https://imgur.com/gallery/UmQxgR5

I’ve found doing the **isolation forest** to derive labels first then derive a decision tree regression on the anomaly values (as opposed to using state labels in a classifier) is the way to go to find which variables contribute to anomalies

I’ve included rules–but cut out those that weren’t specific to identifying anomalies–using a **tree_to_code **function I found that identify the states from highest anomaly to lowest as well as a full dendogram using **dtreeviz **& another using **plot_tree **as well as some original highlighting I was using which compared zca whitened variables with standard scaled ones.

The colorful graph represents states as they deviated from Z scores (left) or ZCA Z scores (right) (I removed the regression outliers I had in there from older analysis as all this is unsupervised).

I include a histogram of anomalies.

I’ve included the .describe() function which gives the 5 number summary of each variable (as well as the anomalies, which isn’t really intended, but it’s in there).

Mississippi is the highest anomaly based on 2 splits in the decision tree (white, income)

I also derived two types of distances from the mean using standardized variables (mdist [“**Mahalanobis **distance”] are summed positive z scores converted to a cumulative distribution function and ZCAs are zca whitened scores doing the same). A state w a low score means closer to national averages (see comments fir averages).

The most average state based on mdist is Pennsylvania, next is Missouri; most extreme is Mississippi. The second most extreme (in terms of mdist and anomaly) is Maryland. Probably because of how close it is to DC and all the benefits that carries with it.

My guess as to what zca is showing is a deviation from the norm controlling for correlation. Looking at Maryland they have a high birth fatality controlling for other correlated variables (say doctors).