ZCA vs StandardScaler using Gradient Pandas DataFrame

I updated my code and made better tables

Based on 2010 Census Data

First picture’s gradient is based on a

box-cox -> Standard Scaler -> ZCA whitened scale

Second picture’s gradient is based on a

box-cox -> StandardScaler scale

I found by augmenting the ZCA whitened scale with a pre StandardScaler (much like PCA whitening recommends), I got a less jarring gradient.

You still see the same effect with Alabama, California, Florida’s Poverty in the first picture (i.e. out of order gradient due to ZCA decorrelation between predictors) producing a dithering (I’ve included a correlation plot of Poverty to showcase). In other words Florida is greener than California because their Poverty is better given the rest of Florida’s numbers. StandardScaler only scales according to one column (and all rows relations within that column) while ZCA scaling takes into consideration all columns/rows relations to each other (i.e. removes correlation, Poverty has a lot of correlation with other predictor terms).

Still includes regression diagnostics based on a Poverty model, flagged means outliers that are highly influential. Labels are groups as derived by ClusterGram.

Code (include more correlation plots): https://github.com/thistleknot/Python-Stock/blob/eafd61187758afbda2c7ade1b04008d8701856cc/ClusterGram-ElasticNet.ipynb

Original post:

I did some styled tables in pandas using a statistical abstract of the US (2010).

What you see are two behind the scenes color scale transformations (both based on box-cox). Left using ZCA, right StandardScaler (i.e. each column’s mean subtracted and divided by standard deviation).

It was confusing at first because despite California’s Poverty close to yet greater than Florida’s, it was still in Red, yet Alabama which has even greater poverty than both is green.

This is because ZCA transformation takes into consideration the correlation between predictor terms where as standard scaler does not. I did some correlation graphs between each ZCA transformed and the old values and it wasn’t a straight line for some values (my presumption is because these values, such as Poverty, shared a lot of correlated information with other variables, thereby ‘dithering’ the normalized scores).

The inference is California is flagged red because it’s value for Poverty is lower (negative normalized score) than it’s expected to be for the other predictor values it has. The inverse can be said of Alabama.

Regression diagnostic columns are based on a model predicting Poverty (flagged=outlier’s), labels identifies groupings found from clustergram.

#dataScience

Leave a Reply

Your email address will not be published. Required fields are marked *