Category Archives: Computers

Partial Correlation Robust Dimension Reduction using KFolds

I wrote some python code that tabulates partial correlation significance across k folds for robust dimension reduction

This has been on my to do list for a while.

I wrote this for a few reasons: one being my stock correlation analysis flipping signs on me across partitions, but also because in the bayes regression tutorial they mentioned using correlation analysis for factor reduction (seems to be pretty common across any type of supervised method), but my kfold modification was applied because it makes sense to sample correlations similar to how Bayes does for distributions with Markov chains. Not exactly the same type of sampling, just the concept of sampling to identify a metric. I’ve learned correlations can give false positives (only occasionally significant, happens often w a variable when using subsamples) and almost never does an article discuss finding significant coefficients of determination.


Notebook with full model and some added purity scoring (tabulation of sign / significance as a percentage):

Identified lagged stock variables including SP500

It looks like I [marginally] beat the SP500


I found the optimal significant lag of various variables and did an upper/lower (i.e. >0 daily delta, or <0 daily delta hold signal for SP500 return) graph for each variable and compared it to the SP500

This was the one that showed promise because it was daily and it gave me a good result.

I actually did my homework and used a training partition to find the significant correlations, then carried these terms to the test partition and derived their Total Cumulative Returns and this one was the one that made the most sense.

Now I can apply this method to any symbol


The blue is the sp500

Additional variables


State Anomalies Using Isolation Forest

I’ve found doing the isolation forest to derive labels first then derive a decision tree regression on the anomaly values (as opposed to using state labels in a classifier) is the way to go to find which variables contribute to anomalies

I’ve included rules–but cut out those that weren’t specific to identifying anomalies–using a tree_to_code function I found that identify the states from highest anomaly to lowest as well as a full dendogram using dtreeviz & another using plot_tree as well as some original highlighting I was using which compared zca whitened variables with standard scaled ones.

The colorful graph represents states as they deviated from Z scores (left) or ZCA Z scores (right) (I removed the regression outliers I had in there from older analysis as all this is unsupervised).

I include a histogram of anomalies.

I’ve included the .describe() function which gives the 5 number summary of each variable (as well as the anomalies, which isn’t really intended, but it’s in there).

Mississippi is the highest anomaly based on 2 splits in the decision tree (white, income)

I also derived two types of distances from the mean using standardized variables (mdist [“Mahalanobis distance”] are summed positive z scores converted to a cumulative distribution function and ZCAs are zca whitened scores doing the same). A state w a low score means closer to national averages (see comments fir averages).

The most average state based on mdist is Pennsylvania, next is Missouri; most extreme is Mississippi. The second most extreme (in terms of mdist and anomaly) is Maryland. Probably because of how close it is to DC and all the benefits that carries with it.

My guess as to what zca is showing is a deviation from the norm controlling for correlation. Looking at Maryland they have a high birth fatality controlling for other correlated variables (say doctors).

Markowitz Profiles (Modern Portfolio Theory)

I’ve significantly improved on my Markowitz algorithm.

It’s a merge of two tutorials (but I’ve read 3). Currently I derive weights based on the optimal sharpe ratio. ATM I don’t take into consideration any interest free rate (i.e. its simply return/volatility).

I was trying two different solver methods. Minimize was taking a really long time, so I tried this other tutorial that made use of cvxopt and a quadratic equation (which I use to draw the efficient frontier rapidly). I was unable to find the optimal value based on sharpe ratio using the cvxopt solver… (it would only solve either the bottom or topmost portion of the line) so I looked at finquant and it uses a montecarlo of 5000 to derive the optimal points (i.e. max return, min volatility, and max sharpe ratio).

So I fell back on monte carlo to do it for me. i.e. “The path of least resistance” and the margin of error is acceptable.

The import thing is it’s very fast/ straightfoward (KISS). So I will use this to backtest with.

What I like about Markowitz is it assumes past behavior is a good model for future (t tests in stocks are used to determine if two partitions of returns are equal). It’s been said that yesterdays price is the best predictor for tomorrows. Same for returns.


ZCA vs StandardScaler using Gradient Pandas DataFrame

I updated my code and made better tables

Based on 2010 Census Data

First picture’s gradient is based on a

box-cox -> Standard Scaler -> ZCA whitened scale

Second picture’s gradient is based on a

box-cox -> StandardScaler scale

I found by augmenting the ZCA whitened scale with a pre StandardScaler (much like PCA whitening recommends), I got a less jarring gradient.

You still see the same effect with Alabama, California, Florida’s Poverty in the first picture (i.e. out of order gradient due to ZCA decorrelation between predictors) producing a dithering (I’ve included a correlation plot of Poverty to showcase). In other words Florida is greener than California because their Poverty is better given the rest of Florida’s numbers. StandardScaler only scales according to one column (and all rows relations within that column) while ZCA scaling takes into consideration all columns/rows relations to each other (i.e. removes correlation, Poverty has a lot of correlation with other predictor terms).

Still includes regression diagnostics based on a Poverty model, flagged means outliers that are highly influential. Labels are groups as derived by ClusterGram.

Code (include more correlation plots):

Original post:

I did some styled tables in pandas using a statistical abstract of the US (2010).

What you see are two behind the scenes color scale transformations (both based on box-cox). Left using ZCA, right StandardScaler (i.e. each column’s mean subtracted and divided by standard deviation).

It was confusing at first because despite California’s Poverty close to yet greater than Florida’s, it was still in Red, yet Alabama which has even greater poverty than both is green.

This is because ZCA transformation takes into consideration the correlation between predictor terms where as standard scaler does not. I did some correlation graphs between each ZCA transformed and the old values and it wasn’t a straight line for some values (my presumption is because these values, such as Poverty, shared a lot of correlated information with other variables, thereby ‘dithering’ the normalized scores).

The inference is California is flagged red because it’s value for Poverty is lower (negative normalized score) than it’s expected to be for the other predictor values it has. The inverse can be said of Alabama.

Regression diagnostic columns are based on a model predicting Poverty (flagged=outlier’s), labels identifies groupings found from clustergram.


Regression Diagnostics Outlier Detection [Python]

I took some code that transplanted R’s regression diagnostic plots to Python and I augmented it to match what I had done previously to it in R.

Basically highlighting the row names of outlier’s as opposed to simply printing off their #, as well as adding in guiding lines.

This plot shows T student residuals mapped against leverage with their respective outlier flagged range limits plus if the cook’s distance exceeds .1 (blue) or .5 (red) which based on it’s F probability distribution score.


I’ve been working on my outlier detection methods and found a useful way to highlight them. My code now outputs at the very end of the regression process the outlier’s in question

This is based on Federal Reserve data as available from their API. The regression in question is based on the DGS10 and not surprisingly correlated significant terms are other measures that are directly related to mortgages and interest rates. I have more interaction terms but due to the way I’m using matches from the non interacted terms, only these showed up.

What I like about this is I KNOW those years are when we had economic crisis (2008 housing bubble burst, and the 2020 covid crisis) and what the outlier detection is flagging is the economic tomfoolery that was necessary to sustain the economy.

The notebook is available here

2010 Census Data Cluster Analysis

Aggregated the clustering work I’ve been working on.

Based on 2010 Census Data, a Statistical Abstract of the United States

Optimal k for clusters as found by NbClust’s 30 tests was 2 (majority rule). I used box-cox transformed PCA Whitened and weighted variables as well as fviz_dist to visualize the similarities of clusters.

Violin plots to aggregate the actual clusters

A 3d PCA plot showing the top 3 principal components which made up 77% of the proportion of variance

And finally pairplot to show the 2 clusters mapped against every possible scatterplot of the original data.

factoextra fviz_dist

Using 2010 Census Data (Statistical Abstract of the United States)

I’ve scaled the data according to PCA variance.

this is a graph of factoextra’s fviz_dist
red = ~0 Euclidean distance (for clustering)
Blue = disimilar


[1] 0.5639438
Closer to 1 is very clusterable

This is very interesting.