Category Archives: Machine Learning

Time Series Forecasting using decomposition and auto-arima for Cycle Factor

It’s taken me 3 attempts to understand this.

Time series decomposition and forecasting.

I got a C during time series forecasting during my master’s program. The only C I’ve ever gotten during that course (and it was because of time series). It was somewhat deserved so the review has been needed. I think the students would have been better served if the professor taught it in R or python rather than self writing his own material and relying on a commercial product that expired as soon as the semester was over (Forecast Pro XE). As it was he never covered additive time series decomposition fully. I guess in his defense. Most students were newb’s when it came to coding, so expecting them to deploy it in R would be a bit much when we were simply learning to clean time series and derive p scores (I had one course that really dived into R and I know I have learned 300% since then on proper R usage). However, he did teach the basics of decomposition, smoothing methods, arima, and autocorrelation, as well as advanced models like holt’s & winter’s exponential smoothing methods.

But I finally got it.

It’s still a work in progress, as is I just wanted to understand the decomposition on the in sample before I did a proper holdout analysis.

I now understand why some time series decompositions include a cycle factor and trend rather than just a trend-cycle.

I intend on doing forecasts of

Linear (i.e. with a cycle factor)
Multiplicative
Additive

and non-linear
Multiplicative
Additive

Using auto.arima to forecast out either cycle factor or trend-cycle depending on the model above.

It’s still a bit messy, I’ve been using this long weekend to get it brain dumped into the IDE after reviewing multiple notes (3rd times a charm). I use excel to model it before I move it into code (something I mentioned during interview recently and was told that’s how a lot of people do it). I figure it will take 3 coding passes to get the code clean and probably won’t be pristine until I have it properly ported into python.

But v1 will be done this weekend and I’ll have a forecast of Los Angeles Condo prices.

ANOVA Cluster Analysis of US States

V3 (2021-11-26)

I made an error in how I derived my F scores (I didn’t carry the individual cluster’s n over but rather the population n), after correcting, I saw not all my prior clusters were p significant.

I redid the min and max constraints as well as I caught negative f scores when having a cluster size of n < k. After ensuring min cluster size was > k+1, I saw I had a max limit of k at 6 because min constraint * k > sample size at that point.

I adjusted the code accordingly and as a result…

The optimal # of clusters is 4 for the US with a min constraint of size 2 to a max of size 5 initially (min/max constraints are not constants because a min and max function are chosen comparing the init_min and init_max to a value based on k).

i.e.
min = MAX(init_min,k+1)
max = MIN(init_max,ceiling(k/n))

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

Same clustering converges regardless of seed

Code: https://lnkd.in/gnBzHqre

Report: https://github.com/thistleknot/R/blob/master/ClusterMapReport.pdf

Cluster sizes
11, 13, 13, 13

V2 (2021-11-25)

Based on 2010 Census Data

I didn’t like how my clustering algorithm was selecting one giant large cluster and a few single record clusters.

So I ported the ANOVA metrics and findknee algo over and used a better kmeans algorithm KMeansConstrained that allowed me to set constraints. I set a minimum cluster size of 2 and a maximum of max(5,n/k) with k iterating over range 3 to 15.

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

I then derived p scores from the resulting f scores to determine if the clusters were statistically significant, and they are.

The point of this exercise is to accommodate a need for useful bifurcations. I presumed correctly that I could get p significant cluster groupings at any k. So I defined min and max’s using KMeansConstrained and then picked the k with the best BSS/WSS ratio.

The result is 6 clusters of sizes: 5, 9, 9, 9, 9, 9

The same clustering converges regardless of seed.

#clustering#machinelearning#anova#unsupervised

You’ll have to forgive the double upload of the map and report data. The link has the corrected report (5 pages vs 9, I added an infographic of how the algorithm finds the optimal cluster size).

Based on this I’d say the 5th grouping is the best set of states to live in (which include WA, CO, MN, and North East states)

V1 (2021-11-24)

ClusterMapReport

I’ve perfected my clustering of US states by 2010 census data.

I converge on the same breakdown of states each run.

Methodology

I use Principal Components to normalize the dataset, then weight the principal components by their contributing proportion of variance before feeding them into kmeans. Finally do ANOVA to maximize both BSS and WSS to find the optimal k as well as derive p-values.

There are 4 clusters of sizes: 45, 3, 1, 1

The individual states are Massachusetts and Alaska.

Then there is a small group of 3 states Alabama, New Mexico, and South Dakota.

I draw violin plots showing the distribution of the variables against each cluster.

#Clustering #ANOVA

Code: https://github.com/thistleknot/R/blob/master/AnovaClustering.R

Partial Correlation Robust Dimension Reduction using KFolds

I wrote some python code that tabulates partial correlation significance across k folds for robust dimension reduction

This has been on my to do list for a while.

I wrote this for a few reasons: one being my stock correlation analysis flipping signs on me across partitions, but also because in the bayes regression tutorial they mentioned using correlation analysis for factor reduction (seems to be pretty common across any type of supervised method), but my kfold modification was applied because it makes sense to sample correlations similar to how Bayes does for distributions with Markov chains. Not exactly the same type of sampling, just the concept of sampling to identify a metric. I’ve learned correlations can give false positives (only occasionally significant, happens often w a variable when using subsamples) and almost never does an article discuss finding significant coefficients of determination.

Gist: https://gist.github.com/thistleknot/ce1fc38ea9fcb1a8dafcfe6e0d8af475

Notebook with full model and some added purity scoring (tabulation of sign / significance as a percentage): https://github.com/thistleknot/python-ml/blob/master/code/pcorr-significance.ipynb

Identified lagged stock variables including SP500

It looks like I [marginally] beat the SP500

LQD

I found the optimal significant lag of various variables and did an upper/lower (i.e. >0 daily delta, or <0 daily delta hold signal for SP500 return) graph for each variable and compared it to the SP500

This was the one that showed promise because it was daily and it gave me a good result.

I actually did my homework and used a training partition to find the significant correlations, then carried these terms to the test partition and derived their Total Cumulative Returns and this one was the one that made the most sense.

Now I can apply this method to any symbol

=D

The blue is the sp500

Additional variables

#econometrics

State Anomalies Using Isolation Forest

https://imgur.com/gallery/UmQxgR5

I’ve found doing the isolation forest to derive labels first then derive a decision tree regression on the anomaly values (as opposed to using state labels in a classifier) is the way to go to find which variables contribute to anomalies

I’ve included rules–but cut out those that weren’t specific to identifying anomalies–using a tree_to_code function I found that identify the states from highest anomaly to lowest as well as a full dendogram using dtreeviz & another using plot_tree as well as some original highlighting I was using which compared zca whitened variables with standard scaled ones.

The colorful graph represents states as they deviated from Z scores (left) or ZCA Z scores (right) (I removed the regression outliers I had in there from older analysis as all this is unsupervised).

I include a histogram of anomalies.

I’ve included the .describe() function which gives the 5 number summary of each variable (as well as the anomalies, which isn’t really intended, but it’s in there).

Mississippi is the highest anomaly based on 2 splits in the decision tree (white, income)

I also derived two types of distances from the mean using standardized variables (mdist [“Mahalanobis distance”] are summed positive z scores converted to a cumulative distribution function and ZCAs are zca whitened scores doing the same). A state w a low score means closer to national averages (see comments fir averages).

The most average state based on mdist is Pennsylvania, next is Missouri; most extreme is Mississippi. The second most extreme (in terms of mdist and anomaly) is Maryland. Probably because of how close it is to DC and all the benefits that carries with it.

My guess as to what zca is showing is a deviation from the norm controlling for correlation. Looking at Maryland they have a high birth fatality controlling for other correlated variables (say doctors).

Markowitz Profiles (Modern Portfolio Theory)

I’ve significantly improved on my Markowitz algorithm.

It’s a merge of two tutorials (but I’ve read 3). Currently I derive weights based on the optimal sharpe ratio. ATM I don’t take into consideration any interest free rate (i.e. its simply return/volatility).

I was trying two different solver methods. Minimize was taking a really long time, so I tried this other tutorial that made use of cvxopt and a quadratic equation (which I use to draw the efficient frontier rapidly). I was unable to find the optimal value based on sharpe ratio using the cvxopt solver… (it would only solve either the bottom or topmost portion of the line) so I looked at finquant and it uses a montecarlo of 5000 to derive the optimal points (i.e. max return, min volatility, and max sharpe ratio).

So I fell back on monte carlo to do it for me. i.e. “The path of least resistance” and the margin of error is acceptable.

The import thing is it’s very fast/ straightfoward (KISS). So I will use this to backtest with.

What I like about Markowitz is it assumes past behavior is a good model for future (t tests in stocks are used to determine if two partitions of returns are equal). It’s been said that yesterdays price is the best predictor for tomorrows. Same for returns.

Code: https://github.com/thistleknot/Python-Stock/blob/master/Markowitz2.ipynb

ZCA vs StandardScaler using Gradient Pandas DataFrame

I updated my code and made better tables

Based on 2010 Census Data

First picture’s gradient is based on a

box-cox -> Standard Scaler -> ZCA whitened scale

Second picture’s gradient is based on a

box-cox -> StandardScaler scale

I found by augmenting the ZCA whitened scale with a pre StandardScaler (much like PCA whitening recommends), I got a less jarring gradient.

You still see the same effect with Alabama, California, Florida’s Poverty in the first picture (i.e. out of order gradient due to ZCA decorrelation between predictors) producing a dithering (I’ve included a correlation plot of Poverty to showcase). In other words Florida is greener than California because their Poverty is better given the rest of Florida’s numbers. StandardScaler only scales according to one column (and all rows relations within that column) while ZCA scaling takes into consideration all columns/rows relations to each other (i.e. removes correlation, Poverty has a lot of correlation with other predictor terms).

Still includes regression diagnostics based on a Poverty model, flagged means outliers that are highly influential. Labels are groups as derived by ClusterGram.

Code (include more correlation plots): https://github.com/thistleknot/Python-Stock/blob/eafd61187758afbda2c7ade1b04008d8701856cc/ClusterGram-ElasticNet.ipynb

Original post:

I did some styled tables in pandas using a statistical abstract of the US (2010).

What you see are two behind the scenes color scale transformations (both based on box-cox). Left using ZCA, right StandardScaler (i.e. each column’s mean subtracted and divided by standard deviation).

It was confusing at first because despite California’s Poverty close to yet greater than Florida’s, it was still in Red, yet Alabama which has even greater poverty than both is green.

This is because ZCA transformation takes into consideration the correlation between predictor terms where as standard scaler does not. I did some correlation graphs between each ZCA transformed and the old values and it wasn’t a straight line for some values (my presumption is because these values, such as Poverty, shared a lot of correlated information with other variables, thereby ‘dithering’ the normalized scores).

The inference is California is flagged red because it’s value for Poverty is lower (negative normalized score) than it’s expected to be for the other predictor values it has. The inverse can be said of Alabama.

Regression diagnostic columns are based on a model predicting Poverty (flagged=outlier’s), labels identifies groupings found from clustergram.

#dataScience

Regression Diagnostics Outlier Detection [Python]

I took some code that transplanted R’s regression diagnostic plots to Python and I augmented it to match what I had done previously to it in R.

Basically highlighting the row names of outlier’s as opposed to simply printing off their #, as well as adding in guiding lines.

This plot shows T student residuals mapped against leverage with their respective outlier flagged range limits plus if the cook’s distance exceeds .1 (blue) or .5 (red) which based on it’s F probability distribution score.

Code: https://github.com/thistleknot/Python-Stock/blob/master/Plot.py

I’ve been working on my outlier detection methods and found a useful way to highlight them. My code now outputs at the very end of the regression process the outlier’s in question

This is based on Federal Reserve data as available from their API. The regression in question is based on the DGS10 and not surprisingly correlated significant terms are other measures that are directly related to mortgages and interest rates. I have more interaction terms but due to the way I’m using matches from the non interacted terms, only these showed up.

What I like about this is I KNOW those years are when we had economic crisis (2008 housing bubble burst, and the 2020 covid crisis) and what the outlier detection is flagging is the economic tomfoolery that was necessary to sustain the economy.

The notebook is available here

https://github.com/thistleknot/Python-Stock/blob/50cd24fce67169b1c1f4b6fa0511f7130223c4f0/FRED_analysis.ipynb