Author Archives: admin

Time Series Forecasting using decomposition and auto-arima for Cycle Factor

It’s taken me 3 attempts to understand this.

Time series decomposition and forecasting.

I got a C during time series forecasting during my master’s program. The only C I’ve ever gotten during that course (and it was because of time series). It was somewhat deserved so the review has been needed. I think the students would have been better served if the professor taught it in R or python rather than self writing his own material and relying on a commercial product that expired as soon as the semester was over (Forecast Pro XE). As it was he never covered additive time series decomposition fully. I guess in his defense. Most students were newb’s when it came to coding, so expecting them to deploy it in R would be a bit much when we were simply learning to clean time series and derive p scores (I had one course that really dived into R and I know I have learned 300% since then on proper R usage). However, he did teach the basics of decomposition, smoothing methods, arima, and autocorrelation, as well as advanced models like holt’s & winter’s exponential smoothing methods.

But I finally got it.

It’s still a work in progress, as is I just wanted to understand the decomposition on the in sample before I did a proper holdout analysis.

I now understand why some time series decompositions include a cycle factor and trend rather than just a trend-cycle.

I intend on doing forecasts of

Linear (i.e. with a cycle factor)
Multiplicative
Additive

and non-linear
Multiplicative
Additive

Using auto.arima to forecast out either cycle factor or trend-cycle depending on the model above.

It’s still a bit messy, I’ve been using this long weekend to get it brain dumped into the IDE after reviewing multiple notes (3rd times a charm). I use excel to model it before I move it into code (something I mentioned during interview recently and was told that’s how a lot of people do it). I figure it will take 3 coding passes to get the code clean and probably won’t be pristine until I have it properly ported into python.

But v1 will be done this weekend and I’ll have a forecast of Los Angeles Condo prices.

ANOVA Cluster Analysis of US States

V3 (2021-11-26)

I made an error in how I derived my F scores (I didn’t carry the individual cluster’s n over but rather the population n), after correcting, I saw not all my prior clusters were p significant.

I redid the min and max constraints as well as I caught negative f scores when having a cluster size of n < k. After ensuring min cluster size was > k+1, I saw I had a max limit of k at 6 because min constraint * k > sample size at that point.

I adjusted the code accordingly and as a result…

The optimal # of clusters is 4 for the US with a min constraint of size 2 to a max of size 5 initially (min/max constraints are not constants because a min and max function are chosen comparing the init_min and init_max to a value based on k).

i.e.
min = MAX(init_min,k+1)
max = MIN(init_max,ceiling(k/n))

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

Same clustering converges regardless of seed

Code: https://lnkd.in/gnBzHqre

Report: https://github.com/thistleknot/R/blob/master/ClusterMapReport.pdf

Cluster sizes
11, 13, 13, 13

V2 (2021-11-25)

Based on 2010 Census Data

I didn’t like how my clustering algorithm was selecting one giant large cluster and a few single record clusters.

So I ported the ANOVA metrics and findknee algo over and used a better kmeans algorithm KMeansConstrained that allowed me to set constraints. I set a minimum cluster size of 2 and a maximum of max(5,n/k) with k iterating over range 3 to 15.

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

I then derived p scores from the resulting f scores to determine if the clusters were statistically significant, and they are.

The point of this exercise is to accommodate a need for useful bifurcations. I presumed correctly that I could get p significant cluster groupings at any k. So I defined min and max’s using KMeansConstrained and then picked the k with the best BSS/WSS ratio.

The result is 6 clusters of sizes: 5, 9, 9, 9, 9, 9

The same clustering converges regardless of seed.

#clustering#machinelearning#anova#unsupervised

You’ll have to forgive the double upload of the map and report data. The link has the corrected report (5 pages vs 9, I added an infographic of how the algorithm finds the optimal cluster size).

Based on this I’d say the 5th grouping is the best set of states to live in (which include WA, CO, MN, and North East states)

V1 (2021-11-24)

ClusterMapReport

I’ve perfected my clustering of US states by 2010 census data.

I converge on the same breakdown of states each run.

Methodology

I use Principal Components to normalize the dataset, then weight the principal components by their contributing proportion of variance before feeding them into kmeans. Finally do ANOVA to maximize both BSS and WSS to find the optimal k as well as derive p-values.

There are 4 clusters of sizes: 45, 3, 1, 1

The individual states are Massachusetts and Alaska.

Then there is a small group of 3 states Alabama, New Mexico, and South Dakota.

I draw violin plots showing the distribution of the variables against each cluster.

#Clustering #ANOVA

Code: https://github.com/thistleknot/R/blob/master/AnovaClustering.R

Markowitz Profiles T-Tested

I rewrote my markowitz profile. I tried using atoti, but unfortunately atm it doesn’t support covariance matrix (which aggregates across columns, and atoti’s aggregate functions aggregate on a column at a time).

So I did the simulation properly.

I take about 6 quarters of trading days of fully populated stocks

Then generate tensors and pull 100 random starting points each using a random set of 10 ETF’s (out of around 100 ETF’s including mutual funds and bonds) which have 52 training weeks and 4 holdout weeks

I then derive optimal weights based on markowitz monte carlo simulation and apply this to the holdout window.

I draw the efficient frontier as well as the performance of the weighted holdout period.

Then I do a one sided paired t test to determine if the mean between the two populations (of 100 runs of non weighted holdout vs weighted holdout) is different.

And it is.

P value below 1%

https://nbviewer.org/github/thistleknot/Python-Stock/blob/master/code/markowitz/Markowitz-slim-Atoti.ipynb

Partial Correlation Robust Dimension Reduction using KFolds

I wrote some python code that tabulates partial correlation significance across k folds for robust dimension reduction

This has been on my to do list for a while.

I wrote this for a few reasons: one being my stock correlation analysis flipping signs on me across partitions, but also because in the bayes regression tutorial they mentioned using correlation analysis for factor reduction (seems to be pretty common across any type of supervised method), but my kfold modification was applied because it makes sense to sample correlations similar to how Bayes does for distributions with Markov chains. Not exactly the same type of sampling, just the concept of sampling to identify a metric. I’ve learned correlations can give false positives (only occasionally significant, happens often w a variable when using subsamples) and almost never does an article discuss finding significant coefficients of determination.

Gist: https://gist.github.com/thistleknot/ce1fc38ea9fcb1a8dafcfe6e0d8af475

Notebook with full model and some added purity scoring (tabulation of sign / significance as a percentage): https://github.com/thistleknot/python-ml/blob/master/code/pcorr-significance.ipynb

Belief and faith

I was explaining to a friend yesterday how I viewed divine revelations. Things such as synchronicities, signs, and etc.


I mentioned parallelomania as a counter to such beliefs.

As a deist–if there is a higher power–it works within the confines of the laws it set forth mainly so each person’s belief is relevant to them only and can be doubted by others. The takeaway is you’re wasting your breath if you’re trying to give a witness–or testimony–to others of your “proof”. The religious experience makes for good stories but is unique and relevant to each person specifically.


In other words you can’t convince others. The way reality is setup there is a deistic veil where everything can be doubted that requires faith to traverse through (i.e. parallelomania vs synchronicity). It’s up to the person having the experience (i.e. seeing the signs) if they want to believe, take them at their percieved face value, and act on them or dismiss them as mere coincidences. It is this choice that determines if the person is going to live a spiritual life or not.


If a deistic power existed, there will always be two explanations for things, a doubtable non spiritual one and a spiritual one (or more accurately pantheism), both are valid, existing on opposite sides of a pane of glass not capable of seeing the other, but the mind (aka eternal soul) can percieve the acausal connections between events that paint a larger picture (i.e. via archetypes).

Identified lagged stock variables including SP500

It looks like I [marginally] beat the SP500

LQD

I found the optimal significant lag of various variables and did an upper/lower (i.e. >0 daily delta, or <0 daily delta hold signal for SP500 return) graph for each variable and compared it to the SP500

This was the one that showed promise because it was daily and it gave me a good result.

I actually did my homework and used a training partition to find the significant correlations, then carried these terms to the test partition and derived their Total Cumulative Returns and this one was the one that made the most sense.

Now I can apply this method to any symbol

=D

The blue is the sp500

Additional variables

#econometrics