Category Archives: Computers

2010 Census Data Cluster Analysis

Aggregated the clustering work I’ve been working on.

Based on 2010 Census Data, a Statistical Abstract of the United States

Optimal k for clusters as found by NbClust’s 30 tests was 2 (majority rule). I used box-cox transformed PCA Whitened and weighted variables as well as fviz_dist to visualize the similarities of clusters.

Violin plots to aggregate the actual clusters

A 3d PCA plot showing the top 3 principal components which made up 77% of the proportion of variance

And finally pairplot to show the 2 clusters mapped against every possible scatterplot of the original data.

factoextra fviz_dist

Using 2010 Census Data (Statistical Abstract of the United States)

I’ve scaled the data according to PCA variance.

this is a graph of factoextra’s fviz_dist
red = ~0 Euclidean distance (for clustering)
Blue = disimilar


[1] 0.5639438
Closer to 1 is very clusterable

This is very interesting.

PCA Clustergram Violin Plots with Pairplot

Based on last census data (2010?)

I used Clustergram with PCA scaling to create the cluster labels and it did a great job of separating the features (i.e. unsupervised). The best performing k based on fitness was 2 groups (which definitely makes things easier to see)

I’ve arranged the violin plots next to each other for easy comparison.

This would be nice to have in a dashboard. To create custom segments based on various k

I would like to do ANOVA using TSS, WSS, BSS to confirm the populations are different.

Optimal SMA

I’ve perfected my tests

Algorithm uses moving windows, finds optimal sma that maximizes return tracking volume weighted price

Simply using the optimal sma strategy is best (as opposed to using mean reversion or macd) considering I look n days ahead (7) and calculating those optional indicators gets tricky (i.e. they are relative to the current and future date).

This shows two different time periods as starting points and the return I would have gotten

This can be derived by running the following notebook

back tested returns

I followed the guide

“An Algorithm to find the best moving average for stock trading

and applied it to BTC and found the optimal SMA for a 1 week return.

The optimal SMA is 87 days and t test significant over test/training partitions (i.e. the return was the same).

Average return for holding for 7 days is ~4%

It’s currently above the SMA.

I think I can build portfolios

fbProphet, VWP, MACD, RSI, AutoML, FredData forecasting BTC 60% return in 1 year

I threw everything I had

* fbProphet predicting volume weighted price
* Augmented bbands
* FRED data (110 additional terms)
* autoML using the above (in rolling windows) as predictors

to determine next days price of BTC

I got a 60% return on BTC (500% in the last year)

I think its time to seriously consider that micro masters in finance.


RSlurm Cluster

Running Models
Initial test to grab stock data

This is a huge breakthrough for me. I’ve been working towards this for years.

I finished setting up my cluster to auto boot an image that contains 95% of the libraries I need for R & processed an actual job (downloading stock symbols). I had to figure out how to parse the input list for rslurm (slight difference from mclapply, ~ c++ struct, i.e. a dataframe).

Test case: 500 symbols over 16 cores. Had to alternate between issuing the job via sbatch or bypass using just rslurm (autodetects slurm nodes). The latter fared better (actually got 500 results).

Uses ansible to apply post configurations.
Uses Glusterfs (parallel file system) to deploy images from.

Now I get to begin the minor annoyance of retrofitting my stock code. Its not 1:1 crossover. For one my R cluster doesnt have an rstudio user. Another difference is I’m running CentOS 7 with 3.6 R but I’ve been running a web based Rstudio off of Debian 10 with R 4.0. To upgrade R 3.6 to 4.0 isn’t too difficult but getting OpenBLAS set as default on a custom compile isn’t self explanatory.

Welcome to the world of figuring things out for yourself.

You’ll have to take my word for it that the terminal windows are showing R being forked out to each node.


Lift Charts

I believe I have the correct algorithm now for cutoffs thanks to revisiting how to do lift charts from Data Mining For Business Analytics (page 123)

And you can use regression models. My other models (regression as well) didn’t perform as well as this one. This one actually had a positive lift. 

The metric of importance is the “cumulative mean response”. That is what I’ve been trying to capture manually (I actually converged on the same formula), but now I can use the gains function on my validation set to capture that point

I.e. to grab the last cumulative mean response that is at least 66%. This will help me build portfolio’s with a 66% positive return

Some addtl sources