Aggregated the clustering work I’ve been working on.
Based on 2010 Census Data, a Statistical Abstract of the United States
Optimal k for clusters as found by NbClust’s 30 tests was 2 (majority rule). I used box-cox transformed PCA Whitened and weighted variables as well as fviz_dist to visualize the similarities of clusters.
Violin plots to aggregate the actual clusters
A 3d PCA plot showing the top 3 principal components which made up 77% of the proportion of variance
And finally pairplot to show the 2 clusters mapped against every possible scatterplot of the original data.
I used Clustergram with PCA scaling to create the cluster labels and it did a great job of separating the features (i.e. unsupervised). The best performing k based on fitness was 2 groups (which definitely makes things easier to see)
I’ve arranged the violin plots next to each other for easy comparison.
This would be nice to have in a dashboard. To create custom segments based on various k
I would like to do ANOVA using TSS, WSS, BSS to confirm the populations are different.
Algorithm uses moving windows, finds optimal sma that maximizes return tracking volume weighted price
Simply using the optimal sma strategy is best (as opposed to using mean reversion or macd) considering I look n days ahead (7) and calculating those optional indicators gets tricky (i.e. they are relative to the current and future date).
This shows two different time periods as starting points and the return I would have gotten
This can be derived by running the following notebook
This is a huge breakthrough for me. I’ve been working towards this for years.
I finished setting up my cluster to auto boot an image that contains 95% of the libraries I need for R & processed an actual job (downloading stock symbols). I had to figure out how to parse the input list for rslurm (slight difference from mclapply, ~ c++ struct, i.e. a dataframe).
Test case: 500 symbols over 16 cores. Had to alternate between issuing the job via sbatch or bypass using just rslurm (autodetects slurm nodes). The latter fared better (actually got 500 results).
Uses ansible to apply post configurations. Uses Glusterfs (parallel file system) to deploy images from.
Now I get to begin the minor annoyance of retrofitting my stock code. Its not 1:1 crossover. For one my R cluster doesnt have an rstudio user. Another difference is I’m running CentOS 7 with 3.6 R but I’ve been running a web based Rstudio off of Debian 10 with R 4.0. To upgrade R 3.6 to 4.0 isn’t too difficult but getting OpenBLAS set as default on a custom compile isn’t self explanatory.
Welcome to the world of figuring things out for yourself.
You’ll have to take my word for it that the terminal windows are showing R being forked out to each node.
I believe I have the correct algorithm now for cutoffs thanks to revisiting how to do lift charts from Data Mining For Business Analytics (page 123)
And you can use regression models. My other models (regression as well) didn’t perform as well as this one. This one actually had a positive lift.
The metric of importance is the “cumulative mean response”. That is what I’ve been trying to capture manually (I actually converged on the same formula), but now I can use the gains function on my validation set to capture that point
I.e. to grab the last cumulative mean response that is at least 66%. This will help me build portfolio’s with a 66% positive return