ANOVA Cluster Analysis of US States

V3 (2021-11-26)

I made an error in how I derived my F scores (I didn’t carry the individual cluster’s n over but rather the population n), after correcting, I saw not all my prior clusters were p significant.

I redid the min and max constraints as well as I caught negative f scores when having a cluster size of n < k. After ensuring min cluster size was > k+1, I saw I had a max limit of k at 6 because min constraint * k > sample size at that point.

I adjusted the code accordingly and as a result…

The optimal # of clusters is 4 for the US with a min constraint of size 2 to a max of size 5 initially (min/max constraints are not constants because a min and max function are chosen comparing the init_min and init_max to a value based on k).

i.e.
min = MAX(init_min,k+1)
max = MIN(init_max,ceiling(k/n))

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

Same clustering converges regardless of seed

Code: https://lnkd.in/gnBzHqre

Report: https://github.com/thistleknot/R/blob/master/ClusterMapReport.pdf

Cluster sizes
11, 13, 13, 13

V2 (2021-11-25)

Based on 2010 Census Data

I didn’t like how my clustering algorithm was selecting one giant large cluster and a few single record clusters.

So I ported the ANOVA metrics and findknee algo over and used a better kmeans algorithm KMeansConstrained that allowed me to set constraints. I set a minimum cluster size of 2 and a maximum of max(5,n/k) with k iterating over range 3 to 15.

Methodology:
* Values are centered & scaled
* Principal Components applied
* [Components] Weighted by each components proportion of variance
* Before utilizing an improved kmeans (KMeansConstrained) algorithm

I then derived p scores from the resulting f scores to determine if the clusters were statistically significant, and they are.

The point of this exercise is to accommodate a need for useful bifurcations. I presumed correctly that I could get p significant cluster groupings at any k. So I defined min and max’s using KMeansConstrained and then picked the k with the best BSS/WSS ratio.

The result is 6 clusters of sizes: 5, 9, 9, 9, 9, 9

The same clustering converges regardless of seed.

#clustering#machinelearning#anova#unsupervised

You’ll have to forgive the double upload of the map and report data. The link has the corrected report (5 pages vs 9, I added an infographic of how the algorithm finds the optimal cluster size).

Based on this I’d say the 5th grouping is the best set of states to live in (which include WA, CO, MN, and North East states)

V1 (2021-11-24)

ClusterMapReport

I’ve perfected my clustering of US states by 2010 census data.

I converge on the same breakdown of states each run.

Methodology

I use Principal Components to normalize the dataset, then weight the principal components by their contributing proportion of variance before feeding them into kmeans. Finally do ANOVA to maximize both BSS and WSS to find the optimal k as well as derive p-values.

There are 4 clusters of sizes: 45, 3, 1, 1

The individual states are Massachusetts and Alaska.

Then there is a small group of 3 states Alabama, New Mexico, and South Dakota.

I draw violin plots showing the distribution of the variables against each cluster.

#Clustering #ANOVA

Code: https://github.com/thistleknot/R/blob/master/AnovaClustering.R

Leave a Reply

Your email address will not be published. Required fields are marked *