Partial Correlation Robust Dimension Reduction using KFolds

I wrote some python code that tabulates partial correlation significance across k folds for robust dimension reduction

This has been on my to do list for a while.

I wrote this for a few reasons: one being my stock correlation analysis flipping signs on me across partitions, but also because in the bayes regression tutorial they mentioned using correlation analysis for factor reduction (seems to be pretty common across any type of supervised method), but my kfold modification was applied because it makes sense to sample correlations similar to how Bayes does for distributions with Markov chains. Not exactly the same type of sampling, just the concept of sampling to identify a metric. I’ve learned correlations can give false positives (only occasionally significant, happens often w a variable when using subsamples) and almost never does an article discuss finding significant coefficients of determination.

Gist: https://gist.github.com/thistleknot/ce1fc38ea9fcb1a8dafcfe6e0d8af475

Notebook with full model and some added purity scoring (tabulation of sign / significance as a percentage): https://github.com/thistleknot/python-ml/blob/master/code/pcorr-significance.ipynb

Belief and faith

I was explaining to a friend yesterday how I viewed divine revelations. Things such as synchronicities, signs, and etc.


I mentioned parallelomania as a counter to such beliefs.

As a deist–if there is a higher power–it works within the confines of the laws it set forth mainly so each person’s belief is relevant to them only and can be doubted by others. The takeaway is you’re wasting your breath if you’re trying to give a witness–or testimony–to others of your “proof”. The religious experience makes for good stories but is unique and relevant to each person specifically.


In other words you can’t convince others. The way reality is setup there is a deistic veil where everything can be doubted that requires faith to traverse through (i.e. parallelomania vs synchronicity). It’s up to the person having the experience (i.e. seeing the signs) if they want to believe, take them at their percieved face value, and act on them or dismiss them as mere coincidences. It is this choice that determines if the person is going to live a spiritual life or not.


If a deistic power existed, there will always be two explanations for things, a doubtable non spiritual one and a spiritual one (or more accurately pantheism), both are valid, existing on opposite sides of a pane of glass not capable of seeing the other, but the mind (aka eternal soul) can percieve the acausal connections between events that paint a larger picture (i.e. via archetypes).

Identified lagged stock variables including SP500

It looks like I [marginally] beat the SP500

LQD

I found the optimal significant lag of various variables and did an upper/lower (i.e. >0 daily delta, or <0 daily delta hold signal for SP500 return) graph for each variable and compared it to the SP500

This was the one that showed promise because it was daily and it gave me a good result.

I actually did my homework and used a training partition to find the significant correlations, then carried these terms to the test partition and derived their Total Cumulative Returns and this one was the one that made the most sense.

Now I can apply this method to any symbol

=D

The blue is the sp500

Additional variables

#econometrics

State Anomalies Using Isolation Forest

https://imgur.com/gallery/UmQxgR5

I’ve found doing the isolation forest to derive labels first then derive a decision tree regression on the anomaly values (as opposed to using state labels in a classifier) is the way to go to find which variables contribute to anomalies

I’ve included rules–but cut out those that weren’t specific to identifying anomalies–using a tree_to_code function I found that identify the states from highest anomaly to lowest as well as a full dendogram using dtreeviz & another using plot_tree as well as some original highlighting I was using which compared zca whitened variables with standard scaled ones.

The colorful graph represents states as they deviated from Z scores (left) or ZCA Z scores (right) (I removed the regression outliers I had in there from older analysis as all this is unsupervised).

I include a histogram of anomalies.

I’ve included the .describe() function which gives the 5 number summary of each variable (as well as the anomalies, which isn’t really intended, but it’s in there).

Mississippi is the highest anomaly based on 2 splits in the decision tree (white, income)

I also derived two types of distances from the mean using standardized variables (mdist [“Mahalanobis distance”] are summed positive z scores converted to a cumulative distribution function and ZCAs are zca whitened scores doing the same). A state w a low score means closer to national averages (see comments fir averages).

The most average state based on mdist is Pennsylvania, next is Missouri; most extreme is Mississippi. The second most extreme (in terms of mdist and anomaly) is Maryland. Probably because of how close it is to DC and all the benefits that carries with it.

My guess as to what zca is showing is a deviation from the norm controlling for correlation. Looking at Maryland they have a high birth fatality controlling for other correlated variables (say doctors).

Markowitz Profiles (Modern Portfolio Theory)

I’ve significantly improved on my Markowitz algorithm.

It’s a merge of two tutorials (but I’ve read 3). Currently I derive weights based on the optimal sharpe ratio. ATM I don’t take into consideration any interest free rate (i.e. its simply return/volatility).

I was trying two different solver methods. Minimize was taking a really long time, so I tried this other tutorial that made use of cvxopt and a quadratic equation (which I use to draw the efficient frontier rapidly). I was unable to find the optimal value based on sharpe ratio using the cvxopt solver… (it would only solve either the bottom or topmost portion of the line) so I looked at finquant and it uses a montecarlo of 5000 to derive the optimal points (i.e. max return, min volatility, and max sharpe ratio).

So I fell back on monte carlo to do it for me. i.e. “The path of least resistance” and the margin of error is acceptable.

The import thing is it’s very fast/ straightfoward (KISS). So I will use this to backtest with.

What I like about Markowitz is it assumes past behavior is a good model for future (t tests in stocks are used to determine if two partitions of returns are equal). It’s been said that yesterdays price is the best predictor for tomorrows. Same for returns.

Code: https://github.com/thistleknot/Python-Stock/blob/master/Markowitz2.ipynb

Trying to land a job in Data Science

“𝐁𝐞𝐥𝐢𝐞𝐯𝐞 𝐢𝐧 𝐲𝐨𝐮𝐫𝐬𝐞𝐥𝐟. 𝐃𝐨 𝐰𝐡𝐚𝐭 𝐲𝐨𝐮 𝐥𝐢𝐤𝐞. 𝐃𝐨 𝐧𝐨𝐭 𝐰𝐨𝐫𝐤 𝐨𝐫 𝐥𝐞𝐚𝐫𝐧 𝐣𝐮𝐬𝐭 𝐛𝐞𝐜𝐚𝐮𝐬𝐞 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐚 𝐣𝐨𝐛 𝐨𝐫 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐞𝐚𝐫𝐧 𝐦𝐨𝐧𝐞𝐲. 𝐋𝐞𝐚𝐫𝐧 𝐛𝐞𝐜𝐚𝐮𝐬𝐞 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐠𝐫𝐨𝐰”. – Kurtis Pykes

Thank you for sharing your journey which not only depicted your low days but also highlighted how you overcame and worked on yourself.
The best takeaway from the session for me was “𝐓𝐚𝐤𝐞 𝐭𝐡𝐞 𝐜𝐫𝐢𝐭𝐢𝐜𝐢𝐬𝐦 𝐚𝐬 𝐚 𝐰𝐚𝐲 𝐨𝐟 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐢𝐦𝐩𝐫𝐨𝐯𝐢𝐧𝐠 𝐲𝐨𝐮𝐫𝐬𝐞𝐥𝐟 𝐫𝐚𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐡𝐚𝐫𝐬𝐡 𝐟𝐞𝐞𝐝𝐛𝐚𝐜𝐤.” – Manpreet Budhraja

Disclaimer: this is just me trying to make sense of my position.

I’ve struggled with inferiority complex, or what others might call imposter syndrome. I’m certainly not trying to fake it till you make it, nor do I think I’m simply being a dilettante about my passions.

Those concepts though related differ from each other. For example, I struggle with achieving to arrive with where I want to be (data science role) while at the same time I’m intimidated by some of the big data processing capabilities of my peers. Ultimately I do want that good cheese money but I dont consider not having that as disqualifying me to lay claim to that title. A woman told me a long time ago, you are what you say you are, but I believe what Jung said moreso.

“You are not what you say you are, you are what you repeatedly do” (which follows on William Durant’s take on Aristotle)
“Excellence is not an act, but a habit.”

Take for example my housing projects. I didn’t need a license to be qualified. I simply needed to do it.

Add to that mix being a bit more than a touch sensitive to criticisms. But I’ve been working on that last part. Trying to listen without judgement. Merely accepting the points of views as areas to grow into. Another idea of Aristotle’s, no man is an island. The polis is the teacher of man. Meaning you are not perfect & must learn from others.

I blame a set of factors for not being in the role I thought I’d be in given my credentials

Gatekeeping
Companies vet higher level candidates more stringently. The bar is raised. Different companies have different expectations for what it means to be a data scientist. Ex. Math knowledge? Tools used? How advanced are you in them (sql is a big one, but so are network data frameworks)? Its kind of like class elitism but with skills but there is an element of class in it as well. They don’t expect nor should they that fresh college graduates have them. Basically you may have trained on some formal level but evaluated as not good enough… often. More so than other types of roles.

Ageism
“How many years experience would you say you have?” & if it doesn’t match your career length on your resume. You will be evaluated as too old. Companies want to hire malleable fresh college graduates to mold & shape.

Competition
Kaggle projects have become the norm to evaluate candidates.
Higher level pay means less past the gate

Entry level requirements
3 years paid experience is considered minimal for entry level positions (& data science is often a mid to senior level).

Trying to strive for some level of formal recognition is misguided, at least in the belief its going to land you the position you want. I with my masters presumed it would be an easy transfer to data science but that has not been the case. In 2 years I’ve had 5 interviews for data science roles (thats the Gatekeeping). One I passed but it was for correcting statistical scholarly publications which I turned down after some advice from a former professor. Companies will sell you on “titles” that you strive to earn to feel qualified that ultimately doesn’t get you where you want (sophists). What really matters is not buying into the hype and simply practicing it. I know many people who worked at AT&T for software engineering that didn’t have formal education in software concepts. They were simply bright people who applied themselves. Which is the point. Education can be beneficial but its not a cure all. What excellence really is, is a habit

My advice is twofold.

First and foremost. You have to do you. You can’t be in a job you don’t love. Don’t be discouraged. Always strive for me.
Second, be practical. You need 3 trade skills. One you strive for, one that you are practical/efficient in, and one that is a fallback.
Mine happen to be Data Science, System Administration, and Housing Construction.
But they weren’t always those. At one time it was system admin/dev, pc repair, and dishwashing.
But as you progress/evolve you pivot into higher tiers. My fallback plan isn’t so much a means to make money, but a means to avoid becoming destitute (i.e. I can live in the houses I’ve made). But the point is. I didnt wait for anyone to qualify me. I just did them and eventually made something out of my experience and opportunities. But you need to be ready for when the skill you’ve been striving for presents itself as an opportunity which you’ve already habitualized (whether having recieved formal “qualified” training/recognition or not)