Github Link under principle-component-analysis/
This project is my first real look into quant research methods, by performing principle component analysis on a basket of stocks and then using a mean reverting method combined with Z-scores to produce trading signals.
However I’m still researching and learning about the Statistical Arbitrage side of this and how I can actually use the beta values from regression between the principle components and my synthetic personal portfolio to actually produce trading signals, so as I figure things out I’ll link the updates on this page.
Rough Workflow
First, we need to define a basket of stocks to develop our eigen-portfolios. Eventually, these will be used to hedge the stock we want to buy. For this first implementation of the project I’ll just be using a hardcoded basket of stocks to look at, this is the current 50 largest tickest in the S&P 500 by weight.
basket = [
'NVDA', 'AAPL', 'MSFT', 'AMZN', 'GOOGL', 'GOOG', 'META', 'TSLA', 'BRK-B', 'WMT',
'LLY', 'JPM', 'V', 'ORCL', 'XOM', 'JNJ', 'MA', 'NFLX', 'PLTR', 'ABBV',
'COST', 'BAC', 'AMD', 'HD', 'PG', 'GE', 'CSCO', 'CVX', 'KO', 'UNH',
'IBM', 'WFC', 'CAT', 'MS', 'AXP', 'MU', 'GS', 'MRK', 'CRM', 'TMUS',
'PM', 'APP', 'RTX', 'MCD', 'ABT', 'TMO', 'AMAT', 'ISRG', 'PEP', 'LRCX'
]1. Data Preparation and PCA
We begin by retrieving the adjusted close price for each stock and converting this into a matrix of day-on-day returns which is then centred and standardised to have a mean of 0 and standard deviation of 1. This matrix is defined as follows:
- : Number of days (time rows)
- : Number of stocks (asset columns)
We then perform PCA on to obtain the eigenvalues and eigenvectors (eigen-portfolios).
This is the point in the program where we need to decide how many principle components () that we want to use, which presents a relatively complex optimisation problem because say we use principle components then yes this would explain all of the variance in price however it also picks up on idiosyncratic noise (small per-stock factors) so we won’t be able to find a large enough spread on the residuals to trade. On the flip side though if we only use 1 principle component then this will likely result in us just mapping the market beta and as a result we won’t be able to find any opportunities for an arbitrage strategy.
So currently my program uses = 15 (15 most significant principle components) however I’m looking into methods to optimise this, some examples of methods are:
1. Scree Plot (Elbow Method)
We plot the each principle component () on the X-axis against the individual amount of variance (eigenvalue, ) that is explains on the Y-axis. Then we look to see where the curve generated starts to taper out (ie finding the ‘Elbow’) which is the point where adding another principle component only explains a minimal amount of variance. This means that any components from this point are are just explaining idiosyncratic noise in the financial data rather than market factors, so we cut off at that point.
2. Cumulative Variance Threshold
This method first asks how much variance of the basket of stocks returns matrix () do we want to explain. The issue with this approach is the risk of overfitting to the financial data, because in statistical arbitrage any profit is generated from the residuals ():
- So if we have 99% of variance explained by PCA then our residual is only 1%, which is too small of a spread to be able to properly trade
- Whereas if 50% of variance is explained by PCA then our residual is going to sit at 50%, which is a big enough spread to trade on.
The other thing to note though is that we don’t want to explain too little variance with PCA because otherwise we begin to introduce higher levels of risk into our strategy, so ideally we want to find that sweet spot of being structured for safety but having enough noise to be profitable.This approach is relatively easy to implement into code though because we just follow the formula:
- : the number of principle components
- : total number of stocks in our basket
- : the eigenvalues (principle components) generated from PCA
3. Marchenko-Pastur Distribution Theory (Random Matrix Theory)
This is the most complex out of the 3 methods listed but in reality its a fairly simple concept, essentially if an eigenvalue () is below some upper threshold () then we know from the Marchenko-Pastur Distribution Theory that a similar value for could also be generated from PCA performed on random noise, so we discard it as it doesn’t represent a significant market factor. This upper threshold for eigenvalues () can be calculate with the following formula:
- : the variance of the residuals, however as we standardised the data this is just 1
- is the number of stocks in our basket
- is the number of days of data we’re using
Then once we have our upper bound for the eigenvalues () any eigenvalue whose weight is less than is then discarded as it likely represents idiosyncratic noise not a significant market factor.
2. Constructing Eigen-Portfolio Returns
Once we have our selected eigenvectors (let’s call this matrix ), we project our original returns onto these vectors to see how the “hidden” factors performed over time.
We perform a matrix multiplication of our standardised returns () against the transposed eigen-portfolios (). This looks like this:
Mathematically:
The resulting matrix has dimensions , where each column represents the daily returns of a specific eigen-portfolio.
3. Statistical Arbitrage Regression
Now we perform the regression to find the trading signal. We select a target stock from our basket (let’s call its returns ) and use the eigen-portfolio returns () as our independent variables.
We solve for the coefficients () in the following structure:
- : The actual returns of the stock we want to trade.
- : The “synthetic” version of the stock constructed using our principal components.
- : The residual, this is the difference between the actual stock and its theoretical PCA value.
- : Any return we’ve made ‘beating’ the market
We can then rearrange this equation to find the residual ():
And from here we look at the residual of and calculate its Z-score to identify whether this is a wide enough spread for us to enter a trade, which then also leads us into how we would hedge this position after it’s been identified. But as I’m still working on that, I’ll post an update when the code is done for it.