Recently I wrote a couple posts about eigenvectors and eigenvalues. I thought it would be cool to go from that slightly more theoretical material and show something super useful in which eigenvectors / eigenvalues are an integral part. So today, I’m going to talk about Principal Component Analysis.

This algorithm is used everywhere, notably in neuroscience and computer graphics. The idea is you have a dataset with a ton of features and you want to reduce it to it’s core components. With high dimensionality, not only is the curse of dimensionality a problem, but you also just can’t visualize the data, which prevents a lot of basic insights. So we want to reduce the dimensionality without losing vital information. This is where PCA comes in. In PCA, we go from a large number of features to a small number of components, which still contain a sufficient proportion of the information.

Before we talk about the algorithm itself, there are a few important math concepts which you must be familiar with in order to proceed. The first is eigenvalues and eigenvectors. You can read about them here. The second is the covariance matrix.

### Covariance Matrix

Covariance is the measure of how two different variables change together. The covariance between two variables, X and Y, can be given by the following formula.

$$cov = \frac{\sum\limits_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}{(n-1)}$$

Now, if we wanted to look at all the possible covariances in a dataset, we can compute the covariance matrix, which has this form –

$$C = \left( \begin{array}{ccc} cov(x,x) & cov(x,y) & cov(x,z) \\ cov(y,x) & cov(y,y) & cov(y,z) \\ cov(z,x) & cov(z,y) & cov(z,z) \end{array} \right)$$

Notice that this matrix will be symmetric \((A = A^T)\), and will have a diagonal of just variances, because \(cov(x, x)\) is the same thing as the variance of x. If you understand the covariance matrix and eigenvalues/vectors, you’re ready to learn about PCA.

### Principal Component Analysis

Here is the central idea of PCA – **the components are the eigenvectors of the covariance matrix.** Moreover, the amount of variance each component captures can be represented by the magnitude of its corresponding eigenvalue. Read that multiple times.

So basically, PCA can be represented in 4 pretty simple steps.

1. **Normalize the data.** We’re dealing with covariance, so it’s a good idea to have features on the same scale.

2. **Calculate the covariance matrix. **

3. **Find the eigenvectors of the covariance matrix.**

4. **Translate the data to be in terms of the components**. This involves just a simple matrix multiplication.

One more important thing to mention about this algorithm. We are going from* features* to *components*. The translated dataset won’t have a literal significance with respect to each component. It’s very easy to digest what a *feature* is—shoe size, weight, etc.—but a component we can only talk about from a mathematical perspective. If you want to gain a sense of understanding about a data point, you’ll have to reference the original dataset.

### Example

Now I’ll show you how to do PCA manually, and then using R’s built-in function. In practice you’ll usually use the built in function, but doing it manually is a great way to learn what’s happening.

I found some data on the psychology within a large financial company. The dataset is called attitude, and is accessible within R.

1 2 |
library(MASS) # install.packages("MASS") if you have to attitude <- attitude # put the data in our workspace in R |

Ok, now let’s do those 4 steps.

1. Normalize the data. This literally means put each feature on a normal curve. Just like you would calculate a z-score, subtract the mean and divide by the standard deviation, to the entire feature vector.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
attach(attitude) # to save me having to type 'attitude' 20 times attitude$rating <- (rating - mean(rating)) / sd(rating) attitude$complaints <- (complaints - mean(complaints)) / sd(complaints) attitude$privileges <- (privileges - mean(privileges)) / sd(privileges) attitude$learning <- (learning - mean(learning)) / sd(learning) attitude$raises <- (raises - mean(raises)) / sd(raises) attitude$critical <- (critical - mean(critical)) / sd(critical) attitude$advance <- (advance - mean(advance)) / sd(advance) # re-attach so it calls the updated feature attach(attitude) summary(attitude) # means are all 0 sd(privileges) # and sd's are all 1 (you can check them all if you like) |

2. Get the covariance matrix

1 2 3 4 5 |
# this is actually really simple..thanks R :) cov(attitude) # Quiz question - Why is the diagonal all 1's? # Because we normalized each feature to have a variance of 1! |

3. The Principal Components

Remember, the principal components are the eigenvectors of the covariance matrix.

1 2 3 4 5 6 |
x <- eigen(cov(attitude)) x$vectors # just out of curiosity, I'm going to check that it did this right (%*% is matrix multiplication in R) cov(attitude) %*% x$vectors[, 1] # gives the same values as... x$values[1] * x$vectors[, 1] |

4. Putting the data in terms of the components

We do this by matrix multiplying the transpose of the feature vector and the transpose of matrix containing the data. Why transpose? Theoretical reasons aside, the dimensions have to line up.

1 2 3 4 5 |
A <- x$vectors[, 1:3] B <- data.matrix(attitude) # because we can't do matrix multiplication with data frames! # now we arrive at the new data by the above formula! newData <- t(A) %*% t(B) |

And then just a hint of data cleaning so we can have a nice data frame to work with and run algorithms on.

1 2 3 4 |
# note - in the newData matrix, each row is a feature, and each column is a data point. Let's change that newData <- t(newData) newData <- data.frame(newData) names(newData) <- c("feat1", "feat2", "feat3") |

So, we successfully reduced a dataset of 7 features down to 3, maintaining a large amount of the information. But why did I choose 3? How do you know how many components to take?

Recall that the magnitude of the eigenvalue (corresponding to each eigenvector) tells us how much variance is represented by each component. Knowing this, we can graph the proportion of variance captured in each component, and decide how much is sufficient. I would advise a maximum of 3, because after that you can’t visualize them all, and that’s one of the reasons we did this in the first place!

The first two components capture 70% of the variance, and the third captures another 11%. I’m definitely comfortable reducing down to 3 dimensions while maintaining 81% of the information.

And our data looks like this, with respect to the components. At this point, I would start clustering.

### PCA Using R’s Built-in Function

R has a built-in ‘prcomp’ function for PCA. It’s really straightforward.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Extra Step - How To Do PCA Automatically in R (prcomp) # Don't forget to scale the data, or set scale = True prcomp(attitude) # does it look like the eigenvectors from before? It should! Note - if some vector v is an eigenvector, # then c*v is an eigenvector, for any scalar c. (My point is, it's cool if the signs are switched). plot(prcomp(attitude)) # similar to my plot above with the blue bars summary(prcomp(attitude)) biplot(prcomp(attitude)) # This one is pretty cool... # At the very least, this affirms that what we did above is right! |

### Final Note: Eigenfaces

There’s this super cool application of PCA in computer vision called Eigenfaces. It basically uses these ideas to do face recognition. It makes me laugh quite a bit to think about the orthogonal components of someone’s face hahaha.