Recently I wrote a couple posts about eigenvectors and eigenvalues. I thought it would be cool to go from that slightly more theoretical material and show something super useful in which eigenvectors / eigenvalues are an integral part. So today, I’m going to talk about Principal Component Analysis. This algorithm is used everywhere, notably in neuroscience and computer graphics. The idea is you have a dataset with a ton of features and you want to reduce it to it’s core components. With high dimensionality, not only is the curse of dimensionality a problem, but you also just can’t visualize the data, which prevents a lot of basic insights. So we want to reduce the dimensionality without losing vital information. This is where PCA comes in. In PCA, we go from a large number of features to a small number of components, which still contain a sufficient proportion of the information. Before we talk about the algorithm itself, there are a few important math concepts which you must be familiar with in order to proceed. The first is eigenvalues and eigenvectors. You can read about them here. The second is the covariance matrix. Covariance Matrix Covariance is the measure of how two different variables change together. The covariance between two variables, X and Y, can be given by the following formula. $$cov = \frac{\sum\limits_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}{(n-1)}$$ Now, if we wanted to look at all the possible covariances in a dataset, we can compute the covariance matrix, which has this form – $$C = \left( \begin{array}{ccc} cov(x,x) & cov(x,y) & cov(x,z) \\ cov(y,x) & cov(y,y) & cov(y,z) \\ cov(z,x) & cov(z,y) & cov(z,z) \end{array} \right)$$ Notice that this matrix will be symmetric \((A = A^T)\), and will have a diagonal of just variances, because \(cov(x, x)\) is the same thing as the variance of x. If you understand the covariance matrix and eigenvalues/vectors, you’re ready to learn about PCA. Principal Component Analysis Here is the central idea of PCA – the components are the eigenvectors of the covariance matrix. Moreover, the amount of variance each component captures can be represented by the magnitude of its corresponding eigenvalue. Read that multiple times. So basically, PCA can be represented in 4 pretty simple steps. 1. Normalize the data. We’re dealing with covariance, so it’s a good idea to have features on the same scale. 2. Calculate the covariance matrix. 3. Find the eigenvectors of the covariance matrix. 4. Translate the data to…

Read More## Bayes Theorem and Naive Bayes

One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself. Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability. Bayes Theorem To understand most statistics, we need to understand conditional probability. Bayes theorem states that where \(P(A|B)\) is the probability that A is true, given that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model. For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a brief explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here. Naive Bayes Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!). The prior – The probability of the class. In this case, if A = accepted, then where \(N_A\) = number of accepted students in the data, and N = number of total students in the data. Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted. If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model. Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of…

Read More## Guide to Linear Regression

Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning. At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!). $$\hat{Y} = \theta_0 + \theta_1 X_1$$ But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X. $$\hat{Y} = h_{\theta}(x) = \sum\limits_{i = 1}^{d} \theta_i X_i = \theta^T X$$ There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent. Sorry for switching notation below. Note – the matrices are \(i \times j\). \(i\) signifies rows, or training examples. Gradient Descent The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual. It achieves this by stepping down the cost function to the (hopefully) global minimum. Here is the cost function – $$J(\theta) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) \ – \ y^{(i)})^2$$ The cost is the residual sum of squares. The \(\frac{1}{2}\) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see \(m\) (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid. And the gradient descent algorithm – $$\Large{\theta_{j+1} := \theta_j + \alpha \sum\limits_{i=1}^{m} (y^{(i)} \ – \ h_{\theta}(x^{(i)}))x_{j}^{(i)}}$$ This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math.. $$\Large{\frac{\partial}{\partial \theta_{j}} J(\theta) = (h_{\theta}(x)\ – \ y)x_j}$$ $$\Large{\theta_{j+1} := \theta_j \ – \ \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)}$$ The original equation switches the position of \(h(x)\) and \(y\), to pull out a negative. This makes the…

Read More## Basic Data Exploration in R

When you’re cleaning up data, you usually end up using a 5-8 functions a ton of times, and then a few more once or twice. Here are those 5-8 functions I find myself using again and again. Here is a quick overview: names() – returns the column names of a dataset str() – gives the overview of a dataset data.table package – includes functions for creating new columns, among other things %in% operator – checks if a value is in a vector Below are some examples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
names(rock) # returns the column names [1] "area" "peri" "shape" "perm" str(rock) # gives the format of the dataframe 'data.frame': 48 obs. of 4 variables: $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ... $ peri : num 2792 3893 3931 3869 3949 ... $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ... $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ... # import the data.table package install.packages("data.table") # don't forget these 3 steps! library(data.table) dtRock <- data.table(rock) dtRock[1:5] # returns the first 5 columns area peri shape perm 1: 4990 2791.90 0.0903296 6.3 2: 7002 3892.60 0.1486220 6.3 3: 7558 3930.66 0.1833120 6.3 4: 7352 3869.32 0.1170630 6.3 5: 7943 3948.54 0.1224170 17.1 # and my favorite way to create a new column dtRock[, areaMP := area / 1000] # area is measured in pixels, so areaMP # is in mega pixels dtRock[1, ] # indicates the first row, all columns area peri shape perm areaMP 1: 4990 2791.9 0.0903296 6.3 4.99 dtRock[, 'areaMP'] # returns the entire 'areaMP' column # The %in% operator is one of the most useful functions in R, I think. a <- c(1, 2, 3, 4) 4 %in% a # it's asking, is the value 4 in the vector a? [1] TRUE |

There are many other functions and packages, such as the ‘dplyr’ package by the amazing Hadley Wickham, but I am just showing the ones I use most frequently.

Read More