In the past week, I really enjoyed this paper. It’s about boosting, which arose as a solution to the question—“Can a set of weak learners create a strong learner?” The solution is pretty fascinating. You basically pit a bunch of weak learners against each other for a number of rounds, and pick the best one at the end of each round, updating the weights of the data points that the winner misclassified. You then combine all the weak learners as a linear combination into a strong learner, by means of their predictions. I enjoyed it so much that I wrote out the algorithm in Python and really thought about his problems at the end of the paper. I was going to write a post about it, but I came to the conclusion that I can’t do a better job describing the topic than Raúl Rojas did. Here is the link again, if you didn’t get it.

Read More## Principal Component Analysis in 4 Steps

Recently I wrote a couple posts about eigenvectors and eigenvalues. I thought it would be cool to go from that slightly more theoretical material and show something super useful in which eigenvectors / eigenvalues are an integral part. So today, I’m going to talk about Principal Component Analysis. This algorithm is used everywhere, notably in neuroscience and computer graphics. The idea is you have a dataset with a ton of features and you want to reduce it to it’s core components. With high dimensionality, not only is the curse of dimensionality a problem, but you also just can’t visualize the data, which prevents a lot of basic insights. So we want to reduce the dimensionality without losing vital information. This is where PCA comes in. In PCA, we go from a large number of features to a small number of components, which still contain a sufficient proportion of the information. Before we talk about the algorithm itself, there are a few important math concepts which you must be familiar with in order to proceed. The first is eigenvalues and eigenvectors. You can read about them here. The second is the covariance matrix. Covariance Matrix Covariance is the measure of how two different variables change together. The covariance between two variables, X and Y, can be given by the following formula. $$cov = \frac{\sum\limits_{i=1}^{n} (X_i – \bar{X})(Y_i – \bar{Y})}{(n-1)}$$ Now, if we wanted to look at all the possible covariances in a dataset, we can compute the covariance matrix, which has this form – $$C = \left( \begin{array}{ccc} cov(x,x) & cov(x,y) & cov(x,z) \\ cov(y,x) & cov(y,y) & cov(y,z) \\ cov(z,x) & cov(z,y) & cov(z,z) \end{array} \right)$$ Notice that this matrix will be symmetric \((A = A^T)\), and will have a diagonal of just variances, because \(cov(x, x)\) is the same thing as the variance of x. If you understand the covariance matrix and eigenvalues/vectors, you’re ready to learn about PCA. Principal Component Analysis Here is the central idea of PCA – the components are the eigenvectors of the covariance matrix. Moreover, the amount of variance each component captures can be represented by the magnitude of its corresponding eigenvalue. Read that multiple times. So basically, PCA can be represented in 4 pretty simple steps. 1. Normalize the data. We’re dealing with covariance, so it’s a good idea to have features on the same scale. 2. Calculate the covariance matrix. 3. Find the eigenvectors of the covariance matrix. 4. Translate the data to…

Read More## Bayes Theorem and Naive Bayes

One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself. Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability. Bayes Theorem To understand most statistics, we need to understand conditional probability. Bayes theorem states that where \(P(A|B)\) is the probability that A is true, given that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model. For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a brief explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here. Naive Bayes Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!). The prior – The probability of the class. In this case, if A = accepted, then where \(N_A\) = number of accepted students in the data, and N = number of total students in the data. Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted. If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model. Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of…

Read More## Markov Chains in R

A few weeks ago I wrote a tutorial on Markov Chains, where the example I gave was a model that generated text. I enjoyed writing that, but I realize if you’re interested in using Markov Chains for a more practical purpose, this tutorial only introduced the idea. So now, I want to provide an example that provides easier access to markov chain related programming. Markov Chain – a random process, where we assume the previous state holds predictive power in predicting the next. Car Rental Example Suppose a car rental agency has three locations in Ottawa: Downtown location, East end location and a West end location. The agency has a group of delivery drivers to serve all three locations. The agency’s statistician has determined the following: Of the calls to the Downtown location, 30% are delivered in Downtown area, 30% are delivered in the East end, and 40% are delivered in the West end. Of the calls to the East end location, 40% are delivered in Downtown area, 40% are delivered in the East end, and 20% are delivered in the West end Of the calls to the West end location, 50% are delivered in Downtown area, 30% are delivered in the East end, and 20% are delivered in the West end. After making a delivery, a driver goes to the nearest location to make the next delivery. This way, the location of a specific driver is determined only by his or her previous location. This is a markov chain because the current state has predictive power over the next state. The state space in this example is Downtown, East, and West. Note – the code is also on my github. We can model the markov chain by a transition matrix.

1 2 3 4 5 6 |
rentalStates <- c("Downtown", "East", "West") rentalTransition <- matrix(c(0.3, 0.3, 0.4, + 0.4, 0.4, 0.2, + 0.5, 0.3, 0.2), + byrow = T, nrow = 3, dimnames = list(rentalStates, rentalStates)) rentalTransition |

1 2 3 4 |
Downtown East West Downtown 0.3 0.3 0.4 East 0.4 0.4 0.2 West 0.5 0.3 0.2 |

There is a package in R that will make our lives easier. It’s called ‘markovchain’. Let’s define our model as a markovchain object. Calling the object is equivalent to it’s show() method. We can even access elements of the transition matrix by indexing the object.

1 2 3 4 5 6 7 |
mcRental <- new("markovchain", states = rentalStates, byrow = T, transitionMatrix = rentalTransition, name = "Rental Cars") # We can access the transition matrix just by calling the mc object mcRental |

1 2 3 4 5 6 7 8 |
Rental Cars A 3 - dimensional discrete Markov Chain with following states Downtown East West The transition matrix (by rows) is defined as follows Downtown East West Downtown 0.3 0.3 0.4 East 0.4 0.4 0.2 West 0.5 0.3 0.2 |

You can also call a specific row of the matrix, using the mcRental object.

1 |
mcRental[1] # the probabilities that we go Downtown, East, and West, given that we are currently Downtown |

1 2 |
Downtown East West 0.3 0.3 0.4 |

We can plot it using plot(mcRental). Ok, let’s ask some more statistical questions that might pop up if this were a real situation. Given that we are downtown, what is the probability being back downtown in exactly 2 trips? To answer this, we have to find the each possible way we…

Read More## Linear Algebra in Julia

Most people (including myself) are drawn to Julia by its lofty goals. Speed of C, statistical packages of R, and ease of Python?—it sounds two good to be true. However, I haven’t seen anyone who has looked into it say the developers behind the language aren’t on track to accomplish these goals. Having only been around since 2012, Julia’s greatest disadvantage is a lack of community support. If you have an obscure Julia question and you google it, you probably won’t find the answer, whereas with Python or R or Java you would. This also means less package support. The packages for linear algebra, plotting, and other stuff are there, but if you want to do computer vision or nlp, you’d be among the few. However, it is definitely worth looking into. I’m not quite a pro, but I’m getting to the point where if I code something in Python, I can easily transfer it to Julia. Then sometimes I test the speed for fun. Julia always wins. Recently, I wrote an article about linear algebra, with accompanying code in Python. Below is basically the same article, with the code in Julia. If you need help with the basic syntax, I also wrote a basic syntax guide, kind of a compressed version of the documentation. I included the concept descriptions, but they’re no different from my original article. The point of this is just to show how easy it is to do linear algebra in Julia. Here is the iPython notebook on my github. (You can write Julia code in iPython…it’s awesome). The Basics matrix – a rectangular array of values vector – one dimensional matrix identity matrix I – a diagonal matrix is an n x n matrix with one’s on the diagonal from the top left to the bottom right. i.e. [[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]] When a matrix A is multiplied by it’s inverse A^-1, the result is the identity matrix I. Only square matrices have inverses. Example below. Note – the inverse of a matrix is not the transpose. Matrices are notated m x n, or rows x columns. A 2×3 matrix has 2 rows and 3 columns. Read this multiple times. You can only add matrices of the same dimensions. You can only multiply two matrices if the first is m x n, and the second is n x p. The n-dimension has to match. Now the…

Read More## Generating Text Using a Markov Model

A Markov Chain is a random process, where we assume the previous state(s) hold sufficient predictive power in predicting the next state. Unlike flipping a coin, these events are dependent. It’s easier to understand through an example. Imagine the weather can only be rainy or sunny. That is, the state space is rainy or sunny. We can represent our Markov model as a transition matrix, with each row being a state, and each column being the probability it moves to another. However, it’s easier to understand with this state transition diagram. In other words, given today is sunny, there is a .9 probability that tomorrow will be sunny, and a .1 probability that tomorrow will be rainy. Text Generator One cool application of this is a language model, in which we predict the next word based on the current word(s). If we just predict based on the last word, it is a first-order Markov model. If we use the last two words, it’s a second-order Markov model. In my example I trained the model using Walden by Henry Thoreau. I also included files of Thus Spoke Zarathustra by Nietszche , and some speeches by Obama to make it easy to experiment. The cool thing about this is that whatever text you train it on, the model spits out really similar text. First we have to import NLTK, the best NLP library in Python. I would say the natural language processing we’re doing here to be pretty mild, but their built in functions save me a lot of code. We then turn the string (taken from the text file) into an array using the split() function.

1 2 3 4 |
import nltk import random</code> <code>file = open('Text/Walden.txt', 'r') walden = file.read() walden = walden.split() |

These next two functions are the meat of the code. The Conditional Frequency Dictionary from NLTK that we are going to eventually use has to take the array as pairs. So the phrase “Hi my name is Alex” would becomes [(“Hi”, “my”), (“my, “name”), (“name”, “is”), (“is”, “Alex”)]. The makePairs function takes in an array (a string split by word) and outputs an array in the above format. The generate method takes in a conditional frequency distribution. Think – how many times did each word appear after ‘farm’? That is what a conditional frequency distribution outputs (for all words, not just ‘farm’). The rest of the generate function does is output text based on the distribution observed in the training data. I did this…

Read More## Introduction To Monte Carlo Methods

I’m going to keep this tutorial light on math, because the goal is just to give a general understanding. Monte Carlo methods originated from the Manhattan Project, as a way to simulate the distance neutrons would travel through various materials [1]. Ideas using sampling had been around for a little while, but they took off in the making of the atomic bomb, and have since appeared in lots of other fields. The idea is this—generate some random samples for some random variable of interest, then use these samples to compute values you’re interested in. I know, super broad. The truth is because Monte Carlo has a ton of different applications, it’s hard to get a more precise definition. It’s used in product design, to simulate variability in parts manufacturing. It’s used in biology, to simulate average distance of a bird from its nest, which would allow a scientist to calculate energy expended. It can be used in AI for games, for example, the chinese game Go. And finally, in finance, to evaluate financial derivatives or option pricing [1]. In short—it’s used everywhere. The big advantage with Monte Carlo methods is that they inject randomness and real-world complexity into the model. They are also more robust to adjustments such as applying a distribution to the random variable you are considering. The justification for a Monte Carlo method lies in the law of large numbers. I’ll elaborate in the first example. Estimating Pi We can use something called the random darts method, a Monte Carlo simulation, to estimate pi. Here is my R code for this example. The logic goes as follows— If we inscribe a circle in a square, where one side length of the square equals the diameter of the circle we can easily calculate the ratio of circle area to square area. Now if we could estimate this value, we would be able to estimate pi. We can do this by randomly sampling points in the square, and calculating the proportion of those inside the circle to the total points. So I just calculate red points over total points, and multiply by 4. We can see that as the number of points increases, the closer our value will get to pi. This is the law of large numbers at work. This is a simple example of a Monte Carlo method at work. We had some value we wanted to estimate, and we…

Read More## Guide to Linear Regression

Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning. At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!). $$\hat{Y} = \theta_0 + \theta_1 X_1$$ But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X. $$\hat{Y} = h_{\theta}(x) = \sum\limits_{i = 1}^{d} \theta_i X_i = \theta^T X$$ There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent. Sorry for switching notation below. Note – the matrices are \(i \times j\). \(i\) signifies rows, or training examples. Gradient Descent The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual. It achieves this by stepping down the cost function to the (hopefully) global minimum. Here is the cost function – $$J(\theta) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) \ – \ y^{(i)})^2$$ The cost is the residual sum of squares. The \(\frac{1}{2}\) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see \(m\) (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid. And the gradient descent algorithm – $$\Large{\theta_{j+1} := \theta_j + \alpha \sum\limits_{i=1}^{m} (y^{(i)} \ – \ h_{\theta}(x^{(i)}))x_{j}^{(i)}}$$ This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math.. $$\Large{\frac{\partial}{\partial \theta_{j}} J(\theta) = (h_{\theta}(x)\ – \ y)x_j}$$ $$\Large{\theta_{j+1} := \theta_j \ – \ \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)}$$ The original equation switches the position of \(h(x)\) and \(y\), to pull out a negative. This makes the…

Read More## k-Nearest Neighbors in scikit-learn

The k-Nearest Neighbor (KNN) algorithm can be used for classification or regression. It’s one of the simpler machine learning algorithms, but it’s usually pretty effective. It is non-parametric, meaning it doesn’t have a fixed number of parameters. When you train KNN, it finds the k “nearest” points to a given point, and returns the class with the highest proportion. If K = 1, then you only look for the closest point and return it’s class. This is not ideal. The optimal K value for KNN is usually between 3-10. Distance Euclidean Distance (pictured below) between test input x and the training examples is the typical distance measure. Other distance metrics are also used, but this is the most common. Note – kNN assumes we are in a metric space, meaning one unit increase in shoe size is equivalent to one unit increase in height. When you use this algorithm outside of this tutorial, put your features on the same scale. The Curse of Dimensionality KNN fails when we have a ton of features. In particular, noisy or irrelevant features. In a sense, noise makes two points that would have been close to each other farther apart. One can use tools such as PCA to reduce dimensionality, and this is a good practice if you have more than 10 or so features. Example – Predicting Grade Level I made some data for shoe sizes and heights or 4th graders vs seniors and high school, so we could have a simple classification example. Plus I’m tired of the iris dataset. You can download the data and the iPython notebook on my github. The dataset we’re working with has two features—shoe size and height. It also has the class, 4th grade, or senior. Red is fourth grade, blue is senior.

1 2 3 4 5 6 7 |
import pandas as pd import numpy as np </code><code> train = pd.read_csv('train.csv') # better be in the correct directory! test = pd.read_csv('test.csv') train.head() |

We have our data in a pandas data frame. We need it to be in a numpy array for the algorithm to take it. I like to split it up into trainArr which has the training features and trainRes, which contains the correct output. I do the same thing to the test data, just for convenience.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
from sklearn.neighbors import KNeighborsClassifier cols = ['shoe size', 'height'] cols2 = ['class'] trainArr = train.as_matrix(cols) trainRes = train.as_matrix(cols2) testArr = test.as_matrix(cols) testRes = test.as_matrix(cols2) knn = KNeighborsClassifier(n_neighbors=3, weights='distance') knn.fit(trainArr, trainRes) output = knn.predict(testArr) # or predict on a specific example! print(knn.predict(testArr[0])) print(testRes[0]) # a match! |

[‘seniors’] [‘seniors’] Let’s get a better overview to see how correct our classifier was.

1 2 3 4 5 |
correct = 0.0 for i in range(len(output)): if testRes[i][0] == output[i]: correct += 1 correct / len(output) |

0.9933333333333333 Well not perfect, but pretty good! One more thing I want to get across in this blog post. A reality check about scikit-learn Scikit-learn is amazing. We essentially create and train a classifier…

Read More## Linear Algebra for Data Scientists

It’s important to know what goes on inside a machine learning algorithm. But it’s hard. There is some pretty intense math happening, much of which is linear algebra. When I took Andrew Ng’s course on machine learning, I found the hardest part was the linear algebra. I’m writing this for myself as much as you. So here is a quick review, so next time you look under the hood of an algorithm, you’re more confident. You can view the iPython notebook (usually easier to code with) on my github. For some of these concepts, I’m going to write more in depth on them later. The Basics matrix – a rectangular array of values vector – one dimensional matrix identity matrix I – a diagonal matrix is an n x n matrix with one’s on the diagonal from the top left to the bottom right. i.e. [[ 1., 0., 0.], [ 0., 1., 0.], [ 0., 0., 1.]] When a matrix A is multiplied by it’s inverse A^-1, the result is the identity matrix I. Only square matrices have inverses. Example below. Note – the inverse of a matrix is not the transpose. Matrices are notated m x n, or rows x columns. A 2×3 matrix has 2 rows and 3 columns. Read this multiple times. You can only add matrices of the same dimensions. You can only multiply two matrices if the first is m x n, and the second is n x p. The n-dimension has to match. Now the basics in Python

1 2 3 4 5 6 7 8 9 10 11 |
import numpy as np A = np.array([[3, 2, 4]]) B = np.array([[1], [5], [5]]) print("rows by columns, or m by n") print("A is", A.shape) print("B is", B.shape) print("A * B = ", np.dot(A, B)) # note -> A*B is not matrix # multiplication in numpy!!! |

1 2 3 4 5 |
[Output] rows by columns, or m by n A is (1, 3) B is (3, 1) A * B = [[33]] |

And for using identity matrices in numpy, use the eye() function.

1 2 3 |
# If a matrix A is multiplied by the identity matrix, the result is A. A = np.array([[1,2,3,4]]) print(np.dot(A, np.eye(4))) # equals A!!! |

1 |
[[ 1. 2. 3. 4.]] |

Also, calculating the inverse using inv()..or pinv()..is important. Another important function is transpose().

1 2 3 4 |
B = np.array([[1,2],[3,4]]) print(np.dot(B, np.linalg.inv(B))) # returns the identity matrix (approximately) print(B.transpose()) |

1 2 3 4 |
[[ 1.00000000e+00 0.00000000e+00] [ 8.88178420e-16 1.00000000e+00]] [[1 3] [2 4]] |

Eigenvalues An eigenvalue of a matrix A is something you can multiply some vector X by, and get the same answer you would if you multiplied A and X. In this situation, the vector X is an eigenvector. More formally – Def: Let A be an n x n matrix. A scalar λ is called an eigenvalue of A if there is a nonzero vector X such that AX = λX. Such a vector X is called an eigenvector of A corresponding to λ. There is a way to compute the eigenvalues of a matrix by hand, and then a corresponding eigenvector, but it’s a bit beyond the scope of this tutorial.

1 2 3 4 5 6 7 8 |
# *** eigenvalues and eigenvectors *** # A = np.array([[2, -4], [-1, -1]]) x = np.array([[4], [-1]]) # a suspected eigenvector eigVal = 3 # a suspected eigenvalue print(np.dot(A, x), "n") print(eigVal * x) # They match! |

1 2 3 |
[output] [[12] [-3]] [[12] [-3]] |

Now that we know matrix A has a real eigenvalue, let’s compute it with numpy!…

Read More