One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself. Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability. Bayes Theorem To understand most statistics, we need to understand conditional probability. Bayes theorem states that where \(P(A|B)\) is the probability that A is true, given that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model. For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a brief explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here. Naive Bayes Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!). The prior – The probability of the class. In this case, if A = accepted, then where \(N_A\) = number of accepted students in the data, and N = number of total students in the data. Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted. If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model. Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of…

Read More## Markov Chains in R

A few weeks ago I wrote a tutorial on Markov Chains, where the example I gave was a model that generated text. I enjoyed writing that, but I realize if you’re interested in using Markov Chains for a more practical purpose, this tutorial only introduced the idea. So now, I want to provide an example that provides easier access to markov chain related programming. Markov Chain – a random process, where we assume the previous state holds predictive power in predicting the next. Car Rental Example Suppose a car rental agency has three locations in Ottawa: Downtown location, East end location and a West end location. The agency has a group of delivery drivers to serve all three locations. The agency’s statistician has determined the following: Of the calls to the Downtown location, 30% are delivered in Downtown area, 30% are delivered in the East end, and 40% are delivered in the West end. Of the calls to the East end location, 40% are delivered in Downtown area, 40% are delivered in the East end, and 20% are delivered in the West end Of the calls to the West end location, 50% are delivered in Downtown area, 30% are delivered in the East end, and 20% are delivered in the West end. After making a delivery, a driver goes to the nearest location to make the next delivery. This way, the location of a specific driver is determined only by his or her previous location. This is a markov chain because the current state has predictive power over the next state. The state space in this example is Downtown, East, and West. Note – the code is also on my github. We can model the markov chain by a transition matrix.

1 2 3 4 5 6 |
rentalStates <- c("Downtown", "East", "West") rentalTransition <- matrix(c(0.3, 0.3, 0.4, + 0.4, 0.4, 0.2, + 0.5, 0.3, 0.2), + byrow = T, nrow = 3, dimnames = list(rentalStates, rentalStates)) rentalTransition |

1 2 3 4 |
Downtown East West Downtown 0.3 0.3 0.4 East 0.4 0.4 0.2 West 0.5 0.3 0.2 |

There is a package in R that will make our lives easier. It’s called ‘markovchain’. Let’s define our model as a markovchain object. Calling the object is equivalent to it’s show() method. We can even access elements of the transition matrix by indexing the object.

1 2 3 4 5 6 7 |
mcRental <- new("markovchain", states = rentalStates, byrow = T, transitionMatrix = rentalTransition, name = "Rental Cars") # We can access the transition matrix just by calling the mc object mcRental |

1 2 3 4 5 6 7 8 |
Rental Cars A 3 - dimensional discrete Markov Chain with following states Downtown East West The transition matrix (by rows) is defined as follows Downtown East West Downtown 0.3 0.3 0.4 East 0.4 0.4 0.2 West 0.5 0.3 0.2 |

You can also call a specific row of the matrix, using the mcRental object.

1 |
mcRental[1] # the probabilities that we go Downtown, East, and West, given that we are currently Downtown |

1 2 |
Downtown East West 0.3 0.3 0.4 |

We can plot it using plot(mcRental). Ok, let’s ask some more statistical questions that might pop up if this were a real situation. Given that we are downtown, what is the probability being back downtown in exactly 2 trips? To answer this, we have to find the each possible way we…

Read More## Introduction To Monte Carlo Methods

I’m going to keep this tutorial light on math, because the goal is just to give a general understanding. Monte Carlo methods originated from the Manhattan Project, as a way to simulate the distance neutrons would travel through various materials [1]. Ideas using sampling had been around for a little while, but they took off in the making of the atomic bomb, and have since appeared in lots of other fields. The idea is this—generate some random samples for some random variable of interest, then use these samples to compute values you’re interested in. I know, super broad. The truth is because Monte Carlo has a ton of different applications, it’s hard to get a more precise definition. It’s used in product design, to simulate variability in parts manufacturing. It’s used in biology, to simulate average distance of a bird from its nest, which would allow a scientist to calculate energy expended. It can be used in AI for games, for example, the chinese game Go. And finally, in finance, to evaluate financial derivatives or option pricing [1]. In short—it’s used everywhere. The big advantage with Monte Carlo methods is that they inject randomness and real-world complexity into the model. They are also more robust to adjustments such as applying a distribution to the random variable you are considering. The justification for a Monte Carlo method lies in the law of large numbers. I’ll elaborate in the first example. Estimating Pi We can use something called the random darts method, a Monte Carlo simulation, to estimate pi. Here is my R code for this example. The logic goes as follows— If we inscribe a circle in a square, where one side length of the square equals the diameter of the circle we can easily calculate the ratio of circle area to square area. Now if we could estimate this value, we would be able to estimate pi. We can do this by randomly sampling points in the square, and calculating the proportion of those inside the circle to the total points. So I just calculate red points over total points, and multiply by 4. We can see that as the number of points increases, the closer our value will get to pi. This is the law of large numbers at work. This is a simple example of a Monte Carlo method at work. We had some value we wanted to…

Read More## Guide to Linear Regression

Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning. At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!). $$\hat{Y} = \theta_0 + \theta_1 X_1$$ But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X. $$\hat{Y} = h_{\theta}(x) = \sum\limits_{i = 1}^{d} \theta_i X_i = \theta^T X$$ There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent. Sorry for switching notation below. Note – the matrices are \(i \times j\). \(i\) signifies rows, or training examples. Gradient Descent The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual. It achieves this by stepping down the cost function to the (hopefully) global minimum. Here is the cost function – $$J(\theta) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) \ – \ y^{(i)})^2$$ The cost is the residual sum of squares. The \(\frac{1}{2}\) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see \(m\) (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid. And the gradient descent algorithm – $$\Large{\theta_{j+1} := \theta_j + \alpha \sum\limits_{i=1}^{m} (y^{(i)} \ – \ h_{\theta}(x^{(i)}))x_{j}^{(i)}}$$ This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math.. $$\Large{\frac{\partial}{\partial \theta_{j}} J(\theta) = (h_{\theta}(x)\ – \ y)x_j}$$ $$\Large{\theta_{j+1} := \theta_j \ – \ \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)}$$ The original equation switches the position of \(h(x)\) and \(y\), to pull out a negative. This makes the…

Read More## Basic Data Exploration in R

When you’re cleaning up data, you usually end up using a 5-8 functions a ton of times, and then a few more once or twice. Here are those 5-8 functions I find myself using again and again. Here is a quick overview: names() – returns the column names of a dataset str() – gives the overview of a dataset data.table package – includes functions for creating new columns, among other things %in% operator – checks if a value is in a vector Below are some examples.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
names(rock) # returns the column names [1] "area" "peri" "shape" "perm" str(rock) # gives the format of the dataframe 'data.frame': 48 obs. of 4 variables: $ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ... $ peri : num 2792 3893 3931 3869 3949 ... $ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ... $ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ... # import the data.table package install.packages("data.table") # don't forget these 3 steps! library(data.table) dtRock <- data.table(rock) dtRock[1:5] # returns the first 5 columns area peri shape perm 1: 4990 2791.90 0.0903296 6.3 2: 7002 3892.60 0.1486220 6.3 3: 7558 3930.66 0.1833120 6.3 4: 7352 3869.32 0.1170630 6.3 5: 7943 3948.54 0.1224170 17.1 # and my favorite way to create a new column dtRock[, areaMP := area / 1000] # area is measured in pixels, so areaMP # is in mega pixels dtRock[1, ] # indicates the first row, all columns area peri shape perm areaMP 1: 4990 2791.9 0.0903296 6.3 4.99 dtRock[, 'areaMP'] # returns the entire 'areaMP' column # The %in% operator is one of the most useful functions in R, I think. a <- c(1, 2, 3, 4) 4 %in% a # it's asking, is the value 4 in the vector a? [1] TRUE |

There are many other functions and packages, such as the ‘dplyr’ package by the amazing Hadley Wickham, but I am just showing the ones I use most frequently.

Read More