One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself.

Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability.

## Bayes Theorem

To understand most statistics, we need to understand conditional probability. Bayes theorem states that

where \(P(A|B)\) is the probability that A is true, *given* that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model.

For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a brief explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here.

## Naive Bayes

Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!).

The **prior** – The probability of the class. In this case, if A = accepted, then

where \(N_A\) = number of accepted students in the data, and N = number of total students in the data.

**Likelihood** – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted.

If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model.

**Posterior** – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of these equations—ignore it.

When calculating the likelihood, we have to do it for each feature. We can simply multiply the probabilities, because the model assumes conditional independence, a strong, likely false assumption. This is why the algorithm is called “naive”.

## Example

Which students get accepted to UCLA?

We have two classes – accepted and rejected. We have three variables — *gre score*,* gpa*, and* rank*. We can use each of these to write a classifier.

Notes –

I’m using R. I chose it over Python for this one because it felt so statistical, but you could just as easily do this in Python.

I’ll give the important statistical ideas that lie in the model right here, but to do to this yourself or to see every line of code, go to my github.

‘train’ is the training data, which I randomly sampled by class.

### Prior

First we need to calculate the prior probabilities for each class. My favorite way to do this is a subset. If you want to save memory, compress this into two lines.

1 2 3 4 5 |
one <- subset(train, admit == 1) zero <- subset(train, admit == 0) priorAdmitted <- nrow(one) / nrow(train) priorRejected <- nrow(zero) / nrow(train) |

### Likelihood

Next, we need to calculate the likelihood. This is a bit more complicated, as we have to use a Gaussian distribution for each variable. I made a dictionary to keep my code concise, rather than define 3 different means and standard deviations. I recommend you explore dictionaries in R a bit.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
condProb <- function(df, testObj, var) { # note - in train df, col2 = gre, col3 = gpa, col4 = rank...let's make a dictionary dict <- vector(mode = "list", length = 3) names(dict) <- c("gre", "gpa", "rank") dict[[1]] <- 2; dict[[2]] <- 3; dict[[3]] <- 4 col <- as.integer(dict[var]) val <- testObj[1, col] # by doing this the "val" can represent gre, gpa, or rank, depending on what i want mu <- mean(df[, col]) sigma <- sd(df[, col]) # we have to account for what side of the mean the value is on if (val < mu) return(pnorm(val, mean = mu, sd = sigma)) else return(pnorm(val, mean = mu, sd = sigma, lower.tail = F)) } |

### Posterior

And lastly, we need to predict for each value in the test data! We do this simply by taking the maximum posterior probability.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# let's create a new column so we have somewhere to put our predictions test$predict <- 0 # prob(C|x) = p(C) * product(p(xi|Ck)) # and in reality there is some constant, p(x), in the denominator of that formula, # but since it's a constant it doesn't matter for classifying. The multiplying together # of the p(xi | Ck)'s requires conditional independence, a naive assumption ;) for (i in 1:nrow(test)) { probAdmit <- priorAdmitted * condProb(one, test[i, ], "gre") * condProb(one, test[i, ], "gpa") * condProb(one, test[i, ], "rank") probReject <- priorRejected * condProb(zero, test[i, ], "gre") * condProb(zero, test[i, ], "gpa") * condProb(zero, test[i, ], "rank") if (probAdmit > probReject) test[i, 5] = 1 else test[i, 5] = 0 } # now we have the predictions! It's pretty clear that the algorithm predicted rejected # more than accepted, probably because rejected has a higher prior (a lot more people # in the training data got rejected than accepted) # Let's visualize it, and calculate an accuracy measure for the algorithm test$correct <- F for (i in 1:nrow(test)) { a <- test[i,1] b <- test[i,5] if (xor(a,b) == F) test[i,6] = T # programming question for you - why does xor work here? else test[i, 6] = F } accuracy <- nrow(subset(test, correct == T)) / nrow(test) # I think it was something around .65, which was higher than the accuracy when I tested this with an R package |

Below is a plot for the accuracy.

One last thing – It’s been a while since I’ve posted; I’ve been really busy with school. I’ll try to write more in the next few months! 🙂