One thing I love to do to practice machine learning is code an algorithm myself. I mean without external packages, just trying to implement all the math and everything else myself. Naive Bayes is a great algorithm for this, because it’s relatively conceptually easy, and it gets pretty decent results. But before we can talk about Naive Bayes, we have to talk about conditional probability. Bayes Theorem To understand most statistics, we need to understand conditional probability. Bayes theorem states that where \(P(A|B)\) is the probability that A is true, given that B is true. Most analysis these days is Bayesian, as it is a way to inject practical subjectivity into a model. For example, what is the probability someone will have heart problems? Well, if we take into account whether or not the person smokes, drinks, exercises and eats healthy foods, we will get a better estimate. This is just a brief explanation of Bayes Theorem and if this is your first time seeing it, you can learn more here. Naive Bayes Naive Bayes is an algorithm using basic probability in order to classify. Let’s define a few probabilities in the context of an example so it’s more clear. In this example we are classifying students as accepted or rejected to some school (the data is from UCLA!). The prior – The probability of the class. In this case, if A = accepted, then where \(N_A\) = number of accepted students in the data, and N = number of total students in the data. Likelihood – This is a conditional probability. If we assume some student (some vector x) is accepted, then we can calculate the probability of him having a gpa as low or as high as he did, given he was accepted. If we have to calculate this for real values, we use a Normal distribution. If we are doing it for binary values, a Bernoulli distribution, and for categorical—multinoulli. We can use different distributions for different variables, each within our model. Posterior – The probability of the prior and the likelihood, normalized. In math this looks like the prior x the likelihood, divided by the probability of the vector x. When we’re writing a Naive Bayes algorithm however, we don’t care about the normalizing, or the dividing by the probability of vector x. This is because it is constant and has no impact no our classification. So the denominator in both of…

Read More