Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning.

At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!).

$$\hat{Y} = \theta_0 + \theta_1 X_1$$

But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X.

$$\hat{Y} = h_{\theta}(x) = \sum\limits_{i = 1}^{d} \theta_i X_i = \theta^T X$$

There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent.

Sorry for switching notation below. Note – the matrices are \(i \times j\). \(i\) signifies rows, or training examples.

## Gradient Descent

The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual.

It achieves this by stepping down the cost function to the (hopefully) global minimum.

Here is the cost function –

$$J(\theta) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) \ – \ y^{(i)})^2$$

The cost is the residual sum of squares. The \(\frac{1}{2}\) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see \(m\) (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid.

And the gradient descent algorithm –

$$\Large{\theta_{j+1} := \theta_j + \alpha \sum\limits_{i=1}^{m} (y^{(i)} \ – \ h_{\theta}(x^{(i)}))x_{j}^{(i)}}$$

This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math..

$$\Large{\frac{\partial}{\partial \theta_{j}} J(\theta) = (h_{\theta}(x)\ – \ y)x_j}$$

$$\Large{\theta_{j+1} := \theta_j \ – \ \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)}$$

The original equation switches the position of \(h(x)\) and \(y\), to pull out a negative. This makes the equation prettier.

**Learning rate alpha** is a value that determines how big each individual hop down the cost function is. If alpha is too small, gradient descent can take longer than you want to train. If alpha is too big, it can hop over the minima.

**The entire goal of gradient descent is to learn the optimal parameters theta. **These are also referred to as weights.

Here is a cool resource where Andrew Ng breaks down the derivative of the cost function. And here is a great tutorial on coding gradient descent yourself in Python.

But today, I’m just covering linear regression in R. I believe R is a lot stronger for linear regression, and most things more statistics-like.

## Linear Regression in R

I’m using a sample crime dataset that’s loaded into R. You can download it on my github.

First, let’s look at one feature, and throw in a linear model[1].

1 2 3 4 5 |
a <- ggplot(crime, aes(x = policeFunding, y = reportedCrime)) + geom_smooth(method=lm, colour = "#0BB5FF") + ylab("Crime") + xlab("Police Funding") + ggtitle("Predicting Crime Rate") |

The shaded area is a confidence region.

Note – remember correlation doesn’t equal causation? Well, this is a good example. More police funding per capita doesn’t *cause* a higher crime rate. It’s likely the other way around.

Now let’s look at lm() a little more.

1 2 |
lr <- lm(reportedCrime ~ policeFunding, data = crime) lr # returns the weights (theta). |

1 2 3 |
lr$residuals # residuals plot(lr) # 4 plots to evaluate your model anova(lr) # Analysis of Variance Table |

## Multiple Regression

In this case, we shouldn’t limit ourselves to one input feature. We have several that would all make the model better. This is the line to train a simple multiple regression model. All the same functions from before apply.

1 |
lr2 <- lm(reportedCrime ~ graduatedHS25 + teensHS + inCollege + graduatedCollege25 + policeFunding, data = crime) |

Here are a few of the features visualized[2] –

Or if we want to see the residuals…

## A few more notes about linear regression –

It can map non-linear relationships[3], if you replace the input vector x with a non-linear function Φ(x). This is useful, but you have to figure out whether you want Φ(x) to be x-squared, or some other exponential, quadratic, logarithmic form…I would try another algorithm (perhaps an SVM!).

This last part is a bit random, I just happened to love the math here.

You can represent linear regression in its relationship to a Gaussian distribution. If you replace predicted y (\(\hat{y}\)) with observed y, you have to add an error ϵ.

$$\Large{Y = \theta^T X + \epsilon}$$

However, ϵ is often assumed (probably correctly) to have a normal distribution[4]. So we can rewrite –

$$\Large{\begin{align} & \epsilon \sim N(0, \sigma^2) \\ & Y = \theta^T X +N(0, \sigma^2) \\ & Y \sim N(\theta^T X, \sigma^2) \end{align}}$$

This gives us a different way of thinking about y. It’s now the output of a normal distribution whose mean is changing. I really like this representation!

Sources –

1. Winston Chang’s Cookbook for R.

2. Scatterplot3d documentation.

3. Kevin Murphy’s super awesome textbook.

4. Some lecture notes from Georgia Tech.