### Guide to Linear Regression

##### Written by Alexander Woods, Jul 19, 2015, 6 Comments

Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning.

At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!).

But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X.

There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent.

Sorry for switching notation below. Note – the matrices are *i x j*. *i* signifies rows, or training examples.

## Gradient Descent

The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual.

It achieves this by stepping down the cost function to the (hopefully) global minimum.

Here is the cost function –

The cost is the residual sum of squares. The (1/2) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see m (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid.

And the gradient descent algorithm –

This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math..

The original equation switches the position of h(x) and y, to pull out a negative. This makes the equation prettier.

**Learning rate alpha** is a value that determines how big each individual hop down the cost function is. If alpha is too small, gradient descent can take longer than you want to train. If alpha is too big, it can hop over the minima.

**The entire goal of gradient descent is to learn the optimal parameters theta. **These are also referred to as weights.

Here is a cool resource where Andrew Ng breaks down the derivative of the cost function. And here is a great tutorial on coding gradient descent yourself in Python.

But today, I’m just covering linear regression in R. I believe R is a lot stronger for linear regression, and most things more statistics-like.

## Linear Regression in R

I’m using a sample crime dataset that’s loaded into R. You can download it on my github.

First, let’s look at one feature, and throw in a linear model[1].

1 2 3 4 5 |
a <- ggplot(crime, aes(x = policeFunding, y = reportedCrime)) + geom_smooth(method=lm, colour = "#0BB5FF") + ylab("Crime") + xlab("Police Funding") + ggtitle("Predicting Crime Rate") |

The shaded area is a confidence region.

Note – remember correlation doesn’t equal causation? Well, this is a good example. More police funding per capita doesn’t *cause* a higher crime rate. It’s likely the other way around.

Now let’s look at lm() a little more.

1 2 |
lr <- lm(reportedCrime ~ policeFunding, data = crime) lr # returns the weights (theta). |

1 2 3 |
lr$residuals # residuals plot(lr) # 4 plots to evaluate your model anova(lr) # Analysis of Variance Table |

## Multiple Regression

In this case, we shouldn’t limit ourselves to one input feature. We have several that would all make the model better. This is the line to train a simple multiple regression model. All the same functions from before apply.

1 |
lr2 <- lm(reportedCrime ~ graduatedHS25 + teensHS + inCollege + graduatedCollege25 + policeFunding, data = crime) |

Here are a few of the features visualized[2] –

Or if we want to see the residuals…

## A few more notes about linear regression –

It can map non-linear relationships[3], if you replace the input vector x with a non-linear function Φ(x). This is useful, but you have to figure out whether you want Φ(x) to be x-squared, or some other exponential, quadratic, logarithmic form…I would try another algorithm (perhaps an SVM!).

This last part is a bit random, I just happened to love the math here.

You can represent linear regression in it’s relationship to a Gaussian distribution. If you replace predicted y (yhat) with observed y, you have to add an error ϵ.

However, ϵ is often assumed (probably correctly) to have a normal distribution[4]. So we can rewrite –

This gives us a different way of thinking about y. It’s now the output of a normal distribution whose mean is changing. I really like this representation!

Sources –

1. Winston Chang’s Cookbook for R.

2. Scatterplot3d documentation.

3. Kevin Murphy’s super awesome textbook.

4. Some lecture notes from Georgia Tech.

Reblogged this on Danilo Gasques.

Good intro to the topic but this misses the details about the assumptions this method makes and the interpretation of the output. Also I’m not sure why would you use gradient descent if there is a closed form solution for linear regression.

Hey Victor, thanks for reading! You’re absolutely right, I should have taken some time to talk about the assumptions of linear regression. I tried to mention them in passing, but I’ll try to be more explicit next time. I should have brought up the pearson correlation coefficient. I tried to write this from a more machine learning point of view, not as much statistics.

As for gradient descent vs normal equation—normal equation is O(n^3), with n being the number of features. So when n is large, gradient descent is much faster.

Thanks for the feedback!

Reblogged this on It's how I want it to be and commented:

Linear regression

Why take the residuals to be only along the y-axis, and not define them as normal to the proposed line equation?

Hey Vlas, thanks for you comments!

I’m going to correct this comment to try and explain better..

The epsilon ~ N(0, sigma^2) isn’t referring to the y-axis when it says 0. epsilon is defined as the actual y – predicted y, and if a prediction is perfect, epsilon = 0, so the values of epsilon are a normal distribution with a mean centered around 0.

We then say Y = (theta^T X) + epsilon, and continue from there. epsilon is defined with respect to the predicted line.