Linear regression is one of the first things you should try if you’re modeling a linear relationship (actually, non-linear relationships too!). It’s fairly simple, and probably the first thing to learn when tackling machine learning. At first, linear regression shows up just as a simple equation for a line. In machine learning, the weights are usually represented by a vector θ (in statistics they’re often represented by A and B!). $$\hat{Y} = \theta_0 + \theta_1 X_1$$ But then we have to account for more than just one input variable. A more general equation for linear regression goes as follows – we multiply each input feature Xi by it’s corresponding weight in the weight vector θ. This is also equivalent to theta transpose times input vector X. $$\hat{Y} = h_{\theta}(x) = \sum\limits_{i = 1}^{d} \theta_i X_i = \theta^T X$$ There are two main ways to train a linear regression model. You can use the normal equation (in which you set the derivative of the negative log likelihood NLL to 0), or gradient descent. Sorry for switching notation below. Note – the matrices are \(i \times j\). \(i\) signifies rows, or training examples. Gradient Descent The cost function is essentially the sum of the squared distances. The “distance” is the vertical distance between the predicted y and the observed y. This is known as the residual. It achieves this by stepping down the cost function to the (hopefully) global minimum. Here is the cost function – $$J(\theta) = \frac{1}{2} \sum\limits_{i = 1}^{m} (h_{\theta}(x^{(i)}) \ – \ y^{(i)})^2$$ The cost is the residual sum of squares. The \(\frac{1}{2}\) is just a constant to make the derivative prettier. You could put 1000 as a multiple of the cost function, it doesn’t change the process of minimizing it. Sometimes you’ll see \(m\) (number of training examples) out front in the denominator too. It would be present in the derivative as well then, because it’s a constant. This just makes ‘cost per training example’, which is perfectly valid. And the gradient descent algorithm – $$\Large{\theta_{j+1} := \theta_j + \alpha \sum\limits_{i=1}^{m} (y^{(i)} \ – \ h_{\theta}(x^{(i)}))x_{j}^{(i)}}$$ This is really the current weight minus the alpha times the partial derivative of the cost function. Or, in math.. $$\Large{\frac{\partial}{\partial \theta_{j}} J(\theta) = (h_{\theta}(x)\ – \ y)x_j}$$ $$\Large{\theta_{j+1} := \theta_j \ – \ \alpha \frac{\partial}{\partial \theta_{j}} J(\theta)}$$ The original equation switches the position of \(h(x)\) and \(y\), to pull out a negative. This makes the…

Read More