Linear Regression from scratch

Fundamentals

Linear regression is a technique used to find the relationship between variables, in the context of machine learning that essentially means the relationship between features and labels.

It can be represented as follows:

$$ \hat{y} = \text{b} + w_1 x_1 + w_2 x_2 + \dots + w_n x_n $$

$\hat{y}$ is the label
$\text{b}$ is the bias
$w_i$ is the weight of the feature i
$x_i$ is the feature value of feature i

Loss Functions

Loss is usually the metric we optimize the model for, it essentially tells us how wrong the model is - it measures the distance between the model’s predictions ($\text{y}$) and the actual labels ($\hat{y}$). Something worth keeping in mind that is we don’t really care about the direction, only the distance between the values, so loss functions usually removing the sign - i.e. 2 - 5 = -3, loss is 3.

Loss Type	Definition	Equation
L1 Loss	Sum of the absolute differences between predicted and actual values.	$$L = \sum_{i=1}^{N} \lvert y_i - \hat{y}_i \rvert$$
L2 Loss	Sum of the squared differences between predicted and actual values.	$$L = \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$
Mean Absolute Error (MAE)	Average of L1 loss across N examples.	$$MAE = \frac{1}{N} \sum_{i=1}^{N} \lvert y_i - \hat{y}_i \rvert$$
Mean Squared Error (MSE)	Average of L2 loss across N examples.	$$MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$$

MSE is useful for outliers, it moves the model more towards it as it reduces the loss, while MAE might keep the model closer to predictions that are not necessarily an outlier.

Gradient Descent

The goal of training model is essentially figuring out the values for $\text{b}$ and $w_i$ for every feature i that minimizes the loss function picked. This is where gradient descent comes in, it iteratively finds the weights and bias that produce the model with the lowest loss. It does so by doing the following process:

Calculate loss with current weights and bias
For each weight and bias, determine the direction in which we should move them to reduce loss (opposite of the gradient)
For each weight and bias, update them by moving a small amount in the direction found
Return to step 1 and repeat until we the model can’t reduce the loss any further

1. Start with weight and bias as 0

We can assume that weights and bias can start as 0 when training starts.

$$ \hat{y} = 0 + 0 \cdot x $$

2. Calculate MSE loss with the current model parameters

When calculating MSE for the model above, we would end up with something like the following if we assume expected values of 18, 15 and so on for our model.

$$ Loss = ((18 - 0)^2 + (15 - 0)^2 ) + ... / N $$

This is obviously not going to result in the best model predictions, so our goal now becomes finding which bias and weights take us to the best possible model, thus reducing the loss.

3. Calculate the slope of the tangent to the loss function for each weight and bias (i.e. the gradient)

We can only optimize the loss if we understand exacly how to update each value, in this case weight and bias, in a way that results in a lower loss. This is where gradients come in.

The gradient is essentially how much y changes when x increases by 1, it tells us in which direction the function is taking us - e.g. m is the gradient in $y = mx + b$.

To get the slope for the lines tangent to the weight and bias, we take the derivative of the loss function with respect to the weight and the bias, and then solve the equations.

Again, assuming we use MSE as our loss function $ \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 $ and $y_i = w \cdot x + b$, as it represents our model.

Weight derivative

The weight derivative is $ \frac{\partial}{\partial w} ( (w \cdot x + b - y)^2 ) $, we can apply the chain rule.

We assume $u^2$, where $u = (w \cdot x + b - y)$

So $ \frac{d}{dw} (u^2) = 2u \cdot \frac{du}{dw} $

Since $\frac{du}{dw} = x$

It becomes $ \frac{d}{dw} ( (w \cdot x + b - y)^2 ) = 2(w \cdot x + b - y) \cdot x $.

We need to do this for every sample in the training dataset, so it’s actually:

$$ \frac{\partial \text{MSE}}{\partial w} = \frac{1}{M} \sum_{i=1}^{M} 2(w \cdot x^{(i)} + b - y^{(i)}) \cdot x^{(i)} $$

Bias derivative

If we do the same for the bias, we would instead get:

$$ \frac{\partial \text{MSE}}{\partial b} = \frac{1}{M} \sum_{i=1}^{M} 2(w \cdot x^{(i)} + b - y^{(i)}) $$

4. Update the weights

Well, now that we know exactly how to calculate the gradients, this should enable us to update both weight and bias in a way that optimizes loss.

The gradient tells us the opposite direction of where we have to go. An easy way to understand this is essentially thinking that if the gradient for a loss function is 5, every time we increase the weight by 1 unit, the loss will increase by 5, so essentially we want to do the opposite as our goal is to decrease the loss.

$$ \text{new weight} = \text{old weight} - \eta \cdot \text{weight's gradient} $$ $$ \text{new bias} = \text{old bias} - \eta \cdot \text{bias' gradient} $$

The small value used to update the gradient is the learning rate, which we will be leaving out of scope for now.

5. Model Training

After iteratively applying the steps above, model training would look something like this:

Iteration	Weight	Bias	Loss (MSE)
1	0	0	303.71
2	1.20	0.34	170.84
3	2.05	0.59	103.17
4	2.66	0.78	68.70
5	3.09	0.91	51.13
6	3.40	1.01	42.17

Recap

Concept	Description
Goal	Find the best weights and bias that minimize the difference between predicted and actual values.
Model Formula	$ \hat{y} = w \cdot x + b $ — where `w` is weight, `x` is feature, and `b` is bias.
Loss Functions	Measure how wrong predictions are, e.g. MSE ($ \frac{1}{M} \sum (\hat{y} - y)^2 $).
Gradient Descent	Iteratively updates weights using gradients: $ \frac{dL}{dw} = \frac{2}{M} \sum ((\hat{y} - y) \cdot x) $, $ \frac{dL}{db} = \frac{2}{M} \sum (\hat{y} - y) $. Update weights/bias in the opposite direction of gradients. Repeat until loss converges (stops decreasing significantly).

References

Linear Regression