Brandon John Grenier

Linear regression is one of the simplest and most well understood algorithms in machine learning, and the first regression model algorithm that we’ll cover in detail. Regression models are used in supervised learning strategies to predict continuous values; in this chapter we’ll develop a linear regression algorithm to predict the price of real estate in Melbourne, Australia.

As a practical introduction to linear regression, we’ll begin by making predictions *intuitively* – no math or algorithms
will be involved, just a bit of guesswork. We’ll then formalize this intuition by introducing key machine learning concepts, mathematics
and algorithms that underpin linear regression. By the end of this chapter, you will have all of the tools you need to build your own linear
regression machine learning algorithm.

As hinted by the chapter title, we’re going to build a linear regression algorithm that uses a single variable, or
*feature*. Linear regression that uses a single variable is referred to as **univariate linear regression**.
Due to its simplicity, univariate linear regression serves as a fantastic technical introduction to foundational machine learning
concepts, mathematics and algorithms. While machine learning algorithms in upcoming chapters will increase in complexity and sophistication,
they will ultimately build upon the foundations we introduce in this chapter.

The first prediction we’re going to make using linear regression will be an entirely intuitive one – there’s no math or algorithms involved at this point, just a bit of guesswork. As a consequence of this guesswork our predictions will be reasonable, and we’ll improve our predictive ability as we formalize our intuition.

Let’s assume that you would like to buy a 150 square meter home in Melbourne’s inner city suburb of Richmond. You’ve done some homework and collected data from twelve recent sales in the area - each data point consists of a home’s sale price and its size in square metres.

Based on this data, you want to predict how much it might cost to buy a 150 square meter home in Richmond. For
the sake of this exercise, we’ll assume that the price of a home is only dependent on a single *feature* - the
size of the home.

Let’s start with a simple method to figure out how much it might cost to buy a 150m^{2} home in Richmond.

As a first step, we’ll plot all of the sales data onto a graph. Figure 1 below presents
a graph of the sales data, with house sizes (in m^{2}) on the horizontal *x axis* and house
prices on the vertical *y axis*.

Now that our sales data is plotted on a graph, we can simply use our best judgement to draw a straight red line that fits the data reasonably well. Figure 2 below presents our red line overlaying the sales data.

To predict the price of a 150m^{2} home in Richmond, draw a vertical line from the 150m^{2} value located on the x axis
until it intersects with the red line we drew in step 2. From the intersection come across to the y axis to get
the predicted price of a 150m^{2} home. Figure 3 below provides you with a visual demonstration of our prediction using
green dashed lines.

Based on the sales data you collected, we can predict that a 150m^{2} house in Richmond will cost
approximately **$900,000** to purchase.

Let’s quickly summarize the four steps we just took to make this prediction:

- We collected data
- We plotted the data
*We drew a straight line that best fit the data*- We used the line to make a prediction – an educated guess in the absence of data

In step 3, we used our judgement to draw a straight line to fit the data as best as we could. For the remainder of
this chapter, we’ll go on a journey to replace this *judgement* with *data-driven algorithms* which will
ultimately perform the same function, but do so with far more accuracy.

The **hypothesis function** is a term that describes any mathematical function that is used to fit, or model
your data. Problems spaces like population growth and infection rates may best be modelled with *exponential functions*;
predicting yearly fluctuations in global temperature could potentially be modelled with *sinusoidal functions*;
predicting earthquake behavior could be modelled with *logarithmic functions*.

A *linear function* is optimal for modelling linear relationships like our home sales data, and will be the most
appropriate linear regression hypothesis function for our current problem space, which only involves one variable - the
size of a home. The linear regression hypothesis function is defined as:

The symbol Θ is called theta, you’ll see references to theta throughout machine learning.

Now that we’ve defined the hypothesis function for linear regression, we have enough context to define
the *goal* of linear regression:

Given a training set, produce a hypothesis function h so that h(x) is a good predictor for y.

In other words, linear regression aims to produce a hypothesis function whose *predictions* match *observations*
as closely as possible. It does this by identifying the best values for **Θ _{0}** and

We’ve included a review of linear functions and a few visual examples of functions that are commonly used to model data in Appendix A. If it’s been a while since you’ve last used linear functions we recommend that you read Appendix A before you move on to the next section.

The **cost function** measures how well a hypothesis function acts as a *good predictor* for a given
training set, which we’ll refer to this as the *performance* of a hypothesis function going forward. It does this by
computing the average difference between predicted values **h(x)** and observed values **y**, for all values
of **x** in a training set. Let’s derive the cost function from first principles so that you can get an appreciation
for its behavior and mechanics.

**Step 1:** Calculate the difference between the predicted value h(x) and the observed value y, for each
instance **i** in the training set.

The superscript **i** in parentheses refers to a specific training *instance* in the training set.

**Step 2:** Sum up all of the differences for each instance in the training set.

The variable **m** refers to the size of the training set.

**Step 3:** Divide the total difference by the size of the training set to calculate the average difference for the training set.

What we’ve just derived in step 3 is a perfectly usable cost function, but we’re going to make two modifications to simplify the implementation of the function in practice.

First, we are going to square the error component (the difference between the predicted value **h(x)** and observed value **y**).
Squaring always ensures that we have a positive number; for example 4^{2} and -4^{2} are both equal to 16. A
programmatic implementation of the cost function only cares about the magnitude of error - getting rid of negatives removes a bit
of complexity in dealing with both positive and negative numbers.

Second, the entire cost function is multiplied by ½ so that the function remains consistent when derivatives of
this function are taken. Since the derivative of x

This is the formal definition of the cost function, which is also referred to as the squared error function or mean error function:

$$\dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h (x^{(i)}) - y^{(i)} \right)^2$$
The cost function is conventionally defined as **J**. Lastly, we substitute the *reference* to our hypothesis function
**h(x _{i})** with the linear regression hypothesis function

We can now express the goal of linear regression in slightly more specific terms - the goal of linear regression is to identify values
for Θ_{0} and Θ_{1} so that **J(Θ _{0}, Θ_{1})** produces the smallest possible value. Put simply, the goal of linear
regression is to

To get a better appreciation of how the cost function works, we’ll run through an exercise to calculate and compare the cost function of two different linear regression hypothesis functions to find out which one would act as the better predictor.

In figure 5 below, I’ve plotted the function **h(x) = 3 + 11x**, the function **h(x) = 1 + 5x** and a
training set consisting of five data points. The data points in our training set are as follows:

x | y |

5 | 30 |

6 | 40 |

7 | 50 |

8 | 60 |

9 | 70 |

Figure 5 suggests that that the function **h(x) = 1 + 5x** is a better fit for the data, and therefore a better
predictor for this training set. We’ll compute the cost for both functions and compare their performance to
validate this intuition.

Table 1 and Table 2 below show the work for calculating the cost function for each hypothesis function.

x | Predicted Value | Observed Value | Difference | Squared Difference |

5 | h(5) = 1 + 5*5 = 26 | 30 | -4 | 16 |

6 | h(6) = 1 + 5*6 = 31 | 40 | -9 | 81 |

7 | h(7) = 1 + 5*7 = 36 | 50 | -14 | 196 |

8 | h(8) = 1 + 5*8 = 41 | 60 | -19 | 361 |

9 | H(9) = 1 + 5*9 = 46 | 70 | -24 | 576 |

Sum of squared differences |
1230 | |||

Cost function J(1, 5) |
123 |

x | Predicted Value | Observed Value | Difference | Squared Difference |

5 | h(5) = 3 + 11*5 = 58 | 30 | -28 | 784 |

6 | h(6) = 3 + 11*6 = 69 | 40 | -29 | 841 |

7 | h(7) = 3 + 11*7 = 80 | 50 | -30 | 900 |

8 | h(8) = 3 + 11*8 = 91 | 60 | -31 | 961 |

9 | h(9) = 3 + 11*9 = 102 | 70 | -32 | 1024 |

Sum of squared differences |
4510 | |||

Cost function J(3, 11) |
451 |

By going through this exercise, we've been able to objectively demonstrate that the hypothesis function **h(x) = 1 + 5x** with a
cost of 123 is in fact a quantifiably better predictor for this training set than the hypothesis function **h(x) = 3 + 11x** with a cost of 451.

As an exercise for the reader, come up with values for Θ_{0} and Θ_{1} that would produce an even better predictor for this training set
then J(1, 5) and prove it by calculating the cost.

The hypothesis function is used to model data, and the cost function is used to measure a model’s performance. In the final step of our journey,
we’ll introduce the gradient descent algorithm – a data-driven, algorithmic method to identify the *optimal model* for a training set.

The **gradient descent algorithm** has broad applicability in machine learning as it offers an efficient means to *minimize functions*. Linear regression uses gradient descent to
minimize the cost function **J(Θ _{0},Θ_{1})** and discovers optimal values Θ

The gradient descent algorithm is very efficient at discovering **local minima** - the smallest value of a function in and around
a *local area* of a function. Discovering the local minima of a function is what ultimately yields us with the optimal values
for Θ_{0} and Θ_{1} - in other words, the local minima of a function represents the *solution* for linear regression.

Figure 6 below will help you get a sense of what local minima are, and how the gradient descent algorithm works to discover them.

Let’s assume that the function in figure 6 represents a side-on view of a mountain range; the green star represents your current location on the mountain. You notice a storm rolling in, and visibility is getting poor – to ride out the storm in safety you quickly begin making your way down to the base of the mountain. Let’s describe an approach to descend the mountain as a series of simple, logical steps:

- You look around, and find a new location that is lower than your current location.
- You take a number of steps in the direction of the new location.
- Once you reach your new location, you repeat steps 1 and 2 until you can no longer find a location that is any lower. To the best of your knowledge, you’ve reached the base of the mountain, represented by the green circle in figure 6.

These three steps provide an intuitive representation of how the gradient descent algorithm works in practice. The green circle in figure 6 represents one of six local minima in the function – it’s the lowest point within a localized area of the function.

Had we started this same exercise from the blue star, we would have discovered an even smaller value, and consequently a more optimal solution than we did starting from the green star.

The conclusion we can draw from this exercise is that the gradient descent algorithm is not well suited for minimizing functions that have more than
one local minima, since the result is highly dependent on your initial starting position. You would never be certain if the algorithm found the
*optimal* values for Θ_{0} and Θ_{1} if the algorithm is run against functions with multiple local minima. Consequently, the gradient descent
algorithm should only be used to minimize functions that have a single, global minima. The linear regression cost function *does* have a single global
minima, so it’s well suited to be minimized by the gradient descent algorithm.

The reasoning behind why linear regression cost functions produce a single global minima goes a bit beyond the scope of linear regression. However, we are going to present a visualization of the cost function so you can appreciate the “terrain” that the gradient descent algorithm will traverse, and provide you with some insight into what the visualization represents with a thought experiment.

In our introduction to the cost function, we went through a practical exercise to compute the cost of our hypothesis function with arbitrary values
for Θ_{0} and Θ_{1}. The results that we computed were as follows:

J(1, 5) = 123

J(3, 11) = 451

Let’s imagine plotting these two results on a graph - since we have three variables to plot (Θ_{0}, Θ_{1} and the cost), we need to plot this data on a 3D
graph. We place Θ_{0} on the x axis, Θ_{1} on the y axis, and the cost on the z axis – what we end up with is a visualization of cost *as a function of*
Θ_{0} and Θ_{1}. At this stage, the graph would consist 2 distinct points sitting in a 3D space.

Now, let’s assume that we have a lot of free time on our hands, and compute the cost function thousands of times, each with slightly different values
of Θ_{0} and Θ_{1} and plot these points on our 3D graph. As you plot this data, a distinct shape will begin to emerge; computing and plotting an
arbitrarily large number of these data points will inevitably reveal the concave structure in figure 7.

This graph presents cost J(Θ_{0}, Θ_{1}) as a function of Θ_{0} and Θ_{1}, and accurately represents the topology that
our gradient descent algorithm will navigate. All linear regression cost functions will end up producing this concave shape, which is why we can use the
gradient descent algorithm to minimize linear regression cost functions.

Areas shown in red represent high costs, and are consequently *bad predictors*. Areas shown in blue represent low costs, and are
consequently *good predictors*. The gradient descent algorithm will discover the local minima of the function, which will ultimately
discover the lowest cost and the best predictor – in other words, the optimal values for Θ_{0} and Θ_{1}.

So far we’ve described the gradient descent algorithm intuitively, as three simple steps to traverse down a mountain. We’ll now formalize this intuition by introducing the gradient descent algorithm:

repeat until convergence, for j = 0 and j = 1: {

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

}

While this definition may initially seem unintuitive, we’ll work to demystify the gradient descent algorithm to show that it’s simply a formal representation of our earlier intuition. Figure 8 below presents the gradient descent algorithm overlayed with the concepts and intuition in our mountain hiking analogy; this should help provide some context around how the algorithm works in practice.

Let’s review some of the new terms introduced in the gradient descent algorithm:

$$ := $$ | The expression := means |

$$ \alpha $$ | This is the term |

$$ \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$ | This term represents the |

The algorithm refers to repeating until convergence, for j = 0 and j = 1. What does this statement mean?

Put simply, when you take a step you’re moving some distance along the x axis (Θ_{0}) and some distance along the y axis (Θ_{0}).
The gradient descent algorithm needs a practical means to *capture* these x y coordinates so that the algorithm has a reference to where it currently
is, and where it needs to move.

The statement for *j = 0 and j = 1* is simply a *shorthand notation* used by the gradient descent algorithm to express the fact that
we need to capture and store these two distinct variables. Let’s express the definition of the gradient descent algorithm in a way that explicitly
tracks and records the x and y coordinates:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$
$$ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

}

Let’s substitute the cost function term J(Θ_{0}, Θ_{1}) with the actual cost function that we defined in the previous section.
The gradient descent algorithm can now be expressed as:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2 $$
$$ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2$$

}

The last step in the development of our algorithm requires us to take partial derivatives of these functions, which requires some knowledge of multivariate calculus. While you are more than welcome to work through the derivations yourself, we’ve worked through the derivations for you.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)x^{(i)} $$

}

It took a bit of effort to get here, but we’ve successfully developed our first machine learning algorithm! Implementing this algorithm will allow you to predict the price of a home in Richmond far more accurately than our original best guess prediction that we made at the beginning of this chapter.

To close out this chapter, we will discuss convergence in gradient descent and provide some practical guidance on how to effectively manage the gradient descent learning rate. These represent the last two pieces of information you’ll need in order to implement the gradient descent algorithm in software.

When we introduced the gradient descent algorithm we briefly touched on the following term, the derivative of the cost function:

$$ \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$
The derivative of a cost function computes the *slope* of the cost function at points θ_0,θ_1. Put simply, the slope gives us information about how
*steep* we are at any particular point in a function.

If you refer back to Figure 7 where we presented a 3D visual of the cost function, you can see that the edges of the function are steeper than the
local minima in the center. Consequently, computing the slope of the cost function at a specific point provides us with an indication of *how
close we are* to finding an optimal solution. The shallower the slope, the closer we are to discovering the local minima.

Figure 9 below presents a cross-section of the 3D cost function that we presented in Figure 7, and shows how the slope of the function successively approaches zero as we approach the local minima.

As we approach the minima, the derivative of the cost function approaches zero; once we reach the global minima, the derivative of the cost
function *equals* zero, and a very useful characteristic of the gradient descent algorithm emerges.

Take our original definition of the gradient descent algorithm (for just one leg; a step in the x direction):

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$Once we reach the minima, the derivative of the cost function is equal to zero, and we end up with:

$$ \theta_0 := \theta_0 - \alpha0 $$Multiplying the learning rate alpha by zero cancels out this term entirely, and we are left with:

$$ \theta_0 := \theta_0 $$
The key insight here is that when the gradient descent algorithm reaches the minima, the value of θ_0 becomes immutable; the value will not
change with further iterations. **Convergence** is defined as the incremental progression towards this condition.

In practice, you may not necessarily want to wait for the gradient descent algorithm to reach this condition. You’ll need to pragmatically balance performance and predictive efficacy to satisfy your specific requirements and use cases.

As the gradient descent algorithm approaches the minima, the derivative term will incrementally approach zero after every iteration. Consequently, the algorithm will naturally take incrementally smaller steps as it approaches the minima, and incrementally longer to reach the minima.

Figure 10 visualizes the incremental reduction in step size as the algorithm converges towards the global minima:

While the gradient descent algorithm *implicitly* reduces the step size as it converges on the global minima, the learning rate *alpha* allows us to
*explicitly* influence the step size. From an implementation perspective, the learning rate alpha is the only adjustable parameter in the gradient
descent algorithm, and choosing an appropriate value for alpha ensures that the algorithm finds an optimal solution efficiently. If alpha is set
too high, the algorithm may not converge on the most optimal solution; if alpha is set too low, the algorithm may a long time to find the most
optimal solution.

Figures 11 and 12 below provide some insight into the behavior of the gradient descent algorithm when the value of alpha is set too high or too low.

The larger the value of alpha, the larger the steps the gradient descent algorithm will take between each iteration. In iteration #1 and iteration #2, the algorithm takes big steps and gets closer to the minima (the yellow circle). In iteration #3 the algorithm overshoots the minima and iteration #4 begins to diverge from the right solution.

When alpha is set at a high value we risk the possibility of never converging on the minima.

The smaller the value of alpha, the smaller the steps the gradient descent algorithm will take between each iteration. If the learning rate is set conservatively, the gradient descent algorithm will take a larger number of smaller steps, which increases the time it will take to find an optimal solution.

The concepts, algorithms and methods that we’ve introduced in this chapter will lay the groundwork for much more powerful and sophisticated machine learning algorithms in upcoming chapters. With only a few incremental updates and enhancements, we will go from predictions that we can make intuitively to predictions so sophisticated they would be impossible to reason about without the tools and techniques we’ve learned in this chapter.

Here are the three key takeaways that you should be confident with:

1. The **Hypothesis Function** is used to model our training data. The hypothesis function can take many forms (linear functions, exponential functions,
logarithmic functions) and it’s the responsibility of the AI/ML designer to select a hypothesis function that most closely models their problem space.
In this chapter, we introduced the *linear function* to model our housing data.

2. Once we’ve selected an appropriate Hypothesis Function, the **Cost Function** allows us to quantify how well our model is performing
with respect to specific values for the parameters Θ_{0} and Θ_{1}. This is the key performance measurement that enables the linear
regression algorithm to discover the *optimal values* for Θ_{0} and Θ_{1}.

3. The **Gradient Descent Algorithm** leverages the Cost Function performance measurements to seek and discover the optimal
values for Θ_{0} and Θ_{1}, and consequently the best *predictor* for our data.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)x^{(i)} $$

}

Once the gradient descent algorithm has converged on an optimal solution, substitute the parameters Θ_{0} and Θ_{1}in the hypothesis function with the
values Θ_{0} and Θ_{1} discovered by the gradient descent algorithm; you will have an optimized model ready to make the best predictions possible.