Brandon John Grenier

Linear regression is one of the simplest and most well understood algorithms in machine learning, and the first regression model algorithm that we’ll cover in detail. Regression models are used in supervised learning strategies to predict continuous values; in this chapter we’ll develop a linear regression algorithm to predict the price of real estate in Melbourne, Australia.

As a practical introduction to linear regression, we’ll begin by making predictions *intuitively* – no math or algorithms
will be involved, just a bit of guesswork. We’ll then formalize this intuition by introducing key machine learning concepts, mathematics
and algorithms that underpin linear regression. By the end of this chapter, you will have all of the tools you need to build your own linear
regression machine learning algorithm.

As hinted by the chapter title, we’re going to build a linear regression algorithm that uses a single variable, or
*feature*. Linear regression that uses a single variable is referred to as **univariate linear regression**.
Due to its simplicity, univariate linear regression serves as a fantastic technical introduction to foundational machine learning
concepts, models and mathematics. Although the machine learning algorithms in upcoming chapters will increase in
complexity and sophistication, they effectively iterate on the foundations we introduce in this chapter.

The first prediction we’re going to make using linear regression will be an intuitive one – there’s no math
or algorithms involved at this point, just a bit of guesswork. As a consequence of this guesswork our predictions
will be *reasonable*, but not perfect - we’ll improve on our predictive ability as we formalize our intuition.

Let’s assume that you would like to buy a 150 square meter home in Melbourne’s inner city suburb of Richmond. You’ve done some homework and collected data from twelve recent sales in the area - each data point consists of a home’s sale price and its size in square metres.

Based on this data, you want to predict how much it might cost to buy a 150 square meter home in Richmond. For
the sake of this exercise, we’ll assume that the price of a home is only dependent on a single *feature* - the
size of the home.

Let’s start with a simple method to figure out how much it might cost to buy a 150m^{2} home in Richmond.

As a first step, we’ll plot all of the sales data onto a graph. Figure 1 below presents
a graph of the sales data, with house sizes (in m^{2}) on the horizontal *x axis* and house
prices on the vertical *y axis*.

Now that our sales data is plotted on a graph, we can simply use our best judgement to draw a straight red line that fits the data reasonably well. Figure 2 below presents our red line overlaying the sales data.

To predict the price of a 150m^{2} home in Richmond, draw a vertical line from the 150m^{2} value located on the x axis
until it intersects with the red line we drew in step 2. From the intersection come across to the y axis to get
the predicted price of a 150m^{2} home. Figure 3 below provides you with a visual demonstration of our prediction using
green dashed lines.

Based on the sales data you collected, we can predict that a 150m^{2} house in Richmond will cost
approximately **$900,000** to purchase.

Congratulations, you made your first data-driven prediction! Let’s quickly summarize the four steps we just took to make this prediction:

- We collected data
- We plotted the data
*We drew a straight line that best fits the data*- We used the line to make a prediction – an educated guess in the absence of data

Notice that step 3 is italicized; this was the step where we used our judgement to draw a straight line to
fit the data as best as we could. For the remainder of this chapter, we’ll go on a journey to replace our
*judgement* with *data-driven algorithms* that will ultimately make far more accurate predictions
for us.

The first step in formalizing our intuition begins with replacing the notion of our “best guess” straight line with
a *mathematical definition* of what that line represents – it’s time to introduce you to the hypothesis function.

The **hypothesis function** is a term that describes any mathematical function that is used to fit, or model
your data - much in the same way that the term *car* can be used to describe any model of car like a Ford,
Lexus or Cadillac.

Problems spaces like population growth and infection rates may best be modelled with *exponential functions*;
predicting yearly fluctuations in global temperature could potentially be modelled with *sinusoidal functions*;
predicting earthquake behavior could be modelled with *logarithmic functions*. You might not have known it at
the time, but we used a *linear function* to predict the price of a home in Richmond. The
specific mathematical function we use to model our data is referred to as the hypothesis function in our
machine learning algorithms.

As an ML designer, it will be your responsibility to select the most appropriate hypothesis function to effectively model your data. Consequently, this will be one of the first activities you’ll undertake when building machine learning solutions. In later chapters we’ll discuss how we can build ML algorithms to select the most appropriate hypothesis functions for our data automatically, but for now we’ll stick with the basics.

If it’s been a while since you’ve last studied mathematical functions, figure 4 below provides a few visual examples of functions that can be used as hypothesis functions to effectively fit your data and model your problem space.

The linear regression algorithm uses a *linear function* as it’s hypothesis function, and is defined as:

**h(x) = Θ _{0} + Θ_{1}x**

The hypothesis function **h(x)** represents the formal, mathematical definition of our straight red
line. The symbol Θ is called theta, you’ll see references to theta throughout machine learning.

Now that we’ve defined the hypothesis function that we’ll use in our linear regression algorithm, let’s talk about the goal of linear regression:

Given a training set, produce a hypothesis function h so that h(x) is a good predictor for y.

More specifically, the goal of linear regression is to discover the most appropriate values for both **Θ _{0}** and

With all that being said, what do we mean by *good predictor* exactly? While the term sounds a bit ambiguous, we’ll
produce a concrete definition of what a good predictor is, and we’ll introduce the *cost function* as method to quantify
good predictors.

We’ve included a review of linear functions in Appendix A. If it’s been a while since you’ve last used linear functions we recommend that you read Appendix A before you move on to the next section.

A good predictor is one that most effectively minimizes the difference between the predicted values h(x) and the observed values y for all values of x in a training set. A cost function quantifies this difference to estimate how models are performing.

The cost function is an incredibly important tool in machine learning, as it acts as the feedback mechanism for machine learning algorithms. The ability to receive and adapt to feedback is one of the foundational characteristics of machine learning algorithms.

The following cost function pseudo-algorithm is used to quantify the performance of a model (i.e. a hypothesis function):

- For each instance in the training set, calculate the difference between the predicted value h(x) and the observed value y.
- Sum up all of the differences.
- Divide the total difference by the size of the training set.

The cost function outputs the average difference between all predictions and observations in a training set. The closer this value is to zero, the better a model is at making predictions. If the value is equal to zero, it means that a model perfectly predicts observations.

Now that you have an understanding of how a cost function works, we’ll derive the mathematical definition of the cost function step-by-step:

Step 1: For each instance in the training set, calculate the difference between the predicted value h(x) and the observed value y.

$$h(x_{i}) - y_{i}$$Where the superscript (i) refers to a specific training instance in the training set. In this step we’re simply subtracting the observed value y from the predicted value h(x) for a specific training instance i.

Step 2: Sum up all the differences.

$$\displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)$$Where m refers to the size of the training set. Please note that the summation for the training set is 0-indexed (i=0). You may see training sets 1-indexed (i=1) in some literature.

Step 3: Divide the total difference by the size of the training set:

$$\dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)$$What we’ve just derived in step 3 is a perfectly usable cost function. Although we could use the function as-is, we are going to make two small modifications to the function so that it’s a bit simpler to implement in practice. The next section describes the modifications and the benefits of making them.

The Cost Function

This is the formal definition of the cost function, which may also be referred to as the squared error function or mean error function:

$$\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)^2$$It’s important to note that the formal definition of the cost function shown above and the cost function derived in step 3 are functionally and materially identical. Using either of these functions will provide a consistent and correct outcome. The formal definition happens to be simpler to implement in software, and we like simple.

First, the error component (the difference between the predicted value and observed value) has been squared. Squaring always ensures that we have a positive number; for example 42 and -42 are both equal to 16. A programmatic implementation of the cost function only cares about the magnitude of error - getting rid of negatives removes a bit of complexity in dealing with both positive and negative numbers.

Second, the cost function is multiplied by ½ - this is a practical change to restore balance to the cost function. Since the error component has been squared, we needed to add a term to make sure the function remains consistent when derivatives of this function are taken. Since the derivative of x2 is 2x, so multiplying by 1/2 cancels out the effect of the exponent.

Lastly, we substitute the reference to our hypothesis function h(xi) with the linear regression hypothesis function that we introduced previously. This is the fully expanded cost function that we will use in the linear regression algorithm:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x_{i} - y_{i} \right)^2$$The cost function is conventionally defined as J, so J(Θ0, Θ1) is a function that will produce a performance measure for specific values of Θ0 and Θ1.

The linear regression algorithm will choose values for Θ0 and Θ1 so that J(Θ0, Θ1) produces the smallest possible output. By way of convention, we want our machine learning algorithm to minimize J(Θ0, Θ1).

Before we move on to minimizing J(Θ0, Θ1), let’s work through a concrete example with the cost function that we’ve just derived.

To get a better appreciation of the cost function, we’ll run through an exercise comparing the performance of two different linear regression hypothesis functions. In figure 5 below, I’ve plotted the function h(x) = 3 + 11x, the function h(x) = 1 + 5x and a training set consisting of five data points. The data points for our training set are as follows:

■(x&y@5&30@6&40@7&50@8&60@9&70)We can assert that the function h(x) = 1 + 5x is a better fit to the data than the function h(x) = 3 + 11x by visual inspection alone, but we’ll generate the cost for each of these hypothesis functions to get an objective measure.

Table 1 and Table 2 below show the work for calculating the cost functions. The final line item in each table (Cost) is calculated by taking the total difference and multiplying by 1/2m, or 1/10 since there are 5 instances in our training set.

x | Predicted Value | Observed Value | Difference | Squared Difference |

5 | 26 | 30 | -4 | 16 |

6 | 31 | 40 | -9 | 81 |

7 | 36 | 50 | -14 | 196 |

x | Predicted Value | Observed Value | Difference | Squared Difference |

5 | 26 | 30 | -4 | 16 |

6 | 31 | 40 | -9 | 81 |

7 | 36 | 50 | -14 | 196 |

By going through this exercise, we have been able to show that J(1, 5) has a cost of 123.0 and that J(3,11) has a cost of 451.0. We can new objectively assert that h(x) = 1 +5x is a better predictor for our observed data than h(x) = 3 + 11x.

Given a perfect predictor has a cost of 0, the data also suggests that there are more optimal values for Θ0 and Θ1. As an exercise for the reader, find values for Θ0 and Θ1 that would be a better predictor than (1, 5) and run through the exercise above to quantify how much better your function performs. Going through the exercise should demonstrate that even though the cost function might look a bit intimidating, it’s quite easy to work with.

We now have the means to define our model and get feedback on its performance. The final step is to discover the best performing model to act as the predictor for our data. In the next section we introduce the gradient descent algorithm, which will provide us with the means to minimize J(Θ0, Θ1) efficiently.

The gradient descent algorithm is a general purpose algorithm that has a number of practical applications in machine learning. It has broad applicability in machine learning as the gradient descent algorithm provides an efficient means to minimize functions. The cost function has provided us with the ability to quantify the effectiveness of a model, and with that feedback the gradient descent algorithm will find the minimal cost for our model – put simply, gradient descent will find the values of Θ0 and Θ1 that will act as the best predictor for our model.

By the end of this section you should have a good understanding of how the gradient descent algorithm works, and appreciate how the algorithm solves minimization problems.

Before we define the gradient descent algorithm, let’s first get an intuitive understanding of how the gradient descent algorithm works and appreciate its limitations. The gradient descent algorithm is effective at finding local minima - the smallest values of a function, in and around a local area of a function.

Figure 6 below presents a function with a couple examples of the gradient descent algorithm in action:

Imagine this function is a mountain, and you’re standing where the green star is. A storm has rolled in, and visibility is poor – you can only see a few steps in front of you. You need to head to the base of the mountain as soon as you can to ride out the storm in safety.

How will you get down? The approach described below acts as a good analogy for the behavior of the gradient descent algorithm.

- You look around, and find a new location that is lower than your current location.
- You take a few steps in the direction of the new location and stop.
- From the new location, you repeat steps 1 and 2 until you can no longer find a location that is lower; in this case you’ve reached (as far as you know) the base of the mountain – the green circle in the figure.

One thing you may have immediately noticed going through this exercise is that we did not reach the base of the mountain when we started from the green star; this is an example of a local minimum. Had we started the same exercise from the blue star, we would have reached the base of the mountain.

The conclusion we can draw from this exercise is that the gradient descent algorithm is not well suited for minimizing functions that have more than one local minima, since the result you receive would be highly dependent on your initial starting position – you would never know if you truly minimized the function. The gradient descent algorithm should only be used to minimize functions that have a single, global minima.

The good news is that the linear regression cost function does have a single global minima, so it’s well suited to be minimized by the gradient descent algorithm. We aren’t going to explain why linear regression cost functions have a single global minima, but we are going to present a 3D visualization of the cost function as a function of Θ0 and Θ1 so you can appreciate the actual terrain our gradient descent algorithm will traverse.

This graph presents cost J(Θ0, Θ1) as a function of both Θ0 and Θ1, and represents the actual topology that our gradient descent algorithm will navigate - it’s pretty cool. In fact, all linear regression cost functions will end up with this convex shape, which is why we can use the gradient descent algorithm to minimize linear regression cost functions.

High areas shown in red represent high costs, and are consequently bad predictors. Low areas in shown in blue represent low costs, and are consequently good predictors. The gradient descent algorithm will find the lowest point on this graph, which also means that it will find the best predictor.

While the pseudo-algorithm for walking down a mountain was quite straightforward, transcribing this concept algorithmically is a little less straightforward; we’ll break down the gradient descent algorithm and show that it’s simply a formal representation of our intuition. The definition of the gradient descent algorithm is as follows:

Repeat until convergence, for j = 0 and j = 1 { θ_j ∶= θ_j- α ∂/(∂θ_j ) J(θ_0,θ_1) }There are a few new terms in the gradient descent algorithm to get your head around. Before we define these new terms, Figure 8 presents the gradient descent algorithm with an overlay of the concepts and intuition in our mountain analogy; this should help provide a bit of context around how the algorithm works in practice.