AN INTRODUCTION TO ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ALGORITHMS

Brandon John Grenier

Chapter 2: Linear Regression with One Feature

Linear regression is one of the simplest and most well understood algorithms in machine learning, and the first regression model algorithm that we’ll cover in detail. Regression models are used in supervised learning strategies to predict continuous values; in this chapter we’ll develop a linear regression algorithm to predict the price of real estate in Melbourne, Australia.

As a practical introduction to linear regression, we’ll begin by making predictions intuitively – no math or algorithms will be involved, just a bit of guesswork. We’ll then formalize this intuition by introducing key machine learning concepts, mathematics and algorithms that underpin linear regression. By the end of this chapter, you will have all of the tools you need to build your own linear regression machine learning algorithm.

As hinted by the chapter title, we’re going to build a linear regression algorithm that uses a single variable, or feature. Linear regression that uses a single variable is referred to as univariate linear regression. Due to its simplicity, univariate linear regression serves as a fantastic technical introduction to foundational machine learning concepts, mathematics and algorithms. While machine learning algorithms in upcoming chapters will increase in complexity and sophistication, they will ultimately build upon the foundations we introduce in this chapter.

Intuitive Linear Regression

The first prediction we’re going to make using linear regression will be an entirely intuitive one – there’s no math or algorithms involved at this point, just a bit of guesswork. As a consequence of this guesswork our predictions will be reasonable, and we’ll improve our predictive ability as we formalize our intuition.

Predicting the price of a home in Melbourne, Australia

Let’s assume that you would like to buy a 150 square meter home in Melbourne’s inner city suburb of Richmond. You’ve done some homework and collected data from twelve recent sales in the area - each data point consists of a home’s sale price and its size in square metres.

Based on this data, you want to predict how much it might cost to buy a 150 square meter home in Richmond. For the sake of this exercise, we’ll assume that the price of a home is only dependent on a single feature - the size of the home.

Let’s start with a simple method to figure out how much it might cost to buy a 150m2 home in Richmond.

Step 1: Plot the data

As a first step, we’ll plot all of the sales data onto a graph. Figure 1 below presents a graph of the sales data, with house sizes (in m2) on the horizontal x axis and house prices on the vertical y axis.

Figure 1: House price as a function of house size

Step 2: Fit the data

Now that our sales data is plotted on a graph, we can simply use our best judgement to draw a straight red line that fits the data reasonably well. Figure 2 below presents our red line overlaying the sales data.

Figure 2: Using judgement to draw a straight line that fits our housing data

Step 3: Make a prediction

To predict the price of a 150m2 home in Richmond, draw a vertical line from the 150m2 value located on the x axis until it intersects with the red line we drew in step 2. From the intersection come across to the y axis to get the predicted price of a 150m2 home. Figure 3 below provides you with a visual demonstration of our prediction using green dashed lines.

Figure 3: Predicting the price of a 150m2 home in Richmond

Based on the sales data you collected, we can predict that a 150m2 house in Richmond will cost approximately $900,000 to purchase.

Let’s quickly summarize the four steps we just took to make this prediction:

  1. We collected data
  2. We plotted the data
  3. We drew a straight line that best fit the data
  4. We used the line to make a prediction – an educated guess in the absence of data

In step 3, we used our judgement to draw a straight line to fit the data as best as we could. For the remainder of this chapter, we’ll go on a journey to replace this judgement with data-driven algorithms which will ultimately perform the same function, but do so with far more accuracy.

The Hypothesis Function

The hypothesis function is a term that describes any mathematical function that is used to fit, or model your data. Problems spaces like population growth and infection rates may best be modelled with exponential functions; predicting yearly fluctuations in global temperature could potentially be modelled with sinusoidal functions; predicting earthquake behavior could be modelled with logarithmic functions.

The Hypothesis Function for Linear Regression

A linear function is optimal for modelling linear relationships like our home sales data, and will be the most appropriate linear regression hypothesis function for our current problem space, which only involves one variable - the size of a home. The linear regression hypothesis function is defined as:

$$ h(x) = \theta_0 + \theta_1x $$

The symbol Θ is called theta, you’ll see references to theta throughout machine learning.

The Goal of Linear Regression

Now that we’ve defined the hypothesis function for linear regression, we have enough context to define the goal of linear regression:

Given a training set, produce a hypothesis function h so that h(x) is a good predictor for y.

In other words, linear regression aims to produce a hypothesis function whose predictions match observations as closely as possible. It does this by identifying the best values for Θ0 and Θ1 so that the hypothesis function minimizes the difference between predictions and observations. We can concretely define a good predictor as a hypothesis function that most effectively minimizes the difference between predicted values h(x) and observed values y, for all values of x in a training set. The smaller the difference, the better the predictor; the larger the difference, the poorer the predictor.

Reader Tip: Linear Function Review

We’ve included a review of linear functions and a few visual examples of functions that are commonly used to model data in Appendix A. If it’s been a while since you’ve last used linear functions we recommend that you read Appendix A before you move on to the next section.

The Cost Function

The cost function measures how well a hypothesis function acts as a good predictor for a given training set, which we’ll refer to this as the performance of a hypothesis function going forward. It does this by computing the average difference between predicted values h(x) and observed values y, for all values of x in a training set. Let’s derive the cost function from first principles so that you can get an appreciation for its behavior and mechanics.

Deriving The Cost Function


Step 1: Calculate the difference between the predicted value h(x) and the observed value y, for each instance i in the training set.

$$h(x^{(i)}) - y^{(i)}$$

The superscript i in parentheses refers to a specific training instance in the training set.

Step 2: Sum up all of the differences for each instance in the training set.

$$\displaystyle \sum _{i=0}^m \left (h (x^{(i)}) - y^{(i)} \right)$$

The variable m refers to the size of the training set.

Step 3: Divide the total difference by the size of the training set to calculate the average difference for the training set.

$$\dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (h (x^{(i)}) - y^{(i)} \right)$$

What we’ve just derived in step 3 is a perfectly usable cost function, but we’re going to make two modifications to simplify the implementation of the function in practice.

First, we are going to square the error component (the difference between the predicted value h(x) and observed value y). Squaring always ensures that we have a positive number; for example 42 and -42 are both equal to 16. A programmatic implementation of the cost function only cares about the magnitude of error - getting rid of negatives removes a bit of complexity in dealing with both positive and negative numbers.

Second, the entire cost function is multiplied by ½ so that the function remains consistent when derivatives of this function are taken. Since the derivative of x is 2x, multiplying by ½ negates the effect of the exponent.

This is the formal definition of the cost function, which is also referred to as the squared error function or mean error function:

$$\dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h (x^{(i)}) - y^{(i)} \right)^2$$

The cost function is conventionally defined as J. Lastly, we substitute the reference to our hypothesis function h(xi) with the linear regression hypothesis function h(xi) = Θ0 + Θ1xi. This is the fully expanded cost function that we will use in the linear regression algorithm:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2$$

We can now express the goal of linear regression in slightly more specific terms - the goal of linear regression is to identify values for Θ0 and Θ1 so that J(Θ0, Θ1) produces the smallest possible value. Put simply, the goal of linear regression is to minimize J(Θ0, Θ1).

The Cost Function in Practice

To get a better appreciation of how the cost function works, we’ll run through an exercise to calculate and compare the cost function of two different linear regression hypothesis functions to find out which one would act as the better predictor.

In figure 5 below, I’ve plotted the function h(x) = 3 + 11x, the function h(x) = 1 + 5x and a training set consisting of five data points. The data points in our training set are as follows:

xy
530
640
750
860
970

Figure 5: Comparing the predictive performance of two linear regression hypothesis functions

Figure 5 suggests that that the function h(x) = 1 + 5x is a better fit for the data, and therefore a better predictor for this training set. We’ll compute the cost for both functions and compare their performance to validate this intuition.

Table 1 and Table 2 below show the work for calculating the cost function for each hypothesis function.

Table 1: Calculating the cost function for h(x) = 1 + 5x
x Predicted Value Observed Value Difference Squared Difference
5 h(5) = 1 + 5*5 = 26 30 -4 16
6 h(6) = 1 + 5*6 = 31 40 -9 81
7 h(7) = 1 + 5*7 = 36 50 -14 196
8 h(8) = 1 + 5*8 = 41 60 -19 361
9 H(9) = 1 + 5*9 = 46 70 -24 576
Sum of squared differences 1230
Cost function J(1, 5) 123

Table 2: Calculating the cost function for h(x) = 3 + 11x
x Predicted Value Observed Value Difference Squared Difference
5 h(5) = 3 + 11*5 = 58 30 -28 784
6 h(6) = 3 + 11*6 = 69 40 -29 841
7 h(7) = 3 + 11*7 = 80 50 -30 900
8 h(8) = 3 + 11*8 = 91 60 -31 961
9 h(9) = 3 + 11*9 = 102 70 -32 1024
Sum of squared differences 4510
Cost function J(3, 11) 451

By going through this exercise, we've been able to objectively demonstrate that the hypothesis function h(x) = 1 + 5x with a cost of 123 is in fact a quantifiably better predictor for this training set than the hypothesis function h(x) = 3 + 11x with a cost of 451.

As an exercise for the reader, come up with values for Θ0 and Θ1 that would produce an even better predictor for this training set then J(1, 5) and prove it by calculating the cost.

The hypothesis function is used to model data, and the cost function is used to measure a model’s performance. In the final step of our journey, we’ll introduce the gradient descent algorithm – a data-driven, algorithmic method to identify the optimal model for a training set.

The Gradient Descent Algorithm

The gradient descent algorithm has broad applicability in machine learning as it offers an efficient means to minimize functions. Linear regression uses gradient descent to minimize the cost function J(Θ01) and discovers optimal values Θ0 and Θ1. Before we define the gradient descent algorithm, let’s gain an intuitive understanding of how the algorithm works and an appreciation of its limitations.

The Gradient Descent Hiking Analogy

The gradient descent algorithm is very efficient at discovering local minima - the smallest value of a function in and around a local area of a function. Discovering the local minima of a function is what ultimately yields us with the optimal values for Θ0 and Θ1 - in other words, the local minima of a function represents the solution for linear regression.

Figure 6 below will help you get a sense of what local minima are, and how the gradient descent algorithm works to discover them.

Figure 6: The gradient descent algorithm - discovering minima by intuition

Let’s assume that the function in figure 6 represents a side-on view of a mountain range; the green star represents your current location on the mountain. You notice a storm rolling in, and visibility is getting poor – to ride out the storm in safety you quickly begin making your way down to the base of the mountain. Let’s describe an approach to descend the mountain as a series of simple, logical steps:

  1. You look around, and find a new location that is lower than your current location.
  2. You take a number of steps in the direction of the new location.
  3. Once you reach your new location, you repeat steps 1 and 2 until you can no longer find a location that is any lower. To the best of your knowledge, you’ve reached the base of the mountain, represented by the green circle in figure 6.

These three steps provide an intuitive representation of how the gradient descent algorithm works in practice. The green circle in figure 6 represents one of six local minima in the function – it’s the lowest point within a localized area of the function.

The Limitations of Gradient Descent

Had we started this same exercise from the blue star, we would have discovered an even smaller value, and consequently a more optimal solution than we did starting from the green star.

The conclusion we can draw from this exercise is that the gradient descent algorithm is not well suited for minimizing functions that have more than one local minima, since the result is highly dependent on your initial starting position. You would never be certain if the algorithm found the optimal values for Θ0 and Θ1 if the algorithm is run against functions with multiple local minima. Consequently, the gradient descent algorithm should only be used to minimize functions that have a single, global minima. The linear regression cost function does have a single global minima, so it’s well suited to be minimized by the gradient descent algorithm.

Visualizing the Topology of the Cost Function

The reasoning behind why linear regression cost functions produce a single global minima goes a bit beyond the scope of linear regression. However, we are going to present a visualization of the cost function so you can appreciate the “terrain” that the gradient descent algorithm will traverse, and provide you with some insight into what the visualization represents with a thought experiment.

In our introduction to the cost function, we went through a practical exercise to compute the cost of our hypothesis function with arbitrary values for Θ0 and Θ1. The results that we computed were as follows:

J(1, 5) = 123
J(3, 11) = 451

Let’s imagine plotting these two results on a graph - since we have three variables to plot (Θ0, Θ1 and the cost), we need to plot this data on a 3D graph. We place Θ0 on the x axis, Θ1 on the y axis, and the cost on the z axis – what we end up with is a visualization of cost as a function of Θ0 and Θ1. At this stage, the graph would consist 2 distinct points sitting in a 3D space.

Now, let’s assume that we have a lot of free time on our hands, and compute the cost function thousands of times, each with slightly different values of Θ0 and Θ1 and plot these points on our 3D graph. As you plot this data, a distinct shape will begin to emerge; computing and plotting an arbitrarily large number of these data points will inevitably reveal the concave structure in figure 7.

Figure 7: J(Θ0, Θ1) as a function of Θ0 and Θ1

This graph presents cost J(Θ0, Θ1) as a function of Θ0 and Θ1, and accurately represents the topology that our gradient descent algorithm will navigate. All linear regression cost functions will end up producing this concave shape, which is why we can use the gradient descent algorithm to minimize linear regression cost functions.

Areas shown in red represent high costs, and are consequently bad predictors. Areas shown in blue represent low costs, and are consequently good predictors. The gradient descent algorithm will discover the local minima of the function, which will ultimately discover the lowest cost and the best predictor – in other words, the optimal values for Θ0 and Θ1.

The Gradient Descent Algorithm

So far we’ve described the gradient descent algorithm intuitively, as three simple steps to traverse down a mountain. We’ll now formalize this intuition by introducing the gradient descent algorithm:

repeat until convergence, for j = 0 and j = 1: {

$$ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

}

While this definition may initially seem unintuitive, we’ll work to demystify the gradient descent algorithm to show that it’s simply a formal representation of our earlier intuition. Figure 8 below presents the gradient descent algorithm overlayed with the concepts and intuition in our mountain hiking analogy; this should help provide some context around how the algorithm works in practice.

Figure 8: The gradient descent algorithm with an overlay of the mountain analogy

Let’s review some of the new terms introduced in the gradient descent algorithm:

$$ := $$

The expression := means assignment. The assignment ensures that we don’t lose reference to our current location as the algorithm iteratively converges towards the right solution.

$$ \alpha $$

This is the term alpha, also known as the learning rate and defines the size of the step you want to take. We will discuss the learning rate in detail shortly.

$$ \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

This term represents the derivative of the cost function and defines the direction of the steps you’re taking. There’s a bit more to unpack here, and we’ll provide more insight about how this term works shortly.

The algorithm refers to repeating until convergence, for j = 0 and j = 1. What does this statement mean?

Put simply, when you take a step you’re moving some distance along the x axis (Θ0) and some distance along the y axis (Θ0). The gradient descent algorithm needs a practical means to capture these x y coordinates so that the algorithm has a reference to where it currently is, and where it needs to move.

The statement for j = 0 and j = 1 is simply a shorthand notation used by the gradient descent algorithm to express the fact that we need to capture and store these two distinct variables. Let’s express the definition of the gradient descent algorithm in a way that explicitly tracks and records the x and y coordinates:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$ $$ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

}

Let’s substitute the cost function term J(Θ0, Θ1) with the actual cost function that we defined in the previous section. The gradient descent algorithm can now be expressed as:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2 $$ $$ \theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2$$

}

The last step in the development of our algorithm requires us to take partial derivatives of these functions, which requires some knowledge of multivariate calculus. While you are more than welcome to work through the derivations yourself, we’ve worked through the derivations for you.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)x^{(i)} $$

}

It took a bit of effort to get here, but we’ve successfully developed our first machine learning algorithm! Implementing this algorithm will allow you to predict the price of a home in Richmond far more accurately than our original best guess prediction that we made at the beginning of this chapter.

To close out this chapter, we will discuss convergence in gradient descent and provide some practical guidance on how to effectively manage the gradient descent learning rate. These represent the last two pieces of information you’ll need in order to implement the gradient descent algorithm in software.

Convergence in Gradient Descent

When we introduced the gradient descent algorithm we briefly touched on the following term, the derivative of the cost function:

$$ \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

The derivative of a cost function computes the slope of the cost function at points θ_0,θ_1. Put simply, the slope gives us information about how steep we are at any particular point in a function.

If you refer back to Figure 7 where we presented a 3D visual of the cost function, you can see that the edges of the function are steeper than the local minima in the center. Consequently, computing the slope of the cost function at a specific point provides us with an indication of how close we are to finding an optimal solution. The shallower the slope, the closer we are to discovering the local minima.

Figure 9 below presents a cross-section of the 3D cost function that we presented in Figure 7, and shows how the slope of the function successively approaches zero as we approach the local minima.

Figure 9: The derivative of the cost function approaches 0 as we approach the local minima.

As we approach the minima, the derivative of the cost function approaches zero; once we reach the global minima, the derivative of the cost function equals zero, and a very useful characteristic of the gradient descent algorithm emerges.

Take our original definition of the gradient descent algorithm (for just one leg; a step in the x direction):

$$ \theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) $$

Once we reach the minima, the derivative of the cost function is equal to zero, and we end up with:

$$ \theta_0 := \theta_0 - \alpha0 $$

Multiplying the learning rate alpha by zero cancels out this term entirely, and we are left with:

$$ \theta_0 := \theta_0 $$

The key insight here is that when the gradient descent algorithm reaches the minima, the value of θ_0 becomes immutable; the value will not change with further iterations. Convergence is defined as the incremental progression towards this condition.

In practice, you may not necessarily want to wait for the gradient descent algorithm to reach this condition. You’ll need to pragmatically balance performance and predictive efficacy to satisfy your specific requirements and use cases.

As the gradient descent algorithm approaches the minima, the derivative term will incrementally approach zero after every iteration. Consequently, the algorithm will naturally take incrementally smaller steps as it approaches the minima, and incrementally longer to reach the minima.

Figure 10 visualizes the incremental reduction in step size as the algorithm converges towards the global minima:

Figure 10: Incremental reduction in step size as gradient descent converges on the minima

The Learning Rate Alpha

While the gradient descent algorithm implicitly reduces the step size as it converges on the global minima, the learning rate alpha allows us to explicitly influence the step size. From an implementation perspective, the learning rate alpha is the only adjustable parameter in the gradient descent algorithm, and choosing an appropriate value for alpha ensures that the algorithm finds an optimal solution efficiently. If alpha is set too high, the algorithm may not converge on the most optimal solution; if alpha is set too low, the algorithm may a long time to find the most optimal solution.

Figures 11 and 12 below provide some insight into the behavior of the gradient descent algorithm when the value of alpha is set too high or too low.

Figure 11: Linear regression behavior with an aggressive learning rate

The larger the value of alpha, the larger the steps the gradient descent algorithm will take between each iteration. In iteration #1 and iteration #2, the algorithm takes big steps and gets closer to the minima (the yellow circle). In iteration #3 the algorithm overshoots the minima and iteration #4 begins to diverge from the right solution.

When alpha is set at a high value we risk the possibility of never converging on the minima.

Figure 12: Linear regression behavior with a conservative learning rate

The smaller the value of alpha, the smaller the steps the gradient descent algorithm will take between each iteration. If the learning rate is set conservatively, the gradient descent algorithm will take a larger number of smaller steps, which increases the time it will take to find an optimal solution.

Linear Regression Summary

The concepts, algorithms and methods that we’ve introduced in this chapter will lay the groundwork for much more powerful and sophisticated machine learning algorithms in upcoming chapters. With only a few incremental updates and enhancements, we will go from predictions that we can make intuitively to predictions so sophisticated they would be impossible to reason about without the tools and techniques we’ve learned in this chapter.

Here are the three key takeaways that you should be confident with:

1. The Hypothesis Function is used to model our training data. The hypothesis function can take many forms (linear functions, exponential functions, logarithmic functions) and it’s the responsibility of the AI/ML designer to select a hypothesis function that most closely models their problem space. In this chapter, we introduced the linear function to model our housing data.

$$ h(x) = \theta_0 + \theta_1x $$

2. Once we’ve selected an appropriate Hypothesis Function, the Cost Function allows us to quantify how well our model is performing with respect to specific values for the parameters Θ0 and Θ1. This is the key performance measurement that enables the linear regression algorithm to discover the optimal values for Θ0 and Θ1.

$$ J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)^2 $$

3. The Gradient Descent Algorithm leverages the Cost Function performance measurements to seek and discover the optimal values for Θ0 and Θ1, and consequently the best predictor for our data.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0 + \theta_1x^{(i)} - y^{(i)} \right)x^{(i)} $$

}

Once the gradient descent algorithm has converged on an optimal solution, substitute the parameters Θ0 and Θ1in the hypothesis function with the values Θ0 and Θ1 discovered by the gradient descent algorithm; you will have an optimized model ready to make the best predictions possible.