We are going learn how to build a simple supervised machine learning algorithm which will be able predict housing prices based on a single feature, the size of a house. Going through this exercise will introduce a number of key concepts, algorithms and mathematics used in machine learning.

### Making Intuitive Predictions

Let’s assume I want to purchase a 150m2 house in Sydney, and that the price of a house is primarily driven by the size of the house. If I can get my hands on a small set of housing data I can start making my own predictions by hand. Here’s how to do it:

I’ve plotted the data points on a graph, housing prices are on the Y axis and the size of houses in square meters are on the X axis. I drew a straight line that fits the data as closely as possible. The straight line is an example of a *linear function* - linear functions are simple mathematical functions that produce straight lines.

If you remember from the first article, there are two types of machine learning problems; classification problems and regression problems. House pricing prediction is an example of a *regression* problem. Because we’re predicting housing prices solely on the price of the house, this problem is technically referred to as a *univariate regression problem* (univariate is a fancy way of saying one variable).

Now that I have my linear function established I can start making predictions. For a 150 square meter house, I drew a green dashed line until it intersected with my hand-rolled linear function, and then come across to the Y axis which gives me an estimate of about $390,000 dollars. My first prediction is done.

This prediction is an intuitive prediction - there’s no math involved here, I just eyeballed the data and tried to make a straight line that fit the data best.

### Optimising Fit

The graph below shows four different linear functions, each of which will result in different predictions.

It’s pretty clear that two of the linear functions don’t fit the housing price data very well, and as a consequence these functions will produce poor predictions. On the other hand, we have two other linear functions fit the data quite well, both of these will produce much better predictions. Given these two well-fitting linear functions, which one will produce the most accurate housing price predictions? Is there an even better function that could produce even better predictions?

This is exactly what our first machine learning algorithm will answer. Effectively we’re going to implement a machine learning algorithm that can draw a really accurate straight line. Sounds exiting? You bet!

### Defining the Training Set

In a supervised learning algorithm, the data set we use to train the algorithm with the “right” answers is called the *training set*. I’m going to introduce some common notation and conventions that we use to describe training sets.

House Size - m^{2} (x) |
House Price - AUD (y) |
---|---|

124 | 223000 |

199 | 430000 |

334 | 900000 |

112 | 300000 |

**x** - this is referred to as our input variable (also known as a *feature*), in our example the house size is our input variable.

**y** - this is referred to as our output variable (also known as a *target*), in our example the house price is our output variable.

**m** - the number of training samples - in the table above, we have 4 training samples

**(x, y)** - this refers to a single training example - in other words, a single row in our table.

**(x ^{(i)}, y^{(i)})** - this refers to a specific instance of a training example. For example, the third row in our training set would be addressed as (x

^{(3)}, y

^{(3)})

### The Hypothesis Function

Our machine learning algorithm will figure which function it should use to make the best possible predictions. The function produced (or “learned”) by this machine learning algorithm is referred to as **h** - short for *hypothesis*, it’s probably not the most accurate description but it’s been the convention for quite some time so we’re sticking with it. All of those straight lines I drew in the graphs above are examples of different hypothesis functions.

The goal of our supervised learning algorithm: given a training set, produce (learn) a function **h** so that **h(x)** is a good predictor for **y**. The hypothesis function **h(x)** is a simple linear regression function will let us figure out how to fit the best possible straight line to our data.

The function is defined as:

h_{Θ}(x) = Θ_{0} + Θ_{1}*x

With a little shorthand notation, I’ll normally write this as:

h(x) = Θ_{0} + Θ_{1}x

This symbol Θ is call theta. The values **Θ _{0}** and

**Θ**are referred to as the

_{1}**parameters**of the function. Again, Θ

_{0}and Θ

_{1}are used by convention, and simply represent arbitrary parameters that our machine learning algorithm will learn to modify to get the best fitting straight line.

The function could just have easily been written like:

f(x) = a + bx

That function might even look familiar to you if you’ve taken a few math classes in high school. it just so happens that Θ_{0}, Θ_{1} and h are the standard convention.

### The Cost Function

The machine learning algorithm will ultimatley decide on which values to use for Θ_{0} and Θ_{1}. How do we go about doing that?

Remember the goal of our cost function: h(x) should be a good predictor of y, therefore we want the values of Θ_{0} and Θ_{1} to be good predictors of y as well!

The best prediction we can have for h(x) = y is when the predicted value of **h(x)** matches the actual value **y** exactly - in other words, the delta (difference) between the predicted value h(x) and the actual value is **0**.

The algorithm isn’t always going to make predictions perfectly, but the idea behind this is that we want to have the *smallest possible difference* between a predicted value and an actual value, so that **h(x) - y** is as small as possible, ideally 0.

For every example (x, y) in our training set, we want to find the difference between the predicted value h(x) and the actual value y, and then calculate the average delta. This will tell us how well we can make predictions across our entire training set, the smaller the average the better the prediction. Algorithmically:

- For every training instance, calculate the difference between the predicted value and the actual value.
- Add all of the differences together.
- Divide the sum of the differences by the size of the training set. This will give use the average difference between predictions and expected outputs across our entire training set.

We’ll now define this algorithm mathematically. First, to calculate the difference between a specific predicted value and the actual value, we can use:

Add all of the differences together, from the first instance to the last instance in the training set:

Divide the sum of the differences by the size of the training set:

The formal cost function is written as follows:

You’ll notice a couple of differences between the cost function above and the one I’ve come up with.

First, the “error” component (the difference between the prediction and actual) is now squared. We square this error component to make implementation of the algorithms we use later a bit easier. Squaring always ensures that we have a positive number: 2^{2} and -2^{2} are both 4. Later on, it means that we simply can find the cost function closest to 0, and not have to look at the “positive value closest to 0” and the “negative value closest to 0”, and then figure out which one of *them* is actually the closest.

Instead of just dividing by the size of the training set, it’s now the size of the training set times two. I’m not actually sure what the benefit of dividing by 2m instead of m actually gives us, but that’s the formal definition of the *squared error cost function*. Feel free to leave a comment if you have any insights.

The machine learning algorithm will ultimatley decide on which values to use for Θ_{0} and Θ_{1}. Now we can clearly articulate our goal: the algorithm must find values for Θ_{0} and Θ_{1} so that the result from the function

produces the smallest possible value - in formal terms, we want to *minimise* J(Θ_{0}, Θ_{1})

### The Cost Function in Practice

The cost function might take you a while to unpack, so I wanted to introduce some practical examples where we can use the cost function and see what it’s doing. The graph below has the observed (actual) data points plotted in red, and I’ve used the hypothesis function, h(x) = Θ_{0} + Θ_{1}x to plot out three different lines on the graph.

I chose Θ_{0} = 1 and Θ_{1} = 1 to draw the purple line: h(x) = 1 + 1x

I chose Θ_{0} = 5 and Θ_{1} = 0 to draw the light blue line: h(x) = 5 + 0x

I chose Θ_{0} = 0 and Θ_{1} = 0.5 to draw the dark blue line: h(x) = 0 + 0.5x

Given we have 5 training samples (m = 5) in the data set, the cost function for J(1, 1) can be written as:

We’ll sum up the squared error for each of the data points:

(h(x_{1}) - y_{1})^{2} = (2 - 1)^{2} = (1)^{2} = 1 +

(h(x_{2}) - y_{2})^{2} = (3 - 2)^{2} = (1)^{2} = 1 +

(h(x_{3}) - y_{3})^{2} = (4 - 3)^{2} = (1)^{2} = 1 +

(h(x_{4}) - y_{4})^{2} = (5 - 4)^{2} = (1)^{2} = 1 +

(h(x_{5}) - y_{5})^{2} = (6 - 5)^{2} = (1)^{2} = 1

The sum of the squared error is equal to 5:

We divide the sum of the squared error by 10, and find that J(1, 1) is equal to 0.5

I’ve calculated the values of the last two cost functions - as an excercise, see if you can calculate the cost functions and get the same numbers.

J(5, 0) = 3.0

J(0, 0.5) = 1.375

After doing the math, I’ve been able to show that J(1, 1) is the “lowest” cost function out of the three that were presented. If you look a the graph again, you’ll notice that J(1, 1) best fits the data as well. Is there a cost function that fits the data event better? Absolutley - J(0,1) would give us a cost function of 0, which means it would our data set perfectly.

Now that we’re able to calculate the cost function, we have one last problem to solve before we can implement our machine learning algorithm. How can we quickly and efficently minimise J(Θ_{0} and Θ_{1})?

We could take a naive, brute force approach and simply calcuate the value of J(Θ_{0} and Θ_{1}) for every possible combination of Θ_{0} and Θ_{1}, and find the lowest value. This will *work*, but you might be waiting a while for your algorithm to finish.

In part 2 I’ll introduce the Gradient Descent algorithm, and show you how we can use it to quickly and efficiently minimise the cost function without iterating over an arbitrarily large number of parameters.