Teaser Image

# Brandon John Grenier

## Artificial Intelligence • Machine Learning • Big Data

In the univariate linear regression series you learned how to implement a simple machine learning algorithm that can predict housing prices based on a single feature, the size of the house. In this article, we’ll start building a more sophisticated algorithm that can make housing price predictions based on multiple features.

We are going learn how to build a simple supervised machine learning algorithm which will be able predict housing prices based on the size of a house. Going through this exercise will introduce a number of key concepts, algorithms and mathematics used in machine learning.

### Defining the Training Set

In our previous example, the price of the house was based only on the size of the house, so we had a training set that looked like this:

House Size - m2 (x) House Price - AUD (y)
124 223000
199 430000
334 900000
112 300000

Our new training set will introduce multiple features - the price of a house will now be based upon the house size, the number of bedrooms it has, the number of floors it has, and the age of the house:

House Size - m2 Bedrooms Floors House Age House Price
124 3 1 3 223000
199 3 2 11 430000
334 5 3 22 900000
112 4 1 44 300000

We’ll introduce some notation in order to describe a training set with multiple features:
x(i) - this refers to all features of an instance in a training set. For example, when we refer to x(2), we are now referring to the entire feature set [199, 3, 2, 11] of the second training example (the second row).

x(i)j - this refers to a specific feature of an instance in a training set. For example, when we refer to x(2)2, we are referring to a specific feature (number of bedrooms) in the second training example [3].

### The Hypothesis Function

Our original hypothesis function was a simple linear function, defined as:
h(x) = Θ0 + Θ1x

This function works well when there is only one feature, but needs to be updated to accommodate for our new training set. With four features, our new hypothesis looks like this:

h(x) = Θ0 + Θ1x1 + Θ2x2 + Θ3x3+ Θ4x4

You might notice that the function is a little inconsistent. Although we have a definition for Θ0, we don’t have a definition for x0. To tidy this up, we’ll introduce a convention, whereby: x0 = 1

By doing this, all of our terms are consistent with each other, and the function now looks like this:

h(x) = Θ0x0 + Θ1x1 + Θ2x2 + Θ3x3 + Θ4x4

Setting x0 = 1 helps to keep our terms consistent, and because it’s defined as 1 there won’t be any mathematical impact to this term. We can also generalise the hypothesis function at this point to get:

h(x) = Θ0x0 + Θ1x1 + … + Θnxn

The parameter n is used as a convention to describe the number of features. In our new training set, n equals 4, since we have 4 features.

### The Cost Function

Our original cost function with a single feature was defined as: $J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x_{i} - y_{i} \right)^2$

The cost function will be updated to support our new hypothesis function, and simply written as:
$J(\theta_0, ..., \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0x_0^i + ... + \theta_nx_n^i - y^i \right)^2$

This definition is a bit long, so we’ll substitute the definition of each individual theta parameter with a single one, named theta.
$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta - y^i \right)^2$

The univariate linear descent algorithm initially had two different functions, one for calculating Θ0, and another for calculating Θ1.

repeat until convergence:
$\theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x^i - y^i \right)$
$\theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((\theta_0 + \theta_1x^i - y^i)x^i \right)$

If we end up using our new convention, x0 = 1, we can make both of these functions consistent with each other, and remove the theta 0 function altogether. With that in mind, the only real difference between the univariate cost function and the multivariate cost function is the very last multiplication term - here we want to multiply the ith x at j, and not just x. This makes our cost function:

repeat simultaneously until convergence, for j = 0 to j = n
$\theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((\theta - y^i)x_j^i \right)$

Source code for the Gradient Descent algorithm (implemented in Java) can be found on GitHub, the multivariate-linear-regression project. Here’s a snippet of the Gradient Descent implementation:

### Feature Scaling

There’s one new problem gets introduced when we move from making predictions from a single feature to making predictions with multiple features. Let’s take the example of predicting the price of a house with two features: the size of the house and the number of bedrooms.

House Size - m2 Bedrooms House Price
124 3 223000
199 3 430000
334 5 900000
112 4 300000

In this training set house sizes range from 112m2 to 334m2, and the number of bedrooms range from 3 to 5. The problem that we run into here is that the Gradient Descent algorithm will likely take quite a few steps to “walk down” the housing prices, because we have a pretty big range to work through. On the other hand, we should be able to get through the bedrooms pretty quickly - they only range from 3 to 5. Since we have to compute the value of theta for all of our features simultaneously, we’ll end up with a scenario where we find the value of theta for our bedrooms quite quickly, but the algorithm will still to burn through iterations finding the correct theta value for our house size.

If you had a training set with 10 features, and only one of them had a large range, the algorithm will take most of it’s time attempting to resolve delta for that one feature. This isn’t particularly efficient, and this is where feature scaling comes in. It’s worthwhile spending a bit of time to understand how feature scaling works, as it’s used throughout machine learning.

The purpose of feature scaling is to minimise the range of values for a specific feature. Ideally, we want to have every feature in our training set to be in or around zero. More specifically, for any given feature x we want every instance of x to be in the range: -1 <= xi <= 1

This isn’t a hard requirement; it’s alright if certain features fall out of this range a bit, but be mindful.

A really simple way to scale our features is to find the maximum value of a given feature instance, and divide all features by that value. For example, we know that the maximum house size is 334m2. If we were to divide every house size by 334, we would automatically scale the feature to be within a range of 0 - 1. We’d get the same outcome if we divided the number of bedrooms by 5.

Another useful technique for feature scaling is called mean normalisation, and the formula is:
$x_i := \dfrac{x_i - \mu_i}{s_i}$

Where μi is the average of all the values for feature (i) and Si is the range of values, defined as the largest value of feature (i) minus the smallest value of feature (i). Mean normalistion will be the most often used technique for feature scaling.

Here’s an example of applying mean normalisation to the house size of our second training instance, x2. We know that the value of x2 is 199, I’ve calculated the average house price (192.25), and I’ve also determined by range (largest house is 334m2, smallest house is 112m2). After applying mean normalisation, the house size should scale to:
$x_2 := \dfrac{199 - 192.25}{(334 - 112)} = \dfrac{-6.75}{222} = -0.03$

I used mean normalistion to scale each house size in the table presented at the beginning of this article, and the results are as follows:

House Size (Before Scaling) House Size (After Scaling)
124 -0.31
199 -0.03
334 0.63
112 -0.36

You should be able to get much better multivariate regression performance from your Gradient Descent algorithm if you apply mean normalisation to your training set.

### Implementing Feature Scaling

Feature scaling will take our training set, which currently looks like this:

House Size - m2 Bedrooms Floors House Age House Price
124 3 1 3 223000
199 3 2 11 430000
334 5 3 22 900000
112 4 1 44 300000

And convert it into this:

House Size - m2 Bedrooms Floors House Age House Price
-0.31 -0.37 -0.37 -0.41 223000
-0.03 -0.37 0.12 -0.21 430000
0.63 0.62 0.62 -0.05 900000
-0.36 0.12 -0.37 0.58 300000

Notice that our house price is not affected; only features are subject to scaling, not actual values or theta. Also, keep in mind that we compute the range and mean on a per feature basis. In other words, you will have four different sets of mean+range combintations based on the example above - we’ll have a unique mean and average for the house size, the bedrooms, the floors, and the house age. As part of your scaling implementation, you’ll want to keep these precomputed values handy, and you’ll see why next.

As a consequence of scaling, our algorithm will be “trained” on our scaled training set, not the original training set. Your function will be able to make accurate predictions with scaled inputs, but not with original inputs. In fact, the values of theta that are produced from a non-scaled training set in most circumstances will differ from the values of theta that are produced from a scaled training set.

Whenever you want to make predictions from a function trained using a scaled training set, you will also need to remember to scale your inputs as well. As a concrete example, I’ve included a simple test which shows how I go about scaling inputs after a training set has been normalised, refer to the comments in the test case for more context.