In the univariate linear regression series you learned how to implement a simple machine learning algorithm that can predict housing prices based on a single feature, the size of the house. In this article, we’ll start building a more sophisticated algorithm that can make housing price predictions based on *multiple* features.

We are going learn how to build a simple supervised machine learning algorithm which will be able predict housing prices based on the size of a house. Going through this exercise will introduce a number of key concepts, algorithms and mathematics used in machine learning.

### Defining the Training Set

In our previous example, the price of the house was based only on the size of the house, so we had a training set that looked like this:

House Size - m^{2} (x) |
House Price - AUD (y) |
---|---|

124 | 223000 |

199 | 430000 |

334 | 900000 |

112 | 300000 |

Our new training set will introduce multiple features - the price of a house will now be based upon the house size, the number of bedrooms it has, the number of floors it has, and the age of the house:

House Size - m2 | Bedrooms | Floors | House Age | House Price |
---|---|---|---|---|

124 | 3 | 1 | 3 | 223000 |

199 | 3 | 2 | 11 | 430000 |

334 | 5 | 3 | 22 | 900000 |

112 | 4 | 1 | 44 | 300000 |

We’ll introduce some notation in order to describe a training set with multiple features:

**x ^{(i)}** - this refers to

*all*features of an instance in a training set. For example, when we refer to x

^{(2)}, we are now referring to the entire feature set [199, 3, 2, 11] of the second training example (the second row).

**x ^{(i)}_{j}** - this refers to

*a specific*feature of an instance in a training set. For example, when we refer to x

^{(2)}

_{2}, we are referring to a specific feature (number of bedrooms) in the second training example [3].

### The Hypothesis Function

Our original hypothesis function was a simple linear function, defined as:

h(x) = Θ_{0} + Θ_{1}x

This function works well when there is only one feature, but needs to be updated to accommodate for our new training set. With four features, our new hypothesis looks like this:

h(x) = Θ_{0} + Θ_{1}x_{1} + Θ_{2}x_{2} + Θ_{3}x_{3}+ Θ_{4}x_{4}

You might notice that the function is a little inconsistent. Although we have a definition for Θ_{0}, we don’t have a definition for x_{0}. To tidy this up, we’ll introduce a convention, whereby:
x_{0} = 1

By doing this, all of our terms are consistent with each other, and the function now looks like this:

h(x) = Θ_{0}x_{0} + Θ_{1}x_{1} + Θ_{2}x_{2} + Θ_{3}x_{3} + Θ_{4}x_{4}

Setting x_{0} = 1 helps to keep our terms consistent, and because it’s defined as 1 there won’t be any mathematical impact to this term. We can also generalise the hypothesis function at this point to get:

h(x) = Θ_{0}x_{0} + Θ_{1}x_{1} + … + Θ_{n}x_{n}

The parameter **n** is used as a convention to describe the number of features. In our new training set, n equals 4, since we have 4 features.

### The Cost Function

Our original cost function with a single feature was defined as:

The cost function will be updated to support our new hypothesis function, and simply written as:

This definition is a bit long, so we’ll substitute the definition of each individual theta parameter with a single one, named theta.

### Multivariate Gradient Descent

The univariate linear descent algorithm initially had two different functions, one for calculating Θ_{0}, and another for calculating Θ_{1}.

repeat until convergence:

If we end up using our new convention, x_{0} = 1, we can make both of these functions consistent with each other, and remove the theta 0 function altogether. With that in mind, the only real difference between the univariate cost function and the multivariate cost function is the very last multiplication term - here we want to multiply the ith x at j, and not just x. This makes our cost function:

repeat simultaneously until convergence, for j = 0 to j = n

### Implementing Gradient Descent

Source code for the Gradient Descent algorithm (implemented in Java) can be found on GitHub, the multivariate-linear-regression project. Here’s a snippet of the Gradient Descent implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

public LinearFunction<T> run() {
List<BigDecimal> thetas = initialise();
List<BigDecimal> tempThetas = initialise();
BigDecimal tempCost = new BigDecimal(100.0);
this.cost = new BigDecimal(100.0);
Double convergence = new Double(100.0);
this.iterations = 1;
while (convergence > tolerance) {
IntStream.range(0, thetas.size()).forEach(i -> tempThetas.set(i, calculateTheta(i, thetas)));
this.cost = costFunction.at(BigDecimals.listToArray(thetas));
tempCost = costFunction.at(BigDecimals.listToArray(tempThetas));
this.alpha = (tempCost.doubleValue() > cost.doubleValue()) ? alpha / 2 : alpha + 0.02;
convergence = Math.abs(tempCost.doubleValue() - cost.doubleValue());
this.iterations += 1;
IntStream.range(0, tempThetas.size()).forEach(i -> thetas.set(i, tempThetas.get(i)));
}
return new LinearFunction<T>(BigDecimals.listToArray(thetas));
}

### Feature Scaling

There’s one new problem gets introduced when we move from making predictions from a single feature to making predictions with multiple features. Let’s take the example of predicting the price of a house with two features: the size of the house and the number of bedrooms.

House Size - m^{2} |
Bedrooms | House Price |
---|---|---|

124 | 3 | 223000 |

199 | 3 | 430000 |

334 | 5 | 900000 |

112 | 4 | 300000 |

In this training set house sizes range from 112m^{2} to 334m^{2}, and the number of bedrooms range from 3 to 5. The problem that we run into here is that the Gradient Descent algorithm will likely take quite a few steps to “walk down” the housing prices, because we have a pretty big range to work through. On the other hand, we should be able to get through the bedrooms pretty quickly - they only range from 3 to 5. Since we have to compute the value of theta for *all* of our features simultaneously, we’ll end up with a scenario where we find the value of theta for our bedrooms quite quickly, but the algorithm will still to burn through iterations finding the correct theta value for our house size.

If you had a training set with 10 features, and only one of them had a large range, the algorithm will take most of it’s time attempting to resolve delta for that one feature. This isn’t particularly efficient, and this is where *feature scaling* comes in. It’s worthwhile spending a bit of time to understand how feature scaling works, as it’s used throughout machine learning.

The purpose of feature scaling is to minimise the range of values for a specific feature. Ideally, we want to have every feature in our training set to be in or around zero. More specifically, for any given feature x we want every instance of x to be in the range: -1 <= x_{i} <= 1

This isn’t a hard requirement; it’s alright if certain features fall out of this range a bit, but be mindful.

A really simple way to scale our features is to find the maximum value of a given feature instance, and divide all features by that value. For example, we know that the maximum house size is 334m^{2}. If we were to divide every house size by 334, we would automatically scale the feature to be within a range of 0 - 1. We’d get the same outcome if we divided the number of bedrooms by 5.

Another useful technique for feature scaling is called *mean normalisation*, and the formula is:

Where μ_{i} is the average of all the values for feature (i) and S_{i} is the *range* of values, defined as the largest value of feature (i) minus the smallest value of feature (i). Mean normalistion will be the most often used technique for feature scaling.

Here’s an example of applying mean normalisation to the house size of our second training instance, x_{2}. We know that the value of x_{2} is 199, I’ve calculated the average house price (192.25), and I’ve also determined by range (largest house is 334m^{2}, smallest house is 112m^{2}). After applying mean normalisation, the house size should scale to:

I used mean normalistion to scale each house size in the table presented at the beginning of this article, and the results are as follows:

House Size (Before Scaling) | House Size (After Scaling) |
---|---|

124 | -0.31 |

199 | -0.03 |

334 | 0.63 |

112 | -0.36 |

You should be able to get much better multivariate regression performance from your Gradient Descent algorithm if you apply mean normalisation to your training set.

### Implementing Feature Scaling

Feature scaling will take our training set, which currently looks like this:

House Size - m2 | Bedrooms | Floors | House Age | House Price |
---|---|---|---|---|

124 | 3 | 1 | 3 | 223000 |

199 | 3 | 2 | 11 | 430000 |

334 | 5 | 3 | 22 | 900000 |

112 | 4 | 1 | 44 | 300000 |

And convert it into this:

House Size - m2 | Bedrooms | Floors | House Age | House Price |
---|---|---|---|---|

-0.31 | -0.37 | -0.37 | -0.41 | 223000 |

-0.03 | -0.37 | 0.12 | -0.21 | 430000 |

0.63 | 0.62 | 0.62 | -0.05 | 900000 |

-0.36 | 0.12 | -0.37 | 0.58 | 300000 |

Notice that our house price is not affected; only features are subject to scaling, not actual values or theta. Also, keep in mind that we compute the range and mean on a *per feature* basis. In other words, you will have *four different sets* of mean+range combintations based on the example above - we’ll have a unique mean and average for the house size, the bedrooms, the floors, and the house age. As part of your scaling implementation, you’ll want to keep these precomputed values handy, and you’ll see why next.

As a consequence of scaling, our algorithm will be “trained” on our scaled training set, not the original training set. Your function will be able to make accurate predictions with *scaled* inputs, but not with original inputs. In fact, the values of theta that are produced from a non-scaled training set in most circumstances will differ from the values of theta that are produced from a scaled training set.

Whenever you want to make predictions from a function trained using a scaled training set, you will also need to remember to scale your inputs as well. As a concrete example, I’ve included a simple test which shows how I go about scaling inputs after a training set has been normalised, refer to the comments in the test case for more context.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

@Test
public void shouldGenerateTheCorrectScaledPredictiveFunctionWithMultipleVariables() {
// Create a training set with three features and one actual value.
SupervisedTrainingSet set = new SupervisedTrainingSet(3);
set.addInstance(0, 0, 1, 1)
set.addInstance(1, 1, 1, 4)
set.addInstance(2, 2, 3, 3)
set.addInstance(3, 3, 1, 4)
set.addInstance(4, 4, 3, 1)
set.addInstance(5, 5, 5, 5);
// Generate a new training set by passing the original set to the mean
// normalisation function.
MeanNormalisation normalistion = new MeanNormalisation(set);
SupervisedTrainingSet normalisedSet = normalistion.normalise();
// Train the gradient descent algorithm with the normalised training set.
// This will output a function with the appropriate thetas already set.
GradientDescentAlgorithm algorithm = new GradientDescentAlgorithm(normalisedSet);
MinimisableFunction function = algorithm.minimise(new LinearFunction());
// I want to make a prediction on values 4, 7, -11, so I need to scale them
// first, using the correct feature normalisation function.
Double x0 = normalistion.functionForFeature(0).normalise(4);
Double x1 = normalistion.functionForFeature(1).normalise(7);
Double x2 = normalistion.functionForFeature(2).normalise(-11);
// I can then pass these scaled inputs to the predictive function and
// assert that I have the right predictions being generated.
BigDecimal prediction = function.at(x0, x1, x2);
assertThat(prediction.doubleValue()).isBetween(3.99, 4.01);
}