AN INTRODUCTION TO ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING ALGORITHMS

Brandon John Grenier

Chapter 3: Linear Regression with Multiple Features

In our previous chapter, we developed a linear regression algorithm to predict the price of a home based on a single feature, the size of a house. In this chapter, we’ll develop a more sophisticated linear regression algorithm that can predict the price of a home based on multiple features. Linear regression that uses multiple features, or variables is referred to as multivariate linear regression.

In addition to the size of a home, our updated training set will consider three additional features - the number of bedrooms, the number of bathrooms, and the age of the house, for a total of four distinct features. The updated training set is presented in table 3 below:

Table 3: Home sales training data set with multiple features
House Size (m2)BedroomsBathroomsAge (years)Price ($)
124313223,000
1993211430,000
3345322900,000
1124144300,000

In this chapter, you’ll learn that multivariate linear regression is simply a generalized form of univariate linear regression. In other words, the algorithms that we introduce in this chapter can be applied to models with one, many or an arbitrarily large number of features.

To start, we’ll derive a generalized version of our original hypothesis function that can be used to model linear relationships with one or many features.

The Hypothesis Function

In the previous chapter we modeled our home sales data with the following linear function as our hypothesis function:

$$ h(x) = \theta_0 + \theta_1x $$

In light of our updated home sales data, we’ll need to amend this function so that it can accommodate an arbitrary number of features, not just one. To do this, we extend our original hypothesis function by appending new terms to the function - one term for each additional feature. For example, to accommodate four distinct features, the hypothesis function can be defined as:

$$ h(x) = \theta_0 + \theta_1\color{blue}x_1\color{black} + \theta_2\color{blue}x_2\color{black} + \theta_3\color{blue}x_3\color{black}+ \theta_4\color{blue}x_4\color{black} $$

This hypothesis function accommodates four distinct variables, or features identified by the parameters x1 to x4. We’ve introduced the parameter subscript notation (highlighted in blue above) as a convenient means to keep track of the distinct features in our model.

We can generalize this hypothesis function so that it can be expressed in a single and consistent fashion, irrespective of the number of parameters. To do this, we need to make the first term Θ0 consistent with the rest of the terms in the function. We introduce a convention where we define a new variable, X0 which will always have the value 1. The hypothesis function can now be expressed as:

$$ h(x) = \theta_0\color{blue}x_0\color{black} + \theta_1x_1 + \theta_2x_2\color{black} + \theta_3x_3\color{black}+ \theta_4x_4\color{black} $$

We’ve conventionally defined X0 = 1 to ensure that the value of the first term remains unchanged.

Now that the first term in the function is consistent with the remaining terms, we can introduce a shorthand notation for the hypothesis function so that we don’t have to explicitly write a term for each feature. This is the definition of the generalized hypothesis function that we will use in multivariate linear regression:

$$ h(x)= \theta_0x_0+ … + \theta_nx_n $$

The parameter n is used as a convention to describe the number of features.

The Cost Function

The cost function used in multivariate linear regression is the same cost function that we introduced in univariate linear regression, which we previously defined as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h(x^{(i)}) - y^{(i)} \right)^2$$

We simply substitute the hypothesis function term h(x(i)) with our multivariate hypothesis function definition, and we get:

$$J(\theta_0 ... \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)^2$$

This definition is a bit verbose, so for convenience we’ll express this function using a shorthand notation by substituting all of the enumerated terms (all 0…n terms) with the single term theta. This is the generalized definition of the cost function that we will use in multivariate linear regression:

$$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta - y^{(i)} \right)^2$$

The Gradient Descent Algorithm

We need to make a couple of practical changes to generalize the gradient descent algorithm we introduced in the previous chapter. Our original gradient descent algorithm referenced the univariate hypothesis function, highlighted in red below.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{red}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{red}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right)x^{(i)} $$

}

First, we replace the univariate hypothesis function with our generalized multivariate hypothesis function, and we get the following definition:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{red}x^{(i)} $$

}

Second, we need to update the x(i) term highlighted in red above - we’ll use the same subscript notation that we introduced in our multivariate hypothesis function so that we can keep track of each individual parameter. We end up with the following definition:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} $$

}

To generalize the gradient descent algorithm, we’ll apply the same approach as we did to generalize our hypothesis function. Again, we introduce a convention where we define a new variable, x0 which will always have the value 1. This new term is highlighted in blue below:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_0^{(i)} $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} $$

}

Let’s take an opportunity to review what the gradient descent algorithm would look like if we applied it to our housing data set with 4 features before we generalize the gradient descent algorithm.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_0^{(i)} \hspace{1cm} contstant $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} \hspace{1cm} size $$ $$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_2^{(i)} \hspace{1cm} bedrooms $$ $$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_3^{(i)} \hspace{1cm} bathrooms $$ $$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_4^{(i)} \hspace{1cm} age $$

}

There’s nothing wrong with this expression, other than the need for a fair amount of typing. We’ll introduce a shorthand notation to generalize the gradient descent algorithm by collapsing each individual update operation into a single update operation, with the following definition:

Repeat simultaneously until convergence, for j = 0 to j = n: {

$$ \theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_j^{(i)} $$

}

Again, all enumerated terms (θ_1,θ_2,θ_(3…) x_1,x_2,x_(3…)) are replaced with parameterized terms θ_j and x_j respectively. This generalized gradient descent algorithm can be used to find optimal solutions with an arbitrary number of features in your model.

Feature Scaling

A new challenge arises when we move from making predictions with a single feature to making predictions with multiple features. To illustrate this challenge, let’s look at a training set with two features - the size of the house and the number of bedrooms.

Table 4: Example training set for feature scaling
House Size (m2)BedroomsPrice ($)
1243223,000
1993430,000
3345900,000
1124300,000

In this training set, house sizes range from 112m2 to 334m2 while the number of bedrooms range from 3 to 5. The range of the house size feature is significantly larger than the range of the bedroom feature. As a consequence, the gradient descent algorithm may have to take many more iterations to converge on an optimal value for the house size feature compared to the bedroom count feature, since the algorithm will have a significantly larger range to traverse.

Imagine a scenario where you have a training set with 10 features, and only one feature had a significant range in values. The gradient descent algorithm would spend an overwhelming amount of its time attempting to converge on that one feature. This isn’t a particularly efficient use of CPU cycles, and with sizeable data sets can significantly increase algorithm run time.

Feature scaling, or data normalization is an optimization technique which will enable the gradient descent algorithm to converge on all features with a reasonably consistent time scale. This method is used to both minimize and standardize the range of values for a specific feature.

For a given feature x, feature scaling will attempt to normalize each instance of x so that all instances lie in a common range, usually within the range of -1 to 1. This constraint is expressed as follows:

-1 <= xi <= 1

While it’s perfectly acceptable for certain features to fall out of this range, the closer you can scale your features to this range the more effectively the gradient descent algorithm will perform.

The specific technique that we’ll use for feature scaling is called mean normalization. Mean normalization is the most common method used for feature scaling and is defined as:

$$ x_i := \dfrac {x_i - u_i}{s_i} $$

Where ui is the average value for feature i and si is the range for feature i, which is simply the difference between the largest instance value and the smallest instance value.

As a concrete example, let’s apply mean normalization to the house size feature in the training set we introduced at the beginning of this section in Table 4. To figure out the range si, we calculate the difference between the largest instance value (in this case 334 m2) and the smallest instance value (in this case 112 m2), and we end up with:

$$ s_i = 334 – 112 = 222 $$

To calculate the average ui, we sum up all house sizes in our training set and divide by the training set size, and we end up with:

$$ u_i = \dfrac{(334 + 112 + 124 + 199)}{4} = 192.25 $$

To scale the value of our first training instance x0, we substitute the values for s0, u0 and x0 into the mean normalization function, and we end up with:

$$ x_0 = \dfrac{x_0 - u_0}{s_0} = \dfrac{124 - 192.25}{222} = -0.31 $$

As an exercise for the reader, use the mean normalization function to scale the remaining three training instances in the training set. You can use the following table to check your work: