Brandon John Grenier

In our previous chapter, we developed a linear regression algorithm to predict the price of a home based on a
single feature, the size of a house. In this chapter, we’ll develop a more sophisticated linear regression
algorithm that can predict the price of a home based on multiple features. Linear regression that uses multiple
features, or variables is referred to as *multivariate linear regression*.

In addition to the size of a home, our updated training set will consider three additional features - the number of bedrooms, the number of bathrooms, and the age of the house, for a total of four distinct features. The updated training set is presented in table 3 below:

House Size (m^{2}) | Bedrooms | Bathrooms | Age (years) | Price ($) |

124 | 3 | 1 | 3 | 223,000 |

199 | 3 | 2 | 11 | 430,000 |

334 | 5 | 3 | 22 | 900,000 |

112 | 4 | 1 | 44 | 300,000 |

In this chapter, you’ll learn that multivariate linear regression is simply a *generalized form* of univariate linear regression. In other words,
the algorithms that we introduce in this chapter can be applied to models with one, many or an arbitrarily large number of features.

To start, we’ll derive a generalized version of our original hypothesis function that can be used to model linear relationships with one or many features.

In the previous chapter we modeled our home sales data with the following linear function as our hypothesis function:

$$ h(x) = \theta_0 + \theta_1x $$In light of our updated home sales data, we’ll need to amend this function so that it can accommodate an arbitrary number of features, not just one. To do this, we extend our original hypothesis function by appending new terms to the function - one term for each additional feature. For example, to accommodate four distinct features, the hypothesis function can be defined as:

$$ h(x) = \theta_0 + \theta_1\color{blue}x_1\color{black} + \theta_2\color{blue}x_2\color{black} + \theta_3\color{blue}x_3\color{black}+ \theta_4\color{blue}x_4\color{black} $$This hypothesis function accommodates four distinct variables, or features identified by the parameters **x1** to **x4**.
We’ve introduced the parameter subscript notation (highlighted in blue above) as a convenient means to keep track of the distinct features in our model.

We can generalize this hypothesis function so that it can be expressed in a single and consistent fashion, irrespective of the number of
parameters. To do this, we need to make the first term Θ_{0} consistent with the rest of the terms in the function. We introduce a convention
where we define a new variable, **X _{0}** which will always have the value

We’ve conventionally defined **X _{0} = 1** to ensure that the value of the first term remains unchanged.

Now that the first term in the function is consistent with the remaining terms, we can introduce a shorthand notation for the hypothesis function so that we don’t have to explicitly write a term for each feature. This is the definition of the generalized hypothesis function that we will use in multivariate linear regression:

$$ h(x)= \theta_0x_0+ … + \theta_nx_n $$The parameter **n** is used as a convention to describe the number of features.

The cost function used in multivariate linear regression is the same cost function that we introduced in univariate linear regression, which we previously defined as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h(x^{(i)}) - y^{(i)} \right)^2$$
We simply substitute the hypothesis function term **h(x ^{(i)})** with our multivariate hypothesis function definition, and we get:

This definition is a bit verbose, so for convenience we’ll express this function using a shorthand notation by substituting all of the enumerated terms (all 0…n terms) with the single term theta. This is the generalized definition of the cost function that we will use in multivariate linear regression:

$$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta - y^{(i)} \right)^2$$We need to make a couple of practical changes to generalize the gradient descent algorithm we introduced in the previous chapter. Our original gradient descent algorithm referenced the univariate hypothesis function, highlighted in red below.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{red}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{red}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right)x^{(i)} $$

}

First, we replace the univariate hypothesis function with our generalized multivariate hypothesis function, and we get the following definition:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{red}x^{(i)} $$

}

Second, we need to update the **x ^{(i)}** term highlighted in red above - we’ll use the same subscript notation that we introduced in our
multivariate hypothesis function so that we can keep track of each individual parameter. We end up with the following definition:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} $$

}

To generalize the gradient descent algorithm, we’ll apply the same approach as we did to generalize our hypothesis function. Again, we introduce a convention where we define a new variable, x0 which will always have the value 1. This new term is highlighted in blue below:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_0^{(i)} $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} $$

}

Let’s take an opportunity to review what the gradient descent algorithm would look like if we applied it to our housing data set with 4 features before we generalize the gradient descent algorithm.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_0^{(i)} \hspace{1cm} contstant $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} \hspace{1cm} size $$
$$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_2^{(i)} \hspace{1cm} bedrooms $$
$$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_3^{(i)} \hspace{1cm} bathrooms $$
$$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_4^{(i)} \hspace{1cm} age $$

}

There’s nothing wrong with this expression, other than the need for a fair amount of typing. We’ll introduce a shorthand notation to generalize the gradient descent algorithm by collapsing each individual update operation into a single update operation, with the following definition:

Repeat simultaneously until convergence, for j = 0 to j = n: {

$$ \theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_j^{(i)} $$

}

Again, all enumerated terms (θ_1,θ_2,θ_(3…) x_1,x_2,x_(3…)) are replaced with parameterized terms θ_j and x_j respectively. This generalized gradient descent algorithm can be used to find optimal solutions with an arbitrary number of features in your model.

A new challenge arises when we move from making predictions with a single feature to making predictions with multiple features. To illustrate this challenge, let’s look at a training set with two features - the size of the house and the number of bedrooms.

House Size (m^{2}) | Bedrooms | Price ($) |

124 | 3 | 223,000 |

199 | 3 | 430,000 |

334 | 5 | 900,000 |

112 | 4 | 300,000 |

In this training set, house sizes range from **112m2** to **334m2** while the number of bedrooms range
from **3** to **5**. The range of the house size feature is significantly larger than the range of the bedroom feature.
As a consequence, the gradient descent algorithm may have to take many more iterations to converge on an optimal value for the house size
feature compared to the bedroom count feature, since the algorithm will have a significantly larger range to traverse.

Imagine a scenario where you have a training set with 10 features, and only one feature had a significant range in values. The gradient descent algorithm would spend an overwhelming amount of its time attempting to converge on that one feature. This isn’t a particularly efficient use of CPU cycles, and with sizeable data sets can significantly increase algorithm run time.

**Feature scaling**, or data normalization is an optimization technique which will enable the gradient descent
algorithm to converge on all features with a reasonably consistent time scale. This method is used to both minimize and standardize the
range of values for a specific feature.

For a given feature **x**, feature scaling will attempt to normalize each instance of x so that all instances lie in a common range,
usually within the range of -1 to 1. This constraint is expressed as follows:

-1 <= x_{i} <= 1

While it’s perfectly acceptable for certain features to fall out of this range, the closer you can scale your features to this range the more effectively the gradient descent algorithm will perform.

The specific technique that we’ll use for feature scaling is called *mean normalization*. Mean normalization is the most common method used
for feature scaling and is defined as:

Where **u _{i}** is the average value for feature

As a concrete example, let’s apply mean normalization to the house size feature in the training set we introduced at the beginning of this section
in Table 4. To figure out the range s_{i}, we calculate the difference between the largest instance value (in this case 334 m^{2}) and the
smallest instance value (in this case 112 m^{2}), and we end up with:

To calculate the average u_{i}, we sum up all house sizes in our training set and divide by the training set size, and we end up with:

To scale the value of our first training instance **x _{0}**, we substitute the values for

As an exercise for the reader, use the mean normalization function to scale the remaining three training instances in the training set. You can use the following table to check your work: