In the previous chapter, we developed a simple linear regression model to predict the price of Apple shares using a single feature – the trading date. While this served as a solid introduction to machine learning, the model we developed has quite a bit of room for improvement.

In this chapter, we’ll build a more capable predictive model that can leverage an arbitrarily large number of features. There’s an
abundance of data in financial markets that we could use to produce a robust model of our domain, from economic and macroeconomic
indicators, company fundamentals, technical analysis, news, sentiment and more. Linear regression that makes use of multiple
features or variables is referred to as **multivariate linear regression**.

The key takeaway from this chapter is that multivariate linear regression is simply a *generalized form* of univariate linear
regression. The hypothesis function, cost function and gradient descent algorithm from the previous chapter will be refactored to
support an arbitrarily large number of features.

To begin this chapter, we’ll introduce two phenomena observed in financial markets known respectively as the *Weekend Effect* and the
*September Effect*, and expand on the ideas of feature engineering in an attempt to capture these phenomena in data.

The Weekend Effect describes a phenomenon where the United States stock market has been historically observed to have significantly
lower returns on Mondays compared to the immediately preceding Fridays. Modern analysis suggests that the Weekend Effect is a statistically
significant *historical* anomaly, but the effect appears to have lost statistical significance in the last 40 years.

The September Effect describes a phenomenon where the stock market has been historically observed to have its worst performance in September. Between 1945 and 2022, the S&P 500 index averaged returns of -0.73% in September, in contrast to the best performing month of April which averaged returns of +1.6%.

Let’s apply feature extraction on the date feature from the previous chapter’s data set to create new features that will be able to investigate the impact of these phenomena on Apple’s share price.

The Weekend Effect suggests that a relationship exists between the day of the week and an underlying asset’s share price. To discover
whether this phenomenon has an impact on Apple’s share price, we extract the day of the week from the date feature and create a new f
eature, which we’ll call *Day of Week*.

We use a simple convention to encode the Day of Week feature by declaring Monday = 1, Tuesday = 2, Wednesday = 3, Thursday = 4 and Friday = 5. The encoding is conventional in the sense that we could use any number range, the key is that the convention is applied consistently across our entire data set.

The September Effect suggests that a relationship exists between the month of the year and an underlying asset’s share price. To discover
whether this phenomenon has an impact on Apple’s share price, we extract the month from the date feature and create an additional
feature which we’ll call *Month*.

To evaluate the impact of these phenomena over longer periods of time, we extract the year from the date feature and create a third
feature which we’ll call *Year*. Finally, we extract the day of the month from the date feature to create a fourth feature
which we’ll call *Day* – we’re doing this as a convenient means to keep track of the full date.

The table below presents an updated training set with our four new features.

Year | Month | Day | Day of Week | Close |

2023 | 6 | 1 | 4 | 180.09 |

2023 | 6 | 2 | 5 | 180.95 |

2023 | 6 | 5 | 1 | 179.58 |

2023 | 6 | 6 | 2 | 179.21 |

2023 | 6 | 7 | 3 | 177.82 |

2023 | 6 | 8 | 4 | 180.57 |

2023 | 6 | 9 | 5 | 180.96 |

2023 | 6 | 12 | 1 | 183.79 |

2023 | 6 | 13 | 2 | 183.31 |

2023 | 6 | 14 | 3 | 183.95 |

2023 | 6 | 15 | 4 | 186.01 |

2023 | 6 | 16 | 5 | 184.92 |

2023 | 6 | 20 | 2 | 185.01 |

2023 | 6 | 21 | 3 | 183.96 |

2023 | 6 | 22 | 4 | 187.00 |

This training set will allow our model to learn and discover whether any meaningful relationships exist between these market phenomena and Apple’s share price.

Let’s continue our use of feature engineering to determine whether a relationship exists between historical share prices and future share
prices. To explore this idea, let’s create two additional **lagging features**: the *Close (T-1)* feature captures Apple’s closing price from
one day ago, while the *Close (T-2)* feature captures Apple’s closing price from two days ago. The table below represents the training
set that we’ll use in this chapter.

Year | Month | Day | Day of Week | Close (T-1) | Close (T-2) | Close |

2023 | 6 | 1 | 4 | 177.25 | 177.3 | 180.09 |

2023 | 6 | 2 | 5 | 180.09 | 177.25 | 180.95 |

2023 | 6 | 5 | 1 | 180.95 | 180.09 | 179.58 |

2023 | 6 | 6 | 2 | 179.58 | 180.95 | 179.21 |

2023 | 6 | 7 | 3 | 179.21 | 179.58 | 177.82 |

2023 | 6 | 8 | 4 | 177.82 | 179.21 | 180.57 |

2023 | 6 | 9 | 5 | 180.57 | 177.82 | 180.96 |

2023 | 6 | 12 | 1 | 180.96 | 180.57 | 183.79 |

2023 | 6 | 13 | 2 | 183.79 | 180.96 | 183.31 |

2023 | 6 | 14 | 3 | 183.31 | 183.79 | 183.95 |

2023 | 6 | 15 | 4 | 183.95 | 183.31 | 186.01 |

2023 | 6 | 16 | 5 | 186.01 | 183.95 | 184.92 |

2023 | 6 | 20 | 2 | 184.92 | 186.01 | 185.01 |

2023 | 6 | 21 | 3 | 185.01 | 184.92 | 183.96 |

2023 | 6 | 22 | 4 | 183.96 | 185.01 | 187.00 |

In the previous chapter, we modelled Apple's share price using the following linear function as our hypothesis function:

$$ h_\theta(x) = \theta_0 + \theta_1x $$This model limits us to exploring relationships between Apple’s share price and a single feature, so we’ll need to extend this model if we want to explore relationships between Apple’s share price and the 6 features in our new data set. For each feature in our model, we extend the hypothesis function by adding one new term to the function. As an example, a hypothesis function accommodating 6 features can be expressed as:

$$ h_\theta(x) = \theta_0 + \theta_1\color{blue}x_1\color{black} + \theta_2\color{blue}x_2\color{black} + \theta_3\color{blue}x_3\color{black} + \theta_4\color{blue}x_4\color{black} + \theta_5\color{blue}x_5\color{black} + \theta_6\color{blue}x_6\color{black} $$
The variables **x _{1}** to

To generalize this function, we introduce a new independent variable **x _{0}** highlighted in blue below. The
introduction of this independent variable makes the first term in the equation consistent with the remaining terms.

We conventionally define **x _{0} = 1** to ensure the value of the first term remains unchanged since any value
multiplied by 1 is equal to itself. Now that all of the terms are expressed consistently, we can provide a generalized
definition of the hypothesis function:

The value **n** describes the number of features.

If it’s been a while since you’ve studied matrices or vectors, the overview of Linear Algebra in Chapter 12 will provide you with the necessary context to follow along in this section. The purpose of this section is to show you that the generalized definition of the hypothesis function can be more conveniently expressed in linear algebraic notation.

We can think about each of the parameters in the hypothesis function as elements in a parameter vector:

$ \theta = \begin{bmatrix}\theta_0\\\theta_1\\...\\\theta_n\end{bmatrix} $We can also think about each of the features in the hypothesis function as elements in a feature vector:

$ x = \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix} $
If we transpose the parameter vector **θ** we end up with a column vector **θ ^{T}**:

If we then take the product of the parameter vector and feature vector, we end up with:

$ \theta^Tx = \begin{bmatrix}\theta_0 & \theta_1 & ... & \theta_n\end{bmatrix} \cdot \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix} = \theta_0x_0 + \theta_1x_1 + ... + \theta_nx_n = h(x) $
As it turns out, **θ ^{T}x** is an equivalent way to express the multivariate hypothesis function

In the previous chapter, we defined the cost function as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h(x^{(i)}) - y^{(i)} \right)^2$$
This cost function can be used to compute the cost of the generalized hypothesis function as well.
We substitute the *reference* to the hypothesis function h(x) with the *definition*
of our generalized hypothesis function, and we get:

This definition is a bit verbose, so let's express this function using the linear algebraic notation we’ve just discussed.

The arguments to the cost function (θ_{0}...θ_{n}) are replaced with the single parameter vector theta, and the
definition of the generalized hypothesis function is replaced with its linear algebraic interpretation,
**θ ^{T}x**. Using this notation, we end up with a more compact and convenient way to describe the cost
function for multivariate linear regression:

In the previous chapter, we defined the gradient descent algorithm as follows:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{black}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{black}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right)x^{(i)} $$

}

This algorithm has similar limitations to our univariate hypothesis function. In its current form, we’re limited to discovering
optimal values for just two parameters - **θ _{0}** and

First, let’s substitute the univariate hypothesis function with the generalized hypothesis function:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{black}x^{(i)} $$

}

When we generalized the hypothesis function, we started by introducing one new term to the function for new each feature in our data set. When we apply the same approach to the gradient descent algorithm, we end up with:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) \hspace{1cm} $$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_1^{(i)} \hspace{0.4cm} \color{black}year $$
$$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_2^{(i)} \hspace{0.4cm} \color{black}month $$
$$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_3^{(i)} \hspace{0.4cm} \color{black}day $$
$$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_4^{(i)} \hspace{0.4cm} \color{black}day of week $$
$$ \theta_5 := \theta_5 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_5^{(i)} \hspace{0.4cm} \color{black}close (t-1) $$
$$ \theta_5 := \theta_6 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_6^{(i)} \hspace{0.4cm} \color{black}close (t-2) $$

}

The variables **x _{1}** to

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_0^{(i)} \hspace{0.4cm} \color{black}constant (1)$$
$$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} \hspace{0.4cm} year $$
$$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_2^{(i)} \hspace{0.4cm} month $$
$$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_3^{(i)} \hspace{0.4cm} day $$
$$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_4^{(i)} \hspace{0.4cm} day of week $$
$$ \theta_5 := \theta_5 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_5^{(i)} \hspace{0.4cm} close (t-1) $$
$$ \theta_5 := \theta_6 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_6^{(i)} \hspace{0.4cm} close (t-2) $$

}

Again, we conventionally define **x _{0} = 1** to ensure the value of the first term remains unchanged since any
value multiplied by 1 is equal to itself.

Now that all of the terms are expressed consistently, we can provide a generalized definition of the hypothesis function:

Repeat simultaneously until convergence, for j = 0 to j = n: {

$$ \theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_j^{(i)} $$

}

The value **n** describes the number of features, while **j** acts as a reference or index to a specific
feature.

It’s important to highlight a key implementation detail that you may have noticed when you saw the sentence “Repeat **simultaneously** until
convergence”. For each and every computational step in gradient descent, all features must be computed before progressing to the next
computational step. In other words, feature convergence cannot occur asynchronously – all features must converge simultaneously,
so consider your collection of features as a single atomic unit.

With a few changes to our previous linear regression model, we’ve been able to develop a far more capable linear regression model that can learn from an arbitrarily large number of features. It’s a nice step up and has far broader applicability towards solving real-world use cases. Hopefully, this inspires you to think about all of the potential sources of market & financial data that you could leverage to make incredibly rich equity predictive models.

The benefit of broader capability comes with one new challenge to solve - not all features in a model are created equal. To wrap up this chapter we’ll discuss a key challenge when working with multivariate models and how a technique called feature scaling can help address them.

To illustrate the new challenge that arises when working with multivariate models, let’s look at a (non-engineered) subset of Apple share price data which captures the opening price, intraday high, intraday low, daily volume and closing price.

Date | Open | High | Low | Volume | Close |

1 June 2023 | 177.70 | 180.12 | 176.93 | 68,901,800 | 180.09 |

2 June 2023 | 181.03 | 181.78 | 179.26 | 61,945,900 | 180.95 |

5 June 2023 | 182.63 | 184.95 | 178.04 | 121,946,500 | 179.58 |

6 June 2023 | 179.97 | 180.12 | 177.43 | 64,848,400 | 179.21 |

Note that the opening price ranges between **177.70** and **182.63**, the intraday high ranges
between **180.12** and **184.95**, the intraday low ranges between **176.93** and
**179.26** while the volume ranges between **61,945,900** and **121,946,500**.

The range of the volume feature is significantly larger than the range of the open, high and low features. As a consequence, the gradient descent algorithm may take many more iterations to converge on an optimal value for the volume feature compared to the open, high or low features since the algorithm will have a significantly larger range to traverse. This isn’t a particularly efficient use of your GPU or CPU cycles, and with sizeable data sets this range disparity can significantly increase training time.

**Feature scaling** describes a variety of techniques that aim to standardize or normalize the range of independent variables
or features of data. Feature scaling has broad applicability in the fields of data science and machine learning as there are a number of
distinct motivating factors for its adoption. For the purpose of this chapter, we’ll use feature scaling as a means to improve the
performance of gradient descent. More specifically, we’ll introduce two different feature scaling techniques that you can use to
standardize or normalize data.

Min-max scaling, also referred to as *min-max normalization* or *rescaling* is used to minimize and standardize the
range of values for a specific feature. For a given feature x, min-max scaling transforms each instance of x so that all instances
lie in a common range, typically between 0 and 1. The min-max scaling function is defined as:

Where **x _{i}** is the value of the training instance

min(x) = 61,945,900

max(x) = 121,946,500

To scale the first instance in our volume feature we substitute the values for min(x), max(x) and our first training instance x_{0}
and we get:

As an exercise for the reader, use min-max scaling to scale the remaining three training instances in the training set. You can use the following table to check your work:

Volume | Rescaled Volume |

68,901,800 | 0.112 |

61,945,900 | 0.000 |

121,946,500 | 1.000 |

64,848,400 | 0.048 |

Note that in Table 3-4 above, the min volume has a rescaled volume of exactly 0, while the max volume has a rescaled volume of exactly 1. Any volumes between the min and max will consequently have rescaled volumes within the range of 0 and 1.

Although the difference between min-max scaling and mean normalization appears cosmetically subtle, mean normalization applies a
more substantial transformation to your data. While the purpose of min-max is to simply scale data, the purpose of normalization is to
transform data so that it can be described as a *normal* or *Gaussian* distribution - a specific statistical distribution
where a roughly equal number of observations fall above and below the mean. For a given feature x, mean normalization transforms each
instance of x so that all instances lie in a common range, typically between -1 and 1.

Models generally run under the assumption that data is normally distributed, and consequently applying mean normalization to data has a tendency to produce better results. The mean normalization function is defined as:

$$ x_i^\prime := \dfrac {x_i - mean(x)}{max(x) - min(x)} $$
Where **x _{i}** is the value of the training instance

mean(x) = 79,410,650

To normalize the first instance in our volume feature we substitute the values for mean(x), min(x), max(x) and our first training instance
x_{0} and we get:

As an exercise for the reader, use the mean normalization function to scale the remaining three training instances in the training set. You can use the following table to check your work:

Volume | Mean Normalized Volume |

68,901,800 | -0.175 |

61,945,900 | -0.291 |

121,946,500 | 0.709 |

64,848,400 | -0.243 |

It’s important to note that once you’ve trained a model using *scaled features*, your model will be only able to make
accurate predictions with *scaled inputs*.

We’ve published an implementation of the multivariate linear regression model on GitHub.

If you would like to implement, train and run this linear regression model yourself, feel free to use the source code as a reference and run the software to see how training and running the model works in practice. Documentation related to building, training and running the model can be found on GitHub as well.

https://github.com/build-thinking-machines/multivariate-linear-regression