Linear Regression with Multiple Features

regression model supervised learning

Introduction

In the previous chapter, we developed a simple linear regression model to predict the price of Apple shares using a single feature – the trading date. While this served as a solid introduction to machine learning, the model we developed has quite a bit of room for improvement.

In this chapter, we’ll build a more capable predictive model that can leverage an arbitrarily large number of features. There’s an abundance of data in financial markets that we could use to produce a robust model of our domain, from economic and macroeconomic indicators, company fundamentals, technical analysis, news, sentiment and more. Linear regression that makes use of multiple features or variables is referred to as multivariate linear regression.

The key takeaway from this chapter is that multivariate linear regression is simply a generalized form of univariate linear regression. The hypothesis function, cost function and gradient descent algorithm from the previous chapter will be refactored to support an arbitrarily large number of features.

To begin this chapter, we’ll introduce two phenomena observed in financial markets known respectively as the Weekend Effect and the September Effect, and expand on the ideas of feature engineering in an attempt to capture these phenomena in data.

The Weekend Effect

The Weekend Effect describes a phenomenon where the United States stock market has been historically observed to have significantly lower returns on Mondays compared to the immediately preceding Fridays. Modern analysis suggests that the Weekend Effect is a statistically significant historical anomaly, but the effect appears to have lost statistical significance in the last 40 years.

The September Effect

The September Effect describes a phenomenon where the stock market has been historically observed to have its worst performance in September. Between 1945 and 2022, the S&P 500 index averaged returns of -0.73% in September, in contrast to the best performing month of April which averaged returns of +1.6%.

Modelling Financial Market Phenomena

Let’s apply feature extraction on the date feature from the previous chapter’s data set to create new features that will be able to investigate the impact of these phenomena on Apple’s share price.

The Weekend Effect suggests that a relationship exists between the day of the week and an underlying asset’s share price. To discover whether this phenomenon has an impact on Apple’s share price, we extract the day of the week from the date feature and create a new f eature, which we’ll call Day of Week.

We use a simple convention to encode the Day of Week feature by declaring Monday = 1, Tuesday = 2, Wednesday = 3, Thursday = 4 and Friday = 5. The encoding is conventional in the sense that we could use any number range, the key is that the convention is applied consistently across our entire data set.

The September Effect suggests that a relationship exists between the month of the year and an underlying asset’s share price. To discover whether this phenomenon has an impact on Apple’s share price, we extract the month from the date feature and create an additional feature which we’ll call Month.

To evaluate the impact of these phenomena over longer periods of time, we extract the year from the date feature and create a third feature which we’ll call Year. Finally, we extract the day of the month from the date feature to create a fourth feature which we’ll call Day – we’re doing this as a convenient means to keep track of the full date.

The table below presents an updated training set with our four new features.

Table 3-1: Apple share price with engineered features, June 1^st 2023 to June 22^nd 2023

Year	Month	Day	Day of Week	Close
2023	6	1	4	180.09
2023	6	2	5	180.95
2023	6	5	1	179.58
2023	6	6	2	179.21
2023	6	7	3	177.82
2023	6	8	4	180.57
2023	6	9	5	180.96
2023	6	12	1	183.79
2023	6	13	2	183.31
2023	6	14	3	183.95
2023	6	15	4	186.01
2023	6	16	5	184.92
2023	6	20	2	185.01
2023	6	21	3	183.96
2023	6	22	4	187.00

This training set will allow our model to learn and discover whether any meaningful relationships exist between these market phenomena and Apple’s share price.

Let’s continue our use of feature engineering to determine whether a relationship exists between historical share prices and future share prices. To explore this idea, let’s create two additional lagging features: the Close (T-1) feature captures Apple’s closing price from one day ago, while the Close (T-2) feature captures Apple’s closing price from two days ago. The table below represents the training set that we’ll use in this chapter.

Table 3-2: Apple share price with engineered features, June 1^st 2023 to June 22^nd 2023

Year	Month	Day	Day of Week	Close (T-1)	Close (T-2)	Close
2023	6	1	4	177.25	177.3	180.09
2023	6	2	5	180.09	177.25	180.95
2023	6	5	1	180.95	180.09	179.58
2023	6	6	2	179.58	180.95	179.21
2023	6	7	3	179.21	179.58	177.82
2023	6	8	4	177.82	179.21	180.57
2023	6	9	5	180.57	177.82	180.96
2023	6	12	1	180.96	180.57	183.79
2023	6	13	2	183.79	180.96	183.31
2023	6	14	3	183.31	183.79	183.95
2023	6	15	4	183.95	183.31	186.01
2023	6	16	5	186.01	183.95	184.92
2023	6	20	2	184.92	186.01	185.01
2023	6	21	3	185.01	184.92	183.96
2023	6	22	4	183.96	185.01	187.00

The Hypothesis Function

In the previous chapter, we modelled Apple's share price using the following linear function as our hypothesis function:

$$ h_\theta(x) = \theta_0 + \theta_1x $$

This model limits us to exploring relationships between Apple’s share price and a single feature, so we’ll need to extend this model if we want to explore relationships between Apple’s share price and the 6 features in our new data set. For each feature in our model, we extend the hypothesis function by adding one new term to the function. As an example, a hypothesis function accommodating 6 features can be expressed as:

$$ h_\theta(x) = \theta_0 + \theta_1\color{blue}x_1\color{black} + \theta_2\color{blue}x_2\color{black} + \theta_3\color{blue}x_3\color{black} + \theta_4\color{blue}x_4\color{black} + \theta_5\color{blue}x_5\color{black} + \theta_6\color{blue}x_6\color{black} $$

The variables x₁ to x₆ highlighted in blue above keep track of the individual features in our model using subscript notation.

To generalize this function, we introduce a new independent variable x₀ highlighted in blue below. The introduction of this independent variable makes the first term in the equation consistent with the remaining terms.

$$ h_\theta(x) = \theta_0\color{blue}x_0\color{black} + \theta_1x_1 + \theta_2x_2\color{black} + \theta_3x_3\color{black} + \theta_4x_4\color{black} + \theta_5x_5\color{black} + \theta_6x_6\color{black} $$

We conventionally define x₀ = 1 to ensure the value of the first term remains unchanged since any value multiplied by 1 is equal to itself. Now that all of the terms are expressed consistently, we can provide a generalized definition of the hypothesis function:

$$ h_\theta(x)= \theta_0x_0+ … + \theta_nx_n $$

The value n describes the number of features.

Linear Algebraic Interpretation

If it’s been a while since you’ve studied matrices or vectors, the overview of Linear Algebra in Chapter 12 will provide you with the necessary context to follow along in this section. The purpose of this section is to show you that the generalized definition of the hypothesis function can be more conveniently expressed in linear algebraic notation.

We can think about each of the parameters in the hypothesis function as elements in a parameter vector:

$ \theta = \begin{bmatrix}\theta_0\\\theta_1\\...\\\theta_n\end{bmatrix} $

We can also think about each of the features in the hypothesis function as elements in a feature vector:

$ x = \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix} $

If we transpose the parameter vector θ we end up with a column vector θ^T:

$ \theta^T = \begin{bmatrix}\theta_0 & \theta_1 & ... & \theta_n\end{bmatrix} $

If we then take the product of the parameter vector and feature vector, we end up with:

$ \theta^Tx = \begin{bmatrix}\theta_0 & \theta_1 & ... & \theta_n\end{bmatrix} \cdot \begin{bmatrix}x_0\\x_1\\...\\x_n\end{bmatrix} = \theta_0x_0 + \theta_1x_1 + ... + \theta_nx_n = h(x) $

As it turns out, θ^Tx is an equivalent way to express the multivariate hypothesis function h(x). It’s a very compact and convenient notation that we’ll use going forward.

The Cost Function

In the previous chapter, we defined the cost function as:

$$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (h(x^{(i)}) - y^{(i)} \right)^2$$

This cost function can be used to compute the cost of the generalized hypothesis function as well. We substitute the reference to the hypothesis function h(x) with the definition of our generalized hypothesis function, and we get:

$$J(\theta_0 ... \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)^2$$

This definition is a bit verbose, so let's express this function using the linear algebraic notation we’ve just discussed.

The arguments to the cost function (θ₀...θ_n) are replaced with the single parameter vector theta, and the definition of the generalized hypothesis function is replaced with its linear algebraic interpretation, θ^Tx. Using this notation, we end up with a more compact and convenient way to describe the cost function for multivariate linear regression:

$$J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=0}^m \left (\theta^Tx^{(i)} - y^{(i)} \right)^2$$

The Gradient Descent Algorithm

In the previous chapter, we defined the gradient descent algorithm as follows:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{black}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\color{black}\theta_0 + \theta_1x^{(i)}\color{black} - y^{(i)} \right)x^{(i)} $$

}

This algorithm has similar limitations to our univariate hypothesis function. In its current form, we’re limited to discovering optimal values for just two parameters - θ₀ and θ₁. Since our new data set has 6 features, we need to develop a version of gradient descent that has the ability to optimize 7 parameters. Let’s develop a generalized version of gradient descent with the same strategy we used to generalize the hypothesis function.

First, let’s substitute the univariate hypothesis function with the generalized hypothesis function:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{black}x^{(i)} $$

}

When we generalized the hypothesis function, we started by introducing one new term to the function for new each feature in our data set. When we apply the same approach to the gradient descent algorithm, we end up with:

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right) \hspace{1cm} $$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_1^{(i)} \hspace{0.4cm} \color{black}year $$ $$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_2^{(i)} \hspace{0.4cm} \color{black}month $$ $$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_3^{(i)} \hspace{0.4cm} \color{black}day $$ $$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_4^{(i)} \hspace{0.4cm} \color{black}day of week $$ $$ \theta_5 := \theta_5 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_5^{(i)} \hspace{0.4cm} \color{black}close (t-1) $$ $$ \theta_5 := \theta_6 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_6^{(i)} \hspace{0.4cm} \color{black}close (t-2) $$

}

The variables x₁ to x₆ highlighted in blue above keep track of each feature using subscript notation. To generalize this algorithm, we apply the same technique we used to generalize the hypothesis function by introducing a new independent variable, x₀ highlighted in blue below. The introduction of this independent variable makes the first term in the algorithm consistent with the remaining terms.

repeat until convergence: {

$$ \theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)\color{blue}x_0^{(i)} \hspace{0.4cm} \color{black}constant (1)$$ $$ \theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_1^{(i)} \hspace{0.4cm} year $$ $$ \theta_2 := \theta_2 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_2^{(i)} \hspace{0.4cm} month $$ $$ \theta_3 := \theta_3 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_3^{(i)} \hspace{0.4cm} day $$ $$ \theta_4 := \theta_4 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_4^{(i)} \hspace{0.4cm} day of week $$ $$ \theta_5 := \theta_5 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_5^{(i)} \hspace{0.4cm} close (t-1) $$ $$ \theta_5 := \theta_6 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_6^{(i)} \hspace{0.4cm} close (t-2) $$

}

Again, we conventionally define x₀ = 1 to ensure the value of the first term remains unchanged since any value multiplied by 1 is equal to itself.

Now that all of the terms are expressed consistently, we can provide a generalized definition of the hypothesis function:

Repeat simultaneously until convergence, for j = 0 to j = n: {

$$ \theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=0}^m \left (\theta_0x_0^{(i)} + ... + \theta_nx_n^{(i)} - y^{(i)} \right)x_j^{(i)} $$

}

The value n describes the number of features, while j acts as a reference or index to a specific feature.

It’s important to highlight a key implementation detail that you may have noticed when you saw the sentence “Repeat simultaneously until convergence”. For each and every computational step in gradient descent, all features must be computed before progressing to the next computational step. In other words, feature convergence cannot occur asynchronously – all features must converge simultaneously, so consider your collection of features as a single atomic unit.

With a few changes to our previous linear regression model, we’ve been able to develop a far more capable linear regression model that can learn from an arbitrarily large number of features. It’s a nice step up and has far broader applicability towards solving real-world use cases. Hopefully, this inspires you to think about all of the potential sources of market & financial data that you could leverage to make incredibly rich equity predictive models.

The benefit of broader capability comes with one new challenge to solve - not all features in a model are created equal. To wrap up this chapter we’ll discuss a key challenge when working with multivariate models and how a technique called feature scaling can help address them.

Feature Scaling

To illustrate the new challenge that arises when working with multivariate models, let’s look at a (non-engineered) subset of Apple share price data which captures the opening price, intraday high, intraday low, daily volume and closing price.

Table 3-2: Apple share price, June 1^st 2023 to June 6^th 2023

Date	Open	High	Low	Volume	Close
1 June 2023	177.70	180.12	176.93	68,901,800	180.09
2 June 2023	181.03	181.78	179.26	61,945,900	180.95
5 June 2023	182.63	184.95	178.04	121,946,500	179.58
6 June 2023	179.97	180.12	177.43	64,848,400	179.21

Note that the opening price ranges between 177.70 and 182.63, the intraday high ranges between 180.12 and 184.95, the intraday low ranges between 176.93 and 179.26 while the volume ranges between 61,945,900 and 121,946,500.

The range of the volume feature is significantly larger than the range of the open, high and low features. As a consequence, the gradient descent algorithm may take many more iterations to converge on an optimal value for the volume feature compared to the open, high or low features since the algorithm will have a significantly larger range to traverse. This isn’t a particularly efficient use of your GPU or CPU cycles, and with sizeable data sets this range disparity can significantly increase training time.

Feature scaling describes a variety of techniques that aim to standardize or normalize the range of independent variables or features of data. Feature scaling has broad applicability in the fields of data science and machine learning as there are a number of distinct motivating factors for its adoption. For the purpose of this chapter, we’ll use feature scaling as a means to improve the performance of gradient descent. More specifically, we’ll introduce two different feature scaling techniques that you can use to standardize or normalize data.

Min-Max Scaling

Min-max scaling, also referred to as min-max normalization or rescaling is used to minimize and standardize the range of values for a specific feature. For a given feature x, min-max scaling transforms each instance of x so that all instances lie in a common range, typically between 0 and 1. The min-max scaling function is defined as:

$$ x_i^\prime := \dfrac {x_i - min(x)}{max(x) - min(x)} $$

Where x_i is the value of the training instance i, min(x) is the smallest value in the range of feature x, and max(x) is the largest value in the range of feature x. Based on the data from the volume feature in Table 3-2 above, we define min(x) and max(x) as follows:

min(x) = 61,945,900
max(x) = 121,946,500

To scale the first instance in our volume feature we substitute the values for min(x), max(x) and our first training instance x₀ and we get:

$$ x_0^\prime = \dfrac{68,901,800 - 61,945,900}{121,946,500 - 61,945,900} = \dfrac{6,955,900}{60,000,600} = 0.112 $$

As an exercise for the reader, use min-max scaling to scale the remaining three training instances in the training set. You can use the following table to check your work:

Table 3-3: Apple trading volume before and after min-max scaling

Volume	Rescaled Volume
68,901,800	0.112
61,945,900	0.000
121,946,500	1.000
64,848,400	0.048

Note that in Table 3-4 above, the min volume has a rescaled volume of exactly 0, while the max volume has a rescaled volume of exactly 1. Any volumes between the min and max will consequently have rescaled volumes within the range of 0 and 1.

Mean Normalization

Although the difference between min-max scaling and mean normalization appears cosmetically subtle, mean normalization applies a more substantial transformation to your data. While the purpose of min-max is to simply scale data, the purpose of normalization is to transform data so that it can be described as a normal or Gaussian distribution - a specific statistical distribution where a roughly equal number of observations fall above and below the mean. For a given feature x, mean normalization transforms each instance of x so that all instances lie in a common range, typically between -1 and 1.

Models generally run under the assumption that data is normally distributed, and consequently applying mean normalization to data has a tendency to produce better results. The mean normalization function is defined as:

$$ x_i^\prime := \dfrac {x_i - mean(x)}{max(x) - min(x)} $$

Where x_i is the value of the training instance i, mean(x) is the mean or average of feature x, min(x) is the smallest value in the range of feature x, and max(x) is the largest value in the range of feature x. To calculate the average, we sum up all volumes in our training set and divide by the training set size, and we end up with:

mean(x) = 79,410,650

To normalize the first instance in our volume feature we substitute the values for mean(x), min(x), max(x) and our first training instance x₀ and we get:

$$ x_0^\prime = \dfrac{68,901,800 - 79,410,650}{60,000,600} = -0.175 $$

As an exercise for the reader, use the mean normalization function to scale the remaining three training instances in the training set. You can use the following table to check your work:

Table 3-4: Apple trading volume before and after mean normalization

Volume	Mean Normalized Volume
68,901,800	-0.175
61,945,900	-0.291
121,946,500	0.709
64,848,400	-0.243

It’s important to note that once you’ve trained a model using scaled features, your model will be only able to make accurate predictions with scaled inputs.

Build the Multivariate Linear Regression Model

We’ve published an implementation of the multivariate linear regression model on GitHub.

If you would like to implement, train and run this linear regression model yourself, feel free to use the source code as a reference and run the software to see how training and running the model works in practice. Documentation related to building, training and running the model can be found on GitHub as well.

https://github.com/build-thinking-machines/multivariate-linear-regression