The Gradient Descent algorithm is a general purpose algorithm that has a number of practical applications in machine learning. By the end of this article you should have a good understanding of how the Gradient Descent algorithm works, and appreciate how it helps to solve to minimisation problems in general.
Intuitive Gradient Descent
Before we define the gradient descent algorithm, let’s first get an intuitive understanding of what the algorithm will accomplish.
The Gradient Descent algorithm is useful for finding local minima - the smallest values of a function, in and around the local area of a function. As an example, let’s take a look at the odd-looking function below - you can think of this as a simple landscape with hills and valleys:
We’ll want the algorithm to to find the lowest valley it can, and we don’t know what the lowest point is ahead of time. To start, we’re just going to pick an arbitrary spot on our “landscape” - the red X.
After we’ve picked our starting point, we’re going to have a look around, and figure out which direction to take in order to go “downhill”, so to speak. Once we’ve figured out the direction we want to take, we’ll take one step in that direction. Now that we’ve taken our first step, the process repeats itself - we’ll have another look around, figure out which direction we want to take in order to go downhill, then take another step (this process is represented by the red dashes in the picture above).
This process repeats itself until you take a look around, and realise that taking another step won’t bring you further downhill. At this point you’ve found the local minima (the red dot in the picture above), and the process completes.
Minima and Local Minima
One of the properties of the Gradient Descent algorithm is that it is efficient at finding local minima in a function. To demonstrate this, I’ve also picked another abitrary point on the graph (the blue X), and followed the same process. You’ll notice that this starting point produced a different value compared to our first starting point. With our “hill and valley” function above, the answer that you get from the Gradient Descent algorithm is highly dependant upon our initial starting position.
The conclusion we can draw is that the Gradient Descent algorithm is not useful for minimising functions that have more than one local minima, since the answer you’ll get back will be highly dependant on where you pick your initial starting position.
The Gradient Descent algorithm will work best at minimising functions where there is only one possible local minima. As it turns out, our linear regression cost function is exactly one of these functions, so the Gradient Descent algorithm will give us the result we’re expecting.
The Gradient Descent Algorithm
The formal definition of the Gradient Descent algorithm is as follows:
repeat until convergence, for j = 0 and j = 1:
It’s not obvious, but this is really just an algorithm for taking a step from point a to point b. Imagine you’re on a walk - where you land after each step depends on three factors: your current location, the size of the step you want to take (giant stride or small shuffle), and the direction of the step (north, south, south-west). Let’s break down the gradient descent algorithm and explain how each term relates to these factors:
The first thing to note here is the expression := which means assignment. We want to assign a new value of Θj, which should intially be based on the current value of Θj. Your new location (after taking a step) will be initially based on your current location.
This is the term alpha, and defines the size of the step you want to take.
This term represents the direction of the step. Obviously there’s a bit more to unpack here, but I’ll talk about how this term works when we get to implementation.
The algorithm also mentions to repeat until convergence, for j = 0 and j = 1. All this means in practice is that we want to apply the Gradient Descent algorithm for Θ0 and Θ1, so our implementation will look like this:
We already know what J(Θ0, Θ1) is, that’s our cost function. Let’s expand our gradient descent function by replacing the J(Θ0, Θ1) term with the actual cost function. We now have:
The next part may lose some of you if you don’t know calculus - don’t worry though, you don’t actually need to understand calculus as I’ve done the work for you. We need to take the partial derivitive of our cost function. After we’ve done that, the gradient descent algorithm low looks like this:
The last thing we’re going to do is replace the hypothesis function h(x) term with the actual hypothesis function, and we finally end up our implementable gradient descent algorithm:
repeat until convergence:
In part 3 I’ll provide insights into alpha, the partial differential equation, convergence and some practical implementation notes and example source code.