### Initialising and Managing Alpha

Alpha is the only tunable parameter in the Gradient Descent algorithm, and using the right value is important; if alpha is too big, the algorithm may not converge on a solution; if alpha is to small, the algorithm can a long time to converge.

The following graphs show the behaviour of the Gradient Descent algorithm when alpha is set too high or too low. The x axis represents a Θ_{0} Θ_{1} pairing, and the y axis represents the value of the cost function for that pair, J(Θ_{0}, Θ_{1}). These *should* be 3 dimensional, with one axis for Θ_{0}, one axis for Θ_{1} and one axis for the cost function J(Θ_{0}, Θ_{1}), but I’m not that great of a 3-d drawer. The 2-d graph is alright just to illustrate the point.

Remember that the Gradient Descent algorithm is a minimisation algorithm, so the best solution is the one where the cost function of Θ_{0}, Θ_{1} is lowest (the minima of the function).

The bigger the value of alpha, the bigger steps the algorithm will take between each iteration. In iteration #1, the algorithm takes a big step and gets closer to convergence. In iteration #2, it “overshoots” the minima, and subsequent iterations keep on boucing around, never converging towards the minima. When alpha is set to high, we risk the possibility of never coverging on the right answer.

Setting the value of alpha too small is less risky; we *will* ultimatley converge on right answer. The downside to setting alpha too small is that doing so will require a large number of small steps, which makes convergence take longer than necessary.

So, how do we approach initialising alpha? I’d suggest to start with a value of 0.1 and adjust as necessary. The cost function will be your guide here - if the gradient descent algorithm is working as expected, the cost function should *decrease* after each and every iteration. If the value of the cost function is increasing, it’s a strong signal that alpha is set too high, and you should decrease it.

I’ve taken an algorithmic approach to adjusting the cost function while gradient descent is running. After every iteration, I check the newly computed cost function against the previously computed cost function. If I detect that the cost function has decreased, I increase the value of alpha by 0.02 (take slightly bigger steps). On the other hand, if I detect that the cost function has increased, I divide the cost function by 2 (we’re on the wrong path, so we want to be more aggresive in our course correction).

To show you how this works in pratice, I’ve captured the output of alpha after each iteration during a gradient descent run and plotted the results in the following graph:

The results here are pretty interesting - for this data set, it appears that there’s a “ceiling” for alpha at about 0.23, and any time the value of alpha hits this point the steps become too large and the cost function increases.

In a second experiment, I intentionally initialised the value of alpha with a really big number (4). Here are the results:

As you can see here, the value of alpha quickly decends into it’s normal operating range, and once alpha gets here it doesn’t make any significant movement upwards or downwards.

### Defining Convergence

In the original definition of the Gradient Descent algorithm, we start with the statement “repeat until convergence”. How do we define convergence?

We already know that a well behaved Gradient Descent algorithm should reduce the cost function of J(Θ_{0}, Θ_{1}) after every iteration. As we approach the minima, we should also observe smaller and smaller cost reduction changes; the following graph illustrates the point:

When we start at the “top” of the function, each iteration reduces the cost of J(Θ_{0}, Θ_{1}), and the direction of our travel is nearly vertical along the y axis. Proportional to y, our we’re not travelling as far along the x axis. As we approach the bottom of the curve, we slowly but surely modify the direction of our travel, and a greater proportion of our travel is directed horizontally along the x axis, not vertically along the y axis.

Since the value of y represents the cost function of J(Θ_{0}, Θ_{1}), we can say that our Gradient Descent algorithm has converged when the difference between the cost function of two iterations is *sufficiently small*. There’s no right answer to what sufficiently small is, as it’s entirely dependent on your personal level of tolerance for accuracy.

I’ll give you an example - the graph below shows the the delta of a cost function between between each iteration.

As you can see, by about iteration 10 the delta between the cost function for each iteration is nearly 0, so we’re getting close to finding the ideal solution. It’s hard to see on the graph, but the delta at iteration 10 is 0.0002. Is that a good enough tolerance to say we’ve converged? That’s up to you.

For this particular data set, I know that the “perfect” hypothesis function should produce Θ_{0} = 0 and Θ_{1} = 1.

If I were to set my definintion of convergence at 0.0002, the Gradient Descent algorithm would stop after about 10 iterations. The result would produce Θ_{0} = 0.203 and Θ_{1} = 0.952. That’s not too bad; we’re *pretty* close to the right answer, but there’s room for improvement.

By iteration 150, the cost function delta between iterations has shrunk down to 0.0000008. Using *this* as our definition for convergence would produce Θ_{0} = 0.005 and Θ_{1} = 0.999. This result would give you more accurate predictions, at the expense of running additional iterations.

Both definitions of convergence are correct, but have different outcomes and goals. The choice for selecting an appropriate convergence value will depend on balacing the need for accuracy and efficiency.

### Implementing Gradient Descent

Source code for the Gradient Descent algorithm (implemented in Java) can be found here. Here’s a snippet of the Gradient Descent implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

public LinearFunction<T> run() {
BigDecimal theta0 = new BigDecimal(0.0), tempTheta0 = new BigDecimal(0.0);
BigDecimal theta1 = new BigDecimal(0.0), tempTheta1 = new BigDecimal(0.0);
BigDecimal cost = new BigDecimal(100.0), tempCost = new BigDecimal(100.0);
Double convergence = new Double(100.0);
while (convergence > tolerance) {
tempTheta0 = calculateThetaZero(theta0, theta1);
tempTheta1 = calculateThetaOne(theta0, theta1);
cost = costFunction.at(theta0, theta1);
tempCost = costFunction.at(tempTheta0, tempTheta1);
alpha = (tempCost.doubleValue() > cost.doubleValue()) ? alpha / 2 : alpha + 0.02;
convergence = Math.abs(tempCost.doubleValue() - cost.doubleValue());
theta0 = tempTheta0;
theta1 = tempTheta1;
}
return new LinearFunction<T>(theta0, theta1);
}

The most important implementation note here is that *theta0* and *theta1* should be calulated atomically (in machine learning, they refer to this as calculating “simultaneously”). The newly computed values of theta0 and theta1 are not assigned immediatley; they are assigned to temporary variables *tempTheta0* and *tempTheta1*. Doing this ensures that we’re computing new values of theta0 and theta1 against the current values of both theta0 and theta1. If we were to perform direct assignment, the value of theta1 could be a bit off, as we would be calculating the value of theta1 against the current value of theta1, but the *new* value of theta0. This *could* still make for a workable Gradient Descent algorithm, but some unexpected behaviours can come about with direct assignment.