Brandon John GrenierArtificial Intelligence • Machine Learning • Big Data
http://brandon.ai/
Sat, 18 Feb 2017 06:00:21 +0000Sat, 18 Feb 2017 06:00:21 +0000Classification, Part 1: Binary Classification<p>To date we’ve been solving regression problems, where the values being predicted (like the price of a house) are entirely unconstrained. Classification problems are those where we want a machine learning algorithm to predict a specific, discrete result from a predefined set of data.</p>
<h3 id="overview-of-binary-classification">Overview of Binary Classification</h3>
<p>Binary classification is the simplest form of classification problem to solve for, but can help us answer some incredibly valuable questions. Examples of binary classification problems include spam filtering (is this email spam? yes/no) fraud detection (is this transaction legitimate? yes/no) and medicine (is this tumor malignant? yes/no).</p>
<p>For each of these example, the variable that we’re trying to predict will take on one of two distinct values, 0 or 1. More formally, we want to predict a value y which will take on the value 0 or 1, defined as: <br />
<script type="math/tex">\displaystyle y \in \{0, 1\}</script></p>
<p>By convention, the value 0 is referred to as the “negative class”, and the value 1 is referred to as the “positive class”.</p>
<h3 id="shortcomings-with-using-linear-regression">Shortcomings with using Linear Regression</h3>
<p>If we take what we’ve learned so far with linear regression and apply it to classification problems, we’ll learn that there are a few shortcomings with this approach.</p>
<p>Linear regression works well with continuous value predictions, but with classification problems we need an approach to produce <em>discrete</em> value predictions. With binary classification specifically, we only want to make predictions that produce 0 or 1 as outputs. In the examples below, we attempt to predict whether a tumor is malignant or not based on the size of the tumor, using linear regression.</p>
<p>In each example, we generate a hypothesis function that fits the data as best as possible. To determine whether a particular data point should represent a “yes” or a “no”, we take a midpoint at y = 0.5 (in other words, the halfway point between 0 and 1 on the y axis). This line extends until it intersects our hypothesis function, and then we make a simple decision - anything to the right of the intersection will be grouped into the “yes” group, and anything to the left of the intersetion will be grouped into the “no” group.</p>
<p>In our very first graph, linear regression works out quite well, and it has correctly grouped the malignant and non malignant tumors. The problem lies with the remaining two graphs, where we have more data points. The data is intuitively consistent with what we might expect (larger tumors tend to be more malignant), but the linear regression algorithm starts “pulling down” to fit the data points, and as a result starts predicting malignant tumors as non malignant.</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-31-uc-binary-classification/linear-regression.jpg" alt="Shortcomings with linear regression" /></p>
<p>Linear functions are, well, too linear. We need a function that can shortly and sharply “cleave” our data points to cleanly place them into one of two camps.</p>
<p>The other practical problem that we run into is that linear regression functions (as we’ve seen before) generate predictive values y that can be less than 0 and greater than one. We can make use of feature scaling to reduce the problem, but it won’t solve the problem entirely. Ideally, we want a function that will guarantee predictive values are always bound between zero and one.</p>
<h3 id="the-sigmoid-function">The Sigmoid Function</h3>
<p>The Sigmoid function (also referred to as the Logistic Regression function) is a function that neatly satisfies our requirements for classification. Don’t get confused about the terminology here - although this function is called the Logistic Regression function, it is in fact used for classification problems.</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-31-uc-binary-classification/sigmoid-function.png" alt="The Sigmoid Function" /></p>
<p>Firstly, the Sigmoid function maps any real number along the x axis to the [0, 1] interval we require for classifiation. For any input value x, the Sigmoid function guarantees that we will always be in our “yes”/”no” boundary. Secondly, it has a sharp transition between the [0, 1] boundary, which will help us produce a well defined decision boundary (i.e. the point where we decide what qualifies as a yes, and what qualifies as a no).</p>
<h3 id="the-hypothesis-function">The Hypothesis Function</h3>
<p>We need to “hook” our hypothesis function into the Logistic Regression function, here’s how to do it.</p>
<p>Our existing hypothesis function is defined as:<br />
h(x) = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub></p>
<p>First, we temporarily assign our existing hypothesis function to a variable <em>z</em>, so we have:<br />
z = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub></p>
<p>The Logistic Regression function is defined as: <br />
<script type="math/tex">g(z) = \dfrac{1}{1 + e^{-z}}</script></p>
<p>We can formally define our new classifiation hypothesis function as:<br />
h(x) = g(z)</p>
<p>When we subsitute z with our defintion above, we get: <br />
h(x) = g(Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub>)</p>
<p>When we expand the function g, our fully implementable hypothesis function is:<br />
<script type="math/tex">h(x) = \dfrac{1}{1 + e^{-(\theta_0x_0 + \theta_1X_1 + ... + \theta_nX_n)}}</script></p>
<p>It’s also important to note the behavioural difference between the Sigmoid function and our linear regression hypothesis function. For any imput value x, the Sigmoid function will return a <em>probability</em> of that y = 1 for the input value.</p>
<p>For example, if we found h(x) = 0.55, the Sigmoid function is telling us that h(x) as a 55% chance of being 1, and therefore a 45% chance of being 0.</p>
<h3 id="the-decision-boundary">The Decision Boundary</h3>
<p>The goal of our hypothesis function is to predict a discrete value, a 1 or a 0. In this section, we’ll understand how the hypothesis function will make these predictions with the use of a decision boundary. We’ll use the following graph to get a better sense of what the logistic regression hypothesis function is computing.</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-31-uc-binary-classification/decision_boundary.jpg" alt="" /></p>
<p>The image above shows our hypothesis function (Logistic Regression or Sigmoid function), with the function asymptoting at 0 and 1. Since our predictions can only result in one of two possible values, the first question we need to ask ourselves is when should a value of h(x) fall into the “0” camp, and when should a value of h(x) fall into the “1” camp.</p>
<p>We now know that if h(x) = 0.55, the hypothesis function is telling us that h(x) as a 55% chance of being 1, and therefore a 45% chance of being 0. Let’s define our first decision boundary as:</p>
<p>when h(x) >= 0.5, the predicted value (y) should be equal to 1. <br />
when h(x) < 0.5, the predicted value (y) should be equal to 0.</p>
<p>This is a pretty sensible defintion - the arbitrary component here is that when h(x) is exactly equal to 0.5, the predicted value could really go in either camp. As a convention, we’ll make the predicted value 1. If you really wanted to, you could make the predicted value 0 - there’s no right or wrong answer.</p>
<p>We can start reasoning about how the algorithm will work. If you look at the value z = 0 on the graph, you’ll notice that the function intersects at y = 0.5. The value of y increases (up to 1) as the values of z get bigger.</p>
<p>We can generalise and say that g(z) >= 0.5 when z >= 0. Likewise, we can say that g(z) < 0.5 when z < 0.</p>
<p>Looking at our graph, we know that g(z) >= 0.5 when z >= 0. That’s the right hand side of the graph. Given our hypothesis function h(x) = g(z), we can also say that</p>
<p>h(x) >= 0.5 when z >= 0, and h(x) < 0.5 when z < 0</p>
<p>Of course, 0.5 is our decision boundary for deciding when a predicted value should go into the “1” camp or “0” camp, so we can also say that</p>
<p>h(x) should produce a predicted value of 1 when z >= 0,<br />
h(x) should produce a predicted value of 0 when z < 0</p>
<p>We also know what z is, so we can more formally say</p>
<p>h(x) should produce a predicted value of 1 when Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub> >= 0
h(x) should produce a predicted value of 0 when Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub> < 0</p>
<h3 id="the-decision-boundary-in-practice">The Decision Boundary in Practice</h3>
<p><img src="http://brandon.ai/assets/article_images/2017-01-31-uc-binary-classification/defining.jpg" alt="" /></p>
Tue, 31 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/31/uc-binary-classification.html
http://brandon.ai/2017/01/31/uc-binary-classification.htmlMultivariate Polynomial Regression<p>A straight line won’t always be the best fit for our data. In this article, you’ll learn how to generate predictive polynomial functions that leverage the machinery of our linear function algorithms. This technique will allow you to generate sophisticated predictive functions that bend and curve to fit your data.</p>
<h3 id="polynomial-functions">Polynomial Functions</h3>
<p>There are many types of polynomial functions at our disposal that can be used to better fit our data and produce more accurate predictions as a result. The following graph shows a few different kinds of polynomial functions and the shapes they can take on:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-27-multivariate-polynomial-regression/polynomial_functions.jpg" alt="Different types of polynomial functions" /></p>
<p>You might be surprised to learn that we can produce these more sophisticated polynomical functions without changing the linear function algorithms we’ve already developed. I’ll go through one example to show you how it works.</p>
<h3 id="mapping-polynomial-functions">Mapping Polynomial Functions</h3>
<p>Let’s start by looking at our prototypical example, where the price of a house is only based on a single feature, the size of the house.</p>
<table>
<thead>
<tr>
<th>House Size</th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>To produce a straight line to fit our data, we use our univariate linear hypothesis function: <br />
h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x</p>
<p>When we have more than one feature, we use our multivariate linear hypothesis function: <br />
h(x) = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + … + Θ<sub>n</sub>x<sub>n</sub></p>
<p>Let’s say that we have a few more data points than the table above suggests. When we plot the data we realise that a straight line isn’t going to give us the best fit, but a nice curve will do the trick. Instead of using a linear function, we want to use a <em>quadratic</em> function, which looks like this: <br />
h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x + Θ<sub>2</sub>x<sup>2</sup></p>
<p>How do we use this quadratic function within our existing machinery? The trick is to map the quadratic function onto our multivariate linear function. To do this, we simply add a new ‘feature’ to our model, the house size <em>squared</em>.</p>
<table>
<thead>
<tr>
<th>House Size</th>
<th>House Size<sup>2</sup></th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>15367</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>39601</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>111556</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>12544</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>We use the standard multivariate linear function for these two features, which is: <br />
h(x) = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> <br />
where x<sub>1</sub> is our first feature, the house size <br />
where x<sub>2</sub> is our second feature, the house size squared</p>
<p>Concretely, if we substitute the values of x with our features, (remembering that x<sub>0</sub> = 1 by convention) we get the following quadratic function as an outcome: <br />
h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>(size) + Θ<sub>2</sub>(size)<sup>2</sup></p>
<p>That is the general approach that you can take to map any kind of polynomial function onto a multivariate linear function. Instead of modifying the linear function, we modify the data set. Using this simple technique will allow you to create highly sophisticaed nonlinear functions by simply adding new features, based on existing ones you already have. You could also create new <em>composite</em> features, by multiplying the values of two features and storing the product as a new standalone feature.</p>
<p>When you use this technique, please ensure that you use feature scaling, especially in cases where you square or cube an existing feature.</p>
<p>Going forward, you’ll have many ways to model your data in order to get the best fit. This might be a bit overwhelming - what’s the best approach to modelling data, given there are so many ways to model it? Don’t worry too much about this, because later on we’ll get our machine learning algorithms to seek and select the best models for us.</p>
Fri, 27 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/27/multivariate-polynomial-regression.html
http://brandon.ai/2017/01/27/multivariate-polynomial-regression.htmlMultivariate Linear Regression & Feature Scaling<p>In the univariate linear regression series you learned how to implement a simple machine learning algorithm that can predict housing prices based on a single feature, the size of the house. In this article, we’ll start building a more sophisticated algorithm that can make housing price predictions based on <em>multiple</em> features.</p>
<p>We are going learn how to build a simple supervised machine learning algorithm which will be able predict housing prices based on the size of a house. Going through this exercise will introduce a number of key concepts, algorithms and mathematics used in machine learning.</p>
<h3 id="defining-the-training-set">Defining the Training Set</h3>
<p>In our previous example, the price of the house was based only on the size of the house, so we had a training set that looked like this:</p>
<table>
<thead>
<tr>
<th>House Size - m<sup>2</sup> (x)</th>
<th>House Price - AUD (y)</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>Our new training set will introduce multiple features - the price of a house will now be based upon the house size, the number of bedrooms it has, the number of floors it has, and the age of the house:</p>
<table>
<thead>
<tr>
<th>House Size - m2</th>
<th>Bedrooms</th>
<th>Floors</th>
<th>House Age</th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>3</td>
<td>1</td>
<td>3</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>3</td>
<td>2</td>
<td>11</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>5</td>
<td>3</td>
<td>22</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>4</td>
<td>1</td>
<td>44</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>We’ll introduce some notation in order to describe a training set with multiple features: <br />
<strong>x<sup>(i)</sup></strong> - this refers to <em>all</em> features of an instance in a training set. For example, when we refer to x<sup>(2)</sup>, we are now referring to the entire feature set [199, 3, 2, 11] of the second training example (the second row).</p>
<p><strong>x<sup>(i)</sup><sub>j</sub></strong> - this refers to <em>a specific</em> feature of an instance in a training set. For example, when we refer to x<sup>(2)</sup><sub>2</sub>, we are referring to a specific feature (number of bedrooms) in the second training example [3].</p>
<h3 id="the-hypothesis-function">The Hypothesis Function</h3>
<p>Our original hypothesis function was a simple linear function, defined as:<br />
h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x</p>
<p>This function works well when there is only one feature, but needs to be updated to accommodate for our new training set. With four features, our new hypothesis looks like this:</p>
<p>h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + Θ<sub>3</sub>x<sub>3</sub>+ Θ<sub>4</sub>x<sub>4</sub></p>
<p>You might notice that the function is a little inconsistent. Although we have a definition for Θ<sub>0</sub>, we don’t have a definition for x<sub>0</sub>. To tidy this up, we’ll introduce a convention, whereby:
x<sub>0</sub> = 1</p>
<p>By doing this, all of our terms are consistent with each other, and the function now looks like this:</p>
<p>h(x) = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + Θ<sub>2</sub>x<sub>2</sub> + Θ<sub>3</sub>x<sub>3</sub> + Θ<sub>4</sub>x<sub>4</sub></p>
<p>Setting x<sub>0</sub> = 1 helps to keep our terms consistent, and because it’s defined as 1 there won’t be any mathematical impact to this term. We can also generalise the hypothesis function at this point to get:</p>
<p>h(x) = Θ<sub>0</sub>x<sub>0</sub> + Θ<sub>1</sub>x<sub>1</sub> + … + Θ<sub>n</sub>x<sub>n</sub></p>
<p>The parameter <strong>n</strong> is used as a convention to describe the number of features. In our new training set, n equals 4, since we have 4 features.</p>
<h3 id="the-cost-function">The Cost Function</h3>
<p>Our original cost function with a single feature was defined as:
<script type="math/tex">J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x_{i} - y_{i} \right)^2</script></p>
<p>The cost function will be updated to support our new hypothesis function, and simply written as: <br />
<script type="math/tex">J(\theta_0, ..., \theta_n) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0x_0^i + ... + \theta_nx_n^i - y^i \right)^2</script></p>
<p>This definition is a bit long, so we’ll substitute the definition of each individual theta parameter with a single one, named theta. <br />
<script type="math/tex">J(\theta) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta - y^i \right)^2</script></p>
<h3 id="multivariate-gradient-descent">Multivariate Gradient Descent</h3>
<p>The univariate linear descent algorithm initially had two different functions, one for calculating Θ<sub>0</sub>, and another for calculating Θ<sub>1</sub>.</p>
<p>repeat until convergence: <br />
<script type="math/tex">\theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x^i - y^i \right)</script><br />
<script type="math/tex">\theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((\theta_0 + \theta_1x^i - y^i)x^i \right)</script></p>
<p>If we end up using our new convention, x<sub>0</sub> = 1, we can make both of these functions consistent with each other, and remove the theta 0 function altogether. With that in mind, the only real difference between the univariate cost function and the multivariate cost function is the very last multiplication term - here we want to multiply the ith x at j, and not just x. This makes our cost function:</p>
<p>repeat simultaneously until convergence, for j = 0 to j = n <br />
<script type="math/tex">\theta_j := \theta_j - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((\theta - y^i)x_j^i \right)</script></p>
<h3 id="implementing-gradient-descent">Implementing Gradient Descent</h3>
<p>Source code for the Gradient Descent algorithm (implemented in Java) can be found <a href="https://github.com/BrandonJohnGrenier/machine-learning">on GitHub</a>, the multivariate-linear-regression project. Here’s a snippet of the Gradient Descent implementation:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24</pre></td><td class="code"><pre><span class="kd">public</span> <span class="n">LinearFunction</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="nf">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">List</span><span class="o"><</span><span class="n">BigDecimal</span><span class="o">></span> <span class="n">thetas</span> <span class="o">=</span> <span class="n">initialise</span><span class="o">();</span>
<span class="n">List</span><span class="o"><</span><span class="n">BigDecimal</span><span class="o">></span> <span class="n">tempThetas</span> <span class="o">=</span> <span class="n">initialise</span><span class="o">();</span>
<span class="n">BigDecimal</span> <span class="n">tempCost</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">100.0</span><span class="o">);</span>
<span class="k">this</span><span class="o">.</span><span class="na">cost</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">100.0</span><span class="o">);</span>
<span class="n">Double</span> <span class="n">convergence</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Double</span><span class="o">(</span><span class="mf">100.0</span><span class="o">);</span>
<span class="k">this</span><span class="o">.</span><span class="na">iterations</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span>
<span class="k">while</span> <span class="o">(</span><span class="n">convergence</span> <span class="o">></span> <span class="n">tolerance</span><span class="o">)</span> <span class="o">{</span>
<span class="n">IntStream</span><span class="o">.</span><span class="na">range</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="n">thetas</span><span class="o">.</span><span class="na">size</span><span class="o">()).</span><span class="na">forEach</span><span class="o">(</span><span class="n">i</span> <span class="o">-></span> <span class="n">tempThetas</span><span class="o">.</span><span class="na">set</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">calculateTheta</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">thetas</span><span class="o">)));</span>
<span class="k">this</span><span class="o">.</span><span class="na">cost</span> <span class="o">=</span> <span class="n">costFunction</span><span class="o">.</span><span class="na">at</span><span class="o">(</span><span class="n">BigDecimals</span><span class="o">.</span><span class="na">listToArray</span><span class="o">(</span><span class="n">thetas</span><span class="o">));</span>
<span class="n">tempCost</span> <span class="o">=</span> <span class="n">costFunction</span><span class="o">.</span><span class="na">at</span><span class="o">(</span><span class="n">BigDecimals</span><span class="o">.</span><span class="na">listToArray</span><span class="o">(</span><span class="n">tempThetas</span><span class="o">));</span>
<span class="k">this</span><span class="o">.</span><span class="na">alpha</span> <span class="o">=</span> <span class="o">(</span><span class="n">tempCost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">()</span> <span class="o">></span> <span class="n">cost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">())</span> <span class="o">?</span> <span class="n">alpha</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">:</span> <span class="n">alpha</span> <span class="o">+</span> <span class="mf">0.02</span><span class="o">;</span>
<span class="n">convergence</span> <span class="o">=</span> <span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">tempCost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">()</span> <span class="o">-</span> <span class="n">cost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">());</span>
<span class="k">this</span><span class="o">.</span><span class="na">iterations</span> <span class="o">+=</span> <span class="mi">1</span><span class="o">;</span>
<span class="n">IntStream</span><span class="o">.</span><span class="na">range</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="n">tempThetas</span><span class="o">.</span><span class="na">size</span><span class="o">()).</span><span class="na">forEach</span><span class="o">(</span><span class="n">i</span> <span class="o">-></span> <span class="n">thetas</span><span class="o">.</span><span class="na">set</span><span class="o">(</span><span class="n">i</span><span class="o">,</span> <span class="n">tempThetas</span><span class="o">.</span><span class="na">get</span><span class="o">(</span><span class="n">i</span><span class="o">)));</span>
<span class="o">}</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">LinearFunction</span><span class="o"><</span><span class="n">T</span><span class="o">>(</span><span class="n">BigDecimals</span><span class="o">.</span><span class="na">listToArray</span><span class="o">(</span><span class="n">thetas</span><span class="o">));</span>
<span class="o">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="feature-scaling">Feature Scaling</h3>
<p>There’s one new problem gets introduced when we move from making predictions from a single feature to making predictions with multiple features. Let’s take the example of predicting the price of a house with two features: the size of the house and the number of bedrooms.</p>
<table>
<thead>
<tr>
<th>House Size - m<sup>2</sup></th>
<th>Bedrooms</th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>3</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>3</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>5</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>4</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>In this training set house sizes range from 112m<sup>2</sup> to 334m<sup>2</sup>, and the number of bedrooms range from 3 to 5. The problem that we run into here is that the Gradient Descent algorithm will likely take quite a few steps to “walk down” the housing prices, because we have a pretty big range to work through. On the other hand, we should be able to get through the bedrooms pretty quickly - they only range from 3 to 5. Since we have to compute the value of theta for <em>all</em> of our features simultaneously, we’ll end up with a scenario where we find the value of theta for our bedrooms quite quickly, but the algorithm will still to burn through iterations finding the correct theta value for our house size.</p>
<p>If you had a training set with 10 features, and only one of them had a large range, the algorithm will take most of it’s time attempting to resolve delta for that one feature. This isn’t particularly efficient, and this is where <em>feature scaling</em> comes in. It’s worthwhile spending a bit of time to understand how feature scaling works, as it’s used throughout machine learning.</p>
<p>The purpose of feature scaling is to minimise the range of values for a specific feature. Ideally, we want to have every feature in our training set to be in or around zero. More specifically, for any given feature x we want every instance of x to be in the range: -1 <= x<sub>i</sub> <= 1</p>
<p>This isn’t a hard requirement; it’s alright if certain features fall out of this range a bit, but be mindful.</p>
<p>A really simple way to scale our features is to find the maximum value of a given feature instance, and divide all features by that value. For example, we know that the maximum house size is 334m<sup>2</sup>. If we were to divide every house size by 334, we would automatically scale the feature to be within a range of 0 - 1. We’d get the same outcome if we divided the number of bedrooms by 5.</p>
<p>Another useful technique for feature scaling is called <em>mean normalisation</em>, and the formula is:<br />
<script type="math/tex">x_i := \dfrac{x_i - \mu_i}{s_i}</script></p>
<p>Where μ<sub>i</sub> is the average of all the values for feature (i) and S<sub>i</sub> is the <em>range</em> of values, defined as the largest value of feature (i) minus the smallest value of feature (i). Mean normalistion will be the most often used technique for feature scaling.</p>
<p>Here’s an example of applying mean normalisation to the house size of our second training instance, x<sub>2</sub>. We know that the value of x<sub>2</sub> is 199, I’ve calculated the average house price (192.25), and I’ve also determined by range (largest house is 334m<sup>2</sup>, smallest house is 112m<sup>2</sup>). After applying mean normalisation, the house size should scale to: <br />
<script type="math/tex">x_2 := \dfrac{199 - 192.25}{(334 - 112)} = \dfrac{-6.75}{222} = -0.03</script></p>
<p>I used mean normalistion to scale each house size in the table presented at the beginning of this article, and the results are as follows:</p>
<table>
<thead>
<tr>
<th>House Size (Before Scaling)</th>
<th>House Size (After Scaling)</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>-0.31</td>
</tr>
<tr>
<td>199</td>
<td>-0.03</td>
</tr>
<tr>
<td>334</td>
<td>0.63</td>
</tr>
<tr>
<td>112</td>
<td>-0.36</td>
</tr>
</tbody>
</table>
<p>You should be able to get much better multivariate regression performance from your Gradient Descent algorithm if you apply mean normalisation to your training set.</p>
<h3 id="implementing-feature-scaling">Implementing Feature Scaling</h3>
<p>Feature scaling will take our training set, which currently looks like this:</p>
<table>
<thead>
<tr>
<th>House Size - m2</th>
<th>Bedrooms</th>
<th>Floors</th>
<th>House Age</th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>3</td>
<td>1</td>
<td>3</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>3</td>
<td>2</td>
<td>11</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>5</td>
<td>3</td>
<td>22</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>4</td>
<td>1</td>
<td>44</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>And convert it into this:</p>
<table>
<thead>
<tr>
<th>House Size - m2</th>
<th>Bedrooms</th>
<th>Floors</th>
<th>House Age</th>
<th>House Price</th>
</tr>
</thead>
<tbody>
<tr>
<td>-0.31</td>
<td>-0.37</td>
<td>-0.37</td>
<td>-0.41</td>
<td>223000</td>
</tr>
<tr>
<td>-0.03</td>
<td>-0.37</td>
<td>0.12</td>
<td>-0.21</td>
<td>430000</td>
</tr>
<tr>
<td>0.63</td>
<td>0.62</td>
<td>0.62</td>
<td>-0.05</td>
<td>900000</td>
</tr>
<tr>
<td>-0.36</td>
<td>0.12</td>
<td>-0.37</td>
<td>0.58</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p>Notice that our house price is not affected; only features are subject to scaling, not actual values or theta. Also, keep in mind that we compute the range and mean on a <em>per feature</em> basis. In other words, you will have <em>four different sets</em> of mean+range combintations based on the example above - we’ll have a unique mean and average for the house size, the bedrooms, the floors, and the house age. As part of your scaling implementation, you’ll want to keep these precomputed values handy, and you’ll see why next.</p>
<p>As a consequence of scaling, our algorithm will be “trained” on our scaled training set, not the original training set. Your function will be able to make accurate predictions with <em>scaled</em> inputs, but not with original inputs. In fact, the values of theta that are produced from a non-scaled training set in most circumstances will differ from the values of theta that are produced from a scaled training set.</p>
<p>Whenever you want to make predictions from a function trained using a scaled training set, you will also need to remember to scale your inputs as well. As a concrete example, I’ve included a simple test which shows how I go about scaling inputs after a training set has been normalised, refer to the comments in the test case for more context.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33</pre></td><td class="code"><pre><span class="nd">@Test</span>
<span class="kd">public</span> <span class="kt">void</span> <span class="nf">shouldGenerateTheCorrectScaledPredictiveFunctionWithMultipleVariables</span><span class="o">()</span> <span class="o">{</span>
<span class="c1">// Create a training set with three features and one actual value.</span>
<span class="n">SupervisedTrainingSet</span> <span class="n">set</span> <span class="o">=</span> <span class="k">new</span> <span class="n">SupervisedTrainingSet</span><span class="o">(</span><span class="mi">3</span><span class="o">);</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">0</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">1</span><span class="o">)</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">1</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">4</span><span class="o">)</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">2</span><span class="o">,</span> <span class="mi">2</span><span class="o">,</span> <span class="mi">3</span><span class="o">,</span> <span class="mi">3</span><span class="o">)</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">3</span><span class="o">,</span> <span class="mi">3</span><span class="o">,</span> <span class="mi">1</span><span class="o">,</span> <span class="mi">4</span><span class="o">)</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">4</span><span class="o">,</span> <span class="mi">4</span><span class="o">,</span> <span class="mi">3</span><span class="o">,</span> <span class="mi">1</span><span class="o">)</span>
<span class="n">set</span><span class="o">.</span><span class="na">addInstance</span><span class="o">(</span><span class="mi">5</span><span class="o">,</span> <span class="mi">5</span><span class="o">,</span> <span class="mi">5</span><span class="o">,</span> <span class="mi">5</span><span class="o">);</span>
<span class="c1">// Generate a new training set by passing the original set to the mean </span>
<span class="c1">// normalisation function.</span>
<span class="n">MeanNormalisation</span> <span class="n">normalistion</span> <span class="o">=</span> <span class="k">new</span> <span class="n">MeanNormalisation</span><span class="o">(</span><span class="n">set</span><span class="o">);</span>
<span class="n">SupervisedTrainingSet</span> <span class="n">normalisedSet</span> <span class="o">=</span> <span class="n">normalistion</span><span class="o">.</span><span class="na">normalise</span><span class="o">();</span>
<span class="c1">// Train the gradient descent algorithm with the normalised training set. </span>
<span class="c1">// This will output a function with the appropriate thetas already set.</span>
<span class="n">GradientDescentAlgorithm</span> <span class="n">algorithm</span> <span class="o">=</span> <span class="k">new</span> <span class="n">GradientDescentAlgorithm</span><span class="o">(</span><span class="n">normalisedSet</span><span class="o">);</span>
<span class="n">MinimisableFunction</span> <span class="n">function</span> <span class="o">=</span> <span class="n">algorithm</span><span class="o">.</span><span class="na">minimise</span><span class="o">(</span><span class="k">new</span> <span class="n">LinearFunction</span><span class="o">());</span>
<span class="c1">// I want to make a prediction on values 4, 7, -11, so I need to scale them </span>
<span class="c1">// first, using the correct feature normalisation function. </span>
<span class="n">Double</span> <span class="n">x0</span> <span class="o">=</span> <span class="n">normalistion</span><span class="o">.</span><span class="na">functionForFeature</span><span class="o">(</span><span class="mi">0</span><span class="o">).</span><span class="na">normalise</span><span class="o">(</span><span class="mi">4</span><span class="o">);</span>
<span class="n">Double</span> <span class="n">x1</span> <span class="o">=</span> <span class="n">normalistion</span><span class="o">.</span><span class="na">functionForFeature</span><span class="o">(</span><span class="mi">1</span><span class="o">).</span><span class="na">normalise</span><span class="o">(</span><span class="mi">7</span><span class="o">);</span>
<span class="n">Double</span> <span class="n">x2</span> <span class="o">=</span> <span class="n">normalistion</span><span class="o">.</span><span class="na">functionForFeature</span><span class="o">(</span><span class="mi">2</span><span class="o">).</span><span class="na">normalise</span><span class="o">(-</span><span class="mi">11</span><span class="o">);</span>
<span class="c1">// I can then pass these scaled inputs to the predictive function and</span>
<span class="c1">// assert that I have the right predictions being generated.</span>
<span class="n">BigDecimal</span> <span class="n">prediction</span> <span class="o">=</span> <span class="n">function</span><span class="o">.</span><span class="na">at</span><span class="o">(</span><span class="n">x0</span><span class="o">,</span> <span class="n">x1</span><span class="o">,</span> <span class="n">x2</span><span class="o">);</span>
<span class="n">assertThat</span><span class="o">(</span><span class="n">prediction</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">()).</span><span class="na">isBetween</span><span class="o">(</span><span class="mf">3.99</span><span class="o">,</span> <span class="mf">4.01</span><span class="o">);</span>
<span class="o">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
Thu, 26 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/26/multivariate-linear-regression.html
http://brandon.ai/2017/01/26/multivariate-linear-regression.htmlUnivariate Linear Regresion, Part 3: Implementing Gradient Descent<h3 id="initialising-and-managing-alpha">Initialising and Managing Alpha</h3>
<p>Alpha is the only tunable parameter in the Gradient Descent algorithm, and using the right value is important; if alpha is too big, the algorithm may not converge on a solution; if alpha is to small, the algorithm can a long time to converge.</p>
<p>The following graphs show the behaviour of the Gradient Descent algorithm when alpha is set too high or too low. The x axis represents a Θ<sub>0</sub> Θ<sub>1</sub> pairing, and the y axis represents the value of the cost function for that pair, J(Θ<sub>0</sub>, Θ<sub>1</sub>). These <em>should</em> be 3 dimensional, with one axis for Θ<sub>0</sub>, one axis for Θ<sub>1</sub> and one axis for the cost function J(Θ<sub>0</sub>, Θ<sub>1</sub>), but I’m not that great of a 3-d drawer. The 2-d graph is alright just to illustrate the point.</p>
<p>Remember that the Gradient Descent algorithm is a minimisation algorithm, so the best solution is the one where the cost function of Θ<sub>0</sub>, Θ<sub>1</sub> is lowest (the minima of the function).</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-23-implementing-gradient-descent/alpha_issues.jpg" alt="Behaviour of gradient descent with different alpha settings" /></p>
<p>The bigger the value of alpha, the bigger steps the algorithm will take between each iteration. In iteration #1, the algorithm takes a big step and gets closer to convergence. In iteration #2, it “overshoots” the minima, and subsequent iterations keep on boucing around, never converging towards the minima. When alpha is set to high, we risk the possibility of never coverging on the right answer.</p>
<p>Setting the value of alpha too small is less risky; we <em>will</em> ultimatley converge on right answer. The downside to setting alpha too small is that doing so will require a large number of small steps, which makes convergence take longer than necessary.</p>
<p>So, how do we approach initialising alpha? I’d suggest to start with a value of 0.1 and adjust as necessary. The cost function will be your guide here - if the gradient descent algorithm is working as expected, the cost function should <em>decrease</em> after each and every iteration. If the value of the cost function is increasing, it’s a strong signal that alpha is set too high, and you should decrease it.</p>
<p>I’ve taken an algorithmic approach to adjusting the cost function while gradient descent is running. After every iteration, I check the newly computed cost function against the previously computed cost function. If I detect that the cost function has decreased, I increase the value of alpha by 0.02 (take slightly bigger steps). On the other hand, if I detect that the cost function has increased, I divide the cost function by 2 (we’re on the wrong path, so we want to be more aggresive in our course correction).</p>
<p>To show you how this works in pratice, I’ve captured the output of alpha after each iteration during a gradient descent run and plotted the results in the following graph:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-23-implementing-gradient-descent/alpha_1.png" alt="Alpha per iteration, starting with alpha = 0.1" /></p>
<p>The results here are pretty interesting - for this data set, it appears that there’s a “ceiling” for alpha at about 0.23, and any time the value of alpha hits this point the steps become too large and the cost function increases.</p>
<p>In a second experiment, I intentionally initialised the value of alpha with a really big number (4). Here are the results:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-23-implementing-gradient-descent/alpha_4.png" alt="Alpha per iteration, starting with alpha = 4" /></p>
<p>As you can see here, the value of alpha quickly decends into it’s normal operating range, and once alpha gets here it doesn’t make any significant movement upwards or downwards.</p>
<h3 id="defining-convergence">Defining Convergence</h3>
<p>In the original definition of the Gradient Descent algorithm, we start with the statement “repeat until convergence”. How do we define convergence?</p>
<p>We already know that a well behaved Gradient Descent algorithm should reduce the cost function of J(Θ<sub>0</sub>, Θ<sub>1</sub>) after every iteration. As we approach the minima, we should also observe smaller and smaller cost reduction changes; the following graph illustrates the point:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-23-implementing-gradient-descent/convergence.jpg" alt="The difference in cost function decreases as we approach the minima" /></p>
<p>When we start at the “top” of the function, each iteration reduces the cost of J(Θ<sub>0</sub>, Θ<sub>1</sub>), and the direction of our travel is nearly vertical along the y axis. Proportional to y, our we’re not travelling as far along the x axis. As we approach the bottom of the curve, we slowly but surely modify the direction of our travel, and a greater proportion of our travel is directed horizontally along the x axis, not vertically along the y axis.</p>
<p>Since the value of y represents the cost function of J(Θ<sub>0</sub>, Θ<sub>1</sub>), we can say that our Gradient Descent algorithm has converged when the difference between the cost function of two iterations is <em>sufficiently small</em>. There’s no right answer to what sufficiently small is, as it’s entirely dependent on your personal level of tolerance for accuracy.</p>
<p>I’ll give you an example - the graph below shows the the delta of a cost function between between each iteration.</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-23-implementing-gradient-descent/convergence_measurement.png" alt="Measuring convergence across a Gradient Descent algorithm run" /></p>
<p>As you can see, by about iteration 10 the delta between the cost function for each iteration is nearly 0, so we’re getting close to finding the ideal solution. It’s hard to see on the graph, but the delta at iteration 10 is 0.0002. Is that a good enough tolerance to say we’ve converged? That’s up to you.</p>
<p>For this particular data set, I know that the “perfect” hypothesis function should produce Θ<sub>0</sub> = 0 and Θ<sub>1</sub> = 1.</p>
<p>If I were to set my definintion of convergence at 0.0002, the Gradient Descent algorithm would stop after about 10 iterations. The result would produce Θ<sub>0</sub> = 0.203 and Θ<sub>1</sub> = 0.952. That’s not too bad; we’re <em>pretty</em> close to the right answer, but there’s room for improvement.</p>
<p>By iteration 150, the cost function delta between iterations has shrunk down to 0.0000008. Using <em>this</em> as our definition for convergence would produce Θ<sub>0</sub> = 0.005 and Θ<sub>1</sub> = 0.999. This result would give you more accurate predictions, at the expense of running additional iterations.</p>
<p>Both definitions of convergence are correct, but have different outcomes and goals. The choice for selecting an appropriate convergence value will depend on balacing the need for accuracy and efficiency.</p>
<h3 id="implementing-gradient-descent">Implementing Gradient Descent</h3>
<p>Source code for the Gradient Descent algorithm (implemented in Java) can be found <a href="https://github.com/BrandonJohnGrenier/machine-learning">here</a>. Here’s a snippet of the Gradient Descent implementation:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><table style="border-spacing: 0"><tbody><tr><td class="gutter gl" style="text-align: right"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23</pre></td><td class="code"><pre><span class="kd">public</span> <span class="n">LinearFunction</span><span class="o"><</span><span class="n">T</span><span class="o">></span> <span class="nf">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">BigDecimal</span> <span class="n">theta0</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">0.0</span><span class="o">),</span> <span class="n">tempTheta0</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">0.0</span><span class="o">);</span>
<span class="n">BigDecimal</span> <span class="n">theta1</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">0.0</span><span class="o">),</span> <span class="n">tempTheta1</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">0.0</span><span class="o">);</span>
<span class="n">BigDecimal</span> <span class="n">cost</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">100.0</span><span class="o">),</span> <span class="n">tempCost</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BigDecimal</span><span class="o">(</span><span class="mf">100.0</span><span class="o">);</span>
<span class="n">Double</span> <span class="n">convergence</span> <span class="o">=</span> <span class="k">new</span> <span class="n">Double</span><span class="o">(</span><span class="mf">100.0</span><span class="o">);</span>
<span class="k">while</span> <span class="o">(</span><span class="n">convergence</span> <span class="o">></span> <span class="n">tolerance</span><span class="o">)</span> <span class="o">{</span>
<span class="n">tempTheta0</span> <span class="o">=</span> <span class="n">calculateThetaZero</span><span class="o">(</span><span class="n">theta0</span><span class="o">,</span> <span class="n">theta1</span><span class="o">);</span>
<span class="n">tempTheta1</span> <span class="o">=</span> <span class="n">calculateThetaOne</span><span class="o">(</span><span class="n">theta0</span><span class="o">,</span> <span class="n">theta1</span><span class="o">);</span>
<span class="n">cost</span> <span class="o">=</span> <span class="n">costFunction</span><span class="o">.</span><span class="na">at</span><span class="o">(</span><span class="n">theta0</span><span class="o">,</span> <span class="n">theta1</span><span class="o">);</span>
<span class="n">tempCost</span> <span class="o">=</span> <span class="n">costFunction</span><span class="o">.</span><span class="na">at</span><span class="o">(</span><span class="n">tempTheta0</span><span class="o">,</span> <span class="n">tempTheta1</span><span class="o">);</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="o">(</span><span class="n">tempCost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">()</span> <span class="o">></span> <span class="n">cost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">())</span> <span class="o">?</span> <span class="n">alpha</span> <span class="o">/</span> <span class="mi">2</span> <span class="o">:</span> <span class="n">alpha</span> <span class="o">+</span> <span class="mf">0.02</span><span class="o">;</span>
<span class="n">convergence</span> <span class="o">=</span> <span class="n">Math</span><span class="o">.</span><span class="na">abs</span><span class="o">(</span><span class="n">tempCost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">()</span> <span class="o">-</span> <span class="n">cost</span><span class="o">.</span><span class="na">doubleValue</span><span class="o">());</span>
<span class="n">theta0</span> <span class="o">=</span> <span class="n">tempTheta0</span><span class="o">;</span>
<span class="n">theta1</span> <span class="o">=</span> <span class="n">tempTheta1</span><span class="o">;</span>
<span class="o">}</span>
<span class="k">return</span> <span class="k">new</span> <span class="n">LinearFunction</span><span class="o"><</span><span class="n">T</span><span class="o">>(</span><span class="n">theta0</span><span class="o">,</span> <span class="n">theta1</span><span class="o">);</span>
<span class="o">}</span><span class="w">
</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The most important implementation note here is that <em>theta0</em> and <em>theta1</em> should be calulated atomically (in machine learning, they refer to this as calculating “simultaneously”). The newly computed values of theta0 and theta1 are not assigned immediatley; they are assigned to temporary variables <em>tempTheta0</em> and <em>tempTheta1</em>. Doing this ensures that we’re computing new values of theta0 and theta1 against the current values of both theta0 and theta1. If we were to perform direct assignment, the value of theta1 could be a bit off, as we would be calculating the value of theta1 against the current value of theta1, but the <em>new</em> value of theta0. This <em>could</em> still make for a workable Gradient Descent algorithm, but some unexpected behaviours can come about with direct assignment.</p>
Mon, 23 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/23/ulr-implementing-gradient-descent.html
http://brandon.ai/2017/01/23/ulr-implementing-gradient-descent.htmlUnivariate Linear Regression, Part 2: Introduction to Gradient Descent<p>The Gradient Descent algorithm is a general purpose algorithm that has a number of practical applications in machine learning. By the end of this article you should have a good understanding of how the Gradient Descent algorithm works, and appreciate how it helps to solve to minimisation problems in general.</p>
<h3 id="intuitive-gradient-descent">Intuitive Gradient Descent</h3>
<p>Before we define the gradient descent algorithm, let’s first get an intuitive understanding of what the algorithm will accomplish.</p>
<p>The Gradient Descent algorithm is useful for finding <em>local minima</em> - the smallest values of a function, in and around the local area of a function. As an example, let’s take a look at the odd-looking function below - you can think of this as a simple landscape with hills and valleys:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-21-gradient-descent/gradient_descent_concept_v1.png" alt="" /></p>
<p>We’ll want the algorithm to to find the lowest valley it can, and we don’t know what the lowest point is ahead of time. To start, we’re just going to pick an arbitrary spot on our “landscape” - the red X.</p>
<p>After we’ve picked our starting point, we’re going to have a look around, and figure out which direction to take in order to go “downhill”, so to speak. Once we’ve figured out the direction we want to take, we’ll take one step in that direction. Now that we’ve taken our first step, the process repeats itself - we’ll have another look around, figure out which direction we want to take in order to go downhill, then take another step (this process is represented by the red dashes in the picture above).</p>
<p>This process repeats itself until you take a look around, and realise that taking another step won’t bring you further downhill. At this point you’ve found the <em>local minima</em> (the red dot in the picture above), and the process completes.</p>
<h3 id="minima-and-local-minima">Minima and Local Minima</h3>
<p>One of the properties of the Gradient Descent algorithm is that it is efficient at finding <em>local minima</em> in a function. To demonstrate this, I’ve also picked another abitrary point on the graph (the blue X), and followed the same process. You’ll notice that this starting point produced a different value compared to our first starting point. With our “hill and valley” function above, the answer that you get from the Gradient Descent algorithm is highly dependant upon our initial starting position.</p>
<p>The conclusion we can draw is that the Gradient Descent algorithm is not useful for minimising functions that have more than one local minima, since the answer you’ll get back will be highly dependant on where you pick your initial starting position.</p>
<p>The Gradient Descent algorithm will work best at minimising functions where there is only one possible local minima. As it turns out, our linear regression cost function is exactly one of these functions, so the Gradient Descent algorithm will give us the result we’re expecting.</p>
<h3 id="the-gradient-descent-algorithm">The Gradient Descent Algorithm</h3>
<p>The formal definition of the Gradient Descent algorithm is as follows:</p>
<p>repeat until convergence, for j = 0 and j = 1:<br />
<script type="math/tex">\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)</script></p>
<p>It’s not obvious, but this is really just an algorithm for taking a step from point a to point b. Imagine you’re on a walk - where you land after each step depends on three factors: your current location, the size of the step you want to take (giant stride or small shuffle), and the direction of the step (north, south, south-west). Let’s break down the gradient descent algorithm and explain how each term relates to these factors:</p>
<p><script type="math/tex">\theta_j := \theta_j</script> <br />
The first thing to note here is the expression <strong>:=</strong> which means <em>assignment</em>. We want to assign a new value of Θ<sub>j</sub>, which should intially be based on the current value of Θ<sub>j</sub>. Your <em>new</em> location (after taking a step) will be initially based on your current location. <br />
<script type="math/tex">\alpha</script> <br />
This is the term <em>alpha</em>, and defines the size of the step you want to take.<br />
<script type="math/tex">\frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)</script> <br />
This term represents the direction of the step. Obviously there’s a bit more to unpack here, but I’ll talk about how this term works when we get to implementation.</p>
<p>The algorithm also mentions to repeat until convergence, for j = 0 and j = 1. All this means in practice is that we want to apply the Gradient Descent algorithm for Θ<sub>0</sub> and Θ<sub>1</sub>, so our implementation will look like this: <br />
<script type="math/tex">\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} J(\theta_0, \theta_1)</script> <br />
<script type="math/tex">\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_0, \theta_1)</script></p>
<p>We already know what J(Θ<sub>0</sub>, Θ<sub>1</sub>) is, that’s our cost function. Let’s expand our gradient descent function by replacing the J(Θ<sub>0</sub>, Θ<sub>1</sub>) term with the actual cost function. We now have: <br />
<script type="math/tex">\theta_0 := \theta_0 - \alpha \frac{\partial}{\partial \theta_0} \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)^2</script> <br />
<script type="math/tex">\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)^2</script></p>
<p>The next part may lose some of you if you don’t know calculus - don’t worry though, you don’t actually need to understand calculus as I’ve done the work for you. We need to take the <em>partial derivitive</em> of our cost function. After we’ve done that, the gradient descent algorithm low looks like this: <br />
<script type="math/tex">\theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (h(x_{i}) - y_{i} \right)</script><br />
<script type="math/tex">\theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((h(x_{i}) - y_{i})x_{i} \right)</script></p>
<p>The last thing we’re going to do is replace the hypothesis function h(x) term with the actual hypothesis function, and we finally end up our implementable gradient descent algorithm:</p>
<p>repeat until convergence: <br />
<script type="math/tex">\theta_0 := \theta_0 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x_i - y_{i} \right)</script><br />
<script type="math/tex">\theta_1 := \theta_1 - \alpha \dfrac {1}{m} \displaystyle \sum _{i=1}^m \left ((\theta_0 + \theta_1x_i - y_{i})x_{i} \right)</script></p>
<p>In <a href="http://brandon.ai/2017/01/23/ulr-implementing-gradient-descent.html">part 3</a> I’ll provide insights into alpha, the partial differential equation, convergence and some practical implementation notes and example source code.</p>
Sat, 21 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/21/ulr-introduction-to-gradient-descent.html
http://brandon.ai/2017/01/21/ulr-introduction-to-gradient-descent.htmlUnivariate Linear Regression, Part 1: Model and Cost Function<p>We are going learn how to build a simple supervised machine learning algorithm which will be able predict housing prices based on a single feature, the size of a house. Going through this exercise will introduce a number of key concepts, algorithms and mathematics used in machine learning.</p>
<h3 id="making-intuitive-predictions">Making Intuitive Predictions</h3>
<p>Let’s assume I want to purchase a 150m2 house in Sydney, and that the price of a house is primarily driven by the size of the house. If I can get my hands on a small set of housing data I can start making my own predictions by hand. Here’s how to do it:</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-15-linear-regression-with-one-variable/linear-regression-goal.jpg" alt="Housing prices as a function of house size" /></p>
<p>I’ve plotted the data points on a graph, housing prices are on the Y axis and the size of houses in square meters are on the X axis. I drew a straight line that fits the data as closely as possible. The straight line is an example of a <em>linear function</em> - linear functions are simple mathematical functions that produce straight lines.</p>
<p>If you remember from the first article, there are two types of machine learning problems; classification problems and regression problems. House pricing prediction is an example of a <em>regression</em> problem. Because we’re predicting housing prices solely on the price of the house, this problem is technically referred to as a <em>univariate regression problem</em> (univariate is a fancy way of saying one variable).</p>
<p>Now that I have my linear function established I can start making predictions. For a 150 square meter house, I drew a green dashed line until it intersected with my hand-rolled linear function, and then come across to the Y axis which gives me an estimate of about $390,000 dollars. My first prediction is done.</p>
<p>This prediction is an intuitive prediction - there’s no math involved here, I just eyeballed the data and tried to make a straight line that fit the data best.</p>
<h3 id="optimising-fit">Optimising Fit</h3>
<p>The graph below shows four different linear functions, each of which will result in different predictions.</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-15-linear-regression-with-one-variable/linear-regressionminimization.jpg" alt="" /></p>
<p>It’s pretty clear that two of the linear functions don’t fit the housing price data very well, and as a consequence these functions will produce poor predictions. On the other hand, we have two other linear functions fit the data quite well, both of these will produce much better predictions. Given these two well-fitting linear functions, which one will produce the most accurate housing price predictions? Is there an even better function that could produce even better predictions?</p>
<p>This is exactly what our first machine learning algorithm will answer. Effectively we’re going to implement a machine learning algorithm that can draw a really accurate straight line. Sounds exiting? You bet!</p>
<h3 id="defining-the-training-set">Defining the Training Set</h3>
<p>In a supervised learning algorithm, the data set we use to train the algorithm with the “right” answers is called the <em>training set</em>. I’m going to introduce some common notation and conventions that we use to describe training sets.</p>
<table>
<thead>
<tr>
<th>House Size - m<sup>2</sup> (x)</th>
<th>House Price - AUD (y)</th>
</tr>
</thead>
<tbody>
<tr>
<td>124</td>
<td>223000</td>
</tr>
<tr>
<td>199</td>
<td>430000</td>
</tr>
<tr>
<td>334</td>
<td>900000</td>
</tr>
<tr>
<td>112</td>
<td>300000</td>
</tr>
</tbody>
</table>
<p><strong>x</strong> - this is referred to as our input variable (also known as a <em>feature</em>), in our example the house size is our input variable.<br />
<strong>y</strong> - this is referred to as our output variable (also known as a <em>target</em>), in our example the house price is our output variable.<br />
<strong>m</strong> - the number of training samples - in the table above, we have 4 training samples<br />
<strong>(x, y)</strong> - this refers to a single training example - in other words, a single row in our table.<br />
<strong>(x<sup>(i)</sup>, y<sup>(i)</sup>)</strong> - this refers to a specific instance of a training example. For example, the third row in our training set would be addressed as (x<sup>(3)</sup>, y<sup>(3)</sup>)</p>
<h3 id="the-hypothesis-function">The Hypothesis Function</h3>
<p>Our machine learning algorithm will figure which function it should use to make the best possible predictions. The function produced (or “learned”) by this machine learning algorithm is referred to as <strong>h</strong> - short for <em>hypothesis</em>, it’s probably not the most accurate description but it’s been the convention for quite some time so we’re sticking with it. All of those straight lines I drew in the graphs above are examples of different hypothesis functions.</p>
<p>The goal of our supervised learning algorithm: given a training set, produce (learn) a function <strong>h</strong> so that <strong>h(x)</strong> is a good predictor for <strong>y</strong>. The hypothesis function <strong>h(x)</strong> is a simple linear regression function will let us figure out how to fit the best possible straight line to our data.</p>
<p>The function is defined as:<br />
h<sub>Θ</sub>(x) = Θ<sub>0</sub> + Θ<sub>1</sub>*x</p>
<p>With a little shorthand notation, I’ll normally write this as:<br />
h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x</p>
<p>This symbol Θ is call theta. The values <strong>Θ<sub>0</sub></strong> and <strong>Θ<sub>1</sub></strong> are referred to as the <strong>parameters</strong> of the function. Again, Θ<sub>0</sub> and Θ<sub>1</sub> are used by convention, and simply represent arbitrary parameters that our machine learning algorithm will learn to modify to get the best fitting straight line.</p>
<p>The function could just have easily been written like:<br />
f(x) = a + bx</p>
<p>That function might even look familiar to you if you’ve taken a few math classes in high school. it just so happens that Θ<sub>0</sub>, Θ<sub>1</sub> and h are the standard convention.</p>
<h3 id="the-cost-function">The Cost Function</h3>
<p>The machine learning algorithm will ultimatley decide on which values to use for Θ<sub>0</sub> and Θ<sub>1</sub>. How do we go about doing that?</p>
<p>Remember the goal of our cost function: h(x) should be a good predictor of y, therefore we want the values of Θ<sub>0</sub> and Θ<sub>1</sub> to be good predictors of y as well!</p>
<p>The best prediction we can have for h(x) = y is when the predicted value of <strong>h(x)</strong> matches the actual value <strong>y</strong> exactly - in other words, the delta (difference) between the predicted value h(x) and the actual value is <strong>0</strong>.</p>
<p>The algorithm isn’t always going to make predictions perfectly, but the idea behind this is that we want to have the <em>smallest possible difference</em> between a predicted value and an actual value, so that <strong>h(x) - y</strong> is as small as possible, ideally 0.</p>
<p>For every example (x, y) in our training set, we want to find the difference between the predicted value h(x) and the actual value y, and then calculate the average delta. This will tell us how well we can make predictions across our entire training set, the smaller the average the better the prediction. Algorithmically:</p>
<ol>
<li>For every training instance, calculate the difference between the predicted value and the actual value.</li>
<li>Add all of the differences together.</li>
<li>Divide the sum of the differences by the size of the training set. This will give use the average difference between predictions and expected outputs across our entire training set.</li>
</ol>
<p>We’ll now define this algorithm mathematically. First, to calculate the difference between a specific predicted value and the actual value, we can use:<br />
<script type="math/tex">h(x_{i}) - y_{i}</script></p>
<p>Add all of the differences together, from the first instance to the last instance in the training set: <br />
<script type="math/tex">\displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)</script></p>
<p>Divide the sum of the differences by the size of the training set:
<script type="math/tex">\dfrac {1}{m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)</script></p>
<p>The formal cost function is written as follows:<br />
<script type="math/tex">\dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h (x_{i}) - y_{i} \right)^2</script></p>
<p>You’ll notice a couple of differences between the cost function above and the one I’ve come up with.</p>
<p>First, the “error” component (the difference between the prediction and actual) is now squared. We square this error component to make implementation of the algorithms we use later a bit easier. Squaring always ensures that we have a positive number: 2<sup>2</sup> and -2<sup>2</sup> are both 4. Later on, it means that we simply can find the cost function closest to 0, and not have to look at the “positive value closest to 0” and the “negative value closest to 0”, and then figure out which one of <em>them</em> is actually the closest.</p>
<p>Instead of just dividing by the size of the training set, it’s now the size of the training set times two. I’m not actually sure what the benefit of dividing by 2m instead of m actually gives us, but that’s the formal definition of the <em>squared error cost function</em>. Feel free to leave a comment if you have any insights.</p>
<p>The machine learning algorithm will ultimatley decide on which values to use for Θ<sub>0</sub> and Θ<sub>1</sub>. Now we can clearly articulate our goal: the algorithm must find values for Θ<sub>0</sub> and Θ<sub>1</sub> so that the result from the function
<script type="math/tex">J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (\theta_0 + \theta_1x_{i} - y_{i} \right)^2</script></p>
<p>produces the smallest possible value - in formal terms, we want to <em>minimise</em> J(Θ<sub>0</sub>, Θ<sub>1</sub>)</p>
<h3 id="the-cost-function-in-practice">The Cost Function in Practice</h3>
<p>The cost function might take you a while to unpack, so I wanted to introduce some practical examples where we can use the cost function and see what it’s doing. The graph below has the observed (actual) data points plotted in red, and I’ve used the hypothesis function, h(x) = Θ<sub>0</sub> + Θ<sub>1</sub>x to plot out three different lines on the graph.</p>
<p>I chose Θ<sub>0</sub> = 1 and Θ<sub>1</sub> = 1 to draw the purple line: h(x) = 1 + 1x<br />
I chose Θ<sub>0</sub> = 5 and Θ<sub>1</sub> = 0 to draw the light blue line: h(x) = 5 + 0x<br />
I chose Θ<sub>0</sub> = 0 and Θ<sub>1</sub> = 0.5 to draw the dark blue line: h(x) = 0 + 0.5x</p>
<p><img src="http://brandon.ai/assets/article_images/2017-01-15-linear-regression-with-one-variable/calculating_theta.jpg" alt="" /></p>
<p>Given we have 5 training samples (m = 5) in the data set, the cost function for J(1, 1) can be written as: <br />
<script type="math/tex">J(1, 1) = \dfrac {1}{10} \displaystyle \sum _{i=1}^5 \left (h (x_{i}) - y_{i} \right)^2</script></p>
<p>We’ll sum up the squared error for each of the data points:<br />
(h(x<sub>1</sub>) - y<sub>1</sub>)<sup>2</sup> = (2 - 1)<sup>2</sup> = (1)<sup>2</sup> = 1 + <br />
(h(x<sub>2</sub>) - y<sub>2</sub>)<sup>2</sup> = (3 - 2)<sup>2</sup> = (1)<sup>2</sup> = 1 + <br />
(h(x<sub>3</sub>) - y<sub>3</sub>)<sup>2</sup> = (4 - 3)<sup>2</sup> = (1)<sup>2</sup> = 1 + <br />
(h(x<sub>4</sub>) - y<sub>4</sub>)<sup>2</sup> = (5 - 4)<sup>2</sup> = (1)<sup>2</sup> = 1 + <br />
(h(x<sub>5</sub>) - y<sub>5</sub>)<sup>2</sup> = (6 - 5)<sup>2</sup> = (1)<sup>2</sup> = 1</p>
<p>The sum of the squared error is equal to 5:<br />
<script type="math/tex">\displaystyle \sum _{i=1}^5 \left (h (x_{i}) - y_{i} \right)^2 = 5</script></p>
<p>We divide the sum of the squared error by 10, and find that J(1, 1) is equal to 0.5<br />
<script type="math/tex">J(1, 1) = \dfrac {5}{10} = 0.5</script></p>
<p>I’ve calculated the values of the last two cost functions - as an excercise, see if you can calculate the cost functions and get the same numbers.</p>
<p>J(5, 0) = 3.0<br />
J(0, 0.5) = 1.375</p>
<p>After doing the math, I’ve been able to show that J(1, 1) is the “lowest” cost function out of the three that were presented. If you look a the graph again, you’ll notice that J(1, 1) best fits the data as well. Is there a cost function that fits the data event better? Absolutley - J(0,1) would give us a cost function of 0, which means it would our data set perfectly.</p>
<p>Now that we’re able to calculate the cost function, we have one last problem to solve before we can implement our machine learning algorithm. How can we quickly and efficently minimise J(Θ<sub>0</sub> and Θ<sub>1</sub>)?</p>
<p>We could take a naive, brute force approach and simply calcuate the value of J(Θ<sub>0</sub> and Θ<sub>1</sub>) for every possible combination of Θ<sub>0</sub> and Θ<sub>1</sub>, and find the lowest value. This will <em>work</em>, but you might be waiting a while for your algorithm to finish.</p>
<p>In <a href="http://brandon.ai/2017/01/21/ulr-introduction-to-gradient-descent.html">part 2</a> I’ll introduce the Gradient Descent algorithm, and show you how we can use it to quickly and efficiently minimise the cost function without iterating over an arbitrarily large number of parameters.</p>
Sun, 15 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/15/ulr-model-and-cost-function.html
http://brandon.ai/2017/01/15/ulr-model-and-cost-function.htmlIntrodution to Machine Learning<p>Informally, machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. Put simply, if a program can improve the performance of a task by using previous experience, you can say that program has <em>learned</em> from the experience. A more formal definition of Machine Learning was produced by Tom Mitchell in 1998:</p>
<blockquote>
<p>A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P, improves
with experience E.</p>
</blockquote>
<p>The ability for software to learn from experience is significantly different from how software is written today. Typically, if a developer wanted to improve the performance of a task (to make it run faster, more efficiently, or cater for new conditions) the developer would first have think about how to acheive the performance gain, and then explicitly program the software to make the improvement.</p>
<h4 id="machine-learning-predictions-classification-and-regression">Machine Learning Predictions: Classification and Regression</h4>
<p>A machine learning algorithm can make two types of predictions. You will need to choose the approprite prediction type based on the problem you’re trying to solve.</p>
<h6 id="classification-problems">Classification Problems</h6>
<p>These are problems where we want a machine learning algorithm to predict a known, discrete result from an existing set of data. This is like selecting a value from a drop down menu.</p>
<p>Facial recognition, handwriting recognition and recommendations are all examples of classification problems. For facial recognition, we want the algorithm to predict a discrete value (a specific person) from a known list (a database of people). Handwriting recognition is similar - we want to predict discrete values (specific words) from an existing set of data (the english dictionary). Recommendations are also classification problems - when Netflix recommends movies and Amazon recommends products you may like, they are both producing lists of discrete values that represent a subset of their entire cataloge of products.</p>
<h6 id="regression-problems">Regression Problems</h6>
<p>These are problems where we want a machine learning algorithm to produce a continuous, unconstrained value.</p>
<p>Stock market prediction is an example of a regression problem. In this case, we want the machine learning algorithm to produce a continuous value - the price of a stock. Predicting housing prices is another example of a regression problem.</p>
<h4 id="supervised-unsupervised-and-reinforcement-learning">Supervised, Unsupervised and Reinforcement Learning</h4>
<p>In order for a machine learning algorithm to learn from experience, it needs to be taught how to learn in the first place. There are three broad techniques that can be used train machine learning algorihthms:</p>
<h6 id="supervised-learning">Supervised Learning</h6>
<p>Supervised learning is a training method where you already know the correct answers (you know the desired output). You train the algorithm by providing it with data as well as the answers. After training, the algorithm should be able to make predictions on novel data that it hasn’t seen before.</p>
<p>Spam filtering is a classic example of supervised learning. You train the algorithm by providing it with examples of spam, and tell the algorithm it’s spam. You provide the algorithm with examples of legitimate email, and tell the algorithm it’s legitimate email. With enough data, the algorithm should be able to classify email as legitimate or spam for emails it hasn’t seen before.</p>
<h6 id="unsupervised-learning">Unsupervised Learning</h6>
<p>Unsupervised learning is a training method where you don’t already know the correct answer (i.e. you don’t know the desired output). Unsupervised learning typically supports the grouping or clustering of data.</p>
<p>If we used unsupervised learning for image recognition, the algorithm wouldn’t be able to identify a particular <em>type</em> of thing - unsupervised learning would not come back with predictions saying things like “this is a photo of a cat”, “this is definitley a dog”, “this is a person”.</p>
<p>Instead, an unsupervised learning algorithm would recognise that all photos of cats look similar, and group them together. It would create similar clusters for humans and dogs, and would recognise that the “pattern” for a cat is different from the “pattern” for a dog or human. The end result of the algorithm would be three unique clusters, and behaviourally would continue to cluster photos into one of it’s known groups, even when it’s exposed to new photos it hasn’t seen before.</p>
<p>After the unsupervised learning has clustered data, it would be simple enough to “tag” the clusters. For example, once our machine learning algorithm has created the cat, dog and human clusters, it would be straightforward at this stage to tell the algorithm “hey, these are cats, these are dogs and these are people”. So, when the machine learning algorithm processed any new photos, it would be able to understand exactly which type of thing it’s looking at.</p>
Wed, 11 Jan 2017 00:00:00 +0000
http://brandon.ai/2017/01/11/introduction-to-machine-learning.html
http://brandon.ai/2017/01/11/introduction-to-machine-learning.html