Gradient Descent part 1
Intro to Gradient Descent
What’s going on here
This is my first Data Science post on my blog. In this post i’ll be exploring my understanding of the Gradient Descent algorithm. My next post will explore how to implement this algorithm in R. Then i’ll have a play around with the function so we can all see the results of some pretty cool maths.
The Gradient Descent algorithm is the first algorithm presented by Andrew NG in his Machine Learning course. It’s my understanding that Gradient Descent is the basis of how machines learn.
linear regression
My understanding of linear regression is how it was presented my the lecturers I had at univisersity. Namely, we look at our data and say that the following model might fit.
\[y_{i} = \beta_{0} + \beta_{1}x_{1i} + \beta_{2i}x_{2i} + ... + \beta_{k}x_{ki} + \epsilon_{i}\] I know everyone likes to present it in matrix form or sum notation but I always conceive it in the above form. Anyway basically we’re saying “our ys are a linear combination of some xs and there’s a random noise thing there too”.
Econometricians look at this and go “if we assume all the \(\epsilon_{i}\) are normally distributed with the same mean and variance then we can estimate it using Ordinary Least Squares”. I’ll leave it up to the reader to dive into this if you wish https://en.wikipedia.org/wiki/Ordinary_least_squares.
But the data scientist? Interestingly they approach this problem and say “we want to estimate this cool equation (the above one minus the \(\epsilon_{i}\)) but we don’t like assumptions. We want to build a MACHINE that does it for us.”
Then they go about designing this machine. They conceive of some hypothesized line that will fit all the points, design a function that tells the machine off when it makes a guess too far away from this line (a cost function). Then they roll a ball down a hill.
Wait, what.
That’s how Gradient Descent is apparently conceived, as rolling a ball down a hill: You roll it from where you stand, it rolls down to the bottom, then it rolls up a bit because physics and stuff, then it rolls down again, and up a little bit, and down again, and up maybe a tiny bit more, then finally it comes to rest.
What a brilliant concept. As in the physical world, if you want to find the bottom of a hill, you roll a ball down it. In the mathematical world, if you want to find a bottom (minima) of a function you roll a ball down it!
For a linear model the hypothesized line is
\[\hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1}x_{1i} + \hat{\beta}_{2}x_{2i} + ... + \hat{\beta}_{k}x_{ki}\]
And so the data scientist sets up a cost functin like \[cost = \frac{1}{2n}\sum_{i = 1}^n (\hat{y_i} - y_{i})^2\]
The 2 in \(\frac{1}{2n}\) is in there for convenience because of course when we go to minimize this function we want to take the first derivative. And so the ball is rolled down the hill of this cost function to minimize it.
And of course to minimize a function we take the first derivative and set it to 0 (well, I lie, this just finds a stationary point on the function whether that’s a minima or maxima we don’t know until we take a second derivative).
And we know the first derivative can be written as: \[\frac{1}{n} \sum_{i = 1}^n (\hat{y_i} - y_{i})x_{i}\] For the “intercept” as it’s called in econmetrics, the \(\beta_{0}\). This simplifies to \[\frac{1}{n} \sum_{i = 1}^n (\hat{y_i} - y_{i})\]
because \(x_{0}= 1\)
And so an algorithm that will roll a ball down a hill until it find the bottom will be:
Repeat until convergence: \[\beta_{0} := \beta_{0} -\alpha \frac{1}{n} \sum_{i = 1}^n (\hat{y_i} - y_{i})\] \[\beta_{j} := \beta_{j} - \alpha \frac{1}{n} \sum_{i = 1}^n (\hat{y_i} - y_{i})x_{i}\] Where \(\alpha\) is the learning rate which I think of as how fast you push the ball when you roll it down the hill. If you push it really hard it will roll to the bottom, then roll up the slope on the other side really quite high, potentially this could cause it to roll into the next valley! Which might be a local minima instead f a global minima.
I love that Machine Learning, especially Gradient Descent feels more experimental than econometrics. To solve this exact problem in econometrics we would have assumed the \(\epsilon\)s are Gaussian and derived a closed form expression for \(\beta_{j}\).
Anyway that’s the maths and a bit of intuition, in my next post i’m going to write a function in R that does gradient descent. We’ll compare the results of a linear model fit using this to one fit using the lm function in R. Spoiler alert - they’re gonna be similar.
In the post after that I want to stretech my R skills a bit by re-writing the gradient descent function so that we can see what value of \(\alpha\) is “best” for a particular cost function. And also test how many iterations the algorithm needs to converge when we scale our variables versus when we don’t.