Implementing Gradient Descent

Intro

  This post derives the algorithm for implementing gradient descent. It is messy, ugly, and tedious! It is a fun exercise deriving it yourself, but in the end it's just identifying patterns. Don't let this page make your head explode and make you give up on AI! Just knowing how to use the algorithm in the end is most important (even then it can be argued that you can just use keras), but why not try it yourself and have some fun! I include the derivation here just because I know I would have loved to see it when I was learning AI.

  We now know that we must find the gradient of the error function with respect to each weight to minimize our error (If you have no idea what I just said, learn/review how to train a neural network here). That's all fine, but how do we actually do this in code. You can imagine how difficult it would be to find the gradient if we have a neural network with 5,6 or more layers each with hundreds or thousands of neurons.

  We know what to find (the gradients), now we just need to figure out how to find it. We could just use the definition of the derivative. Go through each weight and increase it by a tiny amount, then divide the resulting change in the error by this tiny amount. This will give us ∂E/∂W for this particular weight. Now if we did this for all of our weights, we would have the gradient.

  But to put things in perspective, a neural network I trained to play atari games had 1,685,667 weights and I had to update them about 10 million times. Using the definition of the derivative would just take way to long for a network like this.

  Since neural networks have very high repetition, it seems that we could find an algorithm that would find the gradients for us in an efficient way. To try to create this algorithm, we will first write out, by hand, the forward and update phase of a relatively small, but deep, neural network. We will then look for pattens in this process to try to simplify it as much as possible.

  This is the neural network we will be using to derrive our gradient descent algorithm:

a NN with [2,3,3,2] layers each with a bias
This time we are implementing the neurons bias. Note that the above implementation is the exact same as just adding a bias onto each neurons weighted sum, as long as we keep our bias 'input' at 1. The bias is blue, and anything associated with the bias will be blue in this page.

Forward phase

  First off, it is very important to remember that we draw a neural network as this computational graph in order to help us visualize what is happening. But, if we are going to find our gradient descent algorithm we will need to write out the computations done. Well, we have a lot of parameters here! The huge amount of parameters we have really makes for ugly indexing!

  Let me write out all the parameters we have in this neural network. Then let you in on my convention for indexing.

All of our parameters

  Xbar above is a list of the 'inputs' for each layer. Notice that the size of each of the lists in Xbar corresponds to the number of neurons in each layer. Xbar[0] is the initial input data with a bias, Xbar[1] is the activated weighted sum of X[0], also with a bias, and so on.

  Wbar is a list of all the weights that connect each layer. Each of these lists should have size; (# of neurons in next layer excluding bias, # of neurons in current layer including bias) (rows,columns).

  Now, the top left index tells what layer the particular element is in. The top right index tells what input neuron in the layer we are talking about, the bottom right index tells what input we are going to.

All of our parameters
X is the input to the layer, w is the layers weights, z is the weighted sum before being passed through the activation function

  Now that we are on the same page as far as notation. Here is the forward phase of neural network. Just going from right to left, writing out the computations:

computations from left to right
a(z) is our activation function applied to z. Take a step back and just look at what is happening at a high level. Don't let your head explode trying to simultaneously be concious of every detail!

  The top row of computations brings us from our first layer of inputs to our second layer of inputs. The second row brings us from our second layer of inputs to the third layer of inputs, and similar for the third row.

Backward phase

  Ok, now let me write out what we are trying to find;

gradients

  We are going to try to find the enough of the third and second gradient to identify a pattern. Using the chain rule, and a lot of staring at the problem, we can show that these gradients are equal to the following;

just using the chain rule to find an equivalent equations
I highlighted any pattern that I recognized, this will be useful in simplifying

  This can be a tough deriving yourself. If you start at the error, then work yourself back to the weight you are trying to find while using the chain rule, you'll find it's simple but very tedious.

  Now, we are trying to find a simpler way to express the above equations. If you remember your linear algebra, and kind of work backwards, you can see that the above equations are equal to:

simplifying
I included the first gradient here. Admittedly, I didn't include it in the previous picture because it didn't fit on the screen. That circle with the X inside of it means that we are multiplying elementwise. The dot between matrices means we are applying the dot product.

  Look how beautiful! You can notice that everything that is red, is the same as what is green in the row above it. This means we can start at the last layer, initialize a term, call it Ψ, to be what is in green. Then as we move back to previous layers, we can update Ψ with what is green in that layer (whatever is in red is the current value of Ψ). Now to find the gradient for that layer we just multiply Ψ by ∂Z/∂w elementwise. This is a lot prettier when you plug in values, below I assume the mean squared error function and the sigmoid activation function.

plugging in values
Ψ can be simplified even more. I will leave it as an exercise for you do it by yourself.... Just kidding, I hate it when textbooks say that. I show you how to simplify it further in the next picture.

  Now, we wrap up the update into a very simple for loop:

init psi, then for each layer update it then use it to find gradient.

Summary

  Ok I know what you're thinking. What in tarnation just happened!? All I did was identify patterns! Nothing more complicated than the chain rule was used, I just used a LOT of it. I would not blame you if you just believed my result, instead of deriving it yourself. In practice, we will be using Keras for implementation anyways, but now that you know how to do it yourself you don't have to feel guilty doing so.

  As always, we don't truly understand this yet!Let's try to implement this algorithm and build a useful class with it.