# Practice Implementing Gradient Descent

## Project overview

We have derived an algorithm on how to implement gradient descent, but there are a lot of nuances when implementing this algorithm. If you try to tackle this algorithm with a huge project right out the gate, you will find that it is very difficult keeping track of the shapes of all your matrices. Because of this, we will start very small with the same exact architecture we used in the derivation.

First, let's just focus on getting a general form of this algorithm working. After we have something that works, we can clean it up and make it useable. To test our algorithm, we are just going to try to map one input to one output.

#### Initializing our Neural Network

Our first step is to create the weights for our neural network. We can use a simple for loop to do this. First, we specify the layers that our neural network will have.

We can specify the layers to this neural network by saying; layers = [2,3,3,2]. It is important to note that we do not include the bias as part of each layer. We will include them later.

Before we create our for loop to generate the weights, take a look at the shape of each layers weights and how they relate to the layer sizes.

We can see that our weights shapes are; (neurons in next layer, neurons in current layer + 1). Using this we can make a for loop that generates the weights for the entire network for us.

weights = []

for i in range(len(layers)-1):

layers_weights = np.random.rand(layers[i+1],layers[i]+1)

weights.append(layers_weights)

#### Forward Phase

Now that we have our weights, we need to perform our forward phase and collect all the X's. For this example, we will be using 2 random inputs *that are between 0 and 1*.
It is important that they are between 0 and 1, so we are not dealing with huge numbers in our computation.

First, we define the sigmoid function, since this is the activation function we will be using. Now, if we simply do the dot product of the weights and our input, we will get an error. This is because we must add our bias term onto our input. So, we first add the bias onto our input, then do the dot product between the weights and input, activate this weighted sum, and finally record the output as the input for our next layer.

for i in range(len(layers)-1):

"""add bias"""

x_s[-1] = np.concatenate((x_s[-1],[1]))

z = np.dot(weights[i],x_s[i])

x_s.append(sigmoid(z))

Alright, let's take a look at the list 'x_s' after we run the above code.

Everything checks out. So, our next step is to update our weights so our final output matches our expected output. For this example, our expected output will again be two made up numbers between 0 and 1.

#### Backwards Phase

If you didn't go through the page where we derived our gradient descent algorithm, or you didn't understand it very well, the practical information is this: finding the gradients for all of our layers can be done in 3 steps.

Our first step is to initialize psi. Each term is psi is just ∂E/∂X_{i} * ∂X_{i}/∂Z_{i}. Using the neat fact that the derivative of the
sigmoid function = σ(x)(1-σ(x)), and the derivative of our error = -2(y_true_{i} - X_{i}), we find this partial derivative easily. So, to initialize psi we just
loop through each y_true value and append the ∂E/∂X_{i} * ∂X_{i}/∂Z_{i} associated with this y_true value.

psi = []

for i in range(len(y_true)):

output = x_s[-1][i]

psi.append(-2*(y_true[i] - output) * output * (1-output))

psi = np.array(psi)

Now here is the first tricky detail. Right now, the shape of psi is (2,). In order for our algorithm to work, we need psi to be the same shape as it is in our derived algorithm, which is (2,1). This is a small, slightly frustrating, detail; but an easy fix.

Easy peasy. Now that we have psi initialized, we can move on to step 2, which is to find the gradients for the last layer. This is just psi * the second to last input. Or;

gradients.append(psi*x_s[-2])

Numpy handles the elementwise multiplication for us. Now we are ready to move on to step 3, which is a little tricky. We want to start at the second to last layer, since we just found the gradients to the last layer. Moving backwards, we then update psi and find the layers gradient until we reach the last layer.

w = weights[i][:,:-1]

x = x_s[i][:-1]

term = w * x * (1-x)

term = np.transpose(term)

psi = np.dot(term, psi)

psi = np.reshape(psi,(psi.shape[0],1))

gradients.append(psi*x_s[i-1])

We first define the weights and inputs we are dealing with in the term we use to update psi. This is just the layers weights and inputs excluding the bias, since none of the weights we are trying to find affect the bias at all (they are always = 1 no matter what). We then get this term into the correct shape by transposing it. Then, we update psi and reshape it just as we did before. Finally, we append our gradient.

After we run the for loop above, we should end up with:

Note that the first gradient is the same size as the last layers weights, second gradient is the same size as the second to last layers gradient and so on. This is a good sanity check so that we know everything is working properly. We now update our weights with the following fancy indexing:

weights[i] -= .001*gradients[-(i+1)]

And that it! We have just updated our weights one time. Now, we can run the above code in a for loop to update our weights say 1000 times and see if it works.

return 1/(1+np.exp(-x))

y_true = [0.3,.67]

layers = [2,3,3,2]

weights = []

for i in range(len(layers)-1):

layers_weights = np.random.rand(layers[i+1],layers[i]+1)

weights.append(layers_weights)

for n in range(1000):

x_s = [[.2,.3]]

for i in range(len(layers)-1):

"""add bias"""

x_s[-1] = np.concatenate((x_s[-1],[1]))

z = np.dot(weights[i],x_s[i])

x_s.append(sigmoid(z))

psi = []

for i in range(len(y_true)):

output = x_s[-1][i]

psi.append(-2*(y_true[i] - output) * output * (1-output))

psi = np.array(psi)

psi = np.reshape(psi,(psi.shape[0],1))

gradients = []

gradients.append(psi*x_s[-2])

for i in range(len(layers) - 2, 0,-1):

w = weights[i][:,:-1]

x = x_s[i][:-1]

term = w * x * (1-x)

term = np.transpose(term)

psi = np.dot(term, psi)

psi = np.reshape(psi,(psi.shape[0],1))

gradients.append(psi*x_s[i-1])

for i in range(len(gradients)):

weights[i] -= .01*gradients[-(i+1)]

Now the moment of truth. Let's compare out final output to our y_true values.

As you can see our output matches our y_true values exactly. Now, before we clean this code up and throw it into a class let's try to scale it. Let's see if this still works will 1000 neurons in each hidden layer. Go ahead and try it yourself.

Now, let me show you the output I get after 10,000 training iterations when I change the layers to [2,1000,1000,2];

Well, as you can see... It doesn't work at all! Why is this?!

To diagnose this, let's look at our gradients.

Well, that there is the problem... Try to think of why our gradients would all be 0. I'll give you a hint, it has to do with our activation function and it derivative.

Let's just look at this last gradient. ∂E/∂W = ∂E/∂X[-1] * ∂X[-1]/∂Z * ∂Z/∂W. The PD (partial derivative) of the error w.r.t (with respect to) the last output, times the PD of the last input w.r.t the last weighted sum, times the PD of the last weighted sum w.r.t the weights. If our gradient is 0, at least one of these must be 0. Remember that ∂X[-1]/∂Z is just the derivative of our activation function.

What would be the derivative of this function if x was really large or really small? It would be 0! So, let's change our sigmoid function so is it more linear. We can do this by multiplying x by a small constant c (.01 works best here). This is kind of a 'slap a band aid on it' solution, a better solution would be to switch to the leaky Relu activation function. You can if you want to, but I am just going to multiply x by .01 in our sigmoid function.

Now, all we have to do is add a 0.01 into our sigmoid function. Technically, our new derivative would be 0.01 * σ(x)(1-σ(x)), but we don't have to worry about
this since it just gets absorbed into the learning rate. Besides, the whole point of doing this was to scale *up* our derivative. The new output and only change to the code is shown below:

As you can see, we have solved our problem! This is also why we must make our inputs between 0 and 1. If our weighted sum is very large then our derivative will be close to 0. To finish up we can clean this up a little bit and throw it into a class.

def __init__(self,layers):

self.layers = layers

self.weights = []

for i in range(len(layers)-1):

layers_weights = np.random.rand(layers[i+1],layers[i]+1)

self.weights.append(layers_weights)

def sigmoid(self,x):

return 1/(1+np.exp(-.01*x))

def predict(self,data):

x_s = [data]

for i in range(len(self.layers)-1):

"""add bias"""

x_s[-1] = np.concatenate((x_s[-1],[1]))

z = np.dot(self.weights[i],x_s[i])

x_s.append(self.sigmoid(z))

return x_s[-1]

def train(self,data,y_true):

x_s = [data]

for i in range(len(self.layers)-1):

"""add bias"""

x_s[-1] = np.concatenate((x_s[-1],[1]))

z = np.dot(self.weights[i],x_s[i])

x_s.append(self.sigmoid(z))

psi = []

for i in range(len(y_true)):

output = x_s[-1][i]

psi.append(-2*(y_true[i] - output) * (output * (1-output)))

psi = np.array(psi)

psi = np.reshape(psi,(psi.shape[0],1))

gradients = []

gradients.append(psi*x_s[-2])

for i in range(len(self.layers) - 2, 0,-1):

w = self.weights[i][:,:-1]

x = x_s[i][:-1]

term = w * x * (1-x)

term = np.transpose(term)

psi = np.dot(term, psi)

psi = np.reshape(psi,(psi.shape[0],1))

gradients.append(psi*x_s[i-1])

for i in range(len(gradients)):

self.weights[i] -= .01*gradients[-(i+1)]

return sum((y_true-x_s[-1])**2)

You can find this class of github here.

Now, we can finally do something fun with this! With this algorithm we will get a program to identify hand written digits. This project is a classic, but you never see anyone doing it from scratch (maybe for a good reason, but we are learning)!