This is an even simplified version of this fantastic tutorial: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
Let's consider a very simple neural network shown in the following image:
All we want is that given the input .1, the network would output the value .9.
We will be using \frac{1}{1+e^{-x}} as our activation function. Initially both weight w1 and w2 are initialized to .5, though this can be any arbitrary value. However, different initialization shall converge to different final values.
First let's consider the net input in the hidden layer(neth). It's simply the input multiplied by w1. That is:(Here, outh means the output of the hidden layer, neto means the net input in the output layer, outo is the output of the output layer. E is the error. )
neth = i1 * w1 = .1 * .5 = .05
outh = \frac{1}{1+e^{-neth}} = \frac{1}{1+e^{-.05}} = .51249
neto = outh * w2 = .51249 * .5 = .25624
outo = \frac{1}{1+e^{-neto}} = \frac{1}{1+e^{-.25624}} = .563711
E = \frac{1}{2}*(target - outo)^2 = \frac{1}{2}*(.9 - .563711)^2 = .056545
This concludes our forward pass of the network. We've selected that particular error function for our convenience, as it's convex and differentiable. Now, we can see that our target value was .9 but we've got .5637 . We are going to update our weights w1, w2 so that the network produces that desired output. How would we do that is the reason for using the famous Backpropagation algorithm. The basic idea is, we will calculate how much responsibility should each of the weights take for producing that error. Then the weight would be modified accordingly and the next iteration would begin.
First let's see what happens in case of w2. The partial differentiation expression \frac{\partial E}{\partial w2} says the rate of change in the error with respect to w2. As it can't be calculated directly, we would use the chain rule to find it's constituting parts.
\frac{\partial E}{\partial w2} = \frac{\partial E}{\partial outo} * \frac{\partial outo}{\partial neto} * \frac{\partial neto}{\partial w2}
Let's find out each of those parts separately:
\frac{\partial E}{\partial outo} = \frac{\partial}{\partial outo}(\frac{1}{2}*(target - outo)^2) = \frac{1}{2} * 2 * ( target - outo) * (-1)
= (.9 - .563711) * (-1) = -.336289
\frac{\partial outo}{\partial neto} = \frac{\partial }{\partial neto} (\frac{1}{1+e^{-neto}}) = \frac{-1}{(1+e^{-neto})^2} * e^{-neto} * (-1) = \frac{e^{-neto}}{(1+e^{-neto})^2}
= outo * (1-outo) = .563711 * ( 1- .563711) = .24594
\frac{\partial neto}{\partial w2} = \frac{\partial (w2*outh)}{\partial w2} = outh = .51249
So finally we have found the responsibility of w2 for producing the error E:
\frac{\partial E}{\partial w2} = -.33628 * .24594 * .51249 = -.042386
Now let's turn our attention to w1, that is we want to calculate \frac{\partial E}{\partial w1}. Again, chain rule comes to our rescue.
\frac{\partial E}{\partial w1} = \frac{\partial E}{\partial outo} * \frac{\partial outo}{\partial neto} *\frac{\partial neto}{\partial outh} * \frac{\partial outh}{\partial neth} * \frac{\partial neth}{\partial w1}
We already have calculated \frac{\partial E}{\partial outo} and \frac{\partial outo}{\partial neto}. Let's calculate the rest.
\frac{\partial neto}{\partial outh} = \frac{\partial (w2*outh)}{\partial outh} = w2 = .5
\frac{\partial outh}{\partial neth} = outh * ( 1- outh) = .51249 * ( 1- .51249) = .24984
\frac{\partial neth}{\partial w1} = \frac{\partial (w1 * input)}{\partial w1} = input = .1
So,
\frac{\partial E}{\partial w1} = -.336289 * .24594 * .5 * .24984 * .1 = -.001033
Now it's time to update our initial guesses of w1 and w2. Let's say our learning rate \eta = .9
w1 = w1 - (\eta * \frac{\partial E}{\partial w1}) = .5 + .9 * 001033 = 0.5009298654
Similarly,
w2 = w2 - ( \eta * \frac{\partial E}{\partial w2}) = .5 + .9 * 042386 = 0.538148
We can continue this process until our network's output is very close to our target. The following is a python implementation of that process:
Sample output:
Let's consider a very simple neural network shown in the following image:
All we want is that given the input .1, the network would output the value .9.
We will be using \frac{1}{1+e^{-x}} as our activation function. Initially both weight w1 and w2 are initialized to .5, though this can be any arbitrary value. However, different initialization shall converge to different final values.
First let's consider the net input in the hidden layer(neth). It's simply the input multiplied by w1. That is:(Here, outh means the output of the hidden layer, neto means the net input in the output layer, outo is the output of the output layer. E is the error. )
neth = i1 * w1 = .1 * .5 = .05
outh = \frac{1}{1+e^{-neth}} = \frac{1}{1+e^{-.05}} = .51249
neto = outh * w2 = .51249 * .5 = .25624
outo = \frac{1}{1+e^{-neto}} = \frac{1}{1+e^{-.25624}} = .563711
E = \frac{1}{2}*(target - outo)^2 = \frac{1}{2}*(.9 - .563711)^2 = .056545
This concludes our forward pass of the network. We've selected that particular error function for our convenience, as it's convex and differentiable. Now, we can see that our target value was .9 but we've got .5637 . We are going to update our weights w1, w2 so that the network produces that desired output. How would we do that is the reason for using the famous Backpropagation algorithm. The basic idea is, we will calculate how much responsibility should each of the weights take for producing that error. Then the weight would be modified accordingly and the next iteration would begin.
First let's see what happens in case of w2. The partial differentiation expression \frac{\partial E}{\partial w2} says the rate of change in the error with respect to w2. As it can't be calculated directly, we would use the chain rule to find it's constituting parts.
\frac{\partial E}{\partial w2} = \frac{\partial E}{\partial outo} * \frac{\partial outo}{\partial neto} * \frac{\partial neto}{\partial w2}
Let's find out each of those parts separately:
\frac{\partial E}{\partial outo} = \frac{\partial}{\partial outo}(\frac{1}{2}*(target - outo)^2) = \frac{1}{2} * 2 * ( target - outo) * (-1)
= (.9 - .563711) * (-1) = -.336289
\frac{\partial outo}{\partial neto} = \frac{\partial }{\partial neto} (\frac{1}{1+e^{-neto}}) = \frac{-1}{(1+e^{-neto})^2} * e^{-neto} * (-1) = \frac{e^{-neto}}{(1+e^{-neto})^2}
= outo * (1-outo) = .563711 * ( 1- .563711) = .24594
\frac{\partial neto}{\partial w2} = \frac{\partial (w2*outh)}{\partial w2} = outh = .51249
So finally we have found the responsibility of w2 for producing the error E:
\frac{\partial E}{\partial w2} = -.33628 * .24594 * .51249 = -.042386
Now let's turn our attention to w1, that is we want to calculate \frac{\partial E}{\partial w1}. Again, chain rule comes to our rescue.
\frac{\partial E}{\partial w1} = \frac{\partial E}{\partial outo} * \frac{\partial outo}{\partial neto} *\frac{\partial neto}{\partial outh} * \frac{\partial outh}{\partial neth} * \frac{\partial neth}{\partial w1}
We already have calculated \frac{\partial E}{\partial outo} and \frac{\partial outo}{\partial neto}. Let's calculate the rest.
\frac{\partial neto}{\partial outh} = \frac{\partial (w2*outh)}{\partial outh} = w2 = .5
\frac{\partial outh}{\partial neth} = outh * ( 1- outh) = .51249 * ( 1- .51249) = .24984
\frac{\partial neth}{\partial w1} = \frac{\partial (w1 * input)}{\partial w1} = input = .1
So,
\frac{\partial E}{\partial w1} = -.336289 * .24594 * .5 * .24984 * .1 = -.001033
Now it's time to update our initial guesses of w1 and w2. Let's say our learning rate \eta = .9
w1 = w1 - (\eta * \frac{\partial E}{\partial w1}) = .5 + .9 * 001033 = 0.5009298654
Similarly,
w2 = w2 - ( \eta * \frac{\partial E}{\partial w2}) = .5 + .9 * 042386 = 0.538148
We can continue this process until our network's output is very close to our target. The following is a python implementation of that process:
from __future__ import division import math ctr = 0 eta = .9 w1 = .5 w2 = .5 i1 = .1 target = .9 while True: ctr += 1 neth = i1 * w1 outh = 1 / (1 + math.exp(-neth)) neto = w2 * outh outo = 1 / (1 + math.exp(-neto)) e = .5 * math.pow((target - outo), 2) print 'error now is: ', e if e < .00005: print '*' * 100 print 'target is: ', target print 'output is now: ', outo print 'w1 and w2 is: ', w1, ',', w2 print 'number of iterations: ', ctr break # update w2 de_by_douto = (-1) * (target - outo) douto_by_dneto = outo * (1 - outo) dneto_by_dw2 = outh de_by_dw2 = de_by_douto * douto_by_dneto * dneto_by_dw2 # update w1 # de_by_dw1 = de_by_douto * douto_by_dneto * dneto_by_douth * douth_by_dneth * dneth_by_dw1 dneto_by_douth = w2 de_by_outh = de_by_douto * douto_by_dneto * dneto_by_douth douth_by_dneth = outh * (1 - outh) dneth_by_dw1 = i1 de_by_dw1 = de_by_outh * douth_by_dneth * dneth_by_dw1 # updated w1 and w2 w1 = w1 - (eta * de_by_dw1) w2 = w2 - (eta * de_by_dw2)
Sample output:
No comments:
Post a Comment