A Direct Way of Understanding Backpropagation and Gradient Descent

Summary: I believe that there are better representations of neural networks that aid in faster understanding of backpropagation and gradient descent. I find representing neural networks as equation graphs combined with the value at run-time helps engineers who don't have the necessary background in machine learning gets them up to speed faster. In this post, I generate a few graphs with Gorgonia to illustrate the examples.

Backpropagation has been explained to death. It really is as simple as applying the chain rule to compute gradients. However, in my recent adventures, I have found that this explanation isn’t intuitive to people who want to just get shit done. As part of my consultancy (hire me!) job* really, I need to pay the bills to get my startup funded , I provide a brief 1-3 day machine learning course to engineers who will maintain the algorithms that I designed. Whilst most of the work I do don’t use neural networks* people who think deep learning can solve every problem are either people with deep pockets aiming to solve a very general problem, or people who don't understand the hype. I have found that most businesses do not have problems that involves a lot of non-linearities. In fact a large majority of problems can be solved with linear regressions. , recently there was a case where deep neural networks were involved.

This blog post documents what I found was useful to explain neural networks, backpropagation and gradient descent. It’s not meant to be super heavy with theory - think of it as an enabler for an engineer to hit the ground running when dealing with deep networks. I may elide over some details, so some basic understanding/familiarity of neural networks is recommended.

Equations as Graphs

The understanding that underpins this post’s simple explanation of backpropagation is that you can represent mathematical functions as graphs. Graphs are great because they are visual. You can draw them on a whiteboard, and most software engineers are familiar with the concept of graphs. This is for example, the mathematical equation of $latex x + y$ represented as a graph:

%3 + + one x +->one one_2 y +->one_2

The $latex x$ and $latex y$ are variables and values can be substituted in later at run time.

Following? Good. The premise is: all mathematical equations can be represented as a graph. The proposition: neural networks are just mathematical equations. This specifically, is the equation of a perceptron:

$$ f(x \times w+b)$$

While it’s my personal belief that you can’t really understand neural networks and machine learning stuff without knowing the math* and I say this knowing my own base in math ain't that strong to begin with , in my experience though, the moment you start bringing out equations and bring out words like “space” and “hyperplane”, people tend to just nod, appear to understand and then come back to you 6 months later, long after you’ve forgotten about your implementation, to ask you very specific questions. So now I tend to show pictures.

Even with pictures, I discovered that some pictures are better than others. In particular, I found that this common depiction of neural networks (image found on Wikipedia) confuses people more often than not.

Output Hidden Input

Don’t get me wrong, it’s a good model of what a neural network is, but when it comes to the implementation of neural networks, it takes time for some people to realize that the edges in the picture above are represented by matrices. Add backpropagation to the representation above and you’d be guaranteed a lot of confusion.

A better way in my opinion, is to actually represent the neural network as what they are - functions* FWIW, I also personally don't like the term "neural network" as the connection to a real neural network, to quote Yann LeCun, "is tenuous at best" .

A Simple Neural Network

So, recall this is the equation for a simple perceptron:

$$ f(x \times w+b)$$

Here’s how the above equation is built (in case the animation doesn’t play, the graph is built from bottom up). Here, $latex f $ is the sigmoid function, $latex \sigma $:

expressionGraph Node_0xc4200f1c80 c4200f1c80 Aᵀ × b(%2fabbe82, %97b175c) :: Vector Float64 Op Aᵀ × b :: Matrix a → Vector a → Vector a (10) -1 Value (nil) Node_0xc4200f1a40w c4200f1a40 w :: Matrix Float64 (5, 10) -1 Value Matrix (5, 10) [10 1] Matrix (5, 10) [10 1] ⎡  0.463  -0.0168  ⋯   0.143   -0.592⎤ ⎢ -0.315   0.0608  ⋯  -0.303     0.63⎥  ⋮   ⎢ 0.0566   -0.285  ⋯   0.479   -0.339⎥ ⎣  0.143   -0.328  ⋯  -0.186   -0.158⎦ Node_0xc4200f1c80:anchor:s->Node_0xc4200f1a40:anchor:n 1 Node_0xc4200f1b00 c4200f1b00 x :: Vector Float64 (5) -1 Value (nil) Node_0xc4200f1c80:anchor:s->Node_0xc4200f1b00:anchor:n 2 Node_0xc4200f1d40 c4200f1d40 +(%5e0df3e6, %e3776aae) :: Vector Float64 Op + :: Vector a → a → Vector a (10) 0 Value (nil) Node_0xc4200f1d40:anchor:s->Node_0xc4200f1c80:anchor:n 1 Node_0xc4200f1bc0 c4200f1bc0 b :: Float64 () -1 Value 0 Node_0xc4200f1d40:anchor:s->Node_0xc4200f1bc0:anchor:n 2 Node_0xc4200f1e00 c4200f1e00 sigmoid(%366b25f4) :: Vector Float64 Op sigmoid :: a → a (10) 0 Value (nil) Node_0xc4200f1e00:anchor:s->Node_0xc4200f1d40:anchor:n 1

While this graph looks busy, it does contains quite a bit of information, and it’d be useful to get used to the format of the graph. A graph node looks like this:

idname of node :: type of the node
opname of operation :: type of the operation
resulting shape
debug info, feel free to ignore
Valuethe value

What are shapes? Shapes inform us about the value. A shape of (5,10) is a 5 by 10 matrix - 5 rows, 10 columns, while a shape of (5) is a vector (think: an array) of 5 elements. A special case is when the shape is (). This is a scalar value - that means it’s a singular number. b is 0 when the program starts.

One Run

So, let’s see what happens when we run this one single layer neural network, shall we? But before that, I’d like to point out that the value for w (or any yellow box) is randomly generated at the start of running the program, so in different pictures the values will be different. In the later part of this article, I’ll come back to this.

So, we’ll set $latex x$ to $latex \begin{bmatrix}0.1 & 0.2 & 0.3 & 0.4 & 0.5 \end{bmatrix} $, and we’ll run the neural network once, forwards. Our goal is to get the value of the sigmoid (the red outlined node), which we expect to be a column vector of size 10 - which we write (10) as its shape.

expressionGraph Node_0xc420106b40 c420106b40 Aᵀ × b(%2fabbe82, %97b175c) :: Vector Float64 Op Aᵀ × b :: Matrix a → Vector a → Vector a (10) -1 Value Grad Vector (10) [1] [  0.2   -0.2   -0.3    0.2   -0.4    0.6    0.4   -0.2    0.2  -0.09] Vector (10) [1] [0  0  ⋯ 0  0] Node_0xc420106840 c420106840 w :: Matrix Float64 (5, 10) -1 Value Grad ⎡  0.5   -0.3    0.5   -0.3    0.4   -0.2    0.3   -0.5    0.4    0.5⎤ ⎢ 0.09   -0.1   0.06   -0.3  0.005    0.4    0.4  -0.05   -0.3   -0.4⎥ ⎢  0.3    0.2   -0.6    0.1   -0.4    0.4   -0.4   0.01    0.4   -0.3⎥ ⎢  0.2   0.02    0.2  -0.09   -0.4    0.4    0.6   -0.4   -0.3    0.2⎥ ⎣ -0.2   -0.5   -0.5    0.5   -0.3    0.6    0.4   0.04    0.4   -0.1⎦ Matrix (5, 10) [10 1] ⎡0  0  ⋯ 0  0⎤ ⎢0  0  ⋯ 0  0⎥  ⋮   ⎢0  0  ⋯ 0  0⎥ ⎣0  0  ⋯ 0  0⎦ Node_0xc420106b40:anchor:s->Node_0xc420106840:anchor:n 1 Node_0xc4201069c0 c4201069c0 x :: Vector Float64 (5) -1 Value Grad Vector (5) [1] [0.1  0.2  0.3  0.4  0.5] Vector (5) [1] [0  0  ⋯ 0  0] Node_0xc420106b40:anchor:s->Node_0xc4201069c0:anchor:n 2 Node_0xc420106c00 c420106c00 +(%5e0df3e6, %e3776aae) :: Vector Float64 Op + :: Vector a → a → Vector a (10) 0 Value Grad [  0.2   -0.2   -0.3    0.2   -0.4    0.6    0.4   -0.2    0.2  -0.09] Vector (10) [1] [0  0  ⋯ 0  0] Node_0xc420106c00:anchor:s->Node_0xc420106b40:anchor:n 1 Node_0xc420106a80 c420106a80 b :: Float64 () -1 Value Grad 0 0 Node_0xc420106c00:anchor:s->Node_0xc420106a80:anchor:n 2 Node_0xc420106cc0 c420106cc0 sigmoid(%366b25f4) :: Vector Float64 Op sigmoid :: a → a (10) 0 Value Grad Vector (10) [1] [0.5  0.4  0.4  0.5  0.4  0.6  0.6  0.4  0.5  0.5] Vector (10) [1] [1  1  ⋯ 1  1] Node_0xc420106cc0:anchor:s->Node_0xc420106c00:anchor:n 1

After the first run, the result is $latex \begin{bmatrix}0.5 & 0.4 & 0.4 & 0.5 & 0.4 & 0.6 & 0.6 & 0.4 & 0.5 & 0.5 \end{bmatrix}$. You might also note that there is now a new field in the node, Grad. We’ll come back to that later. It should also be noted that the values are rounded up.

I discovered that when you display the equation in a graph with the values attached, it becomes very easy to explain. Just start from the bottom, and follow the lines.

And there you have it. You have a perceptron, the simplest neural network. To build a deeper network, simply take the result, multiply it by some other weights, and apply another activation function. But before we do that, you need the neural network to learn.

Learning

The most typical form of learning with neural networks is through backpropagation and gradient descent. Backprop has been explained to death, and amongst the best explanations are by Michael Nielsen and Andrej Kaparthy. I clearly am not as skilled at explaining backpropagation as either of them, so I shall leave the meat of the explanation to them.

However, the brief idea is we calculate derivative of the cost with regards to the weights, usually written $latex \frac{\partial cost}{\partial w}$ and $latex \frac{\partial cost}{\partial b}$. Once we have calculated the partial derivatives, we add it to the weights, changing the weights a tiny amount towards the wanted value.

Cost function

All this sounds rather abstract. Again, what I found to work very well is to use the expression graph as a visualization. But first, we’d have to define a cost. I found that while a binary cross entropy* also known as negative log likelihood is most often used in the “hello world” machine learning examples, sum of squared errors are easier to explain - a gateway drug, if you will to the notion of and a world of different cost functions.

The sum of squared errors is a lot easier to explain than the negative log-likelihood, as the prerequisite understanding of probability* I tried it once, recalling as best as I could what I learned in uni stats, but somewhere along the line, I got stuck explaining monotonic transform properties of the log function and that ended quite embarassingly isn’t really required. Everyone knows how to subtract and square. Most importantly, everyone has a mathematical “sixth sense”, if you will, about what the result of subtraction* the result generally won't be larger than the operands, except if the second operand is negative and squaring* the result is always positive (because we won't deal with imaginary numbers) . I think a many people don’t realize the power of these intuitions that will go a long way in learning new things. Most people are less familiar with $latex log_{10}$, and even less familiar with $latex log_e$.

This is also rather easy to grok:

expressionGraph Node_0xc4200dce40 c4200dce40 -(%d55882b7, %cbb775ca) :: Vector Float64 Op - :: Vector a → Vector a → Vector a (10) 0 Value Vector (10) [1] [-0.4   0.4  -0.3  -0.4  -0.3  -0.4  -0.4  -0.4   0.5  -0.4] Node_0xc4200dcd80 c4200dcd80 y :: Vector Float64 (10) -1 Value Vector (10) [1] [0.1  0.9  0.1  0.1  0.1  0.1  0.1  0.1  0.9  0.1] Node_0xc4200dce40:anchor:s->Node_0xc4200dcd80:anchor:n 1 Node_0xc4200dccc0 c4200dccc0 sigmoid(%366b25f4) :: Vector Float64 Op sigmoid :: a → a (10) 0 Value Vector (10) [1] [0.5  0.5  0.4  0.5  0.4  0.5  0.5  0.5  0.4  0.5] Node_0xc4200dce40:anchor:s->Node_0xc4200dccc0:anchor:n 2 Node_0xc4200dcf00 c4200dcf00 sq(%a6ac9570) :: Vector Float64 Op sq :: a → a (10) 0 Value Vector (10) [1] [ 0.2   0.1   0.1   0.1  0.09   0.2   0.2   0.2   0.3   0.1] Node_0xc4200dcf00:anchor:s->Node_0xc4200dce40:anchor:n 1 Node_0xc4200dcfc0 c4200dcfc0 Sum of Squared Errors (Cost) :: Float64 Op Σ[0] :: Vector a → a () 0 Value 1.5631904744425802 Node_0xc4200dcfc0:anchor:s->Node_0xc4200dcf00:anchor:n 1

Backpropagation

Explaining the why of backpropagation of errors works is beyond the scope of this post. In a longer tutorial session (say a 3-day session), I’d typically devote a day to explaining gradients, gradient descent and backprop. This typically involves walking through a few step-by-step examples and counter examples.

Otherwise, I’ll just assume that there is a magical system that calculates the partial derivative of the cost with regards to the inputs. Well there actually is - theano, Tensorflow, autograd and other software packages does this for you. I also wrote my own in Go - Gorgonia, which I used to generate the pictures in this post.

I found that when you place the partial derivative side by side with the value, it suddenly makes gradient descent a lot easier to explain. The above equation, after backprop now looks like this:

expressionGraph Node_0xc4200dce40 c4200dce40 -(%d55882b7, %cbb775ca) :: Vector Float64 Op - :: Vector a → Vector a → Vector a (10) 0 Value Grad Vector (10) [1] [-0.4   0.4  -0.3  -0.4  -0.3  -0.4  -0.4  -0.4   0.5  -0.4] Vector (10) [1] [-0.9   0.8  -0.7  -0.8  -0.6  -0.8  -0.8  -0.8     1  -0.8] Node_0xc4200dcd80 c4200dcd80 y :: Vector Float64 (10) -1 Value Grad Vector (10) [1] [0.1  0.9  0.1  0.1  0.1  0.1  0.1  0.1  0.9  0.1] Vector (10) [1] [-0.9   0.8  -0.7  -0.8  -0.6  -0.8  -0.8  -0.8     1  -0.8] Node_0xc4200dce40:anchor:s->Node_0xc4200dcd80:anchor:n 1 Node_0xc4200dccc0 c4200dccc0 sigmoid(%366b25f4) :: Vector Float64 Op sigmoid :: a → a (10) 0 Value Grad Vector (10) [1] [0.5  0.5  0.4  0.5  0.4  0.5  0.5  0.5  0.4  0.5] Vector (10) [1] [ 0.9  -0.8   0.7   0.8   0.6   0.8   0.8   0.8    -1   0.8] Node_0xc4200dce40:anchor:s->Node_0xc4200dccc0:anchor:n 2 Node_0xc4200dcb40 c4200dcb40 Aᵀ × b(%2fabbe82, %97b175c) :: Vector Float64 Op Aᵀ × b :: Matrix a → Vector a → Vector a (10) -1 Value Grad Vector (10) [1] [  0.1   0.07   -0.2  -0.07   -0.4  -0.05   0.03  -0.03   -0.4  -0.07] Vector (10) [1] [ 0.2  -0.2   0.2   0.2   0.1   0.2   0.2   0.2  -0.2   0.2] Node_0xc4200dc840 c4200dc840 w :: Matrix Float64 (5, 10) -1 Value Grad Matrix (5, 10) [10 1] ⎡  0.6    0.1   -0.2   -0.2   -0.5   -0.6    0.6   0.03    0.3   -0.3⎤ ⎢ -0.5    0.2   -0.4    0.3  -0.05   -0.3    0.1   -0.5   0.04    0.6⎥ ⎢  0.2   0.01    0.5   -0.4    0.5   -0.1  -0.03    0.4   -0.5   -0.3⎥ ⎢ -0.5   -0.6   -0.3    0.1   -0.4  -0.02   0.02   -0.6   -0.2    0.1⎥ ⎣  0.6    0.5   -0.3  -0.05   -0.6    0.2   -0.1    0.3   -0.5   -0.2⎦ Matrix (5, 10) [10 1] ⎡ 0.02  -0.02   0.02   0.02   0.01   0.02   0.02   0.02  -0.02   0.02⎤ ⎢ 0.04  -0.04   0.03   0.04   0.03   0.04   0.04   0.04  -0.05   0.04⎥ ⎢ 0.06  -0.06   0.05   0.06   0.04   0.06   0.06   0.06  -0.07   0.06⎥ ⎢ 0.09  -0.08   0.07   0.08   0.06   0.08   0.08   0.08   -0.1   0.08⎥ ⎣  0.1   -0.1   0.09    0.1   0.07    0.1    0.1    0.1   -0.1    0.1⎦ Node_0xc4200dcb40:anchor:s->Node_0xc4200dc840:anchor:n 1 Node_0xc4200dc9c0 c4200dc9c0 x :: Vector Float64 (5) -1 Value Grad Vector (5) [1] [0.1  0.2  0.3  0.4  0.5] Vector (5) [1] [-0.2  -0.2   0.2  -0.1  0.04] Node_0xc4200dcb40:anchor:s->Node_0xc4200dc9c0:anchor:n 2 Node_0xc4200dcc00 c4200dcc00 +(%5e0df3e6, %e3776aae) :: Vector Float64 Op + :: Vector a → a → Vector a (10) 0 Value Grad Vector (10) [1] [  0.1   0.07   -0.2  -0.07   -0.4  -0.05   0.03  -0.03   -0.4  -0.07] Vector (10) [1] [ 0.2  -0.2   0.2   0.2   0.1   0.2   0.2   0.2  -0.2   0.2] Node_0xc4200dcc00:anchor:s->Node_0xc4200dcb40:anchor:n 1 Node_0xc4200dca80 c4200dca80 b :: Float64 () -1 Value Grad 0 1.0725860353419272 Node_0xc4200dcc00:anchor:s->Node_0xc4200dca80:anchor:n 2 Node_0xc4200dccc0:anchor:s->Node_0xc4200dcc00:anchor:n 1 Node_0xc4200dcf00 c4200dcf00 sq(%a6ac9570) :: Vector Float64 Op sq :: a → a (10) 0 Value Grad Vector (10) [1] [ 0.2   0.1   0.1   0.1  0.09   0.2   0.2   0.2   0.3   0.1] Vector (10) [1] [1  1  1  1  1  1  1  1  1  1] Node_0xc4200dcf00:anchor:s->Node_0xc4200dce40:anchor:n 1 Node_0xc4200dcfc0 c4200dcfc0 Sum of Squared Errors (Cost) :: Float64 Op Σ[0] :: Vector a → a () 0 Value Grad 1.5631904744425802 1 Node_0xc4200dcfc0:anchor:s->Node_0xc4200dcf00:anchor:n 1

The Grad column in each node is the partial derivative of the output with regards to that node. Let’s look at node c4200dce40:

c4200dce40-(%d55882b7, %cbb775ca) :: Vector Float64
op- :: Vector a → Vector a → Vector a
(10)
debug info, feel free to ignore
ValueGrad
[-0.4   0.4  -0.3  -0.4  -0.3  -0.4  -0.4  -0.4   0.5  -0.4][-0.9   0.8  -0.7  -0.8  -0.6  -0.8  -0.8  -0.8     1  -0.8]

The way to interpret this node is quite simple.

The value is $latex \begin{bmatrix}-0.4 & 0.4 & -0.3 & -0.4 & -0.3 & -0.4 & -0.4 & -0.4 & 0.5 & -0.4\end{bmatrix} $.

The gradient is $latex \begin{bmatrix}-0.9 & 0.8 & -0.7 & -0.8 & -0.6 & -0.8 & -0.8 & -0.8 & 1 & -0.8\end{bmatrix} $.

If you’re observant you might note that the gradient is twice the value. Well, $latex \frac{d}{dx} x^2 = 2x $, if you recall. Node c4200dce40 is $latex x $, and node c4200dcf00 is $latex x^2$. And since there is only one operation that uses node c4200dce40, the gradient is simply $latex 2x$.

Once you understand this, it can be applied to any of the other nodes in the equation. Therein, in my opinion, lies the power of this form of visualization. It allows people to “see” what is happening in the neural network, and in my opinion has been quite successful at imparting an understanding for people new to the idea of neural networks and have to work on their maintenance.

Gradient Descent

While it is vital in my opinion to actually understand the general idea of gradient descent, the specific mechanics and implementation is very simple, given all the set up from above. The process of gradient descent is simply taking the value of the node, and adding the gradient from it. With flourishes of course.

In the case of a vanilla stochastic gradient descent, it’s simply the addition of the gradient multiplied by some negative learn rate η. Other gradient descent methods have fancier ways of calculating how much to add, each with good reasons from the value of the weights.

Using the above node as an example, gradient descent is simply this:

$$ \phantom{:} \begin{bmatrix}-0.4 & 0.4 & -0.3 & -0.4 & -0.3 & -0.4 & -0.4 & -0.4 & 0.5 & -0.4\end{bmatrix}\phantom{)}\\ + (-\eta \begin{bmatrix}-0.9 & 0.8 & -0.7 & -0.8 & -0.6 & -0.8 & -0.8 & -0.8 & \phantom{0.}1 & -0.8\end{bmatrix}) $$

Of course, in normal machine learning systems, we only want to learn only the value of the weights, gradient descent need only be done for the yellow coloured boxes - why waste precious processing resources to calculate something we don’t need? The gradients in the white boxes are merely there for understanding’s sake. In Gorgonia, the default behaviour is that the partial derivatives will be clobbered (the memories of the matrices reused) once that node has is done being traversed.

Key Takeaways

In case this blog post is too long, here are the key takeaways:
  1. I think the best way to represent a neural network is as equations, not the common picture used.
  2. Equations can be represented as graphs, which can be depicted visually.
  3. Attaching values to the equation graph is a very good way to concrefy the idea of equations-represented-as-graphs.
  4. Backpropagation is simply the process to calculate the partial derivatives of the output to the node itself.
  5. Gradient descent is simply the process to add the gradient values to the values themselves. Specifics are unique to each algorithm.
  6. This is kind of a form of programming by poking, where learning the implementation details would lead to faster understanding of it rather than just abstract theory. In my opinion it works.

I hope this blog post has been helpful either helping you understand backprop and gradient descent implementation mechanics better, or convinced you that this method is a superior way to teach backprop/gradient descent. Tell me what you think!

comments powered by Disqus