Deep Learning from Scratch in Go

Note: This post was originally published at GopherData. This blog post is the exact same copy.

Summary: In this post I talk about the basics of computations, then I go on to show parallels between Go programs and mathematical equations. Lastly I will attempt to persuade you that there is a point in using Go for doing deep learning.

Welcome to the first part of many about writing deep learning algorithms in Go. The goal of this series is to go from having no knowledge at all to implementing some of the latest developments in this area.

Deep learning is not new. In fact the idea of deep learning was spawned in the early 1980s. What’s changed since then is our computers - they have gotten much much more powerful. In this blog post we’ll start with something familiar, and edge towards building a conceptual model of deep learning. We won’t define deep learning for the first few posts, so don’t worry so much about the term.

There are a few terms of clarification to be made before we begin proper. In this series, the word “graph” refers to the concept of graph as used in graph theory. For the other kind of “graph” which is usually used for data visualization, I’ll use the term “chart”.

Computation

I’m going to start by making a claim: all programs can be represented as graphs. This claim is not new, of course. Nor is it bold or revolutionary. It’s the fundamental theory that computer scientists have been working on ever since the birth of the field of computation. But you may have missed it. If you have missed it, the logic goes as such:

All modern computer programs run on what essentially is a Turing Machine.
All Turing machines are equivalent to untyped lambda calculus (this is commonly known as the Church-Turing thesis)
Lambda calculus can be represented as graphs.
Therefore all programs can be represented as graphs.

To make this idea more concrete, let’s look at a simple program:

func main() {
	fmt.Printf("%v", 1+1)	
}

This generates an abstract syntax tree like so (the AST was generated with a library built on top of goast-viewer):

By this, we can also say that any equation that can be represented as a computer program, and a computer program can be represented as a graph. In particular, let’s zoom in 1+1 part:

This corresponds to this part of the cleaned up graph (with unnecessary nodes removed):

The graph is traversed in a depth-first manner, starting from the top. The values of the program flow from bottom to top. When the program runs, it starts right at the top. The node will not be resolved until the dependent nodes have been evaluated. The arrows point to what each node depends on. So for example, the value of the *ast.BinaryExpr node is dependent on the values of *ast.BasicLit (Kind: INT). Since we know both values are 1, and we know what + does, we know that the value at the node *ast.BinaryExpr is 2.

Equations As Graphs

Now why did we spend all that time show 1+1 in graph form? Well, it’s because deep learning is really in its core, just a bunch of mathematical equations. Wait, don’t go yet! It’s not that scary. I am personally of the opinion that one can’t really do deep learning (or any machine learning, really) without understanding the mathematics behind it. And in my experience there hasn’t been a better way to learn it than visually, if only to internalize the concepts.

Most deep learning libraries like Tensorflow, Theano, or even my own for Go - Gorgonia, rely on this core concept that equations are representable by graphs. More importantly, these libraries expose the equation graphs as objects that can be manipulated by the programmer.

So instead of the program above, we’d create something like this:

func main() {
	// Create a graph.
	g := G.NewGraph()                         
	
	// Create a node called "x" with the value 1.
	x := G.NodeFromAny(g, 1, G.WithName("x")) 
	
	// Create a node called "y" with the value 1.
	y := G.NodeFromAny(g, 1, G.WithName("y")) 
	
	// z := x + y
	z := G.Must(G.Add(x, y))                  

	// Create a VM to execute the graph.
	vm := G.NewTapeMachine(g) 

	// Run the VM. Errors are not checked.
	vm.RunAll()               

	// Print the value of z.
	fmt.Printf("%v", z.Value()) 
}

The equation graph looks like this:

Why Graph Objects?

So far you might be thinking - if all computer programs are graphs, and all mathematical equations are graphs, we could just program the mathematical equations in, right? Why write the above example with object graphs when a simple fmt.Printf("%v", 1+1) would do? Afterall, the graphs are mostly the same. Having a graph object at this point seem like a silly amount of overhead.

You would be right. For simple equations, having a programmer manipulatable graph object is really overkill (unless you live in Java land, where it’s classes all the way down).

I would however, posit at least three advantages of having a graph object. All of these have got to do with reducing human errors.

Numerical Stability

Consider the equation $latex y = log(1 + x)$. This equation is not numerically stable - for very small values of x, the answer will most likely be wrong. This is because of the way float64 is designed - a float64 does not have enough bits to be able to tell apart 1 and 1 + 10e-16. In fact, the correct way to do $latex y = log(1 + x)$ is to use the built in library function math.Log1p. It can be shown in this simple program:

func main() {
	fmt.Printf("%v\n", math.Log(1.0+10e-16))
	fmt.Printf("%v\n", math.Log1p(10e-16))
}

1.110223024625156e-15 // wrong
9.999999999999995e-16 // correct

Now, of course the programmer may be well aware of this issue and opt to use math.Log1p when implementing neural network algorithms, but I’m sure you’d agree that having a library that automatically converts log(1+x) to use math.Log1p(x) be rather awesome? It reduces the element of human error.

Machine-Learning Specific Optimizations

Consider a variant of the first program in this post:

func a() int {
	return 1 + 1
}

This is the same program compiled down to assembly:

"".a t=1 size=10 args=0x8 locals=0x0
	(main.go:5)	TEXT	"".a(SB), $0-8
	(main.go:5)	FUNCDATA	$0, gclocals·2a5305abe05176240e61b8620e19a815(SB)
	(main.go:5)	FUNCDATA	$1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB)
	(main.go:6)	MOVQ	$2, "".~r0+8(FP)
	(main.go:6)	RET

In particular, pay attention to the second last line: MOVQ $2, "".~r0+8(FP). The function has been optimized in such a way that 2 is returned. No addition operation will be performed at run time. This is because the compiler knows, at compile time, that 1 + 1 = 2. By replacing the expression with a constant, the compiler is saving on computation cycles at run time. If you’re interested in building compilers, this is known as constant folding.

So, we’ve established that compilers are smart enough to do optimizations. But the Go compiler (and in fact most non-machine-learning specific compilers) isn’t smart enough to handle values that are used for machine learning. For machine learning, we frequently use array-based values, like a slice of float64s, or a matrix of float32s.

Imagine if you will, if you’re not doing 1 + 1. Instead you’re doing []int{1, 1, 1} + []int{1, 1, 1}. The compiler wouldn’t be able to optimize this and just replace it with []int{2, 2, 2}. But building a graph object that can be optimized allows users to do just that. Gorgonia currently doesn’t do constant folding yet (earlier versions had constant folding but it is quite difficult to get right), but it comes with other forms of graph optimizations like common expression elimination, some amount of variable elimination and some minimal form of tree shaking. Other more mature libraries like TensorFlow or Theano comes with very many optimization algorithms for their graphs.

Again, one could argue that this could be done by hand, and coding it into the program would be more than doable. But is this really where you’d rather spend your time and effort? Or would you rather be creating the coolest new deep learning stuff?

Backpropagation

Lastly, and most importantly, the graph object is most important in the implementation of backpropagation. In the following posts, I will talk about backpropagation and how it relates to computing partial derivatives. It’s there we shall see the true power of having graph objects that are manipulatable by the programmer.

In a far future post, I shall also touch on the capacity to generate better code, using the graph object.

Why Go?

If you’re working on a truly homoiconic language such as lisp or Julia, you probably wouldn’t need a graph object. If you could have access to the program’s internal data structures and they’re modifiable on the fly at run time, you would be able to augment plenty of the operations on the fly (yes, you can do the same for Go, but why would you?). This would make backpropagation algorithms a lot simpler to perform at runtime. Unfortunately this isn’t the case. Which is why we’d have to build up extra data structures for deep learning.

Do note that this isn’t a knock on Go or Python or Lua. All of these languages have their strengths and weaknesses. But why do deep learning related work in Go when there are more mature libraries in Python or Lua? Well, one of the major reasons I developed Gorgonia was the ability to deploy everything neatly into one single binary. Doing that with Python or Lua would take an immense amount of effort. By contrast, deploying Go programs are a breeze.

I believe that Go for data science is an amazing idea. It is typesafe (enough for me), and it’s compiled down to binary. Go allows for better mechanical sympathy, which I believe is key to having faster and better AI out there. Afterall, we are ALL bound by our hardware. I just wish there were better higher level data structures for me to express my ideas. There weren’t, so I built them. And I hope you use them.