I released Gorgonia on Thursday. Gorgonia is a library like Theano or TensorFlow, but mainly written in Go. It provides the necessary primitives for creating and executing neural networks and machine learning algorithms.
According to cloc, these are the stats:
chewxy@chewxy-Gallifrey:~/workspace/goworkspace7/src/github.com/chewxy/gorgonia$ cloc . 357 text files. 321 unique files. 604 files ignored. http://cloc.sourceforge.net v 1.60 T=0.83 s (296.5 files/s, 55471.5 lines/s) ------------------------------------------------------------------------------- Language files blank comment code ------------------------------------------------------------------------------- Go 219 6308 3924 30858 Assembly 22 585 740 2128 C/C++ Header 2 55 57 666 C 2 17 39 458 ------------------------------------------------------------------------------- SUM: 245 6965 4760 34110 -------------------------------------------------------------------------------
So, it’s a pretty huge library. But the original version is about 80,000 LoC (though most of the lines of codes were different experimental variations of assembly code). I managed to cut down 50,000 LoC to something more manageable. In this post I want to outline the release of Gorgonia, and share some of the reasoning regarding the design of the library, as well as go thru some of the weirdness found in the library.
If you’re interested, here’s the video (otherwise, skip to the meat):
And here are the slides:
In the Beginning
In the beginning, there was Theano. I learned about Theano when I was still in the advertising industry, writing systems to machine-learn brand safety - this was around the end of 2013. Naturally I was curious and played with it. The deployment of Theano on AWS was a pain in the ass though - it remained a pain in the ass until late 2015.
The idea was certainly a curious one - instead of writing machine learning algorithms directly, write the equations and compile it into a Python function and execute it. I was vaguely aware of the graph-based differentiation, but wasn’t aware of the immediate benefits. I was pretty stuck on my ways of manually writing neural networks and different machine learning algorithms directly.
The benefits became clear when I needed to try lots of different designs of neural networks. My old approach couldn’t scale as I didn’t have parts I could reuse. By then most of my developmental workflow was entrenched in Go. Even my heavy statistics component of my new job was were Go-heavy (I’d use a combination of scikits-learn, R (mainly for ggplot as I was slow to pick up Bokeh) but sorta puppet-mastered by programs written in Go). A full Go solution would be nice.
And the story of Gorgonia, like many other stories of software libraries begin the same way - as a collection of useful functions, data structures and stuff that a person develops and accumulates in their personal libraries over the years. Most data scientist friends I know have something like that in their personal Github/Bitbucket accounts.
The library however, really accelerated in development when I left to pursue my own startup last year. I’m still working on it right now, but the majority of work put into what is now Gorgonia began last year.
Autodifferentiation
In Gorgonia, one of the oldest pieces of code is the dualValue
data structure. It’s gone through many iterations, been exported, unexported, its underlying things changed. But the main idea of it hasn’t changed - it’s a data structure that holds two values: the value itself and its derivative. This idea is clearly not novel, as there’s even a Wikipedia article about dual numbers and its role in differentiation.
My first few attempts at automatic differentiation were clumsy. Mainly it was due to not having a strong foundation in math, but it took me a while to realize that you can do automatic differentiation in two ways: forwards mode and reverse mode. nd I had been writing all sorts of forwards mode differentiation while wondering why my code wasn’t efficient at differentiation. I was doing things that were at least O(n^2)!
Reverse mode differentiation is almost constantly being re-discovered. Seppo Linnainmaa* Ironically, is very not American discovered it in the 1970s. Paul Werbos discovered it in 1982, and then Yann Lecun and Rumelhart subsequently rediscovered it in 1986. I believe so did Geoffrey Hinton sometime in the 80s.
I think for me, it finally clicked when I read Christopher Olah’s exceedingly good article on reverse-mode differentiation. By then I had manually written backprop algorithms at least 20 times over (I experiment with a lot of different designs of ML algorithms). Goes to show sometimes you can blindly do things while thinking you know how it works and why.
I settled on the final design of dualValue
at around the time I was giving my Monads talk at PyConAU (errata and notes). As a result the code around dualValue
can be a bit monadish-crazy. I went to the point of having monad-transformer equivalents but then removed it because it was getting a bit unmanageable.
I like this monad-ish design for dualValue
and I think it’s rather elegant. The elegance was undone a bit in TapeMachine
(because it had to unwrap the Value
at the beginning of the executeOp
instruction.
Package tensor
Another feature of Gorgonia is the Tensor subpackage. The Tensor subpackage has further subpackages for each of the types of tensors it supports: float64
, float32
, bool
and int
.
The design is rather simple: all the features are designed and implemented in tensorf64
first, and then a generator parses the code, and spits out the equivalent for the other types. Currently the generator is stupid. It copies the code and does a simple find-and-replace as specified in the file but eventually when I have the free time, I’d like to parse the AST of the tensorf64
package and generates the tensor types for the other types.
As a result of having a dumb generator, the files in the tensor subpackages are named rather peculiarly. The main reason for splitting up a file into different bits is to aid the dumb generator. For example, arithmetic methods and functions cannot work on bool
types, so they have to be sequestered out so that the generator can ignore copying those files when generating for bool
. As I said, it’s pretty dumb, and I really want to improve it.
I have considered other methods like writing a template file, and generate all the types at the same time. However, this approach didn’t enable me to write and test code quickly. More often than not, it’s easier to write code for tensorf64
and generate from there.
Package tensor
, Part Deux
A frequently asked question is why not use the very excellent gonum/matrix package? Again, the reason is mainly historical. A few years back I had written my own datastructure for n-dimensional arrays called ndarray
that was similar to Numpy’s ndarray* heck I'm so original I even came up with the name ndarray .
That’s one reason. Another reason was because I really needed tensor-types (i.e. multidimensional arrays) when I wrote ndarray
. That time has passed, and now I’m rewriting a lot of the multidimensional tensor parts to replace the janky ass stuff I wrote in the past. For example, in the current version, I broke the TensorDot
method because I changed the way data is accessed.
Yet another reason was because at one point I was convinced that float32
operations would be faster. I was right. With SIMD (specifically AVX), float32
operations were indeed faster. They’re however not without their own challenge, but that’s a story for another day (it’s related to why there are no SIMD
The last reason why I wrote the tensor
package was because I really really really enjoy the elegance of the Numpy ndarray data structure. It’s really well thought out of and if I could emulate even a little bit of it, it would be great. This is not to disparage the developers of Gonum though. They have put in a LOT of good work into the matrix package. I just think Numpy’s data structure is more elegant. You can read more about the elegance of Numpy’s ndarray here.
For what it’s worth though, there are compatibility functions that translate Gonum’s matrix into Gorgonia’s Tensor (valid only for float64 and float32 matrices).
Ops
This is one of those “why didn’t I think of it earlier” things with the Op
s. As of right now, the Op
interface is a bunch of weird mix of exported and unexported methods. It wasn’t until I wanted to open-source the package that I realized it wasn’t a good idea. The methods of the Op
interface should just be entirely exported. It will get there. But the earliest design of Op
were functions. A vestige of this idea can still be found in the ʘBinaryOperator
interface.
The original design was something like this:
type ʘBinaryOperator interface {
isArith() bool
do(...Value) (Value, error)
}
type ssBinOp func(Scalar, Scalar) (Scalar, error)
type tsBinOp func(Tensor, Scalar) (Tensor, error)
type stBinOp func(Scalar, Tensor) (Tensor, error)
type ttBinOp func(Tensor, Tensor) (Tensor, error)
As you can tell, this would lead to a combinatorial explosion of functions, which I eventually deemed too difficult for me to maintain, so I refactored it to what it is now.
I’m still of the opinion that there is a better way to do Op
s, but I’ve been drawing blanks for sometime now for this. To me, the Op
type is probably one of the ugliest bits of Gorgonia. I would want to refactor it into something more elegant.
TensorFlow has the concept of kernels and ops. The idea of kernels were quite obviously lifted from CUDA - they’re basically the “Do” function of an op. I’m slowly beginning to think that this may have been a slightly superior idea to mine.
Type System
Part of the issue with the Op
is due to the existence of a HM-ish type system, which it turns out, Gorgonia doesn’t really need. When I had embarked on creating the early versions of Gorgonia, I had by then written a few toy languages. Most of them had a HM-style type system, because, hey who doesn’t have Haskell-envy* I imagine a virtual room full of strawmen and stereotype where the strawman Haskeller comes up and say "oh well, yeah? if only you had a superior type system! Haskell could do that.". Meanwhile the Coq and Agda and Idris people are at the far end of the room, saying "I can't hear you from waaaaaay over there" ?
And you know how programmers are - we like to copy and paste things from previous projects of ours. And that’s precisely what I did. Which is why Op
has a Type()
that returns functionType
which is essentially a [2]Type
(representing f :: a → b
). It started with a nice idea: “ops are functions! the entire library should construe a expression graph as a functional program!”. But of course as the development of ideas go on, this turns out to be a half baked idea for Gorgonia.
I really really like the idea of having a well-typed expression graph. However, I do not think that this implementation is the best. I also think that it does not need a HM type inference system - afterall, as the development went on, there were more and more hacks that were added in, usually for performance purposes (the unify
function of the HM inference is quite expensive), to the point now that I do not think it is no longer necessary.
It’s also highly unlikely that an expression graph would need to support generics. In the early days of development, I had this written down in the git commit in December 2014 (this was when Gorgonia was still a mixed bag of functions and data structures I keep):
The ops should be able to handle generics - to handle the inputs of various types.
When I had started, I imagined there to be hundreds of “various” types. Turns out there are only two that really matters: “Scalar” and “Tensor”. Sure you can definitely represent it with a HKT system, but thru the experience of writing Gorgonia, I find that it’s overkill.
I’m also split on the idea of typeclasses. On the one hand it is extremely elegant an idea. Define a op, a type variable that is constrained by the typeclass, and bam, instant generics. In real life, the constraint satisfaction algorithm took way too long and was one of the first few things I removed from the type system. Perhaps this was just me being a crappy programmer.
API
Gorgonia’s API is probably one of the more standard things I have done and is one of the things about the library that I liked quite a bit. An early version had an API that looked like this:
x := IsAMatrix(Of(Float64), InGraph(g), WithShape(2,2), WithName("x"), WithValue(xT)))
w := IsAVector(Of(Float64), InGraph(g), WithShape(2,1), WithName("w"), WithInit(Glorot(1.0)))
xw := Must(Mul(x,w))
act:= Must(Tanh(xw))
Whilst I really liked it, after a few WTFs from friends and people I tried to get to use Gorgonia I eventually changed it to the more commonly accepted theano-like API:
x := NewMatrix(g, Float64, WithShape(2,2), WithName("x"))
I think the original idea that I had definitely helped reinforce the idea that you’re building an expression graph instead of creating a value, and using it. But, YMMV.
API, Part 2
There is a part of the API that I’m highly unsure of what the best move is. That is: operations - return error or panic? The typical errors that can arise from operations are limited:
- Type Errors - user passed in nodes of different types (for example:
Add(F32N, F64N)
). - Graph Error - user passed in nodes from different graphs to perform operations
- Differentiation errors - user passed in weird/undifferentiable nodes as differential operations
The question is this: these are technically unrecoverable operations - the graph cannot be executed upon if there are type errors for example. And the standard thing to do in Go is to panic
when faced with unrecoverable operations. However, I personally feel that’s the responsibility of the application, not the library to panic. Hence Gorgonia returns the errors.
On the other hand, returning the errors meant that the API is ugly as sin. Instead of code that looks like these:
probs := Sigmoid(Add(Mul(x,w), b))
You end up with code that look like this:
if xw, err = Mul(x,w); err != nil {
log.Fatal(err)
}
if xwpb, err = Add(xw, b); err != nil {
log.Fatal(err)
}
if prob, err = Sigmoid(xwpb); err != nil {
log.Fatal(err)
}
I ended up writing a Must()
function as a helper for these sorts of things, but I’m still in two minds about it.
On the plus side of that, in being forced to write so many error catching and handling, I managed to find a few bugs in some of the neural network implementations I wrote. This is especially helpful when it comes to more complex stuff like dealing with context vectors and attention scores. On the downside, it’s a LOT of code for very little benefit.
Graph
If you use Gorgonia, you might notice some weird quirk about it: there is a heavy dependence on the *ExprGraph
type. In fact, all expression graphs have to be linked to one main *ExprGraph
. Also, because of this, you cannot do something like this:
one := NewConstant(1.0)
two := NewConstant(2.0)
three := Must(Add(one, two)) // panics
The code above panics because all nodes must be linked to a graph. Constants have no graphs linked (because constants are well, constants, and will be constant in any graph). This idea was obviously stolen from Go’s concept of constants.
The story of the *ExprGraph
is that when I started, no nodes were linked to a *ExprGraph
. In fact, a *ExprGraph
was supposed to be a lightweight container so that topological sorting of nodes could be made easier using Gonum’s topo.Sort
. In fact, the earliest versions had *ExprGraph
unexported, so applications outside the library had no access to it.
Of course, over time, nodes became more and more intertwined with the *ExprGraph
. This was mainly due to the fact that it was easier to use an *ExprGraph
to keep track of the nodes created when doing symbolic differentiation.
I would like to go back to the original design, where an *ExprGraph
was a lightweight collection of nodes. I have no idea how.
What I am Happy About
The majority of this post sounds like I’m complaining about the things I wrote. Well, it’s true. I don’t like a portion of what I wrote. But there are portions that I like too. The biggest portion that I like is that the design of Gorgonia reduces cognitive load on the user when compared to say, Theano. In Theano, the scan
function and the updates
and givens
bit are really a pain in the ass to reason about. Yeah sure, it’s not un-understandable, and you may even grow to like it (I never did). But that’s a big barrier. TensorFlow started without scan either, and then they added it. :s
Gorgonia removes the need for those. Gradient updates are done via the Solver
instead of making it part of the graph. Gorgonia also does away with the whole concept of shared
in Theano. It is necessary in Theano but not in Gorgonia, because you have access to the underlying matrix anyway.
I also like that Gorgonia tries to make some parts of the broadcasting of shapes more explicit than Theano. It’s not as explicit as I like* there exists the special case of scalar-tensor interactions which are hard coded in - but those were added for allocation optimization , but I’m happy to live with that. I find that with Theano’s behind-the-scenes broadcasting and dimshuffling, you can sometimes end up with things you didn’t intend to.
I’m also quite happy with the development of the symbolic differentiation bits of the code. That was a completely new experience for me - writing a symbolic differentiator, but one that I am quite happy to have learned to do. Gorgonia marks the third time I’ve tried to write an automatic differentiator, and by far is the largest attempt for me.
One last thing I really like about Gorgonia is its API - not the operations part, but the whole functional options part. A massive hat tip to Dave Cheney for helping refine my thoughts on that issue with his excellent blog post on this topic. You will indeed find this pattern littered across Gorgonia, from creating new VMs:
m := NewTapeMachine(WithLogger(nil), WithNaNWatch(), WithWatchlist(...))
m2 := NewLispMachine(WithLogger(nil), LogBwdOnly())
to creating new nodes:
x := NewMatrix(g,Float32, WithName("x"), WithShape(2,2))
Of course, I brought up the latter as an example of over-reliance on a pattern. Indeed, the NewMatrix
function would have been better off as this:
func NewMatrix(g *ExprGraph, dt Dtype, row, col int, ...NodeConsOpt){}
Since you already know it’s a matrix, why not directly enforce the shape by requiring rows and cols? The API will probably change into that one day.
The Name
Here’s a little trivia regarding the name Gorgonia. Theano was Pythagoras’ wife - it’s a nice feminine-sounding name of Greek origin. I wanted a nice feminine sounding name of Greek origin too (because as I said, I’m so original like that). I also wanted a name that starts with “Go"* Fun fact, 90% of my Go related work have names starting with "Go" . Gorgonia was one such name that popped up. And then I looked up pictures of Gorgonia and found one of the most beautiful seacreatures ever. So it stuck.
Future
Right now I’m kinda happy with Gorgonia. There are bits I would love to update. The API is considered stable-ish, but subject to change. I’ll consider the API stable when there is 90% test coverage for Gorgonia. These are the things I like to see for Gorgonia’s future:
- 90% test coverage (and better, more "unit" test)
- Examples!
- Better and more consistent documentation
- Reduced dependence of
*Node
and*ExprGraph
- More consistency across things (for example, the solvers are currently a mess)
- Unmessify errors
- Clean up compilation stuff, regalloc stuff were more suited for static compilation instead of dynamic on-the-fly compilation
- Fixed up float32 slices avx/sse issue
- Fixed up GPU work
- Fixed up batched BLAS work for
*LispMachine
- Reduced reliance on the type system which is slow and allocates way too many things
- Reintroduce non-backprop methods
- Derivative-free optimizations
- Refactor existing corner cutting algorithms into the package
- Expand to beyond just neural network building blocks
Thanks
In writing this library I have relied upon the work of countless crazy people who have spent their lives studying one subject and being very good at it - from optimizing floating point operations to designing better neural networks. I really couldn’t have done it without referring to their awesome body of work that is on the internets. In no particular (well, it’s somewhat alphabetical) order:
- Andrej Karpathy* I'm quite surprised in particular how much of my code ended up being inspired by Andrej... His designs are often the superior version of what I have written, so I had to divert my course of action a few times to match my designs to fit his
- Bruce Dawson
- Chris Dyer's lab
- Chris Olah
- Dave Cheney
- Felix Cloutier
- Michael Nielsen
- Honnibal
- Jacob Andreas
- John Schoolman(?)* I have this in my notes as he's from OpenAI, but I can't seem to find him
- PJReddie
- Various people behind autograd, caffe, dlib, MLPack, Numpy, TensorFlow, Theano, Torch etc - I stole a lot of ideas from these packages, some have ended up in Gorgonia.
- Your standard Big Names in ML: Hinton, LeCunn, Ng, Jordan, Bengio, Liang, deFreitas, etc
I’m sure that there are many more whose works have been immeasurably helpful to my implementation of Gorgonia.