This may come as a surprise for many people, but I do a large portion of my data science work in Go. I recently gave a talk on why I use Go for data science. The slides are here, and I’d also like to expand on a few more things after the jump:
I wrote up a cheatsheet for anyone coming from Python, and are used to using Numpy. I specifically want to thank Brendan Tracey, Sebastian Binet, Daniel Whitenack, Makoto Ito and Jorge Landivar for helping me shape the cheatsheet.
I would recommend reading the web format because it’s laid out in the way that I think makes the most sense. The PDF re-laysout the boxes such that the introduction boxes get pushed to the bottom.
The multidimensional array is a fairly fundamental data structure when it comes to doing data science. In Python, you have Numpy. In Java you have nd4j. In Julia you have… Julia. In Go, there are multiple options:
Each package was designed with different use-cases in mind. I like to think that the table below summarizes the issues and help with deciding which package to use.
What to Use
|I ever only want a ||use |
|I want to focus on doing statistical/scientific work||use |
|I want to focus on doing machine learning work||use |
|I want multidimensional arrays||use |
|I want to work with different data types||use |
|I want to wrangle data like in Pandas or R - with data frames||use |
|I want to use external computation devices like GPUs to process my data||use |
The last row is not included in the cheatsheet. This is because I don’t really want to falsely promote the idea that doing data science is all about doing things on the GPU. In fact, short of deep learning related work, most of data science can be done with matrices, and
float64 values. Speedups from using a smaller data type or an external computation device should be considered extreme optimizations.
The powerhouse behind a lot of these are the Gonum packages. Gorgonia’s
tensor packages have a different structure than Gonum’s
Mat type, but also leverages the use of algorithms in the Gonum packages.
The APIs for these packages are different, mainly due to the different design philosophies. The APIs are documented in the cheatsheet.
tensor package was designed to be much closer to Numpy’s API due to my familiarity with Numpy. It returns errors when possible except in object creation functions because one of the earliest uses of Gorgonia was to build an interactive neural network explorer that I used for my teaching courses*The program was last properly used in March, and when I recorded the asciinema piece, it was months later so I had to look up some of the syntax I wrote hence the random pauses in the video..
Gonum’s API rationale can be found in this excellent presentation. Both families of packages share Rule 1: No Silent Bugs. I like that philosophy a lot.
At this point, it wouldn’t be amiss to also enumerate the points of commonality and differences in design philosophy with Gorgonia:
|Rule 1||No Silent Bugs||No Silent Bugs|
|Panic when||Error is easy to check before call||Impossible parameters are passed into object creation functions|
|Return errors when||It's impossible to check without performing the operation||Almost all functions and methods may return error.|
|"Functional" programming||Functions and methods should not modify state, unless that is their only purpose.||Functions and methods should not modify state, unless the function options |
|Idioms||Try to reuse as much Go idioms as possible||Provide commonly used methods, even if it violates some idiomatic Go. Aim to bridge the gap between Numpy and Go.|
|Package functions||Perform operations on interface types||Perform operations on interface types|
|Type methods||Perform operations intrinsic to the type|
Another thing I’d like to highlight is that
Gonum actually comes with a bunch of other packages which are actually useful for doing data science work. For a lot of data science work I find it more than ample that almost all my needs are filled. For smaller scale things (which is most of the cases in most of the works), I just use Gonum.
When I need to use
gorgonia.org/tensor, the package provides interop with Gonum’s
*Dense types. The methods that come with Gorgonia tensors are basic methods. Any additional algorithms would require extension, which can be done by creating a new
ExecutionEngine. That’s fairly advanced Gorgonia work though.
The datasets I work with are fairly known ahead of time. Exploration of said data set is either done on a SQL client, or Jupyter. But when it comes to data science work that can be pushed to production (typically in an executable), I drop straight into Go.
The packages I use are your bog standard SQL libraries, the CSV encoding libraries and Gorgonia or Gonum. That’s one of the things about Go: it’s such a simple language, that there is really nothing fancy to show off. It’s straightforwards - what you read is what happens. Code becomes boring, and there are no “One Weird Trick”s. So there isn’t really anything to blog about on that end.
I personally prefer not to use any frameworks to perform operations. This is mainly due to the fact that the operations I write for the tasks tend to be quite task specific. For a lot of functions I just write them. Few things are truly re-usable as-is. If you set out to write a truly reusable function you tend to end up with many many parameters and super overly complicated code to handle all the edge cases. Not particularly my cup of tea. And this is from the guy who wrote a fairly generic multidimensional array for go.
Generic frameworks that claim to do everything for everyone doesn’t really suit my workflow - it typically ends up with me bending the logics of my programs in weird and potentially buggy ways to fit the ideals of frameworks.
Do I miss nice APIs from Scikit Learn? Occasionally. I especially miss the
classification_report and in fact the entire
sklearn.metrics methods. I don’t miss the
fit()-style APIs tho.
I think the design of those APIs are quite silly. I think those APIs shouldn’t have a one-size-fits-all method. You’ll end up with silliness like
fit_transform()*Before anyone asks: I think the transform process should be a clear and separate step. Mashing it all into one just makes things confusing for your future self. Interface definitions should in my opinion be lazily done. That’s the beauty of Go’s implicit interfaces.
Plotting is another thing that is missing in the Go ecosystem. That usually isn’t a problem for my deep learning related work - I just pipe the json out to a file and have a separate webserver that reads the file and plot it using plotly or something.
On the dynamic data exploration front, I think that’s what is missing. The entire dynamic exploration of data in Go.
Gophernotes and Gota aim to fix that. But at this point, they’re still quite young. As are Gonum and Gorgonia to be honest. As the developer of Gorgonia family of packages I keep getting the anxiety-inducing feeling that the packages are somehow still broken from time to time.
- The GopherData website.
- The #data-science channel in the Gophers slack are excellent resources. Most Gonum core devs hang out there. And @dwhitena is probably the friendliest helpfulest person when people ask questions.
- Daniel Whitenack has recently written a book on using Go for machine learning and I suggest everyone check it out. I have. It’s pretty good. Could do with more on Gorgonia ;P.