On the memory alignment of Go slice values

TL;DR and Meta – I was playing around with some AVX instructions and I discovered that there were some problems. I then described the investigation process of the issue and discovered that this was because Go’s slices are not aligned to a 32 byte boundary. I proceed to describe the alignment issue and devised two solutions, of which I implemented one.

On Thursday I decided to do some additional optimization to my Go code. This meant writing some assembly to get some of the AVX goodness into my program (I once gave a talk on the topic of deep learning in Go, where I touched on this issue). I am no stranger to writing assembly in Go, but it’s not something I touch very often, so sometimes things can take longer to remember how to do them. This is one of them. So this blog post is mainly to remind myself of that.

The values in Go slices are 16-byte aligned. They are not 32 byte aligned.

The Problem

But first, some background. The task I was trying to accomplish was a very basic vectorizing of some math operations on a couple of []float64. The simplest example to reproduce the problem however, can be shown quite simply with adding two slices together. This is the Go equivalent of the simplest possible reproduction case:

func add(a, b []float64) {
for i, v := range a {
a[i] += b[i]
}
}


This is the equivalent assembly

// function header code irrelevant and is truncated
// address at top of slice is stored in %RDI and %RSI
loop:
// a[0] to a[3]
// VMOVAPD (SI), Y0
// VMOVAPD (DI), Y1
// VMOVAPD Y0, (SI)
BYTE $0xc5; BYTE$0xfd; BYTE $0x28; BYTE$0x06
BYTE $0xc5; BYTE$0xfd; BYTE $0x28; BYTE$0x0f
BYTE $0xc5; BYTE$0xf5; BYTE $0x58; BYTE$0xc0
BYTE $0xc5; BYTE$0xfd; BYTE $0x29; BYTE$0x06
ADDQ $16 SI ADDQ$16 DI
SUBQ $16 AX JGE loop ... // remainder code is irrelevant and is truncated  The reason why the assembly has BYTE ... is because the Go assembler doesn’t yet fully support AVX instructions (the only AVX instructions it supports are the ones that the crypto package uses), so I had to write the bytes in manually * not really, I wrote a script to convert normal assembly to Go’s assembly using the standard gcc toolchain. But saying “I did it manually” makes me sound more badass. Or stupid. Either way it’s probably gonna come back and bite me in the as Anyhow, I wrote a bunch of test cases, and they all passed. It wasn’t until I ran the actual function on actual real life data, that it kept failing. Specifically it kept failing with unexpected fault address 0x0. Which was perhaps the most useless error code ever. The top result from a Google search is about map concurrency, which wasn’t the case. Investigation So I started investigating. I noted that it passed in my test cases, but failed in my real life runs, which begs the question – what was different? The first thing that was immediately apparent was the size of the slices were different. In my test cases, I had used three different slice sizes: 7, 1049, and 1299827. In case they weren’t apparent, these are all prime numbers. The reason was because my code had some manually unrolled loop logic, and slices with prime numbered elements would help test if the remainder codes were correct. And hence I tested with several different sizes. To my frustration, they all passed. Perhaps a unstructured, random number approach wouldn’t work. I reasoned, since the AVX registers were 256 bits, that meant 4 elements, I’d try with multiples of 4s, and the +1s and -1s too. So I’d try slices of (4, 3, 5) elements, and (8, 7, 9) elements… etc. It was here I figured out where and when the code would fail. Specifically, the code would fail on 5 and 6 elements in a slice. 0, 1, 2, 3 elements wouldn’t fail, because the code would fall into the non-AVX branches. But 5, 6 had a mix of AVX and non AVX codes. Convinced now that the number of elements was the problem, I wrote some code to find out to which extent it would fail. To my surprise, slices of 61 elements or larger failed 0% of the time. I had however, found that slices 4, 5, 6, 17, 18, 20, 21, 22 elements would fail. And then I went for lunch. When I came back from lunch, I quickly re-ran the tests to warm up my mind to get into the task of debugging this issue* Life Pro Tip: this is actually a very effective way to get back in the groove – simply redo the last 30 mins’ work . Only to get a different result. This time it was 10, 12, 13 elements that failed. It was becoming very clear that it wasn’t the number of elements that was the problem. I was suspicious about my assembly writing skills then. So I tested myself, by adding one byte to each element in a slice, to see if I understood assembly. This is a minimized sample of what I wrote: TEXT ·dumb(SB), 7,$0
MOVQ a_data+0(FP), SI
MOVQ a_len+8(FP), AX
loop:
ADDQ $1, (SI) ADDQ$8, SI