On the memory alignment of Go slice values

TL;DR and Meta – I was playing around with some AVX instructions and I discovered that there were some problems. I then described the investigation process of the issue and discovered that this was because Go’s slices are not aligned to a 32 byte boundary. I proceed to describe the alignment issue and devised two solutions, of which I implemented one.

On Thursday I decided to do some additional optimization to my Go code. This meant writing some assembly to get some of the AVX goodness into my program (I once gave a talk on the topic of deep learning in Go, where I touched on this issue). I am no stranger to writing assembly in Go, but it’s not something I touch very often, so sometimes things can take longer to remember how to do them. This is one of them. So this blog post is mainly to remind myself of that.

The values in Go slices are 16-byte aligned. They are not 32 byte aligned.

The Problem

But first, some background. The task I was trying to accomplish was a very basic vectorizing of some math operations on a couple of []float64. The simplest example to reproduce the problem however, can be shown quite simply with adding two slices together. This is the Go equivalent of the simplest possible reproduction case:

func add(a, b []float64) {
    for i, v := range a {
        a[i] += b[i]
    }
}

This is the equivalent assembly

// function header code irrelevant and is truncated
// address at top of slice is stored in %RDI and %RSI
loop:
    // a[0] to a[3]
    // VMOVAPD (SI), Y0
    // VMOVAPD (DI), Y1
    // VADDPD Y0, Y1, Y0
    // VMOVAPD Y0, (SI)
    BYTE $0xc5; BYTE $0xfd; BYTE $0x28; BYTE $0x06
    BYTE $0xc5; BYTE $0xfd; BYTE $0x28; BYTE $0x0f
    BYTE $0xc5; BYTE $0xf5; BYTE $0x58; BYTE $0xc0
    BYTE $0xc5; BYTE $0xfd; BYTE $0x29; BYTE $0x06
    ADDQ $16 SI
    ADDQ $16 DI
    SUBQ $16 AX
    JGE loop
    ... // remainder code is irrelevant and is truncated

The reason why the assembly has BYTE ... is because the Go assembler doesn’t yet fully support AVX instructions (the only AVX instructions it supports are the ones that the crypto package uses), so I had to write the bytes in manually * not really, I wrote a script to convert normal assembly to Go’s assembly using the standard gcc toolchain. But saying “I did it manually” makes me sound more badass. Or stupid. Either way it’s probably gonna come back and bite me in the as

Anyhow, I wrote a bunch of test cases, and they all passed. It wasn’t until I ran the actual function on actual real life data, that it kept failing. Specifically it kept failing with unexpected fault address 0x0. Which was perhaps the most useless error code ever. The top result from a Google search is about map concurrency, which wasn’t the case.

Investigation

So I started investigating. I noted that it passed in my test cases, but failed in my real life runs, which begs the question – what was different? The first thing that was immediately apparent was the size of the slices were different. In my test cases, I had used three different slice sizes: 7, 1049, and 1299827. In case they weren’t apparent, these are all prime numbers. The reason was because my code had some manually unrolled loop logic, and slices with prime numbered elements would help test if the remainder codes were correct.

And hence I tested with several different sizes. To my frustration, they all passed. Perhaps a unstructured, random number approach wouldn’t work. I reasoned, since the AVX registers were 256 bits, that meant 4 elements, I’d try with multiples of 4s, and the +1s and -1s too. So I’d try slices of (4, 3, 5) elements, and (8, 7, 9) elements… etc.

It was here I figured out where and when the code would fail. Specifically, the code would fail on 5 and 6 elements in a slice. 0, 1, 2, 3 elements wouldn’t fail, because the code would fall into the non-AVX branches. But 5, 6 had a mix of AVX and non AVX codes. Convinced now that the number of elements was the problem, I wrote some code to find out to which extent it would fail. To my surprise, slices of 61 elements or larger failed 0% of the time. I had however, found that slices 4, 5, 6, 17, 18, 20, 21, 22 elements would fail.

And then I went for lunch.

When I came back from lunch, I quickly re-ran the tests to warm up my mind to get into the task of debugging this issue* Life Pro Tip: this is actually a very effective way to get back in the groove – simply redo the last 30 mins’ work . Only to get a different result. This time it was 10, 12, 13 elements that failed.

It was becoming very clear that it wasn’t the number of elements that was the problem. I was suspicious about my assembly writing skills then. So I tested myself, by adding one byte to each element in a slice, to see if I understood assembly. This is a minimized sample of what I wrote:

TEXT ·dumb(SB), 7, $0
    MOVQ a_data+0(FP), SI
    MOVQ a_len+8(FP), AX
loop:
    ADDQ $1, (SI)
    ADDQ $8, SI
    SUBQ $1, AX
    JGE loop
bye:
    RET

Now if you know anything about assembly and floating point numbers, you’d know the result would be a weird float. To know if my code worked, I’d have to cast each element into a uint, and then compare the before and after. Which is what I did:

before := make([]uint, len(a))
after := make([]uint, len(a))
for i := range a {
    u := uint(uintptr((unsafe.Pointer(&a[i]))))
    before = append(before, u)
}
dumb(a)
for _, v := range {
    u := *(*uint)(unsafe.Pointer(&v))
    after = append(after, u)
}
fmt.Printf("%v\n%v\n%v\n", before, after, a)

A Fortuitous Error

If you hadn’t spotted the error by now, let me draw your attention to it:

...
for i := range a {
u := uint(uintptr((unsafe.Pointer(&a[i]))))
before = append(before, u)
}
...

What I had intended to do was to write the value of the float64 as a uint into before, but instead of casting the value (a float64) as a uint, I had converted the pointer of the element to uint. So in effect, I had written the address of the elements into before. The correct one is as per the after. In a proper project, you should use math.Float32bits and math.Float64bits for what I’m doing here

I had no idea why I wrote that line * But I suspect it’s because I had been working on C code the previous day and it’s quite common to just cast pointers to some uint. Another reason why I don’t like switching languages for a project. . Dear future employers, I don’t usually make mistakes like these* Not that it really matters. I’ve been rejected by most employers anyway for things like not being able to design an algorithm to detect a palindrome in a stream (and it was a 2nd tier data science tech support position!), I seriously wonder what kind of god level skills I’d ever need ! Regardless, I was quite surprised when the answers came out as 5e-324 (that’s uint(1) cast as a float64), but the before slice printed long numbers instead of zeroes.

However, it turned out this error was the lucky break I needed to figure the problem out.

After a while of poking around (with some help of the gogogadget lib), I noticed something odd about the numbers in the before slice.

I noticed that sometimes, the numbers would end in 00000 and sometimes it doesn’t. Immediately, that rang alarms. A few years ago, when I wrote a JIT compiler for JavaScript, I had used a very neat trick called pointer tagging. It basically relied on the fact that Go pointers are byte-aligned* I eventually moved on to NaN boxing, but that’s a story for another day .

It’s been a few years now, but when I first started writing AVX-based code for the Go programs I had, I first relied on cgo and I would use memory intrinsics in cgo* It turns out that the cgo call overhead was too high for repeated calls, so I had abandoned the plans to use cgo+mmintrinsics . I recalled having to write code that allocated slices:

func mmMalloc(size int) []float32 {
    length := size * 32
    ptr := C._mm_malloc((C.size_t)(length), 32)
    hdr := reflect.SliceHeader{
        Data: uintptr(unsafe.Pointer(ptr)),
        Len: size,
        Cap: size,
    }
    goSlice := *(*[]float32)(unsafe.Pointer(&hdr))
    return goSlice
}

It was a pretty janky way of getting slices that are aligned to 32 byte boundaries. Then all the memory of it came flooding back – I had to write mmMalloc() because I had wanted to use the _mm256_add_ps intrinsic without having to call _mm256_loadu_ps* just use \_mm\_malloc instead of aligned_alloc that came in the C11 specs, it’ll save you a lot of hair pulling from compiler issues . You see, the AVX instruction VMOVAPD requires the data to be aligned to 32-byte boundaries. However, Go does not guarantee that slices will be aligned to those boundaries (I think it does align to 16-byte boundaries, as I have MOVAPD SSE instructions in my code base that has yet to fail).

The Issue

To make the issue clearer, here’s the addresses of each element of a 5-element slice, printed out in binary:

0xc82007e2d0: 00000000 00000000 00000000 11001000 00100000 00000111 11100010 11010000
0xc82007e2d8: 00000000 00000000 00000000 11001000 00100000 00000111 11100010 11011000
0xc82007e2e0: 00000000 00000000 00000000 11001000 00100000 00000111 11100010 11100000
0xc82007e2e8: 00000000 00000000 00000000 11001000 00100000 00000111 11100010 11101000
0xc82007e2f0: 00000000 00000000 00000000 11001000 00100000 00000111 11100010 11110000

By contrast, here’s the first 5 of a 64-element slice:

0xc8200c5800: 00000000 00000000 00000000 11001000 00100000 00001100 01011000 00000000
0xc8200c5808: 00000000 00000000 00000000 11001000 00100000 00001100 01011000 00001000
0xc8200c5810: 00000000 00000000 00000000 11001000 00100000 00001100 01011000 00010000
0xc8200c5818: 00000000 00000000 00000000 11001000 00100000 00001100 01011000 00011000
0xc8200c5820: 00000000 00000000 00000000 11001000 00100000 00001100 01011000 00100000

Note how the address of the first element in the 5-element slice is not 32-byte aligned, while the first element of the 64-element slice is 32-byte aligned? When VMOVAPD is executed, the 256 bits following that address (in green) is copied into the %ymm register, and the address is not aligned, the code aborts. Hence an unexpected fault address 0x0.

It would appear that allocation of larger slices (61 elements and up) would end up more likely being aligned to a 32-byte boundary. This makes sense, as smaller slices means that the runtime could just slot the slice into somewhere in the arena that has enough free space, whereas larger slices may require the runtime to request more memory from the host, and new memory are usually 32-byte aligned (in 64 bit systems anyway).

Also, Go uses a clever allocation trick – the runtime size class to manage the overhead of allocations. This is also another reason why smaller slices tend to not be 32-byte aligned.

I Have No Clue What You’re Talking About!

Feel free to skip this section if you know about memory allocation and data alignment. This section was added on the comment of a friend that I wasn’t clear.

When you write code like this:

a := make([]float64, 6)

You’re telling the computer to allocate 6 8-byte slots in memory for use. Memory in the computer is in one contiguous block, like so:

Every box represents a byte. Each float64 takes up 8 boxes. The thicker red lines are the borders for the 32nd byte. So when you tell the computer to allocate 6 float64 in a slice, sometimes, the allocation may look like this (not aligned to 32-byte boundary):

Sometimes they look like that (aligned to 32-byte boundary):

The point is the first element (   – remember each element (float64) is 8 bytes) may start on a address that is 32-byte aligned, or it may not.

Alignment 101

To oversimplify, modern computers read data 8 bytes at a time. This is called a Word. This enables faster copying than doing it one byte at a time.

Imagine if you will, you’re told to copy data in an Excel spreadsheet. You’re told the data is grouped at every 8 rows, and it just happens that you know the times table for 8s very well. If the data starts at row 1, then it’s easy to tell where the next group of data is going to start: row 9. Now, imagine, the data starts at row 23. You know your times table of 8 by heart, and 23 is not on the times table. So now you’ve gotta do the math to figure out where the next group will start. This takes extra time. A similar logic applies to CPUs.

AVX is an extension to modern CPUs that read data 32 bytes at a time. However, the processor makers understand that not all data can be properly aligned, so they provided two separate instructions: VMOVAPD and VMOVUPD. The latter is for data that are unaligned, and is a little bit slower than the aligned version (mainly because checks have to be made).

I had used VMOVAPD which is the instruction that tells the CPU that the data I want copied into the %ymm register is indeed 32-byte aligned. This enables the CPU to use a more efficient method to just copy the entire 256 bits into the register. However, the data wasn’t aligned to 32-bytes boundary, so the instruction failed.

The Solution

With this in mind, I had one of two solutions:

  1. Use the slower instruction, VMOVUPD
  2. Only allocate aligned slices for the things I want to use AVX on

Option two meant I had to change a lot of code across many libraries. I’m generally a very lazy person so I opted to go with the one that required me to do the least amount of work – use VMOVUPD. To my surprise, the benchmark doesn’t seem to be that different from using MOVAPD (well, it’s a little slower by a very very very small amount, but I can live with that).

I also chose not to write option 2 because of the same reason why I chose to use Go – it’s a modern C. If I had wanted to mess with memory allocation I’d have been better of writing pure C. And that’d be a pain in the ass. It’s silly and prone to errors because let’s face it, I’m far from a good programmer. My strengths are in statistics. Simple arithmetic makes my head hurt.

Conclusion

This is all rather anticlimactic. But the lesson learned is that Go does not automatically align slices to any boundary. Should the language strive to do that? I’m not sure what the correct answer is. In fact alignment is a Hard Problem with capital H. Afterall Issue 599 is still active.

I think for now, the issue is easily solved with the two solutions I have listed. Afterall, I suspect not many people will have to write assembly code anyway.

Speaking of assembly, here’s a strange conclusion: On the rare occasion I had to manually edit some BYTE $0x... of the file, I found a very suitable sound track that made me feel super awesome hackerish – it’s Watching with Ten Thousand Eyes, by Ramin Djawadi, from the Person of Interest soundtrack. You should try it. God Mode is also pretty awesome.

comments powered by Disqus