Recently at work I optimized key-lookup for Ristretto, a caching library for Go, effectively doubling the lookup speed with one trick. This post walks you through what I did. The benchmarks and code can be found here.

Overview

Ristretto is a fast concurrent cache library in Go. In the upcoming version, the cache is B-tree based. In this, the keys and values are stored in one flat slice of []uint64. Call this node. The length of node will always be a even number. The way the data is structured is such that even numbered indices starting from 0 are keys, and odd numbered indices are values. We can visualize it as such:

n := node{
    0, 1,
    2, 3,
    4, 5,
    1024, 241,
}

Here, 0, 2, 4 and 1024 are keys, indexed by n[0], n[2], n[4] and n[6] respectively.

The task is to find the index of a given key (or the index of a key that is larger than the given key) fast. So we want to write a function that has the following signature:

func Search(kvpairs []uint64, key uint64) int16

The default is a linear search function:

func Search(kvpairs []uint64, key uint64) int16 {
	for i := 0; i < len(kvpairs); i+=2 {
		if kvpairs[i] >= key {
			return int16(i/2)
		}
	}
	return int16(len(kvpairs)/2)
}

How could we improve this?

SIMD Part 1: SSE

One method is to use SIMD. I am no stranger to writing SIMD code in Go - in fact, a lot of Gorgonia’s functions are SIMD’d.

The main idea of SIMD is to reduce the number of instructions executed by the program. If the CPU can compare multiple values at the same time, then it would execute less code, making things faster.

The first step is to try SSE. It’s a well established instruction set that most x86 processors have for the past 20 years. With SSE, we can compare two keys at the same time. To write SSE code, I learned how to use Avo. Previously I hand wrote my assembly code. With Avo, I learned that I could significantly improve my productivity in generating assembly. Michael had been the most patient person in helping me and I am very thankful for it.

SSE operations are done with special registers, called the XMM registers. There’s a SSE instruction, MOVUPS which moves 128 bits of data from memory or a general purpose register to a XMM register. The problem of course, is that we want to compare every second value of the slice.

We cannot just move 128 bits of data from the slice into a XMM register. Instead, we’d have to two keys into an array, then move the 128 bits of the array into XMM.

Avo comes with an excellent function to help you do that: AllocLocal. With that, you can reserve 2x64 bits for the values in the function (Avo handles the calculation of the framepointer for you too!). Then it’s a simple case of moving the 0th value and the 2nd value of the slice into the array.

The code is simply this:

	// Copy 4 keys into packed
	MOVQ (CX)(SI*8), DX
	MOVQ DX, (SP)
	MOVQ 16(CX)(SI*8), DX
	MOVQ DX, 8(SP)

Let me walk you through this code, starting with the first line. CX is a CPU register. It stores the location of the slice memory. SI is the register. It stores a number representing the index of the slice. () are indirections. DX is a general purpose register. MOVQ is the move instruction, telling the CPU to move 64 bits of data at once (Q is the mnemonic for Quadword).

So reading the first line MOVQ (CX)(SI*8), DX, it says move 64 bits of data from(CX)(SI*8) to DX. What does (CX)(SI*8) mean? To really understand it, let’s draw it out:

sse explanation 1

In this drawing, I have depicted all the registers as independent squares at the top. Observe that in the CX register, the value 0xa is filled in. 0xa is where the slice starts in the drawing. Each cell in the slice is 8 bytes, and is either yellow (keys) or blue (values). Also, observe that in the SI register, 0 is the value filled in.

So (CX) means go to the location stored in CX. In this case, we’ll start looking at memory location 0xa. (SI*8) means “start at the value at SI for 8 bytes”. In this case it is very clear that it means the first yellow cell.

Now we want to move 64 bits of that value to DX. Then MOVQ DX, (SP), which means move the value in DX to the location stored in SP. In the drawing, the SP register is filled with the value 0xb*This is a hypothetical situation. In real life `0xb` is literally next to `0xa`., which is the memory location of the array. In effect, what we want to do is to move the value in the first yellow cell to the first value of the array.

The third line says 16(CX)(SI*8), DX. The 16 in front of it all means “start from the 16th byte. Referring back to the image, starting from the 16th byte would be the yellow cell with the label “2” underneath it.

And there you have it. We have moved the 0th and the 2nd value into the array. We can then use the MOVUPS instruction to move 128 bits into the special XMM register.

Similar things are done so that the key is moved twice into a XMM register.

With both the keys and desired key in XMM registers, we can now compare 2 registers at once. The VPCMPEQQ, VPCMPGTQ and VPADDQ instructions make short work of that.

sse explanation2

The final bit that may confuse readers of the code are these lines:

	MOVQ       X1, DX
	PUNPCKHQDQ X1, X1
	MOVQ       X1, BX

See, MOVQ can only move 64 bits of data. The XMM registers are 128 bits in size. So when you have 2x 64 bits of data in the register, MOVQ can only move the lower bits out (in the picture below, it only has access to 0.

punpckhqdq explanation

PUNPCKHQDQ is an instruction to move 64 of the high bits to the low bits. After which, we can use MOVQ again to move the value out to BX.

SIMD Part 2: AVX

Writing SSE code was such a joy. I thought, hey if I could write SSE code so easily, we could compare 4 keys at a time. So I wrote some AVX code. The main idea behind AVX is the same as SSE. The instructions differ slightly, but the general idea is the same.

Initial Results

And so having written the key search in assembly using SIMD instructions, how were the results? I wrote a bunch of other functions to benchmark against. The base case is the linear search. The results are as follows (searching through 512 key-value pairs):

BenchmarkSearchNaive-20       	   35263	     30679 ns/op
BenchmarkSearchClever-20      	   47138	     24708 ns/op
BenchmarkSearchAVX2-20        	    9346	    127415 ns/op
BenchmarkSearchParallel-20    	     100	  11868974 ns/op
BenchmarkSearchMrjn-20        	   32065	     34810 ns/op
BenchmarkSearchSSE-20         	    5942	    192074 ns/op
BenchmarkSearchBinary-20      	   38380	     27173 ns/op
PASS

The following table explains what each benchmark function is:

Benchmark	Algorithm	Written In
`BenchmarkSearchNaive`	Linear search	Go
`BenchmarkSearchClever`	Clever version of linear search	Go
`BenchmarkSearchAVX2`	AVX based search	Go Assembly
`BenchmarkSearchMrjn`	Manish’s manually written assembly based search	Go Assembly
`BenchmarkSearchSSE`	SSE based search	Go Assembly
`BenchmarkSearchBinary`	Uses `sort.Search`	Go

But what is Clever? It’s this:

func Clever(xs []uint64, k uint64) int16 {
	if len(xs) < 8 {
		return Naive(xs, k)
	}
	var twos, pk [4]uint64
	pk[0] = k
	pk[1] = k
	pk[2] = k
	pk[3] = k
	for i := 0; i < len(xs); i += 8 {
		twos[0] = xs[i]
		twos[1] = xs[i+2]
		twos[2] = xs[i+4]
		twos[3] = xs[i+6]
		if twos[0] >= pk[0] {
			return int16(i / 2)
		}
		if twos[1] >= pk[1] {
			return int16((i + 2) / 2)
		}
		if twos[2] >= pk[2] {
			return int16((i + 4) / 2)
		}
		if twos[3] >= pk[3] {
			return int16((i + 6) / 2)
		}

	}
	return int16(len(xs) / 2)
}

I had originally written this to explain to Manish what the AVX based code is doing. The original code looked something like this :

	for i := 0; i < len(xs); i += 8 {
		twos[0] = xs[i]
		twos[1] = xs[i+2]
		twos[2] = xs[i+4]
		twos[3] = xs[i+6]
		for j := range twos {
			result[j] = 0
			if twos[j] > pk[j]{
				result[j] = 1
			}
		}
		for j := range twos {
			if twos[j] == pk[j]{
				result[j] += 1
			}
		}
		for j := range result{
			if result[j] >= 1 {
				return int16((i+j*2)/2)
			}
		}

	}

For the purposes of the benchmark, I altered the algorithm to perform comparisons immediately. I was unrolling the results. Much to my surprise, it performed better than the linear search.

One Weird Trick: Loop Unrolling

And so, I stumbled on that one weird trick. Now, if I could write a more optimized version in assembly, I could outperform the one written in Go. With a lot of careful calculations here’s the result:

BenchmarkSearchNaive-20       	   38415	     30614 ns/op
BenchmarkSearchClever-20      	   43548	     25139 ns/op
BenchmarkSearchAVX2-20        	    8810	    128108 ns/op
BenchmarkSearchParallel-20    	     100	  11904737 ns/op
BenchmarkSearchMrjn-20        	   32686	     34064 ns/op
BenchmarkSearchSSE-20         	    6410	    186599 ns/op
BenchmarkSearchBinary-20      	   40081	     27411 ns/op
BenchmarkSearch2-20           	   76887	     15440 ns/op*

Here’s the results in a chart (the SSE and AVX and others were removed for the results are not as good, and we’d no longer be considering them).

results chart

Once the array size is 2^10 and above, it would be more profitable to use a binary search.

Search2 is the manually written assembly based search. Perhaps the most surprising thing is how short the assembly code is: 60 lines with comments. It’s slightly longer than Manish’s manually written assembly.

That led me to think about why the SIMD code didn’t work as well and why the manually unrolled code worked so well.

SIMD Postmortem

I decided to investigate further. Using perf I was able to find out which instructions were the hottest. In the SSE version, these instructions were the hottest:

       │    // Move the packed keys into ymm; move key into pk      ▒
       │    MOVUPS (SP), X1                                         ▒
 57.56 │      movups       (%rsp),%xmm1                             ◆
       │                                                            ▒
       │    // Check GTE                                            ▒
       │    VPCMPEQQ X1, X0, X2                                     ▒
 20.50 │      vpcmpeqq     %xmm1,%xmm0,%xmm2                        ▒
       │    VPCMPGTQ X0, X1, X1                                     ▒
  0.02 │      vpcmpgtq     %xmm0,%xmm1,%xmm1                        ▒
       │    VPADDQ   X2, X1, X1                                     ▒
 10.23 │      vpaddq       %xmm2,%xmm1,%xmm1                        ▒
       │

In the AVX version, these instructions were the hottest:

       │    // Move the packed keys into ymm; move key into pk      ▒
       │    VMOVUPS (SP), Y1                                        ▒
 54.58 │      vmovups      (%rsp),%ymm1                             ◆
       │                                                            ▒
       │    // Check GTE                                            ▒
       │    VPCMPEQQ Y1, Y0, Y2                                     ▒
 23.17 │      vpcmpeqq     %ymm1,%ymm0,%ymm2                        ▒
       │    VPCMPGTQ Y0, Y1, Y1                                     ▒
  0.06 │      vpcmpgtq     %ymm0,%ymm1,%ymm1                        ▒
       │    VPADDQ   Y2, Y1, Y1                                     ▒
  2.09 │      vpaddq       %ymm2,%ymm1,%ymm1                        ▒
       │                                                            ▒
       │    // Move result out                                      ▒
       │    VMOVMSKPD Y1, DX                                        ▒
  1.53 │      vmovmskpd    %ymm1,%edx                               ▒
       │                                                            ▒
       │    // Count trailing zeroes XX                             ▒
       │    TZCNTL DX, DX                                           ▒
  6.99 │      tzcnt        %edx,%edx

Both these functions show that it’s moving data from regular general purpose registers to the vector registers that was the major bottleneck.

This is par the course for using SIMD. Often it’s the movement of data that costs the most. The operations themselves are usually cheap.

Also to consider are the number of cycles an instruction uses. AVX and SSE instructions do use slightly more cycles for each instruction. After some careful comparisons, I came to the conclusion that overall, the SSE and AVX versions of the code does actually consume more cycles than the scalar versions.

How Much To Unroll?

I also did some experimentation on how much to unroll. From experimentation, it soon became clear that unrolling such that four elements are being processed at once yielded the best balance of highest performance in powerful machines and less powerful machines. I use a pretty powerful machine so the cachelines were quite large. The results would be different for different processors. I tested the various unrolls on an EC2 box *The general advice is to not run benchmarks on EC2, but rather, you should run your benchmarks on an isolated machine that is under your full control. . On lower powered machines, unrolling too many elements at a time (8 and 16) yielded worse performance. Upon consideration of the facts, I settled on unrolling only four elements at a time.

Conclusions

In conclusion, I optimized key-lookup in Ristretto by unrolling the loop manually. I tested various SIMD options and came to the conclusion that the cost of moving data to SIMD registers far outweighed doing the operations themselves, leading to a slower search function.

The main reason I was able to know all these facts is because I did experiments. I measured the performance. And most importantly, the experimentation were targeted at the environment that is common/expected for users of Dgraph. The takeaway is: measure your performance. Do not just take advice from a random blog. Instead, test it for yourself.

As for Ristretto, I can’t say I’m particularly pleased with the results. I believe I can push the performance further. For example, I didn’t get the time to try a SIMD’d branchfree binary search function. Perhaps one day in the future, when I have more time on my hands.

Addendum

Thanks to Ivan Kurnosov who discovered a typo. This has now been fixed