Fight for every byte it takesNibbling at the costs

time to read 4 min | 750 words

In my last post we implemented variable-sized encoding to be able to pack even more data into the page. We were able to achieve 40% better density because of that. This is pretty awesome, but we would still like to do better. There are two disadvantages for variable size integers:

  • They may take more space than the actual raw numbers.
  • The number of branches is high, and non-predictable.

Given that we need to encode the key and value together, let’s see if we can do better. We know that both the key and the value are 8 bytes long. Using little-endian systems, we can consider the number as a byte array.

Consider this number: 139,713,513,353 which is composed of the following bytes: [137, 7, 147, 135, 32, 0, 0, 0]. This is how it looks in memory. This means, that we only need the first 5 bytes, not the last 3 zero ones.

It turns out that there is a very simple way to compute the number of used bytes, like so:

int UsedBytes(long l)
{
return (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)l) / 8);
}
view raw UsedBytes.cs hosted with ❤ by GitHub

This translates into the following assembly:

UsedBytes|0_0(Int64)
L0000: xor eax, eax
L0002: lzcnt rax, rcx
L0007: shr rax, 3
L000b: neg eax
L000d: add eax, 8
L0010: ret
view raw UsedBytes.asm hosted with ❤ by GitHub

Which is about as tight as you can want it to be.

Of course, there is a problem. In order to read the value back, we need to store the number of bytes we used somewhere. For variable-sized integers, they use the top bit until they run out. But we cannot do that here.

Remember however, that we encode two numbers here. And the length of the number is 8 bytes. In binary, that means that we need 4 bits to encode the length of each number. This means that if we’ll take an additional byte, we can fit the length of both numbers into a single byte.

The length of the key and the value would each fit on a nibble inside that byte. Here is what the encoding step looks like now:

static unsafe uint Encode(byte* buffer, long key, long val)
{
var keyLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8);
var valLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8);
buffer[0] = (byte)(keyLen << 4 | valLen);
Unsafe.CopyBlock(buffer + 1, &key, keyLen);
Unsafe.CopyBlock(buffer + 1 + keyLen, &val, keyLen);
return 1 + keyLen + valLen;
}
view raw Encode.cs hosted with ❤ by GitHub

And in assembly:

Encode(ulong,long,long):uint:
push r15
push r14
push r12
push rbx
sub rsp, 24
mov qword ptr [rsp+10H], rsi
mov qword ptr [rsp+08H], rdx
mov rbx, rdi
xor edi, edi
lzcnt rdi, qword ptr [rsp+10H]
shr rdi, 3
mov r14d, edi
neg r14d
add r14d, 8
xor edi, edi
lzcnt rdi, qword ptr [rsp+08H]
shr rdi, 3
mov r15d, edi
neg r15d
add r15d, 8
mov edi, r14d
shl edi, 4
or edi, r15d
mov byte ptr [rbx], dil
lea rdi, [rbx+01H]
lea rsi, [rsp+10H]
mov r12d, r14d
mov rdx, r12
call [CORINFO_HELP_MEMCPY]
mov edi, r14d
lea rdi, [rbx+rdi+01H]
lea rsi, [rsp+08H]
mov rdx, r12
call [CORINFO_HELP_MEMCPY]
lea eax, [r14+r15+01H]
add rsp, 24
pop rbx
pop r12
pop r14
pop r15
ret
view raw Encode.asm hosted with ❤ by GitHub

Note that there are no branches at all here. Which I’m really stoked about. As for decoding, we just have to go the other way around:

Main:Decode(ulong):System.ValueTuple`2[long,long]:
push r15
push r14
push rbx
sub rsp, 16
xor eax, eax
mov qword ptr [rsp+08H], rax
mov qword ptr [rsp], rax
mov rbx, rdi
movzx r14, byte ptr [rbx]
mov r15d, r14d
sar r15d, 4
and r14d, 15
lea rdi, [rsp+08H]
lea rsi, [rbx+01H]
mov edx, r15d
call [CORINFO_HELP_MEMCPY]
lea rdi, [rsp]
mov esi, r15d
lea rsi, [rbx+rsi+01H]
mov edx, r14d
call [CORINFO_HELP_MEMCPY]
mov rax, qword ptr [rsp+08H]
mov rdx, qword ptr [rsp]
add rsp, 16
pop rbx
pop r14
pop r15
ret
view raw Decode.asm hosted with ❤ by GitHub
unsafe static (long Key, long Val) Decode(byte* buffer)
{
var keyLen = (uint)(buffer[0] >> 4);
var valLen = (uint)(buffer[0] & 0xF);
long k = 0;
long v = 0;
Unsafe.CopyBlock(&k, buffer + 1, keyLen);
Unsafe.CopyBlock(&v, buffer + 1 +keyLen, valLen);
return (k, v);
}
view raw Decode.cs hosted with ❤ by GitHub

No branches, and really predictable code.

That is all great, but what about the sizes? We are always taking 4 additional bits per number. So it is actually a single additional byte for each entry we encode. By using varint, the memory we encode numbers that are beyond the 2GB range, we’re already winning. Encoding (3,221,241,856), for example, will cost us 5 bytes (since we limit the range of each byte to 7 bits). The key advantage in our case is that if we have any case where either key or value needs to take an additional byte, we are at parity with the nibble method. If both of them need that, we are winning, since the nibble method will use a single additional byte and the variable size integer will take two (one for each number).

Now that we understand encoding and decoding, the rest of the code is basically the same. We just changed the internal format of the entry, nothing about the rest of the code changes.

And the results?

For the realistic dataset, we can fit 759 entries versus 712 for the variable integer model.

For the full dataset, we can fit 752 entries versus 710 for the variable integer model.

That is a 7% improvement in size, but it also comes with a really important benefit. Fewer branches.

This is the sort of code that runs billions of times a second. Reducing its latency has a profound impact on overall performance. One of the things that we pay attention to in high-performance code is the number of branches, because we are using super scalar CPUs, multiple instructions may execute in parallel at the chip level. A branch may cause us to stall (we have to wait until the result is known before we can execute the next instruction), so the processor will try to predict what the result of the branch would be. If this is a highly predictable branch (an error code that is almost never taken, for example), there is very little cost to that.

The variable integer code, on the other hand, is nothing but branches, and as far as the CPU is concerned, there is no way to actually predict what the result will be, so it has to wait. Branchless or well-predicted code is a key aspect of high-performance code. And this approach can have a big impact.

As a reminder, we started at 511 items, and we are now at 759. The question is, can we do more?

I’ll talk about it in the next post…

More posts in "Fight for every byte it takes" series:

  1. (01 May 2023) Decoding the entries
  2. (28 Apr 2023) Optimizing the encoding process
  3. (27 Apr 2023) Fitting 64 values in 4 bits
  4. (26 Apr 2023) Nibbling at the costs
  5. (25 Apr 2023) Variable size data
  6. (24 Apr 2023) Storing raw numbers