Fight for every byte it takesNibbling at the costs
In my last post we implemented variable-sized encoding to be able to pack even more data into the page. We were able to achieve 40% better density because of that. This is pretty awesome, but we would still like to do better. There are two disadvantages for variable size integers:
- They may take more space than the actual raw numbers.
- The number of branches is high, and non-predictable.
Given that we need to encode the key and value together, let’s see if we can do better. We know that both the key and the value are 8 bytes long. Using little-endian systems, we can consider the number as a byte array.
Consider this number: 139,713,513,353 which is composed of the following bytes: [137, 7, 147, 135, 32, 0, 0, 0]. This is how it looks in memory. This means, that we only need the first 5 bytes, not the last 3 zero ones.
It turns out that there is a very simple way to compute the number of used bytes, like so:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
int UsedBytes(long l) { return (int)(8 - Lzcnt.X64.LeadingZeroCount((ulong)l) / 8); }
This translates into the following assembly:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
UsedBytes|0_0(Int64) L0000: xor eax, eax L0002: lzcnt rax, rcx L0007: shr rax, 3 L000b: neg eax L000d: add eax, 8 L0010: ret
Which is about as tight as you can want it to be.
Of course, there is a problem. In order to read the value back, we need to store the number of bytes we used somewhere. For variable-sized integers, they use the top bit until they run out. But we cannot do that here.
Remember however, that we encode two numbers here. And the length of the number is 8 bytes. In binary, that means that we need 4 bits to encode the length of each number. This means that if we’ll take an additional byte, we can fit the length of both numbers into a single byte.
The length of the key and the value would each fit on a nibble inside that byte. Here is what the encoding step looks like now:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
static unsafe uint Encode(byte* buffer, long key, long val) { var keyLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)key) / 8); var valLen = (uint)(8 - Lzcnt.X64.LeadingZeroCount((ulong)val) / 8); buffer[0] = (byte)(keyLen << 4 | valLen); Unsafe.CopyBlock(buffer + 1, &key, keyLen); Unsafe.CopyBlock(buffer + 1 + keyLen, &val, keyLen); return 1 + keyLen + valLen; }
And in assembly:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Encode(ulong,long,long):uint: push r15 push r14 push r12 push rbx sub rsp, 24 mov qword ptr [rsp+10H], rsi mov qword ptr [rsp+08H], rdx mov rbx, rdi xor edi, edi lzcnt rdi, qword ptr [rsp+10H] shr rdi, 3 mov r14d, edi neg r14d add r14d, 8 xor edi, edi lzcnt rdi, qword ptr [rsp+08H] shr rdi, 3 mov r15d, edi neg r15d add r15d, 8 mov edi, r14d shl edi, 4 or edi, r15d mov byte ptr [rbx], dil lea rdi, [rbx+01H] lea rsi, [rsp+10H] mov r12d, r14d mov rdx, r12 call [CORINFO_HELP_MEMCPY] mov edi, r14d lea rdi, [rbx+rdi+01H] lea rsi, [rsp+08H] mov rdx, r12 call [CORINFO_HELP_MEMCPY] lea eax, [r14+r15+01H] add rsp, 24 pop rbx pop r12 pop r14 pop r15 ret
Note that there are no branches at all here. Which I’m really stoked about. As for decoding, we just have to go the other way around:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
Main:Decode(ulong):System.ValueTuple`2[long,long]: push r15 push r14 push rbx sub rsp, 16 xor eax, eax mov qword ptr [rsp+08H], rax mov qword ptr [rsp], rax mov rbx, rdi movzx r14, byte ptr [rbx] mov r15d, r14d sar r15d, 4 and r14d, 15 lea rdi, [rsp+08H] lea rsi, [rbx+01H] mov edx, r15d call [CORINFO_HELP_MEMCPY] lea rdi, [rsp] mov esi, r15d lea rsi, [rbx+rsi+01H] mov edx, r14d call [CORINFO_HELP_MEMCPY] mov rax, qword ptr [rsp+08H] mov rdx, qword ptr [rsp] add rsp, 16 pop rbx pop r14 pop r15 ret This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
unsafe static (long Key, long Val) Decode(byte* buffer) { var keyLen = (uint)(buffer[0] >> 4); var valLen = (uint)(buffer[0] & 0xF); long k = 0; long v = 0; Unsafe.CopyBlock(&k, buffer + 1, keyLen); Unsafe.CopyBlock(&v, buffer + 1 +keyLen, valLen); return (k, v); }
No branches, and really predictable code.
That is all great, but what about the sizes? We are always taking 4 additional bits per number. So it is actually a single additional byte for each entry we encode. By using varint, the memory we encode numbers that are beyond the 2GB range, we’re already winning. Encoding (3,221,241,856), for example, will cost us 5 bytes (since we limit the range of each byte to 7 bits). The key advantage in our case is that if we have any case where either key or value needs to take an additional byte, we are at parity with the nibble method. If both of them need that, we are winning, since the nibble method will use a single additional byte and the variable size integer will take two (one for each number).
Now that we understand encoding and decoding, the rest of the code is basically the same. We just changed the internal format of the entry, nothing about the rest of the code changes.
And the results?
For the realistic dataset, we can fit 759 entries versus 712 for the variable integer model.
For the full dataset, we can fit 752 entries versus 710 for the variable integer model.
That is a 7% improvement in size, but it also comes with a really important benefit. Fewer branches.
This is the sort of code that runs billions of times a second. Reducing its latency has a profound impact on overall performance. One of the things that we pay attention to in high-performance code is the number of branches, because we are using super scalar CPUs, multiple instructions may execute in parallel at the chip level. A branch may cause us to stall (we have to wait until the result is known before we can execute the next instruction), so the processor will try to predict what the result of the branch would be. If this is a highly predictable branch (an error code that is almost never taken, for example), there is very little cost to that.
The variable integer code, on the other hand, is nothing but branches, and as far as the CPU is concerned, there is no way to actually predict what the result will be, so it has to wait. Branchless or well-predicted code is a key aspect of high-performance code. And this approach can have a big impact.
As a reminder, we started at 511 items, and we are now at 759. The question is, can we do more?
I’ll talk about it in the next post…
More posts in "Fight for every byte it takes" series:
- (01 May 2023) Decoding the entries
- (28 Apr 2023) Optimizing the encoding process
- (27 Apr 2023) Fitting 64 values in 4 bits
- (26 Apr 2023) Nibbling at the costs
- (25 Apr 2023) Variable size data
- (24 Apr 2023) Storing raw numbers
Comments
Comment preview