Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,185
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 273 words

imageI just got back from watching the Incredibles 2. The previous movie was a favorite a mine from first view, and it is one of the few movies that I can actually bear to watch multiple times. I was hoping for a sequel almost from the moment I finished the first movie, and it took over a decade to get it.

I actually sat down with my 3 years old daughter to watch the first movie before I went to see the second one. I’m not sure of how much she got from it, although she is very fond of trains and really loved the train scene (and then kept asking where is the train). It is unusual for me to actually “prepare” to see a movie, by the way. But it did mean that I had the plot sharp in my head and that I could directly compare the two movies.

First, in terms of the plot. It was funny, especially since I got a kid now and could appreciate a lot more of the not so subtle digs at parenthood.

Second, in terms of visuals, wow, it improved by a lot. The original movie held up really good in terms of visuals in the past 12 years, but the new one is visibly better in this term.

Also, at this point I nearly got a heart attack because a talking book (again, my daughter) starting neighing at me at the middle of the night just as I got into the house.

Highly recommended.

time to read 6 min | 1017 words

image

We started to get reports from users that are running RavenDB on Docker that there are situations where RavenDB reports that there has been a data corruption event.  You can see how this looks like on the right. As you can see, this ain’t a happy camper. In fact, this is a pretty scary one. The kind you see in movies that air of Friday the 13th.

The really strange part there was that this is one of those errors that really should never be possible. RavenDB have a lot of internal checks, including for things that really aren’t supposed to happen. The idea is that it is better to be safe than sorry when dealing with your data. So we got this scary error, and we looked into it hard. This is the kind of error that gets top priority internally, because it touch at the core of what we do, keeping data safe.

The really crazy part there was that we could find any data loss event. It took a while until we were able to narrow it down to Docker, so we were checking a lot of stuff in the meantime. And when we finally began to suspect Docker, it got even crazier. At some point, we were able to reproduce this more or less at will. Spin a Docker instance, write a lot of data, wait a bit, write more data, see the data corruption message. What was crazy about that was that we were able to confirm that there wasn’t any actual data corruption.

We started diving deeper into this, and it looked like we fell down a very deep crack. Eventually we figured out that you need the following scenario to reproduce this issue:

  • A Linux Docker instance.
  • Hosted on a Windows machine.
  • Using an external volume to store the data.

That led us to explore exactly how Docker does volume sharing. I a Linux / Linux or Windows / Windows setup, that is pretty easy, it basically re-route namespaces between the host and the container. In a Linux container running on a Windows machine, the external volume is using CIFS. In other words, it is effectively running on a network drive, even if the network is machine local only.

It turned out that the reproduction is not only very specific for a particular deployment, but also for a particular I/O pattern.

The full C code reproducing this can be found here. It is a bit verbose because I handled all errors. The redacted version that is much more readable is here:

This can be executed using:

And running the following command:

docker run --rm -v PWD:/wrk gcc /wrk/setup.sh

As you can see, what we do is the following:

  • Create a file and ensure that it is pre-allocated
  • Write to the file using O_DIRECT | O_DSYNC
  • We then read (using another file descriptor) the data

The write operations are sequential, and the read operations as well, however, the read operation will read past the written area. This is key. At this point, we write again to the file, to an area where we already previously read.

At this point, we attempt to re-read the data that was just written, but instead of getting the data, we get just zeroes.  What I believe is going on is that we are hitting the cached data. Note that this is doing system calls, not any userland cache.

I reported this to Docker as a bug. I actually believe that this will be the same whenever we use CIFS system (a shared drive) to run this scenario.

The underlying issue is that we have a process that reads through the journal file and apply it, at the same time that transactions are writing to it. We effectively read the file until we are done, forcing the file data into the cache. The writes, which are using direct I/O are going to bypass that cache and we are going to have to wait for the change notification from CIFS to know that this needs to be invalidated. That turn this issue into a race condition of data corruption,of sort.

The reason that we weren’t able to detect data corruption after the fact was that there was no data corruption. The data was properly written to disk, we were just mislead by the operating system about that when we tried to read it and got stale results. The good news is that even after catching the operating system cheating on us with the I/O system, RavenDB is handling things with decorum. In other words, we immediately commit suicide on the relevant database. The server process shuts down the database, register an alert and try again. At this point, we rely on the fact that we are crash resistant and effectively replay everything from scratch. The good thing about this is that we are doing much better the second time around (likely because there is enough time to get the change event and clear the cache). And even if we aren’t, we are still able to recover the next time around.

Running Linux containers on Windows is a pretty important segment for us, developers using Docker to host RavenDB, and it make a lot of sense they will be using external volumes. We haven’t gotten to testing it out, but I suspect that CIFS writes over “normal” network might exhibit the same behavior. That isn’t actually a good configuration for a database for a lot of other reasons, but that is still something that I want to at least be able to limp on. Even with no real data loss, a error like the one above is pretty scary and can cause a lot of hesitation and fear for users.

Therefor, we have changed the way we are handling I/O in this case, we’ll avoid using the two file descriptors and hold a bit more data in memory for the duration. This give us more control, actually likely to give us a small perf boost and avoid the problematic I/O pattern entirely.

time to read 1 min | 137 words

We just upgraded our stable branch to .NET Core 2.1. The process was pretty smooth overall, but we did get the following exchange in our internal Slack channel.

It went something like this:

  • is it known that import doesn't work ?
  • As you can imagine, Import is pretty important for us.
  • no
  • does it work on your machine ?
  • checking,,,
  • what's an error?
  • no error.
  • so UI is blocked?
  • image
  • do you have any errors in dev tools console?
  • `TypeError: e is undefined`
    doesn't says to me much
  • same thing in incognito
  • export doesn't work either
  • lol the reason is: dotnet core 2.1
  • the websockets are faster and I had race in code
    will push fix shortly

There you have it, .NET Core 2.1 broke our code. Now I have to go and add Thread.Sleep somewhere…

time to read 1 min | 98 words

And now the book is another tiny big step close to actually being completed. All editing has been completed, and we did a full pass through the book. All content is written and there isn’t much to do at all.

We are now sending this for production work, and once that is done, I can announce this project complete. Of course, by that time, I’ll have to start writing about the new features in RavenDB 4.1, but that is a story for another day.

You can get the updated bits here, as usual, I would really appreciate any feedback.

time to read 3 min | 523 words

imageRavenDB’s subscription give you the ability to run batch processing easily and robustly. In other words, you specify a query and subscribe to its results. RavenDB will send you all the documents matching the query. So far, that is pretty obvious, but what is important with subscriptions is the fact that it will keep sending you results. As long as your subscription is opened, you’ll get any changed document that matches your query. That gives you a great way to implement event pipelines, batch processes and in general opens up some interesting options.

In this case, I want to talk about how failures with subscriptions. Not failure in the sense of a server going down, or a client crashing. These are already handled by the subscription mechanism itself. A server going down will cause the cluster to change the ownership of subscription, and your client code will not even notice. A client going down can either failover to another client. Alternatively, upon restart of the client, it will pick up right from where it dropped things. No, this is handled.

What require attention is what happen if there is an error during the processing of a batch of documents. Imagine that we want to do some background processing. We could do that in many ways, such as introducing a queuing system and tasks queue, but in many cases, the overhead of that is quite high. A simpler approach is to just write the tasks out as documents and use a subscription to process them. In this case, let’s imagine that we want to send emails. A subscription will run over all the EmailToSend collection, doing whatever processing is required to actually send it. Once we are done processing a batch, we’ll delete all the items that we processed. Whenever there are new emails to send, the subscriptions will get them for us immediately.

But what happens if there is a failure to send one particular email in a batch? Well, we can ignore this (and not delete the document), but that will require some admin involvement to resolve. Subscriptions will not revisits documents that they have already seen. Except if these documents were changed.  Here is one way to handle this scenario:

In short, we’ll try to process each document, sending the email, etc. If we failed to do so, we’ll not delete the document, instead, we’ll patch it to increment a Retries property in the metadata. This operation has two interesting effects. First, it means that we can keep track of how often we retried a particular document. But as a side effect of modifying the document, we’ll get it back in the subscription again. In other words, this piece of code will give a document 5 retries before it give up.

As an admin, you can then peek into your database and see all the documents that have exceeded the allow retries and make a decision on what to do with them. But anything that failed because of some transient failure will just work.

time to read 3 min | 589 words

The previous post has a code sample in it that was figuratively* physically painful for me to write. Avoiding the number of syscalls that are invoked, the code isn’t all too efficient as I now measure things, it uses way too much managed memory and it is subject to failures as we increase the amount of data we push through. For this post, I’m going to be rewriting the CodexWriter class as I would for code that is going into RavenDB.

* I literally know what literally used to mean, amazing.

I’m sorry, there is going to be a big jump in the complexity of the code, because I’m going to try to handle performance, parallelism and resource utilization all at once. The first thing to do is to go into the project’s settings and enable both unsafe code (without which is it nearly impossible to write high performance code) and C# 7.3 features, we’ll need these.

We can divide the task of gather the inputs into several stages. First, we need to write the data to the file. This is similar to the way we did it before, here is the Add() method:

As you can see, there isn’t really much that changed here, but we have this notion of a segment, which is created every million keys. But what is this segment?

It is a way to refer to a specific section of records in the file. In particular, it has just one primary role, it exists to sort the records. Let’s take a look at the code:

There are a few key points. Instead of using file I/O directly, we are using memory mapped files. Why is that? Because, as we have seen, the cost of syscalls is non trivial in the extreme, and using memory mapped files means that we can access the data natively without having to pay any price aside from page fault if the data isn’t already in memory.

The EnsureSorted() method is also interesting, it spawns a new task to sort the entries inside the segment in parallel with inserting the data to the main file. The actual sort is handled in the Compare() methods.

As we write the data into the codex, we sort the data as we run through it, but what happens in the end? In this case, we have about 13 million items that we inserted, so we have 13 segments that are each individually sorted. To get the final sort, we basically merge from all of them. Here is the relevant code:

This used the SortedSet as a heap, to always get the minimum value from the sorted inner values in the set. Note that we need to wait for the parallel searches to complete, then merge from all of them to the final result. We can write the result of the sort directly to the end of the file.

Overall, this process takes: 59.01 seconds to complete. Remember that this is when we are pushing unsorted data through. If we pass the data sorted, we get a significant improvement and only take: 35.91 seconds.

To compare, I run the same sort of test on Voron, and I got: 59.15 seconds for the unsorted case and for the sorted case: 13.85 seconds. This is when Voron is also doing ACID writes, which we obviously don’t in Codex.

I guess that spending four to five years with a whole team doing performance optimization is a better way to get storage performance than a couple of evenings hacking before I go to bed, who knew?

time to read 3 min | 553 words

We are exploring a few data structure for a particular feature in RavenDB, and I run into something that is elegant, simple, easy and deep enough that we can discuss serious implementation details upon without getting too bogged down in the details.

The idea is that I’m going to be using this series of blog post to post a detailed walk through about building a key value store from scratch. Including all the intermediate steps and wrong turns along the way. In other words, this is a “Show YourWork” kind of series. The end result is going to be a key/value store that can:

  • Store arbitrary keys / values.
  • Get key’s value by the key.
  • Support range queries and iteration.
  • Support some form of ACID.

In this case, I’m going to start from the very basics and build up. The challenge we are going to deal with is ingesting all the titles of articles in Wikipedia, about 277MB of them. I took them from here: (https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-all-titles-in-ns0.gz). There are 13,755,095 of them in my case.

I’m calling the KV store that we’ll be creating Codex. And I’m going to start from the most basic of example, just being able to store and check if a value exists in the store. Here is the code that reads from the titles list and add them to the store. Note that the articles are sorted, but we don’t want this advantage of adding sorted data, so we randomize things.

The question here, how are we going to store these titles in a way that allow us fast retrieval?  Here is the idea, we are going to write the strings to the output file as they come, and also record their positions. When we are done inserting strings to the codex, we’ll run a sort based on the positions, and that will give us an array of offsets to the strings in the files, sorted by their value. The first version of this code looks like this:

If you’ll run this code on the Wikipedia titles, you’ll find that it takes a while to run. On my machine, that took just under 50 minutes.

Well, we are dealing with the full set of Wikipedia titles, but even so, that doesn’t sound like it should take this long. What gives?

Let’s analyze what is going on here, okay? If you run this code, you’ll note that it isn’t using CPU or I/O or really seems to be doing much. What is going on?

The key here is in the ReadFrom method. There, we do two seemingly innocent actions. We set the file’s position (translate to SetFilePointer call) and read a short string (translate to a ReadFile call). Now, why is that expensive? Well, the ReadFrom method is called twice each time we need to sort an entry. In this case, it means that ReadFrom will be called a total of 575,616,878 times.

That is not a typo. And each invocation means two separate system call. In other words, this innocent seeming piece of code executed over 1.15 billion system calls.

For reference, simple by reading the entire file to a MemoryStream and keeping everything else the same, I was able to bring the cost of this operation down to under 3 minutes.

Lesson learned, system calls are expensive, let’s try to reduce them as much as we can.

time to read 1 min | 170 words

In optimizing the code from Maybe.NET I’m not actually making any structural changes, I’m merely going over the code and fixing idiomatic C# code that has unacceptable performance behavior from my point of view.

For example, let’s take this code, the code of the Bloom Filter algorithm:

image

This will allocate a state object for the closure, and we’ll have a delegate invocation cost to pay. We also saw the other costs that are paid here in the previous post (since this call the to the MurmurHash3 implementation), so I’ll ignore it.

I’m going to use Memory<byte> as the underlying data structure, because that gives me nice abstraction over the memory. I’m using Memory<byte> instead of Span<byte> because I’m going to have the Bloom Filter on the heap, not just on the stack, so we can’t use Span. Here is the full code:

Note that there are no allocations at all during the critical operations.

time to read 3 min | 512 words

I needed to use Bloom Filters, and I didn’t want to use the implementation we already have in RavenDB. That one is too tied up in our infrastructure to be easily used. So I found the Maybe.NET project. it is nice, but it doesn’t have a CoreCLR Nuget package. This meant that I had to go into the code and look at what is going on. And I started going, “nope, that isn’t the way I want it done”.

Now, to be clear, the code in the project is great. It is clear, obvious and idiomatic C#. It is also raising every red flag I have for inefficient code detection that I had built over the past few years of making RavenDB faster. Because this is such as small sample that I thought it would make a good blog post, because I can explain what the code is doing and what changes I’m doing there, and why. Before I can get to the Bloom Filter implementation, I need to use the hash function, and that was just a full stop for me. Let me show you what I mean. The key parts are below, and you can find the full code here.

image

This is a hash method, it produced several hashes for the purpose of the bloom filter. I underlined in red every time that this code allocates. As you can see, this allocates a lot.

There is also the fact that this accepts a generic object as a parameter and serialize that to a byte[]. I’m ignoring the allocations in that part of the code, but I’m assuming that they are significant. So let’s talk about how we can optimize this function?

Well, to start with, I’m going to decide that accepting an object is too high level. This is a hash function, the caller should give us bytes. Now, let’s see what impact that has on us, shall we?

Now this is much better. We don’t allocate anything in the ComputeHashes method and we give the compiler the chance to build really efficient code here. We can probably require that the maxHashValue be a power of two and avoid the mod operation in favor of bit shifting, but I’m not writing RavenDB here and worrying about every little thing. I’ll leave that part as an exercise for the reader.

Now, let’s look at the actual Hash function, shall we?

There is quite a bit going on, but essentially, I’m using the fixed to get the pointer from the span, then compute the hash in 4 bytes at once, then handle the remainder. There is not allocations and this has far fewer instructions that actually need to run. Note that this would be a great place to stop and run unit tests to verify that I didn’t break something, I’m going to assume that I got it write and close this post, I still want to talk about the optimizations that are available for the bloom filter.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}