Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,177
Privacy Policy · Terms
filter by tags archive
time to read 26 min | 5029 words

One of the more interesting developments in terms of kernel API surface is the IO Ring. On Linux, it is called IO Uring and Windows has copied it shortly afterward. The idea started as a way to batch multiple IO operations at once but has evolved into a generic mechanism to make system calls more cheaply. On Linux, a large portion of the kernel features is exposed as part of the IO Uring API, while Windows exposes a far less rich API (basically, just reading and writing).

The reason this matters is that you can use IO Ring to reduce the cost of making system calls, using both batching and asynchronous programming. As such, most new database engines have jumped on that sweet nectar of better performance results.

As part of the overall re-architecture of how Voron manages writes, we have done the same. I/O for Voron is typically composed of writes to the journals and to the data file, so that makes it a really good fit, sort of.

An ironic aspect of IO Uring is that despite it being an asynchronous mechanism, it is inherently single-threaded. There are good reasons for that, of course, but that means that if you want to use the IO Ring API in a multi-threaded environment, you need to take that into account.

A common way to handle that is to use an event-driven system, where all the actual calls are generated from a single “event loop” thread or similar. This is how the Node.js API works, and how .NET itself manages IO for sockets (there is a single thread that listens to socket events by default).

The whole point of IO Ring is that you can submit multiple operations for the kernel to run in as optimal a manner as possible. Here is one such case to consider, this is the part of the code where we write the modified pages to the data file:


using (fileHandle)
{
    for (int i = 0; i < pages.Length; i++)
    {
        int numberOfPages = pages[i].GetNumberOfPages();


        var size = numberOfPages * Constants.Storage.PageSize;
        var offset = pages[i].PageNumber * Constants.Storage.PageSize;
        var span = new Span<byte>(pages[i].Pointer, size);
        RandomAccess.Write(fileHandle, span, offset);


        written += numberOfPages * Constants.Storage.PageSize;
    }
}


PID     LWP TTY          TIME CMD
  22334   22345 pts/0    00:00:00 iou-wrk-22343
  22334   22346 pts/0    00:00:00 iou-wrk-22343
  22334   22347 pts/0    00:00:00 iou-wrk-22334
  22334   22348 pts/0    00:00:00 iou-wrk-22334
  22334   22349 pts/0    00:00:00 iou-wrk-22334
  22334   22350 pts/0    00:00:00 iou-wrk-22334
  22334   22351 pts/0    00:00:00 iou-wrk-22334
  22334   22352 pts/0    00:00:00 iou-wrk-22334
  22334   22353 pts/0    00:00:00 iou-wrk-22334
  22334   22354 pts/0    00:00:00 iou-wrk-22334
  22334   22355 pts/0    00:00:00 iou-wrk-22334
  22334   22356 pts/0    00:00:00 iou-wrk-22334
  22334   22357 pts/0    00:00:00 iou-wrk-22334
  22334   22358 pts/0    00:00:00 iou-wrk-22334

Actually, those aren’t threads in the normal sense. Those are kernel tasks, generated by the IO Ring at the kernel level directly. It turns out that internally, IO Ring may spawn worker threads to do the async work at the kernel level. When we had a separate IO Ring per file, each one of them had its own pool of threads to do the work.

The way it usually works is really interesting. The IO Ring will attempt to complete the operation in a synchronous manner. For example, if you are writing to a file and doing buffered writes, we can just copy the buffer to the page pool and move on, no actual I/O took place. So the IO Ring will run through that directly in a synchronous manner.

However, if your operation requires actual blocking, it will be sent to a worker queue to actually execute it in the background. This is one way that the IO Ring is able to complete many operations so much more efficiently than the alternatives.

In our scenario, we have a pretty simple setup, we want to write to the file, making fully buffered writes. At the very least, being able to push all the writes to the OS in one shot (versus many separate system calls) is going to reduce our overhead. More interesting, however, is that eventually, the OS will want to start writing to the disk, so if we write a lot of data, some of the requests will be blocked. At that point, the IO Ring will switch them to a worker thread and continue executing.

The problem we had was that when we had a separate IO Ring per data file and put a lot of load on the system, we started seeing contention between the worker threads across all the files. Basically, each ring had its own separate pool, so there was a lot of work for each pool but no sharing.

If the IO Ring is single-threaded, but many separate threads lead to wasted resources, what can we do? The answer is simple, we’ll use a single global IO Ring and manage the threading concerns directly.

Here is the setup code for that (I removed all error handling to make it clearer):


void *do_ring_work(void *arg)
{
  int rc;
  if (g_cfg.low_priority_io)
  {
    syscall(SYS_ioprio_set, IOPRIO_WHO_PROCESS, 0, 
        IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 7));
  }
  pthread_setname_np(pthread_self(), "Rvn.Ring.Wrkr");
  struct io_uring *ring = &g_worker.ring;
  struct workitem *work = NULL;
  while (true)
  {
    do
    {
      // wait for any writes on the eventfd 
      // completion on the ring (associated with the eventfd)
      eventfd_t v;
      rc = read(g_worker.eventfd, &v, sizeof(eventfd_t));
    } while (rc < 0 && errno == EINTR);
    
    bool has_work = true;
    while (has_work)
    {
      int must_wait = 0;
      has_work = false;
      if (!work) 
      {
        // we may have _previous_ work to run through
        work = atomic_exchange(&g_worker.head, 0);
      }
      while (work)
      {
        has_work = true;


        struct io_uring_sqe *sqe = io_uring_get_sqe(ring);
        if (sqe == NULL)
        {
          must_wait = 1;
          goto sumbit_and_wait; // will retry
        }
        io_uring_sqe_set_data(sqe, work);
        switch (work->type)
        {
        case workitem_fsync:
          io_uring_prep_fsync(sqe, work->filefd, IORING_FSYNC_DATASYNC);
          break;
        case workitem_write:
          io_uring_prep_writev(sqe, work->filefd, work->op.write.iovecs,
                               work->op.write.iovecs_count, work->offset);
          break;
        default:
          break;
        }
        work = work->next;
      }
    sumbit_and_wait:
      rc = must_wait ? 
        io_uring_submit_and_wait(ring, must_wait) : 
        io_uring_submit(ring);
      struct io_uring_cqe *cqe;
      uint32_t head = 0;
      uint32_t i = 0;


      io_uring_for_each_cqe(ring, head, cqe)
      {
        i++;
        // force another run of the inner loop, 
        // to ensure that we call io_uring_submit again
        has_work = true; 
        struct workitem *cur = io_uring_cqe_get_data(cqe);
        if (!cur)
        {
          // can be null if it is:
          // *  a notification about eventfd write
          continue;
        }
        switch (cur->type)
        {
        case workitem_fsync:
          notify_work_completed(ring, cur);
          break;
        case workitem_write:
          if (/* partial write */)
          {
            // queue again
            continue;
          }
          notify_work_completed(ring, cur);
          break;
        }
      }
      io_uring_cq_advance(ring, i);
    }
  }
  return 0;
}

What does this code do?

We start by checking if we want to use lower-priority I/O, this is because we don’t actually care how long those operations take. The purpose of writing the data to the disk is that it will reach it eventually. RavenDB has two types of writes:

  • Journal writes (durable update to the write-ahead log, required to complete a transaction).
  • Data flush / Data sync (background updates to the data file, currently buffered in memory, no user is waiting for that)

As such, we are fine with explicitly prioritizing the journal writes (which users are waiting for) in favor of all other operations.

What is this C code? I thought RavenDB was written in C#

RavenDB is written in C#, but for very low-level system details, we found that it is far easier to write a Platform Abstraction Layer to hide system-specific concerns from the rest of the code. That way, we can simply submit the data to write and have the abstraction layer take care of all of that for us. This also ensures that we amortize the cost of PInvoke calls across many operations by submitting a big batch to the C code at once.

After setting the IO priority, we start reading from what is effectively a thread-safe queue. We wait for eventfd() to signal that there is work to do, and then we grab the head of the queue and start running.

The idea is that we fetch items from the queue, then we write those operations to the IO Ring as fast as we can manage. The IO Ring size is limited, however. So we need to handle the case where we have more work for the IO Ring than it can accept. When that happens, we will go to the submit_and_wait label and wait for something to complete.

Note that there is some logic there to handle what is going on when the IO Ring is full. We submit all the work in the ring, wait for an operation to complete, and in the next run, we’ll continue processing from where we left off.

The rest of the code is processing the completed operations and reporting the result back to their origin. This is done using the following function, which I find absolutely hilarious:


int32_t rvn_write_io_ring(
    void *handle,
    struct page_to_write *buffers,
    int32_t count,
    int32_t *detailed_error_code)
{
    int32_t rc = SUCCESS;
    struct handle *handle_ptr = handle;
    if (count == 0)
        return SUCCESS;


    if (pthread_mutex_lock(&handle_ptr->global_state->writes_arena.lock))
    {
        *detailed_error_code = errno;
        return FAIL_MUTEX_LOCK;
    }
    size_t max_req_size = (size_t)count * 
                      (sizeof(struct iovec) + sizeof(struct workitem));
    if (handle_ptr->global_state->writes_arena.arena_size < max_req_size)
    {
        // allocate arena space
    }
    void *buf = handle_ptr->global_state->writes_arena.arena;
    struct workitem *prev = NULL;
    int eventfd = handle_ptr->global_state->writes_arena.eventfd;
    for (int32_t curIdx = 0; curIdx < count; curIdx++)
    {
        int64_t offset = buffers[curIdx].page_num * VORON_PAGE_SIZE;
        int64_t size = (int64_t)buffers[curIdx].count_of_pages *
                       VORON_PAGE_SIZE;
        int64_t after = offset + size;


        struct workitem *work = buf;
        *work = (struct workitem){
            .op.write.iovecs_count = 1,
            .op.write.iovecs = buf + sizeof(struct workitem),
            .completed = 0,
            .type = workitem_write,
            .filefd = handle_ptr->file_fd,
            .offset = offset,
            .errored = false,
            .result = 0,
            .prev = prev,
            .notifyfd = eventfd,
        };
        prev = work;
        work->op.write.iovecs[0] = (struct iovec){
            .iov_len = size, 
            .iov_base = buffers[curIdx].ptr
        };
        buf += sizeof(struct workitem) + sizeof(struct iovec);


        for (size_t nextIndex = curIdx + 1; 
            nextIndex < count && work->op.write.iovecs_count < IOV_MAX; 
            nextIndex++)
        {
            int64_t dest = buffers[nextIndex].page_num * VORON_PAGE_SIZE;
            if (after != dest)
                break;


            size = (int64_t)buffers[nextIndex].count_of_pages *
                              VORON_PAGE_SIZE;
            after = dest + size;
            work->op.write.iovecs[work->op.write.iovecs_count++] = 
                (struct iovec){
                .iov_base = buffers[nextIndex].ptr,
                .iov_len = size,
            };
            curIdx++;
            buf += sizeof(struct iovec);
        }
        queue_work(work);
    }
    rc = wait_for_work_completion(handle_ptr, prev, eventfd, 
detailed_error_code);
    pthread_mutex_unlock(&handle_ptr->global_state->writes_arena.lock)
    return rc;
}

Remember that when we submit writes to the data file, we must wait until they are all done. The async nature of IO Ring is meant to help us push the writes to the OS as soon as possible, as well as push writes to multiple separate files at once. For that reason, we use anothereventfd() to wait (as the submitter) for the IO Ring to complete the operation. I love the code above because it is actually using the IO Ring itself to do the work we need to do here, saving us an actual system call in most cases.

Here is how we submit the work to the worker thread:


void queue_work(struct workitem *work)
{
    struct workitem *head = atomic_load(&g_worker.head);
    do
    {
        work->next = head;
    } while (!atomic_compare_exchange_weak(&g_worker.head, &head, work));
}

This function handles the submission of a set of pages to write to a file. Note that we protect against concurrent work on the same file. That isn’t actually needed since the caller code already handles that, but an uncontended lock is cheap, and it means that I don’t need to think about concurrency or worry about changes in the caller code in the future.

We ensure that we have sufficient buffer space, and then we create a work item. A work item is a single write to the file at a given location. However, we are using vectored writes, so we’ll merge writes to the consecutive pages into a single write operation. That is the purpose of the huge for loop in the code. The pages arrive already sorted, so we just need to do a single scan & merge for this.

Pay attention to the fact that the struct workitem actually belongs to two different linked lists. We have the next pointer, which is used to send work to the worker thread, and the prev pointer, which is used to iterate over the entire set of operations we submitted on completion (we’ll cover this in a bit).

Queuing work is done using the following method:


int32_t
wait_for_work_completion(struct handle *handle_ptr, 
    struct workitem *prev, 
    int eventfd, 
    int32_t *detailed_error_code)
{
    // wake worker thread
    eventfd_write(g_worker.eventfd, 1);
    
    bool all_done = false;
    while (!all_done)
    {
        all_done = true;
        *detailed_error_code = 0;


        eventfd_t v;
        int rc = read(eventfd, &v, sizeof(eventfd_t));
        struct workitem *work = prev;
        while (work)
        {
            all_done &= atomic_load(&work->completed);
            work = work->prev;
        }
    }
    return SUCCESS;
}

The idea is pretty simple. We first wake the worker thread by writing to its eventfd(), and then we wait on our own eventfd() for the worker to signal us that (at least some) of the work is done.

Note that we handle the submission of multiple work items by iterating over them in reverse order, using the prev pointer. Only when all the work is done can we return to our caller.

The end result of all this behavior is that we have a completely new way to deal with background I/O operations (remember, journal writes are handled differently). We can control both the volume of load we put on the system by adjusting the size of the IO Ring as well as changing its priority.

The fact that we have a single global IO Ring means that we can get much better usage out of the worker thread pool that IO Ring utilizes. We also give the OS a lot more opportunities to optimize RavenDB’s I/O.

The code in this post shows the Linux implementation, but RavenDB also supports IO Ring on Windows if you are running a recent edition.

We aren’t done yet, mind, I still have more exciting things to tell you about how RavenDB 7.1 is optimizing writes and overall performance. In the next post, we’ll discuss what I call the High Occupancy Lane vs. Critical Lane for I/O and its impact on our performance.

time to read 8 min | 1476 words

When we build a new feature in RavenDB, we either have at least some idea about what we want to build or we are doing something that is pure speculation. In either case, we will usually spend only a short amount of time trying to plan ahead.

A good example of that can be found in my RavenDB 7.1 I/O posts, which cover about 6+ months of work for a major overhaul of the system. That was done mostly as a series of discussions between team members, guidance from the profiler, and our experience, seeing where the path would lead us. In that case, it led us to a five-fold performance improvement (and we’ll do better still by the time we are done there).

That particular set of changes is one of the more complex and hard-to-execute changes we have made in RavenDB over the past 5 years or so. It touched a lot of code, it changed a lot of stuff, and it was done without any real upfront design. There wasn’t much point in designing, we knew what we wanted to do (get things faster), and the way forward was to remove obstacles until we were fast enough or ran out of time.

I re-read the last couple of paragraphs, and it may look like cowboy coding, but that is very much not the case. There is a process there, it is just not something we would find valuable to put down as a formal design document. The key here is that we have both a good understanding of what we are doing and what needs to be done.

RavenDB 4.0 design document

The design document we created for RavenDB 4.0 is probably the most important one in the project’s history. I just went through it again, it is over 20 pages of notes and details that discuss the current state of RavenDB at the time (written in 2015) and ideas about how to move forward.

It is interesting because I remember writing this document. And then we set out to actually make it happen, that wasn’t a minor update. It took close to three years to complete the process, to give you some context about the complexity and scale of the task.

To give some further context, here is an image from that document:

And here is the sharding feature in RavenDB right now:

This feature is called prefixed sharding in our documentation. It is the direct descendant of the image from the original 4.0 design document. We shipped that feature sometime last year. So we are talking about 10 years from “design” to implementation.

I’m using “design” in quotes here because when I go through this v4.0 design document, I can tell you that pretty much nothing that ended up in that document was implemented as envisioned. In fact, most of the things there were abandoned because we found much better ways to do the same thing, or we narrowed the scope so we could actually ship on time.

Comparing the design document to what RavenDB 4.0 ended up being is really interesting, but it is very notable that there isn’t much similarity between the two. And yet that design document was a fundamental part of the process of moving to v4.0.

What Are Design Documents?

A classic design document details the architecture, workflows, and technical approach for a software project before any code is written. It is the roadmap that guides the development process.

For RavenDB, we use them as both a sounding board and a way to lay the foundation for our understanding of the actual task we are trying to accomplish. The idea is not so much to build the design for a particular feature, but to have a good understanding of the problem space and map out various things that could work.

Recent design documents in RavenDB

I’m writing this post because I found myself writing multiple design documents in the past 6 months. More than I have written in years. Now that RavenDB 7.0 is out, most of those are already implemented and available to you. That gives me the chance to compare the design process and the implementation with recent work.

Vector Search & AI Integration for RavenDB

This was written in November 2024. It outlines what we want to achieve at a very high level. Most importantly, it starts by discussing what we won’t be trying to do, rather than what we will. Limiting the scope of the problem can be a huge force multiplier in such cases, especially when dealing with new concepts.

Reading throughout that document, it lays out the external-facing aspect of vector search in RavenDB. You have the vector.search() method in RQL, a discussion on how it works in other systems, and some ideas about vector generation and usage.

It doesn’t cover implementation details or how it will look from the perspective of RavenDB. This is at the level of the API consumer, what we want to achieve, not how we’ll achieve it.

AI Integration with RavenDB

Given that we have vector search, the next step is how to actually get and use it. This design document was a collaborative process, mostly written during and shortly after a big design discussion we had (which lasted for hours).

The idea there was to iron out the overall understanding of everyone about what we want to achieve. We considered things like caching and how it plays into the overall system, there are notes there at the level of what should be the field names.

That work has already been implemented. You can access it through the new AI button in the Studio. Check out this icon on the sidebar:

That was a much smaller task in scope, but you can see how even something that seemed pretty clear changed as we sat down and actually built it. Concepts we didn’t even think to consider were raised, handled, and implemented (without needing another design).

Voron HSNW Design Notes

This design document details our initial approach to building the HSNW implementation inside Voron, the basis for RavenDB’s new vector search capabilities.

That one is really interesting because it is a pure algorithmic implementation, completely internal to our usage (so no external API is needed), and I wrote it after extensive research.

The end result is similar to what I planned, but there are still significant changes.  In fact, pretty much all the actual implementation details are different from the design document. That is both expected and a good thing because it means that once we dove in, we were able to do things in a better way.

Interestingly, this is often the result of other constraints forcing you to do things differently. And then everything rolls down from there.

“If you have a problem, you have a problem. If you have two problems, you have a path for a solution.”

In the case of HSNW, a really complex part of the algorithm is handling deletions. In our implementation, there is a vector, and it has an associated posting list attached to it with all the index entries. That means we can implement deletion simply by emptying the associated posting list. An entire section in the design document (and hours spent pondering) is gone, just like that.

If the design document doesn’t reflect the end result of the system, are they useful?

I would unequivocally state that they are tremendously useful. In fact, they are crucial for us to be able to tackle complex problems. The most important aspect of design documents is that they capture our view of what the problem space is.

Beyond their role in planning, design documents serve another critical purpose: they act as a historical record. They capture the team’s thought process, documenting why certain decisions were made and how challenges were addressed. This is especially valuable for a long-lived project like RavenDB, where future developers may need context to understand the system’s evolution.

Imagine a design document that explores a feature in detail—outlining options, discussing trade-offs, and addressing edge cases like caching or system integrations. The end result may be different, but the design document, the feature documentation (both public and internal), and the issue & commit logs serve to capture the entire process very well.

Sometimes, looking at the road not taken can give you a lot more information than looking at what you did.

I consider design documents to be a very important part of the way we design our software. At the same time, I don’t find them binding, we’ll write the software and see where it leads us in the end.

What are your expectations and experience with writing design documents? I would love to hear additional feedback.

time to read 2 min | 394 words

RavenDB is meant to be a self-managing database, one that is able to take care of itself without constant hand-holding from the database administrator. That has been one of our core tenets from the get-go. Today I checked the current state of the codebase and we have roughly 500 configuration options that are available to control various aspects of RavenDB’s behavior.

These two statements are seemingly contradictory, because if we have so many configuration options, how can we even try to be self-managing? And how can a database administrator expect to juggle all of those options?

Database configuration is a really finicky topic. For example, RocksDB’s authors flat-out admit that out loud:

Even we as RocksDB developers don't fully understand the effect of each configuration change. If you want to fully optimize RocksDB for your workload, we recommend experiments and benchmarking.

And indeed, efforts were made to tune RocksDB using deep-learning models because it is that complex.

RavenDB doesn’t take that approach, tuning is something that should work out of the box, managed directly by RavenDB itself. Much of that is achieved by not doing things and carefully arranging that the environment will balance itself out in an optimal fashion. But I’ll talk about the Zen of RavenDB another time.

Today, I want to talk about why we have so many configuration options, the vast majority of which you, as a user, should neither use, care about, nor even know of.

The idea is very simple, deploying a database engine is a Big Deal, and as such, something that users are quite reluctant to do. When we hit a problem and a support call is raised, we need to provide some mechanism for the user to fix things until we can ensure that this behavior is accounted for in the default manner of RavenDB.

I treat the configuration options more as escape hatches that allow me to muddle through stuff than explicit options that an administrator is expected to monitor and manage. Some of those configuration options control whether RavenDB will utilize vectored instructions or the compression algorithm to use over the wire. If you need to touch them, it is amazing that they exist. If you have to deal with them on a regular basis, we need to go back to the drawing board.

time to read 3 min | 462 words

For a new feature in RavenDB, I needed to associate each transaction with a source ID. The underlying idea is that I can aggregate transactions from multiple sources in a single location, but I need to be able to distinguish between transactions from A and B.

Luckily, I had the foresight to reserve space in the Transaction Header, I had a whole 16 bytes available for me. Separately, each Voron database (the underlying storage engine that we use) has a unique Guid identifier. And a Guid is 16 bytes… so everything is pretty awesome.

There was just one issue. I needed to be able to read transactions as part of the recovery of the database, but we stored the database ID inside the database itself. I figured out that I could also put a copy of the database ID in the global file header and was able to move forward.

This is part of a much larger change, so I was going full steam ahead when I realized something pretty awful. That database Guid that I was relying on was already being used as the physical identifier of the storage as part of the way RavenDB distributes data. The reason it matters is that under certain circumstances, we may need to change that.

If we change the database ID, we lose the association with the transactions for that database, leading to a whole big mess. I started sketching out a design for figuring out that the database ID has changed, re-writing all the transactions in storage, and… a colleague said: why don’t we use another ID?

It hit me like a ton of bricks. I was using the existing database Guid because it was already there, so it seemed natural to want to reuse it. But there was no benefit in doing that. Instead, it added a lot more complexity because I was adding (many) additional responsibilities to the value that it didn’t have before.

Creating a Guid is pretty easy, after all, and I was able to dedicate one I called Journal ID to this purpose. The existing Database ID is still there, and it is completely unrelated to it. Changing the Database ID has no impact on the Journal ID, so the problem space is radically simplified.

I had to throw away heaps of complexity because of a single comment. I used the Database ID because it was there, never considering having a dedicated value for it. That single suggestion led to a better, simpler design and faster delivery.

It is funny how you can sometimes be so focused on the problem at hand, when a step back will give you a much wider view and a better path to the solution.

time to read 2 min | 247 words

I write a transactional database for a living, and the best example of why we want transactions is transferring money between accounts. It is ironic, therefore, that there is no such thing as transactions for money transfers in the real world.

If you care to know why, go back 200 years and consider how a bank would operate in an environment without instant communication. I would actually recommend doing that, it is a great case study in distributed system design. For example, did you know that the Templars used cryptography to send money almost a thousand years ago?

Recently I was reviewing my bank transactions and I found the following surprise. This screenshot is from yesterday (Dec 18), and it looks like a payment that I made is still “stuck in the tubes” two and a half weeks later.

 

I got in touch with the supplier in question to apologize for the delay. They didn’t understand what I was talking about. Here is what they see when they go to their bank, they got the money.

 

For fun, look at the number of different dates that you can see in their details.

Also, as of right now, my bank account still shows the money as pending approval (to be sent out from my bank).

I might want to recommend that they use a different database. Or maybe I should just convince the bank to approve the payment by the time of the next invoice and see if I can get a bit of that infinite money glitch.

time to read 3 min | 539 words

The Cloud team at RavenDB has been working quite hard recently. The company at large is gearing up for the upcoming 6.2 release, but I can’t ignore the number of goodies that have dropped for RavenDB Cloud Customers.

Large Clusters & Sharding

RavenDB Cloud runs your production cluster with 3 nodes by default. Each one of them operates in a separate availability zone for maximum survivability. The new feature allows you to add additional nodes to your cluster. In the RavenDB Cloud Portal, you can see the “Add node” button and its impact:

Clicking this button allows you to add additional nodes to your cluster. The nodes will be deployed and attached to your cluster within a minute or two. The new nodes will be deployed in the same region (but not necessarily the same availability zone) where your cluster is already deployed.

There are plans in place to add support for deploying nodes in other regions and even in a multi-cloud environment. I would love to hear your feedback on this proposed feature.

You can see the new instances in the RavenDB Studio as well:

The key reason for adding additional nodes to a cluster is when you have very large datasets and you want to shard the data. Here is what this can look like:

In this case, we have sharded the data across 5 nodes, with a replication factor of 2.

Feature selection

There are certain Enterprise features that are only available in the higher-end instances in RavenDB Cloud (typically P30 or higher). We now allow you to selectively enable these features even on lower-tier instances.

This feature allows you to easily pick & choose (on an a-la-carte basis) the specific features you want, without having to upgrade to the more expensive tiers.

Metrics & monitoring

This feature isn’t actually new, but it absolutely deserves your attention. The RavenDB Cloud Portal has a metrics button that you should get familiar with:

Clicking it will provide a wealth of information about your cluster and its behavior. That can be really useful if you want to understand the system’s behavior. Take a peek:

Alerts & Warnings

In addition to just looking at the metrics, the RavenDB Cloud backend will give you some indication about things that you should pay attention to. For example, let’s assume that we had a node failure. You’ll typically not notice that since the RavenDB Cluster & client will work to ensure high availability.

You’ll be able to see that in the metrics, and the RavenDB Cloud Portal will bring it to your attention:

Summary

The major point we strive for in RavenDB and RavenDB Cloud is the notion that the entire experience will be seamless. From deployment and routine management to ensuring that you don’t have to concern yourself with the minutiae of data management, so you can focus on your application.

Being able to develop both the software and its execution environment greatly helps in providing solutions that Just Work. I’m really proud of what we have accomplished and I would love to get your feedback on it.

time to read 4 min | 764 words

I wanted to test low-level file-system behavior in preparation for a new feature for RavenDB. Specifically, I wanted to look into hole punching - where you can give low-level instructions to the file system to indicate that you’re giving up disk space, but without actually reducing the size of the file.

This can be very helpful in space management. If I have a section in the file that is full of zeroes, I can just tell the file system that, and it can skip storing that range of zeros on the disk entirely. This is an advanced feature for file systems. I haven't actually used that in the past, so I needed to gain some expertise with it.

I wrote the following code for Linux:


int fd = open("test.file", O_CREAT | O_WRONLY, 0644);
lseek(fd, 128 * 1024 * 1024 - 1, SEEK_SET); // 128MB file
write(fd, "", 1);
fallocate(fd, // 32 MB hole from the 16MB..48MB range
    FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 
    16 * 1024 * 1024, 32 * 1024 * 1024); 
close(fd);

The code for Windows is here if you want to see it. I tested the feature on both Windows & Linux, and it worked. I could see that while the file size was 128MB, I was able to give back 16MB to the operating system without any issues. I turned the code above into a test and called it a day.

And then the CI build broke. But that wasn’t possible since I tested that. And there had been CI runs that did work on Linux. So I did the obvious thing and started running the code above in a loop.

I found something really annoying. This code worked, sometimes. And sometimes it just didn’t.

In order to get the size, I need to run this code:


struct stat st;
fstat(fd, &st);
printf("Total size: %lld bytes\n",
    (long long)st.st_size);
printf("Actual size on disk: %lld bytes\n", 
    (long long)st.st_blocks * 512);

I’m used to weirdness from file systems at this point, but this is really simple. All the data is 4KB aligned (in fact, all the data is 16MB aligned). There shouldn’t be any weirdness here.

As you can see, I’m already working at the level of Linux syscalls, but I used strace to check if there is something funky going on. Nope, there was a 1:1 mapping between the code and the actual system calls issued.

That means that I have to debug deeper if I want to understand what is going on. This involves debugging the Linux Kernel, which is a Big Task. Take a look at the code in the relevant link. I’m fairly certain that the issue is in those lines. The problem is that this cannot be, since both offset & length are aligned to 4KB.

I got out my crystal ball and thinking hat and meditated on this. If you’ll note, the difference between the expected and actual values is exactly 4KB. It almost looks like the file itself is not aligned on a 4KB boundary, but the holes must be.

Given that I just want to release this space to the operating system and 4KB is really small, I can adjust that as a fudge factor for the test. I would love to understand exactly what is going on, but so far the “file itself is not 4KB aligned, but holes are” is a good working hypothesis (even though my gut tells me it might be wrong).

If you know the actual reason for this, I would love to hear it.

And don't get me started on what happened with sparse files in macOS. There, the OS will randomly decide to mark some parts of your file as holes, making any deterministic testing really hard.

time to read 13 min | 2479 words

RavenDB has a hidden feature, enabled by default and not something that you usually need to be aware of. It has built-in support for caching. Consider the following code:


async Task<Dictionary<string, int>> HowMuchWorkToDo(string userId)
{
    using var session = _documentStore.OpenAsyncSession();
    var results = await session.Query<Item>()
        .GroupBy(x =>new { x.Status, x.AssignedTo })
        .Where(g => g.Key.AssignedTo == userId && g.Key.Status != "Closed")
        .Select(g => new 
        {
            Status = g.Key.Status,
            Count = g.Count()
        })
        .ToListAsync();


    return results.ToDictionary(x => x.Status, x => x.Count);
}

What happens if I call it twice with the same user? The first time, RavenDB will send the query to the server, where it will be evaluated and executed. The server will also send an ETag header with the response. The client will remember the response and its ETag in its own memory.

The next time this is called on the same user, the client will again send a request to the server. This time, however, it will also inform the server that it has a previous response to this query, with the specified ETag. The server, when realizing the client has a cached response, will do a (very cheap) check to see if the cached response matches the current state of the server. If so, it can inform the client (using 304 Not Modified) that it can use its cache.

In this way, we benefit twice:

  • First, on the server side, we avoid the need to compute the actual query.
  • Second, on the network side, we aren’t sending a full response back, just a very small notification to use the cached version.

You’ll note, however, that there is still an issue. We have to go to the server to check. That means that we still pay the network costs. So far, this feature is completely transparent to the user. It works behind the scenes to optimize server query costs and network bandwidth costs.

We have a full-blown article on caching in RavenDB if you care to know more details instead of just “it makes things work faster for me”.

Aggressive Caching in RavenDB

The next stage is to involve the user. Enter the AggressiveCache() feature (see the full documentation here), which allows the user to specify an additional aspect. Now, when the client has the value in the cache, it will skip going to the server entirely and serve the request directly from the cache.

What about cache invalidation? Instead of having the client check on each request if things have changed, we invert the process. The client asks the server to notify it when things change, and until it gets notice from the server, it can serve responses completely from the local cache.

I really love this feature, that was the Good part, now let’s talk about the other pieces:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

The bad part of caching is that this introduces more complexity to the system. Consider a system with two clients that are using the same database. An update from one of them may show up at different times in each. Cache invalidation will not happen instantly, and it is possible to get into situations where the server fails to notify the client about the update, meaning that we didn’t clear the cache.

We have a good set of solutions around all of those, I think. But it is important to understand that the problem space itself is a problem.

In particular, let’s talk about dealing with the following query:


var emps = session.Query<Employee>()
    .Include(x => x.Department)
    .Where(x => x.Location.City == "London")
    .ToListAsync();

When an employee is changed on the server, it will send a notice to the client, which can evict the item from the cache, right? But what about when a department is changed?

For that matter, what happens if a new employee is added to London? How do we detect that we need to refresh this query?

There are solutions to those problems, but they are super complicated and have various failure modes that often require more computing power than actually running the query. For that reason, RavenDB uses a much simpler model. If the server notifies us about any change, we’ll mark the entire cache as suspect.

The next request will have to go to the server (again with an ETag, etc) to verify that the response hasn’t changed. Note that if the specific query results haven’t changed, we’ll get OK (304 Not Modified) from the server, and the client will use the cached response.

Conservatively aggressive approach

In other words, even when using aggressive caching, RavenDB still has to go to the server sometimes. What is the impact of this approach when you have a system under load?

We’ll still use aggressive caching, but you’ll see brief periods where we aren’t checking with the server (usually be able to cache for about a second or so), followed by queries to the server to check for any changes.

In most cases, this is what you want. We still benefit from the cache while reducing the number of remote calls by about 50%, and we don’t have to worry about missing updates. The downside is that, as application developers, we know that this particular document and query are independent, so we want to cache them until we get notice about that particular document being changed.

The default aggressive caching in RavenDB will not be of major help here, I’m afraid. But there are a few things you can do.

You can use Aggressive Caching in the NoTracking mode. In that mode, the client will not ask the server for notifications on changes, and will cache the responses in memory until they expire (clock expiration or size expiration only).

There is also a feature suggestion that calls for updating the aggressive cache in a background manner, I would love to hear more feedback on this proposal.

Another option is to take this feature higher than RavenDB directly, but still use its capabilities. Since we have a scenario where we know that we want to cache a specific set of documents and refresh the cache only when those documents are updated, let’s write it.

Here is the code:


public class RecordCache<T>
{
    private ConcurrentLru<string, T> _items = 
        new(256, StringComparer.OrdinalIgnoreCase);
    private readonly IDocumentStore _documentStore;


    public RecordCache(IDocumentStore documentStore)
    {
        const BindingFlags Flags = BindingFlags.Instance | 
            BindingFlags.NonPublic | BindingFlags.Public;
        var violation = typeof(T).GetFields(Flags)
            .FirstOrDefault(f => f.IsInitOnly is false);
        if (violation != null)
        {
            throw new InvalidOperationException(
                "You should cache *only* immutable records, but got: " + 
                typeof(T).FullName + " with " + violation.Name + 
                " which is not read only!");
        }


        var changes = documentStore.Changes();
        changes.ConnectionStatusChanged += (_, args) =>
        {
            _items = new(256, StringComparer.OrdinalIgnoreCase);
        };
        changes.ForDocumentsInCollection<T>()
            .Subscribe(e =>
            {
                _items.TryRemove(e.Id, out _);
            })
            ;
        _documentStore = documentStore;
    }


    public ValueTask<T> Get(string id)
    {
        if (_items.TryGetValue(id, out var result))
        {
            return ValueTask.FromResult(result);
        }
        return new ValueTask<T>(GetFromServer(id));


    }


    private async Task<T> GetFromServer(string id)
    {
        using var session = _documentStore.OpenAsyncSession();
        var item = await session.LoadAsync<T>(id);
        _items.Set(id, item);
        return item;
    }
}

There are a few things to note about this code. We are holding live instances, so we ensure that the values we keep are immutable records. Otherwise, we may hand the same instance to two threads which can be… fun.

Note that document IDs in RavenDB are case insensitive, so we pass the right string comparer.

Finally,  the magic happens in the constructor. We register for two important events. Whenever the connection status of the Changes() connection is modified, we clear the cache. This handles any lost updates scenarios that occurred while we were disconnected.

In practice, the subscription to events on that particular collection is where we ensure that after the server notification, we can evict the document from the cache so that the next request will load a fresh version.

Caching + Distributed Systems = 🤯🤯🤯

I’m afraid this isn’t an easy topic once you dive into the specifics and constraints we operate under. As I mentioned, I would love your feedback on the background cache refresh feature, or maybe you have better insight into other ways to address the topic.

time to read 4 min | 728 words

I got into an interesting discussion on LinkedIn about my previous post, talking about Code Rot. I was asked about Legacy Code defined as code without tests and how I reconcile code rot with having tests.

I started to reply there, but it really got out of hand and became its own post.

“To me, legacy code is simply code without tests.” Michael Feathers, Working Effectively with Legacy Code

I read Working Effectively with Legacy Code for the first time in 2005 or thereabout, I think. It left a massive impression on me and on the industry at large. The book is one of the reasons I started rigorously writing tests for my code, it got me interested in mocking and eventually led me to writing Rhino Mocks.

It is ironic that the point of this post is that I disagree with this statement by Michael because of Rhino Mocks. Let’s start with numbers, last commit to the Rhino Mocks repository was about a decade ago. It has just under 1,000 tests and code coverage that ranges between 95% - 100%.

I can modify this codebase with confidence, knowing that I will not break stuff unintentionally. The design of the code is very explicitly meant to aid in testing and the entire project was developed with a Test First mindset.

I haven’t touched the codebase in a decade (and it has been close to 15 years since I really delved into it). The code itself was written in .NET 1.1 around the 2006 timeframe. It literally predates generics in .NET.

It compiles and runs all tests when I try to run it, which is great. But it is still very much a legacy codebase.

It is a legacy codebase because changing this code is a big undertaking. This code will not run on modern systems. We need to address issues related to dynamic code generation between .NET Framework and .NET.

That in turn requires a high level of expertise and knowledge. I’m fairly certain that given enough time and effort, it is possible to do so. The problem is that this will now require me to reconstitute my understanding of the code.

The tests are going to be invaluable for actually making those changes, but the core issue is that a lot of knowledge has been lost. It will be a Project just to get it back to a normative state.

This scenario is pretty interesting because I am actually looking back at my own project. Thinking about having to do the same to a similar project from someone else’s code is an even bigger challenge.

Legacy code, in this context, means that there is a huge amount of effort required to start moving the project along. Note that if we had kept the knowledge and information within the same codebase, the same process would be far cheaper and easier.

Legacy code isn’t about the state of the codebase in my eyes, it is about the state of the team maintaining it. The team, their knowledge, and expertise, are far more important than the code itself.

An orphaned codebase, one that has no one to take care of, is a legacy project even if it has tests. Conversely, a project with no tests but with an actively knowledgeable team operating on it is not.

Note that I absolutely agree that tests are crucial regardless. The distinction that I make between legacy projects and non-legacy projects is whether we can deliver a change to the system.

Reminder: A codebase that isn’t being actively maintained and has no tests is the worst thing of all. If you are in that situation, go read Working Effectively with Legacy Code, it will be a lifesaver.

I need a feature with an ideal cost of X (time, materials, effort, cost, etc). A project with no tests but people familiar with it will be able to deliver it at a cost of 2-3X. A legacy project will need 10X or more. The second feature may still require 2X from the maintained project, but only 5X from the legacy system. However, that initial cost to get things started is the killer.

In other words, what matters here is the inertia, the ability to actually deliver updates to the system.

time to read 3 min | 481 words

A customer called us about some pretty weird-looking numbers in their system:

You’ll note that the total number of entries in the index across all the nodes does not match. Notice that node C has 1 less entry than the rest of the system.

At the same time, all the indicators are green. As far as the administrator can tell, there is no issue, except for the number discrepancy. Why is it behaving in this manner?

Well, let’s zoom out a bit. What are we actually looking at here? We are looking at the state of a particular index in a single database within a cluster of machines. When examining the index, there is no apparent problem. Indexing is running properly, after all.

The actual problem was a replication issue, which prevented replication from proceeding to the third node. When looking at the index status, you can only see that the entry count is different.

When we zoom out and look at the state of the cluster, we can see this:

There are a few things that I want to point out in this scenario. The problem here is a pretty nasty one. All nodes are alive and well, they are communicating with each other, and any simple health check you run will give good results.

However, there is a problem that prevents replication from properly flowing to node C. The actual details aren’t relevant (a bug that we fixed, to tell the complete story). The most important aspect is how RavenDB behaves in such a scenario.

The cluster detected this as a problem, marked the node as problematic, and raised the appropriate alerts. As a result of this, clients would automatically be turned away from node C and use only the healthy nodes.

From the customer’s perspective, the issue was never user-visible since the cluster isolated the problematic node. I had a hand in the design of this, and I wrote some of the relevant code. And I’m still looking at these screenshots with a big sense of accomplishment.

This stuff isn’t easy or simple. But to an outside observer, the problem started from: why am I looking at funny numbers in the index state in the admin panel? And not at: why am I serving the wrong data to my users.

The design of RavenDB is inherently paranoid. We go to a lot of trouble to ensure that even if you run into problems, even if you encounter outright bugs (as in this case), the system as a whole would know how to deal with them and either recover or work around the issue.

As you can see, live in production, it actually works and does the Right Thing for you. Thus, I can end this post by saying that this behavior makes me truly happy.

FUTURE POSTS

  1. RavenDB on AWS Marketplace - 9 minutes from now
  2. Production postmortem: The race condition in the interlock - 3 days from now
  3. When racing the Heisenbug, code quality goes out the Windows - 5 days from now
  4. Pricing transparency in RavenDB Cloud - 7 days from now
  5. Who can cancel Carmen Sandiego? - 10 days from now

There are posts all the way to Apr 14, 2025

RECENT SERIES

  1. Production Postmortem (52):
    12 Dec 2023 - The Spawn of Denial of Service
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}