Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,583
|
Comments: 51,213
Privacy Policy · Terms
filter by tags archive
time to read 2 min | 303 words

Recently the time.gov site had a complete makeover, which I love. I don’t really have much to do with time in the US in the normal course of things, but this site has a really interesting feature that I love.

Here is what this shows on my machine:

image

I love this feature because it showcase a real world problem very easily. Time is hard. The concept we have in our head about time is completely wrong in many cases. And that leads to interesting bugs. In this case, the second machine will be adjusted on midnight from the network and the clock drift will be fixed (hopefully).

What will happen to any code that runs when this happens? As far as it is concerned, time will move back.

RavenDB has a feature, document expiration. You can set a time for a document to go away. We had a bug which caused us to read the entries to be deleted at time T and then delete the documents that are older than T. Expect that in this case, the T wasn’t the same. We travelled back in time (and the log was confusing) and go an earlier result. That meant that we removed the expiration entries but not their related documents. When the time moved forward enough again to have those documents expire, the expiration record was already gone.

As far as RavenDB was concerned, the documents were updated to expire in the future, so the expiration records were no longer relevant. And the documents never expired, ouch.

We fixed that by remembering the original time we read the expiration records. I’m comforted with knowing that we aren’t the only one having to deal with it.

time to read 6 min | 1077 words

I run across this article, which talks about unit testing. There isn’t anything there that would be ground breaking, but I run across this quote, and I felt that I have to write a post to answer it.

The goal of unit testing is to segregate each part of the program and test that the individual parts are working correctly. It isolates the smallest piece of testable software from the remainder of the code and determines whether it behaves exactly as you expect.

This is a fairly common talking point when people discuss unit testing. Note that this isn’t the goal. The goal is what you what to achieve, this is a method of applying unit testing. Some of the benefits of unit test, are:

Makes the Process Agile and Facilitates Changes and Simplifies Integration

There are other items in the list on the article, but you can just read it there. I want to focus right now on the items above, because they are directly contradicted by separating each part of the program and testing it individually, as is usually applied in software projects.

Here are a few examples from posts I wrote over the years. The common pattern is that you’ll have interfaces, and repositories and services and abstractions galore. That will allow you to test just a small piece of your code, separate from everything else that you have.

This is great for unit testing. But unit testing isn’t a goal in itself. The point is to enable change down the line, to ensure that we aren’t breaking things that used to work, etc.

An interesting thing happens when you have this kind of architecture (and especially if you have this specifically so you can unit test it): it becomes very hard to make changes to the system. That is because the number of times you repeated yourself has grown. You have something once in the code and a second time in the tests.

Let’s consider something rather trivial. We have the following operation in our system, sending money:

image

A business rule says that we can’t send money if we don’t have enough in our account. Let’s see how we may implement it:

This seems reasonable at first glance. We have a lot of rules around money transfer, and we expect to have more in these in the future, so we created the IMoneyTransferValidationRules abstraction to model that and we can easily add new rules as time goes by. Nothing objectionable about that, right? And this is important, so we’ll have unit tests for each one of those rules.

During the last stages of the system, we realize that each one of those rules generate a bunch of queries to the database and that when we have load on the system, the transfer operation will create too much pain as it currently stand. There are a few options that we have available at this point:

  • Instead of running individual operations that will each load their data, we’ll do it once for every one. Here is how this will look like:

As you can see, we now have a way to use Lazy queries to reduce the number of remote calls this will generate.

  • Instead of taking the data from the database and checking it, we’ll send the check script to the database and do the validation there.

And here we moved pretty much the same overall architecture directly into the database itself. So we’ll not have to pay the cost of remote calls when we need to access more information.

The common thing for both approach is that it is perfectly in line with the old way of doing things. We aren’t talking about a major conceptual change. We just changed things so that it is easier to work with properly.

What about the tests?

If we tested each one of the rules independently, we now have a problem. All of those tests will now require non trivial modification. That means that instead of allowing change, the tests now serve as a barrier for change. They have set our architecture and code in concrete and make it harder to make changes.  If those changes were bugs, that would be great. But in this case, we don’t want to modify the system behavior, only how it achieve its end result.

The key issue with unit testing the system as a set of individually separated components is that concept that there is value in each component independently. There isn’t. The whole is greater than the sum of its parts is very much in play here.

If we had tests that looked at the system as a whole, those wouldn’t break. They would continue to serve us properly and validate that this big change we made didn’t break anything. Furthermore, at the edges of the system, changing the way things are happening usually is a problem. We might have external clients or additional APIs that rely on us, after all. So changing the exterior is something that I want to enforce with tests.

That said, when you build your testing strategy, you may have to make allowances. It is very important for the tests to run as quickly as possible. Slow feedback cycles can be incredibly annoying and will kill productivity. If there are specific components in your system that are slow, it make sense to insert seams to replace them. For a example, if you have a certificate generation bit in your system (which can take a long time) in the tests, you might want to return a certificate that was prepared ahead of time. Or if you are working with a remote database, you may want to use an in memory version of that. An external API you’ll want to mock, etc.

The key here isn’t that you are trying to look at things in isolation, the key is that you are trying to isolate things that are preventing you from getting quick feedback on the state of the system.

In short, unless there is uncertainty about a particular component (implementing new algorithm or data structure, exploring unfamiliar library, using 3rd party code, etc), I wouldn’t worry about testing that in isolation. Test it from outside, as a user would (note that this may take some work to enable that as an option) and you’ll end up with a far more robust testing infrastructure.

time to read 6 min | 1149 words

I recently got into an interesting discussion about one of the most interesting features of RavenDB, the ability to automatically deduce and create indexes on the fly, based on actual queries made to the server. This is a feature that RavenDB had for a very long time, over a decade and one that I’m quite proud of. The discussion was about whatever such a feature was useful or not in real world scenario. I obviously leant on this being incredibly useful, but I found the discussion good enough to post it here.

The gist of the argument against automatic indexes is that the developers should be in control of what is going on in the database and create the appropriate indexes on their own accord. The database should be not be in the business of creating indexes on the fly, which is scary to do in production.

I don’t like the line of thinking that says that it is the responsibility of the developers / operators / database admins to make sure that all queries use the optimal query plans. To be rather more exact, I absolutely think that they should do that, I just don’t believe that they can / will / are able to.

In many respects, I consider the notion of automatic index creation to be similar to garbage collection in managed languages. There is currently one popular language that still does manual memory management, and that is C. Pretty much all other languages have switched to some other mode that mean that the developer don’t need to track things manually. Managed languages has a GC, Rust has its ownership model, C++ has RAII and smart pointers, etc. We have decades and decades of experience telling us that no, developers actually can’t be expected to keep track of memory properly. There is a direct and immediate need for systematic help for that purpose.

Manual memory management can be boiled down to: “for every malloc(), call free()”. And yet it doesn’t work.

For database optimizations, you need to have a lot knowledge. You need to understand the system, the actual queries being generated, how the data is laid out on disk and many other factors. The SQL Server Query Performance Tuning book is close to a thousand pages in length. So that is decidedly not a trivial topic.

It is entirely possible to expect experts to know the material and have a checkpoint to deployment that would ensure that you have done the Right Thing before deploying to production. Expect that this is specialized knowledge, so now you have gate keepers, and going back to manual memory management woes, we know that this doesn’t always work.

There is a cost / benefit calculation here. If we make it too hard for developers to deploy, the pace of work would slow down. On the other hand, if a bad query goes to production, it may take the entire system down.

In some companies, I have seen weekly meetings for all changes to the database. You submit your changes (schema or queries), it get reviewed in the next available meeting and deploy to production within two weeks of that. The system was considered to be highly efficient in ensuring nothing bad happened to the database. It also ensured that developers would cut corners. In a memorable case, a developer needed to show some related data on a page. Doing a join to get the data would take three weeks. Issuing database calls over the approved API, on the other hand, could be done immediately. You can guess how that ended up, don’t you?

RavenDB has automatic indexes because they are useful. As you build your application, RavenDB learn from the actual production behavior. The more you use a particular aspect, the more RavenDB is able to optimize it. When changes happen, RavenDB is able to adjust, automatically. That is key, because it remove a very tedious and time consuming chore from the developers. Instead of having to spend a couple of weeks before each deployment verifying that the current database structure still serve for the current set of queries, they can rest assured that the database will handle that.

In fact, RavenDB has a mode where you can run your system on a test database and take the information gather from the test run and apply it on your production system. That way, you can avoid having to learn the new behavior on the fly. You can introduce the new changes to the system model at an idle point in time and let RavenDB adjust to it without anything much going on.

I believe that much of the objection for automatic indexes comes from the usual pain involved in creating indexes in other databases. Creating an index is often seen as a huge risk. It may lock tables and pages, it may consume a lot of system resources and even if the systems has an online index creation mode (and not all do), it is something that you Just Don’t do.

RavenDB, in contrast, has been running with this feature for a decade. We have had a lot of time to improve the heuristics and behavior of the system under this condition. New indexes being introduced are going to have bounded resources allocated to them, no locks are involved and other indexes are able to server requests with no interruption in service. RavenDB is also able to go the other way, it will recognize which automatic indexes are superfluous and remove them. And automatic indexes that see no use will be expired by the query optimizer for you. The whole idea is that there is an automated DBA running inside the RavenDB Query Optimizer that will constant monitor what is going on, reducing the need for manual maintenance cycles.

As you can imagine, this is a pretty important feature and has been through a lot of optimization and work over the years. RavenDB is now usually good enough in this task that in many cases, you don’t ever need to create indexes yourself. That has enormous impact on the ability to rapidly evolve your product. Because you are able to do that instead of going over a thousand pages book telling you how to optimize your queries. Write the code, issue your queries, and the database will adjust.

Will all those praises that I heap upon automatic index creation, I want to note that it is a most a copper bullet, not a silver one. Just like with garbage collection, you are free from the minutia and tedium of manual memory management, but you still need to understand some of the system behavior. The good thing about this is that you are free()-ed  from having to deal with that all the time. You just need to pay attention in rare cases, usually at the hotspots of your application. That is a much better way to invest your time.

time to read 2 min | 202 words

We needed to randomize a list of values, so I quickly wrote the following code:


What do you expect the output to be?

  • A –> 2431
  • B –> 2571
  • C -> 4998

The number vary slightly, but they are surprisingly consistent overall. C is about 50% likely to be at the head of the list, but I wanted the probability to be 33% obviously.

Can you figure out why this is the case? In order to sort, we need to compare the values. And we do that in a random fashion. We start by comparing A and B, and they have 50% change of either one of them being first.

Then we compare the winner (A | B) to C, and there is a 50% chance of C being first. In other words, because we compare C to the top once, it has a 50% chance of being first. But for A and B, they need to pass two 50% chances to get to the top, so they are only there 25% of the time.

When we figured out why this is happening, I immediately thought of the Monty Hall problem.

The right solution is the Fisher-Yates algorithm, and here is how you want to implement it.

time to read 7 min | 1291 words

A system that runs on a single machine is an order of magnitude simpler than one that reside on multiple machines. The complexity involved in maintaining consistency across multiple machines is huge. I have been dealing with this for the past 15 years and I can confidently tell you that no sane person would go for multi machine setup in favor of a single machine if they can get away with it. So what was the root cause for the push toward multiple machines and distributed architecture across the board for the past 20 years? And why are we see a backlash against that today?

You’ll hear people talking about the need for high availability and the desire to avoid a single point of failure. And that is true, to a degree. But there are other ways to handle that (primary / secondary model) rather than the full blown multi node setup.

Some users simply have too much data to go around and have to make use of a distributed architecture. If you are gathering a TB / day of data, no single system is going to help you, for example. However, most users aren’t there. A growth rate of GB / day (fully retained) is quite respectable and will take over a decade to start becoming a problem on a single machine.

What I think people don’t remember so well is that the landscape has changed quite significantly in the past 10 – 15 years. I’m not talking about Moore’s law, I’m talking about something that is actually quite a bit more significant. The dramatic boost in speed that we saw in storage.

Here are some numbers from the start of the last decade a top of the line 32GB SCSI drive with 15K RPM could hit 161 IOPS. Looking at something more modern disk with 14 TB will have 938 IOPS. That is a speed increase of over 500%, which is amazing, but not enough to matter. These two numbers are from hard disks. But we have had a major disruption in storage at the start of the millennium. The advent of SSD drives.

It turns out that SSDs aren’t nearly as new as one would expect them. They were just horribly expensive. Here are the specs for such a drive around 2003. The cost would be tens of thousands (USD) per drive. To be fair, this was meant to be used in rugged environment (think military tech, missiles and such), but there wasn’t much else in the market. In 2003 the first new commodity SSD started to appear, with sizes that topped at 512MB.

All of this is to say, in the early 2000s, if you wanted to store non trivial amount of data, you had to face the fact that you had to deal with hard disks. And you could expect some pretty harsh limitations on the number of IOPS available. And that, in turn, meant that the deciding factor for scale out wasn’t really the processing speed. Remember that the C10K problem was still a challenge, but reasonable one, in 1999. That is, handling 10K concurrent connections on a single server (to compare, millions of connections per server isn’t out of the ordinary).

Given 10K connections per server, with each one of them needing a single IO per 5 seconds, what would happen? That means that we need to handle 2,000 IOPS. That is over ten times what you can get from a top of the line disk at that time. So even if you had a RAID0 with ten drives and was able to get perfect distribution of IO to drive, you would still be about 20% short. And I don’t think you’ll want to get 10 drives at RAID0 in production. Given the I/O limits, you could reasonably expect to serve 100 – 300 operations per second per server. And that is assuming that you were able to cache some portion of the data in memory and avoid disk hits. The only way to handle this kind of limitation was to scale out, to have more servers (and more disks) to handle the load.

The rise of commodity SSDs changed the picture significantly and NVMe drives are the icing on the cake. SSD can do tens of thousands of IOPS and NVMe can do hundreds of thousands IOPS (and some exceed the million IOPS with comfortable margin).

Going back to the C10K issue? A 49.99$ drive with 256GB has specs that exceed 90,000 IOPS. Those 2000 IOPS we had to get 10 machines for? That isn’t really noticeable at all today. In fact, let’s go higher. Let’s say we have 50,000 concurrent connections each one issuing an operation once a second. This is a hundred times more work than the previous example. But the environment in which we are running is very different.

Given an operating budget of 150$, I will use the hardware from this article, which is basically a Raspberry PI with SSD drive (and fully 50$ of the budget go for the adapter to plug the SSD to the PI). That gives me the ability to handle 4,412 requests per second using Apache, which isn’t the best server in the world. Given that the disk used in the article can handle more than 250,000 IOPS, we can run a lot on a “server” that fits into a lunch box and cost significantly less than the monthly cost of a similar machine on the cloud. This factor totally change the way you would architect your systems.

The much better hardware means that you can select a simpler architecture and avoid a lot of complexities along the way. Although… we need to talk about the cloud, where the old costs are still very much a factor.

Using AWS as the baseline, I can get a 250GB gp2 SSD drive for 25$ / month. That would give me 750 IOPS for 25$. That is nice, I guess, but it puts me at less than what I can get from a modern HDD today. There is the burst capability on the cloud, which can smooth out some spikes, but I’ll ignore that for now. Let’s say that I wanted to get higher speed, I can increase the disk size (and hence the IOPS) at linear rate. The max I can get from gp2 is 16,000 IOPS at a cost of 533$.  Moving to io1 SSD, we can get 500GB drive with 3,000 IOPS for 257$ per month, and exceeding 20,000 IOPS on a 1TB drive would cost 1,425$.

In contrast, 242$ / month will get us a r5ad.2xlarge machine with 8 cores, 64 GB and 300 GB NVMe drive. A 1,453$ will get us a r5ad.12xlarge with 48 cores, 384 GB and 1.8TB NVMe drive. You are better off upgrading the machine entirely and running on top of the local NVMe drive and handling the persistency yourself than paying the storage costs associated with having it out as a block storage.

This tyranny of I/O costs and performance has had a huge impact on the overall system architecture of many systems. Scale out was not, as usually discussed, a reaction to the limits of handling the number of users. It was a limit on how fast the I/O systems could handle concurrent load. With SSD and NVMe drives, we are in a totally different field and need to consider how that affect our systems.

In many cases, you’ll want to have just enough data distribution to ensure high availability. Otherwise, it make more sense to get larger but fewer machines. The reduction in management overhead alone is usually worth it, but the key aspect is reducing the number of moving pieces in your architecture.

time to read 1 min | 134 words

We run a hackathon in the office this past week which was quite a lot of fun. The idea was to build a simple GC for an API build in C.

Here is the original code. This is a (poorly written) C API that leaks memory by design. The idea was to build some form of a GC to handle the leaks.

Here are my attempts:

We then used the opportunity to go a little bit into how the GC in .NET works as well as looked into how boehm GC works.

If you go over the code, please note that it isn’t meant to be idiomatic C and that it was written in a rush. It is meant to be a skeleton for understanding, not anything you’ll actually use.

time to read 3 min | 462 words

imageWe are exposing an API to external users at the moment. This API is currently exposed only via a web app. The internal architecture is that we have an controller using ASP.Net MVC that will accept the request from the browser, validate the data and then put the relevant commands in a queue for backend processing.

The initial version of the public API is going to be 1:1 identical to the version we have in the web application. That leads to an obvious question. How do we avoid code duplication between the two?

Practically speaking, there are some differences between the two versions:

  • The authentication is different (session cookie vs. api key)
  • The base class for the controller is different

Aside from those changes, they are remarkedly the same. So how should you share the code?

My answer, in this case, was that the proper strategy was Ctrl+C and Ctrl+V. I do not want to share code between those two locations. In this case, code duplication is the appropriate strategy for ensuring maintainable code.

Here is my reasoning:

  • This is the entry point to our system, not where the real work is done. The controller is in charge of validation, creating the command to run and sending it to the backend. The real heavy lifting happens elsewhere and will be common regardless.
  • That doesn’t mean that the controller doesn’t have work to do, mind. However, that work is highly dependent on the interaction model. In other words, it depends on what we show to the user in the web application. The web application and the existing controller are deployed in tandem and have no versioning boundary between them. Changes to the UI (adding a new field, for example, or additional behaviors) will be reflected in the controller.
  • The API, on the other hand, is a release once maintain for a long time. We don’t want any unforeseen changes here. Changes to the web controller code should not have an impact on the API behavior.

In other words, the fact that we have value equality doesn’t mean that we have identity equality. The code is the same right now, sure. And would be fairly easy to abstract so we’ll only have it in a single location. But that would create a maintenance burden. Whenever we make a modification in the web portion, we’ll have to remember to keep the old behavior for the API. Or we won’t remember, and create backward compatibility issues.

Creating separate codebases for ease use case result in a decision we make once, and then the most straightforward approach for writing the code will give us the behavior we want.

You might also want to read about the JFHCI approach.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production postmorterm (2):
    11 Jun 2025 - The rookie server's untimely promotion
  2. Webinar (7):
    05 Jun 2025 - Think inside the database
  3. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  4. RavenDB News (2):
    02 May 2025 - May 2025
  5. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}