Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,185
Privacy Policy · Terms
filter by tags archive
time to read 9 min | 1608 words

Kelly Sommers has been commenting about transactional model on twitter.

"Fully ACID writes but BASE reads" - Go back to the books. You can't selectively opt out of guarantees as you wish. What can client observe?

And some other things in the same vien. I am going to assume that this is relevant for RavenDB, based on other tweetss from her. And I think that this requires more space than allowed on twitter. The gist of Kelly's argument, as I understand it, is that RavenDB isn't ACID because it does BASE reads. 

I am not sure that I am doing the argument justice, and it is mostly pieced together from a whole bunch of tweets, so I would love to see a complete blog post about it. However, I think that at least with regards to RavenDB, there is a major misconception going on here.

RavenDB is an ACID database, period. If you put data in, you can be sure that the data will be there and you can get it out again. Indeed, the data you put in is immediately consistent. There is no scenario in which you can do any sequence of read/write/read/write and not immediately get the last committed state for that server. What I think is confusing people is the fact that we have implemented BASE queries. So let us go back a bit and discuss that.

Internally, RavenDB is structure as a set of components. One of those components in the document store, which is responsible for... storing documents (and a lot more besdie, but that isn't very important right now). The document store is ACID. And it ensure consistency, MVCC, durability, and all that other good stuff.

The document store is also limited in the kind of queries that it support. In effect, it supports only the following:

  • Document by key
  • Document by key prefix
  • Documents by update order

When talking about ACID, I am talking about this part of the system, in which you are writing documents and loading them by their id or by update order.

Note that in RavenDB, we use the term "load" to refer to accessing the document store. That is probably the cause for confusion. Loading is never inconsistent and obeys a strict snapshot isolation interpretation of the data. That means that asking something like: "give me users/123" is always going to give you the immediately consistent result.

In constrast to that, RavenDB also have the index store. The index store allows us to create complex queries, and that is subject to eventual consistency. That is by design, and a large part of what makes RavenDB able to dynamically adjust its own behavior at runtime based on production usage. 

We have a separate background processes that apply the indexes to the incoming data, and write them to the actual index store. This is done in direct contrast to other indexing systems (for example, RDBMS) in which indexes are updated as part of the same transaction. That leads to several interesting results.

Indexes aren't as expensive - In RDBMS, the more indexes you have, the slower your writes become. In RavenDB, indexes have no impact on the write code path,. That means that you can have a lot more indexes without worrying about it too much. Indexes still have cost, of course, but in most system, you just don't feel it.

Lock free reads & writes - Unlike other databases, where you have to pay the indexing cost either on write (common) or read (rare), with RavenDB both read & writes operate without having to deal with the indexing cost. When you make a query, we will immediately give you an answer from the results that we have right now. When you perform a write, we will process that without waiting for indexing.

Dynamic load balancing - because we don't promise immediate consistency, we can detect and scale back our indexing costs in time of high usage. That means that we can shift resources from indexing to answering queries or accepting writes. And we will be able to pick up the slack when the peek relaxes. 

Choice - one of the things that we absolutely ensure is that we will tell you when you make a query and we give you results that might not be up to date. That means that when you get the reply, you can make an informed decision. "This query is the result of all the data in the index as of 5 seconds ago" - decide if this is good enough for you or if you want to wait. Unlike other systems, we don't force you to accept whatever consistency model you have. You get to choose if you want to answers I can give you right now, or if you want to wait until we can give you the conistent version, that is up to you.

Smarter indexes - under the hood, RavenDB uses lucene for indexes. Even without counting things like full text or spatial searches, we get a lot more than what you get from the type of indexes you are used to in other systems.

For example, from the MongoDB docs (http://docs.mongodb.org/manual/core/index-compound/):

db.events.find().sort( { username: -1, date: 1 } )

The following index can support both these sort operations:

db.events.ensureIndex( { "username" : 1, "date" : -1 } )

However, the above index cannot support sorting by ascending username values and then by ascending date values, such as the following:

db.events.find().sort( { username: 1, date: 1 } )

In constrant, in RavenDB, it is sufficent that you just tell us what you want to index, and you don't need to worry about creating two indexes (and paying twice the price) if you want to allow users to sort the grid in multiple directions.

On the fly optimiaztions - the decision to separate the document store and the index store to separate components has paid off in ways that we didn't realy predict. Because we have the sepration, and because indexes are background processes that don't really impact the behavior of the rest of the system (outside of resources consumed), we have the freedom to do some interesting things. One of them is live index rebuild.

That means that you can create an index in RavenDB, and the system would go ahead and create it. At the same time, all other operations will go on normally. It can be writes to the document store, it can be updates to the other indexes. Anything at all.

Just to give you some idea about this feature, you'll note that this feature exists only in the Enterprise edition for SQL Server, for example. 

And the implications of that are that we can actually create an index in production. In fact, we do so routinely, that is the basis of our ad hoc querying ability. We analyze the query, the query optimizer decides where to dispatch it, and will create the new index if that is needed. There is a whole host of behavior just in this one feature (optimizations, heuroistics, scale back, etc) that we can do, but for the most part, you don't care. You make queries to the server. RavenDB analyze them and product the most efficent indexes to match you actual production load

.Anyway, I am starting to go on about RavenDB, and that is rarely short. So I'll summarize. RavenDB is composed of two distinct parts, the ACID document store, which allow absolute immediate consistency for both reads & writes. And it also have the index store, which is eventually consistent, updated in the background and are always available for queries. They'll also do zero waits (unless you explicitly desire this) for you and give you a much better and more consistent performance.

Oh, and a final number to conclude. The average time between a document update in the document store and the results showing up in the index store? That is 23ms*.

(And I blame 15.6ms of that on the clock resolution).

time to read 2 min | 249 words

There are 6 major features for RavenDB 3.0 that we want to keep as surprises. One of them you've already learned about, the HTML5 studio. And we'll reveal the others as they become viable to show.

But leaving those major features aside, there is a lot of stuff that we are doing that would deserve a bullet point all on its own. And today I want to talk about one of those features, S2S Smuggling.

Basically, assume that you have a server at location P (for Production) and you want to get the state of this server to your database at location D (for development). Right now, you have to export the P database, copy the file to your own machine and import it. That isn't a major hassle, but it is a hassle.

Instead, in RavenDB 3.0, you can go to a server, point it to another server and just see the data streaming by. We even suppose doing this multiple times, and only the changes will be moved between the servers.

Yes, you can do that right now with the replication bundle. But not all databases are replicated, and it is simpler and easier to just move the documents as if there is nothing there. 

The general idea is to be more convenient, but it is also likely to be much faster than the current method of: server > file, copy file, file > server. If only because we can do two way streaming, and plug the outgoing data directly into a bulk insert pipe.

time to read 4 min | 623 words

We got a support request from a customer, and this time it was quite a doozie. The were using the FreeDB dataset as a test bed for a lot of experiments, and they found very slow indexing speed with it.  In fact, what they found was utterly unacceptable indexing speed, to be frank. It took days to complete. However, that run contrary to all of the optimizations that we have done in the past few years.

So something there was quite fishy. Luckily, it was fairly simple to reproduce. And the culprit was very easily identified. It was the SQL Replication Bundle. But why? That turned out to be a pretty complex answer. 

The FreeDB dataset currently contains over 3.1 million records, and the sample the customer send us had about 8 indexes, varying in complexity from the trivial to full text analyzed and map reduce indexes. We expect such work load plus the replication to SQL to take a while. But it should be pretty fast process.

What turned out to be the problem was the way the SQL Replication bundle work. Since we don’t know what changes you are going to replicate to SQL, the first thing we do is to delete all the data that might have previously been replicated. In practice, it means that we execute something like:

DELETE FROM Disks WHERE DocumentId in (@p1, @p2, @p3)

INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )
INSERT INTO Disks (...) VALUES( ... )

And over time, this became slower & slower. Now, SQL Replication need access to a lot of documents (potentially all of them), so we use the same prefetcher technique that we use for indexing. And we also use the same optimizations to decide how much to load.

However, in this case, we had the SQL Replication being slow, and because we use the same optimization, to the optimizer it looked like we were having a very slow index. That calls for reducing the load on the server so we can have greater responsiveness and to reduce overall resource utilization. And that impacted the indexes. In effect, SQL Replication being slow forced us to feed the data into the indexes in small tidbits, and that drastically increased the I/O costs that we had to pay.

So the first thing to do was to actually break it apart, we now have different optimizers instances for indexes and SQL Replication (and RavenDB Replication, for that matter), and they cannot impact one another.

But the root cause was that SQL Replication was slow. And I think you should be able to figure out why from the information outline above.

As we replicated more and more data into SQL, we increased the table size. And as we increased the table size, statements like our DELETE would take more and more time. SQL was doing a lot of table scans.

To be honest, I never really thought about it. RavenDB in the same circumstances would just self optimize and thing would get faster fast. SQL Server (and any other relational database) would just be dog slow until you came and added the appropriate index.

Once that was done, our performance was back on track and we could run things speedily both for indexes and for SQL Replication.

time to read 3 min | 447 words

I got a request from a customer to look at the resource utilization of RavenDB under load. In particular, the customer loaded the entire FreeDB data set into RavenDB, then setup SQL Replication and a set of pretty expensive indexes (map reduce, full text searches, etc).

This was a great scenario to look at, and we did find interesting things that we can improve on. But during the testing, we had the following recording taken.

in fact, all of them were taken within a minute of each other:

image

image

image

The interesting thing about them is that they don’t indicate a problem. They indicate a server that is making as much utilization of resources as is available to it to handle the work load that it has.

While I am writing this, there is about 1GB of free RAM that we’ve left for the rest of the system, but basically, if you have a big server, and you give it a lot of work to do, it is pretty much given that you’ll want to get your money’s worth from all of that hardware. The problem in the customer’s case was that while it was very busy, it didn’t get things done properly, but that is a side note.

What I want to talk about is the assumption that the server is using a lot of resources, that is bad. In fact, that isn’t true, assuming that it is using them properly. For example, if you have a 32GB RAM, there is little point in trying to conserve memory utilization. And there is all the reason in the world to try to prepare ahead of time answers to upcoming queries. So we might be using CPU even on idle moments. The key here is how you detect when you are over using the system resources.

If you are under memory pressure, is is best to let go of that cache. And if you are busy handling requests, it is probably better to reduce the load of background tasks. We’ve been doing this sort of auto tuning in RavenDB for a while. A lot of the changes between 1.0 and 2.0 were done around that area.

time to read 5 min | 984 words

I mentioned that this piece of code have an issue:

public class LocalizationService
{
    MyEntities _ctx;
    Cache _cache;

    public LocalizationService(MyEntities ctx, Cache cache)
    {
        _ctx = ctx;
        _cache = cache;
        Task.Run(() =>
        {
            foreach(var item in _ctx.Resources)
            {
                _cache.Set(item.Key + "/" + item.LanguageId, item.Text);
            }
        });
    }    

    public string Get(string key, string languageId)
    {
        var cacheKey = key +"/" + languageId;
        var item = _cache.Get(cacheKey);
        if(item != null)
            return item;

        item = _ctx.Resources.Where(x=>x.Key == key && x.LanguageId == languageId).SingleOrDefault();
        _cache.Set(cacheKey, item);
        return item;
    }
}

And I am pretty sure that the lot of you’ll be able to find a lot of additional issues that I’ve not thought about.

But there are at least three major issues in the code above. It doesn’t do anything to solve the missing value problem, it doesn’t have good handling for expiring values and have no way to handle changing values.

Look at the code above, assume that I am making continuous calls to Get(“does not exists”, “nh-YI”), or something like that. The way the code is currently written, it will always hit the database to get that value.

The second problem is that if we have had a cache cleanup run, which expired some values, we will actually load them one at a time, in pretty much the worst possible way from the point of view of performance.

Then we have the problem of how to actually handle updating values.

Let us see how we can at least approach this. We will replace the Cache with a ConcurrentDictionary. That will mean that the data cannot just go away from under us, and since we expect the number of resources to be relatively low, there is no issue in holding all of them in memory.

Because we know we hold all of them in memory, we can be sure that if the value isn’t there, it isn’t in the database either, so we can immediately return null, without checking with the database.

Last, we will add a StartRefreshingResources task, which will do the actual refreshing in an async manner. In other words:

public class LocalizationService
{
    MyEntities _ctx;
    ConcurrentDictionary<Tuple<string,string>,string> _cache = new ConcurrentDictionary<Tuple<string,string>,string>();

    Task _refreshingResourcesTask;

    public LocalizationService(MyEntities ctx)
    {
        _ctx = ctx;
        StartRefreshingResources();
    } 

    public void StartRefreshingResources()
    {
         _refreshingResourcesTask = Task.Run(() =>
        {
            foreach(var item in _ctx.Resources)
            {
                _cache.Set(item.Key + "/" + item.LanguageId, item.Text);
            }
        });
    }

    public string Get(string key, string languageId)
    {
        var cacheKey = Tuplce.Create(key,languageId);
        var item = _cache.Get(cacheKey);
        if(item != null || _refreshingResourcesTask.IsCompleted)
            return item;

        item = _ctx.Resources.Where(x=>x.Key == key && x.LanguageId == languageId).SingleOrDefault();
        _cache.Set(cacheKey, item);
        return item;
    }
}

Note that there is a very subtle thing going on in here. as long as the async process is running, if we can’t find the value in the cache, we will go to the database to find it. This gives us a good balance between stopping the system entirely for startup/refresh and having the values immediately available.

time to read 6 min | 1075 words

The following method comes from the nopCommerce project. Take a moment to read it.

public virtual string GetResource(string resourceKey, int languageId,
    bool logIfNotFound = true, string defaultValue = "", bool returnEmptyIfNotFound = false)
{
    string result = string.Empty;
    if (resourceKey == null)
        resourceKey = string.Empty;
    resourceKey = resourceKey.Trim().ToLowerInvariant();
    if (_localizationSettings.LoadAllLocaleRecordsOnStartup)
    {
        //load all records (we know they are cached)
        var resources = GetAllResourceValues(languageId);
        if (resources.ContainsKey(resourceKey))
        {
            result = resources[resourceKey].Value;
        }
    }
    else
    {
        //gradual loading
        string key = string.Format(LOCALSTRINGRESOURCES_BY_RESOURCENAME_KEY, languageId, resourceKey);
        string lsr = _cacheManager.Get(key, () =>
        {
            var query = from l in _lsrRepository.Table
                        where l.ResourceName == resourceKey
                        && l.LanguageId == languageId
                        select l.ResourceValue;
            return query.FirstOrDefault();
        });

        if (lsr != null) 
            result = lsr;
    }
    if (String.IsNullOrEmpty(result))
    {
        if (logIfNotFound)
            _logger.Warning(string.Format("Resource string ({0}) is not found. Language ID = {1}", resourceKey, languageId));
        
        if (!String.IsNullOrEmpty(defaultValue))
        {
            result = defaultValue;
        }
        else
        {
            if (!returnEmptyIfNotFound)
                result = resourceKey;
        }
    }
    return result;
}

I am guessing, but I am assuming that the intent here is to have a tradeoff between startup time and the system responsiveness. If you have LoadAllLocaleRecordsOnStartup set to true, it will load all the data from the database, and access it from there. Otherwise, it will load the data in a piece at a time.

That is nice, but it shows a single tradeoff, and that isn’t a really good idea. Not only that, but look how it uses the cache. There are separate entries in the cache for the resources if they are loaded via the GetAllResourceValues() vs. individually. That leaves the cache with a lot less options when it needs to clear the cache. The cache deciding that it can remove a single item would result in a very expensive and long query taking place.

Instead, we can do it like this:

public class LocalizationService
{
    MyEntities _ctx;
    Cache _cache;

    public LocalizationService(MyEntities ctx, Cache cache)
    {
        _ctx = ctx;
        _cache = cache;
        Task.Run(() =>
        {
            foreach(var item in _ctx.Resources)
            {
                _cache.Set(item.Key + "/" + item.LanguageId, item.Text);
            }
        });
    }    

    public string Get(string key, string languageId)
    {
        var cacheKey = key +"/" + languageId;
        var item = _cache.Get(cacheKey);
        if(item != null)
            return item;

        item = _ctx.Resources.Where(x=>x.Key == key && x.LanguageId == languageId).SingleOrDefault();
        _cache.Set(cacheKey, item);
        return item;
    }
}

Of course, this has a separate issue, but I’ll discuss that in my next post.

time to read 1 min | 112 words

Over a year ago I took the FreeDB data set and important to RavenDB, to see how it behaves with very large data sets. We got a customer request to handle a problem that was related to this data set, so I had to re-create the database.

The numbers are interesting.

Before optimization, 2012 Two hours
After optimization, 2012 42 minutes
Current status 28 minutes

Seeing this, I think that I am going to be re-running the entire perf suite again, just to see what the numbers looks like again.

time to read 2 min | 208 words

We have run into a problem in our production system. Luckily, this is a pretty obvious case of misdirection.

Take a look at the stack trace that we have discovered:

image

The interesting bit is that this is an “impossible” error. In fact, this is the first time that we have actually seen this error ever.

But looking at the stack trace tells us pretty much everything. The error happens in Dispose, but that one is called from the constructor. Because we are using native code, we need to make sure that an exception in the ctor will properly dispose all unmanaged resources.

The code looks like this:

image

And here we can see the probable cause for error. We try to open a transaction, then an exception happens, and the Dispose is called, but it isn’t ready to handle this scenario, so throws.

The original exception is masked, and you have a hard debug scenario.

Voron’s Log

time to read 9 min | 1650 words

I was thinking, which is a bit dangerous, about Voron’s backup story, and things sort of started to roll down the hill from there.

Nitpicker note: None of the things that I am discussing here is actually novel in any way, I am well aware of that. I am articulating my thought process about a problem and how it went down from an initial small feature to something much bigger. Yes, I am aware of previous research in the area. I know that Xyz is also doing that. Thank you in advance.

Voron’s backup story is  a thing of beauty, and I can say that because I didn’t write it. I took it all from LMDB. Basically, just copy the entire file in a safe and transactional manner.  However, it cannot do incremental backups. So I started thinking about what would be needed to happen to get that, and I found that it is actually a pretty easy proposition.

Whenever we commit a transaction, we have a list of modified pages that we write back to the data file. An incremental backup strategy could be just:

Do a full backup – copy the full data file and put it somewhere. Keep track of all the modified pages in a log file. Incremental backup would just be copying the change log elsewhere are restarting the change log.

Restoring would happen by taking the full data file and applying the changes to it. This looks hard, but not when you consider that we always work in pages, so we can just have a change log format like this:

image

The process for applying the change log then becomes something like this:

foreach(LogEntry entry in GetLogEntryFiles())
{
       foreach(Page page in entry.Pages)
       {
             WritePageToDataFile(page.PageNumber, page.Base);
       }
}

And voila, we have incremental backups.

Of course, this means that we will have to write the data twice, once for the incremental log file, and once for the actual data file. But that led me to another series of thoughts. Once of the areas where we can certainly do better is random writes. This is an area where we have a particular problem with. I already discussed this here. But if we already have a log file (which we need to support incremental backups), and if we already write twice… why not take advantage of that.

Instead of having a change log, we can have a write ahead log. That turn all of our writes into sequential writes. Leading to much faster performance overall.

The downside of write ahead logging is that you are going to need to do a few important things. To start with, there is a point where you need to spill the write ahead log into the actual file. That leads to doubling the costs of writes, in comparison to LMDB’s usual manner of only a single write. It also means that you now have to implement some form of recovery, since we need to read the log at startup.

The benefits however are pretty good, I think. There is a reason why most databases have some form of Write Ahead Log to make sure that their writes are sequential. And the nice thing about it is that we can now flush only very clearly defined section of the file, and only do fsyncs on the actual log file.

The idea now becomes:

In the data file header we add record to indicate what is the last flushed log file, and up to where in the log file we flushed. We pre-allocate a log file with 64MB, and start write to it using mem map.

We also write to the end of of the mapped range, and we can safely state what section of the file needs to flush & fsynced, and it is always at the (logical) end of the file. Once every X (where X is time or size or just idleness), we take all the records from the log file from the last position we flushed the log file to the current location, then update the data file header accordingly.  The fun part about this is that this is absolutely safe in the case of failure. If we failed midway through for any reason, we can just re-apply the process from the beginning and there can be no corruption. There is also a good chance for being able to do additional optimizations here, because we can merge a lot of writes into a single operation and then make sure that we access the disk in a predictable and optimal fashion. All of this means that this can be an async process that is allowed to fail, which is very good from our perspective, because it means we can just kick it off with low priority to avoid having impacting regular write performance.

Until we get the data flushed to the data file, we maintain an in memory map of pages. In the image above, we will have a map for pages 2,3,13,19 and when they are requested, we will give out the address of the memory mapped log file entry at the appropriate location.

One thing that I want to avoid is stalls, like what happen to leveldb under heavy sustained writes. So when the log file is full, we just create a new log file and write to that. It is possible that under very heavy load you’ll have the “flush to data file” process take long enough that the new log file will also be full before the process is done. Unlike leveldb, we are going to just go ahead and create a new log file. So when the “flush to disk file” process resumes, it may need to process multiple log files. That is fine with us, and likely to be even better, because free space reuse will mean that we can probably skip writing pages that were overwritten by newer transactions (for example, above, we only need to write page 3 once).

There are a bunch of “details” that still need to be done:

  • What happens if you have a single transaction that is bigger than the size of the log file?
    • We can reject that transaction, 64MB per transaction is a lot.
    • Probably it means that we need to have a log file dedicated to that transaction, but how do we detect / decide that? That doesn’t seems workable.
    • Probably better would be to keep the 64MB size but say that this is a transaction start only, and that we need to process transaction end before we consider this committed. This is very similar to the way the leveldb log works.
  • Need to think about way to reduce the impact of flushing the logs:
    • Maybe provide a way for the user to “suppress” this for a period of 1 minute. We can use that during things like bulk insert, to get very high speed. And as long as the bulk insert goes, we will keep asking for an extension. (I don’t want to have on/off switches, because that means we have to be very careful about turning it on again).
    • Maybe we can detect the rate of writes and adjust accordingly? But what happens when we are running multiple databases at the same time?
  • Transaction logs should be named 0000000001.txlog, 0000000002.txlog, etc. Because I keep getting send them as “database log” during debug scenarios.
  • Need to make sure that this works properly for large values. So values that takes more than a single page are supported.
    • To make things complicated, what happen if we are the last 3 items in the log file, but we need to write a value that is 5 pages long? We need to mark the transaction as “tx started”, move to the next log file, and continue the transaction.
    • For that matter, what if we have a single value that is more than the size of the log file? Is that something that we want to support? We can do the same as “tx started”, marker, but maybe also indicate that we a “tx large value start”, “tx large value mid”, “tx large value end”.
    • For both “tx started” and “tx large value started” we need to be sure that we consider the transaction aborted unless we have the “tx end” part.
  • On startup, we need to read the log file(s) from the last flushed log location and upward and build the map of the pages contained in the log file. This introduce a “recovery” step of a sort, but I think that this should be a pretty small one overall.
  • We need to make sure to delete old log files (unless users want them to incremental backup purposes). And we need to make sure that the memory map for read transactions is kept alive as long there are any transactions that may be reading from it.

Overall, I think that this is a pretty radical departure in our internal process compare to how LMDB works, but I have high hopes for this. Even though it starts with “do we really need to support incremental backups?”

My expectation is that we will see much higher writes, and about the same numbers for reads. Especially for our scenarios.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}