Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,582
|
Comments: 51,212
Privacy Policy · Terms
filter by tags archive
time to read 15 min | 2857 words

I was asked about the problem of repeating property names in RavenDB. In particular, let us consider the following document:

   1: { 
   2:     "FirstName":"John",
   3:     "LastName":"Doe",
   4:     "BirthDay": "1980-01-01",
   5:     "LastLoggedInAt": "2013-08-10"
   6: }

There are 40 chars in this document that are dedicated solely for field names, and just 28 chars dedicate to actual data. Now, 40 vs 28 means that if we have 1 million of those documents, we are losing a lot of space for no good reason. Surely the database can do better on this, right? It can tokenize those things so instead of storing 40 bytes for fields, it would store only 16 (assuming int32 for token ids). And for larger documents, you would see an even more massive saving.

I think that I talked about this before, oh yes, here it is: http://ayende.com/blog/4669/you-saved-5-cents-and-your-code-is-not-readable-congrats, in which you manually (or via config at the client side) change the document to look like this:

   1: { 
   2:  "FN":"John",
   3:  "LN":"Doe",
   4:  "BD": "1980-01-01",
   5:  "LLD": "2013-08-10"
   6: }

This is pretty horrible and I cover exactly why this is bad in the blob post above. Now, for 1 million documents, those 40 chars represent about 38MB of disk space. Not really a major issue, I think.

The actual question I was asked was (by Justin):

Repeating keys is so silly MongoDB has a highly voted issue, marked major with a plan to fix. And many client libraries have code built that tries to shorten names for you:

https://jira.mongodb.org/browse/SERVER-863

I personally think reading and writing 2x or more data then necessary is not a silly issue, it impacts all aspects of performance and database management. Disk are cheap sure, then again if your database is smaller you might have been able to put it on a ssd or faster but smaller drive, or just transferred less bytes from that cheap disk to accomplish the same thing. Are you really going to tell me every byte doesn't make a difference after examining these c code bases that do their best to save every byte for performance reasons?

The ugly solution shown above is a manual approach, but surely the db can do better, right? Interning property names is relatively simple conceptually, but a lot more complex in practice.

For example, if you want to do interning, you probably want to do it end to end. Which means that the client libs need to understand the server tokens. That means that you need some way of informing the client when there is a new interned property added, otherwise they wouldn't be able to figure out the real property name. Then there is the issue of how you are going to handle sharding or replication. Is each server going to have its unique copy of the intern table? Is each client going to have to navigate that? Or are you going to try to have a single intern table, and somehow manage to agree on all the positions in that table among a distributed set of nodes?

Imagine what going to happen during failover, you suddenly need to get the intern table for each of the replicas. And making sure that you aren't using the wrong intern table is going to be fun. Not to mention what happens when you have different clients talking to different servers, especially if you had a network split.

Even assuming that you don't care for that, and only want this for saving disk I/O on the server, you are going to run into issue with concurrency management with regards to the itern list. If you have two transaction adding values that both require modifying the intern list, you are in an interesting problem. Not least of which because if you save just the property name id to the data store, instead of the full key, you have to be able to say that this is there if the transaction is committed, because otherwise you have an unknown (or wrong) property name when you read it back. So now you have concurrency issues, and the potential for conflicts because two transactions may be modifying the values of the intern table at the same time. For fun, assume that they both have documents with the same property name that needs to be added, as well as other property names.

And that doesn't even take into account that it is actually pretty common to have dynamic property names. For example, dates & times. You have 1 million documents, each of them have a property with the names like: "2013-08-03T19:57:32.7060000Z" 

A common scenario where this occurs is tracking things by time. For instance, if you want to track the amount of hours you worked on a bug. The document might look like this:

   1: {
   2:   "2013-08-03T19:57:32.7060000Z":{
   3:      "Project":"RavenDB",
   4:      "RavenDB-1248":"Fixed",
   5:      "RavenDB-1212":"Progress made"
   6:   }
   7: }

As you can see, we actually have multiple properties with dynamic names here. It would be a waste to try to intern them. In fact, it is likely going to cause us to fall over and die. And wait for that to happen when you have to pass this to the client... fun & joy all around.

But you know what, I can still probably go and figure out what the most 10,000 common field names are, and have a fixed table of those in place. This would give me the ability to handle things like Name, FirstName, Date, etc. Because it is a fixed list, it would means that a lot of the problems listed above don’t matter. It would mean that if you wanted to call your property FirstNames, you might need to call it Names, because the first one would be in the list and won’t be interned.

And that leads to a lot of interesting potential issues with backward and forward compatibility. How do I add more terms to the list? Older clients wouldn’t be able to understand the tokens, and that leads to additional complication just managing the versioning story. And I can assure you that you’ll have a lot of people clamoring and wanting to add their own unique properties to that list. (What do you mean ‘PensionFundTarget’ isn’t on the intern list, is a a really common term and we could really use the perf boost of that!).

And then we have the issue of debugging. Right now, you can watch what is going on over the wire using Fiddler very easily, and the values are easily understood. Do you really want to try to figure out data like this?

   1: { 
   2:  "231":"John",
   3:  "432":"Doe",
   4:  "127": "1980-01-01",
   5:  "5841": "2013-08-10"
   6: }

In pretty much every aspect you can think of, this is a really crazy hard feature, and the benefits that it brings aren’t really that important.

Sure, you can reduce the amount of data that you read from the disk. But that data is already sitting in the same place. And there is absolutely no difference between reading a 30 bytes value and a 200 bytes value (reads/writes are always done at a minimum of 512 bytes). So reading a bit of extra bytes means pretty much nothing to us. We don’t have to seek to it, it is already there. And anyway, we have caches to alleviate that problem for us.

And reading from the disk isn’t really our major problem, that is relatively cheap compare to sending the data to the user over the network. But wait, we actually have found a solution for that, we use gzip compression on the wire, a supported and well known method that you can easily see through with tools like Fiddler. That usually gives you the best of all worlds. You get small response size, you still use readable format, you don’t have to deal with all of the issues mentioned above. Happy happy joy joy!

Hm… can we not apply the same method internally on the server side? Sure we can. We can compress & decompress data on the fly when we write / read to the disk, further reducing the amount of data that we write.

In RavenDB, we do both. Data is compressed on the wire, and can be compressed to the disk as well.

As for the MongoDB issue, it may be highly voted, it may be marked as major, but it has been sitting in the issue tracker for the past 3+ years. I would say that they aren’t in any rush to resolve this. Probably for much the same reasons as I outlined here. It is cheap to vote for an issue, and usually the people setting the issue severity are the people creating it. I wouldn’t put too much stock on anything like that.

Put simply, trying to solve this is a really hard problem, with a lot of cascading implication and dangerous side effects. Conversely there are more effective solutions that are cheaper, simpler and are orthogonal to everything else.

I think you can guess what I would pick.

time to read 92 min | 18267 words

This post was written at 5:30AM, I run into this while doing research for another post, and I couldn’t really let it go.

XML as a text base format is really wasteful in space. But that wasn’t what really made it lose its shine. That was when it became so complex that it stopped being human readable. For example, I give you:

   1: <?xml version="1.0" encoding="UTF-8" ?>
   2:  <SOAP-ENV:Envelope
   3:   xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"
   4:   xmlns:xsd="http://www.w3.org/1999/XMLSchema"
   5:   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
   6:    <SOAP-ENV:Body>
   7:        <ns1:getEmployeeDetailsResponse
   8:         xmlns:ns1="urn:MySoapServices"
   9:         SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
  10:            <return xsi:type="ns1:EmployeeContactDetail">
  11:                <employeeName xsi:type="xsd:string">Bill Posters</employeeName>
  12:                <phoneNumber xsi:type="xsd:string">+1-212-7370194</phoneNumber>
  13:                <tempPhoneNumber
  14:                 xmlns:ns2="http://schemas.xmlsoap.org/soap/encoding/"
  15:                 xsi:type="ns2:Array"
  16:                 ns2:arrayType="ns1:TemporaryPhoneNumber[3]">
  17:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  18:                        <startDate xsi:type="xsd:int">37060</startDate>
  19:                        <endDate xsi:type="xsd:int">37064</endDate>
  20:                        <phoneNumber xsi:type="xsd:string">+1-515-2887505</phoneNumber>
  21:                    </item>
  22:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  23:                        <startDate xsi:type="xsd:int">37074</startDate>
  24:                        <endDate xsi:type="xsd:int">37078</endDate>
  25:                        <phoneNumber xsi:type="xsd:string">+1-516-2890033</phoneNumber>
  26:                    </item>
  27:                    <item xsi:type="ns1:TemporaryPhoneNumber">
  28:                        <startDate xsi:type="xsd:int">37088</startDate>
  29:                        <endDate xsi:type="xsd:int">37092</endDate>
  30:                        <phoneNumber xsi:type="xsd:string">+1-212-7376609</phoneNumber>
  31:                    </item>
  32:                </tempPhoneNumber>
  33:            </return>
  34:        </ns1:getEmployeeDetailsResponse>
  35:    </SOAP-ENV:Body>
  36: /SOAP-ENV:Envelope>

After XML was thrown out of the company of respectable folks, we had JSON show up and entertain us. It is smaller and more concise than XML, and so far has resisted the efforts to make it into some sort of a uber complex enterprisiey tool.

But today I run into quite a few effort to do strange things to JSON. I am talking about things like JSON DB (a compressed json format, not actual json database), JSONH, json.hpack, and friends. All of those attempt to reduce the size of JSON documents.

Let us take an example. the following is a JSON document representing one of RavenDB builds:

   1: {
   2:   "BuildName": "RavenDB Unstable v2.5",
   3:   "IsUnstable": true,
   4:   "Version": "2509-Unstable",
   5:   "PublishedAt": "2013-02-26T12:06:12.0000000",
   6:   "DownloadsIds": [],
   7:   "Changes": [
   8:     {
   9:       "Commiter": {
  10:         "Email": "david@davidwalker.org",
  11:         "Name": "David Walker"
  12:       },
  13:       "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  14:       "Href": "/app/rest/changes/id:21039",
  15:       "TeamCityId": 21039,
  16:       "Username": "david walker",
  17:       "Comment": "Do not save Has-Api-Key header to metadata\n",
  18:       "Date": "2013-02-20T23:22:43.0000000",
  19:       "Files": [
  20:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  21:       ]
  22:     },
  23:     {
  24:       "Commiter": {
  25:         "Email": "david@davidwalker.org",
  26:         "Name": "David Walker"
  27:       },
  28:       "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  29:       "Href": "/app/rest/changes/id:21040",
  30:       "TeamCityId": 21040,
  31:       "Username": "david walker",
  32:       "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  33:       "Date": "2013-02-20T23:23:59.0000000",
  34:       "Files": [
  35:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  36:       ]
  37:     },
  38:     {
  39:       "Commiter": {
  40:         "Email": "ayende@ayende.com",
  41:         "Name": "Ayende Rahien"
  42:       },
  43:       "Version": "5919521286735f50f963824a12bf121cd1df4367",
  44:       "Href": "/app/rest/changes/id:21035",
  45:       "TeamCityId": 21035,
  46:       "Username": "ayende rahien",
  47:       "Comment": "Better disposal\n",
  48:       "Date": "2013-02-26T10:16:45.0000000",
  49:       "Files": [
  50:         "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  51:       ]
  52:     },
  53:     {
  54:       "Commiter": {
  55:         "Email": "ayende@ayende.com",
  56:         "Name": "Ayende Rahien"
  57:       },
  58:       "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  59:       "Href": "/app/rest/changes/id:21036",
  60:       "TeamCityId": 21036,
  61:       "Username": "ayende rahien",
  62:       "Comment": "Will ensure that the value is always positive or zero (never negative).\nWhen using numeric calc, will div by 1,024 to get more concentration into buckets.\n",
  63:       "Date": "2013-02-26T10:17:23.0000000",
  64:       "Files": [
  65:         "Raven.Database/Indexing/IndexingUtil.cs"
  66:       ]
  67:     },
  68:     {
  69:       "Commiter": {
  70:         "Email": "ayende@ayende.com",
  71:         "Name": "Ayende Rahien"
  72:       },
  73:       "Version": "7bf51345d39c3993fed5a82eacad6e74b9201601",
  74:       "Href": "/app/rest/changes/id:21037",
  75:       "TeamCityId": 21037,
  76:       "Username": "ayende rahien",
  77:       "Comment": "Fixing a bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  78:       "Date": "2013-02-26T10:53:01.0000000",
  79:       "Files": [
  80:         "Raven.Database/Indexing/MapReduceIndex.cs",
  81:         "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  82:         "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  83:         "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  84:         "Raven.Tests/Issues/RavenDB_784.cs",
  85:         "Raven.Tests/Storage/MappedResults.cs",
  86:         "Raven.Tests/Views/ViewStorage.cs"
  87:       ]
  88:     },
  89:     {
  90:       "Commiter": {
  91:         "Email": "ayende@ayende.com",
  92:         "Name": "Ayende Rahien"
  93:       },
  94:       "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  95:       "Href": "/app/rest/changes/id:21038",
  96:       "TeamCityId": 21038,
  97:       "Username": "ayende rahien",
  98:       "Comment": "No need for so many repeats\n",
  99:       "Date": "2013-02-26T11:27:49.0000000",
 100:       "Files": [
 101:         "Raven.Tests/Bugs/MultiOutputReduce.cs"
 102:       ]
 103:     },
 104:     {
 105:       "Commiter": {
 106:         "Email": "ayende@ayende.com",
 107:         "Name": "Ayende Rahien"
 108:       },
 109:       "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 110:       "Href": "/app/rest/changes/id:21041",
 111:       "TeamCityId": 21041,
 112:       "Username": "ayende rahien",
 113:       "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 114:       "Date": "2013-02-26T11:41:39.0000000",
 115:       "Files": [
 116:         "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 117:       ]
 118:     }
 119:   ],
 120:   "ResolvedIssues": [],
 121:   "Contributors": [
 122:     {
 123:       "FullName": "Ayende Rahien",
 124:       "Email": "ayende@ayende.com",
 125:       "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:     },
 127:     {
 128:       "FullName": "David Walker",
 129:       "Email": "david@davidwalker.org",
 130:       "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:     }
 132:   ],
 133:   "BuildTypeId": "bt8",
 134:   "Href": "/app/rest/builds/id:588",
 135:   "ProjectName": "RavenDB",
 136:   "TeamCityId": 588,
 137:   "ProjectId": "project3",
 138:   "Number": 2509
 139: }

This document is 4.52KB in size. Running this through JSONH gives us the following:

   1: [
   2:     14,
   3:     "BuildName",
   4:     "IsUnstable",
   5:     "Version",
   6:     "PublishedAt",
   7:     "DownloadsIds",
   8:     "Changes",
   9:     "ResolvedIssues",
  10:     "Contributors",
  11:     "BuildTypeId",
  12:     "Href",
  13:     "ProjectName",
  14:     "TeamCityId",
  15:     "ProjectId",
  16:     "Number",
  17:     "RavenDB Unstable v2.5",
  18:     true,
  19:     "2509-Unstable",
  20:     "2013-02-26T12:06:12.0000000",
  21:     [
  22:     ],
  23:     [
  24:         {
  25:             "Commiter": {
  26:                 "Email": "david@davidwalker.org",
  27:                 "Name": "David Walker"
  28:             },
  29:             "Version": "17c661cb158d5e3c528fe2c02a3346305f0234a3",
  30:             "Href": "/app/rest/changes/id:21039",
  31:             "TeamCityId": 21039,
  32:             "Username": "david walker",
  33:             "Comment": "Do not save Has-Api-Key header to metadata\n",
  34:             "Date": "2013-02-20T23:22:43.0000000",
  35:             "Files": [
  36:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  37:             ]
  38:         },
  39:         {
  40:             "Commiter": {
  41:                 "Email": "david@davidwalker.org",
  42:                 "Name": "David Walker"
  43:             },
  44:             "Version": "5ffb4d61ad9102696948f6678bbecac88e1dc039",
  45:             "Href": "/app/rest/changes/id:21040",
  46:             "TeamCityId": 21040,
  47:             "Username": "david walker",
  48:             "Comment": "Do not save IIS Application Request Routing headers to metadata\n",
  49:             "Date": "2013-02-20T23:23:59.0000000",
  50:             "Files": [
  51:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
  52:             ]
  53:         },
  54:         {
  55:             "Commiter": {
  56:                 "Email": "ayende@ayende.com",
  57:                 "Name": "Ayende Rahien"
  58:             },
  59:             "Version": "5919521286735f50f963824a12bf121cd1df4367",
  60:             "Href": "/app/rest/changes/id:21035",
  61:             "TeamCityId": 21035,
  62:             "Username": "ayende rahien",
  63:             "Comment": "Better disposal\n",
  64:             "Date": "2013-02-26T10:16:45.0000000",
  65:             "Files": [
  66:                 "Raven.Client.WinRT/MissingFromWinRT/ThreadSleep.cs"
  67:             ]
  68:         },
  69:         {
  70:             "Commiter": {
  71:                 "Email": "ayende@ayende.com",
  72:                 "Name": "Ayende Rahien"
  73:             },
  74:             "Version": "c93264e2a94e2aa326e7308ab3909aa4077bc3bb",
  75:             "Href": "/app/rest/changes/id:21036",
  76:             "TeamCityId": "...bug where we wouldn't decrement reduce stats for an index when multiple values from the same bucket are removed\n",
  77:             "Date": "2013-02-26T10:53:01.0000000",
  78:             "Files": [
  79:                 "Raven.Database/Indexing/MapReduceIndex.cs",
  80:                 "Raven.Database/Storage/Esent/StorageActions/MappedResults.cs",
  81:                 "Raven.Database/Storage/IMappedResultsStorageAction.cs",
  82:                 "Raven.Database/Storage/Managed/MappedResultsStorageAction.cs",
  83:                 "Raven.Tests/Issues/RavenDB_784.cs",
  84:                 "Raven.Tests/Storage/MappedResults.cs",
  85:                 "Raven.Tests/Views/ViewStorage.cs"
  86:             ]
  87:         },
  88:         {
  89:             "Commiter": {
  90:                 "Email": "ayende@ayende.com",
  91:                 "Name": "Ayende Rahien"
  92:             },
  93:             "Version": "ff2c5b43eba2a8a2206152658b5e76706e12945c",
  94:             "Href": "/app/rest/changes/id:21038",
  95:             "TeamCityId": 21038,
  96:             "Username": "ayende rahien",
  97:             "Comment": "No need for so many repeats\n",
  98:             "Date": "2013-02-26T11:27:49.0000000",
  99:             "Files": [
 100:                 "Raven.Tests/Bugs/MultiOutputReduce.cs"
 101:             ]
 102:         },
 103:         {
 104:             "Commiter": {
 105:                 "Email": "ayende@ayende.com",
 106:                 "Name": "Ayende Rahien"
 107:             },
 108:             "Version": "0620c74e51839972554fab3fa9898d7633cfea6e",
 109:             "Href": "/app/rest/changes/id:21041",
 110:             "TeamCityId": 21041,
 111:             "Username": "ayende rahien",
 112:             "Comment": "Merge branch 'master' of https://github.com/cloudbirdnet/ravendb into 2.1\n",
 113:             "Date": "2013-02-26T11:41:39.0000000",
 114:             "Files": [
 115:                 "Raven.Abstractions/Extensions/MetadataExtensions.cs"
 116:             ]
 117:         }
 118:     ],
 119:     [
 120:     ],
 121:     [
 122:         {
 123:             "FullName": "Ayende Rahien",
 124:             "Email": "ayende@ayende.com",
 125:             "EmailHash": "730a9f9186e14b8da5a4e453aca2adfe"
 126:         },
 127:         {
 128:             "FullName": "David Walker",
 129:             "Email": "david@davidwalker.org",
 130:             "EmailHash": "4e5293ab04bc1a4fdd62bd06e2f32871"
 131:         }
 132:     ],
 133:     "bt8",
 134:     "/app/rest/builds/id:588",
 135:     "RavenDB",
 136:     588,
 137:     "project3",
 138:     2509
 139: ]

It reduced the document size to 2.93KB! Awesome, nearly half of the size was gone. Except: This is actually generating utterly unreadable mess. I mean, can you look at this and figure out what the hell is going on.

I thought not. At this point, we might as well use a binary format. I happen to have a zip tool at my disposal, so I checked what would happen if I threw this through that. The end result was a file that was 1.42KB. And I had no more loss of readability than I have with the JSONH stuff.

To be frank, I just don’t get efforts like this. JSON is a text base human readable format. If you lose the human readable portion of the format, you might as well drop directly to binary. It is likely to be more efficient and you don’t lose anything by it.

And if you want to compress your data, it is probably better to use something like a compression tool. HTTP Compression, for example, is practically free, since all servers and clients should be able to consume it now. And any tool that you use should be able to inspect through it. And it is likely to generate much better results on your JSON documents than if you will try a clever format like this.

time to read 1 min | 113 words

So, I think that I run through enough string sorting and tax calculations. My next interview question is going to be:

Given a finite set of unique numbers, find all the runs in the set. Runs are 1 or more consecutive numbers.

That is, given {1,59,12,43,4,58,5,13,46,3,6}, the output should be: {1}, {3,4,5,6}, {12,13}, {43}, {46},{58,59}.

Note that the size of the set may be very large.

This seems pretty simple, as a question, and should introduce interesting results.

Just to give you some idea, this is a real world problem we run into. We need to find runs in a sequence of freed pages so we can optimize sequential I/O.

time to read 4 min | 717 words

I’ve been doing a lot of research lately on storage. And in general, it seems that the most popular ways of writing to disk today are divide into the following categories.

  • Write Ahead Logging (WAL)–  Many databases use some sort of variant on that.  PostgreSQL, SQLite, MongoDB, SQL Server, etc. Oracle has Redo Log, which seems similar, but I didn’t check too deeply.
  • Log Structured Merge  (LSM)– a lot of NoSQL databases use this method. Cassandra, Riak, LevelDB, SQLite 4, etc.
  • Shadow Paging – was quite popular a long time ago (80s), but still somewhat in use. LMDB, Tokyo Cabinet, CoucbDB (sort of).

WAL came into being for a very simple reason, it is drastically faster to write sequentially than it is to do random writes. Let us assume that you store the data on disk using some sort of a tree, when you need to insert / update something in that tree, the record can be anywhere. That means that you would need to do random writes, and have to suffer the perf issues associated with that. Instead, you can write to the log and have some sort of a background process that would update the on disk data.

It also means that you really only have to update in memory data, flush the log and you are safe. The recovery procedure is going to be pretty complex, but it gives you some nice performance. Note that you write everything at least twice, once for the log, and once for the read data file. The log writes are sequential, the data writes are random.

LSM also take advantage of sequential write speeds, but it takes it even further, instead of updating the actual data, you will wait until the log gets to a certain size, at which point you are going to merge it with the current data file(s). That means that you you will usually write things multiple times, in LevelDB, for example, a lot of the effort has actually gone into eradicating this cost. The cost of compacting your data. Because what ended up happening is that you have user writes competing with the compaction writes.

Shadow Paging is not actually trying to optimize sequential writes. Well, that is not really fair. Shadow Paging & sequential writes are just not related. The reason I said CouchDB is sort of using shadow paging is that it is using the exact same mechanics as other shadow paging system, but it always write at the end of the file. That means that is has excellent write speed, but it also means that it needs some way to reduce space. And that means it uses compaction, which brings you right back to the competing write story.

For our purposes, we will ignore the way CouchDB work and focus on systems that works like LMDB. In those sort of systems, instead of modifying the data directly, we create a shadow page (copy on write) and modify that. Because the shadow page is only wired up to the rest of the pages on commit, this is absolutely atomic. It also means that modifying a page is going to use one page, and leave another free (the old page). And that, in turn, means that you need to have some way of scavenging for free space. CouchDB does that by creating a whole new file.

LMDB does that by recording the free space and reusing that in the next transaction. That means that writes to LMDB can happen anywhere. We can apply policies on top of that to mitigate that, but that is beside the point.

Let us go back to another important aspect that we have to deal with in databases. Backups. As it turn out, it is actually really simple for most LSM / WAL systems to implement that, because you can just use the logs. For LMDB, you can create a backup really easily (in fact, since we are using shadow paging, you pretty much get it for free). However, one feature that I don’t think would be possible with LMDB would be incremental backups. WAL/LSM make it easy, just take the logs since a given point. But with LMDB style dbs, I don’t think that this would be possible.

time to read 30 min | 5927 words

I have been doing a lot more analysis about the actual disk access patterns that we saw in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.

That helped, but it wasn’t nearly enough. So I run some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using Memory Mapped files. So I decided to (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required me to write malloc implementation in managed code, but it was quite efficient looking code.

And then I run it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations / second. Using my own hand written, optimized I/O, I got… 262 operations / second. Okaaaay… so maybe I need to go back to the drawing board.

I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what is going on. We are writing 100 records with 16 bytes key and 100 bytes value. That means that the absolute minimum amount we can write would be 11,600 bytes. However, we are writing in 4Kb pages, which bring us to 3 pages and 12,288 bytes per transaction. Of course, this ignore things like the writing of branch pages in the B+Tree, so let us see what the real numbers are.

   1: Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
   2: Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
   3: Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
   4:  
   5: Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
   6: Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
   7: Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
   8:  
   9: Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
  10: Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
  11: Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
  12:  
  13: Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
  14: Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
  15: Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
  16:  
  17: Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
  18: Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
  19: Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
  20:  
  21: Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
  22: Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
  23: Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

The very first transactions are already showing something very interesting, we are actually writing 12 - 21 pages, or 48  - 84 Kb of data, instead of 12 Kb. Why do we write 4 times as much data as we wanted?

The answer is that we are writing data that is random in nature, so it can’t all sit in the same page, we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular look at lines 9 – 10. You can see that we are now at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.

The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modify a copy and mark the actual page as available for future transaction to re-use. This has the great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over place. And it actually gets worse as more times goes by.

Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.

You already know the numbers, right? What about just straight out writing 1MB? What is the cost of seeks? I wrote the following code:

   1: var buffer = new byte[1024*1024];
   2: var random = new Random();
   3: random.NextBytes(buffer);
   4: if(File.Exists("test.bin"))
   5:     File.Delete("test.bin");
   6: using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))
   7: {
   8:     fs.SetLength(1024 * 1024 * 768);
   9:     // warm up
  10:     for (int i = 0; i < 200; i++)
  11:     {
  12:         fs.Position = random.Next(0, (int)fs.Length);
  13:         fs.Write(buffer,0, random.Next(0, 1024));
  14:     }
  15:  
  16:     var sp = Stopwatch.StartNew();
  17:     for (int i = 0; i < 200; i++)
  18:     {
  19:           fs.Position = random.Next(0, (int)fs.Length);
  20:           fs.WriteByte(1);
  21:     }
  22:     fs.Flush(true);
  23:     sp.Stop();
  24:     Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);
  25:  
  26:     sp = Stopwatch.StartNew();
  27:     fs.Position = random.Next(0, (int)fs.Length);
  28:     fs.Write(buffer, 0, buffer.Length);
  29:     fs.Flush(true);
  30:     sp.Stop();
  31:     Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);
  32: }

The results are quite interesting:

   1: 200 seeks & 200 bytes 146 ms
   2: 1 MB write 6 ms

Just to note, this is when running on SSD, the numbers are supposed to be a lot worse when running on HDD.

In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:

   1: Flush      1 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   2: Flush      2 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   3: Flush      3 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
   4:  
   5: Flush    159 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   6: Flush    160 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
   7: Flush    161 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   8: Flush    162 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
   9:  
  10: Flush  1,320 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  11: Flush  1,321 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  12: Flush  1,322 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  13:  
  14: Flush  4,316 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  15: Flush  4,317 with   7 pages -  28 kb writes and   2 seeks (  5 leaves,   2 branches,   0 overflows)
  16: Flush  4,318 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  17:  
  18: Flush  7,409 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  19: Flush  7,410 with   9 pages -  36 kb writes and   2 seeks (  6 leaves,   3 branches,   0 overflows)
  20: Flush  7,411 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  21:  
  22: Flush  9,990 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  23: Flush  9,991 with   9 pages -  36 kb writes and   4 seeks (  6 leaves,   3 branches,   0 overflows)
  24: Flush  9,992 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
  25: Flush  9,993 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)

Because we are always writing at the end, we only need to touch very few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.

I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space become available in sufficient numbers to be relevant. Something like that would allow auto tuning of the amount of garbage we accumulate vs. the number of seeks we require.

time to read 1 min | 181 words

Here is the recording for our RavenDB 2.5 Webinar from Monday. You can start watching it right now, or you can continue reading about the timing snafu we had below.

Basically, we run into a timing error because of daylight savings. Unlike the common error, where we forgot to account for daylight savings taking effect, here we forgot to take into account daylight saving not coming into effect.

For a lot of really reasons, daylight savings in Israel is a contentious issue, involving the cross roads of politics, religion and whole bunch of other stuff. As a result, until recently we didn’t have a fixed date for daylight saving changes. A few years ago it changed, and that date was supposed to be last week. Then politics happened, and the date moved. We didn’t account for a lot of software still thinking that the daylight savings time actually happening on time, and that meant that we actually had the webinar an hour early.

I apologize for the mistake, and hopefully you’ll still enjoy the recording.

time to read 6 min | 1076 words

I am working peacefully through some stuff with Voron, when I run into some really bad performance in a pretty important scenario. I decided that this is probably something that I am doing wrong, and set out to reproduce the same scenario in LMDB, to figure out what I am doing wrong.

Here it the code:

int main(int argc,char * argv[])
{
    int i = 0, j = 0, rc;
    UUID id;
    MDB_env *env;

    MDB_val key, data;

    MDB_stat mst;
    MDB_cursor *cursor;
    char sval[32];
    clock_t start = clock(), end;

    srandom(time(NULL));

    rc = mdb_env_create(&env);
    rc = mdb_env_set_mapsize(env, 1024*1024*768);
rc = mdb_env_set_flags(env, MDB_WRITEMAP, 1); rc = mdb_env_open(env, "E:\\data\\lmdb2", 0, 0664); key.mv_size = sizeof(UUID); data.mv_size = sizeof(sval); data.mv_data = sval; for (i=0;i<100;i++) { MDB_txn *txn; MDB_dbi dbi; rc = mdb_txn_begin(env, NULL, 0, &txn); rc = mdb_open(txn, NULL, 0, &dbi); for (j= 0; j < 100; j++) { UuidCreate(&id); key.mv_data = &id; rc = mdb_put(txn, dbi, &key, &data, 0); } rc = mdb_txn_commit(txn); mdb_close(env, dbi); } end = clock(); printf("%i", (end - start)); fgetc(stdin); mdb_env_close(env); return 0; }

As you can see, we are inserting 10,000 items (100 transactions of 100 items each). Each item has key of 16 bytes and a value of 100 bytes. Now, you might note that this is probably the worst case scenario for a B+Tree, UUID are unordered, and this generate a lot of fragmentation in the tree. Bad scenario, yes, but also a relatively common one, and one that needs to be handled properly for our needs. It took me a long while to narrow down what is actually going on. At first I was convinced that the problem was with Windows’ implementation of memory mapped files, but eventually I realized that select ain’t broken.

Here is the Process Monitor’s trace of the first few transactions. The highlighted calls are to FlushBuffersFile, which is how FlushFileBuffers look like, which indicate a commit.

image

As I said, those are the first few transactions. You can see that the OS is doing a pretty good job in merging writes and avoid seeks. However, as times goes by…

image

And as more times goes by, we get to… ( there are more rows at the top that I had to remove so I could take the snapshot).

image

In fact, I put the data in Excel and got:

 

image

And those are some really bad numbers, I think you’ll agree. After the very short period of time, the cost of committing a transaction goes to over 50 seeks and writing of 350KB.

Let us compare that to what happens when we are actually writing sequential data (using UuidCreateSequential instead of UuidCreate). The test complete in half the time, and the actual statistics are also very interesting.

image

 

We are writing a lot less, but much more important, we are doing a lot less seeks. For reference, here is the final transaction that we have with the sequential write scenario:

image

I run those tests on a HDD drive, so seeks times matter a lot, but I have gotten similar results when running on SSD.

Now, the interesting question here is what exactly is going on? And I think that a large part of this is the way LMDB allocate pages. It uses a copy on write method, which means that after the transaction is done, the previously used page are free. That means that they can be reclaimed. And the next transaction will do just that. The problem with this approach is that it tends to spread writes all over the file. This saves disk space, but require a lot more seeks when you need to commit a transaction.

That means that in order to optimize this particular scenario, I probably need to do some thinking about the free page allocation behavior. We want to be a lot more conservative about when / how we give out those pages, to make sure that we aren’t generating a really bad situation for the db when commit time comes.

time to read 32 min | 6381 words

Continuing on with the same theme from our last post. How can we improve the speed in which we write to disk. In particular, I am currently focused on my worst case scenario:

fill rnd buff 10,000 tx            :    161,812 ms      6,180 ops / sec

This is 10,000 transactions all running one after another, and taking really way too long to go about doing their thing. Now, we did some improvements and we got it all the way to 6,340 ops / sec, but I think you’ll agree that even this optimization is probably still bad. We spent more time there, trying to figure out exactly how we can do micro optimizations, and we got all the way up to 8,078 ops /sec.

That is the point where I decided that I would really like to look at the raw numbers that I can get from this system. So I wrote the following code:

   1: var key = Guid.NewGuid().ToByteArray();
   2: var buffer = new byte[100];
   3: new Random().NextBytes(buffer);
   4:  
   5: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   6: {
   7:     fs.SetLength(1024*1024*768);
   8:  
   9:     var sp = Stopwatch.StartNew();
  10:  
  11:     for (int i = 0; i < 10*1000; i++)
  12:     {
  13:         for (int j = 0; j < 100; j++)
  14:         {
  15:             fs.Write(key,0, 16);
  16:             fs.Write(buffer, 0, 100);
  17:         }
  18:         fs.Flush(true);
  19:     }
  20:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000*1000)/sp.Elapsed.TotalSeconds);
  21: }

This code mimic the absolute best scenario we could hope for. Zero cost for managing the data, pure sequential writes. Note that we call to Flush(true) to simulate 10,000 transactions. This code gives me: 147,201 ops / sec.

This is interesting, mostly because I thought the reason our random writes with 10K transactions are bad was the calls to Flush(), but it appears that this is actually working very well. I then tested this with some random writes, by adding the following lines before line 13:

   1: var next = random.Next(0, 1024*1024*512);
   2: fs.Position = next - next%4096;

I then decided to try it with memory mapped files, and I wrote:

   1: using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
   2: {
   3:     fs.SetLength(1024 * 1024 * 768);
   4:  
   5:     var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
   6:                                     "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
   7:                                     null, HandleInheritability.None, true);
   8:     var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
   9:  
  10:     byte* p = null;
  11:     memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  12:  
  13:     var sp = Stopwatch.StartNew();
  14:  
  15:     for (int i = 0; i < 10 * 1000; i++)
  16:     {
  17:         var next = random.Next(0, 1024 * 1024 * 512);
  18:         byte* basePtr = p + next;
  19:         using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024,12*1024, FileAccess.ReadWrite))
  20:         {
  21:             for (int j = 0; j < 100; j++)
  22:             {
  23:                 ums.Write(key, 0, 16);
  24:                 ums.Write(buffer, 0, 100);
  25:             }
  26:         }
  27:     }
  28:     Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  29: }

You’ll note that I am not doing any flushing here. That is intention for now, using this, I am getting 5+ million ops per second. But since I am not doing flushing, this is pretty much me testing how fast I can write to memory.

Adding a single flush cost us 1.8 seconds for a 768MB file. And what about doing the right thing? Adding the following in line 26 means that we are actually flushing the buffers.

   1: FlushViewOfFile(basePtr, new IntPtr(12 * 1024));

Note that we are not flushing to disk, we still need to do that. But for now, let us try doing it. This single line changed the code from 5+ million ops to doing 170,988 ops / sec. And that does NOT include actual flushing to disk. When we do that, too, we get a truly ridiculous number: 20,547 ops / sec. And that explains quite a lot, I think.

For reference, here is the full code:

   1: unsafe class Program
   2: {
   3:     [DllImport("kernel32.dll", SetLastError = true)]
   4:     [return: MarshalAs(UnmanagedType.Bool)]
   5:     extern static bool FlushViewOfFile(byte* lpBaseAddress, IntPtr dwNumberOfBytesToFlush);
   6:  
   7:     static void Main(string[] args)
   8:     {
   9:         var key = Guid.NewGuid().ToByteArray();
  10:         var buffer = new byte[100];
  11:         var random = new Random();
  12:         random.NextBytes(buffer);
  13:  
  14:         using (var fs = new FileStream("test.bin", FileMode.Truncate, FileAccess.ReadWrite))
  15:         {
  16:             fs.SetLength(1024 * 1024 * 768);
  17:  
  18:             var memoryMappedFile = MemoryMappedFile.CreateFromFile(fs,
  19:                                             "test", fs.Length, MemoryMappedFileAccess.ReadWrite,
  20:                                             null, HandleInheritability.None, true);
  21:             var memoryMappedViewAccessor = memoryMappedFile.CreateViewAccessor();
  22:  
  23:             byte* p = null;
  24:             memoryMappedViewAccessor.SafeMemoryMappedViewHandle.AcquirePointer(ref p);
  25:  
  26:             var sp = Stopwatch.StartNew();
  27:  
  28:             for (int i = 0; i < 10 * 1000; i++)
  29:             {
  30:                 var next = random.Next(0, 1024 * 1024 * 512);
  31:                 byte* basePtr = p + next;
  32:                 using (var ums = new UnmanagedMemoryStream(basePtr, 12 * 1024, 12 * 1024, FileAccess.ReadWrite))
  33:                 {
  34:                     for (int j = 0; j < 100; j++)
  35:                     {
  36:                         ums.Write(key, 0, 16);
  37:                         ums.Write(buffer, 0, 100);
  38:                     }
  39:                 }
  40:                 FlushViewOfFile(basePtr, new IntPtr(12 * 1024));
  41:                 fs.Flush(true);
  42:             }
  43:             Console.WriteLine("{0:#,#} ms for {1:#,#} ops / sec", sp.ElapsedMilliseconds, (1000 * 1000) / sp.Elapsed.TotalSeconds);
  44:         }
  45:     }
  46: }

This is about as efficient a way you can get for writing to the disk using memory mapped files if you need to do that using memory mapped files in a transactional manner. And that is the absolute best case scenario, pretty much. Where we know exactly what we wrote and were we wrote it. And we always write a single entry, of a fixed size, etc. In Voron’s case, we might write to multiple pages at the same transaction (in fact, we are pretty much guaranteed to do just that).

This means that I need to think about other ways of doing that.

FUTURE POSTS

  1. fsync()-ing a directory on Linux (and not Windows) - 3 days from now

There are posts all the way to Jun 09, 2025

RECENT SERIES

  1. Webinar (7):
    05 Jun 2025 - Think inside the database
  2. Recording (16):
    29 May 2025 - RavenDB's Upcoming Optimizations Deep Dive
  3. RavenDB News (2):
    02 May 2025 - May 2025
  4. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  5. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
View all series

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}