Ayende @ Rahien

filter by tags archive

architecture (611) rss
bugs (450) rss
challanges (123) rss
community (379) rss
databases (481) rss
design (895) rss
development (641) rss
hibernating-practices (71) rss
miscellaneous (592) rss
performance (397) rss
programming (1085) rss
raven (1448) rss
ravendb.net (532) rss
reviews (184) rss

2025
- June (4)
- May (10)
- April (10)
- March (10)
- February (7)
- January (12)
2024
- December (3)
- November (2)
- October (1)
- September (3)
- August (5)
- July (10)
- June (4)
- May (6)
- April (2)
- March (8)
- February (2)
- January (14)
2023
- December (4)
- October (4)
- September (6)
- August (12)
- July (5)
- June (15)
- May (3)
- April (11)
- March (5)
- February (5)
- January (8)
2022
- December (5)
- November (7)
- October (7)
- September (9)
- August (10)
- July (15)
- June (12)
- May (9)
- April (14)
- March (15)
- February (13)
- January (16)
2021
- December (23)
- November (20)
- October (16)
- September (6)
- August (16)
- July (11)
- June (16)
- May (4)
- April (10)
- March (11)
- February (15)
- January (14)
2020
- December (10)
- November (13)
- October (15)
- September (6)
- August (9)
- July (9)
- June (17)
- May (15)
- April (14)
- March (21)
- February (16)
- January (13)
2019
- December (17)
- November (14)
- October (16)
- September (10)
- August (8)
- July (16)
- June (11)
- May (13)
- April (18)
- March (12)
- February (19)
- January (23)
2018
- December (15)
- November (14)
- October (19)
- September (18)
- August (23)
- July (20)
- June (20)
- May (23)
- April (15)
- March (23)
- February (19)
- January (23)
2017
- December (21)
- November (24)
- October (22)
- September (21)
- August (23)
- July (21)
- June (24)
- May (21)
- April (21)
- March (23)
- February (20)
- January (23)
2016
- December (17)
- November (18)
- October (22)
- September (18)
- August (23)
- July (22)
- June (17)
- May (24)
- April (16)
- March (16)
- February (21)
- January (21)
2015
- December (5)
- November (10)
- October (9)
- September (17)
- August (20)
- July (17)
- June (4)
- May (12)
- April (9)
- March (8)
- February (25)
- January (17)
2014
- December (22)
- November (19)
- October (21)
- September (37)
- August (24)
- July (23)
- June (13)
- May (19)
- April (24)
- March (23)
- February (21)
- January (24)
2013
- December (23)
- November (29)
- October (27)
- September (26)
- August (24)
- July (24)
- June (23)
- May (25)
- April (26)
- March (24)
- February (24)
- January (21)
2012
- December (19)
- November (22)
- October (27)
- September (24)
- August (30)
- July (23)
- June (25)
- May (23)
- April (25)
- March (25)
- February (28)
- January (24)
2011
- December (17)
- November (14)
- October (24)
- September (28)
- August (27)
- July (30)
- June (19)
- May (16)
- April (30)
- March (23)
- February (11)
- January (26)
2010
- December (29)
- November (28)
- October (35)
- September (33)
- August (44)
- July (17)
- June (20)
- May (53)
- April (29)
- March (35)
- February (33)
- January (36)
2009
- December (37)
- November (35)
- October (53)
- September (60)
- August (66)
- July (29)
- June (24)
- May (52)
- April (63)
- March (35)
- February (53)
- January (50)
2008
- December (58)
- November (65)
- October (46)
- September (48)
- August (96)
- July (87)
- June (45)
- May (51)
- April (52)
- March (70)
- February (43)
- January (49)
2007
- December (100)
- November (52)
- October (109)
- September (68)
- August (80)
- July (56)
- June (150)
- May (115)
- April (73)
- March (124)
- February (102)
- January (68)
2006
- December (95)
- November (53)
- October (120)
- September (57)
- August (88)
- July (54)
- June (103)
- May (89)
- April (84)
- March (143)
- February (78)
- January (64)
2005
- December (70)
- November (97)
- October (91)
- September (61)
- August (74)
- July (92)
- June (100)
- May (53)
- April (42)
- March (41)
- February (84)
- January (31)
2004
- December (49)
- November (26)
- October (26)
- September (6)
- April (10)

Sep 08 2013

RavenDB High Performance Book is OUT

time to read 1 min | 51 words

Tweet Share Share 6 comments

Tags:

raven

You can get the new book here.

Learn how to build your application for scalability and high availability
Make highly interactive applications that support client-side notifications, faceted search, search suggestions, and more
Take advantage of advanced RavenDB APIs to make your application fly

Sep 06 2013

With a little knowledge and a profiler, let us optimize!

time to read 13 min | 2574 words

Tweet Share Share 14 comments

Tags:

performance

As part of the work we have been doing on Voron, I wrote a few benchmarks and looked at where the hot spots are. One of the major ones was this function:

   1: public override void Flush()

   2: {

   3:     if (_flushMode == FlushMode.None)

   4:         return;

5:

   6:     PagerState.Accessor.Flush();

   7:     _fileStream.Flush(_flushMode == FlushMode.Full);

   8: }

This is the “fsync()” call, effectively. Accessor.Flush() call will resolve to a FlushViewOfFile(0, size); and _fileStream.Flush(true) will resolve to FlushFileBuffers on Windows.

It isn’t surprising that this would be THE hotspot, it is the part where we actually have to wait for the hardware to do stuff, after all. But further investigation revealed that it wasn’t the FlushFileBuffers that was really costly, it was the FlushViewOfFile. What FlushViewOfFile will do is scan all of the pages in range, and flush them to the OS (not to disk) if they are dirty. That is great, but it is effectively an O(N) operation. We have more knowledge about what is going on, so we can do better. We already know what are the dirty pages, so we can actually use that, instead of letting the OS do all the work.

But then we run into another problem. If we actually just call FlushViewOfFile for every page separately, we are going to have to spend a lot of time just calling to the OS when we have to do a large write. So we need to balance the amount of data we send to FlushViewOfFile with the number of times we are going to call FlushViewOfFile. Therefor, I came up with the following logic. We are going to group calls to FlushViewOfFile, as long as they are nearby (within 256KB of one another), this will give us the best balance between reducing the number of pages that FlushViewOfFile needs to call and the number of times we call FlushViewOfFile.

This now looks like this:

   1: public override void Flush(List<long> sortedPagesToFlush)

   2: {

   3:     if (_flushMode == FlushMode.None || sortedPagesToFlush.Count == 0)

   4:         return;

5:

   6:     // here we try to optimize the amount of work we do, we will only

   7:     // flush the actual dirty pages, and we will do so in sequential order

   8:     // ideally, this will save the OS the trouble of actually having to flush the

   9:     // entire range

  10:     long start = sortedPagesToFlush[0];

  11:     long count = 1;

  12:     for (int i = 1; i < sortedPagesToFlush.Count; i++)

  13:     {

  14:         var difference = sortedPagesToFlush[i] - sortedPagesToFlush[i - 1];

  15:         // if the difference between them is not _too_ big, we will just merge it into a single call

  16:         // we are trying to minimize both the size of the range that we flush AND the number of times

  17:         // we call flush, so we need to balance those needs.

  18:         if (difference < 64)

  19:         {

  20:             count += difference;

  21:             continue;

  22:         }

  23:         FlushPages(start, count);

  24:         start = sortedPagesToFlush[i];

  25:         count = 1;

  26:     }

  27:     FlushPages(start, count);

28:

  29:     if (_flushMode == FlushMode.Full)

  30:         _fileStream.Flush(true);

  31: }

A side affect of this is that we are more likely to be writing to the disk in a sequential fashion because of this.

The end result of this change was doubling the performance of the system under worse case scenario to “just” 25% faster under best conditions.

Sep 05 2013

My RavenDB’s Story Contest

time to read 1 min | 97 words

Tweet Share Share 3 comments

Tags:

raven

With the public release of RavenDB 2.5, we want to hear a lot more from users about what they are doing with RavenDB. Therefor, we decided to have a contest.

Basically, we would ask you to write a post about your RavenDB experience on the RavenDB page in Facebook. We will send a free RavenDB care package (which includes an awesome T-Shirt & laptop stickers) to the first 50 people to send us their stories.

We will also raffle 3 RavenDB DVDs from those who will submit their stories. The contest will end on Sep 20.

Sep 04 2013

RavenDB 2.5 is out, and a happy New Year

time to read 2 min | 229 words

Tweet Share Share 7 comments

Tags:

raven

About six weeks ago, we actually released RavenDB 2.5 to the world. It was build 2666. I decided to do something a bit different, and do a silent launch. In other words, We released it, let the people in the mailing list know about it, but we didn’t make a big fuss about it.

Now we have a new build, 2700 (which isn’t going to evoke… certain issues for some people), and we want to make as much noise as possible. Because RavenDB 2.5 is out, and it is really cool.

Here are some of the new stuff:

And that is just a taste.

In fact, we are going to do a Webinar about how cool RavenDB 2.5 is on Monday. You can register using the following link: https://www2.gotomeeting.com/register/551636514

And, of course, go and get the latest RavenDB from our site.

And have a great New Year, everyone. We will be off for the Holiday until next Monday…h

Sep 03 2013

Diagnosing a production issue on our servers

time to read 14 min | 2649 words

Tweet Share Share 21 comments

Tags:

bugs
raven

One of the steps that we take before releasing a stable is to push the latest build into our own production servers, and see what goes on. By far, this has been a pretty pleasant process, and mostly served to increase our confidence that we can go to production with that version. But sometimes it does what it is supposed to do and find the sort of bugs that are very hard to run in production.

In this case, after several hours (8 – 12, I am guessing), we discovered that we would start getting errors such as EsentOutOfSessionsException on some of our sites. Esent sessions are the main way we access Esent, and we are pretty careful about managing them. Previously, there wasn’t really any way that you could get this error, indeed, that is pretty much the first time we saw that outside of the lab. The difference in 2.5 is that we allowed detached sessions to be used along with DTC calls. This gave us the ability to have a commit pending between the Prepare & Commit phases of the operation.

Reviewing the code, I found some places where we weren’t properly disposing the sessions, which could explain that. So I fixed that and pushed a new version out. It took a bit longer this time, but the same error happened. Sad smile

The good thing about having this happen on our production servers is that I have full access there. Of course, it is production, so outright debugging it out, but taking a dump and transferring that to my machine was easy enough.

Now, open it with WinDBG, run “.loadby sos clr” and start investigating.

First command, as always, is !threads. And there I could see several threads marked with Microsoft.Isam.Esent.Interop.EsentOutOfSessionsException. That was interesting, and said that we caught the problem as it was happening, which was great.

Next, it was time to look a bit at the actual memory. I run: !DumpHeap -type Session

My reaction is Huh!!! There is absolutely zero justification for that.

Now, the only question is why. So I decided to look at the class that is holding the transaction state, assuming that this is probably what is holding into all those sessions. I run: !DumpHeap -type EsentTransactionContext

And that tells me quite a lot. There appear to be now a total of 317 in flight DTC transactions. Considering that I know what our usage is, that is a huge number. And it tells me that something isn’t right here. This is especially true when you consider that we don’t have that many open databases holding in flight transactions: !DumpHeap -type EsentInFlightTransactionalState –stat

In other words, we have 8 loaded dbs, each of them holding their in flight transactional state. And we have 317 opened transactions and 35 thousands sessions. That is pretty ridiculous. Especially given that I know that we are supposed to have a max of single digits concurrent DTC transactions at any one time. So somehow we are leaking transactions & sessions. But I am still very unhappy with just “we are leaking sessions”. That is something that I knew before we start debugging everything.

I can already tell that we probably need to add a more robust way of expiring transactions, and I added that, but the numbers don’t add up to me. Since this is pretty much all I can do with WinDBG, I decided to use another tool, MemProfiler. This gives me the ability to import the dump file, and then analyze that in a much nicer manner. Doing so, I quickly found this out:

Huh?!

Sessions are finalizable, sure, but I am very careful about making sure to dispose of them. Especially after the previous code change. There should be just 317 undisposed sessions. And having that many items in the finalizer queue can certainly explain things. But I don’t know how they got there. And the numbers don’t match us, either. We are missing about 7K items from the WinDBG numbers.

Okay, next, I pulled ClrMD and wrote the following:

   1: var dt = DataTarget.LoadCrashDump(@"C:\Users\Ayende\Downloads\w3wp\w3wp.dmp");

   2: var moduleInfo = dt.ClrVersions[0].TryGetDacLocation();

   3: var rt = dt.CreateRuntime(moduleInfo);

4:

   5: var clrHeap = rt.GetHeap();

6:

   7: var finalized = new HashSet<ulong>();

8:

   9: int cnt = 0;

  10: foreach (var ptr in rt.EnumerateFinalizerQueue())

  11: {

  12:     var type = clrHeap.GetObjectType(ptr);

  13:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")

  14:     {

  15:         finalized.Add(ptr);

  16:     }

  17: }

  18: Console.WriteLine(finalized.Count);

19:

  20: var live = new HashSet<ulong>();

  21: foreach (var ptr in clrHeap.EnumerateObjects())

  22: {

  23:     var type = clrHeap.GetObjectType(ptr);

  24:     if (type.Name == "Microsoft.Isam.Esent.Interop.Session")

  25:     {

  26:         if (finalized.Contains(ptr) == false)

  27:             live.Add(ptr);

  28:     }

  29: }

  30: Console.WriteLine(live.Count);

This gave me 28,112 sessions in the finalizer queue and 7,547 session that are still live. So something is creating a lot of instances, but not using or referencing them?

I did a code review over everything once again, and I think that I got it. The culprit is this guy:

Where createContext is defined as:

Now, what I think is going on is that the concurrent dictionary, which is what transactionContexts might be calling the createContext multiple times inside the GetOrAdd method. But because those create values that have to be disposed… Now, in the normal course of things, worst case scenario is that we would have them in the finalizer queue and they would be disposed in due time. However, under load, we actually gather quite a few of them, and we run out of available sessions to operate with.

At least ,this is my current theory. I changed the code to be like this:

So if my value wasn’t created, I’ll properly dispose of it. I’ll be pushing this to production in a bit and seeing what is happening. Note that there aren’t locking, but we might be generating multiple sessions. That is fine, as long as only one of them survives.

Sep 02 2013

Get out of the way, we are coding, Part II

time to read 3 min | 409 words

Tweet Share Share 12 comments

Tags:

development

Another thing that is pretty common in development cycles is the notion of who can do more. Hours, that is, rather than work. That is a pretty important distinction.

In general, I appreciate Work much better than Hours. For the simple reason that someone doing 12 hours a day in the office usually do a lot less actual work. Sprints are possible, and we do that sometimes, usually if there is a major production issue or we are gearing up for a release.

Then again, we have just released RavenDB 2.5, and we haven’t had the need for doing that. It was simpler & easiest to push the date by a week than do long hours just to hit an arbitrary point in time. I think that in the last six months, we had people stay in the office past 5 – 6 PM twice.

There are three reasons for that. The two obvious ones are:

people doing 12 – 18 hours of work each day turn do crappy stuff, so that is bad for the product.
people doing 12 – 18 hours of work each day also tend to have… issues. They burn out, quite rapidly, too. Leaving aside issues such as this one. People crash and burn.

I know that I said it before, but it is important to note. Burn out will do nasty things to you. Leaving aside the proven physical and mental health issues that this cause, it boils down to this. I’ve burned out before, it sucks. Let’s us not do that is a pretty important aspect of what I do on a daily basis. That is why I turned to building products, because being on the road 60% of the time isn’t sustainable, and if it is something that I feel, this is certain for other people who work for Hibernating Rhinos.

But I said that there are three reasons. And the third might be just as important as the others. Hibernating Rhinos was built to be a place where people retire from. This is the ideal, and we are probably talking 40 years from now, considering all factors, but that is the idea. We aren’t a startup, chasing the pot of gold for that one in a hundred chance to make it rich.

And that is why I had to kick people out of the office and tell them to continue working on that issue tomorrow.

Oren Eini

Oren Eini

CEO of RavenDB

RavenDB High Performance Book is OUT

With a little knowledge and a profiler, let us optimize!

My RavenDB’s Story Contest

RavenDB 2.5 is out, and a happy New Year

Diagnosing a production issue on our servers

Get out of the way, we are coding, Part II

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed