Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 5 min | 844 words

I’m using 100/99 node cluster as the example, but the discussion also apply for smaller clusters (dozens of nodes) and bigger clusters (hundreds or thousands). Pretty much the only reason that you want to go with clusters of that size is that you want to scale out your processing in some manner. I’ve already discussed why a hundred node cluster isn’t a good option for safety reasons.

Consensus algorithm create a single consensus in the entire cluster, usually about an order set of operations that are fed to a state machine. The easiest such example would be a dictionary. But it make no sense to have a single dictionary spread across hundred nodes. Why would you need to do that?  How would it give you the ability to make full use of all of the power of all those nodes?

Usually nodes are used for either computing or storage purposes. Computing is much easier, so let us take that as a good example. A route calculating system, need to do a lot of computations on a relatively small amount of information (the map data). Whenever there is a change in the map (route blocked, new road open, etc), it needs to send the information to all the servers, and make sure that it isn’t lost.

Since calculating routes is expensive (we’ll ignore the options for optimizations and caching for now), we want to scale it to many nodes. And since the source data is relatively small, each node can have a full copy of the data. Under this scenario, the actual problem we have to solve is how to ensure that once we save something to the cluster, it is propagated to the entire cluster.

The obvious way to do this is with a hierarchy:

image

Basically, the big icons are the top cluster, each of which is responsible for updating a set of secondary servers, which is then responsible for updating the tertiary servers.

To be perfectly honest, this looks nice, and even reasonable, but it is going to cause a lot of issues. Sure, the top cluster is resilient to failures, but relying on a node to be up to notify other nodes isn’t so smart. If one of the nodes in the top cluster goes down, then we have about 20% of our cluster that didn’t get the notice, which kind of sucks.

A better approach would be to go with a management system and a gossip background:

image

In other words, the actual decisions are down by the big guys (literally, in this picture). This is a standard consensus cluster (Paxos, Raft, etc). Once a decision has been made by the cluster, we need to send it to the rest of the nodes in the system. We can do that either by just sending the messages to all the nodes, or by selecting a few nodes and have them send the messages to their peers. The protocol for that is something like: “What is the like command id you have? Here is what I have after that.” Assuming that each processing node is connected to a few other servers, that means that we can send the information very quickly to the entire cluster. And even if there are errors, the gossiping server will correct it (note that there is an absolute order of the commands, ensured by the consensus cluster, so there isn’t an issue about agreeing to this, just distributing the data).

Usually the gossip topology follows the actual physical distribution. So the consensus cluster will notify a couple of servers on each rack, and let the servers in the rack gossip among themselves about the new value.

This means that once we send a command to the cluster, the consensus would agree on that, then we would distribute it to the rest of the nodes. There is a gap between the consensus confirming it and the actual distributing to all the nodes, but that is expected in any distributed system. If it is important to sync this on a synchronized basis across the entire cluster, the command is usually time activated (which require clock sync, but that is something that we can blame on the ops team, so we don’t care Smile).

With this system, we can have an eventually consistent set of data across the entire cluster, and we are happy.

Of course, this is something that is only relevant for compute clusters, the kind of things were you compute a result, return it to the client and that is about it. There are other types of clusters, but I’ll talk about them in my next post.

time to read 1 min | 85 words

We are getting to the part where we are out of things to do, so we setup a live instance of RavenDB 3.0 and opened it up for the world to play with.

It is available here: http://live-test.ravendb.net

Disclaimer - It may go down at any moment, data will routinely be wiped but is public and can be copied and used for other users. This is strictly for playing around with it, nothing more.

Give it a shot, see all the new cool stuff.

time to read 4 min | 705 words

The question cross my desk, and it was interesting enough that I felt it deserves a post. The underlying scenario is this. We have distributed consensus protocols that are built to make sure that we can properly arrive at a decision and have the entire cluster follow it, regardless of failure. Those are things like Paxos or Raft. The problem is that those protocols are all aimed at relatively small number of nodes. Typically 3 – 5. What happens if we need to manage a large number of machines?

Let us assume that we have a cluster of 99 machines. What would happen under this scenario? Well, all consensus algorithm works on top of the notion of a quorum. That at least (N/2+1) machines have the same data. For a 3 nodes cluster, that means that any decision that is on 2 machines is committed, and for a 5 nodes cluster, it means that any decision that is on 3 machines is committed. What about 99 nodes? Well, a decision would have to be on 50 machines to be committed.

That means making 196 requests (98 x 2) (once for the command, then for the confirmation) for each command. That… is a lot of requests. And I’m not sure that I want to see what it would look like in term of perf. So just scaling things out in this manner is out.

In fact, this is also pretty strange thing to do. The notion of distributed consensus is that you will reach agreement on a state machine. The easiest way to think about it is that you reach agreement on a set of values among all nodes. But why are you sharing those values among so many nodes? It isn’t for safety, that is for sure.

Assuming that we have a cluster of 5 nodes, with each node having 99% availability (which translates to about 3.5 days of downtime per year). The availability of all nodes in the cluster is 95%, or about 18 days a year.

But we don’t need them to all be up. We just need any three of them to be up. That means that the math is going to be much nicer for us (see here for an actual discussion of the math).

In other words, here are the availability numbers if each node has a 99% availability:

Number of nodes Quorum Availability  
3 2 99.97% ~ 2.5 hours per year
5 3 99.999% (5 nines) ~ 5 minutes per year
7 5 99.9999% (6 nines) ~ 12 seconds per year
99 50 100%  

Note that all of this is based around each node having about 3.5 days of downtime per year. If we can have availability of 99.9% (or about 9 hours a year), the availability story is:

Number of nodes Quorum Availability  
3 2 99.9997% ~ 2 minutes a year
5 3 99.999999% ( 8 nines ) ~ 30 seconds per year
7 5 100%  

So in rough terms, we can say that going to 99 node cluster isn’t a good idea. It is quite costly in terms of the number of operation require to ensure a commit, and from a safety perspective, you can get the same safety level at the drastically lower cost.

But there is now another question, what would we actually want to do with a 99 node cluster*? I’ll talk about this in my next post.

A hundred node cluster only make sense if you have machines with about 80% availability. In other words, they are down for 2.5 months every year. I don’t think that this is a scenario worth discussing.

time to read 5 min | 941 words

An interesting challenge came across my desk. Let us assume that we have a set of libraries, which looks like this:

{
    "Name": "The Geek Hangout",
    "OpeningHours": {
        "Sunday": [
            {   "From": "08:00", "To": "13:00"  },
            {   "From": "16:00", "To": "19:00"  }
        ],
        "Monday": [
            {   "From": "09:00", "To": "18:00"  },
            {   "From": "22:00", "To": "23:59"  }
        ],
        "Tuesday": [
            {   "From": "00:00", "To": "04:00"  },
            {   "From": "11:00", "To": "18:00"  }
        ]
    }
}
{
    "Name": "Beer & Books",
    "OpeningHours": {
        "Sunday": [
            {   "From": "16:00", "To": "23:59"  }
        ],
        "Monday": [
            {   "From": "00:00", "To": "02:00"  },
            {   "From": "10:00", "To": "22:00"  }
        ],
        "Tuesday": [
            {   "From": "10:00", "To": "22:00"  }
        ]
    }
}

I only included three days, to make it shorter, but you get the points. You can also see that there are times that the opening hours go through a day.

Now, the question we need to answer is: “find me an open library now”.

How can we answer such a question? If we were using SQL, it would be something like this:

select * from Libraries l 
where Id in (
         select Library Id OpeningHours oh 
         where oh.Day = dayofweek(now()) AND oh.From >= now() AND oh.To < now()
) 

I’ll leave the performance of such a query to your imagination, but the key point is that we cannot actually express such a computation in RavenDB. We can do range queries, but in this case, it is the current time that we compare to the range of values. So how do we answer such a query?

As usual, but not trying to answer the same thing at all. Here is my index:

image

The result of this is an index entry per day, and in each index entry, we have outputted the half hours that this library is open. So if we want to check for libraries that are open on Sunday at 4:30 PM, all we have to do is issue the following query:

image

The power of dynamic fields and index time computation means that this is an easy query to make, and even more importantly, this is something that we can answer very efficiently.

time to read 1 min | 93 words

We were asked this a few times, so I think it is worth clarifying.

If you have a subscription license to RavenDB, you have automatic access to all versions of RavenDB for as long as your subscription is current. That means that if you purchase a RavenDB 2.x subscription, your license allows you to use RavenDB 3.0 without any issues.

Note that this doesn’t include using RavenFS, which will require an updated license.

If you purhcased RavenDB using the one time code, you’ll need to purchase a new license for RavenDB 3.0.

time to read 2 min | 246 words

So we had a problem in our production environment. It showed up like this.

image

The first thing that I did was log into our production server, and look at the logs for errors. This is part of the new operational features that we have, and it was a great time to try it under real world conditions:

image

This gave me a subscription to the log, which gave me the exact error in question:

image

From there, I went to look at the exact line of code causing the problem:

image

Can you see the issue?

We create a dictionary that was case sensitive, and they used that to create a dictionary that was case insensitive. The actual fix was adding the ignore case comparer to the group by clause, and then pushing to production.

time to read 1 min | 115 words

I’m currently with some of our team in the Oredev conference. So if you are here, seek us out.

In other good news, the new website for RavenDB is now up, and that means that we are no longer selling RavenDB 2.x. We are now selling RavenDB 3.0 only*!

With this, the last hurdle of releasing RavenDB 3.0 is pretty much out the door, we’ll probably wait until we are back from Oredev and recover a bit, but we are on track for a stable release of RavenDB 3.0 next week or the one just after.

In the meantime, go ahead and look at the new website.

* A RavenDB 3.0 license can work for 2.5, though.

time to read 3 min | 486 words

This started out as a customer engagement, but it was interesting to see how we solved it.

The problem is searching for books. Let us take the following books as good example:

image

We have users that want to have recommendations for books in specific topics, and authors can pay us to promote their books. You can see how it looks like above.

Now, the rules we want to follow for sorting the results are fairly simple. Find all the matching books, and sort them so:

  • The user has searched for a book primary tag, and the author paid to promote that tag, show first.
  • The user has searched for a book secondary tag, and the author paid to promote that tag, show second.
  • The user has searched for a book primary tag, and the author didn’t paid to promote that tag, show third.
  • The user has searched for a book secondary tag, and the author didn’t paid to promote that tag, show forth.

Actually trying to specify the sort order according to this tend to be quite hard to do, as it turns out, but we can take advantage of boosting to get what we want.

We define the following index:

from book in docs.Books
select new
{
  PaidPrimaryTag = book.Tags.Where(x=>x.Primary && x.Paid).Select(x=>x.Name),
  PaidSecondaryTag = book.Tags.Where(x=>x.Primary == false && x.Paid).Select(x=>x.Name),
  PrimaryTag = book.Tags.Where(x=>x.Primary).Select(x=>x.Name),
  SecondaryTag = book.Tags.Where(x=>x.Primary == false).Select(x=>x.Name),
}

And now we want to do a few searches: First for NoSQL and then RavenDB.

The actual query we issue is:

image

And as you can see, books/3 is shown first, because the author paid for higher ranking. What about when we do that with RavenDB?

image

We have books/3, as before, but books/2 is higher ranked than books/1. Why is that? Because books/2 paid to have a higher ranking on a secondary tag, and it is more important than even a primary tag match according to our query.

This is quite elegant, and it also allows us to take into account relevancy in the search as well.

time to read 2 min | 267 words

We got a failing test because of some changes we made in RavenDB, and the underlying reason ended up being this code:

image

The problem was that the type that I was expecting did inherit from the right stuff. See this:

image

So something here is very wrong. I tracked this until I got to:

return RuntimeTypeHandle.CanCastTo(fromType, this);

And there is stopped. I work around this issue by using IsSubclassOf, instead of IsAssignableFrom.

The problem with IsAssignableFrom is that it is a confusing method. The parent is supposed to be the target, and the type you check is the parameter, but it is very easy to forget that and get confused. This worked for 99% of cases, because the single assembly we usually use also contained the RavenBaseApiController (which obviously can be assigned to itself), so that looked like it worked. IsSubclassOf is much nicer, but you need to understand that this won’t work for interfaces, or check direct equality. In this case, this was exactly what I needed, so that worked.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}