Oren Eini

CEO of RavenDB

a NoSQL Open Source Document Database

Get in touch with me:

oren@ravendb.net +972 52-548-6969

Posts: 7,565
|
Comments: 51,184
Privacy Policy · Terms
filter by tags archive
time to read 8 min | 1455 words

The first thing that we will do with RavenDB and the NuGet data is to issue the same logical query as the one used to populate the packages page. As a reminder, here is how it looks:

SELECT        TOP (30) 
          -- ton of fields removed for brevity
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

Despite the apparent complexity ,this is a really trivial query. What is does is say:

  • Give me the first 30 – 60 rows
  • Where IsPrerelease is false
  • Order by the download count and then the id

With Linq, the client side query looks something like this:

var results = Session.Query<Package>()
                         .Where(x=>x.IsPrerelease == false)
                         .OrderBy(x=>x.DownloadCount).ThenBy(x=>x.Id)
                         .Skip(30)
                         .Take(30)
                         .ToList();

Now, I assume that this is what the NuGet code is also doing, it is just that the relational database has made it so they have to go to the data in a really convoluted way.

With RavenDB, to match the same query, I could just issue the following query, but there are subtle differences between how the query works in SQL and how it works in RavenDB. in particular, the data that we have in RavenDB is the output of this query, but it isn’t the raw output. For example, we don’t have the Id column available, which is used for sorting. Now, I think that the logic is meaning to say, “sort by download count descending and then by age ascending”. So old and popular packages are more visible than new and fresh packages.

In order to match the same behavior (and because we will need it to the next post) we will define the following index in RavenDB:

image

And querying it:

image

The really nice thing about this?

image

This is the URL for this search:

/indexes/Packages/Listing?query=IsPrerelease:false&start=0&pageSize=128&aggregation=None&sort=-DownloadCount&sort=Created

This is something that RavenDB can do in its sleep, because it is a very cheap operation. Consider the query plan that would for the SQL query above. You have to join 5 times just to get to the data that you want, paging is a real mess, and the database actually have to work a lot to answer this fiddling little query.

Just to give you some idea here. We are talking about something that conceptually should be the same as:

select top 30 skip 30 * from Data where IsPrerelease = 0

But it get really complex really fast with the joins and the tables and all the rest.

In comparison, in RavenDB, we actually do have just a property match to do. Because we keep the entire object graph in a single location, we can do very efficient searches on it.

In the next post, I’ll discuss the actual way I modeled the data, and then we get to do exciting searches Smile.

time to read 7 min | 1397 words

The first part of actually showing how RavenDB can handle the NuGet scenario was to actually get the data to RavenDB. Luckily, NuGet makes the data accessible using OData, so I quickly hacked up the following program:

using (var store = new DocumentStore
    {
        Url = "http://localhost:8080",
        DefaultDatabase = "Nuget"
    }.Initialize())
{
    string url = "https://nuget.org/api/v2/Packages";
    while (true)
    {
        Console.WriteLine("GET {0}", url);
        if (url == null)
            break;
        var webRequest = (HttpWebRequest)WebRequest.Create(url);
        webRequest.Accept = "application/json";
        using (var resp = webRequest.GetResponse())
        using (var strema = resp.GetResponseStream())
        {
            url = WritePackagesToRaven(strema, store);
        }
    }
}

This is imply going to NuGet and asking for the packages in json format. It is very easy for us to work with json data with RavenDB, so that is what we are doing.

The next stage is to actually read the response and write the packages to RavenDB, this is handled here:

private static string WritePackagesToRaven(Stream strema, IDocumentStore store)
{
    var json = RavenJToken.ReadFrom(new JsonTextReader(new StreamReader(strema)))
        .Value<RavenJObject>("d");


    using (var session = store.OpenSession())
    {
        foreach (RavenJObject result in json.Value<RavenJArray>("results"))
        {
            ModifyResult(result);
            session.Advanced.Defer(new PutCommandData
                {
                    Document = result,
                    Metadata = new RavenJObject
                        {
                            {"Raven-Entity-Name", "Packages"}
                        },
                    Key = "packages/" + result.Value<string>("Id") + "/" + result.Value<string>("Version")
                });
        }
        session.SaveChanges();
    }
    return json.Value<string>("__next");
}

I am not really sure why we have this “d” as the beginning of the json results, but that is what NuGet returns. We iterate over the query results, and write all of them to RavenDB.

You might note that we use the Defer() option, which means that we can rely on the session to handle batching for us and only go to the server once, when we call SaveChanges(). We also set the document metadata to be pretty basic, merely indicating that this should go on the Packages collection. Finally, we set the id to be composed of the package id and the version, resulting in a unique and human readable key for the imported package.

Note that we return the next page location, and continue on working on that in the next page loop.

There is one thing that we need to do, the NuGet data is still highly relational, and quite ugly at times. For example, let us take Tags and Dependencies. Here is how they show up in the raw results:

  • Dependencies: AboditUnits:1.0.4|Autofac.Mef:2.5.2.830|ImpromptuInterface:5.6.2|log4net:1.2.11
  • Tags: Ian_Mercer Natural_Language Abodit NLP

That isn’t a really nice way to work with the data, so before we save the results to RavenDB, we modify it slightly.

private static void ModifyResult(RavenJObject result)
{
    var tags = result.Value<string>("Tags");
    if (tags != null)
    {
        result["Tags"] =
            new RavenJArray(tags.Split(new[] {' ', ',', ';'}, StringSplitOptions.RemoveEmptyEntries));
    }
    else
    {
        result["Tags"] = new RavenJArray();
    }
    var deps = result.Value<string>("Dependencies");
    if (deps != null)
    {
        result["Dependencies"] =
            new RavenJArray(deps.Split(new[] {'|'}, StringSplitOptions.RemoveEmptyEntries)
                                .Select(s =>
                                    {
                                        var strings = s.Split(':');
                                        return RavenJObject.FromObject(new {Package = strings[0], Version = strings[1]});
                                    }));
    }
    result.Remove("__metadata");
}

Finally, let us take a peek at RavenDB and see how the results look like there:

image

Now that is much more like it.

On my next post, I am going to show how to do some queries againt this data, which currently have about 66,483 results.

time to read 14 min | 2742 words

It appears that NuGet has some perf problems recently and Jeff Handley posted the problematic queries as well as the new, hand optimized queries.

You can see the original problematic queries and the optimized code (still tentative) here.

Hand optimized query to load the pages for the packages page:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        Packages.IsPrerelease <> cast(1 as bit)
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id
 

This monster query is actually translated to something like:

Give me the top 30 packages which are not pre released, ordered by the download count and then by their id.

It takes a great deal of complexity to deal with that for one major reason, the data is split up across multiple tables in a way that make it hard get all of it easily. The minor reason is that there is really no good way to do paging in SQL Server (shocking, I know). One would assume that such a basic feature would have a bit more attention.

What is worse is the optimized version of the search feature:

SELECT        TOP (30)
            Paged.PackageRegistrationKey
        ,    Paged.Id
        ,    Paged.Version
        ,    Packages.FlattenedAuthors
        ,    Packages.Copyright
        ,    Packages.Created
        ,    Packages.FlattenedDependencies
        ,    Packages.Description
        ,    PackageRegistrations.DownloadCount
        ,    Packages.ExternalPackageUrl
        ,    N'packages/' + PackageRegistrations.Id + N'/' + Packages.Version AS C1
        ,    Packages.IconUrl
        ,    Packages.IsLatestStable
        ,    Packages.Language
        ,    Packages.LastUpdated
        ,    Packages.LicenseUrl
        ,    Packages.Hash
        ,    Packages.HashAlgorithm
        ,    Packages.PackageFileSize
        ,    Packages.ProjectUrl
        ,    CASE Packages.Listed WHEN 1 THEN Packages.Published ELSE NULL END AS C2
        ,    Packages.ReleaseNotes
        ,    N'package/ReportAbuse/' + PackageRegistrations.Id + N'/' + Packages.Version AS C3
        ,    Packages.RequiresLicenseAcceptance
        ,    Packages.Summary
        ,    CASE WHEN Packages.Tags IS NULL THEN CAST(NULL as varchar(1)) ELSE N' ' + LTRIM(RTRIM(Packages.Tags)) + N' ' END AS C4
        ,    ISNULL(Packages.Title, PackageRegistrations.Id) AS C5
        ,    Packages.DownloadCount AS DownloadCount1
        ,    cast(0 as float(53)) AS C6
FROM        (

            SELECT        Filtered.Id
                    ,    Filtered.PackageRegistrationKey
                    ,    Filtered.Version
                    ,    Filtered.DownloadCount
                    ,    row_number() OVER (ORDER BY Filtered.DownloadCount DESC, Filtered.Id ASC) AS [row_number]
            FROM        (
                        SELECT        PackageRegistrations.Id
                                ,    Packages.PackageRegistrationKey
                                ,    Packages.Version
                                ,    PackageRegistrations.DownloadCount
                        FROM        Packages
                        INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Packages.PackageRegistrationKey
                        WHERE        ((((Packages.IsPrerelease <> cast(1 as bit)))))
                                ((((AND    Packages.IsLatestStable = 1))))
                                ((((AND    Packages.IsLatest = 1))))
                                AND    (
                                        PackageRegistrations.Id LIKE '%jquery%' ESCAPE N'~'
                                    OR    PackageRegistrations.Id LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Title LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Title LIKE '%ui%' ESCAPE N'~'

                                    OR    Packages.Tags LIKE '%jquery%' ESCAPE N'~'
                                    OR    Packages.Tags LIKE '%ui%' ESCAPE N'~'
                                    )
                        ) Filtered
            ) Paged
INNER JOIN    PackageRegistrations ON PackageRegistrations.[Key] = Paged.PackageRegistrationKey
INNER JOIN    Packages ON Packages.PackageRegistrationKey = Paged.PackageRegistrationKey AND Packages.Version = Paged.Version
WHERE        Paged.[row_number] > 30
ORDER BY    PackageRegistrations.DownloadCount DESC
        ,    Paged.Id

One thing that immediately popped up to me was the use of queries such as "’%jquery%’. This is a flat out killer for performance in relational databases, since they cannot do any indexes on this and are forced to do a table scan.

I decided to take a stub at moving the NuGet data to RavenDB, which is  a much better fit (in my own obviously utterly unbiased opinion). In the next post, we will start with the actual import process, then we get to actual queries.

FUTURE POSTS

No future posts left, oh my!

RECENT SERIES

  1. Production Postmortem (52):
    07 Apr 2025 - The race condition in the interlock
  2. RavenDB (13):
    02 Apr 2025 - .NET Aspire integration
  3. RavenDB 7.1 (6):
    18 Mar 2025 - One IO Ring to rule them all
  4. RavenDB 7.0 Released (4):
    07 Mar 2025 - Moving to NLog
  5. Challenge (77):
    03 Feb 2025 - Giving file system developer ulcer
View all series

RECENT COMMENTS

Syndication

Main feed Feed Stats
Comments feed   Comments Feed Stats
}