Bulk-loading associations with ActiveRecord

September 13, 2010 at 10:55 AM

Here's a fairly common problem I've encountered. Say you have two classes; for example, a Restaurant and a Cuisine. They exist in a many-to-many relationship, and have associations defined like so:

This is basically the model we use at Urbanspoon.

Now, there are two interesting aspects to the data we have in this relationship. First, Urbanspoon know about approximately 40x more Restaurants than Cuisines. Second, Cuisine popularity follows a Zipf distribution. That is, the vast majority of Cuisines are associated with very few Restaurants. However, the converse is not true: we enforce that there be ~3 cuisines associated with every Restaurant in the system.

Restaurant counts by Cuisine

So, it turns out that the top 1000 Restaurants in Seattle comprise about 70 distinct cuisines. That fits well with our graph from above, and provides a good premise for this issue. Let's say we load these 1000 Restaurants and we'd like to display the Cuisines associated with all of them. What's the best way to do this?

Naïve Attempt

This takes an average of ~12.25 seconds to complete with a hot query cache. True to intuition, naïve loading of associations is dog slow in this scenario. We're duplicating work left and right, both in database loading and in ActiveRecord instantiation. If one Cuisine is referenced by N Restaurants, we pull it from the database N times and create N objects with identical information.

Improved Attempt

Great! We achieved an average speed of almost exactly twice as fast (~5.63 seconds) by using ActiveRecord's built-in association loading. But there are two problems with this approach. Firstly, it's not very flexible or composable. That is, you can't build a handful of specialized Restaurant-finding functions that are agnostic about the decision to load Cuisines or not. You can't cache a list of Restaurants and load the Cuisines later. Basically, if you want to be efficient, you have to do the Cuisine loading at query time.

The second issue is still one of performance. Even though we only load 73 Cuisine rows from the database, ActiveRecord still allocates over 2000 unique Cuisine objects. See?

Oy. That means we're consuming roughly 27.4 times the memory and clock cycles required to instantiate and retain Cuisines. So, what's the really right way to do this?

Bulk Loading, For Real

The pattern I've used is pretty simple. You query the two associations separately, and then use an id-to-record map to squash them back together in code. I did this in a few places in the iLike codebase, and I found this example waiting for me at Urbanspoon. This final version combines and improves on my previous attempts, and I have a few comments to share about it later. But first, the code!

OK, that seems reasonable enough. What are the times like?

Wow, an average of 3.18s! That's significantly faster than the association loading version. And it uses far less memory, too. In fact, that's the reason it's so much faster – you can shave off 43.5% of the total time just by instantiating each database row only once. This is a total time reduction of 74% over the naïve method.

Additionally, this function can be called as an afterthought. You can compose various functions and have them still perform well. You can load Restaurants from any cache or other non-DB resource and still fetch their associated Cuisines efficiently. Fun times.

Caveats

I mentioned that I had a few comments about this technique. Most of them pertain to changes in ActiveRecord from Rails 1.x to Rails 2.x.

[1] In kilobytes; based on this post.