Here's a fairly common problem I've encountered. Say you have two classes; for example, a Restaurant and a Cuisine. They exist in a many-to-many relationship, and have associations defined like so:
This is basically the model we use at Urbanspoon.
Now, there are two interesting aspects to the data we have in this relationship. First, Urbanspoon know about approximately 40x more Restaurants than Cuisines. Second, Cuisine popularity follows a Zipf distribution. That is, the vast majority of Cuisines are associated with very few Restaurants. However, the converse is not true: we enforce that there be ~3 cuisines associated with every Restaurant in the system.
So, it turns out that the top 1000 Restaurants in Seattle comprise about 70 distinct cuisines. That fits well with our graph from above, and provides a good premise for this issue. Let's say we load these 1000 Restaurants and we'd like to display the Cuisines associated with all of them. What's the best way to do this?
This takes an average of ~12.25 seconds to complete with a hot query cache. True to intuition, naïve loading of associations is dog slow in this scenario. We're duplicating work left and right, both in database loading and in ActiveRecord instantiation. If one Cuisine is referenced by N Restaurants, we pull it from the database N times and create N objects with identical information.
Great! We achieved an average speed of almost exactly twice as fast (~5.63 seconds) by using ActiveRecord's built-in association loading. But there are two problems with this approach. Firstly, it's not very flexible or composable. That is, you can't build a handful of specialized Restaurant-finding functions that are agnostic about the decision to load Cuisines or not. You can't cache a list of Restaurants and load the Cuisines later. Basically, if you want to be efficient, you have to do the Cuisine loading at query time.
The second issue is still one of performance. Even though we only load 73 Cuisine rows from the database, ActiveRecord still allocates over 2000 unique Cuisine objects. See?
Oy. That means we're consuming roughly 27.4 times the memory and clock cycles required to instantiate and retain Cuisines. So, what's the really right way to do this?
The pattern I've used is pretty simple. You query the two associations separately, and then use an id-to-record map to squash them back together in code. I did this in a few places in the iLike codebase, and I found this example waiting for me at Urbanspoon. This final version combines and improves on my previous attempts, and I have a few comments to share about it later. But first, the code!
OK, that seems reasonable enough. What are the times like?
Wow, an average of 3.18s! That's significantly faster than the association loading version. And it uses far less memory, too. In fact, that's the reason it's so much faster – you can shave off 43.5% of the total time just by instantiating each database row only once. This is a total time reduction of 74% over the naïve method.
Additionally, this function can be called as an afterthought. You can compose various functions and have them still perform well. You can load Restaurants from any cache or other non-DB resource and still fetch their associated Cuisines efficiently. Fun times.
I mentioned that I had a few comments about this technique. Most of them pertain to changes in ActiveRecord from Rails 1.x to Rails 2.x.
See, the companies I've been working at have all been stuck on Rails 1.1.6 – deplorable, I know. So, previous versions had a line like this:
id_map[row[:restaurant_id]].cuisines << cuisine_map[row[:cuisine_id]] # note, *not* a call to .cuisines.target
I'm upgrading Urbanspoon, and I discovered (thanks to unique key
constraints in the database – don't ever let anyone tell you
they're worthless) that modern versions of Rails will treat this
as an insert statement. Yikes. Calling .cuisines.target
lets you
modify the array directly, rather than be intercepted by the
associations codebase.
My bungling first attempt at working around this was to use
instance_eval
to completely override the association during the
ids-to-restaurants map building phase:
r.instance_eval {
@cuisines = []
...
}
This is a bad idea, as it litters your codebase with instances of Restaurants whose associations may or may not behave correctly.
The fact that each Restaurant now points to a single instance of the same cuisine could lead to unexpected things. That is, if Restaurants A and B both have the 'Chinese' cuisine, you can modify the Cuisines of A and see the change reflected in B without saving and reloading. This different than the normal associations behavior, but I think it's acceptable, and perhaps even preferable.
Finally, I think this technique could be generalized easily and added to ActiveRecord. I haven't researched whether others have gone down this path before, so it's entirely possible that one can do this already with a hip plugin that I haven't seen.
I also don't know if the cloned vs. referenced issue of the previous bullet point would be a problem with inclusion in ActiveRecord. It is the heart of the performance gain, howerever, and it is personally how I'd like associations in general to behave.
[1] In kilobytes; based on this post.