OutOfLine – A Memory-Locality Pattern for High Performance C++

In my time at Headlands Technologies, I’ve gotten the opportunity to build some utilities that have improved the ergonomics of maintaining high-performance C++ codebases. This article will give a generic overview of one of those utilities, OutOfLine.

Let’s start with a motivating example. Suppose you have a system that opens a very large number of paths. Maybe they are files, maybe they are named UNIX sockets, maybe pipes. But for whatever reason, you open a lot of file descriptors at startup, then you do a lot of processing on those descriptors, and finally when you’re done you close the descriptors and unlink the paths.

An (abbreviated) initial design might look like this:

And that’s a nice, logically reasonable design. It RAIIs the close and unlink for you. You can allocate a big array of these things, operate on them, and they clean up after themselves when the array’s lifetime ends.

But what about performance? Suppose you use the fd very often, and you use path only when cleaning up the object. Now we have an array of 40B objects, and our critical path only ever uses 4B of that. Which means you’ll see more cache line misses as you keep having to “skip over” the 90% overhead.

One very common solution to this is to switch from array-of-structs to struct-of-arrays. And that would net us our performance win here, but it would cost us the RAII. Is there a way to have the best of both worlds?

One initial compromise might be to not store a std::string which is 32B, and instead store a std::unique_ptr – which is only 8B. That takes your object size down from 40B to 16B, which is a big win. But it’s not as good as parallel arrays.

OutOfLine is a tool that you can use to keep RAII, and move your cold members completely outside your object with zero space overhead. You use OutOfLine by inheriting from it, like this. It is a CRTP base class, so the first template argument should be the child class that is inheriting. The second argument is the cold data that should be associated with each “fast” object.

And what does that class look like itself?

The implementation is based on the idea of a global map hiding somewhere that maps pointers to fast objects to pointers of cold objects.

You can build this base from anything you can build your cold data from. And when you do, it’ll create that cold data and associate it with your fast object.

When your fast object gets cleaned up, the corresponding cold object will too:

When you move your fast object, the corresponding cold object is reassociated with the new fast object (remember that means you shouldn’t use the cold data on a moved-from object).

The current implementation just makes OutOfLine non-copyable for simplicity, but one could instead choose to implement copy construction by copying the cold data.

Now for this to be useful to us though, it has to actually be convenient to access that cold data. When you inherit from OutOfLine, your class gains const and non-const member functions cold():

Calling these gives you a reference (of appropriate constness) to your cold data.

And that’s it. This UnlinkingFD will be only 4B large, it provides access to the fd member at full speed, and it still preserves all the same RAII behavior. All the lifetime-related work is handled for you. When you move the fast object, the cold object is reassociated to the new fast object. When your fast object goes away, the cold object does too.

Sometimes though, your data conspires to make your life difficult, and the fast data must be constructed first because it is a constructor argument to the cold data. That makes the construction order the reverse of the order OutOfLine imposes on you. Also, sometimes you need the fast data to outlive the cold data (maybe the cold data holds a reference to the fast data). For these cases, we need an “escape hatch” to control the order in which data is initialized and deinitialized.

There is another constructor of OutOfLine that your class could call, one that accepts the tag type TwoPhaseInit. If you build your OutOfLine in that way, your cold data will not be initialized, and you’ll be left in a half-constructed state. You then finish your two-phase construction by calling init_cold_data (with any arguments from which you can construct a ColdData) and you’ll be done. Just remember not to call .cold() on an object that has not yet had its cold data initialized. And the parallel holds too – you can release your cold data early if your data requires it by calling release_cold_data.

And that’s all of it. So ultimately, what did our 29 SLOC buy us? It bought us one more option in the space of tradeoffs. Any time you have an object where some members are drastically more important than other members, you might consider OutOfLine. It lets you make some members a little bit faster at the expense of making accesses to other members a lot slower, so you would reach for this in situations where that sounds like a good tradeoff to you.

We’ve been able to apply this technique in several places – it’s fairly common to want to tag fast data with extra metadata that is logged out on shutdown, or in rare or unexpected situations. Whether that’s recording which user this connection belongs to, which internal trade desk this order is attributed to, or the handle to a hardware-accelerated market-data session – this will keep your cache lines clean while you’re in your critical paths.

I’ve included a benchmark that you can use to see and explore the differences.

Scenario Time (ns)
With cold data in-line (original) 34684547
With cold data thrown away (best-case scenario) 2938327
With OutOfLine 2947645

I measured an ~10x speedup by using OutOfLine. Obviously this benchmark is contrived to provide the best-case-scenario of OutOfLine, but it serves to demonstrate that cache optimization can have very real performance impact, and that OutOfLine really does deliver on that front. And keeping your data cache clear of cold data can also have a difficult-to-measure holistic benefit on the rest of your code. As always, you need to measure each application to optimize it, but this might be a useful tool to have in your belt.