Benchmarking Hongli Lai's New Patch for Ruby Memory Savings

Recently, Hongli Lai of Phusion Passenger fame, has been looking at how to reduce CRuby’s memory usage without going straight to jemalloc. I think that’s an admirable goal - especially since you can often combine different fixes with good results.

When people have an interesting Ruby speedup they’d like to try out, I often offer to benchmark it for them - I’m trying to improve our collective answer to the question, “does this make Rails faster?”

So: let’s examine Hongli’s fix, benchmark it, and see what we think of it!

The Fix

Hongli has suggested a specific fix - he mentioned it to me, and I tested it out. The basic idea is to occasionally use malloc_trim to free additional blocks of memory that would otherwise not be returned to the OS.

Specifically: in gc_start(), near the end, just after the gc_marks() call, he suggests that you can call:

if(do_full_mark) { malloc_trim(0) }

This will take extra CPU cycles to trim away memory we know we’ll have to get rid of - but only when doing a “full mark”, part of Ruby’s mark/sweep garbage collection. The idea is to spend extra CPU cycles to reduce memory usage. He also suggested that you can skip the “only on a full-mark pass” part of it, and just call malloc_trim(0) every time. That might divide the work over more iterations for more even performance, but might cost overall performance.

Let’s call those variation 1 (only trim on full-mark), variation 2 (trim on every GC, full-mark or not) and baseline (released Ruby 2.6.0.)

(Want to know more about Ruby’s GC and what the pieces are? I gave a talk at RubyKaigi in 2018 on that.)

Based on the change to Ruby’s behavior, I’ll refer to this as the “trim-on-full-mark” patch. I’m open to other names. It is, in any case, a very small patch in lines of code. Let’s see how the effect looks, though!

The Trial

Starting from released Ruby 2.6.0, I tested “plain vanilla” Ruby 2.6.0 and the two variations using Rails Ruby Bench. For those of you just joining us, that means running a Rails app server (including database and Redis) on a dedicated m4.2xlarge EC2 instance, with everything running entirely on-instance (no network) for stability reasons. For each “batch,” RRB generates (in this case) 30,000 pseudorandom HTTP requests against a copy of Discourse running on a large Puma setup (10 processes, 60 threads) and sees how fast it can process them all. Other than having only small, fast database requests, it’s a pretty decent answer to the question, “how fast can Rails process HTTP requests on a big EC2 instance?”

As you may recall from my jemalloc speed testing, running 10 large Rails servers, even on a big EC2 instance, simply consumes all the memory. You won’t see a bunch of free memory sitting around because one server or another would take it. Instead, using less memory will manifest as faster request times and higher throughputs. That’s because more memory can be used for caching, or for less-frequent garbage collection. It won’t be returned to the OS.

This trial used 15 batches of 30,000 requests for each variant (V1, V2, baseline.) That won’t catch tiny, subtle differences (say 0.25%), but it’s pretty good for rough “does this work?” checks. It’s also very representative of a fairly heavy Rails workload.

I calculated median, standard deviation and so on then realized: look, it’s 15 batches. These are all approximations for eyeballing your data points and saying, “is there a meaningful difference?” So, look below for the graph. Looking at it, there does appear to be a meaningful difference between released Ruby 2.6 (orange) and the two variations. I do not see a meaningful difference between Variation 1 and Variation 2. Maybe one has slightly more predictable response time than the other? Maybe no? If there’s a significant performance difference here between V1 and V2, it would want more samples to be able to see it.

The Y axis is requests/second over the 30k requests. The X axis is Rand(0.0, 1.0). The Y axis does not start at zero, so this is not a huge difference.

The Y axis is requests/second over the 30k requests. The X axis is Rand(0.0, 1.0). The Y axis does not start at zero, so this is not a huge difference.

Hongli points out that this article gives some excellent best practices for benchmarking. RRB isn’t perfect according to its recommendations — for instance, I don’t run the CPU scheduler on a dedicated core or manually set process affinity with cores. But I think it rates pretty decently, and I think this benchmark is giving uniform enough results here, in simple enough circumstances, to trust the result.

Based on both eyeballing the graph above and using a calculator on my values, I’d call that about 1% speed difference. It appears to be about three standard deviations of difference between baseline (released 2.6) and either variation. So it appears to be a small but statistically significant result.

That’s good, if it holds for other workloads - 1 line of changed code for a 1% speedup is hard to complain about.

The Fix, More Detail

So… What does this really do? Is it really simple and free?

Normally, Ruby can only return memory to the OS if the blocks are at the end of its address space. It checks it occasionally, and returns blocks it if it can. That’s a very CPU-cheap way to handle it, which makes it a good default in many cases. But it winds up retaining more memory because freed blocks in the middle can only be reused by your Ruby process, not returned for a different process to use. So mostly, long-running Ruby processes expand up to a size with some built-in waste (“fragmentation”) and then stay that big.

With Hongli’s change, Ruby scans all of memory on certain garbage collections (Variant 1) or all garbage collections (Variant 2) and frees blocks of memory that aren’t at the end of its memory space.

The function being called, malloc_trim, is part of GLibC’s memory allocator. So this won’t directly stack with jemalloc, which doesn’t export the exact same interface, and handles freeing differently. My previous results with jemalloc suggest that this isn’t enough, by itself, to bring GLibC up to jemalloc’s level. Jemalloc already frees more memory to the OS than GLibC, and can be tuned with the lg_dirty_mult option to release even more aggressively. I haven’t timed different tunings of jemalloc, though.

A Possible Limitation

This seems like a good patch to me, but just to mention a problem it could have: the malloc_trim API is GLibC-specific. This would need to be #ifdef’d out when Ruby is compiled with jemalloc. The core team may not be thrilled to add extra allocator-specific behavior, even if it’s beneficial.

I don’t see this as a big deal, but I’m not the one who gets to decide.

Conclusions

I think Hongli’s patch shows a lot of promise. I’m curious how it compares on smaller benchmarks. But especially for Variation 1 (only on full-mark GC), I don’t think it’ll be very different — most small benchmarks do very few full-mark garbage collections. Most do very few garbage collections, period.

So I think this is a free 1% speed boost for large, memory-constrained Rails applications, and that it doesn’t hurt anybody else. I’ll look forward to the results on smaller benchmarks and more CPU-bound Ruby code.