I’ve mentioned a few times recently that something is a “microbenchmark.” What does that mean? Is it good or bad?
Let’s talk about that. Along the way, we’ll talk about benchmarks that are not microbenchmarks and how to pick a scale/size for a specific benchmark.
I talk about this because I write benchmarks for Ruby. But you may prefer to read it because you use benchmarks for Ruby - if you read the results or run them. Knowing what can go wrong in benchmarks is like learning to spot bad statistics: it’s not easy, but some practice and a few useful principles can help you out a lot.
Microbenchmarks: Definition and Benefits
The easiest size of benchmark to talk about is a very small benchmark, or microbenchmark.
The Ruby language has a bunch of microbenchmarks that ship right in the language - a benchmarks directory that’s a lot like a test directory, but for speed. The code being timed is generally tiny, simple and specific.
Each one is a perfect example of a microbenchmark: it tests very little code, sometimes just a single Ruby operation. If you want to see how fast a particular tiny Ruby operation is (e.g. passing a block, a .each loop, an Integer plus or a map) a microbenchmark can measure that very exactly while measuring almost nothing else.
A well-tuned microbenchmark can often detect very tiny changes, especially when running many iterations per step (see “Writing Good Microbenchmarks” below.) If you see a result like “this optimization speeds up Ruby loops by half of one percent", you’re pretty certainly looking at the result of a microbenchmark.
Another advantage of running just one small piece of code is that it’s usually easy and fast. You don’t do much setup, and it doesn’t usually take long to run.
A good microbenchmark measures one small, specific thing. This strength is also a weakness. If you want to know how fast Ruby is overall, a microbenchmark won’t tell you much. If you get lots of them together (example: Ruby’s benchmarks directory) then it still won’t tell you much. That’s because they’re each written to test one feature, but not set up according to which features are used the most, or in what combination. It’s like reading the dictionary - you may have all the words, but a normal block of text is going to have some words a lot (“a,” “the,” “monkey”) and some words almost never (“proprioceptive,” “batrachian,” “fustian.”)
In the same way, running your microbenchmarks directory is going to overrepresent uncommon operations (e.g. passing a block by typecasting something to a proc and passing via ampersand; dynamically adding a module as a refinement) and is going to underrepresent common operations (method calls, loops.) That’s because if you run about the same number of each, that’s not going to look much like real Ruby code — real Ruby code uses common operations a lot, and uncommon operations very little.
A microbenchmark isn’t normally a good way to test subtle, pervasive changes since it’s measuring only a short time at once. For instance, you don’t normally test garbage collector or caching changes with a microbenchmark. To do so you’d have to collect a lot of different runs and check their behavior overall… which quickly turns into a much larger, longer-term benchmark, more like the larger benchmarks I describe later in this article. It would have completely different tradeoffs and would need to be written differently.
Microbenchmarks are excellent to check a specific optimization, since they only run that optimization. They’re terrible to get an overall feel for a speedup, because they don’t run “typical” code. They also usually just run the one operation, often over and over. This is also not what normal Ruby code tends to do, and it affects the results.
Lastly, a microbenchmark can often look deceptively simple. A tiny confounding factor can spoil your entire benchmark without you noticing. Say you were testing the speed of a “rescue nil” clause and your newer Ruby version didn’t just rescue faster — it also incorrectly failed to throw the exception you wanted. It would be easy for you to say “look how fast this benchmark is!” and never realize your mistake.
Writing Good Microbenchmarks
If you’re writing or evaluating a microbenchmark, keep this in mind: your test harness needs to be very simple and very fast. If your test takes 15 milliseconds for one whole run-through, 3 milliseconds of overhead is suddenly a lot. Variable overhead, say between 1 and 3 milliseconds, is even worse - you can’t usually subtract it out and you don’t want to separately measure it.
What you want in a test harness looks like benchmark_ips or benchmark_driver. You want it to be simple and low-overhead. Often it’s a good idea to run the operation many times - perhaps 100 or 1000 times per run. That means you’ll get a very accurate average with very low overhead — but you won’t see how much variation happens between runs. So it’s a good method if you’re testing something that basically always takes about equally long.
Since microbenchmarks are very speed-dependent, try to avoid VMs or tools like Docker which can add variation to your results. If you can run your microbenchmark outside a framework (e.g. Rails) then you usually should. In general, simplify by removing everything you can.
You may also want to run warmup iterations - these are extra, optional benchmark runs before you start timing the result. If you want to know the steady-state performance of a benchmark, give it lots of warmup iterations so you’ll find out how fast it is after it’s been running awhile. Or if it’s an operation that is usually done only a few times, or occasionally, don’t give it warmup at all and see how it does from a cold start.
Warmup iterations can also avoid one-time performance costs, such as class loading in Java or reading a rarely-used file from disk. The very first time you do that it will be slow, but then it will be fast every other time - even a single warmup iteration can often make those costs nearly zero. That’s either very good if you don’t want to measure them, or very bad if you do.
Since microbenchmarks are usually meant to measure a specific operation, you’ll often want to turn off operations that may confound it - for instance, you may want to garbage collect just before the test, or even turn off GC completely if your language supports it.
Keep in mind that even (or especially!) a good microbenchmark will give chaotic results as situations change. For instance, a microbenchmark won’t normally get slightly faster every Ruby version. Instead, it will leap forward by a huge amount when a new Ruby version optimizes its specific operation… And then do nothing, or even get slower, in between. The long-term story may say “Ruby keeps getting faster!”, but if you tell that story entirely by how fast passing a symbol as a block is, you’ll find that it’s an uneven story of fits and starts — even though, in the long term, Ruby does just keep getting faster.
Okay, if those are microbenchmarks, what’s the opposite? I haven’t found a good name for these, so let’s call them macrobenchmarks.
Rails Ruby Bench is a good example of a macrobenchmark. It uses a large, real application (called Discourse) and configures it with a lot of threads and processes, like a real company would host it. RRB loads it with test data and generates real-looking URLs from multiple users to simulate real-world application performance.
In many ways, this is the mirror image opposite of a microbenchmark. For instance:
It’s very hard to see how one specific optimization affects the whole benchmark
A small, specific optimization will usually be too small to detect
Configuring the dependencies is usually hard; it’s not easy to run
There’s a lot of variation from run to run; it’s hard to get a really exact figure
It takes a long time for each run
It gives a very good overview of current Ruby performance
It’s a great way to see how Ruby “tuning” works
It’s usually easy to see a big mistake, since a sudden 30%+ shift in performance is nearly always a testing error
“Telling a story” is easier, because the overview at every point is more accurate; less chaotic results
In other words, it’s good where microbenchmarks are bad, and bad where they’re good. You’ll find that a language implementor (e.g. the Ruby Core Team) wants more microbenchmarks so they can see the effects of their work. On the other hand, a random Ruby developer probably only cares about the big picture (“how fast is this Ruby version? Does the Global Method Cache make much speed difference?”) Large and small benchmarks are for different audiences.
If a good microbenchmark is judged by its exactness and low variation, a good macrobenchmark is judged by being representative of some workload. “Yes, this is a typical Rails app” would be high praise for a macrobenchmark.
Good Practices for Macrobenchmarks
A high-quality macrobenchmark is different than a high-quality microbenchmark.
While a microbenchmark cannot, and usually should not, measure large systemic effects like garbage collection, a good macrobenchmark nearly always wants to — and usually needs to. You can’t just turn off garbage collection and run a large, long-term benchmark without running out of memory.
In a good microbenchmark you turn off everything nonessential. In a good macrobenchmark you turn off everything that is not representative. If garbage collection matters to your target audience, you should leave it on. If your audience cares about startup behavior, be careful about too many warmup iterations — they can erase the initial startup iterations’ effects.
This requires knowing (or asking, or assuming, or testing) a lot about what your audience wants - you’ll need to figure out what’s in and what’s out. In a microbenchmark, one assumes that your benchmark will test one tiny thing and developers can watch or ignore it, depending. In a macrobenchmark, you’ll have a lot of different things going on. Your responsibility is to communicate to your audience what you’re checking. Then, be sure to check what you said you would.
For instance, Rails Ruby Bench attempts to be “a mid-sized typical Rails application as deployed by a small but successful startup.” That helps a lot to define the audience and operations. Should RRB test warmup iterations? Only a little - mostly it’s about steady-state performance after warmup is finished. Early performance is mostly important to represent how quickly you can edit/debug the application. Should RRB test garbage collection? Yes, absolutely, that’s an important performance consideration to the target audience. Should it test Redis performance? Only as far as necessary for actions. The target audience doesn’t directly care about Redis except as it concerns overall performance.
A good macrobenchmark is defined by the way you choose, implement and communicate the simulated workload.
Conclusions: Choosing Your Benchmark Scale
Whether you’re writing a benchmark or looking for one, a big question is “how big should the benchmark be?” A very large benchmark will be less exact and harder to run for yourself. A tiny benchmark may not tell you what you care about. How big a benchmark should you look for? How big a benchmark should you write?
The glib answer is “exactly big enough and no bigger.” Not very useful, is it?
Here’s a better answer: who’s your target audience? It’s okay if the answer is “me” or “me and my team” or “me and my company.”
A very specific audience usually wants a very specific benchmark. What’s the best benchmark for “you and your team?” Your team’s app, usually, run in a new Ruby version or with specific settings. If what you really care about is “how fast will our app be?” then figuring out some generalized “how fast is Ruby with these settings?” benchmark is probably all work and no benefit. Just test your app, if you can.
If your answer is, “to convince the Internet!” or “to show those Ruby haters!” or even “to show those mindless Ruby fans!” then you’re probably on the pathway to a microbenchmark. Keep it small and you can easily “prove” that a particular operation is very fast (or very slow.) Similarly, if you’re a vendor selling something, microbenchmarks are a great, low-effort way to show that your doodad is 10,000% faster than normal Ruby. Pick the one thing you do really fast and only measure that. Note that just because you picked a specific audience doesn’t mean they want to hear what you have to say. So, y’know, have fun with that.
That’s not to say that microbenchmarks are bad — not at all! But they’re very specific, so make sure there’s a good specific reason for it. Microbenchmarks are at their best when they’re testing a specific small function or language feature. That’s why language implementors use so many of them.
A bigger benchmark like RRB is more painful. It’ll be harder to set up. It’ll take longer to run. You’ll have to control for a lot of factors. I only run a behemoth like that regularly because AppFolio pays the server bills (thank you, AppFolio!) But the benefit is that you can answer a larger, poorly-defined question like, “about how fast is a big Rails application?” There’s also less competition ;-)