OptCarrot: An Excellent CPU Benchmark for Ruby 3x3

You may have read here about my benchmarking attempts for Ruby 3x3. In addition, there are various small benchmarks in the Ruby source and several other aggregations of benchmarks.

But the other major benchmark currently used for Ruby 3x3 is called OptCarrot. It's written by Yusuke Endoh (aka mametter, aka mame.) It's a very well-designed benchmark. Let's talk a bit about why, shall we?

Anatomy of a Carrot

OptCarrot is a headless hardware emulator for the Nintendo Entertainment System. Everybody should have some fun with profiling, right?

And original NES. most of you are probably too young to remember these.

And original NES. most of you are probably too young to remember these.

It's not actually great for playing games. There are already lots of NES emulators out there. But the idea is cool, and it's obvious to most people what success looks like, which is nice. Here's a NES architecture introduction from the author.

Ruby 2.0 currently runs OptCarrot at 22 frames per second in the benchmarking configuration. So we'd love for Ruby 3.0 to run it at 66 frames per second. That's not as far out of reach as it sounds - Ruby 2.4 is already running at about 30 fps. There are other Ruby implementations running at up to around 200 frames per second, though they need some warmup time for that... So it's not impossible. But even 66 fps probably requires JIT. Several Rubyists are working on it.

You can find more instructions about running the benchmark in OptCarrot's docs.

Benchmark Mode

Since OptCarrot is being used as a benchmark, it normally runs headless. That means it doesn't display the screen, even though it calculates it. That's because showing images changes the timing a lot - waiting for a video card's VRAM or a monitor's vertical sync is important for a good emulator. But it messes up the timing for a benchmark. So optcarrot doesn't use a hardware display. Instead it just calculates everything at full speed, regardless of how fast the frames might be displayed to a real-world player.

OptCarrot is intentionally a CPU benchmark. It doesn't do much I/O. It generates very little garbage in memory, so fast GC doesn't help it. The main loop is simple with no fancy metaprogramming, though it does use send and Method#[].

When OptCarrot is given "--benchmark" it just runs for 180 frames, exits, and prints a checksum of what it calculated. The checksum is important. If you try to patch Ruby to be faster and it breaks OptCarrot, the checksum will change because what's displayed on the screen will change. That's a very intuitive way to verify correctness.

OptCarrot has an optimized mode that you can turn on with --opt. When you use it, it uses a giant case/when statement for all the instructions, a bit like the one inside Ruby's own VM code. That's dirty and ugly, but it's also very fast. Whether you benchmark with or without --opt probably depends on what you're trying to measure... Also, some non-MRI Ruby implementations have a slower case/when statement. Those implementations run much slower with --opt.

With --opt turned on, Ruby 2.0.0 already runs at close to 60fps on a Core i7 4500 with Ubuntu 16.10, and with Ruby 2.4.0 it's close to 80 fps.

Ooh! A Game Console? What Does It Run?

OptCarrot runs a game called "Lan Master" where the player places/rotates connections in a network to try to connect all hosts. You can see an animation of it below.

LanMaster.gif

Ooh! A Benchmark. What Are Its Results?

OptCarrot keeps a nice graph of its results on GitHub. You can see a semi-recent version below.

One thing to keep in mind: it's hard to do this "fairly" when comparing JIT versions, especially with regard to warmup. A JITted implementation like JRuby, TruffleRuby or OMR does best when given a lot of runtime, especially if you throw away times for the first few "warmup" iterations. MRI does best when you time from the very beginning and don't give it too many total iterations -- MRI has amazing startup time, but isn't as fast as JRuby or TruffleRuby for sustained steady-state server performance.

Eventually MRI is likely to add JIT, which is going to give some really interesting results in terms of warmup and runtime...

benchmark-optimized.png

Tooling

One thing I personally love about OptCarrot is that the author clearly understands that even a reproducible, consistent, low-noise benchmark will have some noise in the data from system activity. There's no avoiding it.

And so to deal with that, he includes a simple statistical test to check small differences in the runtime and deal with noise in measurement. If you're measuring very small or very random optimizations, you can run it even more iterations and potentially find even very subtle speedups.

I built a similar statistical profiling tool. I think it's a very useful approach.

I miss PartiallyClips. It's been finished for years now.

I miss PartiallyClips. It's been finished for years now.

So OptCarrot is the Perfect Benchmark?

There is, of course, no such thing as a perfect benchmark. But OptCarrot is a very good benchmark for a specific use case.

In his 2016 RubyKaigi presentation, Matt Gaudet talks about how we should measure "three times faster." Matt works on IBM's Ruby OMR project, so he knows what he's talking about.

One type of benchmark he identifies as needed is a simple CPU benchmark. Other benchmarks that he contrasts it with are numerical/scientific benchmarks, web framework benchmarks and tree modification benchmarks to give the garbage collector a nice workout. I don't think he says so, but I'll say that concurrency benchmarks are also a very good idea. Concurrency is one of Matz's stated goals for Ruby 3x3.

OptCarrot is a very solid, reproducible CPU benchmark with good configuration, tools and testing. It generates very little memory garbage. It runs in a very repeatable way, and it checksums its progress to detect breakages.

In other words, it handles exactly one category from Matt's list, and it does it solidly and simply.

That means it will be a great part of the final suite of Ruby 3x3 benchmarks when the other ones exist. My own benchmark is meant to handle the web framework category, but does so with higher concurrency and less reproducibility -- it's much harder to get good reproducibility for a concurrent benchmark.

OptCarrot is already being used heavily by other Ruby implementations (e.g. TruffleRuby) and Vladimir Makarov's MJIT branch for benchmarking and optimization work. It's helping to determine the optimization of future Ruby. So it's definitely a success in my book.