Benchmarking Rails

I've been working on a Rails benchmark for Ruby, like the one Matt Gaudet's RubyKaigi talk discusses. For the short version: we'd like to have a number of Ruby benchmarks for standard workloads like "lots of CPU, not much disk I/O" and "modifying lots of memory, lots of GC time"... And of course "typical Rails application." It helps show us whether a given set of Ruby optimizations are good or bad for the performance of a "typical" Rails app... whatever that means.

There are a lot of interesting questions. What do we benchmark? How do we benchmark it? A benchmark is defined by how it answers those questions. I can talk about what decisions I prefer and why.

(And then smarter people than me will argue for a slightly better set. That's cool. That argument gets way easier if I state my assumptions to start the debate.)

Decisions

For the Rails code to benchmark, I chose Discourse. Discourse is a real Rails-based product with real-world applications. The code is complex, but not so complex you can't understand it. It uses typical Rails calls in typical proportions, insofar as there's is a such thing as "typical" for a mid-sized Rails application.

Discourse has a lot of dependencies. That's why they usually use Docker to install it. I use a local setup script for local development, and (soon) an AWS EC2 AMI for "real" benchmark numbers. I could have used Docker, but Docker isn't great for performance numbers. Benchmarking isn't really what Docker is for. However, a lot of real shops use Amazon EC2 for Rails hosting, so those numbers are useful. Even hosting that isn't done directly on EC2 is sometimes done indirectly on it, via an EC2-based cloud service.

We want a multithreaded benchmark - Matz made it clear that a lot of Ruby3x3's increase in speed will be about concurrent performance. We can use Guilds when Rails supports them. But for now, we have to use threads. A multithreaded or Guilded benchmark won't be perfectly reproducible. But we can get pretty good reproducibility with random seeds for which calls are made in which order.

The idea is that there will be a "standard" random seed that we start testing with -- this guarantees that each thread has a known list of calls in a known order. But we can easily change the random seed to make sure we're not "overtraining" by optimizing too much for one specific call order, even if there's one we normally use for consistent numbers.

The random seed algorithm is very simple. That means that small changes to the benchmark can make big changes to exactly which actions are taken. But it also means the benchmark doesn't have to be hugely complex. The benchmark should change very little over time, once it has been initially debugged. And with a large number of iterations, the total benchmark time shouldn't change much with changes to the random seed.

I'm assuming that we'll be testing with 8 or fewer virtual CPUs, because that's quite common for Rails applications early on. Later, when Rails applications need a lot more parallelism, they tend to have wildly divergent needs -- some are all about I/O, some are extremely CPU-bound. I don't think any one benchmark will do very well for very large Rails applications, but I think a smaller benchmark can be quite representative of small and mid-sized Rails applications. Also, companies that need extremely concurrent Rails workloads are probably doing more of their own benchmarking with their own application. It's reasonable for them to compile a custom version of Ruby with application-specific settings, for instance.

(Nate Berkopec points out that Amazon's idea of 8 vCPUs is actually 8 hyperthreads on four of what a normal human might call a "virtual CPU" or "virtual core." So I'm going to change the load tester to a single multithreaded process, which helps a bit, and see if there is any way I can otherwise improve the ratio of CPUs to processes. I still think CPU contention isn't going to be as big a problem as AWS network overhead when measuring.)

Discourse is already Thin-compatible, so I'm starting with that. Later, I should switch to Puma. Puma is threaded by default (Unicorn is multiprocess, Passenger is multiprocess except for the commercial version, Thin is based on EventMachine.) So Puma seems like the best choice to demonstrate the increased concurrency of Ruby 3x3.

For the benchmark, I'm configuring it as a single EC2 instance with Postgres, Sidekiq, the (one) app server and the benchmark script all configured on the same instance. That is not realistic, but it avoids swamping the benchmark results with AWS-specific problems like noisy neighbors, getting all the instances into the same VPC and running the benchmark from an IP address that's too far from the nearest AWS datacenter. You can, of course, use the same pieces in a different configuration to get your own benchmark numbers. But those aren't the ones I'm generating.

I added a little bit of warmup time to the default benchmark settings. That setting can be changed, naturally. The amount of warmup is a big deal when timing MRI versus JRuby or Rubinius. The more warmup time you allow, the more you give an advantage to JIT approaches where the interpreter starts really slow and speeds up over time.

In addition to the time it takes to complete all the requests, another important factor is startup time: how long from when you run "rails server," does it take to serve the very first request? When you're debugging, recovering from a power outage or restarting a server for a deploy or crash recovery, you care a lot about how long it takes Rails to start up. And when you're debugging, you'll pay that startup cost over and over, so it matters even more.

What Does It Do?

The benchmark basically simulates a set of users loading the app - posting topics, viewing topics and occasionally editing posts. When I talk about a random seed, the seed is for which of these activities occur and in what order.

There are a fixed number of users (5) and a fixed number of requests per user (10,000), and the benchmark is a weighted sum of:

  1. how long it takes to get through each user for each user
  2. how long it takes to get through the fastest one of them
  3. how long the Rails application took to start up
  4. maximum memory usage of the Rails process

There's nothing sacred about that exact list or how it's weighted. But I think those items reward the right things.

If five users doesn't sound like a lot, remember that we're processing the requests as fast as possible. While I'm using the word "user" for these threads, they're more of a measure of total concurrency than a realistic number of requests for a few users. Given the other servers running on the instance (Redis, Postgres, app servers) this means that 8 vCPU cores should be reasonably loaded...

A larger number of users and Rails threads could saturate more vCPUS. But I don't think I can design a representative benchmark for diverse Rails workloads at at that scale. At that many users, the engineering teams should be optimizing significantly for the specific workload of that specific application, rather than running a "typical" Rails application. Above a certain number of simultaneous users, there's no such thing as "typical."

Why Not the Current Discourse Benchmark?

Discourse includes a simple benchmark. It checks only a few specific URLs, one at a time.

There's nothing wrong with it, but it's not what we want to optimize.

Is This What the Whole Ruby Team Uses?

No. This is an early draft. I'm hoping for feedback on my decisions and my code. With luck, this benchmark (or one like it) will become the semi-official Rails benchmark of the Ruby team, just as optcarrot is their most common CPU benchmark.

Right now, there's no Rails benchmark that fits that description.

As Matt Gaudet's talk suggests, it would be great to have different benchmarks for different "typical" workloads. This is just an attempt at one of them.

Source Code?

It's not finished, but yes.

The link above is for a just-about-adequate implementation of what I describe here. My intent is to start from this and improve the code and assumptions as I go. You can comment or submit a Pull Request if you'd like to join the debate. I'd love to hear what you think.