The Benchmark and the Rails

Not long ago, I wrote about benchmarking Rails for the Ruby team and how I thought it should probably be done. I got a little feedback and adjusted for it. Now I've implemented the first draft of it. That's what you do, right?

You can read those same benchmarking principles in the README for the benchmark if you're so inclined, and how to build an AMI to test the same way. You can also benchmark locally on Mac OS or Linux -- those just aren't the canonical numbers for the benchmark. Something has to be, and AWS seems like the way to go, for reasons discussed in the previous blog post, the README, etc.

So let's talk a bit about how the code came out, and what you can do with it if you're so inclined.

Right now, you'll need to run the benchmark locally, or build your own AMI. You can run that AMI on a t2.2xlarge instance if you want basically-official benchmark results.

I'd love quibbles, pull requests, bug reports... Anything along those lines.

And if you think it's not a fair benchmark in some way, great! Please let me know what you think is wrong. Bonus points if you have a suggestion for how to fix it. For instance: on my Mac, using Puma cut nearly twenty-five percent of the total runtime relative to running with Thin. Whoah! So running with Thin would be a fine example of bad benchmarking in this specific case.

I don't yet have a public AMI so that you can just spin up an instance and run your own benchmarks... yet. It's coming. Expect another blog post when it does.

Threads, Threads, Threads

Nate Berkopec pointed out that a lot of the concurrency and threading particulars would matter. Good! Improving concurrency is a major Ruby 3x3 goal. Here are some early results along those lines...

The "official" AWS instance size, like my laptop, has four "real" cores, visible as eight Intel hyperthreaded cores. Which means, arguably, that having four unrelated processes going at all times would be the sweet spot where the (virtualized) processor is fully used but not overused.

I originally wrote the load-testing program as multiple processes, and later converted it to threads. The results turned out to be in line with the results here: a block of user actions that previously took 39 or 40 seconds to process suddenly took 35 to 37 seconds (apples-to-oranges warning: this also stopped counting a bit of process startup time.) So: definitely an improvement when not context-switching between processes as often. Threaded beats multiprocess for the load tester, presumably by reducing the number of processes and context switches.

Rails running in Puma means it's using threads not processes as well. One assumes that Ruby 3 guilds, when Rails supports them, will do even better by reducing Global Interpreter Lock (GIL) contention in the Rails server. When that happens, it'll probably be time to use guilds for the load-tester as well to allow it to simultaneously execute multiple threads, just as the Rails server will be.

So: this should be a great example of improving benchmark results as guilds improve multithreaded concurrency.

Interestingly, the load tester keeps getting faster by adding threads up to at least five worker threads, though four and five are very close (roughly 29.5 vs 30.8 seconds, where per-run variation is on the order of 0.5-0.8 seconds.) That implies that only four simultaneous threads in the load-tester doesn't quite saturate the Rails server on a four-core machine. Perhaps there's enough I/O wait while dealing with SQLite or the (localhost) network somehow, or while communicating with Redis? But I'm still guessing about the reasons - more investigation will be a good idea.

Benchmarking on AWS

One of the interesting issues is clearly going to be benchmarking on AWS. Here are a few early thoughts on that. More will come as I take more measurements, I promise :-)

AWS has a few well-known issues, and likely more that aren't as well-known.

One issue: noisy neighbors. This comes in several flavors, including difficulties in network speed/connection, CPU usage and I/O. Basically, an Amazon instance can share a specific piece of physical hardware with multiple other virtual machines, and usually does. If you wind up on a machine where the other virtual hosts are using a lot of resources, you'll find your own performance is lower. This isn't anything nefarious or about Amazon overselling - it's standard VM stuff, and it just needs to be dealt with.

Another issue: central services. Amazon's infrastructure, including virtual machine routing, load-balancing and DNS are all shared with a gigantic number of virtual machines. This, too, produces some performance "noise" as these unpredictable shared services behave slightly differently moment-to-moment.

My initial solution to both these problems is to make as little use of AWS networking as possible. That doesn't address the CPU and I/O flavors of noisy neighbors, but I'm getting nicely consistent results to begin with... And I'll need to filter results over time to account for differing CPU and I/O usage, though I'm not there yet.

Another thing that helps: large instances. The larger the size of the AWS instance being used, the fewer total VMs you'll have on the physical hardware you're sharing. This should make intuitive sense.

Another commonly-used solution: spin up a number of instances, do a speed test, and then keep only the N fastest instances for your benchmark. This obviously isn't a theoretical guarantee of everything working, since a quiet "neighbor" VM may become noisy at any time. But in general, there's a strong correlation between a random VM's resource usage at time T and at time T + 3. So: selecting your VM for quiet neighbors isn't a guarantee, but it can definitely help. Right now I'm doing the manual flavor of this where I check by hand, but it may become automated at some point in the future.

Onward

From here, I plan to keep improving the benchmark quality and convenience in various ways, and assess how to get relatively reliable benchmarks out of AWS.

What I'd love is your feedback: have I missed anything major you care about in a Rails benchmark? Does anything in my approach seem basically wrong-headed?

Big thanks to Matt GaudetChris Seaton and Nate Berkopec, who have each given significant feedback that have required changes on my part. Thanks for keeping me honest, gentlemen!