In working on Rails Ruby Bench, I've explained a bit about how it generates a bunch of fake user actions from a random seed number. Basically, if you choose a particular random seed, you get a different bunch of actions like "post a new comment" or "save a draft" or "view current posts" using the Discourse forum software.
By doing this with a bunch of fake users at once, it (approximately) simulates high load on a busy Rails app.
With a different random seed, you get a slightly different benchmark. I keep posting about how Ruby has gotten faster over time based on my benchmark.
With a different random seed, would I get a different story about that?
Take the Simple Approach
Maybe the answer is as simple as measuring again. I tried out four different random seeds with Ruby 2.3.4 and 2.4.1. Here's what I got:
Okay... So, maybe that doesn't immediately clear it up. It's nice that the random seeds don't change the results much, but it's still not clear what we're looking at. How about a closeup of the same data?
Hm. Better... maybe?
I like throughputs - the number of requests processed per second over the course of the benchmark. Let's see if those give a clearer answer:
Really, really, really no.
Bringing Out the Big Guns
It turns out that Ruby 2.3.4 and 2.4.1 are mostly about the same speed. Part of why we're not seeing a lot of difference is that there isn't a lot of difference.
So let's look at more Ruby versions. For this, I needed to use multiple versions of Discourse to get compatibility with Ruby from 2.0.0 to 2.4.1. But when I do...
There we go! That's what I was looking for.
Each group shows a specific random seed. Each set of five bars is five different Ruby versions, going from 2.0.0 to 2.1.10 to 2.2.7 to 2.3.4 to 2.4.1. And each set tells the same slightly quirky story (is Ruby getting slower? Not really, but the last two bars are with a different, slower version of Discourse. I did, like, a whole talk that explains it better.)
Would it be easier to see if I sorted by Ruby version? I think it might. Here's what that looks like:
Random Seeds Matter, But Not Too Much
If the four random seeds give four slightly-different benchmarks, each of those benchmarks agrees about what Ruby is doing. There's a bit of noise between them -- there should be because they're doing slightly different sets of operations, which take different amounts of time.
Which is perfect - a single benchmark can't perfectly reflect everybody's workload. But if slightly different workloads gave completely different results, something would be very wrong (for instance, I might be measuring wrong, or measuring something chaotic, or not using enough iterations.)
Instead, each workload tells approximately the same story in a slightly different way.
Which is exactly what a seeded, pseudorandom benchmark should do.