Comparing Rails Performance by Ruby Version

The Ruby benchmark I've been working on has matured recently. It builds an AMI that runs the benchmark automatically and makes it downloadable from the AWS instance. It has a much better-considered number of threads and processes for Puma, its application server.

So let's see how it works as a benchmark. Let's see if there's been any particular change in Rails speed between Ruby versions. (Just want to see the pretty graphs? Scroll down.)

Because of Discourse, I'll only check Ruby versions between about 2.0.X and 2.3.X -- Discourse doesn't support 2.4 as of early April 2017. I'm not using current Discourse because the current version wants gems that don't support Ruby 2.0.X. That's a problem with using a real app for a benchmark: it creates a lot of constraints on compatibility! When I move to a version of Discourse that supports Ruby 2.4.0, it'll also be using syntax that prevents using Ruby 2.2.X or earlier. It's good that we're taking benchmarks now, I suppose, so that we can compare the speed to later Discourse versions! But that's another post...

Version Differences, Speed Differences

So we can only compare version 2.0, 2.1, 2.2 and 2.3? Great, let's compare them. I've added some extra logging to the benchmark to record the completion times of each individual HTTP request. That makes it easier to compare my benchmark with Discourse's standard benchmark, the one they run on RubyBench.

Each run of the benchmark completes 1500 requests, divided evenly among 30 load-testing threads. That's only 50 requests/thread, so you'll see some variation in how long the threads take to complete. I checked the variation between all individual requests (across all threads, 1500 samples/run) and the variation among single-thread runs (30 samples/run, 50 requests/sample.)

Individual request times - later Ruby versions are faster by a roughly constant factor.

Individual request times - later Ruby versions are faster by a roughly constant factor.

Each column here averages 50 individual requests. So there's less variation, but lots of slow, steady improvement.

Each column here averages 50 individual requests. So there's less variation, but lots of slow, steady improvement.

The individual request time varies a lot, as do the 0th and 100th percentiles -- that's expected. The median requests and per-run totals get noticeably faster - easily 30% faster by any reasonable estimate. And the slowest requests (90th+ percentiles) improve by a similar amount.

Here is the numeric processed data for these graphs. It's output from process.rb in the Git repo. I haven't included my JSON files of test data. But if you run tests using the public AMI (current latest: ami-36cb5820) or an AMI you create from the Packer scripts, you should get very similar results. Or email me and I'll happily send you a copy of my JSON results.

Ruby Bench

Ruby Bench has numbers for the original Discourse benchmark - but only for older Ruby and Discourse versions. But we can get a second opinion on how much Ruby performance has increased between 2.0 and 2.3. We'll check that the Rails Ruby Bench results are approximately sane by comparing them to a simpler, more established benchmark that's already in use.

Click the graph to rotate through pictures of several similar graphs from Ruby Bench. See the same Ruby Bench link to get source code and exact benchmark numbers.

There's a lot to those graphs. I'd summarize it as: the median request hasn't really gotten faster, but the 90th/95th/99th have gotten *much* faster, in some cases 2x or more. Yet another reason why "3 times faster" is hard to nail down.

Memory usage (the red graph) has also gotten a bit higher. So we've traded more memory for more even response times. That sounds like a win to me. YMMV.

Why hasn't the median request gotten faster in this benchmark? Hard to say. There may be a few optimizations that are included as backports that show up in the newer benchmark... But if so, not many. It's also possible that concurrent performance is better but straight-line sequential performance isn't. The default Discourse benchmark doesn't set "-c" for concurrency, so it's only checking one request at once.

Ruby 3x3

So what does all this say about Ruby 3x3? It says that Ruby 2.3.4 is already running 150% the speed of 2.0.0-p648 for Ruby on Rails. That's a great start. It says that Ruby is fixing up a lot of edge cases - requests that used to cause slowdowns in rare cases are getting even rarer, so the performance is a lot more predictable.

I think it also suggests that my Rails benchmark is a pretty good start on measuring Rails performance in these cases.

Where we may really see some interesting effects for Rails is when Guilds are properly supported, allowing us to increase the number of threads and decrease processes, running more workers in the same amount of memory. This benchmark should really sing when Guilds are working well.

Caveats For Pedants Only - Everybody Else Close the Tab

Currently there are no warmup iterations after the server starts. That's going to significantly affect performance for non-MRI Ruby implementations, and probably has a non-trivial current effect even on MRI. I'll examine warmup iterations in a later post.

Data is written to JSON files after each run. You can see how that data is written in start.rb in the Git repo, and you can see how it's processed to create the Gist and data above in process.rb.

If even one request fails, my whole benchmark fails. So you're seeing only "perfect" runs where all 1500 requests complete without error. You'll see an occasional failed run in the wild. I see bad runs 5%-10% of the time. I don't currently believe this significantly skews the results, but I'm open to counterarguments.

In a later post I'll be increasing the total requests above 1500. Then the variance per run should go down, though the variance per HTTP request will stay about the same. 1500 requests just isn't enough for this EC2 instance size and I'll be correcting that later. Also, it's possible for "lucky" load threads to finish early and not pick up pending requests, so the 100th percentile load threads can have a lot of variation.

Ruby Bench uses Discourse 1.2.5, which is very old. I used 1.5.0 because it's the oldest version I could get working smoothly with recent Ruby and Ubuntu versions. I'll examine how Discourse has changed in speed in a future post. This is a hazard of testing with a real application that changes over time.

Ruby Bench uses very old versions of Ruby for its Discourse benchmark. Basically, Discourse broke for head-of-master Ruby when 2.4.0 merged Fixnum with Integer. So Ruby Bench stopped testing with newer versions. When Discourse works with Ruby 2.4 (note: coming very soon), they can update and I can write a speed-comparison blog post that includes 2.4.

Ruby Bench and my benchmark use different hardware (in my case, an EC2 t2.2xlarge non-dedicated instance.) The slope of the graph comparing one Ruby version with another should be similar, but the request times will be different. So: don't directly compare seconds/request between their graphs and mine, for many good reasons.

The standard Discourse benchmarks request the same URL many times in a row using ApacheBench. Ruby Bench uses the standard Discourse benchmark, so it does the same. My benchmark requests different URLs in different orders with different parameters, which affects the speed of the resulting benchmark. That's what I mean when I say their results should be only "roughly the same" as mine. You can think of my results as "roughly" a weighted blend of multiple of their results, plus some URLs they don't test.

I don't include 1st%/99th% data for full runs because there just aren't enough samples. Until you're looking at 500+ samples for each Ruby version, the 1% and 99% mark are going to bounce around so much that it doesn't make sense to show them. That's about 15 full runs-through of the benchmark for each Ruby version. That's perfectly doable, but more samples than I collected for this post. Instead, I showed the 90th and 10th percentile, which are much more stable for this amount of data. As stated above, you can also request my full JSON data files and get any percentile you feel like from the raw data.

(Have you read this far? Really? I'm impressed! By the way, AppFolio is hiring Ruby folks in Southern California and is an awesome place to work. Just sayin'.)

Rails Benchmarking: Puma and MultiProcess

This week, I've been playing with Puma processes. Headius, Nate Berkopec, you can probably stop reading now. You won't learn much ;-)

One consequence of the GIL is that a single Ruby process has limited ability to fully use the capabilities of a multi-core machine, including a larger AWS instance.

As a result, the optimum number of Rails processes for your AWS instance is probably not just one.

I'm still using my Discourse-based Rails benchmark, and we'll look at how the number of processes affects the total throughput. There are some slightly odd things about this benchmark that haven't changed since previous articles. For instance, the fact that Postgres and the load-testing process runs on the same machine as Puma. The load-testing process is singular now with only threads, which helps its impact noticeably.

A quick note on Puma: by default it will spawn threads on demand in a single Ruby process with a single GIL, up to a maximum of 16 or whatever you specify. In cluster mode it will spawn more Ruby processes, as many as you specify, each one having on-demand threads up to 16 or the number you picked.

Workers and Speed

In a previous post I played with load-testing threads to find out what it took to saturate Puma's default single process with a single GIL. In this post, I'll talk a bit more about what it takes to saturate an EC2 t2.2xlarge instance with Rails, database and load-testing processes.

With a single load-testing process and one Rails worker, you can get reasonable gains in total speed up to 5-7 load-testing threads. But there are two hyperthreaded cores on the instance, for a total of somewhere between 2 and 4 effective cores. We're already running five load-testing threads, one Rails thread, Postgres, and a normal collection of Ubuntu miscellaneous processes. Aren't we close to saturated already?

(Spoiler: not hardly.)

Per Aspera Ad Astra

Our time to process 1500 total requests with one Rails process was around 35 or 36 seconds. There's a bit of variation from run to run, as you'd expect. That involved adding more load-testing threads to generate more requests. Even a single process, with a GIL preventing simultaneous Ruby, still likes having at least 5 or so load-testing threads to keep it fully loaded.

How about with two processes? To process the same total number of requests, it takes 17 to 19 seconds, or around half the time. That's nice, and it implies that Rails is just scaling linearly at that point. It won't scale perfectly forever due to things like database access and a limited number of cores, but it's pleasant while it lasts. So clearly having two processes of 16 threads is a win for Rails throughput.

How about three processes? 13-14 seconds to process all 1500 requests, it turns out. And you can be sure that part of the delay there is in the load tester (remember, still only five load threads with a GIL), not the Rails server.

But what if we crank it up to 20 load threads? And 11 Puma processes, just to keep things interesting... And to process every request, it takes between 5.5 and 7.0 seconds, roughly. I wasn't expecting that when we started with a (tuned) one-Rails-process server at the beginning. Were you? Heck, at that level you have to wonder if the GIL is slowing things down *in the load tester*.

So go for broke: what about 30 load threads and 11 Puma processes?

Finally, at this point, we start seeing fun 500 errors, which you'd kind of expect as you just keep cranking things up. Postgres has a built-in limit of 100 connections (configurable upward), and 11 Puma processes with up-to-sixteen-on-demand threads for each one, plus the load testers, is finally exceeding that number. As a corollary, the previous 11 Puma threads with only 5 load threads were clearly not using all 16 threads per server -- if they did, we'd have hit this Postgres limit before, since 11 * 16 is more than 100.

Between you and me, I did have a breakage before that. I had to increase the number of ActiveRecord threads, called "pool", in database.yml before I hit Postgres' limit.

This all suggests an interesting possibility: what about limiting the number of threads per process with the higher number of processes? Then we can duck under the Postgres limit easily enough.

One More Time

Let's look at some (very) quick measurements on how fast it runs with different combinations of processes and threads...

Performance by number of Puma threads and processes, and by load-testing threads.

Performance by number of Puma threads and processes, and by load-testing threads.

Puma Processes  Puma Threads  Load Threads  Time Taken  
10 6 30 5.1 - 6.8
10 10 20 5.3 - 6.5
8 8 15 5.8 - 7.0
10 10 10 6.3 - 7.3
8 10 20 5.7 - 6.7
4 10 20 8.5 - 9.7
1 16 12 32.6 - 35.5

(Note: all of this is done with current-latest Rails Ruby Bench and Ruby 2.3.1p112 - having some compatibility issues with latest 2.4.0 and latest Discourse release because of JSON 1.8.3 incompatibility. Working on it!)

To keep under the Postgres 100-connection limit, use threads for quick context switching and more processes for avoiding the GIL, there's a pretty good sweet spot at around 30 load-testing threads, 10 Puma processes and a limit of 6 threads/process. At that point, the noise in the benchmark starts to make it hard to tell whether a new configuration is faster -- there's too much noise and too little change. There's a tool to get around that, but for now, it's time to move to a different issue.

For now, let's call that success on tuning processes and threads. Later it's highly likely that the load-tester is hitting GIL contention with 30 (!) threads, and I'm *sure* this quick-and-dirty configuration check isn't the very fastest way to serve 1500 requests in Rails. But we've verified that anything starting at around 8-10 Puma processes, 5 processes/thread and 20+ load testing threads will get us into quite decent performance (5-8 seconds for 1500 requests.)

But this is definitely the low-hanging fruit, and a solid configuration. And we don't need or *want* it to be perfect. We want it to be a representative configuration for a "normal" "real" Rails app, laboring hard. Speaking of which...

Ruby 3x3

Even this little two-to-four-core EC2 instance is clearly benefiting a lot from having lots of Ruby threads running at once. That's a really good sign for Ruby Guilds. It's also great for JRuby, which has the JVM's world-class thread performance and no GIL. Expect me to time JRuby in a future post, after adding some warmup iterations to keep JRuby's JIT from making it look incredibly slow here. 1500 requests just isn't that much, and even for MRI the total number may need an increase.

Later, it's likely I'll reimplement the load tester to use clustered processes with many worker threads. I think the load tester is a *great* place to use guilds once they become available, and to measure their performance when it's time to. But there may be some interesting difference between multithreaded speed and Guild speed, depending on what objects are accessed...

Hope you've enjoyed reading about this Rails benchmark performance! I'll keep writing as it evolves...

Ruby, RubyGems and Bundler

Ruby, RubyGems and Bundler can be a bit of an intertwined mess -- it can be hard to tell what magic incantation will tell you what went wrong and how to fix it.

The secret is that they're three separate layers. Ruby was originally designed without RubyGems. RubyGems is a separate layer on top with a few seams where it was designed to be detached. Bundler was created for Rails 3.0, and was built on top of a perfectly good RubyGems to add more functionality.

In other words, it makes sense to learn them separately even if you'll only use them together. Otherwise, how can you tell what's wrong with what library?

We won't discuss version managers like rvm, rbenv or chruby here. Rest assured that they're another whole layer with their own power and subtleties. They do interact with gems, not only the Ruby executable.

I found this talk by Andre Arko after writing this - he mentions a whole setup.rb step in between Require and RubyGems that you can ignore completely. It has a lot of great additional detail and history.

Ruby

Ruby, at its lowest level, doesn't really have "libraries" built in. It has the ability to "load" or "require" a file, and it has $LOAD_PATH, an array of paths to check when you ask for a filename.

"Load" just does what it says: read the file and execute the Ruby inside. It's almost the same as if you'd just written "eval File.read('filename')", except that it checks the paths in the $LOAD_PATH, in order, to figure out where to find your filename. Well, and it also executes inside a top-level Ruby object called "main" rather than exactly where you called "eval". Still, it's a pretty straightforward command.

"Require" is just slightly more complicated. It keeps a hash of what files have already been required. If you ask for a new one, it will load it. If you ask for a file you've already required, it will do nothing. "Require" also tries hard to not re-require different paths if they point to the same file -- think of symbolic links between directories, for instance, or relative pathnames versus absolute. Require tries to chase down the "real" canonical location of the file so it can avoid requiring the same file twice except in pretty extreme circumstances.

Ruby starts with a few default entries in the $LOAD_PATH. You may want to pop into irb and type "$LOAD_PATH" to see what they are for you. An old version of Ruby like 1.8.6 had even fewer since RubyGems wasn't loaded until you manually required it. In recent versions, you can see that RubyGems is installed by default - you're likely to see some gems automatically in the $LOAD_PATH.

You'll also notice that the current directory (".") isn't in the $LOAD_PATH. Long ago it used to be. These days it isn't. That's why you can't just "require 'myfile'" and have it magically find myfile.rb from the current directory. I mean, unless you stick "." into the $LOAD_PATH array, but that's not a common thing to do.

RubyGems

RubyGems is a library on top of Ruby. You can upgrade it separately from your Ruby language version. RubyGems also has some command-line commands and strong opinions that old versions of Ruby didn't originally have.

The "gem" command will show you a lot about how RubyGems is currently set up. Specifically, try typing "gem env" and see all the good stuff:

RubyGems Environment:
- RUBYGEMS VERSION: 2.5.1
- RUBY VERSION: 2.3.1 (2016-04-26 patchlevel 112) [x86_64-darwin15]
- INSTALLATION DIRECTORY: /Users/noah.gibbs/.rvm/gems/ruby-2.3.1
- USER INSTALLATION DIRECTORY: /Users/noah.gibbs/.gem/ruby/2.3.0
- RUBY EXECUTABLE: /Users/noah.gibbs/.rvm/rubies/ruby-2.3.1/bin/ruby
- EXECUTABLE DIRECTORY: /Users/noah.gibbs/.rvm/gems/ruby-2.3.1/bin
- SPEC CACHE DIRECTORY: /Users/noah.gibbs/.gem/specs
- SYSTEM CONFIGURATION DIRECTORY: /Users/noah.gibbs/.rvm/rubies/ruby-2.3.1/etc
- RUBYGEMS PLATFORMS:
- ruby
- x86_64-darwin-15
- GEM PATHS:
 - /Users/noah.gibbs/.rvm/gems/ruby-2.3.1
 - /Users/noah.gibbs/.rvm/gems/ruby-2.3.1@global
- GEM CONFIGURATION:
 - :update_sources => true
 - :verbose => true
 - :backtrace => false
 - :bulk_threshold => 1000
- REMOTE SOURCES:
 - https://rubygems.org/
- SHELL PATH:
 - /Users/noah.gibbs/.rvm/gems/ruby-2.3.1/bin
 - /Users/noah.gibbs/.rvm/gems/ruby-2.3.1@global/bin
 - /Users/noah.gibbs/.rvm/rubies/ruby-2.3.1/bin
 - /usr/local/bin
 - /usr/bin
 - /bin
 - /usr/sbin
 - /sbin
 - /Users/noah.gibbs/.rvm/bin

There are a bunch of environment variables that affect where and how Ruby finds gems. "Gem env" shows you where they're all currently pointed. Useful!

That list of "GEM PATHS" tell you what RubyGems puts into the $LOAD_PATH to let Ruby find your gems. The "INSTALLATION DIRECTORY" is where "gem install" will put stuff.

RubyGems does some interesting things, but it's mostly an extension of $LOAD_PATH. It doesn't do as much fancy stuff as you might think. As a result, it doesn't have any ability to find things that aren't locally installed - you can't use a gem from Git using RubyGems for instance, because how and when would you update it? RubyGems has a path it installs to, a few paths it looks through, and the ability to turn a directory of files into an archive (a "gem file", but not at all like "Gemfile") and back.

The last one is interesting. You can "gem build" a gemfile, if you have a .gemspec file in the right format. It's just a YAML manifest of metadata and an archive of files, all compressed into a single ".gem" archive. But you can push it to remote storage, such as RubyGems.org or a local gem server (see GemInABox for an example.)

That's also how "gem install" works - it downloads a .gem archive, then unpacks it to a local directory under the "INSTALLATION DIRECTORY". The reason for things like "spec cache" above is that to download .gem archives, RubyGems wants to know who has what versions of what gems, and what platforms and Ruby versions they're compatible with. The spec files have that information but not the whole archive of files. That's so that they're smaller and quicker to download.

One more subtlety: gems are allowed to build native extensions. That is, they can link to system libraries and build new files when you install them. So this is a *bit* more complicated than just unpacking an archive of Ruby files into place. It can also involve fairly convoluted install steps. But they're basically a run-on-install shell script to build files. This is also why every Ruby version you have installed saves its own copy of every gem. If a gem builds an extension, that's compiled against your local Ruby libraries, which are different for every version of Ruby. So that copy of Nokogiri actually has different files in the Ruby 2.3.1 copy than in the Ruby 2.4.0 copy or the Ruby 1.9.3 copy. That's what happens when you build them each with different libraries, it turns out.

RubyGems is more complicated than plain old "load" and "require." But nothing I've described is terribly magical, you know?

Bundler

Bundler is a much more powerful, complex and subtle tool than RubyGems. It has more weird behaviors that you have to understand, and it enables a lot more magic.

It solves a lot of long-standing RubyGems problems, and replaces them with a new crop of Bundler-specific problems. No tool can just "do what you want" without you having to describe what you want. Bundler is no exception.

You can tell you're using Bundler when you're messing with the Gemfile and Gemfile.lock, or when you use the "bundle" command. You can also start Bundler from your Ruby files. That's why Rails commands run with Bundler active, but don't start with the "bundle" command.

The first thing Bundler does is to make undeclared gems "invisible." If it's not in your Gemfile, you can't require it. That's really powerful because it means somebody else can tell what gems you were actually *using*. It also makes undeclared gem *versions* invisible. So if you have five versions of JSON installed (don't laugh, it happens), this will make sure you get the right one and only the right one. This trick requires "bundle exec" (see below.)

It also has "bundle install". If you have a list of all the gems you can use, it makes sense to just let you install them. That's probably the most wonderful magic in Bundler. If you remember the old system of gem declarations in Rails' environment.rb, you understand just how incredible Bundler is. If you don't remember it... That's probably for the best.

Similarly, it has a Gemfile.lock with the specific version of all your gems. So even if you give a range of legal versions of MultiJSON, the Gemfile.lock will list the specific one you're currently using. That way, everybody else will also get the same version when they "bundle install" using your Gemfile.lock. For complex but good reasons, you should check in an application's Gemfile.lock so that everybody gets the same versions you do, but you should *not* check in a library's Gemfile.lock because you *can't* make everybody use your same dependencies. Oy.

Bundler also figures out which Gem versions are compatible with which other versions. When Bundler creates Gemfile.lock, it makes sure that the whole set of different gem versions works together, and that they get activated in the right order with all the right versions. Getting all your gem versions loaded in the right order used to be a very ugly process. Bundler fixes it.

Bundler can also use somewhat-dynamic gems. Specifically, you can declare a "git" URL in your Gemfile and Bundler will check it out locally and make sure it gets into your $LOAD_PATH so that "require" works. The Gemfile can also take gems with a ":path" option to point to un-installed local gems, such as in a checked-out repo. Both of these things require Bundler to be active inside your Ruby process -- just setting $LOAD_PATH isn't enough, the Bundler library has to be active. Be sure to run with "bundle exec" or this won't work.

Bundler still does a lot of this with $LOAD_PATH magic. The rest is done by loading its library into your Ruby process, changing how "require" works. It gets loaded via "Bundler.setup" in Ruby, or something like Rails or "bundle exec" that calls it. There may also be a sacrifice of goats involved, so check your version number carefully.

Because Bundler needs to be running inside your Ruby process, you'll need to activate it. The easiest way to do this manually is to type "bundle exec" in front of your process name. That will find the right Gemfile, set an environment variable so that sub-processes running Bundler will use the same one, and generally make sure everything gets properly loaded. Just be careful - if you run a sub-process that also runs Ruby, it can be hard to make sure it's using the same Bundler in the same way. When in doubt, run "bundle exec" in front of the command line if there's any chance that it could run something in Ruby.

Bundler also has a facility for "vendoring" gems -- that is, copying specific versions to a local directory and using them there, not from system directories. That can be valuable, but the way Bundler does it is kind of magical and kind of brain-bending. It's better than the old RubyGems method of copying the files to a local directory and replacing $LOAD_PATH. But it's still pretty weird.

If you're having trouble figuring out what's going on in Bundler, the answer is usually "bundle exec". For instance, "bundle exec gem env" will show you where Gems get installed or loaded with Bundler active, which can be a huge revelation. "Oh, *that's* why I'm not seeing it." Similarly, running things like "bundle exec gem list --local" shows you what Bundler can and can't see. That's very powerful.

There are rumors that Bundler will wind up built into RubyGems. If that happens, it will eliminate some of the headaches with subprocesses and manually running "bundle exec". That would be awesome. In the mean time you're going to need to know more about this than you'd like. I feel your pain, I promise.

Rails Benchmarking and a Public AMI

You remember that Rails benchmark I've been working on? I've been making it friendlier to quick runs and getting results in its final (?) AWS configuration.

If you check its Git repository, you'll find lots of Packer code to make a nice AMI that boots up and puts JSON benchmark results into a publicly-served directory. At least, if you happen to know the right IP address to go to. I'm assuming your JSON results aren't terribly important to your security -- or that you can modify the Packer code yourself, or not expose a public IP address when you spin up the AWS instance. Suit yourself.

I've just made the AMI public: ami-745b8262. That means you should be able to spin up a new instance of it with something like the following:

aws ec2 run-instances --image-id ami-745b8262 --count 1 --instance-type t2.2xlarge --key-name my-ec2-keypair-name

Replace the keypair name with your own keypair, naturally. Though you don't have to SSH in. Instead, you can navigate to the instance's public IP address, and to /benchmark-results/ to see a directory of (currently one file of) benchmark results. See /etc/rc.local if you're curious how the benchmark is run. The code is all from the repository above, naturally.

I'm still modifying the benchmark and the tooling. But I'd love any feedback you have. Is this a convenient way to make a benchmark canonically accessible? I'm also still testing AWS performance to see how the above varies from using dedicated hardware -- the above command line still uses standard shared instances if you copy and paste it verbatim.

 

The Benchmark and the Rails

Not long ago, I wrote about benchmarking Rails for the Ruby team and how I thought it should probably be done. I got a little feedback and adjusted for it. Now I've implemented the first draft of it. That's what you do, right?

You can read those same benchmarking principles in the README for the benchmark if you're so inclined, and how to build an AMI to test the same way. You can also benchmark locally on Mac OS or Linux -- those just aren't the canonical numbers for the benchmark. Something has to be, and AWS seems like the way to go, for reasons discussed in the previous blog post, the README, etc.

So let's talk a bit about how the code came out, and what you can do with it if you're so inclined.

Right now, you'll need to run the benchmark locally, or build your own AMI. You can run that AMI on a t2.2xlarge instance if you want basically-official benchmark results.

I'd love quibbles, pull requests, bug reports... Anything along those lines.

And if you think it's not a fair benchmark in some way, great! Please let me know what you think is wrong. Bonus points if you have a suggestion for how to fix it. For instance: on my Mac, using Puma cut nearly twenty-five percent of the total runtime relative to running with Thin. Whoah! So running with Thin would be a fine example of bad benchmarking in this specific case.

I don't yet have a public AMI so that you can just spin up an instance and run your own benchmarks... yet. It's coming. Expect another blog post when it does.

Threads, Threads, Threads

Nate Berkopec pointed out that a lot of the concurrency and threading particulars would matter. Good! Improving concurrency is a major Ruby 3x3 goal. Here are some early results along those lines...

The "official" AWS instance size, like my laptop, has four "real" cores, visible as eight Intel hyperthreaded cores. Which means, arguably, that having four unrelated processes going at all times would be the sweet spot where the (virtualized) processor is fully used but not overused.

I originally wrote the load-testing program as multiple processes, and later converted it to threads. The results turned out to be in line with the results here: a block of user actions that previously took 39 or 40 seconds to process suddenly took 35 to 37 seconds (apples-to-oranges warning: this also stopped counting a bit of process startup time.) So: definitely an improvement when not context-switching between processes as often. Threaded beats multiprocess for the load tester, presumably by reducing the number of processes and context switches.

Rails running in Puma means it's using threads not processes as well. One assumes that Ruby 3 guilds, when Rails supports them, will do even better by reducing Global Interpreter Lock (GIL) contention in the Rails server. When that happens, it'll probably be time to use guilds for the load-tester as well to allow it to simultaneously execute multiple threads, just as the Rails server will be.

So: this should be a great example of improving benchmark results as guilds improve multithreaded concurrency.

Interestingly, the load tester keeps getting faster by adding threads up to at least five worker threads, though four and five are very close (roughly 29.5 vs 30.8 seconds, where per-run variation is on the order of 0.5-0.8 seconds.) That implies that only four simultaneous threads in the load-tester doesn't quite saturate the Rails server on a four-core machine. Perhaps there's enough I/O wait while dealing with SQLite or the (localhost) network somehow, or while communicating with Redis? But I'm still guessing about the reasons - more investigation will be a good idea.

Benchmarking on AWS

One of the interesting issues is clearly going to be benchmarking on AWS. Here are a few early thoughts on that. More will come as I take more measurements, I promise :-)

AWS has a few well-known issues, and likely more that aren't as well-known.

One issue: noisy neighbors. This comes in several flavors, including difficulties in network speed/connection, CPU usage and I/O. Basically, an Amazon instance can share a specific piece of physical hardware with multiple other virtual machines, and usually does. If you wind up on a machine where the other virtual hosts are using a lot of resources, you'll find your own performance is lower. This isn't anything nefarious or about Amazon overselling - it's standard VM stuff, and it just needs to be dealt with.

Another issue: central services. Amazon's infrastructure, including virtual machine routing, load-balancing and DNS are all shared with a gigantic number of virtual machines. This, too, produces some performance "noise" as these unpredictable shared services behave slightly differently moment-to-moment.

My initial solution to both these problems is to make as little use of AWS networking as possible. That doesn't address the CPU and I/O flavors of noisy neighbors, but I'm getting nicely consistent results to begin with... And I'll need to filter results over time to account for differing CPU and I/O usage, though I'm not there yet.

Another thing that helps: large instances. The larger the size of the AWS instance being used, the fewer total VMs you'll have on the physical hardware you're sharing. This should make intuitive sense.

Another commonly-used solution: spin up a number of instances, do a speed test, and then keep only the N fastest instances for your benchmark. This obviously isn't a theoretical guarantee of everything working, since a quiet "neighbor" VM may become noisy at any time. But in general, there's a strong correlation between a random VM's resource usage at time T and at time T + 3. So: selecting your VM for quiet neighbors isn't a guarantee, but it can definitely help. Right now I'm doing the manual flavor of this where I check by hand, but it may become automated at some point in the future.

Onward

From here, I plan to keep improving the benchmark quality and convenience in various ways, and assess how to get relatively reliable benchmarks out of AWS.

What I'd love is your feedback: have I missed anything major you care about in a Rails benchmark? Does anything in my approach seem basically wrong-headed?

Big thanks to Matt GaudetChris SeatonNate Berkopec and Charles Nutter, who have each given significant feedback that have required changes on my part. Thanks for keeping me honest, gentlemen!

 

You -- Yes, You -- Can Speak at a Conference

I've been having this talk with the coworkers a lot lately. So let's talk about it here, shall we?

You, O budding programmer (or non-programmer) can speak at a conference. Better yet, you can speak in a useful way and somebody will enjoy it. Let's talk about why.

You've done a couple of weeks of work on something at some point, right? Maybe it was regular expressions. Or Ruby on Rails controllers. Or learned a little about Rust or Haskell. Or how to solve a Rubik's cube. You're literate (because you're reading this post) which means you learned how to do something at some point.

When you were doing that work, you know who would have been perfect to help you out? Future-you, a few weeks or months further along in the same task.

Six months on, you've forgotten a bunch of what problems you had. If you became amazing at the same task, you forgot everything you used to have trouble with (it turns out that it's called "the curse of knowledge" and it's totally a thing.)

Which means that whatever you've spent a few weeks on (or a few months, or a few years,) you are perfect to help out the person who is a few weeks behind you.

If you're only a few weeks in, you can submit that as a talk to perfect beginners -- and you're the best person in the world to help them out.

Or if you've been doing something for a few years (Ruby programming, databases, competitive eating), you're the perfect person to help other people out who are farther along.

If you've put the time into something, that's a pretty good indicator that somebody finds it interesting. You did.

If you can't find a talk about stuff you're doing -- great, you should give one!

So stop every week or three at work or play. Scribble down what you're working on. And now next time you want to submit and idea for a talk, use one of those.

And you'll know you're the perfect person to give that talk. And you a few weeks (or months) earlier is who you should say the talk is for.

Got it?

 

RVM Errors with Latest Ruby?

Are you trying to build Ruby 2.4.0 or later, including the 2.4.0 preview releases? Are you doing it by installing with RVM? Are you seeing errors like the ones below?

Ruby (specifically Rubygems) now pulls code from repos with submodules. That requires an RVM change.

To fix it, some day you can upgrade to the latest RVM using "rvm get head". In the mean time, you can use older Rubygems. Don't "rvm install ruby-head". Instead, "rvm install rubyhead --rubygems 2.6.6". Curious if it's been long enough and you can use the latest Rubygems? See the rvm issue for it.

Once that issue is fixed, the correct fix should be "rvm get stable" instead of specifying a specific older RubyGems version for each Ruby installation.

Screenshot of the problem here - see below for cut-and-pasteable output.

Screenshot of the problem here - see below for cut-and-pasteable output.

 

rails_ruby_bench noah.gibbs$ rvm install ruby-head-nobu --url https://github.com/nobu/ruby.git --branch round-to-even
Checking requirements for osx.
Installing requirements for osx.
Updating system..............
Installing required packages: coreutils...
Certificates in '/usr/local/etc/openssl/cert.pem' are already up to date.
Requirements installation successful.
Installing Ruby from source to: /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu, this may take a while depending on your cpu(s)...
HEAD is now at 61a7af5 add to: :nearest option
From https://github.com/nobu/ruby
 * branchround-to-even -> FETCH_HEAD
Current branch round-to-even is up to date.
git checkout round-to-even
Copying from repo to src path...
ruby-head-nobu - #autoreconf.
ruby-head-nobu - #configuring..................................................................
ruby-head-nobu - #post-configuration.
ruby-head-nobu - #compiling....................................................................................................
ruby-head-nobu - #installing.......
ruby-head-nobu - #making binaries executable..
ruby-head-nobu - #downloading rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15
ruby-head-nobu - #extracting rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15.....
ruby-head-nobu - #removing old rubygems.........
ruby-head-nobu - #installing rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15.
Error running 'env GEM_HOME=/Users/noah.gibbs/.rvm/gems/ruby-head-nobu@global GEM_PATH= /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/bin/ruby -d /Users/noah.gibbs/.rvm/src/rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15/setup.rb --no-document',
showing last 15 lines of /Users/noah.gibbs/.rvm/log/1479161946_ruby-head-nobu/rubygems.install.log
[2016-11-14 14:22:20] /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/bin/ruby
current path: /Users/noah.gibbs/.rvm/src/rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15
GEM_HOME=/Users/noah.gibbs/.rvm/gems/ruby-2.3.1
PATH=/Users/noah.gibbs/.rvm/usr/bin:/usr/local/opt/coreutils/bin:/usr/local/opt/pkg-config/bin:/usr/local/opt/libtool/bin:/usr/local/opt/automake/bin:/usr/local/opt/autoconf/bin:/Users/noah.gibbs/.rvm/gems/ruby-2.3.1/bin:/Users/noah.gibbs/.rvm/gems/ruby-2.3.1@global/bin:/Users/noah.gibbs/.rvm/rubies/ruby-2.3.1/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/noah.gibbs/.rvm/bin:/Users/noah.gibbs/.rvm/bin
GEM_PATH=/Users/noah.gibbs/.rvm/gems/ruby-2.3.1:/Users/noah.gibbs/.rvm/gems/ruby-2.3.1@global
command(7): env GEM_HOME=/Users/noah.gibbs/.rvm/gems/ruby-head-nobu@global GEM_PATH= /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/bin/ruby -d /Users/noah.gibbs/.rvm/src/rubygems-7d3b7063184c0de861d9f31285ee1e7357efde15/setup.rb --no-document
Exception `LoadError' at /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/lib/ruby/2.4.0/rubygems.rb:1345 - cannot load such file -- rubygems/defaults/operating_system
Exception `LoadError' at /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/lib/ruby/2.4.0/rubygems.rb:1354 - cannot load such file -- rubygems/defaults/ruby
Exception `Gem::MissingSpecError' at /Users/noah.gibbs/.rvm/rubies/ruby-head-nobu/lib/ruby/2.4.0/rubygems/dependency.rb:308 - Gem::MissingSpecError
ERROR:While executing gem ... (Errno::ENOENT)
No such file or directory @ dir_chdir - bundler/lib

Using RVM with Locally-Built Ruby

If you use RVM, I'm sure you're used to installing your Rubies with it. It's a great way to make sure your build dependencies are up-to-date, to build a Ruby that doesn't overlap with your other stuff, to manage paths and gems...

(And yes, it has some down sides. By and large I still like it quite a bit. But if you use rbenv instead, I have no stones to throw. Rbenv and ruby-build are a great combination too, and will get you where you need to be.)

It's less obvious that you can use RVM to build and install Ruby from a local clone of the Ruby GitHub repo or another similar one -- for instance, if you're messing with Ruby patches yourself, or you're cloning from somebody who is.

Name and Location

Two little-known RVM features help us out here: Named Rubies, and the (undocumented, I think) --url option to "rvm install."

Which means you can do something like this:

rvm install ruby-head-nobu --url https://github.com/nobu/ruby.git --branch round-to-even

Now you have a normal-looking RVM-installed Ruby, in this case called "ruby-head-nobu", from pure-custom source. You can even build it from a local repo (see below.) Do you see errors instead? Try "rvm get head" to upgrade to the latest RVM - they patched a problem with this on November 9th, 2016.

A quick disclaimer: don't use a name like "ruby-head-round-even". Turns out that you can't have any more dashes in the name. It just cuts them off and you get a name like "ruby-head-even". Which is pretty confusing at the time. Don't use dashes in the last part of the name.

Another disclaimer: RVM has only recently updated to support Ruby including Bundler. So you may need to update RVM to latest stable before "rvm install ruby-head" works at all, custom URL or no.

Do It Locally

If you just pass the path to a local repo above as a URL, it doesn't work. It tries to clone your local directory with SVN, which doesn't work. Same problem with a file URL. How do you get it to use Git, which is smart about these things?

I had to read through the RVM source to find this out, but if your URL contains the string "git" anywhere, RVM assumes it should use Git, not SVN, to clone it.

So this worked for me:

rvm install ruby-head-nobu --url file:///Users/noah.gibbs/src/ruby/nobu_ruby/.git --branch round-to-even

It's a bit of a hack -- it only works because the URL happens to contain "git" and that's what RVM checks for as of this writing. And because you're allowed to clone from the .git directory. But it works for me! This saves a very time-consuming "git clone" step over the network. The Ruby repo is pretty big at this point, and Git is smart about how it clones a repo from the local file system.

Made of Win

This is all useful if you're building Ruby locally -- it can be annoying to set up individual install directories and then mess with your PATH and RUBYLIB. And an installed Ruby can be necessary to do certain things like running RubySpec. Look for another post about that soon!

 

Benchmarking Rails

I've been working on a Rails benchmark for Ruby, like the one Matt Gaudet's RubyKaigi talk discusses. For the short version: we'd like to have a number of Ruby benchmarks for standard workloads like "lots of CPU, not much disk I/O" and "modifying lots of memory, lots of GC time"... And of course "typical Rails application." It helps show us whether a given set of Ruby optimizations are good or bad for the performance of a "typical" Rails app... whatever that means.

There are a lot of interesting questions. What do we benchmark? How do we benchmark it? A benchmark is defined by how it answers those questions. I can talk about what decisions I prefer and why.

(And then smarter people than me will argue for a slightly better set. That's cool. That argument gets way easier if I state my assumptions to start the debate.)

Decisions

For the Rails code to benchmark, I chose Discourse. Discourse is a real Rails-based product with real-world applications. The code is complex, but not so complex you can't understand it. It uses typical Rails calls in typical proportions, insofar as there's is a such thing as "typical" for a mid-sized Rails application.

Discourse has a lot of dependencies. That's why they usually use Docker to install it. I use a local setup script for local development, and (soon) an AWS EC2 AMI for "real" benchmark numbers. I could have used Docker, but Docker isn't great for performance numbers. Benchmarking isn't really what Docker is for. However, a lot of real shops use Amazon EC2 for Rails hosting, so those numbers are useful. Even hosting that isn't done directly on EC2 is sometimes done indirectly on it, via an EC2-based cloud service.

We want a multithreaded benchmark - Matz made it clear that a lot of Ruby3x3's increase in speed will be about concurrent performance. We can use Guilds when Rails supports them. But for now, we have to use threads. A multithreaded or Guilded benchmark won't be perfectly reproducible. But we can get pretty good reproducibility with random seeds for which calls are made in which order.

The idea is that there will be a "standard" random seed that we start testing with -- this guarantees that each thread has a known list of calls in a known order. But we can easily change the random seed to make sure we're not "overtraining" by optimizing too much for one specific call order, even if there's one we normally use for consistent numbers.

The random seed algorithm is very simple. That means that small changes to the benchmark can make big changes to exactly which actions are taken. But it also means the benchmark doesn't have to be hugely complex. The benchmark should change very little over time, once it has been initially debugged. And with a large number of iterations, the total benchmark time shouldn't change much with changes to the random seed.

I'm assuming that we'll be testing with 8 or fewer virtual CPUs, because that's quite common for Rails applications early on. Later, when Rails applications need a lot more parallelism, they tend to have wildly divergent needs -- some are all about I/O, some are extremely CPU-bound. I don't think any one benchmark will do very well for very large Rails applications, but I think a smaller benchmark can be quite representative of small and mid-sized Rails applications. Also, companies that need extremely concurrent Rails workloads are probably doing more of their own benchmarking with their own application. It's reasonable for them to compile a custom version of Ruby with application-specific settings, for instance.

(Nate Berkopec points out that Amazon's idea of 8 vCPUs is actually 8 hyperthreads on four of what a normal human might call a "virtual CPU" or "virtual core." So I'm going to change the load tester to a single multithreaded process, which helps a bit, and see if there is any way I can otherwise improve the ratio of CPUs to processes. I still think CPU contention isn't going to be as big a problem as AWS network overhead when measuring.)

Discourse is already Thin-compatible, so I'm starting with that. Later, I should switch to Puma. Puma is threaded by default (Unicorn is multiprocess, Passenger is multiprocess except for the commercial version, Thin is based on EventMachine.) So Puma seems like the best choice to demonstrate the increased concurrency of Ruby 3x3.

For the benchmark, I'm configuring it as a single EC2 instance with Postgres, Sidekiq, the (one) app server and the benchmark script all configured on the same instance. That is not realistic, but it avoids swamping the benchmark results with AWS-specific problems like noisy neighbors, getting all the instances into the same VPC and running the benchmark from an IP address that's too far from the nearest AWS datacenter. You can, of course, use the same pieces in a different configuration to get your own benchmark numbers. But those aren't the ones I'm generating.

I added a little bit of warmup time to the default benchmark settings. That setting can be changed, naturally. The amount of warmup is a big deal when timing MRI versus JRuby or Rubinius. The more warmup time you allow, the more you give an advantage to JIT approaches where the interpreter starts really slow and speeds up over time.

In addition to the time it takes to complete all the requests, another important factor is startup time: how long from when you run "rails server," does it take to serve the very first request? When you're debugging, recovering from a power outage or restarting a server for a deploy or crash recovery, you care a lot about how long it takes Rails to start up. And when you're debugging, you'll pay that startup cost over and over, so it matters even more.

What Does It Do?

The benchmark basically simulates a set of users loading the app - posting topics, viewing topics and occasionally editing posts. When I talk about a random seed, the seed is for which of these activities occur and in what order.

There are a fixed number of users (5) and a fixed number of requests per user (10,000), and the benchmark is a weighted sum of:

  1. how long it takes to get through each user for each user
  2. how long it takes to get through the fastest one of them
  3. how long the Rails application took to start up
  4. maximum memory usage of the Rails process

There's nothing sacred about that exact list or how it's weighted. But I think those items reward the right things.

If five users doesn't sound like a lot, remember that we're processing the requests as fast as possible. While I'm using the word "user" for these threads, they're more of a measure of total concurrency than a realistic number of requests for a few users. Given the other servers running on the instance (Redis, Postgres, app servers) this means that 8 vCPU cores should be reasonably loaded...

A larger number of users and Rails threads could saturate more vCPUS. But I don't think I can design a representative benchmark for diverse Rails workloads at at that scale. At that many users, the engineering teams should be optimizing significantly for the specific workload of that specific application, rather than running a "typical" Rails application. Above a certain number of simultaneous users, there's no such thing as "typical."

Why Not the Current Discourse Benchmark?

Discourse includes a simple benchmark. It checks only a few specific URLs, one at a time.

There's nothing wrong with it, but it's not what we want to optimize.

Is This What the Whole Ruby Team Uses?

No. This is an early draft. I'm hoping for feedback on my decisions and my code. With luck, this benchmark (or one like it) will become the semi-official Rails benchmark of the Ruby team, just as optcarrot is their most common CPU benchmark.

Right now, there's no Rails benchmark that fits that description.

As Matt Gaudet's talk suggests, it would be great to have different benchmarks for different "typical" workloads. This is just an attempt at one of them.

Source Code?

It's not finished, but yes.

The link above is for a just-about-adequate implementation of what I describe here. My intent is to start from this and improve the code and assumptions as I go. You can comment or submit a Pull Request if you'd like to join the debate. I'd love to hear what you think.

Diving into Japanese for Ruby

I only started studying Japanese recently... But these days, I'm working on the Ruby language, and Japanese turns out to be a big help there. You can find several gaijin learning Japanese to work on Ruby, it turns out.

If you need to pick up Japanese in a massive hurry and you want advice from somebody a few months farther along than you are, keep reading.

First, Get Google Translate

The simplest way to handle Japanese is to let the machine do it. There's a great Chrome extension for Google Translate, and before that you can use it manually (ew.) Some of Translate's output won't make any damn sense. That is normal and expected. But it's still mostly better than nothing.

To get an idea of how useful (and not-useful) it is, I recommend translating this wonderful article on profiling by Yui Naruse. The article isn't available in English, it kinda-sorta works with Google Translate, and you can just about limp through it without knowing any actual Japanese.

So: Google Translate is an okay step one, and really easy. Do that first.

 

Second, Immerse Yourself

You'll want to be surrounded by Japanese to pick it up. Better, surround yourself with Japanese Ruby stuff.

Google for Ruby stuff and click Japanese links. Read bugs. I'm told there's good Japanese Ruby IRC but I can't seem to find it.

For more, follow Twitter accounts. I've had good luck following Matz, Koichi Sasada and Nalsh. You might also like Yusuke Endoh, Shibata Hiroshi or Akira Matsuda. There are others, and they tend to retweet each other, so feel free to pick up new Japanese developers as you go along.

Twitter offers a second Bing-based translation for each Tweet. Sometimes it's better than Google Translate, sometimes worse, sometimes both will make zero sense. That's how it goes. You can also use Jisho and Rikaikun to translate specific words.

You will, however, start picking up little figures of speech like Koichi's "うーん" (sort of an "uh..." or "ugh..." suggesting that he's tired.) Which is what you're there for - to see Japanese words used by Japanese developers doing Ruby things.

Plus you can start picking apart Hiragana from Katakana and learning common borrowed-from-English words that are used in Japanese. Speaking of which...

Third, Learn the Alphabet(s)

However you start formally learning Japanese, they're likely to start teaching you Hiragana and Katakana, the two easier Japanese alphabets... Or syllabaries. Or whatever you want to call them. Kanji's the really hard one, and you won't learn it immediately. But Google Translate plus Hiragana and Katakana will take you farther than you'd think.

But they'll also teach you to start picking apart sentences into Hiragana, Katakana and Kanji, which lets you start teasing out and learning interesting words, such as "ツール" (tool).

This takes time, repetition and memorization. Which is why you're self-immersing on Twitter and stuff, right?

Also, practice with an app. I like Kana Mind, but there are many.

Fourth, Keep Learning

At some point these stop being "learn Japanese in a massive hurry" tips. Instead, they're just "learn Japanese" tips.

Get a teacher? Yes, it helps a lot.

Books? Sure. I like Japanese the Manga Way and your teacher probably wants you to buy Minna No Nihongo and its English translation (please do.) But do what works for you.

Study buddies? If you can, definitely. Good study partners can be hard to find. But they're really useful!

In general, yes, keep going.

Finally...

Thanks to Al Tobey, who helped me with a lot of my initial jump into Japanese. Several of these recommendations were from him first.

And thanks to the ever-friendly Ruby folks, Japanese-speaking and English-speaking alike.

Since you kept reading this far (thanks!), here's a list of Japanese terms from the Ruby bug tracker. I got them from a Ruby developer on Twitter.

まってる -> feedback
まだ -> open
まつ -> matz
なおした -> closed
なおってた -> closed
なおさない -> rejected

 

Tales from an Internship: Solving a Race Condition in Twilio Usage

by Ky-Cuong Huynh.

Overview

Interns at AppFolio are entrusted to contribute to real production work. After
2 weeks of training on AppFolio's processes and tools, I had the awesome
privilege of shipping code to production alongside my team every week since.
Along the way, I learned so much under the mentorship of fantastic engineers.
This led to the (technical) highlight of my summer: solving a race condition in
our Twilio usage and saving the company a significant amount of money every
year.

To understand how that race condition arose, here's a quick summary of our
multi-app architecture [0]: our production boxes run Passenger, a fast and
powerful app server. It achieves its speed by coordinating a pool of worker
processes to handle incoming requests. Then, for simplicity, we can think of
each of AppFolio's customers as having their own box that serves up the
AppFolio Property Manager (APM) they know and love.

When APM needs to communicate with Twilio's API, we have an internal app called
Messenger (also running on that same box). It accepts message-send requests
from other apps and acts as a middleman for Twilio, encapsulating all the logic
required. Multiple instances of Messenger may be running on a given box, one
per Passenger worker. As message-send requests arrive from APM, they may be
assigned to different workers. Now comes the race condition. If a customer has
sent too many messages in a 24-hour period, then Twilio recommends purchasing
another number to use. This prevents the first number from being marked as a
spam-sending one by carriers.

Unfortunately, more than one Passenger worker may believe it is necessary to buy
another phone number from Twilio, even if only one more number overall was
required. Thus, a customer may end up with more numbers than needed. That
introduces an unnecessary recurring expense for AppFolio, one that increases as
we have more customers.

Now, this was a known race condition. The team that built Messenger added a
lock around the provisioning code (our critical section/region). Alas,
Messenger instances can enter/exit the region one after the other. Even if only
one worker process executes the code at any point in time, we still end up with
multiple executions of the provisioning code.

Our solution: check if provisioning is actually necessary while inside of the
critical region. In other words, we added a check that asks, "Do I actually
need another phone number, or has enough sending capacity already been added?".
In the end, we spent far more time reasoning carefully than we did coding,
which is a common occurrence for complex real world problems.

Background

The bulk of my summer was spent working with my team on bulk texting
functionality. Our work empowered property managers to easily text whole units
or properties at once. Along the way, carrier limitations meant that we had
extended our messaging architecture to potentially map many phone numbers to
each customer. Towards the end, I learned of the interesting race condition
in question and began analyzing it. But first, some context.

Twilio's Quotas

At AppFolio, we use Twilio to send text messages.
Twilio recommends a rolling quota of 250/messages per 24-hour period. "Rolling" means that old messages no longer count against the quota once 24 hours have elapsed from the time they were sent. We can think of this a conveyor belt of messages, each of which returns to us for use after 24 hours. Note also that this is a soft limit, meaning that Twilio doesn't enforce this strictly. Rather, if we send more than 250 messages/24 hour period for a single phone number, that number risks being marked as a spam number by carriers. If a number is marked as a spam number, then its messages may be silently filtered out from ever reaching their recipients.

There's one more wrinkle. When a conversation is seen as "conversational" by carriers, there is a higher threshold for blacklisting numbers as spam senders. These also vary by carrier, but Twilio recommends 1,000 messgages per (rolling) 24-hour period when a communication is more 2-way. To be safe, our internal limits model this conversational threshold as a +1 to the quota for every message received.

Sending Messages Using Messenger

Messenger is essentially our internal middleman for Twilio. Every app that needs to send text messages can do so by sending a request to Messenger, which will handle the back-and-forth with Twilio. Messenger itself presents a set of JSON APIs that are message-oriented. It's designed to take in requests for sending a single message. It has no concept of groups/batches of messages
(which is something we considered adding to speed up sending bulk texts).

Here's an overview diagram:

apm_bundle messaging architecture

The Race Condition

To make things more concrete, we'll walk through the code flow for sending a bulk text.

In Property, our TenantBulkSmsService is invoked to create a background job for sending requests to Messenger. Messenger receives the request from Property and sends the message using TwilioInterface::Sender.send_message:

class << self
  def send_message(body:, to_phone_number:, ignore_sending_limit: false)
    # Check customer quota and message length
    raise DailyLimitExceeded, # ... 
    raise BodyTooLong, # ...

    # RACE CONDITION entrance: we potentially provision as part of the
    # Twilio configuration
    config = verify_and_load_twilio_config

    # Send message with the proper parameters
    prepared_message = TwilioInterface::PreparedMessage.new(body: body, to_phone_number: to_phone_number)
    response = config.twilio_subaccount_client.send_message(
      # From:, To:, Body:, etc.
      # ...
    )

    # If here, successful message send
    TwilioInterface::Sender::SuccessfulResponse.new(twilio_response: response)
    # Else, Twilio error handling
    rescue Twilio::REST::ServerError
    # ...

That verify_and_load_twilio_config method leads to the retrieve method here:

module TwilioInterface
  class PhoneNumberManager
    class << self
      def retrieve
        # RACE CONDITION: more than one Messenger instance
        # may find this condition to be true and proceed to provisioning logic
        if need_provision?
          TwilioInterface::PhoneNumberProvisioner.provision!
        end
      end

      # Other class methods
      # ...

That need_provision check is later down this same file:

def need_provision?
  if Experiments.on?(:texting_quota)
    (TwilioInterface::DailyLimitChecker.current_amount_under_limit < 1) && !reach_max_capacity?
  else
    !provisioned?
  end
end

The race condition occurs here: more than one Passenger worker process (running Messenger) may
enter this method and find TwilioInterface::DailyLimitChecker.current_amount_under_limit to be < 1. Then, they would all proceed to TwilioInterface::PhoneNumberProvisioner.provision!:

def provision!(purchase_filter = {})
  create_new_number(purchase_filter)
end

So we go to create_new_number:

def create_new_number(purchase_filter = {})
  # RACE CONDITION consequence: multiple Passenger workers
  # may enter this critical region one at a time and each
  # purchase an (soon unnecessary) number
  lock = Lock.first
  lock.with_lock do
    unless PhoneNumberManager.reach_max_capacity?
      # Provision new phone number
      # ...
    end
  end

  rescue Twilio::REST::RequestError => e
    raise TwilioNumberNotPurchasable, e
end

This is where the phone number purchase actually occurs. Notice that the
purchasing logic is protected by a lock. This is a leftover from when a
previous team originally implemented this code and noticed the race condition.
They solved it for the single phone-number case, but now we had to generalize
to multiple phone numbers. At this point, my mentors and I understood the
race condition and were ready to solve it.

Aside: A Database-based Lock

You may have wondered why we're not using the Mutex class built into Ruby's standard library. In fact, if we dig into that Lock implementation, it's just using a single row in a table on the database. So why are we using the database to implement a lock?

Here's why: our issue is not coordinating multiple threads on a single machine (or guarding against interrupts with a single thread), but rather multiple processes that are potentially across multiple machines. This is because customers may have more than one physical box running Passenger (though this race condition can still occur in the multi-process, single-box case). Any mutex we create would only exist in the memory of one of those boxes.

What stops this from becoming a full-blown distributed systems problem is the fact that everything in a box shares a single database. We can leverage that shared database to coordinate/synchronize access to this piece of code. This is possible because MySQL uses its own locks as part of achieving its ACID properties. Then, the biggest users of those locks are transactions.

To make use of all this, the Lock#with_lock [2] method simply wraps the passed-in block code in a transaction. To test this unusual lock, we opened two Rails consoles and had one try to enter the critical region after another had already done so. We found that the lock did indeed prevent access, at least until it was released. Note, however, that ActiveRecord appears to have a default timeout for how long to wait to acquire a lock.

The Fix

The fix for this race condition is surprisingly simple. At its core, we face the issue that Messenger processes are checking against how many phone numbers a customer may have, rather than the capacity they may have already gained. So, we simply add this check to the critical section:

def create_new_number(purchase_filter = {})
  lock = Lock.first
  lock.with_lock do
    if TwilioInterface::PhoneNumberManager.need_provision?
      # Provision
      # ...

And recall that that check looked like this:

def need_provision?
  if Experiments.on?(:texting_quota)
    (TwilioInterface::DailyLimitChecker.current_amount_under_limit < 1) && !reach_max_capacity?
  else
    !provisioned?
  end
end

Now, a natural question is "Why wouldn't we simply add the database-based lock around the first need_provision? check?"

Answer: because that method is called for every single message to be sent, whereas actually provisioning more numbers only happens on occasion. Whenever locks are used, we're defining a critical region of code that can only be accessed by one process at a time. This makes a tradeoff of safety vs. speed. Putting a lock around the first need_provision? overly widens our critical region and reduces message throughput.

Verifying the Fix

To check the fix, we'll want to modify some of our method calls to simplify testing. To start, we'll stub the provisioning logic lock to let us know
what occurred:

def create_new_number(purchase_filter = {})
  lock = Lock.first
  lock.with_lock do
    if TwilioInterface::PhoneNumberManager.need_provision?
      PhoneNumber.create!(
          twilio_subaccount_id: 7,
          sid:                  "#{Random.rand(100)}",
          number:               "+#{Random.rand(100)}",
          friendly_name:        "+#{Random.rand(100)}"
      )
      puts "I provisioned a new number!"
    else
      puts "I did not provision a new number"
    end
  end

Next, we modified the database and calls so that the first number is exhausted. Note that test messages must be dated for today because of the 24-hour rolling quota!

And when we run:

TwilioInterface::PhoneNumberProvisioner::provision!

Then, the first console says, "I provisioned a new number!", but the second console to execute this gives, "I did not provision a new number". Race condition resolved! We were also able to devise a test case to guard against regressions. That test now runs as part of every continuous integration (CI)
build.

If anyone has any questions, I can be reached on Twitter or via email at kycuonghuynh@ucla.edu.

Appreciations

Thank you to Yunlei and Magne for collaborating with me on this.
You were both amazing mentors and truly made my summer. I've grown so
much under your careful guidance and I can't wait to keep learning and building.

Thank you to Flavio for regression testing this fix. AppFolio cares highly
about the level of quality delivered to its customers and you serve as a
strong advocate for them.

Thank you to Jingyu and Chris (authors of the original Messenger code) for encouraging me to pursue a fix and then helping me understand Messenger.

Finally, thank you to Andrew for letting me write about this, as well as editing this post and recruiting me to AppFolio.

Footnotes

[0] A review of the concurrency material from OSTEP, a free operating systems textbook, can also be helpful.

[1] Using a code texted to a device as part of login is really using 2SV,
second-step verification, since the code is still a piece of knowledge.
This belongs to the the same class of factors as passwords ("something you know"). A Yubikey or biometrics requirement would constitute a true second factor ("something you have" and "something you are", respectively).

[2] It is Ruby convention to refer to instance methods in documentation via ClassName#instance_method.

Internships at Appfolio

At Appfolio we take pride in having a high quality internship program.  This year, at the end of the summer, two interns chose to write up their thoughts on the summer that the spent with us.

Christopher Chen, from Berkeley, had the following to say about our software engineering internship:

When I first started at Appfolio, I had two prior technical internships, the first at a very large company, and the second at a small company. Neither of them were what I’d call inspiring and the first actually made me question majoring in computer science. Starting in October of my junior year, I applied to many of the large, well-known, Bay Area tech companies with mixed responses. Appfolio was an afterthought at the time since I’ve never heard of the company before. Weighing the offers of multiple companies, I eventually chose to go to Appfolio due to its strong employee satisfaction on Glassdoor.

Soon it was mid-May, the end of my second semester as a junior, and the beginning of my internship. I was unsure of what to expect - my friends were interning at large Bay Area tech companies, meanwhile what well-known tech companies are located in Santa Barbara? (surprisingly many, as it turns out) 

My concerns were put to rest on the first day; interns were given Macbook Pros to work with, in a modern office setting with desks that could go from standing to sitting height at the touch of a button. Snacks were a-plenty and it was overall a great work environment. 

I began with a 2-week training session in Ruby on Rails, building enough technical knowledge to begin working in a team - I would continue to learn as I developed new features. I also participated in team decision-making and it struck me how seriously I was being taken, even as a fresh intern. The team welcomed my ideas and almost immediately I was accepted into the team, not as an intern, but as an engineer. 

As I type this sitting at my desk on my final day after 3 months, I reflect on how different this internship was from my previous one. At one of my previous internships, my mentor was a man who was well-known in his field, regularly taught graduate classes, had many patents under his name, and was otherwise very respected. However, I was lucky to get an hour per week from him. Here at Appfolio, Natalie, my very bright and hardworking mentor, was always willing to spend time with me. Under her guidance, and that of the entire team’s, I built many customer-facing features, those that would later bring about highly positive customer feedback. Through being productive, I also learned more than I did at any other internship. Beyond technical skills (like learning an entirely new language and web development stack), I also learned and refined social skills like working effectively in a team setting and buying in on the company culture.

The main difference, and the best one, was how much trust I was given. I hit the ground running with a non-trivial project (unlike those of my previous internships) that was immediately utilized in the live release. I was able to directly speak to customers about features they want and needs they need fulfilled. What I was able to get done was fulfilling for the customers and for me - it was my first experience as a “real” engineer in a software company, not the intern whose time was worth little.

 

Sophia Zheng, also from Berkeley, had this to say about our product management internship:

Adjusting to a new workplace can be hard for anyone, but AppFolio has a way of seamlessly incorporating interns into the company. During my first couple weeks at AppFolio, I attended several sessions that taught me the skills I needed to be an effective employee. 

From there, I was put on two teams that were working on features that would go live to all customers, and I think this is what distinguishes an AppFolio internship from other internships. Instead of putting all the interns on an intern project that may or may not be used in the end, interns are given the same responsibilities as regular employees. As a product intern, I was given my own feature to product manage--I can’t say that’s common for many other product management internships. I got my own product management experience, from working with developers and user experience to contacting customers for research. During my internship, we pushed out two major stories for the feature I was in charge of to all customers. 

I was also given two other projects: I got my own personal research project and I was put on a market validation team. During market validation meetings, my thoughts were acknowledged and taken into consideration, and the results of my research were taken seriously. 

AppFolio truly does an amazing job at making sure interns do work that matters to the company. I will say this though: if you take an internship at AppFolio, also be prepared to play a lot of foosball. 

ABProf: An Accurate Statistical Profiling Harness

ABProf is a tool to compare which of two programs is faster. It can prove that a particular optimization worked, which is my job these days.

We talked about the easy-to-use version of ABProf recently. Let's talk about the more powerful and accurate version now. It's a bit harder to use, but often worth it.

To follow this post, first install abprof: "gem install abprof" will generally do it.

A Simple Example

Let's look at the simple version of the harness code. The below example sets up a block for sampling -- ABProf will run it repeatedly to check it for speed.

#!/usr/bin/env ruby

require "abprof"

puts "ABProf example: sleep 0.1 seconds"

ABProf::ABWorker.iteration do
sleep 0.001
end

ABProf::ABWorker.start

This is written in ABProf's worker DSL (Domain-Specific Language), which is different from its benchmark DSL.

This DSL is used to set up a worker process for ABProf. Normally ABProf will run two workers and use them to check which of two processes is faster, unless you run it "bare" and supply a block of Ruby. Bare, it will run the block from ABProf's own process. Running in a single process is fast but limits you to a single version of Ruby in-process (and a few other things.)

So: the benchmark DSL sets up a test between two workers (command lines, number of trials to run, desired accuracy, etc.), or you can just specify all that from the command line. The worker DSL sets up a single worker, or you can skip it and get worse accuracy when measuring.

Why worse accuracy? Because being able to do just one specific action is better than doing that plus having to start up a new process. The most accurate measurement is the one that does nearly nothing except measuring. Also, just running what you need instead of spinning up a new process is a bit faster. So without harness code, abprof takes longer to converge, and you wait longer for your benchmarking.

If you wanted to do an A/A test of sleep.rb against itself, you'd run abprof this way:

abprof examples/sleep.rb examples/sleep.rb

You can also pick how many trials to run, how many times to run the block per trial and many other options. Run "abprof --help" for the full list and default values. Here's an example:

abprof --pvalue=0.001 --min-trials=10 examples/sleep.rb examples/for_loop_10k.rb

By default, abprof runs your code block or your program a fair number of times -- 10 before it starts measuring, 10 per trial run. For something very slow (e.g. 5 minutes per run) you'll probably want fewer, while for something fast like a tight "for" loop you'll probably want many more. 10 times isn't many when you're benchmarking small, tight code.

The two commands, abprof and abcompare, are very similar. If you run abprof in "bare" mode with two command-line programs, you're getting the same thing as abcompare.

A More In-Depth Example

The example above is pretty simple - the block runs a single iteration. We can get better accuracy by running several at once, or by having the code measure in a specific way. ABProf supports both of these modes by changing the method used to supply the block. In the above, that method is called "iteration" and it times very simply. If you call the method "iteration_with_return_value" and return a measurement, you can time it yourself.

Here's an example:

#!/usr/bin/env ruby

require "abprof"

puts "ABProf example: sleep 0.1 seconds"

ABProf::ABWorker.iteration_with_return_value do
t1 = Time.now# Can also be a counter
  sleep 0.01
  (Time.now - t1)# Return measurement
end

ABProf::ABWorker.start

This can be useful if you need to do setup or teardown for every iteration, or if you want to measure a counter rather than time, or if you have some kind of custom timer that you want to do yourself rather than using Ruby's built-in Time.now() for timing, as ABProf does.

Be careful - ABProf will tell you it's returning the lower command. When the measurements are time, lower means faster. But if you return a counter or other higher-is-faster performance measurement, ABProf will consistently return the lower result, which may not be what you expect. That's why ABProf keeps printing "faster?" with a question mark to the terminal. If you return something other than a time, lower may not mean faster.

Randomly: don't use an explicit "return" in the block. It's a block, not a lambda, so "return" won't do what you want. Just add the return value as the final one, as in the example above.

Conclusion

So: want better accuracy than with abcompare? Want to take out process startup time as a factor? Use abprof.

Want to do your own timing, or otherwise control it more? Use the "iteration_with_return_value" method instead of just "iteration", and return a value. Also, use abprof.

Want to just quickly, roughly see if something is a lot faster? You can keep using abcompare. But it'll take longer to be sure, so if you're going to keep doing it, use abprof.

 

ABProf: Profiling Ruby Like an A/B Tester

Lately I've been trying stuff out with Ruby configurations, and inlining functions when compiling Ruby and trying to tell if they made anything faster. That's harder than it sounds -- the standard benchmarks for Ruby are kind of noisy. They give slightly different answers from run to run. It's a pretty common problem in benchmarking and profiling. How can I tell if I'm actually making things consistently faster or just fooling myself?

If I take a lot of samples, then maybe the dark power of statistics could tell us if one way was consistently faster?

Turns out: YES! In fact, I'm not even the first person to do this with optcarrot, though they use the tests for something slightly different.

I built a simple tool, called ABProf, to do it. Let's talk about that.

The Basics

We want to run optcarrot with two different Rubies and see which is faster, with what likelihood, and by how much. But we don't need a Ruby-specific tool to that -- the tool can just run two benchmarks without knowing what they are and do a statistical test. So that's what ABProf does.

Specifically, it installs a command called "abcompare" to do that.

The abcompare command is simple. Give it two command lines:

abcompare "sleep 0.01" "sleep 0.0095"

And it'll spend some time to tell you that, yes, one is faster than the other. It'll even tell you how much faster one is than the other, roughly. The output will look something like this:

Trial 3, Welch's T-test p value: 0.0032549239331043367
Based on measured P value 0.0032549239331043367, we believe there is a speed difference.
As of end of run, p value is 0.0032549239331043367. Now run more times to check, or with lower p.
Lower (faster?) process is 2, command line: "sleep 0.0095"
Lower command is (very) roughly 1.0445033124951848 times lower (faster?) -- assuming linear sampling.
Process 1 mean result: 1.3370193333333333
Process 1 median result: 1.342169
Process 2 mean result: 1.2836166666666664
Process 2 median result: 1.284983

Yup. And with one command being around 5% faster, even a very quick, rough test like this says it's around 4.4% faster. Not too shabby. Decrease the P value or increase the trials or iterations per trial for more exact results.

exe/abcompare --pvalue 0.001 --iters-per-trial=10 --min-trials=10 "sleep 0.01" "sleep 0.0095"

You'll get less exactness if you pick two things that are very close, where starting the process takes most of the time.

abcompare "echo bob" "echo bologna sandwich with mayonnaise"

The above lines compare equal, because echo is so fast (compared to the time to start the process) you just can't see the difference. There's a more exact version of abcompare in development to help with this.

The Ruby-Tinted Truth

Want to check if one Ruby or another is faster when running optcarrot, a subject near and dear to my heart? You'll need to cd into the appropriate directory and use "runruby" to make sure you get the right environment. Also, you'll probably want to see the abcompare output as it goes, so your command shouldn't print to console. And you need to make the total number of times to run it be sane -- if every test takes a few seconds, running many hundreds of them takes quite a long time.

Also, a way to make optcarrot profile better - its benchmark mode is the same as "--headless --print-fps --print-video-checksum --frames 180", so you can use that, but increase the number of frames and not pay the process startup overhead so many times. Nifty!

With all that taken care of, here's how to do it with two built Rubies in "vanilla_ruby" and "inline_ruby_2500".

abcompare "cd ../vanilla_ruby && ./tool/runruby.rb ../optcarrot/bin/optcarrot --headless --print-fps --print-video-checksum --frames 1800 ../optcarrot/examples/Lan_Master.nes >> /dev/null" "cd ../inline_ruby_2500 && ./tool/runruby.rb ../optcarrot/bin/optcarrot --headless --print-fps --print-video-checksum --frames 1800 ../optcarrot/examples/Lan_Master.nes >> /dev/null" --iters-per-trial=1 --min-trials=10 --pvalue=0.001 --burnin=2

This is also a great example of using abcompare with a nontrivial program. Consider putting this long command into a shellscript to run repeatedly :-)

You can get better measurement with the more advanced ABProf version in development, the one that's a bit harder to use... This one is counting startup time as part of the total, so you'll see 5% improvement instead of the 7% improvement mentioned in a previous post... Don't worry, there will be another post on the more exact version of ABProf!

More Better Awesomer Later

ABProf can use a harness of Ruby code, kind of like how Ruby-Prof does it. That's to let it run without having to start a new process on every iteration. Process startup time is slow... and worse, it varies a lot, which can make it hard to tell whether it's your benchmark or your OS that's slow. When you can, it's better to just run your benchmark in a single, longer-lived process.

At the very least you should have the choice to do it that way. The same command can also handle a fun DSL to configure your various tests in more complicated ways...

But for right now, abcompare will do what you want, and do it pretty well. It just takes a few extra iterations to make it happen.

Look for another blog post on how to do this very soon.

What's My Takeaway?

Want to prove you made something faster? Use abcompare. To prove you have the amount of speed-up right, increase the number of iterations per trial, or increase the number of minimum trials, or decrease the P value.

Want to check which of two Rubies is faster? Build both and use abcompare.

Need to prove that something already pretty darn fast is now faster? Do the operation many times per trial or wait for the newer, spiffier ABProf to bring you a way to prove it.

 

Speed Up Ruby By 7-10% Or More

These days, my job is to speed up Ruby (thanks, AppFolio!) One way to do that is by careful configuration options for building the Ruby source. Let's talk about a nice recent discovery which may help you out. It works if you use CLang -- as Mac OS does by default, and Linux easily can. CLang is intended to be the next GCC by replacing it with a fully-compatible LLVM-based successor. CLang is most of the way there, and ships by default on the Mac. It's available pretty much everywhere. In addition to MacOS, we'll also talk about this speedup on CentOS, a popular OS to deploy Ruby over top of.

Function Inlining

By default, GCC will inline functions on the higher optimization levels. That is, for short, simple functions, it will just substitute the whole function body where you called the function. That lets you skip a function call at runtime. But more important, it lets the compiler skip parts of the function that aren't used, special-case for the parameters that get passed in and otherwise speed up the code.

The down-side is that the code gets bigger - if you put in the full function definition at each call site, things get big. So there's a balancing act for how big a function you inline -- if you inline huge functions it doesn't buy you much speed and the code gets huge. If you never inline anything, the code is as small as it can be, but also slow.

CLang has an easy way to tune what size function you inline. It has an internal metric of how big/complex a function is, and a threshold (default 225) for where it inlines. But you can turn that threshold way up and inline much bigger functions if you want to.

The result is a bit hard to measure. I'll write another blog post on the stats-based profiler I wrote to convince myself that it does speed things up noticeably. But I'm now convinced - with higher inlining settings, you can make a Ruby that's faster than vanilla Ruby, which uses -O3 and the default inlining threshold of 225.

How Big? How Fast? (Mac Edition)

I'm using a standard Ruby benchmark called optcarrot to measure speed here as I tend to do. I'm adding inlining options to Ruby's defaults for optimization ("export optflags='-O3 -fno-fast-math -mllvm -inline-threshold=5000'"). Each of the measurements below used hundreds of trials, alternating between the two implementations, because benchmarks can be noisy :-/

Still, how much faster?

  1. cflags="-mllvm -inline-threshold=1000": about 5% faster?
  2. cflags="-mllvm -inline-threshold=1800": 5% faster
  3. cflags="-mllvm -inline-threshold=2500": 6% faster
  4. cflags="-mllvm -inline-threshold=5000": 7% faster

Those Rubies are also bigger, though. Plain Ruby on my Mac is about 3MB. With an inline threshold of 1000, it's about 4.8MB. At a threshold of 2500, it's about 5.5MB. With a threshold of 5000, it's about 6MB. This doesn't change the size of most Ruby processes, but it does (once per machine) take that extra chunk of RAM -- a 6MB Ruby binary will consume 3MB of additional RAM, over what a 3MB Ruby binary would have taken.

Note: these may look linear-ish. Really they're just very unpredictable, because it depends which functions get inlined, how often they're called, how well they optimize... Please take the "X% faster" above as being noisy and hard to verify, even with many trials. The 1000-threshold and 1800-threshold are especially hard to tell apart, says my statistical profiling tool. Even 1000 and 2500 aren't easy. Also: measured vs recent MacOS with a recent Ruby 2.4.0 prerelease version.

How Big? Host Fast? (CentOS 6.6 Edition)

I didn't do the full breakdown for CentOS, but... It's actually more of a difference. If I measure the difference between -inline-threshold=5000 and default CLang (with -O3 and -fno-fast-math, the defaults, which uses 225 as the inline threshold), I measure about 14% faster.

But that's cheating - the version of CLang available here (3.4.2) is slower than the default GCC on CentOS 6.6. So how about measuring it versus GCC?

Still about 9% faster than GCC.

The CentOS binaries are a bit bigger than MacOS, closer to 8MB than 3MB with debugging info. So the doubling in size makes it bigger, about 18MB instead of 6MB. Beware if you're tight on RAM. But the speedup appears to be true for CentOS, too.

(Measured on recently-updated CentOS 6.6, a Ruby 2.4.0 prerelease, CLang 3.4.2 vs GCC 4.47. YMMV, for all sorts of reasons.)

Disclaimers

So: can you just make Ruby 10% faster? Only if you're CPU-bound. Which you probably aren't.

I'm measuring versus optcarrot, which is intentionally all about CPU calculations in Ruby. "Normal" Ruby apps, like Rails apps, tend to waiting on the OS, or the database, or the network, or...

Nothing you can do to Ruby will speed up the OS, the database or the network by much.

But if you're waiting on Ruby, this can help.

Did Somebody Do It For Me?

Wondering if this got picked up in mainline Ruby, or otherwise made moot? You can check the flags that your current Ruby was compiled with easily:

ruby -r rbconfig -e "puts RbConfig::CONFIG['CFLAGS']"

You're looking for flags like "-O3" (optimization) and "-mllvm -inline-threshold=5000" (to tune inlining.) You can do the same thing with ./tool/runruby inside a built Ruby directory. If you see the inline-threshold flag, somebody has already compiled this into your Ruby.

Conclusion

This option is a great deal if you're on a desktop Mac -- 3MB of RAM for a 7% faster Ruby -- but questionable on a tiny Linux virtual host -- 8MB for 10% faster, but you're probably not CPU-bound.

It is, however, only available with CLang, not with vanilla GCC.

I've opened a bug with a patch to make this behavior default with CLang. Let's see if the Ruby team goes for it! And even if they don't, perhaps ruby-build or RVM will take up the torch. On a modern Mac, this seems like a really useful thing to add.

How do you add it? You need to reconfigure Ruby when building it

Using GPerfTools on MRI Ruby in Mac OS X

There are a lot of great profilers for code that runs in Ruby. There are... fewer really good profilers to work on a standard generated-by-GCC binary, like the Ruby interpreter itself. GPerfTools and Instruments are really your choices, with an honorable mention for DTrace - it isn't a profiler, but it's so flexible that it can do a lot of profiler-like things, including for Ruby.

Today, let's talk about getting GPerfTools working with Ruby - specifically, with "vanilla" Ruby, often called Matz's Ruby Interpreter, or "MRI."

First, download the Ruby source. These days, that can be as simple as git-cloning its GitHub repository. Then you can "autoconf && ./configure && make", which can be enough to get you up and working -- if you don't have OpenSSL problems, anyway.

You'll need GPerfTools, of course. Assuming you use Homebrew, you can install a few useful dependencies like this:

brew install gperftools
brew install graphviz # you'll see why later

You'll also need to build Ruby with the GPerfTools profiling library, and with Address Space Layout Randomization turned off (Mac OS calls its ASLR implementation Position Independent Executables, or PIE.) And it wouldn't be bad to tell it to always use a frame pointer -- marginally slower, but easier to debug at times. That probably looks like this:

export LIBS="-lprofiler"
export cflags="-fno-omit-frame-pointer"
export LDFLAGS="-Wl,-no_pie"
./configure && make clean && make && make check
unset LIBS
unset cflags
unset LDFLAGS

You can also pass -pg in cflags, but that's likely to slow things down a fair bit. (Why those lines? See our post on configuring the Ruby source code for details.)

You might need other things, such as "export PKG_CONFIG_PATH=/usr/local/opt/openssl/lib/pkgconfig:/usr/local/lib/pkgconfig" to configure OpenSSL -- add what you need for your setup as well, naturally.

And now your Ruby should be compatible with GPerfTools. Great! That was a nice short blog post, and now it's cocoa and schnapps all around!

But What Do We Use It For?

We'd like to check against a nontrivial benchmark - the core team often uses a toy Nintendo emulator called optcarrot right now for benchmarking. So let's set that up.

cd .. # Back up one directory from your Ruby source
git clone git@github.com:mame/optcarrot.git
cd optcarrot
ruby bin/optcarrot --benchmark examples/Lan_Master.nes

I see output like this when I run it:

fps: 34.313312258259884
checksum: 59662

So: now we have a thing to profile. How do we profile it?

What Next?

How do we use it? And what if something went wrong?

First, let's try to generate some profiling data. Profilers are quite picky about knowing what program generated the data, so you're much better off running the Ruby interpreter directly. That's a problem if you have a local built Ruby, not installed, and you'd otherwise run it with a script like runruby.

With trial and error, you can figure out which environment variables to set to run Ruby, from the Ruby dir, with GPerfTools active, running optcarrot. You cloned Ruby and optcarrot into two directories next to each other, so optcarrot is at "../optcarrot" for me. Here's the command I use when I run it:

PATH=.:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin RUBYLIB=.:.ext/common:.ext/x86_64-darwin15:lib CPUPROFILE=/tmp/prof.out ./ruby ../optcarrot/bin/optcarrot --benchmark ../optcarrot/examples/Lan_Master.nes

Note the "CPUPROFILE" variable toward the end - that turns on GPerfTools profiling and tells it what file to write to. In this case, I chose "/tmp/prof.out".

Troubleshooting: if Ruby can't find its libraries, you'll need to change PATH and RUBYLIB. If you're not sure what to change them to, change your runruby script near the end, line 105 for me, and "STDERR.puts ENV.inspect" to see what your environment variables should look like. You can run optcarrot with runruby to see what it uses ("./tool/runruby.rb ../optcarrot/bin/optcarrot --benchmark ../optcarrot/examples/Lan_Master.nes") Then, put in your variables instead of mine from the ones you printed to the console.

If you run optcarrot and a /tmp/prof.out file appears... You win! Now you have profiling data. You can move on to "Seeing the Results" below.

More troubleshooting: if it doesn't work, you may want to clean your Ruby directory and rebuild:

git clean -dxf
export LIBS="-lprofiler"
export LDFLAGS="-Wl,-no_pie"
autoconf && ./configure && make clean && make && make check
unset LIBS
unset LDFLAGS

The "git clean" line means "destroy any files that git didn't put here." It cleans out everything a "make clean" would, but also all caches, logs, configurations... Even the configure script itself. It's the equivalent of blowing away your repo and re-cloning, but with less network traffic.

But it blows away the configure script, so you have to run autoconf to regenerate it, and then you'll get a very slow full rebuild. So only do this if you think you need it.

Seeing the Results

So, that's cool. It's kind of neat seeing the file appear in /tmp/prof.out. But what can we do with it?

We'll use pprof, part of GPerfTools. Fair warning: it's a little rough around the edges. If you use it much, you may find yourself reading its Perl source regularly. Pprof isn't amazing for weird corner cases like "you made a typo in the filename for the profiler output" and other "unlikely" situations. On the plus side, it's simple enough that with just a little Perl know-how, or Ruby know-how and some squinting, you can usually figure out what it's doing.

To run pprof, give it your binary's location, the profiling file, and what kind of output you want. Here's a simple example:

pprof ./ruby /tmp/prof.out --text

That should print out examples of what functions are taking the most time, like this:

The first column is how many samples landed in that function but not others that it called. The second column is what percentage of CPU time was spent on that function but not any other functions it called. The third column is a running sum - the total percentage of CPU time that's accounted for by this function and all the ones above it in this sort order (see how it always goes upward on every line?) The fourth and fifth columns are cumulative -- that is, the samples and percent of CPU time used by this function, plus everything else that it called.

The fifth-column percentages don't even vaguely add to 100% - they're cumulative, so they include that function and everything under it. Since vm_exec_core does a lot of calling rby_ary_push, for instance, they wind up with significantly more than 100% of total CPU time between them, because the shared time is double-counted.

(And obviously, last on the line comes the function that's being called, or a big hex number like "0x00007fff9b5eb273" if pprof can't figure out what was getting called there. You won't usually see static functions here, but that's because they're usually inlined -- their addresses are similar to normal function addresses and don't start with 0x000007fff. Addresses starting with that prefix seem to be on the stack. Perhaps the EIP register or ESP register are being used for something else briefly, which seems like it would confuse GPerfTools method of sampling.)

Troubleshooting

If you see only hexadecimal numbers instead of names, you're either not getting debugging symbols, or you didn't turn off PIE (see above.) You'll need to rebuild Ruby, possibly clean from nothing.

This is what that failure looks like:

Think it might be a problem with debug symbols in your Ruby binary? Pprof just uses "nm" for that, so run "nm ./ruby". If you see a bunch of function names, your Ruby binary is fine. Here's what that looks like:

More Fun Output Formats

Pprof can output graphical call graphs. They're fun, but I personally find them a bit impractical. But if you'd like to see and you installed Graphviz via homebrew above... Go ahead and run the same thing, but with "--web" instead of "--text".

pprof ./ruby /tmp/prof.out --web

This should create a nice svg file, also in /tmp, and open Safari to view it (see below.) If instead you got a warning about not having "dot" installed, that means Graphviz wasn't installed, or isn't in your path.

In fact, go ahead and run pprof --help and look at all the output types. I'll wait.

In general, pprof can output line numbers for call sites... But mostly, it doesn't do that on Mac, which lacks the addr2line command-line utility. If I can get gaddr2line working, perhaps you'll get another blog post on it. So: if you get question marks instead of line numbers from the Ruby source code, that's expected.

Play around with pprof - this is all poorly documented, but the trial and error is pretty good.

What Does All This Tell Me?

This should nicely break down where Ruby is spending its time. It's a sampling profiler, so the accuracy is pretty good. If you remember profiling optcarrot before, the output above should make sense -- optcarrot spent its time calling array appends, and Ruby is spending its time on rb_ary_push.

From here on out, you should be able to use this and figure out what Ruby's doing.

And that's what we both wanted, right?

 

Configuring the Ruby Source Code for Profiling... And Everything Else

I'll be talking a fair bit about configuring the Ruby source code... And I'll show you how to do it as well. With luck, by the end of this post we'll be able to do lots of profiling, setting Ruby code options and whatnot.

A lot of this was occasioned by me trying to use GPerfTools and Instruments... Which turn out to need some configuring.

(This post is basically for Unix, Mac and environments that are set up the same way like MinGW. If you're using an IDE and/or Visual Studio tools, it's probably not going to be a perfect match for you. And if you already know C, Make and Configure really well, and you already have a pretty good idea how the Ruby source code gets configured... You may not learn much here, but please feel free to critique in the comments below!)

The Basics

If you're a Ruby programmer, you may not use C libraries very often. They can be painful. A big C project such as the Ruby interpreter (which is written in C) does a huge amount of checking what's available on your platform, what compiler options and flags are available, what functions are available in what libraries and hundreds or thousands of other things... I wrote about the problems with just OpenSSL (only one of those libraries) not long ago.

The way C solves this is with a long list of possible options, flags and dependencies. A program called "autoconf" builds a giant bash script called "configure", which builds a Makefile, which allows you to build, test and install Ruby. The fact that autoconf, bash and Make are essentially their own individual languages will probably seem barbaric if you're a Rubyist. But this all started so long ago that, um... Well, okay, we only pretend it ever made sense. And Ruby can't do it in Ruby because first you have to build Ruby. But autoconf/configure/make is available nearly everywhere, and mostly works. Ruby, like most other large C projects, uses this system.

The basic short version of "what do I do?" goes like this:

autoconf # if there's no configure script in the current dir
./configure# with options, sometimes
make # build the software
make check # sometimes called "make test" or similar
make install # to copy the program into place

This means that when you want to change how Ruby is built, you may need to change things during configuration, or during make, by setting environment variables for various things... It can be complicated. So let's talk about how that happens.

Command Line Arguments

The configure script takes a bunch of command line arguments. Some configure scripts use them for turning features on and off -- Ruby does a little of that, too. But it's not a coincidence that I didn't mention configure scripts when talking about fixing OpenSSL. You can use them to do a few things, though. Here are some chunks of the output of "./configure --help":

If this seems cryptic... You're not alone. It can be hard to figure out how to set an option. So let's talk about some Ruby-relevant ways to do that.

Compiling and Linking

When you build a C program, there are (usually) two passes. The basic C source files get compiled from a .c file into a .o file, called an "object file", which is an intermediate not-yet-runnable chunk of binary code. You can make an object file (or many of them) runnable by putting those pieces together into a single binary with a tool called a linker.

Or you can give the compiler the names of all your .c files originally (not just one,) and it will automatically compile them behind the scenes and run the linker for you. That's good too.

Most project do some of each -- building one .c at a time, or putting them all together in one command. Which means that the configure-and-Make stuff above needs to know which flags get used in which ways, because it needs to pass compiler flags or linker flags, but not the other kind. So we'll talk about several kinds of flags... And the difference between them is basically "compiler, linker or both?"

If you're trying to use a language feature, OS tweak or library that you found online, you'll need to know which type of flag it is. Sometimes you don't know and you have to use trial and error.

Here are some example types of flag, which I also got from "./configure --help":

Compiler Flags

The basic compiler flags, the ones that aren't meant for the linker, go into an environment variable called CFLAGS. There's a sort of "extra CFLAGS" variable to Ruby's configure called "cflags." Yeah, they're different :-/ I recommend exporting the cflags variable before you run configure, which should add the flags you want.

For instance, want to add profiler information with the "-pg" compiler flag? You could do this:

export cflags="-pg"
./configure && make clean && make# build Ruby
unset cflags # Don't leave this sitting around

If you set CFLAGS instead of cflags, it'll replace the set Ruby uses instead of adding to them. You probably don't want to do that, since they include a lot of debugging information, compiler warnings and optimization flags that you probably don't want to lose.

Sometimes, Ruby already uses a given flag. It can be bad if you add a flag that conflicts with something Ruby already uses. So there are some Ruby-specific additional variables such as "debugflags" for specific kinds of compiler flags. Unfortunately, you may just need to read the configure source code (see "configure.in" in the Ruby source) to find these... The example below is to set a compiler flag for debugging information manually. Otherwise, Ruby would do it for you. And if you put it into cflags, you and Ruby would both do it, which comes out badly.

(You could also figure out what goes in CFLAGS and replace what Ruby would put there. If so, caveat emptor!)

export debugflags="-g3"
./configure && make clean && make
unset debugflags

Similar option types include optflags for optimization and warnflags for code warnings.

Linker Flags

We mentioned that some flags get sent to the linker, not the C compiler. For instance, MacOS has Address Space Layout Randomization (MacOS uses a variant called "PIE," for Position-Independent Executable.) This feature can make it very hard to debug or profile, because all the functions move around.

But you can turn it off with a linker flag, "-no_pie". And you can tell the compiler to pass that linker flag through with "-Wl,-no_pie" since there's no compiler flag for it. That compiler option means roughly: pass the following "-no_pie" to the linker without trying to make any sense of it.

You do the same as above, but pass it in LDFLAGS instead of CFLAGS so that Ruby will use it whenever it's invoking the linker:

export LDFLAGS="-Wl,-no_pie"
./configure && make clean && make
unset LDFLAGS

Library Flags

Sometimes you want to add in a library, one that Ruby otherwise wouldn't use. I'll use the example of GPerfTools' profiling library, because hey, I have a nice "how to profile" theme going here.

export LIBS="-lprofiler"
./configure && make clean && make
unset LIBS

This can also be used with a "-L /path/to/files" (capital L, not lowercase l) to make Ruby look for existing libraries in new places. This will get passed to the compiler, but not when just making an object file -- only when it creates a binary.

About That Title

If you want to profile with GPerfTools or Instruments, you'll need to use the option above to turn off PIE. If you want to profile with GPerfTools specifically, you'll need the "-lprofiler" option. So here's the "TL;DR" version of this article:

export LDFLAGS="-Wl,-no_pie"
export LIBS="-lprofiler"
autoconf && ./configure && make clean && make
unset LIBS
unset LDFLAGS

You may need to "brew install autoconf" or the equivalent as well. And you already have XCode, right? And... Well, let's say that having a full set of C development tools is another whole blog post, if you haven't done it already.

But What If...?

Naturally I haven't discussed every possible configuration option here. The next time you'd like to remember how amazing it is to just work in Ruby, I recommend configuring a big C project of any description. But in case you need it, here are some places you can look for Ruby knobs and levers to fiddle with, roughly in the order to check:

  • Just Google your problem
  • Arguments to ./configure
  • One of the environment variables above
  • Google your problem again, with what you learned
  • Check the Makefile
  • Check the source to ./configure
  • Google your problem one more time
  • Dive into the Ruby source

Happy Rubying!

 

Configuring Ruby on MacOS with OpenSSL

If you, like me, use MacOS and Homebrew, you may have had some trouble configuring Ruby from source - and maybe even more if you don't use Homebrew. You may also have gotten some odd error messages when you tried. I had some trouble on El Capitan, but a quick Google search will find many weird errors with OpenSSL and Ruby over many years and many versions. RVM goes so far as to download its own OpenSSL and link against it when it builds.

Of course, that doesn't help you if you're building from source, without using RVM.

How Do You Check?

First off, let's talk about how you see if you have openssl working. You *can* run "make check" in your Ruby source directory, but that's quite slow. Better is to start up irb and require "openssl". If Ruby can't find it, that means it didn't get built -- or your path is wrong. You can require something else to make sure it's not a path problem.

$ irb
2.3.1 :001 > require "tsort"
 => true
2.3.1 :002 > require "openssl"
 => true

(I required "tsort" just as a way of showing that, yes, my standard Ruby libraries were available.)

You can also cut that down to a one-liner by having Ruby evaluate code from the command line:

$ ruby -r openssl -e 'print "Working\n"'
Working
$

Okay, maybe I just really like one-liners :-)

What If It Doesn't Work?

Extensions don't always rebuild properly if Ruby can't detect the libraries. Specifically, configure doesn't always seem to figure out when it's been looking in the wrong place. So just re-making will often not rebuild Ruby's OpenSSL extension properly if it didn't get detected right the first time. This will manifest as a C error about incompatible prototypes for HMAC_CTX_copy, a function you've rightly never heard of and don't care about. That's because Ruby is using a very old, slightly wrong version of this function in its chunk of "you don't have OpenSSL" stub code.

You can track this down to a few old commits in 2008 (commit a8a111 in Ruby GitHub, and commit 87d524 in OpenSSL, not that I checked or anything.) But if you're getting prototypes not matching on HMAC_CTX_copy, it's because Ruby has the OpenSSL extension half-configured. Kill it with fire. If you're cloned from the Ruby GitHub repo, that looks like this in your Ruby source directory:

git clean -dxf
autoconf
./configure
make
make check

That first incantation, "git clean -dxf" means "destroy every file that wouldn't be there after a new clean git checkout." Even stuff in .gitignore. If Git didn't put it there, destroy it. If you wonder if your local Ruby checkout might be in an inconsistent state, I recommend that command highly. Of course, you'll also wait quite awhile for everything to be configured and built.

However, that will make sure you're seeing the OpenSSL that's actually available on your system instead of one that configure saw and cached several versions ago.

(Is it a problem that Ruby has some of the wrong prototypes for OpenSSL functions in its "you don't have OpenSSL" stub? Not really. They're very close. And you'll only see the compiler complain if you pull in both sets, which is an error, and a sign that things are already wrong. It's actually kind of nice, because you never wind up with a compiled-but-half-functioning Ruby with part stubs and part real OpenSSL.)

Any Hints on Why?

OpenSSL is one of a number of Ruby "extensions" which get compiled along with your Ruby. Ruby will try to build them, but may not if it can't find needed libraries, or if something fails when Ruby tries to compile it.

These extensions live in your Ruby source tree under the "ext" directory:

There may be a "mkmf.log" directory in any of them -- especially if something failed to build. You can see how the extensions above may not be around if something they need isn't there (Win32ole on a Mac, say, or zlib or openssl on a computer without those libraries installed.)

If you're basically C-literate, the mkmf.log file may be able to tell you that it can't find the library, or that there's an incompatibility. May I recommend not digging too deep in those cases unless you know OpenSSL quite well? That way, madness lies. Still, opening up mkmf.log and reading the last few entries can sometimes be quite useful ("wait, it's finding what version of OpenSSL? In what directory? Why is that there?")

It's possible to rebuild just the OpenSSL extension instead of all of Ruby as a way to see if you got the configuration right. I recommend not doing that. It's easy to wind up with a half-configured setup, some stuff cached by configure and passed in, and otherwise something that only sort-of works, and which you'll never be able to replicate if you ever need to build Ruby again. When possible, build clean and from the top level, even if it's slower, so that you'll be able to do it again later if you need to.

Homebrew? From Source?

I did a fair bit of uninstalling and reinstalling openssl in Homebrew, and from source. As it turns out, that wasn't my problem here. But in case you need it:

$ brew uninstall openssl

# And in the OpenSSL source directory:

./config --prefix=/usr/local --openssldir=/usr/local/openssl
./Configure darwin64-x86_64-cc
make depend
make
make test
make install
cd /usr/local/include && ln -s /usr/local/ssl/include/openssl .

 

What If It Still Doesn't Work?

Ruby can be picky on a Mac about picking up OpenSSL, which uses a slightly funky directory structure for a library anyway. I tried a bunch of things with setting LIBS when configuring, to no avail.

It turns out there's a utility already on your Mac that Ruby will use and that specializes in this sort of thing: pkg-config.

There's a nice blog post that explains how this works, but the sort answer looks like this:

export PKG_CONFIG_PATH=/usr/local/opt/openssl/lib/pkgconfig:/usr/local/lib/pkgconfig

By telling pkg-config exactly where to find OpenSSL and to look there first, Ruby finds it, and everything works... Almost.

You should put that in your ~/.bashrc file, so that it always happens. That way, next time you build Ruby it will just "magically work," instead of you having to remember where you read this blog post.

And When My Tests Fail?

All of that gets you almost all the way there. But when you run all the Ruby tests, such as with "make check", four will fail with timeouts on incoming network connections. And, if you're lucky, you'll get a little pop-up window telling you that Mac OS isn't so sure that your process should just get to open incoming sockets. Security!

Mac comes with something called Mac Application Firewall, which will cause some of Ruby's OpenSSL tests to fail, if you have it on.

Short answer: go to the "Security and Privacy" control panel, to the "Firewall" tab, and turn it off. You'll need to open the little lock icon at the bottom to make changes, just like always.

However, now it should be working!

What If It's Still Not Working?

I can only tell you what worked for me. But here are a few more possibilities that didn't help me, but might help you.

You can uninstall and reinstall OpenSSL via Homebrew. This will help if your version is damaged.

You can install a new OpenSSL from source, with the commands above. This will help if you're using the wrong version.

You can install Ruby via RVM. Remember that "rvm install ruby-head" will grab the latest, so you're not limited to released, stable versions. RVM has a lot of workarounds to get Ruby compiled with OpenSSL and downloads and configures its own, so it's often the easiest. Also, even if your local system OpenSSL is broken somehow, RVM won't be using it, so that may be a workaround for that problem. Ruby-build can also work for this.

You can install OpenSSL to a different directory. In the "configure OpenSSL from source" commands above, I put it under /usr/local. But you can put it directly under /usr, for instance, or otherwise try a place that might be easier to find.

Final Thoughts

If you can, install via RVM or ruby-build to avoid doing this yourself.

If you need latest source, but not often, still use rvm or ruby-build.

If you need to build locally, have pkg-config do the configuration, don't do it yourself.

If you still have to configure it yourself, make sure to "git clean -dxf" to avoid even the slightest chance of a half-cached older configuration.

If you'll need to build Ruby again, put the "export PKG_CONFIG_PATH=/usr/local/opt/openssl/lib/pkgconfig:/usr/local/lib/pkgconfig" into your ~/.bashrc file so it happens always, not just this one time.

In short, Ruby's code is part of the C ecosystem, with all the usual library-based problems that entails. If somebody else can do part of the configuration work for you, please let them :-)

 

A Little Ruby-Land Profiling with StackProf

I'm looking at Ruby performance lately. Aaron Patterson was awesome when I asked him about Ruby performance and his interview -- he pointed me at a great but Linux-specific Japanese blog post about how to profile Ruby itself. The program being used, called optcarrot, is a headless NES emulator that the Ruby core team are using as a CPU-intensive optimization target. Thus "optcarrot" == "optimization carrot", the target they're trying to optimize. Cute, right?

I'm on a Mac, which seems to lack most profiling tools (e.g. no gprof) and the existing one or two like XCode's Instruments seem fairly command-line-unfriendly and Makefile-unfriendly. But while I learn to use DTrace for profiling and get a Linux VM set up with other profiling tools, I can also check out the Ruby-side performance. Let's talk about that.

(One confusing point: there's the performance of the Ruby interpreter itself, and then the performance of the Ruby code running in the interpreter. Today's post is about the code running in the interpreter. So the function names and durations are from the optcarrot NES emulator, not the internals of the core Ruby language - yet. Capisce?)

The beginning of that Japanese post mentions stackprof, which I hadn't used before. I'm enjoying it more than perftools.rb or ruby-prof, and it's been much easier to set up. Let's talk about how to use it.

I got stackprof set up in a locally-compiled development Ruby. I'm not going to say too much about that here - it's kind of painful, and I don't recommend it without a good reason. So maybe I'll post about it later. Today, let's just talk about what stackprof shows us about optcarrot.

Stackprof needs to be in the local ruby, so "gem install stackprof" to make that happen. You may need to put it into a Gemfile with "gem stackprof" -- I did that in optcarrot. Then it wants to be required and configured. In optcarrot, you can do that in the bin/optcarrot file. Here's the original file:

#!/usr/bin/env ruby

# I'm too lazy to type `-Ilib` every time...
require_relative "../lib/optcarrot"

Optcarrot::NES.new.run

end

And here's the version with stackprof. Pretty simple:

#!/usr/bin/env ruby

require 'stackprof'# added

StackProf.run mode: :cpu, # added
interval: 1000,
out: '/tmp/stackprof_optcarrot.dump' do

# I'm too lazy to type `-Ilib` every time...
require_relative "../lib/optcarrot"

Optcarrot::NES.new.run

end

Okay. So what's it do? Well, run optcarrot again, just like before... If you're not running a funky local Ruby, that's probably just "ruby bin/optcarrot --benchmark examples/Lan_Master.nes". If you need to instead add a bunch of non-standard includes and stuff so that Ruby can find its local standard library, you have my sympathies :-)

Once the run works, you should see the file that's listed in the StackProf.run line. I used /tmp/stackprof_optcarrot.dump as the filename, but you can pick what you like.

$ ls -l /tmp
total 2272
drwx------3 noah.gibbswheel102 May 31 16:35 com.apple.launchd.5JyKIrJXHi
drwx------3 _mbsetupuserwheel102 May 31 16:35 com.apple.launchd.6AV1lzEddb
drwx------3 _mbsetupuserwheel102 May 31 16:35 com.apple.launchd.RhIpa252Cv
drwx------3 noah.gibbswheel102 May 31 16:35 com.apple.launchd.yH43wuq7kM
-rw-r--r--1 noah.gibbswheel10236 Jun9 15:12 stackprof_optcarrot.dump

What do you do with it?

The most useful is to ask it what used most of the performance:

$ stackprof /tmp/stackprof_optcarrot.dump --text --limit 10
==================================
Mode: cpu(1000)
Samples: 3625 (1.01% miss rate)
GC: 10 (0.28%)
==================================
 TOTAL(pct) SAMPLES(pct) FRAME
2854(78.7%)2854(78.7%) Optcarrot::PPU#render_pixel
 271 (7.5%) 271 (7.5%) block (3 levels) in <class:PPU>
 218 (6.0%) 109 (3.0%) Optcarrot::PPU#setup_lut
79 (2.2%)79 (2.2%) Optcarrot::PPU#setup_frame
55 (1.5%)55 (1.5%) Optcarrot::CPU#fetch
26 (0.7%)26 (0.7%) Optcarrot::PPU#wait_one_clock
19 (0.5%)19 (0.5%) Optcarrot::PPU#update_address_line
6017 (166.0%)14 (0.4%) Optcarrot::PPU#main_loop
13 (0.4%)13 (0.4%) Optcarrot::PPU#evaluate_sprites_even
13 (0.4%)13 (0.4%) Optcarrot::PPU#load_tiles

Stackprof will also output files, graphviz output, flamegraphs, specific methods and more. See its GitHub page for more details.

So - what does that actually tell us? The PPU code is the "graphical" output (headless in this case) where it calculates all the output pixels. Optcarrot's documentation (see doc/benchmark.md) even warns us that the vast majority of the time will be spent in PPU#run and CPU#run. That makes a lot of sense for a game console emulator. The time is spent emulating the hardware.

The render_pixel function isn't terribly long (see lib/optcarrot/ppu.rb and search for "render_pixel"). But looking at that array append at the end, I had a sudden suspicion. I tried just commenting out that one last line of the function (yes, the emulator is useless at that point, but humor me.) Delete /tmp/stackprof_optcarrot.dump, run the emulator, run stackprof again and...

$ stackprof /tmp/stackprof_optcarrot.dump --text --limit 10
==================================
Mode: cpu(1000)
Samples: 2753 (1.18% miss rate)
GC: 14 (0.51%)
==================================
 TOTAL(pct) SAMPLES(pct) FRAME
 495(18.0%) 493(17.9%) Optcarrot::PPU#vsync
 259 (9.4%) 259 (9.4%) block (3 levels) in <class:PPU>
 258 (9.4%) 258 (9.4%) Optcarrot::PPU#render_pixel
 202 (7.3%) 202 (7.3%) Optcarrot::PPU#wait_one_clock
 137 (5.0%) 135 (4.9%) Optcarrot::CPU#fetch
 119 (4.3%) 119 (4.3%) Optcarrot::PPU#update_address_line
 113 (4.1%) 113 (4.1%) Optcarrot::PPU#evaluate_sprites_even
 110 (4.0%) 110 (4.0%) Optcarrot::PPU#setup_frame
 203 (7.4%) 101 (3.7%) Optcarrot::PPU#setup_lut
2786 (101.2%)98 (3.6%) Optcarrot::PPU#main_loop

Instead of 78% of the total runtime, render_pixel is down to 9.4% of the total runtime. So that array append is burning a great big chunk of our CPU cycles.

You can also verify this by asking stackprof to "zoom in" on the render_pixel method.

$ stackprof /tmp/stackprof_optcarrot.dump --text --method render_pixel

See the left columns? That single line is using 75.2% of the total runtime. So again, the array append is slow...

If I were trying to optimize the optcarrot NES emulator, I'd restructure that -- preallocate the array, so there's only one memory allocation? Don't use an array at all, and find some other way to handle pixel output?

But I'm not really trying to optimize optcarrot. Instead, I'm trying to optimize the Ruby internal operations that make optcarrot slow. And now I know -- those are dominated by arrays.

By the way - don't take the one-line-commented-out op count too seriously. That's dominated by PPU#vsync, but PPU#vsync winds up blacking the screen if render_pixel doesn't write output pixels. And that's an array append in a tight loop. In other words, if you take out the slow array appends in render_pixel, it will fill in with a bunch more array appends anyway. This is one of the hazards of modifying the code you're profiling - it changes the behavior.

In a later post, perhaps we'll return to that array append. I'm still figuring out how to optimize it. I suspect realloc() being slow, myself. And in the mean time, you have learned a bit about how to use stackprof, which is a pretty darned awesome tool.

 

Git Bisect to Find a Ruby Regression

Hi! My name is Noah. I'm AppFolio's new Ruby Fellow. Though I've been working on Ruby stuff for a long time, most of it hasn't been on the core Ruby language.

So: that means I'm getting up to speed. And I'm a compulsive blogger. So you'll get a series of posts about how to do interesting things with the Ruby code, and how to get started with it.

Today I'm looking at a minor regression in Ruby, and using git-bisect to find it. Care to follow along?

(@nalsh on Twitter points out that you can use tool/bisect.sh in the Ruby source if you're bisecting on a test that Ruby already has. It's even easier!)

I have the Ruby code checked out locally. I'm running the bug-reporter's example test, an "nth prime" example from Exercism.io. It succeeds on Ruby 2.2, and fails with an obscure "deadlock; recursive locking" error on the very latest Ruby (currently 2.4, commit d16ff5 in the GitHub mirror.)

To check if a given Ruby version is good or bad, I'm rebuilding Ruby and running the test via this incantation: "make && ./ruby -v -Ilib -I. -I.ext/x86\_64-darwin15 ~/Desktop/nth_prime_test.rb". The extra arguments are to use local libraries from the Ruby build directory on my Macbook. It'll be a little different if you're using Windows or Linux.

(Here's a time-saver of a tip: check your incantation by manually switching back and forth between the first "good" and "bad" revisions - that way you'll be sure that you're rebuilding the right things to get the bug after you weren't getting it, and vice-versa. That's very important when you're trying to isolate it. And checking all the revisions when your test doesn't work will waste your time.)

I started my "git bisect" on Ruby's git history:

$ git bisect start trunk # Trunk is bad, current (ruby_2_2) is good
$ make && ./ruby #... # Run the tests
$ git bisect good

After each "git bisect" command I make Ruby locally and run my test again. Then I either type "git bisect good" or "git bisect bad", depending whether I got the nasty deadlock error that I'm tracking down.

 $ git bisect good
Bisecting: a merge base must be tested
[7632a82d5750b7908bd173eda3268ecb0855b934] * gc.c (wmap_final_func): fix memory size shortage when realloc wmap. Fix SEGV during finilize of WeakRef on Solaris (though the SEGV could occur on all OS/platforms). [ruby-dev:48779] [Bug #10646]
$ make && ./ruby #... # Run the tests again

Git bisect, if you don't know it already, will look at the whole list of commits between your "good" and "bad" revision, pick one in the middle, and then let you test it to see if it's "good" or "bad". Each time you tell it, it will cut the range of commits in half again.

So it's a binary search algorithm, but you'll be playing the role of the "compare" operation :-)

If you hit a different bug, you can "git bisect skip" to say that that commit can't be tested... It'll take a little longer to find the bug because you had to rebuild Ruby just to find out that one commit wasn't good. But when you need it, use it. Also, there's always a weird tension about how much you rebuild. Do you "git clean -dxf && autoconf && ./configure &&..." (aka nuke it from orbit and redo everything)? Do you just remake as little as possible? This is one reason that it can be touch to just hand it a script and have it do the right thing, unless your script does a full nuke-and-rebuild, which can be very slow. I'm sure there's also a smart way to not rebuild rdoc every time, but I'm still getting the hang of lots of Ruby rebuilding ;-)

WhoDunnit

Did you read the original bug? Wondering why innocent-looking code having nothing obvious to do with threads is getting a bizarre mutex error with some Ruby versions?

The Mutex exists because the example code uses Prime, which is a Singleton, which locks itself when getting the single, global instance of Prime that exists. The commit that git-bisect finds looks relevant: commit d2487e removes an obsolete Prime.new operation. It also switches Prime to be a Singleton. The old version of Prime used an internal @the_instance variable to track itself. The new ones uses Singleton, and a mutex, and the code where it gets that deadlock error. Ah!

The deadlock error on the mutex basically means "this mutex is already locked by you and you're locking it again" -- which can never succeed. Frequently it's caused by... recursion.

In this case, Prime got rid of its .new() method. The Singleton class, in Ruby, makes its instance using .new(). Worse, the sample code winds up defining .new() as well. So when it calls .new(), Prime winds up calling .instance(), which calls .new(), which calls .instance()...

And so it looks like the problem is just the code in question defining a .new() method. It's technically a regression, but is it one that matters? Hm...

Want to see what happens? Watch the bug for updates!