You may know that I run Rails Ruby Bench and write a fair bit about it. It’s intended to answer performance questions about a large Rails app running in a fairly real-world configuration.

Here’s a basic question I haven’t addressed much in this space: where does RRB actually spend most of its time?

I’ve used the excellent StackProf for the work below. It was both very effective and shockingly painless to use. These numbers are for Ruby 2.6, which is the current stable release in 2019.

(Disclaimer: this will be a lot of big listings and not much with the pretty graphs. So expect fairly dense info-dumps punctuated with interpretation.)

About Profiling

It’s hard to get high-quality profiling data that is both accurate and complete. Specifically, there are two common types of profiling and they have significant tradeoffs. Other methods of profiling fall roughly into these two categories, or a combination of them:

Instrumenting Profilers: insert code to track the start and stop points of whatever it measures; very complete, but distorts the accuracy by adding extra statements to the timing; usually high overhead; don’t run them in production
Sampling Profilers: every so many milliseconds, take a sample of where the code currently is; statistically accurate and can be quite low-overhead, but not particularly complete; fast parts of the code often receive no samples at all; don’t use them for coverage data; fast ones can be run in production

StackProf is a sampling profiler. It will give us a reasonably accurate picture of what’s going on, but it could easily miss methods entirely if they’re not much of the total runtime. It’s a statistical average of samples, not a Platonic ideal analysis. I’m cool with that - I’m just trying to figure out what bits of the runtime are large. A statistical average of samples is perfect for that.

I’m also running it for a lot of HTTP requests and adding the results together. Again, it’s a statistical average of samples - just what I want here.

Running with a Single Thread

Measuring just one process and one thread is often the least complicated. You don’t have to worry about them interfering with each other, and it makes a good baseline measurement. So let’s start with that. If I run RRB in that mode and collect 10,000 requests, here are the top (slowest) CPU-time entries, as measured by StackProf.

(I’ve removed the “total” columns from this output in favor of just the “samples” columns because “total” counts all methods called by that method, not just the method itself. You can get my original data if you’re curious about both.)

==================================
  Mode: cpu(1000)
  Samples: 4293 (0.00% miss rate)
  GC: 254 (5.92%)
==================================
SAMPLES    (pct)     FRAME
 (4.8%)     ActiveRecord::Attribute#initialize
 (4.4%)     ActiveRecord::LazyAttributeHash#[]
 (2.8%)     block (4 levels) in class_attribute
 (2.3%)     ActiveModel::Serializer::Associations::Config#option
 (2.1%)     block (2 levels) in class_attribute
 (2.1%)     ActiveSupport::PerThreadRegistry#instance
 (2.0%)     ThreadSafe::NonConcurrentCacheBackend#[]
 (1.8%)     String#to_json_with_active_support_encoder
 (1.6%)     ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache
 (1.6%)     ActiveModel::Serializer#include?
 (1.5%)     SiteSettingExtension#provider
 (1.4%)     block (2 levels) in <class:Numeric>
 (1.2%)     ActiveRecord::ConnectionAdapters::PostgreSQL::Utils#extract_schema_qualified_name
 (1.2%)     ThreadSafe::NonConcurrentCacheBackend#get_or_default
 (1.2%)     Arel::Nodes::Binary#hash
 (1.1%)     ActiveRecord::Associations::JoinDependency::JoinPart#extract_record
 (1.1%)     ActiveSupport::JSON::Encoding::JSONGemEncoder::EscapedString#to_json
 (1.1%)     ActiveRecord::Attribute#value
 (1.1%)     ActiveRecord::LazyAttributeHash#assign_default_value
 (1.0%)     ActiveSupport::JSON::Encoding::JSONGemEncoder#jsonify
 (1.0%)     block in define_include_method
 (1.0%)     ActiveRecord::Result#hash_rows

There are a number of possibly-interesting things here. I’d probably summarize the results as “6% garbage collection, 17%ish ActiveRecord/ActiveModel/ARel/Postgres, around 4-6% JSON and serialization, and some cache and ActiveSupport various like class_attribute.” That’s not bad - with the understanding that ActiveRecord is kinda slow, and this profiler data definitely reflects that. A fast ORM like Sequel would presumably do better for performance, though it would require rewriting a bunch of code.

Running with Multiple Threads

You may recall that I usually run Rails Ruby Bench with lots of threads. How does that change things? Let’s check.

==================================
  Mode: cpu(1000)
  Samples: 40421 (0.51% miss rate)
  GC: 2706 (6.69%)
==================================
SAMPLES    (pct)     FRAME
 (3.5%)     ActiveRecord::Attribute#initialize
 (2.9%)     ActiveRecord::LazyAttributeHash#[]
 (2.5%)     ThreadSafe::NonConcurrentCacheBackend#[]
 (2.3%)     block (4 levels) in class_attribute
 (1.8%)     ActiveSupport::PerThreadRegistry#instance
 (1.6%)     block (2 levels) in class_attribute
 (1.5%)     ActiveModel::Serializer::Associations::Config#option
 (1.4%)     block (2 levels) in <class:Numeric>
 (1.4%)     Arel::Nodes::Binary#hash
 (1.2%)     ActiveRecord::Result#hash_rows
 (1.2%)     ActiveRecord::ConnectionAdapters::PostgreSQLAdapter#exec_no_cache
 (1.2%)     ThreadSafe::NonConcurrentCacheBackend#get_or_default
 (1.2%)     ActiveModel::Serializer#include?
 (1.1%)     Hashie::Mash#convert_key
 (1.1%)     SiteSettingExtension#provider
 (1.0%)     ActiveRecord::ConnectionAdapters::PostgreSQL::Utils#extract_schema_qualified_name
 (0.9%)     String#to_json_with_active_support_encoder
 (0.9%)     Arel::Visitors::Reduce#visit
 (0.9%)     ActiveRecord::Associations::JoinDependency::JoinPart#extract_record
 (0.8%)     ActiveSupport::TimeWithZone#transfer_time_values_to_utc_constructor
 (0.8%)     ActiveSupport::JSON::Encoding::JSONGemEncoder#jsonify
 (0.8%)     ActiveSupport::JSON::Encoding::JSONGemEncoder::EscapedString#to_json
 (0.8%)     ActiveRecord::Type::TimeValue#new_time

This is pretty similar. ActiveRecord is showing around 20%ish rather than 17%, though doesn’t reflect any of the smaller components, anything under 1% of the total (plus it’s sampled.) The serialization is still pretty high, around 4-6%.

If I try to interpret these results, the first thing I should point out is that they’re quite similar. While running with 6 threads/process is adding to (for instance) the amount of time spent on cache contention and garbage collection, it’s not changing it that much. Good. A massive change there is either a huge optimization that wouldn’t be available for single-threaded, or (more likely) a serious error of some kind.

If GC is High, Can We Fix That?

It would be reasonable to point out that 7% is a fair bit for garbage collection. It’s not unexpectedly high and Ruby has a pretty good garbage collector. But it’s high enough that it’s worth looking at - a noticeable change there could be significant.

There’s a special GC profile mode that Ruby can use, where it keeps track of information about each garbage collection that it does. So I went ahead and ran StackProf again with GC profiling turned on - first in the same “concurrent” setup as above, and then with jemalloc turned on to see if it had an effect.

The short version is: not really. Without jemalloc, the GC profiler collected records of 2.4 seconds of GC time over the 10,000 HTTP requests… And with jemalloc, it collected 2.8 seconds of GC time total. I’m pretty sure what we’re seeing is that jemalloc’s primary speed advantage is during allocation and freeing… And with Ruby using a deferred sweep happening in a background thread, it’s a good bet that neither of these things count as garbage collection time.

This is one of those GC::PRofiler reports. You can also get it as a Ruby hash table and then dump that, which makes it a bit easier to analyze in irb later.

I also took more StackProf results with profiling on, but 1) they’re pretty similar to the other results and 2) GC profiling actually takes enough time to distort the results a bit, so they’re likely to be less accurate than the ones above.

What Does All This Suggest?

There are a few interesting leads we could chase from here.

For instance, could JSON be lower? Looking through Discourse’s code, it’s using the oj gem via MultiJSON. OJ is pretty darn fast, so that’s probably going to be hard to trim to less of the time. And MultiJSON might be adding a tiny bit of overhead, but it shouldn’t be more than that. So we’d probably need a structural or algorithmic change of some kind (e.g. different caching) to lower JSON overhead. And for a very CRUD-heavy app, this isn’t an unreasonable amount of serialization time. Overall, I think Discourse is serializing pretty well, and these results reflect that.

ActiveRecord is a constant performance bugbear in Rails, and Discourse is certainly no exception. I use this for benchmarking and I want “typical” not “blazing fast,” so this is pretty reassuring for me personally - yup, that’s what I’m used to seeing slow down a Rails app. If you’re optimizing rather than benchmarking, the answers are 1) the ActiveRecord team keep making improvements and 2) consider using something other than ActiveRecord, such as Sequel. None of them are 100% API-interoperable with ActiveRecord, but if you’re willing to change a bit of code, some Ruby ORMs are surprisingly fast. ActiveRecord is convenient, flexible, powerful… but not terribly fast.

Since jemalloc’s not making much different in GC… in a real app, the next step would be optimization and trying to create less garbage. Again, for me personally, I’m benchmarking, so lots of garbage per request means I’m doing it right. Interestingly, jemalloc does seem to speed up Rails Ruby Bench significantly, so these results don’t mean it’s not helping. If anything, this may be a sign that StackProf’s measurement doesn’t do very well at measuring jemalloc’s results - perhaps it isn’t catching differences in free() call time? And garbage collection can be hard to measure well in any case.

Methodology

This is mostly just running for 10,000 requests and seeing what they look like added/averaged together. There are many reasons not to take this as a perfect summary, starting with the fact that the server wasn’t restarted to give multiple “batches” the way I normally do for Rails Ruby Bench work. However, I ran it multiple times to make sure the numbers basically hold up, and they basically seem to.

Don’t think of this as a bulletproof and exact summary of where every Rails app spends all its time - it wouldn’t be anyway. It’s a statistical summary, it’s a specific app and so on. Instead, you can think of it as where a lot of time happened to go one time that some guy measured… And I can think of it as grist for later tests and optimizations.

As for specifically how I got StackProf to measure the requests… First, of course, I added the StackProf gem to the Gemfile. Then in config.ru:

use StackProf::Middleware,
  enabled: true,
  mode: :cpu,
  path: "/tmp/stackprof",  # to save results
  interval: 1000,          # ms between samples
  save_every: 50           # save .dump file each this many results

You can see other configuration options in the StackProf::Middleware source.

Conclusions

Here are a few simple takeaways:

Even when configured well, a Rails CRUD app will spend a fair bit of time on DB querying, ActiveRecord overhead and serialization,
Garbage collection is a lot better than in Ruby 1.9, but it’s still a nontrivial chunk of time; try to produce fewer garbage objects where you can,
ActiveRecord adds a fair bit of overhead on top of the DB itself; consider alternatives like Sequel and whether they’ll work for you,
StackProf is easy and awesome and it’s worth trying out on your Ruby app

See you in two weeks!

Engineering Blog

Where Does Rails Spend Its Time?