Traveling Ruby Conversation!

Hey, Rubyists!

I’m going to be traveling as I work for awhile. So expect announcements of interesting locations where I may be found. If you’d like to talk Ruby, I’m interested!

I’ll also try to contact local meetups in my various locations - I’d love to give talks or just hang out with the local folks!

I’m currently in Anaheim, California. Let me know if you’re nearby!

And I’ll be in Malaysia for RubyConf MY from October 15th-30th. Do I have any Malaysian readers?

Ruby Method Lookup, RubyVM.stat and Global State

In Ruby, it can sometimes be a bit involved figuring out where a constant or a method comes from - not just for you, but for Ruby, too! When you call "foo," is it foo from your current class? One of its parents? A module included from a parent class? Somewhere else?

And of course, Ruby has a few objects whose methods are available just about everywhere - Object, BasicObject and Kernel.

When I wrote about the Global Method Cache, we touched on that just a little. Let's go a bit deeper, shall we?

Better yet, we'll learn about some things that you should carefully *not* do to keep your performance good, and how you check.

Method Caching and Cache Invalidation

Methods on three specific Ruby objects are special - BasicObject, Object and Kernel. BasicObject is the root of the inheritance tree - even the simplest Ruby objects inherit from it. Object is slightly more full-featured than BasicObject, but it's basically the same thing - it includes a module called Kernel, which gives it a lot of methods. And Object is the default for inheritance, so nearly every object inherits from it. As a result, every object inherits from BasicObject and nearly every object inherits from Object, and includes Kernel. Also, any “top level” definitions are definitions on Object (thanks to Benoit Daloze for correcting my original explanation!)

As a result, they get special hardcoding in Ruby's performance code. Ruby keeps a count of how many times you have defined methods on those three objects. It calls that count your “Global Method State.” And it tags all method cache entries with that number. Which means that if you add a new method on any of them, all the old method cache entries get invalidated...

That's good! You don't want to use stale entries that would point to the wrong method definition.

Also, that's bad. All your old cache entries are gone, and now you have to look them up again. That can be slow.

If it happens during the setup at the beginning of your Rails app... Well, sure. That sounds like Rails, defining methods in all sorts of places. Any cache you set up before it's done with that is wasted, sure. Sounds like Rails. Also, sounds fine.

But if you do that in your inner loop, that's really bad. It means you're looking up all your methods fresh, every time, even methods that don't share a name with the method you're defining on Object (or Kernel, or BasicObject.) That's no good.

Aaron Patterson has also written more about Global Method State and its relatives, if you want more details.

Constants and Cache Invalidation: Global Constant State

Ruby has really involved constant lookup. In the same way as for global methods, making a new constant makes it hard to tell exactly what constant to use. Ruby has a really similar solution.

When you define a new constant, it invalidates your old constant caching - Ruby has to look everything up again. The state number Ruby uses to track this is called the Global Constant State.

Your Global Cache Detective: RubyVM.stat

In CRuby, you can't "fix" the cache-invalidation problem except by not defining methods on Object, Kernel or BasicObject in your inner loop. If you need to define new methods constantly, do it in some other class.

But how can you tell if you — or one of your dependencies — are doing that?

Remember Ruby's GC.stat? In the same way, Ruby has a RubyVM.stat that tracks changes to (what it calls) "global_method_state".

2.5.0 :002 > RubyVM.stat
 => {:global_method_state=>137, :global_constant_state=>1115, :class_serial=>7109}

Each of these tracks an internal quantity for the VM. :global_method_state is the one we’ve just been talking about. :global_constant_state is the counter for constants. And class_serial is complicated to explain — but basically, for each individual class, they keep a similar counter, and this number assigns the value across all classes. Unlike global_method_state, it’s not a big performance problem if you reset it. Mostly you can ignore class_serial.

Great! But How Do I Use It?

You might reasonably ask, “what would I do with this?” If you’re finding your app slows down unaccountably at some times, or you’re worried that it might, it’s one more thing you can check - has global_method_state increased a lot? Or has global_constant_state increased a lot? Either case might mean that you’re doing something slow that busts your caches. OpenStruct has had this problem, for instance (edit: fixed to say OpenStruct, not Struct - thanks, commenter!)

If you’re using Rails, you might consider doing this in a Rack middleware - if either of those increases during a request, that may be a sign that you’re using something slow and you’ll want to look into that.

Ruby's Main Object Does What?

Last week, Mischa talked about debugging a problem with Rake tasks polluting the global namespace. This week, I'll talk about how that happens.

Spoiler: there will be a bunch of spelunking, first in Ruby and then in Ruby's C source code. We'll talk about how to find code in both places.

What Are We Looking For?

When we define methods on Ruby's top-level "main" object, they show up as instance methods on Object so they're callable absolutely everywhere. But "main" is mostly just a random instance of Object, and isn't part of the standard class hierarchies like, say, Kernel or Object. So why and how do those methods show up on Object?

First I checked what was going on in irb. You don't know what you're looking for if you can't write an example! Here's what I did:

2.5.0 :001 > def bobo; print "Bobo!"; end
=> :bobo
2.5.0 :002 > class NotBobo; def really; puts "I am pretty sure."; bobo; end
2.5.0 :003?>   end
=> :really
2.5.0 :004 > a = NotBobo.new
=> #<NotBobo:0x00007fc0fe031940>
2.5.0 :005 > a.really
I am pretty sure.
Bobo! => nil

That bit where it prints "Bobo!" instead of giving a no-method exception? That's the feature we're talking about.

(I hadn't actually known this happened. How'd I miss that? If you already knew that, congratulations! This is a good time to feel superior about it ;-)

This is a good example of a feature that can be mysterious if you don't already know it. So let's talk about how you'd track this feature down, by talking about how I recently did do that.

Searching in Ruby

Ruby is a language with great reflection methods. So first, we'll check where that behavior could be coming from in Ruby.

The obvious places to look for weird behavior in Ruby are the object in question ("main" in this case,) and Object and Kernel.

Let's see what "main" actually is in irb:

2.5.0 :001 > self
 => main
2.5.0 :002 > self.class
 => Object
2.5.0 :003 > self.singleton_class
 => #<Class:#<Object:0x00007f9ea78ba2e8>>
2.5.0 :005 > self.methods.sort - Object.methods
 => [:bindings, :cb, :chws, :conf, :context, :cws, :cwws, :exit, :fg,
     :help, :install_alias_method, :irb, :irb_bindings, :irb_cb,
     :irb_change_binding, :irb_change_workspace, :irb_chws,
     :irb_context, :irb_current_working_binding,
     :irb_current_working_workspace, :irb_cwb, :irb_cws, :irb_cwws,
     :irb_exit, :irb_fg, :irb_help, :irb_jobs, :irb_kill, :irb_load,
     :irb_pop_binding, :irb_pop_workspace, :irb_popb, :irb_popws,
     :irb_print_working_binding, :irb_print_working_workspace,
     :irb_push_binding, :irb_push_workspace, :irb_pushb, :irb_pushws,
     :irb_pwb, :irb_pwws, :irb_quit, :irb_require, :irb_source,
     :irb_workspaces, :jobs, :kill, :popb, :popws, :pushb, :pushws,
     :pwws, :quit, :source, :workspaces]

Huh. That feels like a lot of instance methods on the top-level object, and a lot of them irb-flavored. Maybe this is somehow an irb thing? Let's check in a separate Ruby source file with no irb loaded.

# not_in_irb.rb
def method_on_main
  print "Bobo!"
end

class NotBobo
  def some_method
    print "Yup.\n"
    method_on_main
  end
end

a = NotBobo.new
a.some_method

print "\n\n\n"

print (self.methods.sort - Object.methods).inspect

Feel free to run it. But when I do, I get an empty list for the methods. In other words, that whole list of methods on main that aren't on Object came from irb, not from plain Ruby. Which seems like there's nothing terribly special there. Hrm.

What about the singleton class (a.k.a. eigenclass) for main? Maybe it has something individually that works only on that one object? Again, I'll run this outside of irb.

dir noah.gibbs$ cat >> test2.rb
print self.singleton_class.instance_methods
dir noah.gibbs$ ruby test2.rb
[:inspect, :to_s, :instance_variable_set, :instance_variable_defined?,
 :remove_instance_variable, :instance_of?, :kind_of?, :is_a?, :tap,
 :instance_variable_get, :public_methods, :instance_variables, :method,
 :public_method, :define_singleton_method, :singleton_method,
 :public_send, :extend, :pp, :to_enum, :enum_for, :<=>, :===, :=~,
 :!~, :eql?, :respond_to?, :freeze, :object_id, :send, :display,
 :nil?, :hash, :class, :clone, :singleton_class, :itself, :dup,
 :taint, :yield_self, :untaint, :tainted?, :untrusted?, :untrust,
 :trust, :frozen?, :methods, :singleton_methods, :protected_methods,
 :private_methods, :!, :equal?, :instance_eval, :instance_exec, :==,
 :!=, :__id__, :__send__]

Ah. There's a lot there, and I'm not finding it terribly enlightening. Plus, it's really likely whatever method I care about is going to be defined in C, at least in CRuby. Let's bit the bullet and check Ruby's source code...

No Luck - Let's Make More Luck

Okay. That pretty much exhausts our Ruby options. Beyond this point, it's time to check in C. I'll use the latest (August 2018) Ruby code, because this is a long-term feature and hasn't changed any time recently -- sometimes you'll need a very specific version of the Ruby code, depending what you're looking for.

So: first we're looking for the "main" object. The word "main" is used in lots of places in Ruby, so that will be hard to track down. How else can we search?

Luckily, we know that if you print out that object, it says "main". Which means we should be able to find the string "main", quotes and all, in C. I'm going to use The Silver Searcher, a.k.a. "ag", for code search here - you can also use Ack, rgrep, or your favorite other tool.

We can ignore anything under "test", "spec", or "doc" here, so I won't list those out.

thread.c
4936:    rb_define_singleton_method(rb_cThread, "main", rb_thread_s_main, 0);

object.c
3692: *     String(self)        #=> "main"

vm.c
3177:    return rb_str_new2("main");

addr2line.c
782:    if (line->sname && strcmp("main", line->sname) == 0)

lib/rdoc/generator/template/darkfish/index.rhtml
15:<main role="main">

lib/rdoc/generator/template/darkfish/servlet_root.rhtml
16:<main role="main">

lib/rdoc/generator/template/darkfish/servlet_not_found.rhtml
13:<main role="main">

lib/rdoc/generator/template/darkfish/page.rhtml
15:<main role="main" aria-label="Page <%=h file.full_name%>">

lib/rdoc/generator/template/darkfish/table_of_contents.rhtml
2:<main role="main">

lib/rdoc/generator/template/darkfish/class.rhtml
19:<main role="main" aria-labelledby="<%=h klass.aref %>">

iseq.c
742:    const ID id_main = rb_intern("main");

(...)

Okay. The one in object.c is on a comment line. The ones in the rdoc generator are classes on markup elements -- not what we're looking for. The one in thread.c is for the name of the main thread. Let's investigate addr2line.c, iseq.c and vm.c as the promising choices.

In addr2line.c it's messing with printed text in dumping a backtrace, not defining methods on the top-level object:

void
rb_dump_backtrace_with_lines(int num_traces, void **traces)
{
  /* ... */
    /* output */
    for (i = 0; i < num_traces; i++) {
        line_info_t *line = &lines[i];
        uintptr_t addr = (uintptr_t)traces[i];
        /* ... */
        /* FreeBSD's backtrace may show _start and so on */
        if (line->sname && strcmp("main", line->sname) == 0)
            break;

(By the way - in C, anything surrounded by the slash-star star-slash things are comments.)

In iseq.c, it's naming the types of instruction sequences -- not what we want:

static enum iseq_type
iseq_type_from_sym(VALUE type)
{
    const ID id_top = rb_intern("top");
    const ID id_method = rb_intern("method");
    const ID id_block = rb_intern("block");
    const ID id_class = rb_intern("class");
    const ID id_rescue = rb_intern("rescue");
    const ID id_ensure = rb_intern("ensure");
    const ID id_eval = rb_intern("eval");
    const ID id_main = rb_intern("main");
    const ID id_plain = rb_intern("plain");

Finally, in vm.c we find something really promising:

/* top self */

static VALUE
main_to_s(VALUE obj)
{
    return rb_str_new2("main");
}

VALUE
rb_vm_top_self(void)
{
    return GET_VM()->top_self;
}

That's more like it! And it's next to something being referred to as "top self", which sounds pretty main-like. So let's see where main_to_s gets used. I checked with ag, and it's only in vm.c:

void
Init_top_self(void)
{
    rb_vm_t *vm = GET_VM();

    vm->top_self = rb_obj_alloc(rb_cObject);
    rb_define_singleton_method(rb_vm_top_self(), "to_s", main_to_s, 0);
    rb_define_alias(rb_singleton_class(rb_vm_top_self()), "inspect", "to_s");
}

Cool. That looks useful - it's defining the main object's method "to_s", so we have the right object. It's saying that main is an instance of Object (called "rb_cObject" here) and defining to_s and inspect on it. That doesn't tell us anything else special about main... But knowing that it's called "top_self" does let us find other files that use main, so let's look for it.

Main, top_self, Ruby

If we search for 'top_self', we can see what we do with main. In fact, we can see everything Ruby does with the main object...

inits.c
24:    CALL(top_self);

eval.c
1940:    rb_define_private_method(rb_singleton_class(rb_vm_top_self()),
1942:    rb_define_private_method(rb_singleton_class(rb_vm_top_self()),

internal.h
1908:PUREFUNC(VALUE rb_vm_top_self(void));

vm_method.c
2131:    rb_define_private_method(rb_singleton_class(rb_vm_top_self()),
2133:    rb_define_private_method(rb_singleton_class(rb_vm_top_self()),

ruby.c
671:    VALUE self = rb_vm_top_self();

vm.c
466:    vm_push_frame(ec, iseq, VM_FRAME_MAGIC_TOP | VM_ENV_FLAG_LOCAL | VM_FRAME_FLAG_FINISH, rb_ec_thread_ptr(ec)->top_self,
2142:   rb_gc_mark(vm->top_self);
2443:    RUBY_MARK_UNLESS_NULL(th->top_self);
2574:    th->top_self = rb_vm_top_self();
3093:   th->top_self = rb_vm_top_self();
3101:   th->ec->cfp->self = th->top_self;
3181:rb_vm_top_self(void)
3183:    return GET_VM()->top_self;
3187:Init_top_self(void)
3191:    vm->top_self = rb_obj_alloc(rb_cObject);
3192:    rb_define_singleton_method(rb_vm_top_self(), "to_s", main_to_s, 0);
3193:    rb_define_alias(rb_singleton_class(rb_vm_top_self()), "inspect", "to_s");

vm_eval.c
1394:    return eval_string_with_cref(rb_vm_top_self(), rb_str_new2(str), NULL, file, 1);
1406:    return eval_string_with_cref(rb_vm_top_self(), arg->str, NULL, arg->filename, 1);
1474:    VALUE self = th->top_self;
1479:    th->top_self = rb_obj_clone(rb_vm_top_self());
1480:    rb_extend_object(th->top_self, th->top_wrapper);
1484:    th->top_self = self;
1516:       val = eval_string_with_cref(rb_vm_top_self(), cmd, NULL, 0, 0);

vm_core.h
596:    VALUE top_self;
870:    VALUE top_self;

lib/rdoc/parser/c.rb
476:      next if var_name == "ruby_top_self"

proc.c
3175:    rb_define_private_method(rb_singleton_class(rb_vm_top_self()),

load.c
577:    volatile VALUE self = th->top_self;
589:    th->top_self = rb_obj_clone(rb_vm_top_self());
591:    rb_extend_object(th->top_self, th->top_wrapper);
619:    th->top_self = self;
996:            handle = (long)rb_vm_call_cfunc(rb_vm_top_self(), load_ext,

mjit.c
1503:    mjit_add_class_serial(RCLASS_SERIAL(CLASS_OF(rb_vm_top_self())));

variable.c
2188:    state->result = rb_funcall(rb_vm_top_self(), rb_intern("require"), 1,

Since we're looking for what happens when we define a method on main, we can ignore some of what we see here.  Anything that's being initialized, we can skip. And of course, we can still ignore anything in doc, test or spec. Here's roughly how I'd summarize what I see in these files when I check around for top_self:

  • inits.c: initialization, like the name says
  • internal.h, vm_core.h: in C, files ending in ".h" are declaring structure, not code; ignore them
  • mjit.c: this is initialization, so we can skip it.
  • lib/rdoc/parser/c.rb: this is documentation.
  • load.c: this is interesting, and could be what we're looking for... except it happens only on require or load.
  • variable.c: this is only on autoload, and it just makes sure we autoload into the main object.
  • vm_method.c: this is defining "public" and "private" as methods on main. Neat, but not what we're looking for.
  • eval.c: this is defining "include" and "using" as methods. Also neat, also not quite it.
  • ruby.c: this takes some tracking down... But it's just requiring any library you pass with "-r" on the command line into main.
  • vm_eval: this is defining various flavors of "eval", which happen on the "main" object.
  • vm.c: there's a lot going on here - initialization and garbage collection, mostly. But in the end, it's not what we want.
  • proc.c: finally! This is what we're looking for.

Any Ruby source file can be great for learning more about Ruby. It's worth your time to search for "top_self" in any of these files to see what's going on. But I'll skip to the actual thing we're looking for - proc.c.

 

The Hunt... And the Capture!

In proc.c, if you look for top_self, here's the relevant bit:

void
Init_Proc(void)
{
  /* ... */
  rb_define_private_method(rb_singleton_class(rb_vm_top_self()),
                          "define_method", top_define_method, -1);

And that's what we were looking for. It's in an init method, it turns out - d'oh! But it's also redefining 'define_method' on main, which is exactly what we're looking for.

What does top_define_method do? Here it is, the whole thing:

static VALUE
top_define_method(int argc, VALUE *argv, VALUE obj)
{
    rb_thread_t *th = GET_THREAD();
    VALUE klass;

    klass = th->top_wrapper;
    if (klass) {
        rb_warning("main.define_method in the wrapped load is effective only in wrapper module");
    }
    else {
        klass = rb_cObject;
    }
    return rb_mod_define_method(argc, argv, klass);
}

Translated from C, and from Ruby internals this is saying several things.

If you're doing a load-in-an-anonymous-module, it tells you that you probably won't get what you want. Your top-level definition won't be properly global since you asked it not to be. And then it defines the method on Object, as an instance method, not just on main. And that is the feature we were looking for, defined in Ruby.

Winding Up

We've succeeded! We tracked down a simple but well-hidden feature in the CRuby source. It's cocoa and schnapps all around!

Benoit Daloze of TruffleRuby points out that this is all much easier to read if you define your Ruby internals in Ruby, like they do. He's not wrong.

But in case you're still using CRuby for things like Rails... It may be worth your time to learn to look around in the internals! And I find that nothing makes it work better than practice.

Also, this tells us exactly what's up with last week's post about Rake! Now we know how that works, which may be important when we wish to fix that...

Rake Does What?: A Debugging Story

The Mystery

While working on upgrading one of our apps to Rails 5 I noticed that suddenly migrations were failing with the following error:

StandardError: An error has occurred, all later migrations canceled:
wrong number of arguments (given 2, expected 1)

The migration it was failing on just had something like this:

class LeakyMigration < ActiveRecord::Migration[4.2]
  results = select_one <<~SQL
    SELECT 1 FROM table_name
  SQL

  if results
    raise "Panic!"
  end
end

Digging into the ActiveRecord 4.2 and 5.0 documentation it became pretty clear that there wasn't a change to the method signature for select_one, so what gives? Looking at the stack trace I noticed that the method call was going through one of our private gems - specifically through a rake task we have defined there. Huh?

The Investigation

I opened the rake task and found something like the following:

namespace :db do
  task :leaky_task do
    min_db_schema_version
    ...Other things
  end

  def select_one
    ActiveRecord::Base.connection.select_one(ActiveRecord::Base.send(:sanitize_sql, sql, "NONE"))
  end
  
  def min_db_schema_version
    select_one('SELECT min(version) AS version FROM schema_migrations')['version']
  end
end

A quick check found that the method signature for sanitize_sql_for_conditions had changed. An optional second parameter was no longer supported, so it was understandably upset when we tried to give it a second parameter.

But why is this select_one being called during the migration anyways? Do methods defined in a namespace get shared across the namespace? While perhaps a bit inconvenient, that wouldn't be completely unreasonable. A quick check eliminated that possibility. Even with a different namespace the exception was getting raised during the migration.

It couldn't possibly be defining the method on Object, could it?

The Experiment

Well, a bit of quick investigation showed me two things:

You can experiment with this by creating a new empty rake task, defining a method foo and then starting a Rails console. Try something like User.send(:foo) and you’ll get a NoMethodError - so far so good, but now load your application’s rake tasks (which aren’t loaded by default) by calling AppName::Application.load_tasks and call User.send(:foo) again.

It runs! Madness!

These two things alone are worth digging into more, but first I think we should answer the question - why is our migration calling our select_one rather than the one provided by ActiveRecord?

In order to find out I commented out the offending rake task and added a raise statement to the ActiveRecord definition of select one. Re-running the migration yielded a stack trace that highlighted something very interesting...

The Finale

All those nice little sql methods like select_one() and update() aren't defined in ActiveRecord::Migration or any of its parents, they are all defined on the connection object. ActiveRecord::Migration defines a custom method_missing implementation that forwards the call to the connection object.

And as you remember, Rake is defining all its tasks as methods on Ruby's main object, which defines them on Object, which defines them nearly everywhere.

Unfortunately these two things combined lead to trouble. When rake tasks are loaded (which they are when they are running) these methods defined on main get defined on every object. Then, when we call select_one in the migration it sends the call to itself which it now knows how to respond to since it inherits from Object.

You might reasonably ask, "how does defining a method on 'main' define it everywhere?" You can check that it does do that for yourself. But next week, we'll dig down into how that happens and why it works.

Does ActionCable Smell Like Rails?

Not long ago, Rails got ActionCable. ActionCable is an interface to WebSockets and (potentially) other methods of turning a normal sent-to-browser web page into a two-way connection that can keep exchanging data. There have been a lot of these attempts over the years - WebSockets, Server-Sent Events, Comet and Server Push (HTTP1 and HTTP2) are all protocols to do that. There have been many Ruby implementations of these. Good old AJAX was probably the first widespread attempt. I'm sure there are some I missed. It's been the "near future" for as long as there's been a present.

Do you already know why we want WebSockets, and how bad most solutions are? Scroll down to "ActionCable and Convenience" below. Or heck, scroll wherever you like - I'm not the boss of you.

Yeah, But Why?

One big question is always: do we need this? What good does it do us?

Let's look at one-way updates (server-to-browser) separately from two-way.

The "hello, world" of one-way interactive web pages is the page update. When Twitter or Facebook shows you how many updates are waiting, or pops up new posts, those are interactive web pages. In order to tell you what's waiting for you, or to deliver it, they have to interact with a page that's already in your browser. You can also show site news banners that change, or pop up an alert like "site going down for maintenance at midnight!" The timing is often not that important, so a little delay is usually fine. AJAX is the usual way to do this, or Server-Sent Events can be a bit more efficient at the cost of some complexity.

See that "99+" up there? For you to know how many notifications are waiting, Twitter has to tell your browser somehow.

See that "99+" up there? For you to know how many notifications are waiting, Twitter has to tell your browser somehow.

The "hello, world" of two-way interactive web pages is the chatbox. From a simple "everybody sees everything" single-room chat to something more complex with sub-areas and filtering and private messages, you have to have two-way interaction to make chat work. If you put intercom.io or similar customer service chatbots on your site, they have to use two-way interaction of some kind. Any kind of document collaboration (e.g. Google Docs) or multiplayer game needs it. Two-way interaction starts to make sense quickly if you're doing something that people update together - auction sites, for instance, or fast-updating financial markets (stock trading, futures markets, in-game auction houses.) You can do two-way interaction with AJAX but it starts to get inefficient quickly, and the latency can be high. Long-polling (e.g. Comet, SSE) can help, or you can work around the browser with plugins (e.g. Flash or Unity.) This is the situation WebSockets (and thus ActionCable) were really designed for.

There are one-way updates in the browser-to-server direction, but forms and AJAX handle those just fine. They are often things like form submit, error reporting or analytics where the user takes action, but there's only a simple acknowledgement -- "yup, you did that, the server saw it and you can get back to what you were doing."

So What's Different With Interactivity?

In old-style HTTP, the browser gets an HTTP page. It may already be cached, so the browser may not use the network at all! Some pages can work entirely offline. The browser can keep requesting pages or resources from the server when and if it wants to. That may fail - the network connection isn't guaranteed. But the server can't interfere with most of this. The server can't send anything it wasn't asked for. The server may not even know that this is all going on. If the HTML page came from cache, the server probably has no idea that any of this is happening. In a normal web framework, the request gets served and the server moves on to other requests without a backward glance.

If you want interactivity with a constant connection to the server, that changes everything. The server needs to sit and spin, holding the connection open. It needs to see what other connections are doing and figure out what to send to everybody else every time you do something. If Bob sends a chat message, it may go out to 250 other people who have a certain page open.

The Programming Model

One big difficulty with two-way interaction is that web servers and frameworks aren't usually designed to handle it. If you keep a connection open all the time and you can send to it at random times... How does that look to the programmer? How does it work? That's new for Rails... And Sinatra. And nearly any Ruby framework that isn't using EventMachine. So... nearly all of them.

It's not just WebSockets that have this problem, incidentally. HTTP/2 adds server push, which also screws up all the frameworks... But not quite as much as WebSockets does.

In fact, very few application servers (e.g. Puma, Passenger, Unicorn, Thin, WEBrick) are designed for pushing from the server either. You can hack long-polling or server push on top of NGinX or Apache, but... that's not how nearly anybody uses them. You really want an evented server. There are some experimental Ruby evented webservers. But mostly, that's not how you write Ruby web applications. Or nearly any other web applications, in any language. It's an inefficient match for HTTP1 apps, but it's the only reasonable match for HTTP/2 apps with much two-way interactivity.

Node.js folks can laugh at the rest of us here. They do evented web applications all the time. And even they have a bit of a rough road to HTTP/2 and WebSocket support, because their other tools (reverse proxies, caches, etc) are designed in the traditional way, not the HTTP/2 way.

The whole reason for HTTP's weird, hacked-on model of sessions and cookies is that it can't keep a connection to the server or repeatedly identify individual clients. What would we do if it could? Maybe we'll find out.

But the fact remains that it's weird and hard to combine a long-running stateful server with lots of connections (WebSockets) to a respond-and-forget stateless web server (NGinX, Apache, Puma, Passenger, etc.) in the current day and age. They're not designed to work that way.

I only know of one web server that was ever designed with this in mind: Zed Shaw's Mongrel2. It's a powerful, interesting model for a server architecture. It's also rough around the edges, very raw and not used in production by anybody I know of. I've tried to get it running with Ruby web apps, and mostly you rapidly discover that nobody else designs anything that way, so it's hard to use with anything you don't write from scratch. There are bespoke frameworks inside big tech companies (e.g. LinkedIn, Facebook) that do the same thing. That makes sense. They'd have to, even with their current architecture. The rest of us have a long road to get there.

ActionCable and Convenience

So: Rails bundled WebSocket support into recent versions. Problem solved, right? Rails is awesome about making things convenient, so we're golden?

Yes... and no.

I've worked with the old solutions to this like Faye and Juggernaut. There is absolutely no question that ActionCable is a smoother experience. It starts the extra server automatically. It includes all the extra pieces of software. Secure sockets work approximately out-of-the-box, maybe, with a bunch of ugly caveats that aren't the server's fault. But they're ordinary HTTPS caveats, not really new ones for WebSockets. Code reloading works, kind of. When you hit "reload" in your browser the classes reload at least, like, 80% of the time. I mean, unless you have connections from multiple browsers or something else unreasonable.

Let's be clear: for hooking WebSockets up to a traditional web framework this is a disturbingly smooth experience. That's how bad the old ways of doing this are. Many old problems like "have I restarted both servers?" are nearly 100% solved in development, and are only painful in production. This is a huge step up.

The fact that code reloading works at all, ever, is surreal. You can nearly always get it to work by restarting the server and then hitting reload in the browser. Even that wasn't really reliable with older frameworks (ew, browser caching.) Thank you, Rails asset pipeline! And there's only one server in development mode, so you don't have to bring down multiple server-side processes and reload your browser.

This stuff used to be incredibly bad. ActionCable brings the pain level down to something a smartphone-app programmer would call tolerable.

(I'm an old phone operating system programmer. We had to flash ROMs every time. You kids don't know how good you have it. Get off my lawn!)

But... Does It Smell Like Rails?

Programming in Rails is a distinctive experience. From the controller actions to the views to the config files, you can just glance at a chunk of Rails code to figure out: yup, this is Ruby on Rails.

For an example of "it's not the web, but it smells like Rails," just look at ActionMailer. It may be about sending email, but it uses the same controller actions and the views feel like Rails. Or look at the now-defunct ActiveResource, which brings an ActiveRecord-style API to remote procedure calls. Yeah, okay, it was kind of a bad idea, but it really looks like Rails.

I've been trying to figure out for awhile: is there a way to make ActionCable look more like Rails? I'll show you what I've found so far and I'll let you decide.

ActionCable kind of wants to be structured like Rails controllers. I'll use examples from the ActionCable Rails Guide, just to make sure I'm not misrepresenting it.

Every user, when they connect to your site, gets a single ApplicationCable::Connection object:

# app/channels/application_cable/connection.rb
module ApplicationCable
  class Connection < ActionCable::Connection::Base
    identified_by :current_user
 
    def connect
      self.current_user = find_verified_user
    end
 
    private
      def find_verified_user
        if verified_user = User.find_by(id: cookies.encrypted[:user_id])
          verified_user
        else
          reject_unauthorized_connection
        end
      end
  end
end

You can think of it as being a bit like ApplicationController. Of course, all your Channels inherit from ApplicationCable::Channel, which is also a bit like ActionController. In the Guide there's not much in it. In fact, it can be very problematic to have many methods in it. I'll talk about why a bit later.

Then you have actual Channels inherited from ApplicationCable::Channel, which correspond roughly to pub/sub subscriptions, and then you can subscribe to one or more streams on each channel, which are sort of subchannels. This is explained poorly, but it's not too complicated once you figure it out. ActionCable wants one more level of channel-ness than most pub/sub libraries do, but it still boils down to "listen on channel names, get messages for those channels." ActionCable's streams are what every other Pub/Sub library would call a channel, approximately, and ActionCable's Channels are a namespace of streams.

It's clearly trying to look like a Rails controller:

# app/channels/chat_channel.rb
class ChatChannel < ApplicationCable::Channel
  def subscribed
    stream_from "chat_#{params[:room]}"
  end
 
  def receive(data)
    ActionCable.server.broadcast("chat_#{params[:room]}", data)
  end
end

But then that starts breaking down...

Where ActionCable Smells Funny

ActionCable actions are a bit like Rails controllers, and they use a somewhat different API than every other Pub/Sub library to get it. That's not necessarily a problem. In fact, that's fairly Rails-like, as far as it goes.

An ActionCable action can only be called from a browser, though. So there's two fairly distinct forms of sending: browsers send actions, while the server sends "broadcasts", which can also be sent from the server, including from actions. That's a bit like Rails controller actions, though it's a bit unintuitive. And then the server sends back JSON data, like a JSON API.

Then, JavaScript receives the JSON data and renders it. Here's what that looks like on the client side:

# app/assets/javascripts/cable/subscriptions/chat.coffee
# Assumes you've already requested the right to send web notifications
App.cable.subscriptions.create { channel: "ChatChannel", room: "Best Room" },
  received: (data) ->
    @appendLine(data)
 
  appendLine: (data) ->
    html = @createLine(data)
    $("[data-chat-room='Best Room']").append(html)
 
  createLine: (data) ->
    """
    <article class="chat-line">
      <span class="speaker">#{data["sent_by"]}</span>
      <span class="body">#{data["body"]}</span>
    </article>
    """

This should look like exactly what Rails encourages you NOT to do. It's ignoring everything Rails knows about HTML in favor of a client-side framework, presumably separate from whatever rendering you do on the server side. You can't easily use Rails view rendering to produce HTML, and no Rails controller is instantiated, so you can't use things like render_to_string. There's probably some way to patch it in, but that doesn't seem to be an expected choice. It's certainly not done by default, and at a minimum requires including Rails' internal modules into your Channels.

Also: please don't include random modules into your Channels, because every public method on every Channel object becomes publicly callable via JavaScript, so that's a really unsafe thing to do. So I would absolutely not recommend you include the Rails modules for view rendering into your Channels, because I would not recommend including anything you don't control into your channels. You need near-absolute control of your set of public methods for security reasons. Exposing public calls to whatever Rails has exposed from internal modules as a public API isn't a good idea even if you love Rails views (and I do.)

I'm not saying it's hard to do it in a more Rails-like way. You can build a simple template renderer using Erubis for this if you want. I'm saying ActionCable won't do this for you, or particularly encourage you to do it that way. The documentation and examples all seem to think that ActionCable should ship JSON around and your JavaScript/CoffeeScript client code should do the rendering. To be fair, that's likely to be more space-efficient than shipping around HTML, especially because WebSocket compression is fairly primitive (e.g. permessage-deflate). But compromising the programmer API for the sake of bandwidth fails the "smells like Rails" sniff test, in my opinion. Especially because it's entirely possible to have the examples mostly use the server-rendering-capable flavor and mention that you can convert to JSON and client rendering rather than vice-versa.

Here's what a very simple version of that Erubis template rendering might look like. It's not difficult. It's just not there by default:

def replace_html_with_template(elt_selector, template_name, locals: {})
  unless template_name["."]
    template_name += ".html.erb"
  end
  filename = File.join("app/views", template_name)
  tmpl = Erubis::Eruby.new(File.read filename)
  replace_html(elt_selector, tmpl.evaluate(locals))
end

(The replace_html method above needs to be a broadcast to the client, in a format the client recognizes. Also, don't make this a public method on the Channel object - I actually wrap the Channel object with a second object to avoid that problem.)

In general, ActionCable feels simple and "raw" compared to most Rails APIs. Want HTML escaping? Ask for it explicitly. Curious who is subscribed to what? Feel free to track that yourself. Wonder what some of the APIs do (e.g. identified_by)? Read the source code.

That may be because it's an immature API. ActionCable is fairly recent, as Rails APIs go. Or it may be the ActionCable isn't going to be as polished. WebSockets are a thing you add to an existing app for better performance, as a rule. So maybe they're being treated as an advanced feature, and the polish isn't wanted. It's hard to tell.

A Few More Details on How It Works

Before wrapping up, let's talk a bit more about how ActionCable does what it does.

If you're going to have a bunch of connections held open, you're going to need network sockets for it and some way to do work for those connections. ActionCable handles this with a thread pool, with four workers by default. If you're messing with the database in your ActionCable code, that means you should increase your database connection pool by that number of workers to avoid running out.

ActionCable eats a ton of memory to do its thing. That's fine - it's the best Ruby environment for WebSocket development, and you didn't start using Rails because it was the fastest or smallest - if you wanted that, you'd be writing bespoke C binaries for CGI scripts, or maybe assembly language.

But when you're looking at deploying to a production environment with a reasonable number of users, you may want to look for a more efficient, API-compatible solution like AnyCable. You can find a RubyKaigi talk by Vlad Dementyev about that and how it compares, if you'd like more information.

While ActionCable uses only a single server in development mode, it's assumed you'll want multiple in production. 

So What's the Takeaway?

ActionCable is a great way to reduce the pain of WebSockets development compared to the older solutions out there. It provides a relatively clean development environment. The API is unusual - not quite Rails, not quite standard Pub/Sub. But it's decently clean and usable once you get used to it.

A Rails API is often carefully polished. They tend to be hard to misuse with strong indicators of how to do things The Rails Way and excellent documentation. ActionCable isn't there yet, if that's where they're headed. You'll need more sophistication about what ActionCable builds on to avoid security problems - more like old-style Rails routing or controllers, less like heavily-secured Rails APIs like forms, HTML escaping or forgery protection.

How Can I Use Ruby 2.6 JIT?

I gave a talk on using JIT in Ruby 2.6 at Southeast Ruby - a great regional conference with a very friendly, cozy vibe. If you get a chance, I highly recommend going next year! It'll be August 1st and 2nd, 2019.

Wondering about what JIT is, how it works and why you'd use it? Or how to try it out in (currently pre-release) Ruby 2.6? Here are my slides. Don't miss the presenter notes, which have extra detail beyond just what's in the slides.

 

Can I Use Ten 10% Speedups to Make Ruby Instant?

There are a lot of little speedups to Ruby around. I write about a bunch of them. It wouldn't be too hard to collect 100% worth of little 3% and 5% and 10% speedups. Presumably it won't make every Ruby program instant, but what would it do? Heck, Ruby has had way more than ten big speedups over the years. Shouldn't Ruby be instant right now?

Let's talk about how you do performance math - how you check those improvements and how they add up. Then when you find speedup in the wild, you can guess a little about how much they'll help you.

Why Rockets Explode On The Launchpad

When I was a kid, we did a math exercise in class. NASA has a hard job because rockets are complicated. Each piece of a rocket has to be very reliable.

How reliable?

Well, say the pieces are 99.9% reliable and you have 10,000 of them. How likely is it that at least one piece fails and your rocket blows up?

About 99.996% likely to blow up. Don't put anything you care about (like your torso) into a rocket with those numbers. Luckily nobody in my middle school was likely to be an astronaut, so it all worked out.

As a kid, the point was "rockets should blow up all the time. I wonder if I'll get to watch?"

As a crusty old adult, I suggest the lesson, "multiplying over and over does some things you don't expect."

So let's talk performance.

Speeding Up Ruby

It's easy to find little things that give a 5% speedup here or a 10% speedup there in Ruby. For some of them you just upgrade, but others want some configuration.

So what if you added up all of them? What if you grabbed five 10% speedups and ten 5% speedups. That's 100% speedup, so any Ruby code should finish instantly with the right answer, right?

Alas, no. Three 10% speedups sound like they should add up to a 30% speedup. But it's not addition - it's repeated multiplication, like rocket reliability.

Let's talk math, future astronaut.

Let's say you have a Ruby program that takes ten seconds to run and you'd like it faster. You apply one of those 10% speedups, which works perfectly and brilliantly. Now your program runs in 9 seconds. Yay!

So you apply another 10% speedup. Unfortunately, it's not saving you 10% of ten seconds. It's saving you 10% of nine seconds - that other second is gone already. So you save nine tenths of a second, for a runtime of 8.1 seconds, not 8 seconds.

So your speedup isn't 10% + 10% = 20%. Instead, it's 90% of the runtime times 90% of the runtime is 81% of the runtime. So two 10% speedups add up to 19%. And that's why you get 8.1 seconds, not 8 seconds.

Hey, I didn't make the rules.

(There are actually a few cases where they add up to more than that for complicated reasons. Those cases are weird and rare in the real world.)

Except the Real World Sucks

Now if you take two actual 10% speedups and measure the result, you're likely to be disappointed. You often won't get 20% or even 19%. It may be more like 16% or 18%. Sometimes it'll be 10% - both speedups together are exactly as good as just one.

Why?

Sometimes it's because they solve the same problem. If your first optimization is to optimize garbage collection for a 2% speedup, and your second optimization is to turn off garbage collection completely for a 5% speedup, your total will be 5%. That first 2% doesn't do you any good at all.

In general, the more two optimizations "touch", the less their total is going to be. If you save 7%, 10%, 5% and 3% on four different CPU optimizations, it will almost never add up to 23% (note: 0.93 * 0.90 * 0.95 * 0.97 = 0.771, or about 22.9% speedup if they all work together perfectly.) There's generally some overlap between one optimization and another.

But it's worse than that. Because the rocket reliability math is too optimistic.

mountain_dew_bottle.png

Two Liters In Fifteen Seconds: GO!

If you were a mathematically inclined fourteen-year-old with nothing to do, you might decide to try a scientific experiment. Specifically, imagine your parents weren't paying attention for a bit, you could try to roll up as many Dungeons and Dragons characters as possible in a short time. Okay, maybe this is just me. Uh, or some anonymous fourteen-year-old who is only in an example.

You could roll and write out the characters for ten minutes. Let's say you could do about five high-school-quality D&D characters in that time.

But! You discover that by chugging a liter of coffee first, you can get it to six characters in ten minutes. Or with a liter of sprite, five and a half. We'll ignore the character quality, and how many are ripoffs of Drizzt Do'Urden.

So then, how many characters can you write out if you chug a two-liter of Mountain Dew, which combines that much caffeine and twice that much sugar?

A naive additive mathematician would say seven - five characters base, plus one more for the caffeine, plus one more (0.5 * 2) for the sugar. No problem!

A rocket-reliability mathematician would say 6.6 characters (20% speedup plus two 8.3% speedups, 0.8 * 0.917 * 0.917 = 0.673, or 32.7% speedup.)

And our example high-schooler discovers that his hand cramps up after six, plus the walls are now vibrating.

Operations Theory: The Academic Study of "This Class is Pass/Fail"

As our hand-cramp math suggests, you can optimize a particular problem all you like... And it'll help, right up until it doesn't. Mountain-Dew-fueled creativity doesn't help if your hand cramps first.

At any given time there will be one part of the program slowing you down ("the bottleneck.") If you can speed it up until something else is the slow part, any further speedup is wasted. Congratulations! You succeeded! All extra credit is rounded down to zero. Each bottleneck is pass/fail. Pass your papers forward and don't talk to your neighbor.

For instance, let's say your current bottleneck is shoving enough network packets through. It's about 7% slower than whatever your next bottleneck is. If you can find ten different ways to cut your network traffic by 5% each, then they should combine to give you... about 7% speedup. Because now the problem is the next bottleneck, and it really doesn't matter how fast or small your network packets are. Next!

If this sounds simple, I recommend using the phrase "Operations Theory" to describe it and mentioning that it comes from Eliyahu Goldratt's 1984 book "The Goal" (which it does.) Doesn't it sound fancier now?

It's also not quite this simple, as a skilled profiler can tell you. If you speed up an already-fast part of the program, it won't usually give you zero speedup. It'll usually just give you a very small one. That's one of the weird ways two small optimizations can add up to a big one - a big optimization to something that's not currently your bottleneck may turn into a really important optimization... If the bottleneck changes.

By the way - this is also why you should be careful adding up several small optimizations that work well right now. If they're currently giving good speedups, it's probably because they're related to your current bottleneck. Which means when that bottleneck is solved, they'll all turn into tiny (or zero) speedups.

So the Conclusion Is... It's Complicated

This is why it's hard to predict what five different 5% speedups add up to. How much do they overlap? Are they in bottlenecks in your program? If one of them changes the bottleneck, would one of the other speedups suddenly matter more?

I solve this by measuring with a big end-to-end performance test. That's inconvenient, but it changes "it's complicated" to "it's slow and takes a bunch of computer time."

But when people suggest just taking all the known speedups and putting them together, keep in mind that that can be complicated. If you're adding speedups into the core language, great! That means they're constantly tested together. If you're talking about rarely-used tuning knobs, those get complicated fast when you combine them.

 

Ruby's Global Method Cache

Hey, folks! Lately I've been exploring Ruby environment settings and how much they can help (or not) your app speed. I feel like I've already hit most of the major tuning knobs on Ruby at one point or another... But let's look at one I haven't yet: Ruby's global method cache. What is it? How do you set it? How much speed does it give?

What's the Global Method Cache?

When you use a particular method, Ruby has to figure out what classes and/or modules and/or refinements define it, and which one to use in that particular location. It's a much more involved process than you'd think, especially with how Ruby handles constants and scope. In a lot of cases you can figure out what defines that method once and keep using the lookup that you did the first time - it's slow to re-run, so we don't.

There are two ways Ruby saves those lookups: the inline method cache, and the global method cache. After I explain what they are, we'll talk about the global method cache.

The inline method cache lives at a specific call site. It is "inline" in the sense that it's cached in your Ruby code where you call the method. That seems simple and sane. When it works, the global method cache doesn't get used - the lookup happens the first time the code is hit, and gets reused afterward.

The global method cache is for cases where that doesn't work - method_missing, respond_to? and refinements are examples. In those cases, it's very unlikely that the same place in your code will always get the same answer for "what is the method here?" Here's how Pat Shaughnessy puts it:

Depending on the number of superclasses in the chain, method lookup can be time consuming. To alleviate this, Ruby caches the result of a lookup for later use. It records which class or module implemented the method that your code called in two caches: a global method cache and an inline method cache.

Ruby uses the global method cache to save a mapping between the receiver and implementer classes.

The global method cache allows Ruby to skip the method lookup process the next time your code calls a method listed in the first column of the global cache. After your code has called Fixnum#times once, Ruby knows that it can execute the Integer#times method, regardless of from where in your program you call times .
— Pat Shaughnessy, Ruby Under a Microscope

There are a fixed number of entries in the global method cache - by default, 2048 of them. A Shopify engineer finds that gives a 90%+ hit rate even for a really huge Rails app, so that's not bad.

You can set the number of entries, but only to a power of two, with the environment variable RUBY_GLOBAL_METHOD_CACHE_SIZE. The default is 2048, so you'll normally want to go up from there, not down. Each cache entry is 40 bytes. So the default cache uses about 80kb, and each time you double the number of entries, you double the size. So Shopify's setting of 128k entries at 40 bytes/entry would use about 2.5 megabytes of memory.

How's the Speed?

I write and maintain Rails Ruby Bench, a highly-concurrent Ruby benchmark based on Discourse, a large real-world Rails application. I do a lot of checking Ruby and Rails speed using it. And today I'll do that with Ruby's global method cache.

Discourse isn't as huge as Shopify's Rails app - few Rails apps are. Which means it may not need to increase the cache size as badly. But it certainly has far more possible cache entries than the default 2048. So it's a pretty good indicator of how much a mid-size Rails app benefits from the cache size increase.

Long-time readers will be expecting a pretty graph here, and I have bad news for them: the difference in speed when adjusting the cache size is so small that any reasonable way to graph it makes it look like they're identical - which they nearly are. Here are the results as a table:

RUBY_GLOBAL_METHOD_CACHE_SIZEMean req/secStd deviationSpeedup vs Default
1024155.31.7-1%
2048156.81.50%
4096158.32.31%
8192159.53.21.7%
16384160.33.42.2%

So as you can see, my smaller number of cached entries are... Hm. If I check that Shopify article... They actually only claimed to get about 3% faster results. So my own results are directly in line with theirs. I see a tiny speedup, in return for a very small amount of memory.

So... Is It a Good Idea?

I don't see any harm in using this. But for most users, I don't think a savings of 2%-3% is worth bothering about. And that's assuming your Rails app is fairly large. I would expect a smaller app, or a non-Rails app, to gain very little or even nothing at all.

In most cases, I think Ruby's global method cache does a great job and doesn't require adjustment.

But now you know how to check!

 

Finding the right engineers for the job

IMG_9036 (1).jpg

The best engineers

"This project will be critical to our success. Let’s get our best software engineers on it…" Maybe you’ve heard something like this where you work. The question naturally arises, who are your “best” engineers?

When I first started my career, I believed that the best software engineer was one who rigorously applied the principles of good software engineering (SOLID) in a steady fashion to create something new. My thinking on this has shifted a bit. While I still very much believe in these principles, I now see them as a part of a larger picture.

The right engineers

A few years ago I attended a conference where I heard a talk from a company called CorgiBytes. They work as consultants specializing in improving and maintaining legacy code. To many software engineers, this would be the worst job imaginable. But the founders recognized that there is a special type of engineer who enjoys and even thrives on this kind of (often desperately needed) work. So they target hiring engineers with this specific strength.

CorgiBytes describes their members as, "the joyful janitors of your codebase." This quote is heartwarming to me. It reminds me that for every need, there exists a person who not only can do the job, but will derive their greatest satisfaction from doing it. What a wonderfully merciful thing that there is someone out there who will joyfully do the very thing you cannot stand doing.

True greatness, then, exists wherever a need meets a person perfectly suited to fulfill it.

The developer landscape

Okay, so if there are software engineers wired for legacy code projects, what other types of software engineers might exist? In his conference talk, the founder of CorgiBytes discussed the following model which he dubs the "developer landscape:"

chart1.jpg

Each quadrant in the model defines a particular type of developer. In the upper left hand corner you have your hacker-maker. This engineer is more concerned with "building the right thing" rather than "building the thing right." This person loves to crank out prototypes, experiment, and fail fast. In the upper right hand corner you have the craftsman-maker. This is your engineer who is more concerned with "building the thing right," rather than "building the right thing." This person wants to steadily apply the principles of SOLID. At the start of my career, I viewed this quadrant as the one strength that defined a good software engineer.

Below the x-axis are the code menders. The lower right hand corner represents the craftsman-mender. This is the type of engineer that CorgiBytes selects for. These engineers love to gut a nasty piece of code and replace it with something clear and maintainable. Finally, you have the hacker-mender in the lower left hand corner. These are the firefighters of your system. When a service is exploding or customers are in trouble, they parachute in and save the day. They are motivated by driving to a resolution quickly.

All of these strengths are needed in the context of a company like AppFolio that develops and maintains software services because each strength maps to a phase in the software lifecycle. Nothing truly new happens without the rapid prototyper, your system is quickly crippled by technical debt without your SOLID engineer, your codebase becomes legacy without your remodeler, and your services stop running without your firefighter.

Finding the right fit

In my first software engineering job, I worked in a context where the needs were almost solely SOLID and remodeler strengths. I enjoyed this work and I learned much. But I was unaware of the rapid prototyper and firefighting quadrants. I didn’t know they existed.

Things changed for me when I joined AppFolio. AppFolio’s organizational structure and focus on generalist teams suddenly exposed me to the entire developer landscape. I quickly learned that my real strength shines in the hacker quadrants. I can be quite happy in the craftsman quadrants, but I feel most alive in my work when I’m operating as a hacker. I wish I had learned this about myself sooner!

There is amazing power in finding the right fit for an engineer based on their strengths. I’ve been involved in several projects where we stacked the deck with "the best engineers," only to see these projects stagnate. We were evaluating whether an engineer was "the best" separate from the need they were intended to fulfill. Once we began asking, "which is the right type of engineer for this project?" we saw that different strengths were needed. When we made changes accordingly, we saw those projects come to life and succeed beyond what we even imagined.

Where do you fit?

What about you? Are you a great software engineer? As I said above, we have many opportunities at AppFolio for all of the developer types, as well as opportunities for developers who want to gain experience in new areas. Our organizational structure lends itself to exploration, and our journey as a company into new business verticals guarantees new adventures for many years to come. Come join us!

Ruby Memory Environment Variables - Simpler Than They Look.

You've probably seen some of the great posts on how you can use environment variables to tune Ruby's memory use. They look complicated, don't they? If you need to squeeze out every last ounce of performance, they can be useful. But mostly, they give a single, simple advantage:

Quicker startup time. More specifically, quicker time-to-full-speed.

You can configure your Ruby process with more memory slots or looser malloc/oldmalloc limits. If you don't, your process will still grow to the right size if it needs it. The only reason to set the limits manually is if you want your process to grow to full size and speed a little more quickly. If you're running a big batch job or a long-running server, the environment settings won't matter much after the first hour or so, and only a little after the first few minutes - your process will figure it out quickly in any case.

Why the speed difference? Mostly because when Ruby is still figuring out the right size for your process's memory, it has to garbage-collect a little more often. That slows things down until it hits its stride.

There are also some environment variables that set how fast to expand. Which, again, basically just affects the time to full speed -- unless you mess them up :-)

But I Really Want...

But what if you do want to set them for some reason? How do you know what to set them to?

I find that Ruby does a fantastic job of figuring that out, but it may take some time to do it. So why not use your same settings from last run?

That's what EnvMem does.

You run your process, dump the current settings (via GC.stat) and then use them for the next run.

There's hardly any reason to use a dedicated tool, though - if you look at how EnvMem works, it only loads a few entries from GC.stat into the corresponding environment variables. The tool is just executable documentation of which GC.stat entries correspond to which environment variables.

The three variables that it sets -- RUBY_GC_HEAP_INIT_SLOTS, RUBY_GC_MALLOC_LIMIT and RUBY_GC_OLDMALLOC_LIMIT -- are the ones that get your process to the right initial size. And doing it based on your previous run is better than any other method I know.

For most applications, let them run for a minute or two until the settings are automatically set correctly. If your application doesn't run that long, then congratulations - these aren't things you need to worry about. If you need fast startup time, use EnvMem. Or just do the same thing yourself, since it's easy.

But What About...?

This all sounds reasonable, sure. But what about those last few variables? What about the ones that EnvMem doesn't bother to set?

You can tune those, sure. Keep in mind that if you tune process size, then you should not tune the other variables exactly like you would for a new process.

Specifically: for a new process, you want to make sure expansion is fast and happens in big chunks, so that you have a nice low startup time. For a process that is old and carefully tuned, you want to make sure expansion is slow and happens a little at a time so that you don't waste too much memory.

Ruby has several "max" variables to prevent adding too much of anything at once. That can be disastrous if they're set too low - it means expansion happens very slowly, so full speed only happens after the process has been running for many minutes. But for a mature, well-tuned application, good "max" values can prevent bloating by allocating too much of a resource at one time.

So with that in mind, here are the last few variables you might choose to tune:

  • RUBY_GC_HEAP_FREE_SLOTS_GOAL_RATIO: for a fast-growing app you might set this low, around 0.3 to 0.6, to make sure you have lots of free slots. For a mature app, set it much higher, even up to 0.8, to make sure you're not wasting much memory on unused slots. Keep in mind that you need free slots for new objects in new requests, so this should basically never be higher than 0.95, and rarely higher than 0.8.
  • RUBY_GC_HEAP_GROWTH_MAX_SLOTS: this is the cap on how many new slots can be added at once. I find the defaults work great for me here. But if you're actually counting slots allocated on new requests (via GC.stat) it may make sense for you to limit the number of maximum slots allocated. If you aren't counting slots with GC.stat, please don't set this manually. Don't optimize before you profile.
  • RUBY_GC_MALLOC_LIMIT_MAX: this determines the fastest rate you can raise RUBY_GC_MALLOC_LIMIT, which in turn determines how often to do a major GC (one that checks the old-generation objects, not just new.) If you're using GC.stat and watching the malloc_increase_bytes_limit, this determines how fast to raise that (at most.) Until everything in this paragraph sounds straightforward, please don't customize this.
  • RUBY_GC_OLDMALLOC_LIMIT_MAX: this is like RUBY_GC_MALLOC_LIMIT_MAX, but it affects the oldmalloc limit instead of the malloc limit. That is, it affects how often you get major GCs in response to the old generation increasing in (estimated) size. Again, if this all sounds like Greek to you the you're happiest with the default settings - which are pretty good.

Happy Tuning! Better Yet, Happy Not-Tuning!

So then, what's the upshot? If you're just skipping down this far, my recommended upshot would  be: Ruby mostly tunes its memory configuration wonderfully, and you should enjoy that and move on. The environment variables don't make a difference in the long-term runtime of your application, and you don't care about the (tiny) difference in startup/warmup time.

But let's pretend you're looking for even more detail about tuning and how/why the Ruby memory system works the way it does. May I recommend the slides from my RubyKaigi presentation? Don't skip the presenter notes, most of the interesting details are there.

 

Upgrading Rails 4 Controller Tests to Rails 5

At AppFolio, we're finally in the process of upgrading many of our Rails applications from Rails 4.2, up to Rails 5.2. Our first, and biggest step is to upgrade to Rails 5.0. While there are many parts necessary to complete this upgrade, I would like to share a few things we have done specifically to address the backwards incompatibility Rails 5 introduced in controller tests.

In case you aren't aware, it was common in Rails 4 to write controller test request methods in the following syntax:

get image_path, id: '12'

In addition to providing a params hash, perhaps you wanted to include something in the request's session, and pre-set a flash message as well. That can be accomplished via:

get image_path, { id: '12' }, { user_id: '39' },
    { success: 'page successfully created' }

Note that the first hash corresponds to params, the second hash session variables, and the third, flash values as indicated in the Rails 4.2 testing tutorial.

In addition to the get method, the test request methods, delete, head, patch, post, and put also existed in Rails 4.2. Finally, if you wanted to mimic an asynchronous request, one would instead use the xhr method, which aside from the first argument to indicate the HTTP verb, has the same method signature as the previously listed test request methods [ref]. An xhr method call might look like the following:

xhr :post, image_path, { title: 'New Image' }, { user_id: '39' }

In Rails 5, two things have changed: first, passing params, session, and flash by positional argument are no longer supported, and second, the xhr method no longer exists; it is replaced by adding an xhr: true keyword argument to one of the test request methods. Keyword arguments are awesome, and were introduced in Ruby 2.0, with required keyword arguments being introduced in Ruby 2.1 [ref].

The three prior test examples should be written for Rails 5 as follows:

get image_path, params: { id: '12' }

get image_path, flash: { success: 'page successfully created' },
    params: { id: '12' }, session: { user_id: '39' }

Note that in the above the order of the keyword arguments is irrelevant; I like to keep mine sorted.

post image_path, params: { title: 'New Image' }, session: { user_id: '39' }, xhr: true

Making any individual change from the Rails 4 syntax to that of Rails 5 is pretty trivial, however, it can be incredibly tedious and error prone when there are thousands of such invocations. To support this part of our upgrade to Rails 5, we have written two open source tools, and utilized another well known open source tool, Rubocop.

First, we wanted a way to support the Rails 5 syntax in Rails 4. Doing so would enable us to stop writing tests the old way, and instead write tests the new way. Additionally, once we've upgraded part of our application to use the new syntax, we wanted to enforce using only the new syntax in that part of the application so we didn't have to repeat the update process multiple times until we finally switched a project to depend on Rails 5. These two objectives were met through the introduction of our open source rails-forward_compatible_controller_tests gem.

This gem provides the ability for a Rails 4 application to use keyword arguments with the test request methods, as well as use the xhr: true keyword argument. Furthermore, this gem can be configured to do nothing, output DeprecationWarnings, or raise an exception when using the old syntax. The DeprecationWarning configuration is perfect while in transition, and raising exceptions is useful both while making a complete transition to the Rails 5 syntax, and afterward to ensure no regressions are introduced.

With the rails-forward_compatible_controller_tests gem in place, all that was left was to convert the thousands of Rails 4 test request method instances in our codebase. Fortunately, one tool was already available to aid in this effort. That tool is rubocop; more specifically rubocop used in combination with its Rails/HttpPositionalArguments cop.

By running the following, rubocop will autocorrect most uses of test request methods, except for uses of the xhr method:

rubocop -a --only Rails/HttpPositionalArguments

To support automatically fixing those xhr instances, we wrote and open sourced another gem, rails5_xhr_update. One way to utilize this gem is by looking for all files containing "xhr :" and passing those files to the rails5_xhr_update program with the --write option set to indicate overwriting the existing files like so:

git grep -l "xhr :" | rails5_xhr_update --write

By following the aforementioned steps we've made significant progress towards our Rails 5 upgrade. Of course, there is still a lot of work to do, and Rails 6 is on the horizon. If you’re interested in helping continue to support future versions of Rails while delivering exceptional value to our customers please see http://www.appfolioinc.com/jobs.
 

To Sleep, Perchance to Dream: Rails Ruby Bench and Sleepy GC

Hey, folks! It's been a few weeks since my last post about Rails Ruby Bench, so let's talk about some things you don't see it do, but it does behind the scenes! We'll also talk about an interesting new performance change that may be coming to Ruby 2.6.

That change is Eric Wong's Sleepy GC bug report and patch. With SleepyGC, Ruby will garbage collect with spare (idle) cycles. If you're just here for the latest Ruby development news, skip this post and click the bug report. Original sources are always more compete than commentary, right?

(Want to skip the narrative and go straight to the upshot? Skip to the bottom -- look for "So Did It Work?")

A Little Story

A few days ago, the excellent Sam Saffron of the Discourse team asked for my opinion on a pending Ruby speed patch. Yay! Rails Ruby Bench exists for this exact purpose: when a new Ruby patch comes out, I check how much it speeds up Rails (or slows it down.) And then you all know!

Of course, just lately I've been working on scaling out the benchmark itself, as my Ruby coordinator post suggests - right now, even if Discourse scales up just fine, my benchmark tops out at an EC2 m4.2xlarge instance. Above that I'm not configuring enough connections to Postgres, so I can't run enough threads and processes to use all that capacity. Working on it!

As a result, I haven't been constantly running RRB on the latest head of Ruby, because I've been working on other stuff. Which means my results are a bit out of date. Ruby also keeps getting faster, and the speedups keep getting smaller. This should make sense -- you've seen my "look, faster Ruby!" numbers, and it keeps getting harder to get large speedups after the last few hundred large speedups. Which means I need to crank up the number of requests per run and the number of runs per batch to keep pace. The Ruby core team are good at what they do! At the moment my "quick, rough check" numbers are 10,000 HTTP requests per run and 30 runs per batch (that's 200k HTTP requests) for reference, and that doesn't catch really small differences! And that still gets occasional outlier runs, so that's definitely not enough to check, "hey, does this sometimes cause random slowdowns?"

When I checked, things were a little broken. There was a slight speed regression for late March and early April Rubies, and a slightly older version (around May 1st) wouldn't run Rails Ruby Bench at all - the requests just didn't return.

So here's a "thanks for reading" takeaway for you -- don't run your production infrastructure on untested, non-release Rubies from random dates in the repository. ;-)

But here's another: the speed regression didn't last. Even when they're mostly testing on non-Rails code, mostly the Ruby core team do a great job keeping everything in a good shape - small problems tend to be caught and found rapidly, even in the long gaps between releases and previews.

Eventually I found a working Ruby, got a nice stable Rails Ruby Bench performance baseline just before the Sleepy GC patch, and ran a big batch of tests on Ruby 2.6.0 preview 1, the Ruby right before Sleepy GC, and Sleepy GC version 3 from Eric Wong's repository.

Wait, What's Sleepy GC?

Normally Garbage Collection (GC) runs when you've allocated a lot of memory, or when your process is running low and needs more. In other words, normally you reclaim old unused memory when you need memory -- and not before. You can manually run garbage collection before that in most languages (including Ruby) but that's not especially common.

It can be hard -- or impossible -- to avoid random pauses in your program if you use garbage collection. That's one reason that GC tuning is such a big deal in the JVM, for instance. Random pauses aren't necessarily a problem for every workload, but ask a game programmer about GC some time and you'll see what's wrong with them!

Ruby normally has "idle" times, such as when it's waiting for a file to be read, or network packets to arrive, or a database query. There can also be idleness from explicit sleeps or delays if the Ruby process is trying not to use more CPU than necessary. In all of these cases, it may make sense for the garbage collector to do some of its work in the idle time rather than making your program wait when you need memory.

Of course, if your Ruby process has lots of threads then you may already be filling this idle time with other work.

So Did It Work?

The short version is: the current Sleepy GC doesn't do anything for Rails Ruby Bench. If you think for a second, this should make sense - RRB runs a giant concurrent workload flat-out from startup until shutdown, overloaded with threads so that every CPU is running Ruby code constantly. There are no unfilled idle cycles. So Sleepy GC neither speeds up nor slows down RRB detectably -- which is a win, if it speeds up other workloads. Sam Saffron suggests it may do well for Unicorn servers, for instance. That makes sense - Unicorn runs one thread per process, so it may have lots more idle time than a heavily-multithreaded Puma workload like RRB. Sleepy GC may be useful, but RRB is a terrible way to find out one way or the other. That's fine. No benchmark shows you everything you care about, and it's important to know which is which.

While from my viewpoint, it was a great success! I have determined to my own satisfaction that there aren't lots of idle cycles for GC that I'm not capturing, so RRB did what it should have!

If you have a workload that you think may benefit from Sleepy GC, you can also try it out yourself. Sam Saffron says it helps certain Postgres workloads quite a lot, for instance. As of this writing, the latest branch is "git://80x24.org/ruby.git" on branch "sleepy-gc-v3". But read the bug report for the latest, always.

Ruby and Haskell: Culture is What You Don't Say

I'm working through a Haskell book with some friends. Learning something new is always good! But it's also because I write and teach Ruby. Learning from other communities helps me notice the cultural differences between, say, Haskell and Ruby.

I'm working with the excellent book "Haskell Programming from First Principles." It's far and away the best Haskell instruction I've found so far. It's easy to look at weirdness in bad instruction and say, "oh, this just isn't very good." But when you see things that seem weird in a first-class book like this, you're usually looking at a cultural difference.

Am I trashing Haskell? Or Haskell culture? Oh, heck no. I am really glad there are purists out there doing their thing. I'm thrilled to be learning from them. I'm very impressed with this Haskell book. Explaining unusual new concepts is hard.

But let's look at some differences between their culture and Ruby culture, shall we?

Judging Haskell by Its Cover

Our intrepid authors acknowledge that Haskell is known for being hard. To quote them:

There’s a wild rumor that goes around the internet from time to time about needing a Ph.D. in mathematics and an understanding of monads just to write “hello, world” in Haskell.
— Haskell Programming from First Principles (Allen and Moronuki)

Here's what's interesting about that: their entire first chapter is explaining the Lambda Calculus, before even talking about how to install the Haskell environment. Not just conceptual explanation, but in-depth math with work-it-out-for-yourself math exercises. They also say they strongly recommend not skipping it, and that (much) later chapters will make more sense if you know the math. They know that the math is intimidating to beginners. They respond by jumping very far into it, very rapidly.

Is that wrong? I don't think so. It's a very un-Ruby-ish cultural choice. Which is fine for a Haskell book, right? If you see something unfamiliar, mostly you need to not be intimidated by it in Haskell. If you need it spoon-fed, you're probably in the wrong place.

Ruby tries really hard to have a gentle learning curve. It doesn't always succeed, but it tries very hard, to the point of rewriting all sorts of things in Ruby, documenting and testing to a fault, and generally beckoning folks in with "look how familiar this looks!" It's not that one way or another is better. The Haskell method will give you a fearless community with a "ho-hum" attitude to code that looks scary. If that bugs you, the door is that-a-way. The Ruby method gives you a lot of beginners (yay!) who sometimes need and expect more hand-holding. We like our way, but I can't really say what we do is right and what they do is wrong. I can say that you wind up with very different groups as a result.

This is by far the simplest, most approachable guide to Haskell I have ever seen. They try really hard to not require lots of up-front math, compared to nearly anything else. One of the authors learned programming more-or-less for this book. And they still open with the lambda calculus before "here's how you install Haskell" or any code whatsoever. The entire current Haskell community has learned from this or from much less friendly sources.

Speaking in Math

Haskell is well-known as pretty math-heavy. That makes sense. Even in a book that is very intentionally not as "all math all the time," here's an example description from chapter 2:

When we talk about evaluating an expression, we’re talking about reducing the terms until the expression reaches its simplest form. Once a term has reached its simplest form, we say that it is irreducible or finished evaluating. Usually, we call this a value. Haskell uses a nonstrict evaluation (sometimes called “lazy evaluation”) strategy which defers evaluation of terms until they’re forced by other terms referring to them.

Values are irreducible, but applications of functions to arguments are reducible. Reducing an expression means evaluating the terms until you’re left with a value. As in the lambda calculus, application is evaluation: applying a function to an argument allows evaluation or reduction.
— Haskell Programming from First Principles

That doesn't exactly require you to already know the math. But I feel very confident saying that if you find math intimidating, you will find that explanation intimidating as well.

This is, again, a major departure from how the Ruby community does it. In other words, it's another way in which their community is intentionally different. This is another case of, "we're going to explain it simply, plainly and in our own vocabulary, which often happens to be the same as mathematical vocabulary. If you're not already with us, we hope you'll catch up later."

Later in chapter 2, they say, "your intuitions about precedence, associativity, and parenthesization from math classes will generally hold in Haskell." So when they talk (sincerely!) about how you don't have to know that much math, understand that they're talking to an audience for whom the phrase "your intuitions [...] from math classes" is reasonable and unremarkable.

So... Haskell Unreasonably Assumes You Already Know Everything?

You might reasonably and fairly ask me at this point, "are you saying that Ruby is easier and better at explaining everything, then?" Not so much. Ruby has a different set of unspoken assumptions.

For instance, Haskell From First Principles takes its sweet time explaining modular arithmetic, much more so than you'd expect from the rest of the book. It goes into detailed examples and hits a lot of corner cases explicitly in a way it doesn't for other operations. Modular arithmetic is certainly no harder than several things it skims over. Instead, modular arithmetic is less immediately familiar to most mathematicians than to programmers. A Ruby guide wouldn't usually call it out in such detail because historically, most Ruby programmers come from a language like C or Java that already has modulus built in, most frequently as the percent-sign operator.

In fact, the famous old free version of the Pickaxe Book for Ruby spends a lot of time waxing poetic about how Ruby has the excellence of two or three programming languages you presumably know (Perl, Python) plus one or two you mostly know by reputation (SmallTalk.) It isn't that Ruby makes no assumptions! Ruby's also okay with some quirkiness - have a look at Why's Poignant Guide to Ruby for an extreme-but-popular example.

I Didn't Say That! Though I'm Incomprehensible If You Don't Assume It.

One of the fastest ways to identify your culture is what you don't say. Haskell is fine if you're coming from math but don't know the "standard" C-descended-language idea of modulus, but very hard if you're not used to fairly abstract algebra. Ruby tutorials usually assume you've programmed in C or one of its descendants. They "know" you probably feel a little funky about Functional Programming and you probably don't have a math degree (even if you do -- I do!)

Neither one says this up front. They just say a lot of other things that casually assume them. If this "resonates" with you, it mostly means you're a match for their assumptions. Congratulations! It's always nice to find a community you fit in with. If it doesn't resonate with you, I have confidence you'll keep looking around until you find something that does. You seem resourceful that way.

Again, unstated assumptions aren't wrong. If you tried to state absolutely everything, you'd get another culture still, also with unstated assumptions (e.g. "we claim we have no unstated assumptions by virtue of cataloguing the obvious at great length - please pretend that completeness is possible in this universe.") Culture happens in the assumptions and what goes unsaid.

And the current cultures are neither right nor wrong. There may be some alternate Ruby universe where the founding Rubyists assume we all have math degrees, but we don't live in that one. Haskell could have come from a different group and speak in chemistry or biology analogies, but that's not where our world's Haskell community came from.

Can You Finish With a Moral Please?

It would be easy to tie this up with something smug on one side or the other. Nobody avoids having a preference about cultures, you know? It's easy to glibly say "Ruby is better because it's friendlier to novices" or "Haskell is better because it keeps the bar higher."

Try this as a moral, instead: don't just read and see if you get a good or a bad feeling. Listen to what gets said that makes you feel that way. Then, think about who it attracts or repels. Because culture isn't just in "learn this language!" books. It's in every part of the programming community - blog posts, Twitter, forums, talking in person.

Ruby has a very strong culture. If you're reading this, you're likely a part of it. There are problems coming, and storms to weather -- always, and as there always have been.

Don't just drink the culture around you. Learn to see it consciously, and learn to make it for yourself. Our local culture can use your help, and every culture needs more people who can see it consciously.

Ruby Coordinator Processes for Fork Servers

Often I think, "threads are really annoying. Why don't people use processes?" Then, I use processes. Usually as I'm thinking, "processes are really annoying. Why don't I use threads?" The joke's on me either way, of course.

Until Guilds happen, those are mostly our options, outside special cases that can use EventMachine or Fibers or actors or whatever. I tend to consider processes to be the lesser evil for general use. And by "general use" I mean "not on Windows."

But cleaning up processes can be ugly. Oh, man. If you don't catch all your dead children you get zombie processes. And that's not even the most gruesome mixed metaphor in Unix concurrency.

So: let's look at a pattern in Ruby for using a single coordinator process with a separate process group to wrangle the child processes. Consider this a sort of extended fork/pipe example for spawning child workers, showing one way to set everything up. It's a somewhat advanced pattern. Don't be dismayed if you have to reread this or play with the code a bit before you get it right.

I write Rails Ruby Bench. It's a sort of HTTP load tester. So that's the example I'm using. If I talk about passing around URLs and timings, that's why.

Ruby Fork and Pipes

With a little Googling, you can find an example of creating a child process in Ruby and opening a pipe between the two processes. It'll look something like this:

pipe_out, pipe_in = IO.pipe

pid = fork do
  pipe_out.close # For parent's use, not child's use

  output = do_something()
  pipe_in.write(output)
end

pipe_in.close
result = pipe_out.read
pipe_out.close  # Done with it now
Process.waitpid(pid) # Wait for the child's death and cleanup

So, okay. That's maybe a tad confusing, but not too bad. For those of you who don't write command shells in Unix all day, the call to fork is copying your Ruby process into two nearly-identical ones. There's a new process (called the "child") that will now only run the code inside that block, and your first process (called the "parent") will ignore the block and carry on as though the fork method didn't do anything. The "pid" above stands for "Process ID", and it's a big number like 32741. It's used to identify the process. You've probably seen them before in "ps" or "top" output or the Mac Activity Monitor.

The IO.pipe method returns two IO objects. If you write to one, the other can read it. We'll pass one into the child and keep one in the parent. Well, okay -- actually, we copied both sides of the pipe when we copied our process. So we'll close one side in the child, and close the other side in the parent.

The child does some work ("do_something" above) and then passes the result, as a string, through the pipe. The parent reads the result from the pipe. If necessary it will hang out doing nothing on that "read" for as long as necessary, or until the child dies.

You need the pipe because the two processes are completely separate. So you can't just assign a returned value in the child and have it show up in the parent. That's why we write in the child and read in the parent. It's also why we don't need to use a Mutex like we would with threads - the two processes are more separate than that, so they use a pipe instead.

The final Process.waitpid is telling your operating system, "when that child is finished, I'm done with it. You don't need to save me any more pipe output or whether the child succeeded or failed or anything. I'm all finished with that child process, you can clean it up." You can also call it without a process ID as the argument and it will just wait for any child to die.

That's not a coordinator. But it's the building block we'll be using to make one.

Mo Processes Mo Problems...

If you want many processes to each do work, you can do the above many times. That's the same basic principle behind app servers like Unicorn, or Puma's cluster mode, for instance. The traditional Unix "fork server" is exactly that. Normally you call the original process the "master" and the forked child processes "workers".

One difficulty of all this is cleanup. Normally a master process only manages the workers, because failures are hard. If you want to do other things in your master process (like Rails Ruby Bench does,) you'll want a coordinator process. Then the master process can calculate and handle input, the coordinator process can watch for dead children, and of course the children (a.k.a. "workers") will do the work. But you still have to keep track of everything to clean it up.

Unix and MacOS have a great tracking mechanism for watching and cleaning this stuff up: process groups. You may have used them before without knowing it. You know how if a process gets way out of hand you can use "kill -9" to kill it? The "9" part means the unblockable, uncatchable SIGKILL signal. Did you ever wonder what the minus is for?

It turns out the minus means "don't just kill this one process. Kill everything in its process group." A process group, by default, includes any other processes it forks off, so it cleans everything up, including all the child processes. This is magic that you, too, can use.

In other words: the top-level "master" process forks a coordinator. The coordinator sets up a new process group, then forks some (or many!) workers. When the task is done, terminate the coordinator's whole process group with extreme prejudice. Now you can be sure: it's all cleaned up.

Here's one way that can look:

def manage_workers(num_processes, &block)
  processes = []
  pipes = []
  num_processes.times do
    pipe_out, pipe_in = IO.pipe

    # Inside each process, run the block, print the result and exit.
    started_pid = fork do
      pipe_out.close
      val = yield
      pipe_in.write(JSON.dump val)
    end
    pipe_in.close
    processes.push(started_pid)
    pipes.push(pipe_out)
  end

  result = []
  pipes.each do |pipe|
    out = pipe.read # Or read in a loop until it returns ""
    result.concat JSON.parse(out)
    pipe.close
  end

  # Wait for all child pids
  until processes.empty?
    dead_pid = Process.waitpid(0)
    processes -= [ dead_pid ]
  end

  result
end

def run_coordinator(num_processes, &block)
  coordinator_out, coordinator_in = IO.pipe

  coordinator_pid = fork do
    coordinator_out.close
    pgid = Process.pid  # Get child's own pid
    Process.setpgid(pgid, pgid)  # Detach into new process group

    combined_output = manage_workers(num_processes, &block)
    coordinator_pipe_in.write(JSON.dump combined_output)
  end

  coordinator_pipe_in.close # For coordinator use, not parent use
  json_result = read_all_from_pipe coordinator_pipe_out
  return JSON.parse(json_result)
end

Is this a new idea? Kind of. It's not frequently done in exactly this way. But a very common method is to run a command that has a master process (e.g. Unicorn, Passenger, Puma in cluster mode) from your process. You're certainly allowed to kill off Unicorn or Puma with a "kill -9" if you want to. But you normally run the command with system or backticks, which fork-and-exec. The coordinator process is effectively moving that fork-and-exec into your own Ruby code.

(Rails Ruby Bench does both - it runs the load testers with this coordinator-based method. It runs the tested Rails server with Puma.)

Awesome! Now When Do We Do This?

Any useful pattern, library or data structure has a most important question attached: when do we use it, and when don't we?

The Coordinator pattern is needed when you want/need multiple child processes working on the same problem, but you also have work that belongs in a master process, at the top level. By separating worker management from the master-process calculation, you can make cleanup easier and separate the logic better.

When do you not use it? When you don't use multiple processes -- such as a single-process calculation, or something you do with threads, actors or events instead of worker processes. You also wouldn't use it if you just need to coordinate workers without doing top-level work -- then your coordinator is your master process, as though you were writing a fork server like Unicorn. And finally, you wouldn't use it if somebody has written a worker-coordinator process for you separately, like if you want to run a multiprocess HTTP server. In that case, just fork-and-exec their version, which replaces the coordinator. You can also set it up in its own process group if you like, using normal Unix methods.

Programming in Paradise

January sunrise over the Pacific Ocean in Santa Barbara, CA (2018)

January sunrise over the Pacific Ocean in Santa Barbara, CA (2018)

Once a month, I wake up very early, put on a warm jacket, make myself a cup of coffee, and sneak out of the house while the rest of my family is still asleep. I drive about ten minutes up highway 101, past hills, a vineyard, lemon orchards, and a couple of ranches, to a turnout on the side of the road where I can park the car. There is a short path I walk along to a cliff with a great view. I stand at the edge drinking coffee while I watch the sunrise over the Pacific Ocean.

Santa Barbara is a unique place where, because of its location and orientation on a coastal point, the sun both rises and sets over the ocean for half of the year. Sometimes I can watch the sun sink back into the ocean from a hill near my house after I leave work and before I return home to the tornado of energy that is life with three small children.

This is a blog post about software engineering in Santa Barbara (or programming in paradise, as some call it). I’m writing it as a reflection on my time here, because I think our environment has a big impact on how we see life and on the work that we do. Even a company like Apple recognizes this -- they proclaim, “designed by Apple in California,” on the packaging for their products.

I grew up in Los Angeles and decided to attend the University of California Santa Barbara after high school. I knew even in high school that my passion was coding. I figured, “what could be better than studying what I love in a beautiful city?” It was not long before I fell in love with this place. I had an ocean view all four years of my undergraduate studies. When I finished school, I decided I had found the place I wanted to call home and would find work at one of the tech companies built upon the engineering talent which UCSB attracts to the area.

Santa Barbara has a different feel from what you might think of as a beach city. Just a couple miles inland stand the impressive Santa Ynez mountains. These mountains offer hiking trails, waterfalls, and hot springs for people to explore. Santa Barbara is sandwiched between this tall mountain range and the ocean. The unique topography of the area creates a temperate Mediterranean climate unique to coastal California. The climate is well suited for vineyards, lemon orchards, and avocado orchards, all of which you can find throughout the city.

The city has a small town feel (despite the fact that it is 20 miles long). There is a high likelihood I will see someone I know whenever I visit the grocery store or a restaurant, and for me this leads to a strong feeling of connectedness with the community. This feeling of connectedness is even stronger within AppFolio, where the people I work with are the same people I love spending time with outside of work too. This is a different experience than what I had growing up in a big city like Los Angeles.

I think the connectedness we experience and foster within AppFolio because of our location in Santa Barbara has a profound impact on our business and in the lives of our customers. More than anything, we are a customer driven company. Everything we do is based on input and feedback from customers. We do this by connecting with them and with one another. We are constantly creating prototypes and putting them in front of customers to get feedback, we invite customers to our office, we send teams to visit customer offices, and we attend customer meet ups all over the country. I’ve even had a customer invite me and a team over to his house for BBQ and prototyping.

Just a couple weeks ago I traveled to Seattle with a team of engineers. We rented a big AirBnb together and hacked on code in the offices of two customers. We built a prototype in one office, and completed a shippable feature in another office in just four hours. To me there is nothing as satisfying as seeing the smile on a customer’s face because of something we created together.

Many of our customers are small, family operated businesses. When we work closely with them, I feel it is our family connecting with theirs. They trust us because we listen to them and we are top notch engineers. We love them because they give us a reason to come to work each day. We are a growing company but we believe that part of our success depends on us “keeping it small” so that no customer or teammate gets lost as a cog in a big machine.

We are a group of passionate engineers. We work hard and get the job done. We take pride in our work as engineers, and we never want to let our teammates down. In my experience, the best engineers are the ones who realize that being brilliant will only get you halfway there. The other half depends on a willingness to roll up your sleeves and make things happen. These engineers love problems because they recognize them to be opportunities.

We work exceedingly hard, but we are not running in a rat race. Our location in Santa Barbara reminds us that there is more to life than work. Across our engineering group we have avid rock climbers, cyclists, runners, surfers, volleyball players, soccer players, softball players, hikers, and kite boarders. Santa Barbara is a place that attracts people from all over the world because of its natural beauty, the influence of the university, and its deep cultural roots. There are many festivals throughout the year, including Fiesta, the French festival, the lemon and avocado festivals, the solstice parade, and the famous Santa Barbara film festival. Visitors and residents enjoy a variety of unique restaurants throughout the city.

Santa Barbara has much to offer beyond work, but for those times you need something from the big city, a short drive to the south will take you to LA. We are far enough from LA that the air is free of smog and the highways are not a constant traffic jam, but we are close enough for a day visit. To the north you can enjoy wine country. People from all over the world come here to vacation and we get to live here!

Santa Barbara is a special place to me and to so many of us at AppFolio. This is the place I met my wife. I proposed to her standing on top of the mountains overlooking the city and ocean. I loved what Santa Barbara had to offer when I was a college student who had just moved away from home, and I love it now as a place where my wife and I can raise our three young boys. The beauty of the city and surrounding nature, the people I have come to love, and the opportunities I have to build something awesome with an amazing team are the things that keep me here.

We’ve got a thriving business at AppFolio and are hiring great engineers who want to have real impact in customers’ lives. If this is you, then come program with us in paradise!

November sunset over the Pacific Ocean in Santa Barbara, CA (2017)

November sunset over the Pacific Ocean in Santa Barbara, CA (2017)

Ruby 2.6 and Ahead-Of-Time Compilation

Ruby 2.6 preview 1 has optional JIT that you can turn on with a command-line switch. It also has a mode where you can tell it to wait for JIT before running your code, which is marked as a "test" option. But can you just turn it on and get Ruby AoT for our Rails Apps?

Let's check!

I maintain Rails Ruby Bench, so that's what I'll be playing with here, but the JIT and AOT advice should apply to most large Ruby apps. Also, keep in mind that JIT has only just happened and it's not recommended for Rails apps yet - you should expect things to change a lot after April 2018, when this article was written.

How Can We Do It?

For current Ruby 2.6, you have to turn on JIT explicitly. I use this:

export RUBYOPT=--jit

But any way you set the command-line option is just fine.

Then you run your app. For longer-running, smaller-size apps, JIT should just magically make it faster. If that's what you were after - congratulations! You can skip the rest of this article and play Plants Vs Zombies. You have my permission, as long as it's the first one - the sequels aren't as good.

But I found that this slowed down Rails Ruby Bench instead of speeding it up, just like the preview announcement said it would. D'oh!

If you run "ruby --help" you'll see some new JIT-related options:

MJIT options (experimental):
  --jit-warnings  Enable printing MJIT warnings
  --jit-debug     Enable MJIT debugging (very slow)
  --jit-wait      Wait until JIT compilation is finished everytime (for testing)
  --jit-save-temps
                  Save MJIT temporary files in $TMP or /tmp (for testing)
  --jit-verbose=num
                  Print MJIT logs of level num or less to stderr (default: 0)
  --jit-max-cache=num
                  Max number of methods to be JIT-ed in a cache (default: 1000)
  --jit-min-calls=num
                  Number of calls to trigger JIT (for testing, default: 5)

Hey, "--jit-wait" looks like exactly what we're looking for. So then we turn it on and we have Ruby AoT compilation? Not quite.

If we run RRB with just "--jit --jit-wait", it will hang forever as far as I can tell.

It turns out the new JIT has a cache of compiled methods, and that cache has a maximum size. And a Rails app is generally too big for it, and Rails Ruby Bench is definitely, no question, way too big for it and it winds up basically hanging. But we can turn the size up with --jit-max-cache!

So we set it up like this:

export RUBYOPT='--jit --jit-wait --jit-max-cache=100000'

And then we run the benchmark. And we get our next indicator that something's wrong.

It doesn't hang, exactly. It just starts up veeeeeery slowly. Like, getting to the point where it would make an HTTP request takes multiple minutes. And then it fails, probably because Puma's HTTP delays aren't set up for HTTP requests taking that long.

A Little Bit About Ruby's New JIT

You may remember from some of the previous writing about Ruby's new JIT that it works in an unusual way. It writes out C source files to a temp directory, compiles them to shared libraries and then loads them in a bit like native extensions for gems. That's a perfectly good approach, and it doesn't cause any problems here.

Ruby waits to compile a method until it has called that method a few times. Sure. Takashi pointed out that it's better not to tell it to compile the method the very first time we hit it (which we aren't, but we could with --jit-min-calls above) because if we wait a bit, we'll get better invocation data so the final result will be faster. Okay, but that's still not the problem.

We have a background thread that sits and compiles methods, one at a time, in this way. This is closer to the problem we're seeing...

If we run Rails Ruby Bench and tell it to wait until we have compiled tens of thousands of methods, one at a time, in a single background thread that runs the C compiler for every individual method... Yeah, okay. That looks like the problem.

Just to add insult to injury, we're also loading a lot of stuff into that cache and it eventually gets big. But don't worry - you don't have enough patience to wait until it gets really huge. You were hoping to speed up your Rails app... And this isn't going to feel very fast to you. It adds minutes of time to startup, which causes enough race conditions that the resulting app won't run anyway.

So when the Ruby option said it was just for testing, it meant it.

I Just Skipped To The End - Does This Do AoT Or Not?

The short version is: there's not really an AoT compile option in Ruby 2.6. The JITting options simply don't do that in a useful or acceptable way. You're far better off using it in the recommended way.

And for now, the recommended way for Rails apps is: don't. But that's likely to change soon. The Ruby release (2.6) with JIT is still in preview. There's a lot of polishing-up to do yet.

Rails Ruby Bench: What Is It and Why Should You Care?

Recently the brilliant and accomplished Chris Seaton asked me what the difference was between Rails Ruby Bench and the normal Discourse benchmarks, as seen on ruby-bench.org.

Plus I keep writing about RRB and linking to the code on GitHub. That's not terrible, but it's not exactly friendly. So let's talk: what is Rails Ruby Bench? Why should you care?

(Spoiler: if you mostly care about Ruby on Rails performance on a big server or VM, Rails Ruby Bench is the closest to "your" benchmark for Ruby that you'll find.)

The Very Basics: What Is RRB?

Rails Ruby Bench uses Discourse, a real-world Ruby on Rails app, and a simulated realistic workload to benchmark the speed of Ruby. So: what is "real-world"? Let alone "simulated realistic workload?"

First off, if you have 40 minutes here's what I said about that at RubyKaigi in Hiroshima in 2017, along with a lot of the information about Ruby's speed that came from RRB. I'm pretty proud of it!

But for the quick version: Ruby benchmarks historically tend to be about the language's raw speed, and tend to be pretty small. OptCarrot is a great example - a very solid benchmark that still doesn't use Rails or concurrency or much garbage collection. A benchmark like that is really easy to work from and it has a lot to do with the language itself. That's exactly the kind of major benchmark you want if you're implementing a language. Which is why they did!

But when you're telling the community about your speedups, folks want to know: but what about lots of threads or processes? What about I/O-bound and memory-bound applications? And in Ruby they ask, "but what about Ruby on Rails?"

Those are great questions. RRB tries hard to answer them.

How Does It Do That?

Rails Ruby Bench uses a real-world, typical, in-production open-source Ruby on Rails app as its basis. That would be Discourse, software for forums built by a real commercial company. It has all the little hairy bits of DDOS-protection and security and stuff that Real Apps do and benchmarks usually do not. Then it generates a lot of repeatably-random but realistic HTTP requests browsing through topics, saving drafts, posting comments and so on. It mimics a lot of users doing their thing and determines how much load Discourse can process with that Ruby version, basically.

RRB uses a large dedicated EC2 instance and loads it up to full capacity, running flat-out. Then it measures how fast 10 processes and 60 threads can process the requests. This is a pretty realistic setup and looks a lot like how small-to-mid-size startups do it and what they care about.

It's a benchmark, so what does it do badly? It runs the load-test process and the database on the same instance as the benchmark - that's great for low-variance results (no network, no noisy neighbors) but is absolutely not a realistic setup. It also runs with only a single application server and no load balancer, so it tests nothing about the scaling of load-balancing or the database. That's because it's a benchmark for the Ruby language, and neither of those components normally use Ruby. Similarly, it runs without a reverse-proxy in front (no NGinX) and the load-tester doesn't request static files -- I want to test Ruby, not NGinX.

Other good things about RRB: its basic setup has been vetted by luminaries like Nate Berkopec, Richard Schneeman, the aforementioned Chris Seaton and others. I've changed a number of things to reflect their observations and/or my previous screwups. So it's battle-tested.

It's also shown some fairly interesting things about Ruby's speed. Again, I recommend my RubyKaigi 2017 presentation for even more details.

More for Pedants (Scroll Past If You Don't Care)

What about failed requests? If any request generates a 500-series error I throw away the whole run, normally between 4,000 and 6,000 requests, depending on what's being benchmarked. As Ruby gets faster and I measure smaller effects, the number of HTTP requests per run has to go up to get accurate results.

I most frequently run with one warmup start/stop iteration, and 100 warmup HTTP requests. If I use a different number than that, I'll say so in the specific post about it. It's configurable when you run RRB.

I've used multiple Discourse versions over time, and I update slowly over time. It's hard to keep RRB usable across many Discourse versions as the code and HTTP request format changes -- that's just a hazard of using a real-world app.

Why RRB Is Different

The standard Discourse benchmarks request the same URL many times in a row using ApacheBench, then measures the speed with a simple "best time out of N" metric -- though it also does one run where it measures the memory usage. RRB requests different URLs in different orders with different parameters, which affects the speed of the resulting benchmark and gives a "blended" result from many different URLs. As a result, you'll often see Ruby optimizations where different standard Discourse benchmarks give different amounts of optimization and RRB shows a weighted average of those speedups - sometimes a particular Ruby optimization is much faster for some URLs than others.

Smaller benchmarks like OptCarrot often give very different results than RRB. They tend to measure CPU time, while RRB effectively measures memory and I/O performance as much or more -- it's a bunch of parallel Ruby on Rails processes running in Puma, so CPU time isn't the be-all end-all that you see in a much smaller process.

RRB's primary difference from straight-up I/O or garbage collection benchmarks is that it does measure CPU performance. You might expect Ruby on Rails to not care about CPU performance, and to be purely CPU- and memory-bound, but that's not what RRB's results show. There's plenty of room to optimize Rails by optimizing CPU performance, though that's not the whole story.

Is Rails Ruby Bench All Done Now?

Benchmarks often have a kind of "shelf life" built in - by their nature they show a single metric, and at some point you've gotten nearly all the benefit from that metric.

And RRB has needed some adjustments - Discourse version upgrades, for instance, and running more full runs and more HTTP requests per run.

The best evidence I can give that Rails Ruby Bench isn't done is this: I keep finding new, interesting slants for it. Right now (April 2018) I'm trying to measure how much extra speed you can get from Rails by saving memory, and using RRB to do it. I'm trying to figure out how much of MRI's warmup time (it has some!) is from the memory setup and can be removed with environment variables. And soon, RRB will be my go-to metric to see: does JIT speed up Rails? What settings work best?

I think Rails Ruby Bench has more interesting questions to answer in the next year or three. There will, of course, be new efforts of all kinds. And Ruby 3x3 could use more benchmarks as well. But RRB isn't done yet.

 

Ruby 2.6 preview 1: Timing JIT

The new Ruby 2.6 preview 1 has JIT capability built in. Awesome! But it's still early. They say JIT doesn't help on Rails apps, for instance.

Purely by coincidence, I happen to write a big concurrent Rails-based benchmark, which Takashi was hoping to see JIT results for. And I'm freshly back to part-time work after paternity leave.

So how is its performance for Rails apps? Let's find out.

(Disclaimer: Takashi says that 2.6 head-of-master has significantly better JIT performance than prerelease 1. And I'll get around to timing that soon, too. But for now let's go with the 2.6 prerelease.)

Some Graphs

There's a way I usually graph this stuff. And several people have pointed out that I could do better with a line graph. And they're right, I totally could. So let's look at this how I usually do it, and then with some (I think) improved graphs.

I like the rainbow thing this graph has going. It's pretty. But commenters are right that it could be much clearer.

I like the rainbow thing this graph has going. It's pretty. But commenters are right that it could be much clearer.

That bar graph lets you know: Ruby 2.6.0 prerelease 1 isn't much faster than 2.5.0. But how close? And the 2.6.0 bars with JIT (far right) are higher, so it's slower. But how much higher/slower? I usually clarify with a table, which kind of makes the graph redundant. Here's what that looks like:

Percentile Ruby 2.5.0 Ruby 2.6.0 w/o JIT Ruby 2.6.0 w/ JIT speedup w/ 2.6 speedup w/ 2.6 JIT
0% 29.79 sec 29.69 sec 32.21 sec 0.35% -8.13%
10% 32.62 sec 32.01 sec 36.34 sec 1.85% -11.42%
50% 34.35 sec 33.94 sec 38.34 sec 1.20% -11.60%
90% 35.35 sec 34.89 sec 39.58 sec 1.30% -11.95%
100% 36.75 sec 35.92 sec 40.79 sec 2.25% -11.01%

It says pretty much the same thing: Ruby 2.6 is a tiny bit faster (let's call it 1.5% faster.) And with JIT it's much slower, more than 10% slower. Keep in mind this is a big, highly-concurrent Rails-based benchmark, which is exactly where we were told JIT was slower.

Still, we can do a better job presenting this data, I think. What if, instead of looking at a few representative percentiles of the full-run times, we took all 120,000 requests per Ruby (20 full runs, each with 6,000 requests,) sorted them from fastest to slowest, and overlaid them like a CDF? I think that would give us a pretty good view of how much faster or slower it is. Here's what that looks like:

I don't feel like this is as pretty. There are things I could do to improve it, obviously. But the biggest problem is that it's hard to estimate the total area between the curves in a wide, shallow graph like this. But I agree - this is an improveme…

I don't feel like this is as pretty. There are things I could do to improve it, obviously. But the biggest problem is that it's hard to estimate the total area between the curves in a wide, shallow graph like this. But I agree - this is an improvement.

Note that a small difference like the one between Ruby 2.5 and 2.6 is the worst case for a graph like this. It's about a 1.5% difference, as we saw in the table above. In fact, it's much smaller than that -- the 1.5% difference was aggregated and included a lot of the longer requests, while most requests on this graph are nearly the same between Ruby versions. Very few graphs will do a good job of showing that. And even the 2.6 with/without JIT difference isn't massive, at around 10%. Still, it's hard to recognize even the biggest, most important features of this graph, such as the fact that the slowest requests, with JIT, are much slower. And that's what you'd hope a graph like this would be best at.

Still, it's worth a look at the full-run version rather than the per-request version. Anything that shows every individual request, all 360,000 of them, is going to be, at best, too much information. What about just the 18,000 most important points, the aggregated run times?

Here's what that looks like:

This makes you appreciate the buttery smoothness of the version with far too many points, doesn't it? 18,000 points sounds like a lot when I write it, but it's not really that huge.

This makes you appreciate the buttery smoothness of the version with far too many points, doesn't it? 18,000 points sounds like a lot when I write it, but it's not really that huge.

This is the clearest graph so far, no question. By aggregating the full-run times, you can see that there's actually a lot of difference between the Ruby versions, even if most individual requests are very similar. In 6,000 requests, nearly every run is going to have something that is faster in 2.6, or much slower in 2.6 with JIT.

Also, the Y axis is zoomed in here. Notice that it runs from around 30 to 40 seconds, which is the basic spread for a full run of 6,000 requests for this benchmark. The individual-request graph had to start at zero because some requests are nearly instantaneous, while others take upwards of a second. This lets you see a lot more clearly what it means that the green and purple lines are "about 1.5%" apart - the fastest runs are very close together, the vast majority of runs are nearly a constant factor apart, and there are a few at the end that are outliers -- barely. As graphs go, this is a very orderly, neat one rather than a noisy one with lots of weirdness. Right now, Ruby 2.6 is a small, simple, uniform optimization and its JIT is a smallish, simple, uniform slowdown to this benchmark.

Methodology and Conclusion

Right now you don't want to use Ruby 2.6 JIT for your large, highly-concurrent Rails app, just like it said in the prerelease announcement. That makes sense. And don't worry, I'll be timing the newer 2.6 versions very soon. You'll find out when JIT breaks even for Rails Ruby Bench, and when it gets faster. I'll also try playing with different JIT settings a bit -- if I find anything interesting, I'll let you all know.

In case you haven't read my other articles on Ruby speed, this is all measured using Rails Ruby Bench (aka RRB.) RRB preloads Discourse with a bit of data and runs with with 10 Puma processes and 60 threads, then shoves pseudorandomly-generated HTTP requests through as fast as possible on a single large EC2 dedicated host. This gets more predictable benchmark results than you'd think, for reasons you can read about in my previous posts and on GitHub.

So: when you read about "how fast Ruby 2.6 prerelease 1 is" in this article, you're finding out how its speed looks for a large, real-world, highly-concurrent Rails workload. Other workloads will vary -- the Ruby 2.6 JIT is much faster on optcarrot, for instance.

Benchmarking Ruby's Heap: malloc, tcmalloc, jemalloc

Last week's post talked about different kinds of Ruby objects: some are contained in the 64-bit reference directly, some use up a 40-byte "Slot", and some use a Slot and a chunk of heap. Let's talk about that last set of objects.

"The Heap" isn't specifically a Ruby concept. It's a standard part of Unix processes. Other than garbage collection, Ruby doesn't do much that's special or unusual with the Heap in its processes. So: what's there to talk about?

It turns out that the heap does get managed. The C standard library has a "normal" malloc. But memory allocation is like everything else run by programmers: you have a bunch of different choices with subtle differences between them. And so you have several memory allocators to choose from. There are smart folks who strongly favor nonstandard allocators like tcmalloc and jemalloc, and get great results with them.

Also, I haven't measured anything with RailsRubyBench in a little while. I get itchy. You know how it goes.

(Do you just want to see the pretty graph? I totally understand. Scroll down, it's near the bottom after the explanation.)

 

What Does Malloc Do?

I won't dive too deeply into malloc -- the information is out there, and mostly it's not what you need to know for Ruby. But let's hit the basics.

Your process uses memory "pages" which it requests from the operating system. They're usually 4 kibibytes (4096 bytes), though it can get complicated. Your memory allocator figures out when to ask the operating system for new pages. It manages chunks of memory that usually aren't 4096 bytes, the ones your program asks for. If you return them later, it manages those too. So it often winds up with a kind of memory Swiss cheese as your Ruby program asks for various sizes of objects and hands them back in a different order.

(Doesn't Ruby use garbage collection? Sure. But when it frees your object automatically, it turns around and frees up that memory using whatever malloc implementation it's using. Just because you don't manually free the memory doesn't mean Ruby doesn't do that. Ruby is written in C, and behaves like it.)

Malloc needs to keep a list of what memory is used and free. It needs to update that list when you allocate or free memory in the Heap. You're asking for more Heap when your process asks for a Page full of Slots. But you don't touch the Heap when you use a Slot in a Page that Ruby already has.

 

What's the Difference?

If you enjoy reading C code, may I recommend the Dan Luu tutorial on implementing a really basic malloc? It's a great way to start thinking about what malloc does and how it does it. Of course, real malloc is normally a lot more complicated, but it's a good start.

There are a few different commonly-available malloc implementations, aside from whatever version came as part of your C standard library.

The two we'll talk about today are called tcmalloc and jemalloc. You can build or run Ruby with either one. Tcmalloc is part of the Google Performance Tools suite and keeps a thread-local cache for each malloc so you don't have to go to a single big pool of memory for every allocation on every thread. That's not going to help much for an only-processes Ruby application that doesn't use threads... But Rails Ruby Bench uses threading pretty heavily, so you'd think tcmalloc could help a lot.

Jemalloc is the old FreeBSD allocator, separated from FreeBSD. Like tcmalloc, it keeps per-thread chunks of memory and tries to avoid memory fragmentation. It comes highly recommended by Ruby performance luminaries like Nate Berkopec. Both allocators are good, and there are a few interesting differences between them.

So, shall we have a speed shootout?

 

How Fast?

I'm going to use Ruby 2.5.0 and Rails Ruby Bench for this. So I'm answering the question, "for a big concurrent Rails benchmark, what's the difference in total speed/throughput?" Memory benchmarks are a little odd this way. I'm measuring speed, but making some changes that clearly affect total memory usage. But the memory usage is basically capped by the fact that it's a single m4.2xlarge dedicated EC2 instance running ten Rails processes. So this benchmark answers how better memory efficiency translates into speed with a constant amount of memory, not how little memory you can use.

(Why do we care exactly what/how this measures? For starters, because there's probably a better way to tune Puma for lower memory usage per process. This shootout probably understates how much faster a more efficient malloc is, because it starts with a benchmark that has been carefully tuned for normal system malloc. You might be able to get better throughput with more Puma processes, for instance, or more threads per process. This benchmark doesn't measure that.)

So, what do we measure? I started with normal Ruby 2.5.0, and Ruby 2.5.0 with jemalloc and tcmalloc. I also tried using the memory environment variable settings the page suggested for tcmalloc, but they're entirely within the margin of error - in this benchmark, warmup is serving the same purpose, so on-boot memory settings don't matter enough to be measurable.

jemalloc_tcmalloc_full_runs.png

These are full-run times with Rails Ruby Bench, measured in seconds. In case you're not already familiar with RRB, I'm running Discourse with 10 processes and 60 threads on an EC2 dedicated m4.2xlarge instance in a don't-use-the-network configuration to reduce variation. It's a configuration that's meant to answer, "what's the speed difference for a highly-concurrent Rails application, running as hard as my fairly-large EC2 instance can handle?" There's a 30-thread load tester processes running 6000 requests/batch, and each of these configurations was tested for 60 batches. 60 batches of 6,000 requests gives 360,000 HTTP requests for each configuration. I throw out any batch that has an HTTP error, but in this case none of the 180 batches got any errors. It does happen for some benchmarks because HTTP is like that.

That graph is reasonably pretty, but it's hard to pull a specific percentage out. Let's put the numbers in a table and check percentages, shall we?

Percentile CRuby 2.5.0 2.5.0 w/ jemalloc 2.5.0 w/ tcmalloc speedup w/ tcmalloc speedup w/ jemalloc
0% 28.22 sec 24.45 sec 25.45 sec 9.82% 13.38%
10% 31.41 sec 27.86 sec 30.03 sec 4.40% 11.30%
50% 33.13 sec 29.40 sec 31.72 sec 4.27% 11.25%
90% 34.12 sec 30.28 sec 32.70 sec 4.15% 11.25%
100% 34.87 sec 31.17 sec 33.90 sec 2.80% 10.62%

Overall we're getting about an 11% speedup with jemalloc here, and a much more variable speedup, between 3% and 10%, with tcmalloc. That makes sense, and matches the reports I've heard of both jemalloc and tcmalloc. Note that this is with no additional tuning (e.g. no change in the number of processes or threads,) and none of these got a single server error in the 360,000 requests they each handled. So: reasonably solid.

(I've run these numbers and gotten as low as 9% speedup for jemalloc before as well. But the speedup is still in a similar range.)

When I check throughput rather than runtimes I usually get a smaller speedup - total throughput uses the slowest of the runs as the total time, so it nearly always shows less speedup. What's interesting here is that that's clearly true of tcmalloc... But jemalloc gets great results on throughput, suggesting a very consistent speedup (as you see above as well):

CRuby 2.5.0 jemalloc tcmalloc increase w/ tcmalloc increase w/ jemalloc
Median Throughput 175.13 req/sec 197.49 req/sec 183.33 req/sec 4.68% 12.77%

 

Conclusions

It looks like the jemalloc advocates have a darn good point. That makes sense. Richard Schneeman and Nate Berkopec are the kind of folks who would know. It looks like jemalloc gives conservatively an 11% speedup for a big concurrent Rails app. Tcmalloc is more variable but still hits 4%-9% speedup or so, which is nothing to sneeze at. Remember that this is overall speedup - not just when allocating memory, but for the full Rails app's runtime and throughput.

Among other things, this should tell you that heap management is important in Ruby. If you weren't seeing many bytes allocated on the heap, or if heap management was really fast, there wouldn't be a 10% speedup to give!