A Little Ruby-Land Profiling with StackProf

I'm looking at Ruby performance lately. Aaron Patterson was awesome when I asked him about Ruby performance and his interview -- he pointed me at a great but Linux-specific Japanese blog post about how to profile Ruby itself. The program being used, called optcarrot, is a headless NES emulator that the Ruby core team are using as a CPU-intensive optimization target. Thus "optcarrot" == "optimization carrot", the target they're trying to optimize. Cute, right?

I'm on a Mac, which seems to lack most profiling tools (e.g. no gprof) and the existing one or two like XCode's Instruments seem fairly command-line-unfriendly and Makefile-unfriendly. But while I learn to use DTrace for profiling and get a Linux VM set up with other profiling tools, I can also check out the Ruby-side performance. Let's talk about that.

(One confusing point: there's the performance of the Ruby interpreter itself, and then the performance of the Ruby code running in the interpreter. Today's post is about the code running in the interpreter. So the function names and durations are from the optcarrot NES emulator, not the internals of the core Ruby language - yet. Capisce?)

The beginning of that Japanese post mentions stackprof, which I hadn't used before. I'm enjoying it more than perftools.rb or ruby-prof, and it's been much easier to set up. Let's talk about how to use it.

I got stackprof set up in a locally-compiled development Ruby. I'm not going to say too much about that here - it's kind of painful, and I don't recommend it without a good reason. So maybe I'll post about it later. Today, let's just talk about what stackprof shows us about optcarrot.

Stackprof needs to be in the local ruby, so "gem install stackprof" to make that happen. You may need to put it into a Gemfile with "gem stackprof" -- I did that in optcarrot. Then it wants to be required and configured. In optcarrot, you can do that in the bin/optcarrot file. Here's the original file:

#!/usr/bin/env ruby

# I'm too lazy to type `-Ilib` every time...
require_relative "../lib/optcarrot"

Optcarrot::NES.new.run

end

And here's the version with stackprof. Pretty simple:

#!/usr/bin/env ruby

require 'stackprof'# added

StackProf.run mode: :cpu, # added
interval: 1000,
out: '/tmp/stackprof_optcarrot.dump' do

# I'm too lazy to type `-Ilib` every time...
require_relative "../lib/optcarrot"

Optcarrot::NES.new.run

end

Okay. So what's it do? Well, run optcarrot again, just like before... If you're not running a funky local Ruby, that's probably just "ruby bin/optcarrot --benchmark examples/Lan_Master.nes". If you need to instead add a bunch of non-standard includes and stuff so that Ruby can find its local standard library, you have my sympathies :-)

Once the run works, you should see the file that's listed in the StackProf.run line. I used /tmp/stackprof_optcarrot.dump as the filename, but you can pick what you like.

$ ls -l /tmp
total 2272
drwx------3 noah.gibbswheel102 May 31 16:35 com.apple.launchd.5JyKIrJXHi
drwx------3 _mbsetupuserwheel102 May 31 16:35 com.apple.launchd.6AV1lzEddb
drwx------3 _mbsetupuserwheel102 May 31 16:35 com.apple.launchd.RhIpa252Cv
drwx------3 noah.gibbswheel102 May 31 16:35 com.apple.launchd.yH43wuq7kM
-rw-r--r--1 noah.gibbswheel10236 Jun9 15:12 stackprof_optcarrot.dump

What do you do with it?

The most useful is to ask it what used most of the performance:

$ stackprof /tmp/stackprof_optcarrot.dump --text --limit 10
==================================
Mode: cpu(1000)
Samples: 3625 (1.01% miss rate)
GC: 10 (0.28%)
==================================
 TOTAL(pct) SAMPLES(pct) FRAME
2854(78.7%)2854(78.7%) Optcarrot::PPU#render_pixel
 271 (7.5%) 271 (7.5%) block (3 levels) in <class:PPU>
 218 (6.0%) 109 (3.0%) Optcarrot::PPU#setup_lut
79 (2.2%)79 (2.2%) Optcarrot::PPU#setup_frame
55 (1.5%)55 (1.5%) Optcarrot::CPU#fetch
26 (0.7%)26 (0.7%) Optcarrot::PPU#wait_one_clock
19 (0.5%)19 (0.5%) Optcarrot::PPU#update_address_line
6017 (166.0%)14 (0.4%) Optcarrot::PPU#main_loop
13 (0.4%)13 (0.4%) Optcarrot::PPU#evaluate_sprites_even
13 (0.4%)13 (0.4%) Optcarrot::PPU#load_tiles

Stackprof will also output files, graphviz output, flamegraphs, specific methods and more. See its GitHub page for more details.

So - what does that actually tell us? The PPU code is the "graphical" output (headless in this case) where it calculates all the output pixels. Optcarrot's documentation (see doc/benchmark.md) even warns us that the vast majority of the time will be spent in PPU#run and CPU#run. That makes a lot of sense for a game console emulator. The time is spent emulating the hardware.

The render_pixel function isn't terribly long (see lib/optcarrot/ppu.rb and search for "render_pixel"). But looking at that array append at the end, I had a sudden suspicion. I tried just commenting out that one last line of the function (yes, the emulator is useless at that point, but humor me.) Delete /tmp/stackprof_optcarrot.dump, run the emulator, run stackprof again and...

$ stackprof /tmp/stackprof_optcarrot.dump --text --limit 10
==================================
Mode: cpu(1000)
Samples: 2753 (1.18% miss rate)
GC: 14 (0.51%)
==================================
 TOTAL(pct) SAMPLES(pct) FRAME
 495(18.0%) 493(17.9%) Optcarrot::PPU#vsync
 259 (9.4%) 259 (9.4%) block (3 levels) in <class:PPU>
 258 (9.4%) 258 (9.4%) Optcarrot::PPU#render_pixel
 202 (7.3%) 202 (7.3%) Optcarrot::PPU#wait_one_clock
 137 (5.0%) 135 (4.9%) Optcarrot::CPU#fetch
 119 (4.3%) 119 (4.3%) Optcarrot::PPU#update_address_line
 113 (4.1%) 113 (4.1%) Optcarrot::PPU#evaluate_sprites_even
 110 (4.0%) 110 (4.0%) Optcarrot::PPU#setup_frame
 203 (7.4%) 101 (3.7%) Optcarrot::PPU#setup_lut
2786 (101.2%)98 (3.6%) Optcarrot::PPU#main_loop

Instead of 78% of the total runtime, render_pixel is down to 9.4% of the total runtime. So that array append is burning a great big chunk of our CPU cycles.

You can also verify this by asking stackprof to "zoom in" on the render_pixel method.

$ stackprof /tmp/stackprof_optcarrot.dump --text --method render_pixel

See the left columns? That single line is using 75.2% of the total runtime. So again, the array append is slow...

If I were trying to optimize the optcarrot NES emulator, I'd restructure that -- preallocate the array, so there's only one memory allocation? Don't use an array at all, and find some other way to handle pixel output?

But I'm not really trying to optimize optcarrot. Instead, I'm trying to optimize the Ruby internal operations that make optcarrot slow. And now I know -- those are dominated by arrays.

By the way - don't take the one-line-commented-out op count too seriously. That's dominated by PPU#vsync, but PPU#vsync winds up blacking the screen if render_pixel doesn't write output pixels. And that's an array append in a tight loop. In other words, if you take out the slow array appends in render_pixel, it will fill in with a bunch more array appends anyway. This is one of the hazards of modifying the code you're profiling - it changes the behavior.

In a later post, perhaps we'll return to that array append. I'm still figuring out how to optimize it. I suspect realloc() being slow, myself. And in the mean time, you have learned a bit about how to use stackprof, which is a pretty darned awesome tool.