How Ruby Encodes References - Ruby Tiny Objects Explained
/When you’re using Ruby and you care about performance, you’ll hear a specific recommendation: “use small, fast objects.” As a variation on this, people will suggest you use symbols (“they’re faster than strings!”), prefer nil to the empty string and a few similar recommendations.
It’s usually passed around as hearsay and black magic, and often the recommendations are somehow wrong. For instance, some folks used to say “don’t use symbols! They can’t be garbage collected!”. But nope, now they can be. And the strings versus symbols story gets a lot more complicated if you use frozen strings…
I’ve explained how Ruby allocates tiny, small and large objects before, but this will be a deep dive into tiny (reference) objects and how they work. That will help you understand the current situation and what’s likely to change in the future.
We’ll also talk a bit about how C stores objects. CRuby (aka Matz’s Ruby or “plain” Ruby) is written in C, and uses C data structures to store your Ruby objects.
And along the way you’ll pick up a common-in-C trick that can both be used in Ruby (Matz does!) and help you understand the deeper binary underpinnings of a lot of higher-level languages.
How Ruby Stores Objects
You may recall that Ruby has three different objects sizes, which I’ll call “tiny,” “small” and “large.” For deeper details on that, the slides from my 2018 RubyKaigi talk are pretty good (or: video link.)
But the short version for Ruby on 64-bit architectures (such as any modern processor) is:
A Ruby 8-byte “reference” encodes tiny objects directly inside it, or points to…
A 40-byte RVALUE structure, which can fully contain a small object or the starting 40 bytes of…
A Large object (anything bigger), which uses an RVALUE and an allocation from the OS.
Make sense? Any Ruby value gets a reference, even the smallest ones. Tiny values are encoded directly into the 8-byte reference. Small or large objects (but not tiny) also get a 40-byte RVALUE. Small objects are encoded directly into the 40-bytes RVALUE. And large objects don’t fit in just a reference or just an RVALUE, so they get an extra allocation of whatever size they actually need (plus the RVALUE and the reference.) For the C folks in the audience, that “extra allocation” is the same thing as a call to malloc(), the usual C memory allocation function.
The RVALUE is often called a “Slot” when you’re talking about Ruby memory. Technically Ruby uses the word “slot” for the allocation and “RVALUE” for the data type of the structure that goes in a slot, but you’ll see both words used both ways - treat them as the same thing.
Why the three-level system? Because it gets more expensive in performance as the objects get bigger. 8-byte references are tiny and very cheap. Slots get allocated in blocks of 408 at a time and aren’t that big, so they’re fairly cheap - but a thousand or more of them start to get expensive. And a large object takes a reference and a slot and a whole allocation of its own that gets separate tracked - not cheap.
So: let’s look at references. Those are the 8-byte tiny values.
Which Values are Tiny?
I say that “some” objects are encoded into the reference. Which ones?
Fixnums between about negative one billion and one billion
Symbols
Floating-point numbers (like 3.7 or -421.74)
The special values true, false, undef and nil
That’s a pretty specific set. Why?
C: Mindset, Hallucinations and One Weird Trick That Will Shock You
C really treats all data as a chunk of bits with a length. There are all sorts of operations that act on chunks of bits, of course, and some of those operations might be assigned something resembling a “type” by a biased human observer. But C is a big fan of the idea that if you have a chunk of bytes and you want to treat it as a string in one line and an integer the next, that’s fine. Length is the major limitation, and even length is surprisingly flexible if you’re careful and/or you don’t mind the occasional buffer overrun.
What’s a pointer? Pointers are how C tracks memory. If you imagine numbering all the bytes of memory starting at zero, and the next byte is one, the next byte two and so on, you get exactly how old processors addressed memory. Some very simple embedded processors still do it that way. That’s exactly what a C pointer is - an index for a location in memory, if you were to treat all of memory as one giant vector of bytes. Memory addressing is more complicated in newer processors, OSes and languages, but they still present your program with that same old abstraction. In C, you use it very directly.
So when I say that in C a pointer is a memory address, you might ask, “is that a separate type from integer with a bunch of separate operations you can do on it?” and I might answer “it’s C, so I just mean there are a bunch of pointer operations that you can do with any piece of data anywhere inside your process.” The theme here is “C doesn’t track your stuff for you at runtime, who do you think C is, your mother?” The other, related theme is “C assumes when you tell it something you know what you’re doing, whether you actually do or not.” And if not, eh, crashes happen.
One bit related to this mindset: allocating a new “object” (really a chunk of bytes) in C is simple: you call a function and you get back a pointer to a chunk of bytes, guaranteed to hold at least the size you asked for. Ask it for 137 bytes, get back a pointer to a buffer that is at least 137 bytes big. That’s what “malloc” does. When you’re done with the buffer you call “free” to give it back, after which it may become invalid or be handed back to somebody else, or split up and parts of it handed back to somebody else. Data made of bits is weird.
A side effect of all of this “made of bits” and “track it yourself” stuff is that often you’ll do type tagging. You keep one piece of data that says what type another piece of data is, and then you interpret the second one completely differently depending on the first one. Wait, what? Okay, so, an example: if you know you could have an integer or a string, you keep a tag, which is either 0 for integer or 1 for string. When you read the object, first you check the tag for how to interpret the second chunk of bits. When you set a new value (which could be either integer or string) you also set the tag to the correct value. Does this all sound disorganized and error-prone? Good, you’re understanding a bit of what C is like.
One last oddity: because of how processor alignment and memory tracking works, due to a weird quirk of history, pointers are essentially always even. In fact, values returned by a memory allocator on a modern processor is always a multiple of 8, because most processors don’t like accessing an 8-bytes value on an address that isn’t a multiple of 8. The memory allocator can’t just tell you not to use any 8-byte values. Processors are weird, yo.
Which means if you looked at the representation of your pointer in binary, the smallest three bits would always be zero. Because, y’know, multiple of 8. Which means you could use those three bits for something. Keep that in mind for the next bit.
Okay, So What Does Ruby Do?
If this sounds like I’m building up to explaining some type-tagging… Yup, well spotted!
It turns out that a reference is normally a C pointer under the hood. Basically every dynamic language does this, with different little variations. So all references to small and large Ruby objects are pointers. The exception is for tiny objects, which live completely in the reference.
Think about the last three bits of Ruby’s 8-byte references. You know that if those last bits are all zeroes, the value is (or could be) a pointer to something returned by the memory allocator - so it’s a small or large object. But if they’re not zero, the value lives in the reference and it’s a tiny object.
And Ruby is going to pass around a lot of values that you’d like to be small and fast… Numbers, say, or symbols. Heck, you’d like nil to be pretty small and fast too.
So: CRuby has a few things that it calls “immediate” values in its source code. And the list of those immediate values look exactly like the list above - values you can store as tiny objects directly in a reference.
Let’s get back to those last three bits of the reference again.
If the final bit is a “1” then the reference contains a Fixnum. If the final two bits are “10” then it’s a Float. And if the last four bits are “1100” then it’s a Symbol. But the last three of “1100” are still illegal for an allocated pointer, so it works out.
The four “special” values (true, false, undef, nil) are all represented by small numbers that will also never be returned by the memory allocator. For completeness, here they are:
Value | Hexadecimal value | Decimal value |
---|---|---|
true | 0x14 | 20 |
false | 0x00 | 0 |
undef | 0x34 | 52 |
nil | 0x08 | 8 |
So Every Integer Ends in 1, Then?
You might reasonably ask… but what about even integers?
I mean, “ends in 1” is a reasonable way to distinguish between pointers and not-pointers. But what if you want to store the number 4 at some point? Its binary representation ends in “00,” not “1.” The number 88 is even worse - like a pointer, it’s a multiple of 8!
It turns out that CRuby stores your integer in just the top 63 bits out of 64. The final “1” bit isn’t part of the integer’s value, it’s just a sign saying, “yup, this is an integer.” So if type-tagging is two values with one tagging the other, then the bottom bit is the tag and the top 63 bits are the “other” piece of data. They’re both crunched together, but… Well, this is C. If you want to crunch up “multiple” pieces of data into one chunk… C isn’t your mother, and it won’t stop you. In fact, that’s what C does with all its arrays anyway. And in this case it makes for pretty fast code, so that’s what CRuby does.
If you’re up for it, here’s the C code for immediate Fixnums - all this code makes heavy use of bitwise operations, as you’d expect.
// Check if a reference is an immediate Fixnum #define RB_FIXNUM_P(f) (((int)(SIGNED_VALUE)(f))&RUBY_FIXNUM_FLAG) // Convert a C int into a Ruby immediate Fixnum reference #define RB_INT2FIX(i) (((VALUE)(i))<<1 | RUBY_FIXNUM_FLAG) // Convert a Ruby immediate Fixnum into a C int - RSHIFT is just >> #define RB_FIX2LONG(x) ((long)RSHIFT((SIGNED_VALUE)(x),1))
So It’s All That Simple, Then?
This article can’t cover everything. If you think about symbols for a moment, you’ll realize they have to be a bit more complicated than that - what about a symbol like :thisIsAParticularlyLongName? You can’t fit that in 8 bytes! And yet it’s still an immediate value. Spoiler: Ruby keeps a table that maps the symbol names to fixed-length keys. This is another very old trick, often called String Interning.
And as for what it does to the Float representation… I’ll get into a lot more detail about that, and about what it does to Ruby’s floating-point performance, in a later post.