A closer look at the interpreter

October 27, 2020 · by John Högberg

In my previous post we had a look at BEAM, and now that we’re more familiar with it it’s time for us to look at the reference implementation: the interpreter.

The interpreter can be thought of as an endless loop that looks at the current instruction, executes it, and then moves on to the next one.

In normal builds our code is laid out as directly threaded code, where each instruction consists of the machine code address of its handler, followed by its arguments, which are in turn followed by the next instruction:

Instruction address
    First argument
    Second argument
    ... and so on.
Instruction address
    First argument
    Second argument
    ... and so on.

When an instruction finishes it reads the next instruction address and jumps to it, forever following the “thread” of instructions.

With very few exceptions, the instructions are “pure” in the sense that they always do the same thing with the same input and that they only affect BEAM, either through changing the control flow or writing a result to a register. This makes the instructions very easy to read and reason about in isolation, so for the sake of brevity we’ll only have a look at one of them:

is_nonempty_list(Label, Src) {

    /* Check if our $Src is not a list. */
    if (is_not_list($Src)) {

        /* Invoke the $FAIL macro, jumping to our
         * $Label. */
        $FAIL($Label);
    }

    /* Execute the next instruction. */
}

The above is written in a domain-specific language (DSL) that is a superset of C, where $-prefixed macros are expanded and the rest is kept as is.

The beam_makeops script takes this definition and generates code for the parts that are cumbersome to write by hand, such as jumping to the next instruction. It also hides argument handling which allows us to reduce the code size by packing small arguments together behind the scenes, which we’ve explored in an earlier post on the subject.

For performance reasons it also generates different variants based on the argument types outlined in ops.tab to reduce the amount of work we need to do at runtime.

Let’s have a look at the generated code for this instruction, specifically the variant for X registers:

/* (This has been modified slightly for readability) */
case is_nonempty_list_fx:
{
    Eterm arg_word, term;

    /* Read the argument word from the instruction
     * stream. */
    arg_word = I[1];

    /* Unpack the offset of our source register (upper
     * 32 bits) and then read its contents.
     *
     * Note that we address the X registers directly;
     * had this instruction not been specialized, we
     * would first need to determine whether the
     * argument was an X or a Y register. */
    term = x_registers[arg_word >> 32];

    /* is_not_list(term) */
    if (term & (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)) {

        /* Unpack our fail label (lower 32 bits) and add
         * it to our current instruction pointer. This
         * is an offset and may be negative for backward
         * jumps. */
        I += (Sint32)arg_word;

        /* Jump to the instruction at the fail label using
         * the "labels as values" GCC extension. */
        goto *I;
    }

    /* Skip the current label address and argument word,
     * then jump to the next instruction. This was
     * automatically generated by beam_makeops. */
    I += 2;
    goto *I;
}

There’s a bit more to it, and those who would like to know more can read the documentation for the beam_makeops script, but the above covers the gist of it.

While the above is quite efficient, there’s significant overhead from dispatch and argument handling. Here’s a slightly altered assembly listing of the above:

    ; Read the argument word from the instruction
    ; stream.
    mov    rdx, [rbx + 8]

    ; Unpack the offset of our source register (upper 32
    ; 32 bits).
    mov    rcx, rdx
    shr    rcx, 32

    ; X registers live in machine register r15, and we
    ; placed our offset in rcx, so we can find our term at
    ; [r15 + rcx].
    ;
    ; Perform a non-destructive bitwise AND on the term
    ; using the `test` instruction, and jump to the fail
    ; label if the result is non-zero.
    test   byte [r15 + rcx], (_TAG_PRIMARY_MASK - TAG_PRIMARY_LIST)
    jne    jump_to_fail_label

    ; Skip the current label address and argument word,
    ; then jump to the next instruction.
    add    rbx, 16
    jmp    [rbx]

jump_to_fail_label:
    ; Unpack our fail label (lower 32 bits) and add
    ; it to our current instruction pointer.
    movsxd rdx, edx
    lea    rbx, [rbx + rdx * 8]

    ; Jump to the instruction at the fail label.
    jmp    [rbx]

The bold section is the meat of the instruction and the rest is argument unpacking and instruction dispatch. While this is not much of a problem for large instructions, its effect on short ones like this is very large, and when looking through a profiler (e.g. perf) it’s not unusual for the final jmp to dominate the rest.

To lessen this effect, the loader combines commonly used instruction sequences into a single instruction. For example, two independent moves may fuse into move2_par, and is_nonempty_list followed by get_list might be fused to is_nonempty_list_get_list.

This reduces the cost of short instructions, but only works to the extent we’re able to identify common patterns and comes at a significant maintenance cost as each combination must be implemented manually. Even so, the effect tends to be moderate and the dispatch overhead remains significant.

Another, albeit lesser, downside with the interpreter is that modern processors are very optimized for patterns commonly found in “ordinary” native code. For example, nearly all of them have a special branch predictor just for calls and returns. Assuming that every call has a corresponding return lets it predict returns perfectly unless an exception is thrown, but since the interpreter does not use native calls and returns, it cannot make use of this optimization.

Unfortunately there’s not a whole lot that can be done about this, and after over two decades of refinement it’s becoming increasingly difficult to optimize it in meaningful ways.

Because of that our quest to improve performance has instead focused on two areas: improving the compiler and implementing a JIT. We’ve made great strides with both as of late, and are very proud to have finally merged the latter.

Stay tuned for our next post, where we’ll have a look at the JIT and see how it avoids these issues.