Retiring old performance pitfalls
Erlang/OTP 22 will bring many performance improvements to the table, but most of them have a broad impact and don’t affect the way you write efficient code. In this post I’d like to highlight a few things that used to be surprisingly slow but no longer need to be avoided.
Named fun recursion #
Named funs have a neat little feature that might not be obvious at a first glance; their name is a variable like any other and you’re free to pass it to another function or even return it.
deepfoldl(F, Acc0, L) -> (fun NamedFun([_|_]=Elem, Acc) -> lists:foldl(NamedFun, Acc, Elem); NamedFun(, Acc) -> Acc; NamedFun(Elem, Acc) -> F(Elem, Acc) end)(L, Acc0).
This is cool but a bit of a headache for the compiler. To create a fun we pass
its definition and free variables to a
make_fun2 instruction, but we can’t
include the fun itself as a free variable because it hasn’t been created yet.
Prior to OTP 22 we solved this by creating a new equivalent fun inside the
fun, but this made recursion surprisingly expensive both in terms of run-time
and memory use.
As of OTP 22 we translate recursion to a direct function call instead which avoids creating a new fun. Other cases still require recreating the fun, but they’re far less common.
List subtraction with large operands (– operator) #
While the Erlang VM appears to be pre-emptively scheduled to the programmer, it’s co-operatively scheduled internally. When a native function runs it monopolizes the scheduler until it returns, so a long-running one can severely harm the responsiveness of the system. We’ve therefore written nearly all such functions in a style that breaks the work into short units that complete quickly enough, but there’s a steadily shrinking list of functions that misbehave, and list subtraction was one of these.
It’s usually pretty straightforward to rewrite functions in this style, but the old algorithm processed the second list in a loop around the first list, which is problematic since both lists can be very long and resuming work in nested loops is often trickier than expected.
In this case it was easier to just get rid of the nested loop altogether. The
new algorithm starts out by building a red-black tree from the right-hand side
before removing elements from the left-hand side. As all operations on the tree
log n complexity we know that they will finish really quickly, so all we
need to care about is yielding in the outer loops.
This also had the nice side-effect of reducing the worst-case complexity from
O(n log n) and let us remove some warnings from the reference
manual and efficiency guide. It’s worth noting
that the new implementation is always faster than the proposed workarounds, and
that it falls back to the old algorithm when it’s faster to do so.
This change will be rolled out in OTP 21.2, big thanks to Dmytro Lytovchenko (@kvakvs on GitHub) for writing the better half of it!
Lookahead in bit-syntax matching #
The optimization pass for bit-syntax matching was completely rewritten in OTP 22 to take advantage of the new SSA-based intermediate format. It applies the same optimizations as before so already well-optimized code is unlikely to see any benefit, but it manages to apply them in far more cases.
For those who aren’t familiar, all bit-syntax matching operates on a “match context” internally, which is a mutable object that keeps track of the current match position. This helps a lot when matching complicated patterns as it can zip back and forth as required, saving us from having to match components more than once.
This is great when matching several different patterns, but it comes in real handy in loops like the following:
trim_zero(<<0,Tail/binary>>) -> trim_zero(Tail); trim_zero(B) when is_binary(B) -> B.
As the compiler can see that
Tail is passed directly to
promptly begins with a bit-match, it can skip extracting
Tail as a sub-binary
and pass the match context instead. This is a pretty well-known optimization
called “match context reuse” which greatly improves performance when applied,
and a lot of code has been written with it in mind.
The catch of passing a match context like this is that we have to maintain the illusion that we’re dealing with an immutable binary. Whenever it’s used in a non-matching expression we either need to convert the context to an equivalent binary, or admit defeat and skip the optimization.
While the compiler did a pretty good job prior to OTP 22 it gave up a bit too easily in many cases, and the most trivial example is almost funny:
calls_wrapper(<<"hello",Tail/binary>>) -> count_ones(Tail). %% This simple wrapper prevents context reuse in the call above. :( count_ones(Bin) -> count_ones_1(Bin, 0). count_ones_1(<<1, Tail/binary>>, Acc) -> count_ones_1(Tail, Acc + 1); count_ones_1(<<_, Tail/binary>>, Acc) -> count_ones_1(Tail, Acc); count_ones_1(<<>>, Acc) -> Acc.
A trickier example can be found in the
bin_search_inv_1(<<CP1/utf8, BinRest/binary>>=Bin0, Cont, Sep) -> case BinRest of %% 1 <<CP2/utf8, _/binary>> when ?ASCII_LIST(CP1, CP2) -> case CP1 of Sep -> %% 2 bin_search_inv_1(BinRest, Cont, Sep); _ -> %% 3 [Bin0|Cont] end; %% ... snip ...
What we’re looking at is a fast-path for ASCII characters; when both
CP2 are ASCII we know that
CP1 is not a part of a grapheme cluster and we
can thus avoid a call to
unicode_util:gc/1. It’s not a particularly expensive
function but calling it once per character adds up quickly.
At first glance it might seem safe to pass the context at
2, but this is made
Bin0 being returned at
3. As contexts are mutable and change
their position whenever a match succeeds, naively converting
Bin0 back to a
binary would give you what comes after
Now, you might be wondering why we couldn’t simply restore the position before
Bin0 back to a binary. It’s an obvious thing to do but before OTP
22 the context tracked not only the current position but also previous ones
needed when backtracking. These were saved in per-context “slots” which were
mutable and heavily reused, and the match at
1 clobbered the slot needed to
This also meant that a context couldn’t be used again after being passed to
another function or entering a
catch, which made it more or less
impossible to apply this optimization in code that requires looking ahead.
As of OTP 22 these positions are stored outside the context so there’s no need to worry about them becoming invalid, making it possible to optimize the above cases.