A list of all changes is found in Erlang/OTP 26 Readme. Or, as always, look at the release notes of the application you are interested in. For instance: Erlang/OTP 26 - Erts Release Notes - Version 14.0.
This year’s highlights mentioned in this blog post are:
lists
modulemaybe
in the runtime systemOTP 26 brings many improvements to the experience of using the Erlang shell.
For example, functions can now be defined directly in the shell:
1> factorial(N) -> factorial(N, 1).
ok
2> factorial(N, F) when N > 1 -> factorial(N - 1, F * N);
.. factorial(_, F) -> F.
ok
3> factorial(5).
120
The shell prompt changes to ..
when the previous line is not a
complete Erlang construct.
Functions defined in this way are evaluated using the erl_eval module, not compiled by the Erlang compiler. That means that the performance will not be comparable to compiled Erlang code.
It also possible to define types, specs, and records, making it possible to paste code from a module directly into the shell for testing. For example:
1> -record(coord, {x=0.0 :: float(), y=0.0 :: float()}).
ok
2> -type coord() :: #coord{}.
ok
3> -spec add(coord(), coord()) -> coord().
ok
4> add(#coord{x=X1, y=Y1}, #coord{x=X2, y=Y2}) ->
.. #coord{x=X1+X2, y=Y1+Y2}.
ok
5> Origin = #coord{}.
#coord{x = 0.0,y = 0.0}
6> add(Origin, #coord{y=10.0}).
#coord{x = 0.0,y = 10.0}
The auto-completion feature in the shell has been vastly improved, supporting auto-completion of variables, record names, record fields names, map keys, function parameter types, and file names.
For example, instead of typing the variable name Origin
, I can just
type O
and press TAB to expand it to Origin
since the only
variable defined in the shell with the initial letter O
is
Origin
. That is a little bit difficult to illustrate in a blog post,
so let’s introduce another variable starting with O
:
7> Oxford = #coord{x=51.752022, y=-1.257677}.
#coord{x = 51.752022,y = -1.257677}
If I now press O
and TAB, the shell shows the possible completions:
8> O
bindings
Origin Oxford
(The word bindings
is shown in bold and underlined.)
If I press x
and TAB, the word is completed to Oxford
:
8> Oxford.
#coord{x = 51.752022,y = -1.257677}
To type #coord{
is is sufficient to type #
and TAB (because there is
only one record currently defined in the shell):
9> #coord{
Pressing TAB one more time causes the field names in the record to be printed:
9> #coord{
fields
x= y=
When trying to complete something which has many possible expansions,
the shell attempts to show the most likely completions first. For
example, if I type l
and press TAB, the shell shows a list of BIFs
beginning with the letter l
:
10> l
bifs
length( link( list_to_atom(
list_to_binary( list_to_bitstring( list_to_existing_atom(
list_to_float( list_to_integer( list_to_pid(
list_to_port( list_to_ref( list_to_tuple(
Press tab to see all 37 expansions
Pressing TAB again, more BIFs are shown, as well as possible shell commands and modules:
10> l
bifs
length( link( list_to_atom(
list_to_binary( list_to_bitstring( list_to_existing_atom(
list_to_float( list_to_integer( list_to_pid(
list_to_port( list_to_ref( list_to_tuple(
load_module(
commands
l( lc( lm( ls(
modules
lcnt: leex: lists:
local_tcp: local_udp: log_mf_h:
logger: logger_backend: logger_config:
logger_disk_log_h: logger_filters: logger_formatter:
logger_h_common: logger_handler_watcher: logger_olp:
logger_proxy: logger_server: logger_simple_h:
logger_std_h: logger_sup:
Typing ists:
(to complete the word lists
) and pressing TAB, a
partial list of functions in the lists
modules are shown:
10> lists:
functions
all( any( append( concat( delete(
droplast( dropwhile( duplicate( enumerate( filter(
filtermap( flatlength( flatmap( flatten( foldl(
foldr( foreach( join( keydelete( keyfind(
Press tab to see all 72 expansions
Typing m
and pressing TAB, the list of functions is narrowed down to
just those beginning with the letter m
:
10> lists:m
functions
map( mapfoldl( mapfoldr( max( member(
merge( merge3( min( module_info(
OTP 25 and earlier releases printed small maps (up to 32 elements) with atom keys according to the term order of their keys:
1> AM = #{a => 1, b => 2, c => 3}.
#{a => 1,b => 2,c => 3}
2> maps:to_list(AM).
[{a,1},{b,2},{c,3}]
In OTP 26, as an optimization for certain map operations, such as
maps:from_list/1
, maps with atom keys are now sorted in a different
order. The new order is undefined and may change between different
invocations of the Erlang VM. On my computer at the time of writing,
I got the following order:
1> AM = #{a => 1, b => 2, c => 3}.
#{c => 3,a => 1,b => 2}
2> maps:to_list(AM).
[{c,3},{a,1},{b,2}]
There is a new modifier k
for format strings to specify that maps should
be sorted according to the term order of their keys before printing:
3> io:format("~kp\n", [AM]).
#{a => 1,b => 2,c => 3}
ok
It is also possible to use a custom ordering fun. For example, to order the map elements in reverse order based on their keys:
4> io:format("~Kp\n", [fun(A, B) -> A > B end, AM]).
#{c => 3,b => 2,a => 1}
ok
There is also a new maps:iterator/2 function that supports iterating over the elements of the map in a more intuitive order. Examples will be shown in the next section.
In OTP 25 and earlier, it was common to combine maps:from_list/1
and
maps:to_list/1
with list comprehensions. For example:
1> M = maps:from_list([{I,I*I} || I <- lists:seq(1, 5)]).
#{1 => 1,2 => 4,3 => 9,4 => 16,5 => 25}
In OTP 26, that can be written more succinctly with a map comprehension:
1> M = #{I => I*I || I <- lists:seq(1, 5)}.
#{1 => 1,2 => 4,3 => 9,4 => 16,5 => 25}
With a map generator, a comprehension can now iterate over the elements of a map. For example:
2> [K || K := V <- M, V < 10].
[1,2,3]
Using a map comprehension with a map generator, here is an example showing how keys and values can be swapped:
3> #{V => K || K := V <- M}.
#{1 => 1,4 => 2,9 => 3,16 => 4,25 => 5}
Map generators accept map iterators as well as maps. Especially useful are the ordered iterators returned from the new maps:iterator/2 function:
4> AM = #{a => 1, b => 2, c => 1}.
#{c => 1,a => 1,b => 2}
5> [{K,V} || K := V <- maps:iterator(AM, ordered)].
[{a,1},{b,2},{c,1}]
6> [{K,V} || K := V <- maps:iterator(AM, reversed)].
[{c,1},{b,2},{a,1}]
7> [{K,V} || K := V <- maps:iterator(AM, fun(A, B) -> A > B end)].
[{c,1},{b,2},{a,1}]
Map comprehensions were first suggested in EEP 58.
maps:get/3
In OTP 26, the compiler will inline calls to maps:get/3, making them slightly more efficient.
maps:merge/2
When merging two maps, the maps:merge/2 function will now try to reuse the key tuple from one of the maps in order to reduce the memory usage for maps.
For example:
1> maps:merge(#{x => 13, y => 99, z => 100}, #{x => 0, z => -7}).
#{y => 99,x => 0,z => -7}
The resulting map has the same three keys as the first map, so it can reuse the key tuple from the first map.
This optimization is not possible if one of the maps has any key not present in the other map. For example:
2> maps:merge(#{x => 1000}, #{y => 2000}).
#{y => 2000,x => 1000}
Updating of a map using the =>
operator has been improved to avoid
updates that don’t change the value of the map or its key
tuple.
For example:
1> M = #{a => 42}.
#{a => 42}
2> M#{a => 42}.
#{a => 42}
The update operation does not change the value of the map, so in order to save memory, the original map is returned.
(A similar optimization for the :=
operator was implemented 5
years ago.)
When updating the values of keys that already exist in a map using the
=>
operator, the key tuple will now be re-used. For example:
3> M#{a => 100}.
#{a => 100}
For anyone who wants to dig deeper, here are the main pull requests for maps for OTP 26:
lists
modulelists:enumerate/3
In OTP 25, lists_enumerate() was introduced. For example:
1> lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]
2> lists:enumerate(0, [a,b,c]).
[{0,a},{1,b},{2,c}]
In OTP 26, lists:enumerate/3 completes the family of functions by allowing an increment to be specified:
3> lists:enumerate(0, 10, [a,b,c]).
[{0,a},{10,b},{20,c}]
4> lists:enumerate(0, -1, [a,b,c]).
[{0,a},{-1,b},{-2,c}]
zip
family of functionsThe zip
family of functions in the lists
module combines two or three lists
into a single list of tuples. For example:
1> lists:zip([a,b,c], [1,2,3]).
[{a,1},{b,2},{c,3}]
The existing zip
functions fail if the lists don’t have the same length:
2> lists:zip([a,b,c,d], [1,2,3]).
** exception error: no function clause matching . . .
In OTP 26, the zip
functions now take
an extra How
parameter that determines what should happen when the
lists are of unequal length.
For some use cases for zip
, ignoring the superfluous elements in the
longer list or lists can make sense. That can be done using the trim
option:
3> lists:zip([a,b,c,d], [1,2,3], trim).
[{a,1},{b,2},{c,3}]
For other use cases it could make more sense to extend the shorter
list or lists to the length of the longest list. That can be done
using the {pad, Defaults}
option, where Defaults
should be a tuple
having the same number of elements as the number of lists. For
lists:zip/3
, that means that the Defaults
tuple should have two
elements:
4> lists:zip([a,b,c,d], [1,2,3], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{d,999}]
5> lists:zip([a,b,c], [1,2,3,4,5], {pad, {zzz, 999}}).
[{a,1},{b,2},{c,3},{zzz,4},{zzz,5}]
For lists:zip3/3
the Defaults
tuple should have three elements:
6> lists:zip3([], [a], [1,2,3], {pad, {0.0, zzz, 999}}).
[{0.0,a,1},{0.0,zzz,2},{0.0,zzz,3}]
maybe
in the runtime systemIn OTP 25, the feature
concept
and the maybe
feature
were introduced. In order to use maybe
in OTP 25, it is necessary to
enable it in both the compiler and the runtime system. For example:
$ cat t.erl
-module(t).
-feature(maybe_expr, enable).
-export([listen_port/2]).
listen_port(Port, Options) ->
maybe
{ok, ListenSocket} ?= inet_tcp:listen(Port, Options),
{ok, Address} ?= inet:sockname(ListenSocket),
{ok, {ListenSocket, Address}}
end.
$ erlc t.erl
$ erl
Erlang/OTP 25 . . .
Eshell V13.1.1 (abort with ^G)
1> t:listen_port(50000, []).
=ERROR REPORT==== 6-Apr-2023::12:01:20.373223 ===
Loading of . . ./t.beam failed: {features_not_allowed,
[maybe_expr]}
** exception error: undefined function t:listen_port/2
2> q().
$ erl -enable-feature maybe_expr
Erlang/OTP 25 . . .
Eshell V13.1.1 (abort with ^G)
1> t:listen_port(50000, []).
{ok,{#Port<0.5>,{{0,0,0,0},50000}}}
In OTP 26, it is no longer necessary to enable a feature in the
runtime system in order to load modules that are using it.
It is sufficient to have -feature(maybe_expr, enable).
in the module.
For example:
$ erlc t.erl
$ erl
Erlang/OTP 26 . . .
Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1> t:listen_port(50000, []).
{ok,{#Port<0.4>,{{0,0,0,0},50000}}}
OTP 26 improves on the type-based optimizations in the JIT introduced
last year, but the most noticable improvements are for matching and
construction of binaries using the bit syntax. Those improvements,
combined with changes to the base64
module itself, makes encoding to
Base64 about 4 times faster and decoding from Base64 more than 3
times faster.
More details about these improvements can be found in the blog post More Optimizations in the Compiler and JIT.
Worth mentioning here is also the re-introduction of an optimization that was lost when the JIT was introduced in OTP 24:
erts: Reintroduce literal fun optimization
It turns out that this optimization is important for the jason library. Without it, JSON decoding is 10 percent slower.
Dialyzer has a new incremental mode implemented by Tom Davies. The incremental mode can greatly speed up the analysis when only small changes have been done to a code base.
Let’s jump straight into an example. Assuming that we want to prepare
a pull request for the stdlib
application, here is how we can use Dialyzer’s
incremental mode to show warnings for any issues in stdlib
:
$ dialyzer --incremental --apps erts kernel stdlib compiler crypto --warning_apps stdlib
Proceeding with incremental analysis... done in 0m14.91s
done (passed successfully)
Let’s break down the command line:
The --incremental
option tells Dialyzer to use the incremental mode.
The --warning_apps stdlib
lists the application that we want
warnings for. In this case, it’s the stdlib
application.
The --apps erts kernel stdlib compiler crypto
option lists the
applications that should be analyzed, but without generating any
warnings.
Dialyzer analyzed all modules given for the --apps
and
--warning_apps
options. On my computer, the analysis finished in
about 15 seconds.
If I immediately run Dialyzer with the same arguments, it finishes pretty much instantaneously because nothing has been changed:
$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
done (passed successfully)
If I do any change to the lists
module (for example, by adding a new
function), Dialyzer will re-analyze all modules that depend on it
directly or indirectly:
$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
Of the 270 files being tracked, 1 have been changed or removed,
resulting in 270 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m14.95s
done (passed successfully)
It turns out that all modules in the analyzed applications depend on
the lists
module directly or indirectly.
If I change something in the base64
module, the re-analysis will be
much quicker because there are fewer dependencies:
$ dialyzer --incremental --warning_apps stdlib --apps erts kernel stdlib compiler crypto
There have been changes to analyze
Of the 270 files being tracked, 1 have been changed or removed,
resulting in 3 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m1.07s
done (passed successfully)
In this case only three modules needed to be re-analyzed, which was done in about one second.
Note that all of the examples above used the same command line.
When running Dialyzer in the incremental mode, the list of applications to be analyzed and the list of applications to produce warnings for must be supplied every time Dialyzer is invoked.
To avoid having to supply the application lists on the command line,
they can be put into a configuration file named dialyzer.config
.
To find out in which directory Dialyzer will look for the configuration
file, run the following command:
$ dialyzer --help
.
.
.
Configuration file:
Dialyzer's configuration file may also be used to augment the default
options and those given directly to the Dialyzer command. It is commonly
used to avoid repeating options which would otherwise need to be given
explicitly to Dialyzer on every invocation.
The location of the configuration file can be set via the
DIALYZER_CONFIG environment variable, and defaults to
within the user_config location given by filename:basedir/3.
On your system, the location is currently configured as:
/Users/bjorng/Library/Application Support/erlang/dialyzer.config
An example configuration file's contents might be:
{incremental,
{default_apps,[stdlib,kernel,erts]},
{default_warning_apps,[stdlib]}
}.
{warnings, [no_improper_lists]}.
{add_pathsa,["/users/samwise/potatoes/ebin"]}.
{add_pathsz,["/users/smeagol/fish/ebin"]}.
.
.
.
Near the end there is information about the configuration file and where Dialyzer will look for it.
To shorten the command line for our previous examples, the following term can
be stored in the dialyzer.config
:
{incremental,
{default_apps, [erts,kernel,stdlib,compiler,crypto]},
{default_warning_apps, [stdlib]}
}.
Now it is sufficient to just give the --incremental
option to Dialyzer:
$ dialyzer --incremental
done (passed successfully)
As a final example, let’s run Dialyzer on PropER.
To do that, the default_warnings_apps
option in the configuration
file must be changed to proper
. It is also necessary to add the
add_pathsa
option to prepend the path of the proper
application to
the code path:
{incremental,
{default_apps, [erts,kernel,stdlib,compiler,crypto]},
{default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
Running Dialyzer:
$ dialyzer --incremental
There have been changes to analyze
Of the 296 files being tracked,
26 have been changed or removed,
resulting in 26 requiring analysis because they depend on those changes
Proceeding with incremental analysis...
proper.erl:2417:13: Unknown function cover:start/1
proper.erl:2426:13: Unknown function cover:stop/1
proper_symb.erl:249:9: Unknown function erl_syntax:atom/1
proper_symb.erl:250:5: Unknown function erl_syntax:revert/1
proper_symb.erl:250:23: Unknown function erl_syntax:application/3
proper_symb.erl:257:51: Unknown function erl_syntax:nil/0
proper_symb.erl:259:49: Unknown function erl_syntax:cons/2
proper_symb.erl:262:5: Unknown function erl_syntax:revert/1
proper_symb.erl:262:23: Unknown function erl_syntax:tuple/1
done in 0m2.36s
done (warnings were emitted)
Dialyzer found 26 new files to analyze (the BEAM files in the proper
application).
Those were analyzed in about two and a half seconds.
Dialyzer emitted warnings for unknown functions because proper
calls
functions in applications that were not being analyzed. To eliminate those warnings,
the tools
and syntax_tools
applications can be added to the list of applications
in the list of default_apps
:
{incremental,
{default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
{default_warning_apps, [proper]}
}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
With that change to the configuration file, no more warnings are printed:
$ dialyzer --incremental
There have been changes to analyze
Of the 319 files being tracked,
23 have been changed or removed,
resulting in 38 requiring analysis because they depend on those changes
Proceeding with incremental analysis... done in 0m6.47s
It is also possible to include warning options in the configuration file, for example to disable warnings for non-proper lists or to enable warnings for unmatched returns. Let’s enable warnings for unmatched returns:
{incremental,
{default_apps, [erts,kernel,stdlib,compiler,crypto,tools,syntax_tools]},
{default_warning_apps, [proper]}
}.
{warnings, [unmatched_returns]}.
{add_pathsa, ["/Users/bjorng/git/proper/_build/default/lib/proper"]}.
When warnings options are changed, Dialyzer will re-analyze all modules:
$ dialyzer --incremental
PLT was built for a different set of enabled warnings,
so an analysis must be run for 319 modules to rebuild it
Proceeding with incremental analysis... done in 0m19.43s
done (passed successfully)
dialyzer: Add incremental analysis mode
New in OTP 26 is the
argparse module, which
simplifies parsing of the command line in
escripts. argparse
was implemented by Maxim Fedorov.
To show only a few of the features, let’s implement the command-line
parsing for an escript called ehead
, inspired by the Unix command
head:
#!/usr/bin/env escript
%% -*- erlang -*-
main(Args) ->
argparse:run(Args, cli(), #{progname => ehead}).
cli() ->
#{
arguments =>
[#{name => lines, type => {integer, [{min, 1}]},
short => $n, long => "-lines", default => 10,
help => "number of lines to print"},
#{name => files, nargs => nonempty_list, action => extend,
help => "lists of files"}],
handler => fun(Args) ->
io:format("~p\n", [Args])
end
}.
As currently written, the ehead
script will simply print the
arguments collected by argparse
and quit.
If ehead
is run without any arguments an error message will be
shown:
$ ehead
error: ehead: required argument missing: files
Usage:
ehead [-n <lines>] [--lines <lines>] <files>...
Arguments:
files lists of files
Optional arguments:
-n, --lines number of lines to print (int >= 1, 10)
The message tells us that at least one file name must be given:
$ ehead foo bar baz
#{lines => 10,files => ["foo","bar","baz"]}
Since the command line was valid, argparse
collected the arguments
into a map, which was then printed by the handler
fun.
The number of lines to be printed from each file defaults to 10
, but
can be changed using either the -n
or --lines
option:
$ ehead -n 42 foo bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead foo --lines=42 bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead --lines 42 foo bar baz
#{lines => 42,files => ["foo","bar","baz"]}
$ ehead foo bar --lines 42 baz
#{lines => 42,files => ["foo","bar","baz"]}
Attempting to give the number of lines as 0
results in an error message:
$ ehead -n 0 foobar
error: ehead: invalid argument for lines: 0 is less than accepted minimum
Usage:
ehead [-n <lines>] [--lines <lines>] <files>...
Arguments:
files lists of files
Optional arguments:
-n, --lines number of lines to print (int >= 1, 10)
[argparse] Command line parser for Erlang
In OTP 25, the default options for ssl:connect/3 would allow setting up a connection without verifying the authenticity of the server (that is, without checking the server’s certificate chain). For example:
Erlang/OTP 25 . . .
Eshell V13.1.1 (abort with ^G)
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, []).
=WARNING REPORT==== 6-Apr-2023::12:29:20.824457 ===
Description: "Authenticity is not established by certificate path validation"
Reason: "Option {verify, verify_peer} and cacertfile/cacerts is missing"
{ok,{sslsocket,{gen_tcp,#Port<0.6>,tls_connection,undefined},
[<0.122.0>,<0.121.0>]}}
A warning report would be generated, but a connection would be set up.
In OTP 26, the default value for the verify
option is now
verify_peer
instead of verify_none
. Host verification
requires trusted CA certificates to be supplied using one of the options
cacerts
or cacertsfile
. Therefore, a connection attempt with an empty
option list will fail in OTP 26:
Erlang/OTP 26 . . .
Eshell V14.0 (press Ctrl+G to abort, type help(). for help)
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
[{verify,verify_peer},{cacerts,undefined}]}}
The default value for the cacerts
option is undefined
,
which is not compatible with the {verify,verify_peer}
option.
To make the connection succeed, the recommended way is to
use the cacerts
option to supply CA certificates to be used
for verifying. For example:
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{cacerts, public_key:cacerts_get()}]).
{ok,{sslsocket,{gen_tcp,#Port<0.5>,tls_connection,undefined},
[<0.137.0>,<0.136.0>]}}
Alternatively, host verification can be explicitly disabled. For example:
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{verify,verify_none}]).
{ok,{sslsocket,{gen_tcp,#Port<0.6>,tls_connection,undefined},
[<0.143.0>,<0.142.0>]}}
Another way that OTP 26 is safer is that legacy algorithms such as SHA1 and DSA are no longer allowed by default.
In OTP 26, the checking of options is strengthened to return errors
for incorrect options that used to be silently ignored. For example,
ssl
now rejects the fail_if_no_peer_cert
option if used for the
client:
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2> ssl:connect("www.erlang.org", 443, [{fail_if_no_peer_cert, true}, {verify, verify_peer}, {cacerts, public_key:cacerts_get()}]).
{error,{option,server_only,fail_if_no_peer_cert}}
In OTP 25, the option would be silently ignored.
ssl
in OTP 26 also returns clearer error reasons. In the example in
the previous section the following connection attempt was shown:
2> ssl:connect("www.erlang.org", 443, []).
{error,{options,incompatible,
[{verify,verify_peer},{cacerts,undefined}]}}
In OTP 25, the corresponding error return is less clear:
2> ssl:connect("www.erlang.org", 443, [{verify,verify_peer}]).
{error,{options,{cacertfile,[]}}}
]]>In OTP 25, the compiler was updated to embed type information in the BEAM file and the JIT was extended to emit better code based on that type information. Those improvements were described in the blog post Type-Based Optimizations in the JIT.
As mentioned in that blog post, there were limitations in both the compiler and the JIT that prevented many optimizations. In OTP 26, the compiler will produce better type information and the JIT will take better advantage of the improved type information, typically resulting in fewer redundant type tests and smaller native code size.
A new BEAM instruction introduced in OTP 26 makes record updates faster by a small but measurable amount.
The most noticable performance improvements in OTP 26 are probably for
matching and construction of binaries using the bit syntax. Those
improvements, combined with changes to the base64
module itself,
makes encoding to Base64 about 4 times as fast and decoding from
Base64 more than 3 times as fast.
While this blog post will show many examples of generated code, I have attempted to explain the optimizations in English as well. Feel free to skip the code examples.
On the other hand, if you want more code examples…
To examine the native code for loaded modules, start the runtime system like this:
erl +JDdump true
The native code for all modules that are loaded will be dumped to files with the
extension .asm
.
To examine the BEAM code for a module, use the -S
option when
compiling. For example:
erlc -S base64.erl
Let’s quickly summarize the type-based optimizations in OTP 25. For more details, see the aformentioned blog post.
First consider an addition of two values with nothing known about their types:
add1(X, Y) ->
X + Y.
The BEAM code looks like this:
{gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
return.
Without any information about the operands, the JIT must emit code that can handle all possible types for the operands. For the x86_64 architecture, 14 native instructions are needed.
If the type of the operands are known to be integers sufficiently small making overflow impossible, the JIT needs to emit only 5 native instructions for the addition.
Here is an example where the types and ranges of the operands for the
+
operator are known:
add5(X, Y) when X =:= X band 16#3FF,
Y =:= Y band 16#3FF ->
X + Y.
The BEAM code for this function is as follows:
{gc_bif,'band',{f,24},2,[{x,0},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,24},
[{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'band',{f,24},2,[{x,1},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,24},
[{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
return.
The register operands ({x,0}
and {x,1}
) have now been annotated with
type information:
{tr,Register,Type}
That is, each register operand is a three-tuple with tr
as the first
element. tr
stands for typed register. The second element is the
BEAM register ({x,0}
or {x,1}
in this case), and the third element
is the type of the register in the compiler’s internal type
representation. {t_integer,{0,1023}}
means that the value is an
integer in the inclusive range 0 through 1023.
With that type information, the JIT emits the following native code
for the +
operator:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
mov rsi, qword ptr [rbx+8]
and rax, -16 ; Zero the tag bits
add rax, rsi
mov qword ptr [rbx], rax
(Lines starting with #
are comments emitted by the JIT, while the
text that follows ;
is a comment added by me for clarification.)
The reduction in code size from 14 instructions down to 5 is nice, but
having to express the range check in that convoluted way using band
can hardly be called nice nor natural.
If we try to express the range checks in a more natural way:
add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
is_integer(Y), 0 =< Y, Y < 16#400 ->
X + Y.
the compiler in OTP 25 will no longer be able to figure out the ranges for the operands. Here is the BEAM code:
{test,is_integer,{f,22},[{x,0}]}.
{test,is_ge,{f,22},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,22},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
{test,is_integer,{f,22},[{x,1}]}.
{test,is_ge,{f,22},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,22},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
Because of that severe limitation in the compiler’s value range analysis, I wrote:
We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example.
Compiling the same example with OTP 26, the result is:
{test,is_integer,{f,19},[{x,0}]}.
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
{test,is_integer,{f,19},[{x,1}]}.
{test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_ge,{f,19},[{integer,1023},{tr,{x,1},{t_integer,{0,'+inf'}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
The BEAM instruction for the +
operator now have ranges for its operands.
Let’s look at little bit closer at the first three instructions, which
corresponds to the guard test is_integer(X), 0 =< X, X < 16#400
.
First is the guard check for an integer:
{test,is_integer,{f,19},[{x,0}]}.
It is followed by the guard test 0 =< X
(rewritten to X >= 0
by the compiler):
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
As a result of the is_integer/1
test it is known that {x,0}
is an integer.
The third instruction corresponds to X < 16#400
, which the compiler
has rewritten to 16#3FF >= X
(1023 >= X
):
{test,is_ge,{f,19},[{integer,1023},{tr,{x,0},{t_integer,{0,'+inf'}}}]}.
In the type for the {x,0}
register there is something new for
OTP 26. It says that the range is 0 through '+inf'
, that is, from 0 up
to positive infinity. Combining that range with the range from this
instruction, the Erlang compiler can infer that if this instruction
succeeds, the type for {x,0}
is t_integer,{0,1023}}
.
In OTP 25, the JIT would emit native code for each BEAM instruction in the guard individually. When translated individually, the three guards tests for one of the variables each require 11 native instructions, or 33 instructions for all three.
By having the BEAM loader combine the three guard tests into a
single is_int_range
instruction, the JIT is capable of doing a much
better job and emit a mere 6 native instructions.
How is that possible?
As individual BEAM instructions, each guard test needs 5 instructions
to fetch the value from {x,0}
and test that the value is a small
integer. As a combined instruction, that only needs to be done once.
Other parts of the guard tests also become redundant in the combined
instruction and can be omitted. For example, the is_integer/1
type
test will also succeed if its argument is a bignum (an integer
that does not fit in a machine word). Clearly, a bignum will fall well
outside the range 0 through 1023, so if the argument is not a small
integer, the combined guard test will fail immediately.
With those and some other simplifications, we end up with the following native instructions:
# is_int_in_range_fScc
mov rax, qword ptr [rbx]
sub rax, 15
test al, 15
short jne label_19
cmp rax, 16368
short ja label_19
The first instruction fetches the value of {x,0}
to the CPU
register rax
:
mov rax, qword ptr [rbx]
The next instruction subtracts the tagged value for the lower
bound of the range. Since the lower bound of the range is 0 and the
tag for small integers is 15, the value that is subtracted
is 16 * 0 + 15
or simply 15. (For small integers, the runtime system
uses the 4 least significant bits of the word as tag bits.)
If the lower bound would have been 1, the value to be subtracted would
have been 16 * 1 + 15
or 31:
sub rax, 15
The subtraction achieves two aims at once. Firstly, it simplifies the
tag test in the next two instructions because if the value of of
{x,0}
is a small integer, the 4 least significants bits will now be
zero:
test al, 15
short jne label_19
The test al, 15
instruction does a bitwise AND operation of the
lower byte of the CPU register rax
, discarding the result but
setting CPU flags depending on the value. The next instruction tests
whether the result was nonzero (the tag was not the tag for a small
integer), in which case the test fails and a jump to the failure
label is made.
The second aim for the subtraction is to simplify the range check.
If the value being tested was below the lower bound, the value
of rax
will be negative after the subtraction.
Since integers are represented in two’s complement notation, a
signed negative integer interpreted as an unsigned integer will be a
very large integer. Therefore, both bounds can be checked at once
using the old trick of treating the value in rax
as unsigned:
cmp rax, 16368
short ja label_19
The cmp rax, 16368
instruction compares the value in rax
with the
difference of the tagged upper bound and the tagged lower bound, that
is:
(16 * 1023 + 15) - (16 * 0 + 15)
ja
stands for “Jump (if) Above”, that is, jump if the CPU flags
indicates that in previous comparison of unsigned integers the first
integer was greater than the second. Since a negative number
represented in two’s complement notation looks like a huge integer
when interpreted as an unsigned integer, short ja label_19
will
transfer control to the failure label for values both below the lower
bound and above the upper bound.
The JIT in OTP 26 generates better code for common combinations of
relational operators. In order to reduce the number of combinations
that the JIT will need to handle, the compiler rewrites the <
operator to >=
if possible. In the previous example, it was shown
that the compiler rewrote X < 1024
to 1023 >= X
.
Let’s look at a contrived example to show (off) a few more improvements in the code generation:
add6(M) when is_map(M) ->
A = map_size(M),
if
9 < A, A < 100 ->
A + 6
end.
The main part of the BEAM code looks like this:
{test,is_map,{f,41},[{x,0}]}.
{gc_bif,map_size,{f,0},1,[{tr,{x,0},{t_map,any,any}}],{x,0}}.
{test,is_ge,
{f,43},
[{tr,{x,0},{t_integer,{0,288230376151711743}}},{integer,10}]}.
{test,is_ge,
{f,43},
[{integer,99},{tr,{x,0},{t_integer,{10,288230376151711743}}}]}.
{gc_bif,'+',{f,0},1,[{tr,{x,0},{t_integer,{10,99}}},{integer,6}],{x,0}}.
return.
In OTP 26, the JIT will inline the code for many of the most
frequently used guard BIFs. Here is the native code for the
map_size/1
call:
# bif_map_size_jsd
mov rax, qword ptr [rbx] ; Fetch map from {x,0}
# skipped type check because the argument is always a map
mov rax, qword ptr [rax+6] ; Fetch size of map
shl rax, 4
or al, 15 ; Tag as small integer
mov qword ptr [rbx], rax ; Store size in {x,0}
The two is_ge
instructions are combined by the BEAM loader into
an is_in_range
instruction:
# is_in_range_ffScc
# simplified fetching of BEAM register
mov rdi, rax
# skipped test for small operand since it always small
sub rdi, 175
cmp rdi, 1424
ja label_43
The first instruction is a new optimization in OTP 26. Normally {x,0}
is
fetched using the instruction mov rax, qword ptr [rbx]
. However, in this
case, the last instruction in the previous BEAM instruction is the instruction
mov qword ptr [rbx], rax
. Therefore, since it is known that the contents of
{x,0}
is already in CPU register rax
, the instruction can be simplified
to:
# simplified fetching of BEAM register
mov rdi, rax
The size of a map that will fit in memory on a 64-bit computer is always a small integer, so the test for a small integer is skipped:
# skipped test for small operand since it always small
sub rdi, 175 ; Subtract 16 * 10 + 15
cmp rdi, 1424 ; Compare with (16*99+15)-(16*10+15)
ja label_43
The native code for the +
operator looks like this:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
add rax, 96 ; 16 * 6 + 0
mov qword ptr [rbx], rax
The previous example of combining guard tests showed that the JIT can often generate better code if multiple BEAM instructions are combined into one. While the BEAM loader is capable of combining instructions it is often more practical to let the Erlang compiler emit combined instructions.
OTP 26 introduces two new instructions, each of which replaces a sequence of any number of simpler instructions:
update_record
for updating any number of fields in a record.
bs_match
for matching multiple segments of fixed size.
In OTP 25, the bs_create_bin
instruction for constructing a binary
with any number of segments was introduced, but its full potential for
generating efficient code was not leveraged in OTP 25.
Consider the following example of a record definition and three functions that update the record:
-record(r, {a,b,c,d,e}).
update_a(R) ->
R#r{a=42}.
update_ce(R) ->
R#r{c=99,e=777}.
update_bcde(R) ->
R#r{b=2,c=3,d=4,e=5}.
In OTP 25 and earlier, the way in which a record is updated depends on both the number of fields being updated and the size of the record.
When a single field in a record is updated, as in update_a/1
, the
setelement/3
BIF is called:
{test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
{move,{x,0},{x,1}}.
{move,{integer,42},{x,2}}.
{move,{integer,2},{x,0}}.
{call_ext_only,3,{extfunc,erlang,setelement,3}}.
When updating more than one field but fewer than approximately half of
the fields, as in update_ce/1
, code similar to the following is
emitted:
{test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
{allocate,0,1}.
{move,{x,0},{x,1}}.
{move,{integer,777},{x,2}}.
{move,{integer,6},{x,0}}.
{call_ext,3,{extfunc,erlang,setelement,3}}.
{set_tuple_element,{integer,99},{x,0},3}.
{deallocate,0}.
return.
Here the e
field is updated using setelement/3
, followed by
set_tuple_element
to update the c
field destructively. Erlang does
not allow mutation of terms, but here it is done “under the hood” in a
safe way.
When a majority of the fields are updated, as in update_bcde/1
, a
new tuple is built:
{test,is_tagged_tuple,{f,40},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{get_tuple_element,{x,0},1,{x,0}}.
{put_tuple2,{x,0},
{list,[{atom,r},
{x,0},
{integer,2},
{integer,3},
{integer,4},
{integer,5}]}}.
return.
In OTP 26, all records are updated using the new BEAM instruction
update_record
. For example, here is the main part of the BEAM code
for update_1
:
{test,is_tagged_tuple,{f,34},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{update_record,{atom,reuse},6,{x,0},{x,0},{list,[2,{integer,42}]}}.
return.
The last operand is a list of positions in the tuple and their corresponding new values.
The first operand, {atom,reuse}
, is a hint to the JIT that it is possible
that the source tuple is already up to date and does not need to be updated.
Another possible value for the hint operand is {atom,copy}
, meaning that
the source tuple is definitely not up to date.
The JIT emits the following native code for the update_record
instruction:
# update_record_aIsdI
mov rax, qword ptr [rbx]
mov rdi, rax
cmp qword ptr [rdi+14], 687
je L130
vmovups xmm0, [rax-2]
vmovups [r15], xmm0
mov qword ptr [r15+16], 687
vmovups ymm0, [rax+22]
vmovups [r15+24], ymm0
lea rax, qword ptr [r15+2]
add r15, 56
L130:
mov qword ptr [rbx], rax
Let’s walk through those instructions. First the value of {x,0}
is fetched:
mov rax, qword ptr [rbx]
Since the hint operand is the atom reuse
, is is possible that it is
unnecessary to copy the tuple. Therefore, the JIT emits an instruction
sequence to test whether the a
field (position 2 in the tuple)
already contains the value 42. If so, the source tuple can be reused:
mov rdi, rax
cmp qword ptr [rdi+14], 687 ; 42
je L130 ; Reuse source tuple
Next follows the copy and update sequence. First the header word for
the tuple and its first element (the r
atom) are copied using
AVX instructions:
vmovups xmm0, [rax-2]
vmovups [r15], xmm0
Next the value 42 is stored into position 2 of the copy of the tuple:
mov qword ptr [r15+16], 687 ; 42
Finally the remaining four elements of the tuple are copied:
vmovups ymm0, [rax+22]
vmovups [r15+24], ymm0
All that remains is to create a tagged pointer to the newly created tuple and increment the heap pointer:
lea rax, qword ptr [r15+2]
add r15, 56
The last instruction stores the tagged pointer to either the original
or the updated tuple to {x,0}
:
L130:
mov qword ptr [rbx], rax
The BEAM code for update_ce/1
is very similar to the code for update_a/1
:
{test,is_tagged_tuple,{f,37},[{x,0},6,{atom,r}]}.
{test_heap,7,1}.
{update_record,{atom,reuse},
6,
{x,0},
{x,0},
{list,[4,{integer,99},6,{integer,777}]}}.
return.
The native code looks like this:
# update_record_aIsdI
mov rax, qword ptr [rbx]
vmovups ymm0, [rax-2]
vmovups [r15], ymm0
mov qword ptr [r15+32], 1599 ; 99
mov rdi, [rax+38]
mov [r15+40], rdi
mov qword ptr [r15+48], 12447 ; 777
lea rax, qword ptr [r15+2]
add r15, 56
mov qword ptr [rbx], rax
Note that the copying and updating is done unconditionally, despite
the reuse
hint. The JIT is free to ignore the hints. When multiple
fields are being updated, the test for whether the update is
unnecessary would be more expensive and it is also much less likely
that all of the fields would turn out to be unchanged. Therefore,
trying to reuse the original tuple is more likely to be a
pessimization rather
than an optimization.
To explore the optimizations of binaries, the following example will be used:
bin_swap(<<A:8,B:24>>) ->
<<B:24,A:8>>.
Somewhat simplified, the main part of the BEAM code as emitted by the compiler in OTP 25 looks like this:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
{bs_get_position,{x,1},{x,0},2}.
{test,bs_get_integer2,
{f,2},
2,
[{x,1},
{integer,8},
1,
{field_flags,[unsigned,big]}],
{x,2}}.
{test,bs_get_integer2,
{f,2},
3,
[{x,1},
{integer,24},
1,
{field_flags,[unsigned,big]}],
{x,3}}.
{test,bs_test_tail2,{f,2},[{x,1},0]}.
{bs_create_bin,{f,0},
0,4,1,
{x,0},
{list,[{atom,integer},
1,1,nil,
{tr,{x,3},{t_integer,{0,16777215}}},
{integer,24},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,8}]}}.
return.
Let’s walk through the code. The first instruction sets up a match context:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
A match context holds several pieces of information needed for matching a binary.
The next instruction saves information that will be needed if matching of the binary fails for some reason:
{bs_get_position,{x,1},{x,0},2}.
The next two instructions match out two segments as integers (comments added by me):
{test,bs_get_integer2,
{f,2}, % Failure label
2, % Number of live X registers (needed for GC)
[{x,1}, % Match context register
{integer,8}, % Size of segment in units
1, % Unit value
{field_flags,[unsigned,big]}],
{x,2}}. % Destination register
{test,bs_get_integer2,
{f,2},
3,
[{x,1},
{integer,24},
1,
{field_flags,unsigned,big]}],
{x,3}}.
The next instruction makes sure that the end of the binary has now been reached:
{test,bs_test_tail2,{f,2},[{x,1},0]}.
The next instruction creates the binary with the segments swapped:
{bs_create_bin,{f,0},
0,4,1,
{x,0},
{list,[{atom,integer},
1,1,nil,
{tr,{x,3},{t_integer,{0,16777215}}},
{integer,24},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,8}]}}.
Before OTP 25, creation of binaries was done using multiple
instructions, similar to how binary matching is still done in
OTP 25. The reason for creating the bs_create_bin
instruction in OTP 25
was to be able to provide improved error information when construction
of a binary fails, similar to the improved BIF error
information.
When the size of a segment of size 8, 16, 32, or 64 is matched, specialized instructions are used for x86_64. The specialized instructions do everything inline provided that the segment is byte-aligned. (The JIT in OTP 25 for AArch64/ARM64 does not have these specialized instructions.) Here is the instruction for matching a segment of size 8:
# i_bs_get_integer_8_Stfd
mov rcx, qword ptr [rbx+8]
mov rsi, qword ptr [rcx+22]
lea rdx, qword ptr [rsi+8]
cmp rdx, qword ptr [rcx+30]
ja label_25
rex test sil, 7
short je L91
mov edx, 64
call L92
short jmp L90
L91:
mov rdi, qword ptr [rcx+14]
shr rsi, 3
mov qword ptr [rcx+22], rdx
movzx rax, byte ptr [rdi+rsi]
shl rax, 4
or rax, 15
L90:
mov qword ptr [rbx+16], rax
The first two instructions pick up the pointer to the match context and from the match context the current bit offset into the binary:
mov rcx, qword ptr [rbx+8] ; Load pointer to match context
mov rsi, qword ptr [rcx+22] ; Get offset in bits into binary
The next three instructions ensure that the length of the binary is at least 8 bits:
lea rdx, qword ptr [rsi+8] ; Add 8 to the offset
cmp rdx, qword ptr [rcx+30] ; Compare offset+8 with size of binary
ja label_25 ; Fail if the binary is too short
The next five instructions test whether the current byte in the binary is aligned at a byte boundary. If not, a helper code fragment is called:
rex test sil, 7 ; Test the 3 least significant bits
short je L91 ; Jump if 0 (meaning byte-aligned)
mov edx, 64 ; Load size and flags
call L92 ; Call helper fragment
short jmp L90 ; Done
A helper code fragment is a shared block of code that can be called from the native code generated for BEAM instructions, typically to handle cases that are uncommon and/or would require more native instructions than are practial to include inline. Each such code fragment has its own calling convention, typically tailor-made to be as convenient for the caller as possible. (See Further adventures in the JIT for more information about helper code fragments.)
The remaining instructions read one byte from memory, convert it to a
tagged Erlang terms, store it in {x,2}
, and advance the bit offset
in the match context:
L91:
mov rdi, qword ptr [rcx+14] ; Load base pointer for binary
shr rsi, 3 ; Convert bit offset to byte offset
mov qword ptr [rcx+22], rdx ; Update bit offset in match context
movzx rax, byte ptr [rdi+rsi] ; Read one byte from the binary
shl rax, 4 ; Multiply by 16...
or rax, 15 ; ... and add tag for a small integer
L90:
mov qword ptr [rbx+16], rax ; Store extracted integer
When matching a segment of size other than one of the special sizes mentioned earlier, the JIT will always emit a call to a general routine that can handle matching of any integer segment with any aligment, endianness, and signedness.
In OTP 25, the full potential for optimization of the bs_create_bin
instruction is not realized. The construction of each segment is done
by calling a helper routine that builds the segment. Here is the
native for the part of the bs_create_bin
instruction that builds the
integer segments:
# construct integer segment
mov edx, 24
mov rsi, qword ptr [rbx+24]
xor ecx, ecx
lea rdi, qword ptr [rbx-80]
call 4387496416
# construct integer segment
mov edx, 8
mov rsi, qword ptr [rbx+16]
xor ecx, ecx
lea rdi, qword ptr [rbx-80]
call 4387496416
In OTP 26, there is a new BEAM bs_match
instruction used for
matching segments with sizes known at compile time. The BEAM code for
the matching code in the function head for bin_swap/1
is as follows:
{test,bs_start_match3,{f,1},1,[{x,0}],{x,1}}.
{bs_get_position,{x,1},{x,0},2}.
{bs_match,{f,2},
{x,1},
{commands,[{ensure_exactly,32},
{integer,2,{literal,[]},8,1,{x,2}},
{integer,3,{literal,[]},24,1,{x,3}}]}}.
The first two instructions are identical to their OTP 25 counterparts.
The first operand of the bs_match
instruction, {f,2}
, is the
failure label and the second operand {x,2}
is the register holding
the match context. The third operand, {commands,[...]}
, is a list of
matching commands.
The first command in the commands
list, {ensure_exactly,32}
, tests
that the remaining number of bits in the binary being matched is
exactly 32. If not, a jump is made to the failure label.
The second command extracts an integer of 8 bits and stores it in
{x,2}
. The third command extracts an integer of 24 bits and store it
in {x,3}
.
Having matching of multiple segments contained in a single BEAM instruction makes it much easier for the JIT to generate efficient code. Here is what the native code will do:
Test that there are at exactly 32 bits left in the binary.
If the segment is byte-aligned, read a 4-byte word from the binary and store it in a CPU register.
If the segment is not byte-aligned, read an 8-byte word from the binary and shift to extract the 32 bits needed.
Shift and mask out 8 bits and tag as an integer. Store into {x,2}
.
Shift and mask out 24 bits and tag as an integer. Store into {x,3}
.
The native code for the bs_match
instruction (slightly simplifed) is
as follows:
# i_bs_match_fS
# ensure_exactly 32
mov rsi, qword ptr [rbx+8]
mov rax, qword ptr [rsi+30]
mov rcx, qword ptr [rsi+22]
sub rax, rcx
cmp rax, 32
jne label_3
# read 32
mov rdi, qword ptr [rsi+14]
add qword ptr [rsi+22], 32
mov rax, rcx
shr rax, 3
add rdi, rax
and ecx, 7
jnz L38
movbe edx, dword ptr [rdi]
add ecx, 32
short jmp L40
L38:
mov rdx, qword ptr [rdi-3]
shr rdx, 24
bswap rdx
L40:
shl rdx, cl
# extract integer 8
mov rax, rdx
# store extracted integer as a small
shr rax, 52
or rax, 15
mov qword ptr [rbx+16], rax
shl rdx, 8
# extract integer 24
shr rdx, 36
or rdx, 15
mov qword ptr [rbx+24], rdx
The first part of the code ensures that there are exactly 32 bits remaining in the binary:
# ensure_exactly 32
mov rsi, qword ptr [rbx+8] ; Get pointer to match context
mov rax, qword ptr [rsi+30] ; Get size of binary in bits
mov rcx, qword ptr [rsi+22] ; Get offset in bits into binary
sub rax, rcx
cmp rax, 32
jne label_3
The next part of the code does not directly correspond to the commands
in the bs_match
BEAM instruction. Instead, the code reads 32 bits
from the binary:
# read 32
mov rdi, qword ptr [rsi+14]
add qword ptr [rsi+22], 32 ; Increment bit offset in match context
mov rax, rcx
shr rax, 3
add rdi, rax
and ecx, 7 ; Test alignment
jnz L38 ; Jump if segment not byte-aligned
; Read 32 bits (4 bytes) byte-aligned and convert to big-endian
movbe edx, dword ptr [rdi]
add ecx, 32
short jmp L40
L38:
; Read a 8-byte word and extract the 32 bits that are needed.
mov rdx, qword ptr [rdi-3]
shr rdx, 24
bswap rdx ; Convert to big-endian
L40:
; Shift the read bytes to the most significant bytes of the word
shl rdx, cl
The 4 bytes read will be converted to big-endian and placed as the
most significant bytes of CPU register rdx
with the rest of the
register zeroed.
The following instructions extracts the 8 bits for the first segment and
stores it as a tagged integer in {x,2}
:
# extract integer 8
mov rax, rdx
# store extracted integer as a small
shr rax, 52
or rax, 15
mov qword ptr [rbx+16], rax
shl rdx, 8
The following instructions extracts the 24 bits for the second segment and
stores it as a tagged integer in {x,3}
:
# extract integer 24
shr rdx, 36
or rdx, 15
mov qword ptr [rbx+24], rdx
For binary construction in OTP 26, the compiler emits a
bs_create_bin
BEAM instruction just as in OTP 25. However, the
native code that the JIT in OTP 26 emits for that instruction bears
little resemblance to the native code emitted by OTP 25. The native
code will do the following:
Allocate room on the heap for a binary and initialize it with inlined native code. A helper code fragment is called to do a garbage collection if there is not sufficient room left on the heap.
Read the integer from {x,3}
and untag it.
Read the integer from {x,2}
and untag it. Combine the value with
the previous 24-bit value to obtain a 32-bit value.
Write the combined 32 bits into the binary.
Here follows the complete native code for the bs_create_bin
instruction (somewhat simplified):
# i_bs_create_bin_jItd
# allocate heap binary
lea rdx, qword ptr [r15+56]
cmp rdx, rsp
short jbe L43
mov ecx, 4
.db 0x90
call 4343630296
L43:
lea rax, qword ptr [r15+2]
mov qword ptr [rbx-120], rax
mov qword ptr [r15], 164
mov qword ptr [r15+8], 4
add r15, 16
mov qword ptr [rbx-64], r15
mov qword ptr [rbx-56], 0
add r15, 8
# accumulate value for integer segment
xor r8d, r8d
mov rdi, qword ptr [rbx+24]
sar rdi, 4
or r8, rdi
# accumulate value for integer segment
shl r8, 8
mov rdi, qword ptr [rbx+16]
sar rdi, 4
or r8, rdi
# construct integer segment from accumulator
bswap r8d
mov rdi, qword ptr [rbx-64]
mov qword ptr [rbx-56], 32
mov dword ptr [rdi], r8d
Let’s walk through it.
The first part of the code, starting with # allocate heap binary
and
ending before the next comment line allocates a heap binary with
inlined native code. The only call to a helper code fragment is in case
there is not sufficient space left on the heap.
Next follows the construction of the segments of the binary.
Instead of writing the value of each segment to memory one at a time, multiple segments are accumulated into a CPU register. Here follows the code for the first segment to be constructed (24 bits):
# accumulate value for integer segment
xor r8d, r8d ; Initialize accumulator
mov rdi, qword ptr [rbx+24] ; Fetch {x,3}
sar rdi, 4 ; Untag
or r8, rdi ; OR into accumulator
Here follows the code for the second segment (8 bits):
# accumulate value for integer segment
shl r8, 8 ; Make room for 8 bits
mov rdi, qword ptr [rbx+16] ; Fetch {x,2}
sar rdi, 4 ; Untag
or r8, rdi ; OR into accumulator
Since there are no segments of the binary left, the accumulated value will be written out to memory:
# construct integer segment from accumulator
bswap r8d ; Make accumulator big-endian
mov rdi, qword ptr [rbx-64] ; Get pointer into binary
mov qword ptr [rbx-56], 32 ; Update size of binary
mov dword ptr [rdi], r8d ; Write 32 bits
The ancient OTP R12B release introduced an optimization for efficiently appending to a binary. Let’s look at an example to see the optimization in action:
-module(append).
-export([expand/1, expand_bc/1]).
expand(Bin) when is_binary(Bin) ->
expand(Bin, <<>>).
expand(<<B:8,T/binary>>, Acc) ->
expand(T, <<Acc/binary,B:16>>);
expand(<<>>, Acc) ->
Acc.
expand_bc(Bin) when is_binary(Bin) ->
<< <<B:16>> || <<B:8>> <= Bin >>.
Both append:expand/1
and append:expand_bc/1
take a binary and
double its size by expanding each byte to two bytes. For example:
1> append:expand(<<1,2,3>>).
<<0,1,0,2,0,3>>
2> append:expand_bc(<<4,5,6>>).
<<0,4,0,5,0,6>>
Both functions accept only binaries:
3> append:expand(<<1,7:4>>).
** exception error: no function clause matching append:expand(<<1,7:4>>,<<>>)
4> append:expand_bc(<<1,7:4>>).
** exception error: no function clause matching append:expand_bc(<<1,7:4>>)
Before looking at the BEAM code, let’s do some benchmarking using erlperf to find out which function is faster:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 7936 126 us 100%
r(Bin) -> append:expand(Bin). 1 4369 229 us 55%
The expression for the --init_runner_all
option uses
rand:bytes/1 to create a binary with 10,000 random
bytes, which will be passed to both expand functions.
From the benchmark results, it can be seen that the expand_bc/1
function is
almost twice as fast.
To find out why, let’s compare the BEAM code for the two functions. Here is
the instruction that appends to the binary in expand/1
:
{bs_create_bin,{f,0},
0,3,8,
{x,1},
{list,[{atom,append}, % Append operation
1,8,nil,
{tr,{x,1},{t_bitstring,1}}, % Source/destination
{atom,all},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,16}]}}.
The first segment is an append
operation. The operand
{tr,{x,1},{t_bitstring,1}}
denotes both source and destination of
the operation. That is, the binary referenced by {x,1}
will be
mutated. Erlang normally does not allow mutation, but this mutation
is done under the hood in a way not observable from outside. That
makes the append operation much more efficient than it would be if the
source binary had to be copied.
For the binary comprehension in expand_bc/1
, there is a similar
BEAM instruction for appending to the binary:
{bs_create_bin,{f,0},
0,3,1,
{x,1},
{list,[{atom,private_append}, % Private append operation
1,1,nil,
{x,1},
{atom,all},
{atom,integer},
2,1,nil,
{tr,{x,2},{t_integer,{0,255}}},
{integer,16}]}}.
The main difference is that the binary comprehension uses the more
efficient private_append
operation instead of append
.
The append
operation has more overhead because it must produce the
correct result for code such as:
bins(Bin) ->
bins(Bin, <<>>).
bins(<<H,T/binary>>, Acc) ->
[Acc|bins(T, <<Acc/binary,H>>)];
bins(<<>>, Acc) ->
[Acc].
Running it:
1> example:bins(<<"abcde">>).
[<<>>,<<"a">>,<<"ab">>,<<"abc">>,<<"abcd">>,<<"abcde">>]
In the expand/1
function, only the final value binary being appended
to was needed. In bins/1
, all of the intermediate values of binary
are collected in a list. For correctness, the append
operations must
ensure that the binary Acc
is copied before H
is appended to
it. To be able to know when it is necessary to copy the binary, the
append
operation does some extra bookeeping that does not come
for free.
In OTP 26, there is a new optimization in the compiler that replaces
an append
operation with a private_append
operation whenever it is
correct and safe to do so. This optimization was implemented by Frej
Drejhammar. That is, the optimization will rewrite append:expand/2
to use private_append
, but not examples:bins/2
.
The difference between append:expand/1
and append:expand_bc/1
is now
much smaller:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 13164 75988 ns 100%
r(Bin) -> append:expand(Bin). 1 12419 80550 ns 94%
expand_bc/1
is still a bit faster because the compiler emits
somewhat more efficient binary matching code for it than for the
expand/1
function.
is_binary/1
guardsThe expand/1
function has an is_binary/1
guard test that may seem
unnecessary:
expand(Bin) when is_binary(Bin) ->
expand(Bin, <<>>).
The guard test is not necessary for correctness, because expand/2
will raise a function_clause
exception if its argument is not a
binary. However, better code will be generated for expand/2
with
the guard test.
With the guard test, the first BEAM instruction in expand/2
is:
{bs_start_match4,{atom,no_fail},2,{x,0},{x,0}}.
Without the guard test, the first BEAM instruction is:
{test,bs_start_match3,{f,3},2,[{x,0}],{x,2}}.
The bs_start_match4
instruction is more efficient because it does
not have to test that {x,0}
contains a binary.
The benchmark results show measurable increased execution time for
expand/1
if the guard test is removed:
erlperf --init_runner_all 'rand:bytes(10_000).' \
'r(Bin) -> append:expand(Bin).' \
'r(Bin) -> append:expand_bc(Bin).'
Code || QPS Time Rel
r(Bin) -> append:expand_bc(Bin). 1 13273 75366 ns 100%
r(Bin) -> append:expand(Bin). 1 11875 84236 ns 89%
base64
moduleTraditionally, up to OTP 25, the clause in the base64
module that does
most of the work of encoding a binary to Base64 looked like this:
encode_binary(<<B1:8, B2:8, B3:8, Ls/bits>>, A) ->
BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
encode_binary(Ls,
<<A/bits,(b64e(BB bsr 18)):8,
(b64e((BB bsr 12) band 63)):8,
(b64e((BB bsr 6) band 63)):8,
(b64e(BB band 63)):8>>).
The reason is that matching out segments of size 8 has always been specially optimized and has been much faster than matching out a segment of size 6. That is no longer true in OTP 26. With the improvements in binary matching described in this blog post, the clause can be written in a more natural way:
encode_binary(<<B1:6, B2:6, B3:6, B4:6, Ls/bits>>, A) ->
encode_binary(Ls,
<<A/bits,
(b64e(B1)):8,
(b64e(B2)):8,
(b64e(B3)):8,
(b64e(B4)):8>>);
(This is not the exact code in OTP 26, because of additional features added later.)
The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 25 is:
erlperf --init_runner_all 'rand:bytes(1_000_000).' \
'r(Bin) -> base64:encode(Bin).'
Code || QPS Time
r(Bin) -> base64:encode(Bin). 1 61 16489 us
The benchmark results for encoding a random binary of 1,000,000 bytes to Base64 for OTP 26 is:
erlperf --init_runner_all 'rand:bytes(1_000_000).' \
'r(Bin) -> base64:encode(Bin).'
Code || QPS Time
r(Bin) -> base64:encode(Bin). 1 249 4023 us
That is, encoding is about 4 times faster.
Here are the main pull requests for the optimizations mentioned in this blog post:
private_append
optimization for binariesYou can download the readme describing all the changes here: Erlang/OTP 25 Readme. Or, as always, look at the release notes of the application you are interested in. For instance here: Erlang/OTP 25 - Erts Release Notes - Version 13.0.
This years highlights are:
maps
and lists
modulesmaybe_expr
featureshort
for erlang:float_to_list/2
and erlang:float_to_binary/2
peer
supersedes the slave modulegen_xxx
modules has got a new format_status/1
callbacktimer
module has been modernized and made more efficientmaps
and lists
modulesTriggered by suggestions from the users we have introduced new functions in the maps
and lists
modules in stdlib
.
maps:groups_from_list/2,3
For short we can say that this function take a list of elements and group them. The result is a map #{Group1 => [Group1Elements], GroupN => [GroupNElements]}
.
Let us look at some examples from the shell:
> maps:groups_from_list(fun(X) -> X rem 2 end, [1,2,3]).
#{0 => [2], 1 => [1, 3]}
The provided fun calculates X rem 2
for every element X
in the input list and then group the elements in a map with the result of X rem 2
as key and the corresponding elements as a list value for that key.
> maps:groups_from_list(fun erlang:length/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 => ["ant", "cat"], 5 => ["dingo"], 7 => ["buffalo"]}
In the example above the strings in the input list are grouped into a map based on their length.
There is also a variant of groups_from_list
with an additional fun by which the values can be converted before they are put into their groups.
> maps:groups_from_list(fun(X) -> X rem 2 end, fun(X) -> X*X end, [1,2,3]).
#{0 => [4], 1 => [1, 9]}
In the example above the elements X
in the list are grouped according the X rem 2
calculation but the values stored in the groups are the elements multiplied by themselves (X * X
).
> maps:groups_from_list(fun erlang:length/1, fun lists:reverse/1, ["ant", "buffalo", "cat", "dingo"]).
#{3 => ["tna","tac"],5 => ["ognid"],7 => ["olaffub"]}
In the example above the strings from the input list are grouped according to their length and they are reversed before they are stored in the groups.
For more details see the maps:groups_from_list/2
documentation.
lists:enumerate/1,2
Takes a list of elements and returns a new list where each element has been associated with its position in the original list. Returns a new list with tuples of the form {I, H}
where I
is the position of H
in the original list. The enumeration starts with 1 and increases by 1 in each step.
Example:
> lists:enumerate([a,b,c]).
[{1,a},{2,b},{3,c}]
There is also a enumerate/2
function which can be used to set the initial number to something else than 1. See example below:
> lists:enumerate(10, [a,b,c]).
[{10,a},{11,b},{12,c}]
For more details see the lists:enumerate/1
documentation.
lists:uniq/1,2
Removes duplicates from a list while preserving the order of the elements. The first occurrence of each element is kept.
We already have lists:usort
which also removes duplicates but returns a sorted list.
Examples:
> lists:uniq([3,3,1,2,1,2,3]).
[3,1,2]
> lists:uniq([a, a, 1, b, 2, a, 3]).
[a, 1, b, 2, 3]
lists:uniq/2
allows the user to specify with a fun how to determine that 2 elements in the list are equal. In the example below the provided fun is just testing the first element of the 2 tuples for equality.
Examples:
> lists:uniq(fun({X, _}) -> X end, [{b, 2}, {a, 1}, {c, 3}, {a, 2}]).
[{b, 2}, {a, 1}, {c, 3}]
For more details see the lists:uniq/1
documentation.
maybe_expr
featureSelectable features is a new mechanism and concept where a new potentially incompatible feature (language or runtime), can be introduced and tested without causing troubles for those that don’t use it.
When it comes to language features the intention is that they can be activated per module with no impact on modules where they are not activated.
Let’s use the new maybe_expr
feature as an example.
In module my_experiment
the feature is activated and used like this:
-module(my_experiment).
-export([foo/1]).
%% Enable the feature maybe_expr in this module only
%% Makes maybe a keyword which might be incompatible
%% in modules using maybe as a function name or an atom
-feature(maybe_expr,enable).
foo() ->
maybe
{ok, X} ?= f(Foo),
[H|T] ?= g([1,2,3]),
...
else
{error, Y} ->
{ok, "default"};
{ok, _Term} ->
{error, "unexpected wrapper"}
end.
The compiler will note that the feature maybe_expr
is enabled and will handle the maybe construct correctly. In the generated .beam
file it will also be noted that
the module has enabled the feature.
When starting an Erlang node the specific feature (or all) must be enabled otherwise the .beam
file with the feature will not be allowed for loading.
erl -enable-feature maybe_expr
Or
erl -enable-feature all
For more details see the feature section in the Erlang Reference Manual.
maybe_expr
feature EEP-49The EEP-49 “Value-Based Error Handling Mechanisms”, was suggested by Fred Hebert already 2018 and now it has finally been implemented as the first feature within the new feature concept.
The maybe ... end
construct is similar to begin ... end
in that it is used to group multiple distinct expressions as a
single block. But there is one important difference in that the
maybe
block does not export its variables while begin
does
export its variables.
A new type of expressions (denoted MatchOrReturnExprs
) are introduced, which are only valid within a
maybe ... end
expression:
maybe
Exprs | MatchOrReturnExprs
end
MatchOrReturnExprs
are defined as having the following form:
Pattern ?= Expr
This definition means that MatchOrReturnExprs
are only allowed at the
top-level of maybe ... end
expressions.
The ?=
operator takes the value returned by Expr
and pattern matches
it against Pattern
.
If the pattern matches, all variables from Pattern
are bound in the local
environment, and the expression is equivalent to a successful Pattern = Expr
call. If the value does not match, the maybe ... end
expression returns the
failed expression directly.
A special case exists in which we extend maybe ... end
into the following form:
maybe
Exprs | MatchOrReturnExprs
else
Pattern -> Exprs;
...
Pattern -> Exprs
end
This form exists to capture non-matching expressions in a MatchOrReturnExprs
to handle failed matches rather than returning their value. In such a case, an
unhandled failed match will raise an else_clause
error, otherwise identical to
a case_clause
error.
This extended form is useful to properly identify and handle successful and unsuccessful matches within the same construct without risking to confuse happy and unhappy paths.
Given the structure described here, the final expression may look like:
maybe
Foo = bar(), % normal exprs still allowed
{ok, X} ?= f(Foo),
[H|T] ?= g([1,2,3]),
...
else
{error, Y} ->
{ok, "default"};
{ok, _Term} ->
{error, "unexpected wrapper"}
end
For more details see the maybe section in the Erlang Reference Manual.
With the maybe
construct it is possible to reduce deeply nested conditional expressions and make messy patterns found in the wild unnecessary. It also provides a better separation of concerns when implementing functions.
One common pattern that can be seen in Erlang is deep nesting of case
... end
expressions, to check complex conditionals.
Take the following code taken from Mnesia, for example:
commit_write(OpaqueData) ->
B = OpaqueData,
case disk_log:sync(B#backup.file_desc) of
ok ->
case disk_log:close(B#backup.file_desc) of
ok ->
case file:rename(B#backup.tmp_file, B#backup.file) of
ok ->
{ok, B#backup.file};
{error, Reason} ->
{error, Reason}
end;
{error, Reason} ->
{error, Reason}
end;
{error, Reason} ->
{error, Reason}
end.
The code is nested to the extent that shorter aliases must be introduced
for variables (OpaqueData
renamed to B
), and half of the code just
transparently returns the exact values each function was given.
By comparison, the same code could be written as follows with the new construct:
commit_write(OpaqueData) ->
maybe
ok ?= disk_log:sync(OpaqueData#backup.file_desc),
ok ?= disk_log:close(OpaqueData#backup.file_desc),
ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
{ok, OpaqueData#backup.file}
end.
Or, to protect against disk_log
calls returning something else than ok |
{error, Reason}
, the following form could be used:
commit_write(OpaqueData) ->
maybe
ok ?= disk_log:sync(OpaqueData#backup.file_desc),
ok ?= disk_log:close(OpaqueData#backup.file_desc),
ok ?= file:rename(OpaqueData#backup.tmp_file, OpaqueData#backup.file),
{ok, OpaqueData#backup.file}
else
{error, Reason} -> {error, Reason}
end.
The semantics of these calls are identical, except that it is now much easier to focus on the flow of individual operations and either success or error paths.
Dialyzer now supports the missing_return
and extra_return
options to raise warnings when specifications differ from inferred types. These are similar to, but not quite as verbose, as overspecs
and underspecs
.
Dialyzer now better understands the types for min/2
, max/2
, and erlang:raise/3
. Because of that, Dialyzer can potentially generate new warnings. In particular, functions that use erlang:raise/3
could now need a spec with a no_return()
return type to avoid an unwanted warning.
The JIT compiler introduced in Erlang/OTP 24 improved the performance for Erlang applications.
Erlang/OTP 25 introduces some major improvements of the JIT:
The JIT now supports the AArch64 (ARM64) architecture, used by (for example) Apple Silicon Macs and newer Raspberry Pi devices.
Better code generated based on types provided by the Erlang compiler.
Better support for perf
and gdb
with line numbers for Erlang code.
How much speedup one can expect from the JIT compared to the interpreter varies from nothing to up to four times.
To get some more concrete figures we have run three different benchmarks with the JIT disabled and enabled on a MacBook Pro (M1 processor; released in 2020).
First we ran the EStone benchmark. Without the JIT, 691,962 EStones were achieved and with the JIT 1,597,949 EStones. That is, more than twice as many EStones with the JIT.
Next we tried running Dialyzer to build a small PLT:
dialyzer --build_plt --apps erts kernel stdlib
With the JIT, the time for building the PLT was reduced from 18.38 seconds down to 9.64 seconds. That is, almost but not quite twice as fast.
Finally, we ran a benchmark for the base64 module included in this Github issue.
With the JIT:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 11846 ms: 84 it/sec
fun base64:decode/1: 1000 iterations in 14617 ms: 68 it/sec
Without the JIT:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 25938 ms: 38 it/sec
fun base64:decode/1: 1000 iterations in 20603 ms: 48 it/sec
Encoding with the JIT is almost two and half times as fast, while the decoding time with the JIT is about 75 percent of the decoding time without the JIT.
The JIT translates one BEAM instruction at the time to native code
without any knowledge of previous instructions. For example, the native
code for the +
operator must work for any operands: small integers that
fit in 64-bit word, large integers, floats, and non-numbers that should
result in raising an exception.
In Erlang/OTP 25, the compiler embeds type information in the BEAM file to the help the JIT generate better native code without unnecessary type tests.
For more details, see the blog post Type-Based Optimizations in the JIT.
perf
and gdb
It is now possible to profile Erlang systems with perf and get a mapping from the JIT code to the corresponding Erlang code. This will make it easy to find bottlenecks in the code.
The same goes for gdb
which also can show which line of Erlang code a specific address in the JIT code corresponds to.
Perf is a Linux command-line tool for lightweight CPU profiling; it checks CPU performance counters, trace points, uprobes, and kprobes, monitors program events, and creates reports.
An Erlang node running under perf
can be started like this:
perf record --call-graph fp -- erl +JPperf true
The result from perf could then be viewed like this:
perf report
It is also possible to attach perf
to an already running Erlang node like this:
# start Erlang at get the Pid
erl +JPperf true
And the pid for the node is 4711
You can then attach perf
to the node like this:
sudo perf record --call-graph fp -p 4711
Below is an example where perf
is run to analyze dialyzer
building a PLT like this:
ERL_FLAGS="+JPperf true +S 1" perf record --call-graph=fp \
dialyzer --build_plt -Wunknown --apps compiler crypto erts kernel stdlib \
syntax_tools asn1 edoc et ftp inets mnesia observer public_key \
sasl runtime_tools snmp ssl tftp wx xmerl tools
The above code is run using +S 1 to make the perf output easier to understand.
If you then run perf report -f --no-children
you may get something similar to this:
Frame pointers are enabled when the +JPperf true
option is passed, so you can
use perf record --call-graph=fp
to get more context.
Any Erlang function in the report is prefixed with a $
and all C functions have
their normal names. Any Erlang function that has the prefix $global::
refers
to a global shared fragment.
So in the above, we can see that we spend the most time doing eq
, i.e. comparing two terms.
By expanding it and looking at its parents we can see that it is the function
erl_types:t_is_equal/2
that contributes the most to this value. Go and have a look
at it in the source code to see if you can figure out why so much time is spent there.
After eq
we see the function erl_types:t_has_var/1
where we spend almost
5% of the entire execution in. A while further down you can see copy_struct_x
which is the function used to copy terms. If we expand it to view the parents
we find that it is mostly ets:lookup_element/3
that contributes to this time
via the Erlang function dialyzer_plt:ets_table_lookup/2
.
perf
tips and tricksYou can do a lot of neat things with perf
. Below is a list of some of the options
we have found useful:
perf report --no-children
Do not include the accumulation of all children in a call.perf report --call-graph callee
Show the callee rather than the caller when expanding a function call.perf archive
Create an archive with all the artifacts needed to inspect the data
on another host. In early version of perf this command does not work,
instead you can use this bash script.perf report
gives “failed to process sample” and/or “failed to process type: 68”
This probably means that you are running a buggy version of perf. We have
seen this when running Ubuntu 18.04 with kernel version 4. If you update
to Ubuntu 20.04 or use Ubuntu 18.04 with kernel version 5 the problem
should go away.Erlang/OTP 24 introduced improved BIF error information to provide more information when a call to a BIF failed.
In Erlang/OTP 25, improved error information is also given when the creation of a binary using the bit syntax fails.
Consider this function:
bin(A, B, C, D) ->
<<A/float,B:4/binary,C:16,D/binary>>.
If we call this function with incorrect arguments in past releases we will just be told that something was wrong and the line number:
1> t:bin(<<"abc">>, 2.0, 42, <<1:7>>).
** exception error: bad argument
in function t:bin/4 (t.erl, line 5)
But which part of line 5? Imagine that t:bin/4
was called from deep
within an application and we had no idea what the actual values for
the arguments were. It could take a while to figure out exactly what
went wrong.
Erlang/OTP 25 gives us more information:
1> c(t).
{ok,t}
2> t:bin(<<"abc">>, 2.0, 42, <<1:7>>).
** exception error: construction of binary failed
in function t:bin/4 (t.erl, line 5)
*** segment 1 of type 'float': expected a float or an integer but got: <<"abc">>
Note that the module must be compiled by the compiler in Erlang/OTP 25 in order to get the more informative error message. The old-style message will be shown if the module was compiled by a previous release.
Here the message tells us that first segment in the construction was given
the binary <<"abc">>
instead of a float or an integer, which is the expected
type for a float
segment.
It seems that we switched the first and second arguments for bin/4
,
so we try again:
3> t:bin(2.0, <<"abc">>, 42, <<1:7>>).
** exception error: construction of binary failed
in function t:bin/4 (t.erl, line 5)
*** segment 2 of type 'binary': the value <<"abc">> is shorter than the size of the segment
It seems that there was more than one incorrect argument. In this case, the message tells us that the given binary is shorter than the size of the segment.
Fixing that:
4> t:bin(2.0, <<"abcd">>, 42, <<1:7>>).
** exception error: construction of binary failed
in function t:bin/4 (t.erl, line 5)
*** segment 4 of type 'binary': the size of the value <<1:7>> is not a multiple of the unit for the segment
A binary
segment has a default unit of 8. Therefore, passing a bit string of
size 7 will fail.
Finally:
5> t:bin(2.0, <<"abcd">>, 42, <<1:8>>).
<<64,0,0,0,0,0,0,0,97,98,99,100,0,42,1>>
Another improvement is the exceptions when matching of a record fails.
Consider this record and function:
-record(rec, {count}).
rec_add(R) ->
R#rec{count = R#rec.count + 1}.
In past releases, failure to match a record or retrieve an element from a record would result in the following exception:
1> t:rec_add({wrong,0}).
** exception error: {badrecord,rec}
in function t:rec_add/1 (t.erl, line 8)
Before Erlang/OTP 15 that introduced line numbers in exceptions, knowing which record that was expected could be useful if the error occurred in a large function.
Nowadays, unless several different records are accessed on the same line, the line number makes it obvious which record was expected.
Therefore, in Erlang/OTP 25 the badrecord
exception has been changed
to show the actual incorrect value:
2> t:rec_add({wrong,0}).
** exception error: {badrecord,{wrong,0}}
in function t:rec_add/1 (t.erl, line 8)
The new badrecord
exceptions will show up for code that has been compiled
with Erlang/OTP 25.
Previously shell scripts (e.g., erl
and start
) and the RELEASES
file
for an Erlang installation depended on a hard coded absolute path to the
installation’s root directory. This made it cumbersome to move an
installation to a different directory which can be problematic for platforms
such as Android (#2879) where the
installation directory is unknown at compile time. This is fixed by:
Changing the shell scripts so that they can dynamically find the
ROOTDIR
. The dynamically found ROOTDIR
is selected if it differs
from the hard-coded ROOTDIR
and seems to point to a valid Erlang
installation. The dyn_erl
program has been changed so that it can
return its absolute canonicalized path when given the –realpath
argument (dyn_erl gets its absolute canonicalized path from the
realpath POSIX function). The dyn_erl’s –realpath
functionality is used by the scripts to get the root dir dynamically.
Changing the release_handler module that reads and writes to the
RELEASES
file so that it prepends code:root_dir()
whenever it
encounters relative paths. This is necessary since the current
working directory can be changed so it is something different than
code:root_dir()
.
It has since long been possible to optimize an ETS table for write concurrency doing like this:
ets:new(my_table, [{write_concurrency, true}]).
Now we also introduce adaptive support for write concurrency which can be configured like this:
ets:new(my_table, [{write_concurrency, auto}]).
This option forces tables to automatically change the number of locks that are used at run-time depending on how much concurrency is detected. When you enable automatic write concurrency decentralized_counters
are also activated for even more scalable ETS tables. Use this option when you know that a lot of processes will be accessing an ETS table on systems with many number of cores.
For more details you can read PR 5208 that introduced the change and the blog post about decentralized counters.
short
for erlang:float_to_list/2
and erlang:float_to_binary/2
A new option called short
has been added to the functions erlang:float_to_list/2
and erlang:float_to_binary/2
. This option creates the shortest correctly rounded string representation of the given float that can be converted back to the same float again.
If option short
is specified, the float is formatted
with the smallest number of digits that still guarantees that
F =:= list_to_float(float_to_list(F, [short]))
When the float is inside the range (-2⁵³, 2⁵³), the notation that yields the smallest number of characters is used (scientific notation or normal decimal notation). Floats outside the range (-2⁵³, 2⁵³) are always formatted using scientific notation to avoid confusing results when doing arithmetic operations.
The implementation is contributed by Thomas Depierre and uses the Ryū algorithm.
Ryū, is a new algorithm to convert binary floating point numbers to their decimal representations using only fixed-size integer operations. Ryū is simpler and approximately three times faster than the previously fastest implementation. https://github.com/ulfjack/ryu
peer
supersedes the slave moduleThe peer
module provides functions for starting linked Erlang nodes. The Erlang node spawning new “peer” nodes is called origin
, and the newly started nodes are peers.
A peer node automatically terminates when it loses the control connection to the origin. This connection could be an Erlang distribution connection, or an alternative - TCP or standard I/O. The alternative connection provides a way to execute remote procedure calls even when Erlang Distribution is not available, allowing to test the distribution itself.
Peer node terminal input/output is relayed through the origin. If a standard I/O alternative connection is requested, console output also goes via the origin, allowing debugging of node startup and boot script execution (see -init_debug). File I/O is not redirected, contrary to slave
behavior.
The peer node can start on the same or a different host (via ssh) or in a separate container (for example Docker). When the peer starts on the same host as the origin, it inherits the current directory and environment variables from the origin.
This module is designed to facilitate multi-node testing with Common Test. Use the ?CT_PEER() macro to start a linked peer node according to Common Test conventions: crash dumps written to specific location, node name prefixed with module name, calling function, and origin OS process ID). Use random_name/1
to create sufficiently unique node names if you need more control.
A peer node started without alternative connection behaves similarly to slave(3)
.
gen_XXX
modules has got a new format_status/1
callback.The format_status/2
callback for gen_server
, gen_statem
and gen_event
has been deprecated in favor of the new format_status/1
callback.
The new callback adds the possibility to limit and change many more things than the just the state.
The purpose with both the old and the new format_status
callbacks are to let the user filter away sensitive information and possibly data of huge volume from the crash reports.
timer
module has been modernized and made more efficientThe timer module has been modernized and made more efficient, which makes the timer server less susceptible to being overloaded. The timer:sleep/1
function now accepts an arbitrarily large integer.
Some applications in OTP like SSL/TLS and SSH need cryptography to work. That is provided by the OTP application crypto, which interfaces Erlang to an external cryptolib in C using NIFs. The main example of such an external cryptolib is OpenSSL.
The OpenSSL cryptolib exists in many versions. OTP/crypto supports 0.9.8c and later, although only 1.1.1 is still maintained by OpenSSL.
OpenSSL has released its version 3.0 series, which is their future platform totally re-built with a new API. The APIs of previous versions (1.1.1 and older) are partly deprecated, although still available in 3.0. The support of 1.1.1 will also end in a future.
Since it is vital to get security patches in the cryptolib, and in a future only the 3.0 API might be available, OTP/crypto now from OTP-25.0 interfaces OpenSSL 3.0 using the new 3.0 API. A few functions from old APIs are still used, but they will be replaced as soon as possible.
You as a user will hopefully not notice any difference: if you have OpenSSL 1.1.1 (or older - not recommended) and build OTP, that one will be used as previously. If you have any OpenSSL 3.0 version installed, that one will be used without need of doing anything special except for normal handling of dynamic loading paths in the OS.
With the new functions public_key:cacerts_load/0,1
and public_key:cacerts_get/0
the CA certificates can be fetched from the standard place of the OS (or from a file).
They will then be cached in decoded form by use of persistent_term
which makes them available in an efficient way for the ssl
and httpc
modules. The intention with this is to make it unnecessary to depend on for example certifi
in many packages.
On Windows and MacOSx the certificate store is not an ordinary file so the information is fetched via an API using a NIF (Windows) or with an external program (MacOSx).
Example with ssl
%% makes the certificates available without copying
CaCerts = public_key:cacerts_get(),
% use the certificates when establishing a connection
{ok,Socket} = ssl:connect("erlang.org",443,[{cacerts,CaCerts}, {verify,verify_peer}]),
...
We also plan to update the http client (httpc
) to use this soon.
A new custom designed Pseudo Random Generator rand:mwc59
has been implemented. It is probably the fastest possible
generator with good quality that can be written in Erlang.
To do this it barely avoids bignums, allocating heap data,
and uses only a minimal number of fast operations.
Under the “right” circumstances: A number that takes 60 ns to generate
with the default generator can be generated in 4 ns with rand:mwc59
.
It is intended for applications in dire need for speed
in PRNG numbers, but not any of the comfort features
that rand
otherwise offers.
rand
module are overkill.
This blog post will dive in to new additions to the
said module, how the Just-In-Time compiler optimizes them,
known tricks, and tries to compare these apples and potatoes.
rand_SUITE:measure/1
The Pseudo Random Number Generators implemented in
the rand
module offers many useful features such as
repeatable sequences, non-biased range generation,
any size range, non-overlapping sequences,
generating floats, normal distribution floats, etc.
Many of those features are implemented through
a plug-in framework, with a performance cost.
The different algorithms offered by the rand
module are selected
to have excellent statistical quality and to perform well
in serious PRNG tests (see section PRNG tests).
Most of these algorithms are designed for machines with 64-bit arithmetic (unsigned), but in Erlang such integers become bignums and almost an order of magnitude slower to handle than immediate integers.
Erlang terms in the 64-bit VM are tagged 64-bit words. The tag for an immediate integer is 4 bit, leaving 60 bits for the signed integer value. The largest positive immediate integer value is therefore 259-1.
Many algorithms work on unsigned integers so we have 59 bits useful for that. It could be theoretically possible to pretend 60 bits unsigned using split code paths for negative and positive values, but extremely impractical.
We decided to choose 58 bit unsigned integers in this context since then we can for example add two integers, and check for overflow or simply mask back to 58 bit, without the intermediate result becoming a bignum. To work with 59 bit integers would require having to check for overflow before even doing an addition so the code that avoids bignums would eat up much of the speed gained from avoiding bignums. So 58-bit integers it is!
The algorithms that perform well in Erlang are the ones
that have been redesigned to work on 58-bit integers.
But still, when executed in Erlang, they are far from
as fast as their C origins. Achieving good PRNG quality
costs much more in Erlang than in C. In the section
Measurement results we see that the algorithm exsp
that boasts sub-ns speed in C needs 17 ns in Erlang.
32-bit Erlang is a sad story in this regard. The bignum limit
on such an Erlang system is so low, calculations would have
to use 26-bit integers, that designing a PRNG
not using bignums must be so small in period and size
that it becomes too bad to be useful.
The known trick erlang:phash2(erlang:unique_integer(), Range)
is still fairly fast, but all rand
generators work exactly the same
as on a 64-bit system, hence operates on bignums so they are much slower.
If your application needs a “random” integer for an non-critical purpose such as selecting a worker, choosing a route, etc, and performance is much more important than repeatability and statistical quality, what are then the options?
rand
anywayReasoning and measurement results are in the following sections, but, in short:
erlang:phash2(erlang:unique_integer(), Range)
has its use cases.mwc59
.rand
anywayIs rand
slow, really? Well, perhaps not considering what it does.
In the Measurement results at the end of this text,
it shows that generating a good quality random number using
the rand
module’s default algorithm is done in 45 ns.
Generating a number as fast as possible (rand:mwc59/1
) can be done
in less than 4 ns, but that algorithm has problems with the
statistical quality. See section PRNG tests and Implementing a PRNG.
Using a good quality algorithm instead (rand:exsp_next/1
) takes 16 ns,
if you can store the generator’s state in a loop variable.
If you can not store the generator state in a loop variable there will be more overhead, see section Storing the state.
Now, if you also need a number in an awkward range, as in not much smaller than the generator’s size, you might have to implement a reject-and-resample loop, or even concatenate numbers.
The overhead of code that has to implement this much of the features
that the rand
module already offers will easily approach
its 26 ns overhead, so often there is no point in
re-implementing this wheel…
There has been a discussion thread on Erlang Forums: Looking for a faster RNG. Triggered by this Andrew Bennett (aka potatosalad) wrote an experimental BIF.
The suggested BIF erlang:random_integer(Range)
offered
no repeatability, generator state per scheduler, guaranteed
sequence separation between schedulers, and high generator
quality. All this thanks to using one of the good generators from
the rand
module, but now written in its original
programming language, C, in the BIF.
The performance was a bit slower than the mwc59
generator state update,
but with top of the line quality. See section Measurement results.
Questions arised regarding maintenance burden, what more to implement, etc.
For example we probably also would need erlang:random_integer/0
,
erlang:random_float/0
, and some system info
to get the generator bit size…
A BIF could achieve good performance on a 32-bit system, if it there would return a 27-bit integer, which became another open question. Should a BIF generator be platform independent with respect to generated numbers or with respect to performance?
potatosalad also wrote a NIF, since we (The Erlang/OTP team) suggested that it could have good enough performance.
Measurements, however, showed that the overhead is significantly larger
than for a BIF. Although the NIF used the same trick as the BIF to store
the state in thread specific data it ended up with the same
performance as erlang:phash2(erlang:unique_integer(), Range)
,
which is about 2 to 3 times slower than the BIF.
As a speed improvement we tried was to have the NIF generate a list of numbers, and use that list as a cache in Erlang. The performance with such a cache was as fast as the BIF, but introduced problems such as that you would have to decide on a cache size, the application would have to keep the cache on the heap, and when generating in a number range the application would have to know in generate numbers in the same range for the whole cache.
A NIF could like a BIF also achieve good performance on a 32-bit system, with the same open question — platform independent numbers or performance?
One suggested trick is to use os:system_time(microseconds)
to get
a number. The trick has some peculiarities:
See section Measurement results for the performance for this “solution”.
The best combination would most certainly be
erlang:phash2(erlang:unique_integer(), Range)
or
erlang:phash2(erlang:unique_integer())
which is slightly faster.
erlang:unique_integer/0
is designed to return an unique integer
with a very small overhead. It is hard to find a better candidate
for an integer to hash.
erlang:phash2/1,2
is the current generic hash function for Erlang terms.
It has a default return size well suited for 32-bit Erlang systems,
and it has a Range
argument. The range capping is done with a simple
rem
in C (%
) which is much faster than in Erlang. This works good
only for ranges much smaller than 32-bit as in if the range is larger
than 16 bits the bias in the range capping starts to be noticable..
Alas this solution does not perform well in PRNG tests.
See section Measurement results for the performance for this solution.
To be fast, the implementation of a PRNG algorithm cannot execute many operations. The operations have to be on immediate values (not bignums), and the the return value from a function have to be an immediate value (a compound term would burden the garbage collector). This seriously limits how powerful algorithms that can be used.
We wrote one and named it mwc59
because it has a 59-bit
state, and the most thorough scrambling function returns
a 59-bit value. There is also a faster, intermediate scrambling
function, that returns a 32-bit value, which is the “digit” size
of the MWC generator. It is also possible to directly
use the low 16 bits of the state without scrambling.
See section Implementing a PRNG for how this generator
was designed and why.
As another gap filler between really fast with low quality,
and full featured, an internal function in rand
has been exported:
rand:exsp_next/1
. This function implements Xoroshiro116+ that exists
within the rand
plug-in framework as algorithm exsp
.
It has been exported so it is possible to get good quality without
the plug-in framework overhead, for applications that do not
need any framework features.
See section Measurement results for speed comparisons.
There are many different aspects of a PRNG:s quality. Here are some.
erlang:phash2(erlang:unique_integer(), Range)
has, conceptually,
an infinite period, since the time it will take for it to repeat
is assumed to be longer than the Erlang node will survive.
For the new fast mwc59
generator the period it is about 259.
For the regular ones in rand
it is at least 2116 - 1,
which is a huge difference. It might be possible to consume
259 numbers during an Erlang node’s lifetime,
but not 2116.
There are also generators in rand
with a period of
2928 - 1 which might seem ridiculously long,
but this facilitates generating very many parallel sub-sequences
guaranteed to not overlap.
In, for example, a physical simulation it is common practice to only use a fraction of the generator’s period, both regarding how many numbers you generate and on how large range you generate, or it may affect the simulation for example that specific numbers do not reoccur. If you have pulled 3 aces from a deck you know there is only one left.
Some applications may be sensitive to the generator period, while others are not, and this needs to be considered.
The value size of the new fast mwc59
generators is 59, 32, or 16 bits,
depending on the scrambling function that is used.
Most of the regular generators in the rand
module has got
a value size of 58 bits.
If you need numbers in a power of 2 range then you can simply mask out the low bits:
V = X band ((1 bsl RangeBits) - 1).
Or shift down the required number of bits:
V = X bsr (GeneratorBits - RangeBits).
This, depending on if the generator is known to have weak high or low bits.
If the range you need is not a power of 2, but still
much smaller than the generator’s size you can use rem
:
V = X rem Range.
The rule of thumb is that Range
should be less than
the square root of the generator’s size. This is much slower
than bit-wise operations, and the operation propagates low bits,
which can be a problem if the generator is known to have weak low bits.
Another way is to use truncated multiplication:
V = (X * Range) bsr GeneratorBits
The rule of thumb here is that Range
should be less than
the square root of 2GeneratorBits, that is,
2GeneratorBits/2. Also, X * Range
should not create a bignum, so not more than 59 bits.
This method propagates high bits, which can be a problem
if the generator is known to have weak high bits.
Other tricks are possible, for example if you need numbers
in the range 0 through 999 you may use bit-wise operations to get
a number 0 through 1023, and if too high re-try, which actually
may be faster on average than using rem
. This method is also
completely free from bias in the generated numbers. The previous
methods have the rules of thumb to get a so small bias
that it becomes hard to notice.
The spectral score of a generator, measures how much a sequence of numbers from the generator are unrelated. A sequence of N numbers are interpreted as an N-dimensional vector and the spectral score for dimension N is a measure on how evenly these vectors are distributed in an N-dimensional (hyper)cube.
os:system_time(microseconds)
simply increments so it should have
a lousy spectral score.
erlang:phash2(erlang:unique_integer(), Range)
has got unknown
spectral score, since that is not part of the math behind a hash function.
But a hash function is designed to distribute the hash value well
for any input, so one can hope that the statistical
distribution of the numbers is decent and “random” anyway.
Unfortunately this does not seem to hold in PRNG tests
All regular PRNG:s in the rand
module has got good spectral scores.
The new mwc59
generator mostly, but not in 2 and 3 dimensions,
due to its unbalanced design and power of 2 multiplier.
Scramblers are used to compensate for those flaws.
There are test frameworks that tests the statistical properties of PRNG:s, such as the TestU01 framework, or PractRand.
The regular generators in the rand
module perform well
in such tests, and pass thorough test suites.
Although the mcg59
generator pass PractRand 2 TB
and TestU01 with its low 16 bits without any scrambling,
its statistical problems show when the test parameters
are tweaked just a little. To perform well in more cases,
and with more bits, scrambling functions are needed.
Still, the small state space and the flaws of the base generator
makes it hard to pass all tests with flying colors.
With the thorough double Xorshift scrambler it gets very good, though.
erlang:phash2(N, Range)
over an incrementing sequence does not do well
in TestU01, which suggests that a hash functions has got different
design criteria from PRNG:s.
However, these kind of tests may be completely irrelevant for your application.
For some applications, a generated number may have to be even cryptographically unpredictable, while for others there are no strict requirements.
There is a grey-zone for “non-critical” applications where for example a rouge party may be able to affect input data, and if it knows the PRNG sequence can steer all data to a hash table slot, overload one particular worker process, or something similar, and in this way attack an application. And, an application that starts out as “non-critical” may one day silently have become business critical…
This is an aspect that needs to be considered.
If the state of a PRNG can be kept in a loop variable, the cost can be almost nothing. But as soon as it has to be stored in a heap variable it will cost performance due to heap data allocation, term building, and garbage collection.
In the section Measurement results we see that the fastest PRNG can generate a new state that is also the generated integer in just under 4 ns. Unfortunately, just to return both the value and the new state in a 2-tuple adds roughly 10 ns.
The application state in which the PRNG state must be stored is often more complex, so the cost for updating it will probably be even larger.
Seeding is related to predictability. If you can guess the seed you know the generator output.
The seed is generator dependent and how to create a good seed usually takes much longer than generating a number. Sometimes the seed and its predictability is so unimportant that a constant can be used. If a generator instance generates just a few numbers per seeding, then seeding can be the harder problem.
erlang:phash2(erlang:unique_integer(), Range)
is pre-seeded,
or rather cannot be seeded, so it has no seeding cost, but can
on the other hand be rather predictable, if it is possible to estimate
how many unique integers that have been generated since node start.
The default seeding in the rand
module uses a combination
of a hash value of the node name, the system time,
and erlang:unique_integer()
, to create a seed,
which is hopefully sufficiently unpredictable.
The suggested NIF and BIF solutions would also need a way to create a good enough seed, where “good enough” is hard to put a number on.
The speed of the newly implemented mwc59
generator
is partly thanks to the recent type-based optimizations in the compiler
and the Just-In-Time compiling BEAM code loader.
This is the Erlang code for the mwc59
generator:
mwc59(CX) ->
C = CX band ((1 bsl 32)-1),
X = CX bsr 32,
16#7fa6502 * X + C.
The code compiles to this Erlang BEAM assembler, (erlc -S rand.erl
),
using the no_type_opt
flag to disable type-based optimizations:
{gc_bif,'bsr',{f,0},1,[{x,0},{integer,32}],{x,1}}.
{gc_bif,'band',{f,0},2,[{x,0},{integer,4294967295}],{x,0}}.
{gc_bif,'*',{f,0},2,[{x,0},{integer,133850370}],{x,0}}.
{gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
When loaded by the JIT (x86) (erl +JDdump true
)
the machine code becomes:
# i_bsr_ssjd
mov rsi, qword ptr [rbx]
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L2271
Above was a test if {x,0}
is a small integer and if not
the fallback at L2271
is called to handle any term.
Then follows the machine code for right shift, Erlang bsr 32
,
x86 sar rax, 32
, and a skip over the fallback code:
mov rax, rsi
sar rax, 32
or rax, 15
short jmp L2272
L2271:
mov eax, 527
call 140439031217336
L2272:
mov qword ptr [rbx+8], rax
# line_I
Here follows band
with similar test and fallback code:
# i_band_ssjd
mov rsi, qword ptr [rbx]
mov rax, 68719476735
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L2273
and rax, rsi
short jmp L2274
L2273:
call 140439031216768
L2274:
mov qword ptr [rbx], rax
Below comes *
with test, fallback code, and overflow check:
# line_I
# i_times_jssd
mov rsi, qword ptr [rbx]
mov edx, 2141605935
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L2276
# mul with overflow check, imm RHS
mov rax, rsi
mov rcx, 133850370
and rax, -16
imul rax, rcx
short jo L2276
or rax, 15
short jmp L2275
L2276:
call 140439031220000
L2275:
mov qword ptr [rbx], rax
The following is +
with tests, fallback code, and overflow check:
# i_plus_ssjd
mov rsi, qword ptr [rbx]
mov rdx, qword ptr [rbx+8]
# are both operands small?
mov eax, esi
and eax, edx
and al, 15
cmp al, 15
short jne L2278
# add with overflow check
mov rax, rsi
mov rcx, rdx
and rcx, -16
add rax, rcx
short jno L2277
L2278:
call 140439031219296
L2277:
mov qword ptr [rbx], rax
When the compiler can figure out type information about the arguments it can emit more efficient code. One would like to add a guard that restricts the argument to a 59 bit integer, but unfortunately the compiler cannot yet make use of such a guard test.
But adding a redundant input bit mask to the Erlang code puts the compiler on the right track. This is a kludge, and will only be used until the compiler has been improved to deduce the same information from a guard instead.
The Erlang code now has a first redundant mask to 59 bits:
mwc59(CX0) ->
CX = CX0 band ((1 bsl 59)-1),
C = CX band ((1 bsl 32)-1),
X = CX bsr 32,
16#7fa6502 * X + C.
The BEAM assembler then becomes, with the default type-based optimizations in the compiler the OTP-25.0 release:
{gc_bif,'band',{f,0},1,[{x,0},{integer,576460752303423487}],{x,0}}.
{gc_bif,'bsr',{f,0},1,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
{integer,32}],{x,1}}.
{gc_bif,'band',{f,0},2,[{tr,{x,0},{t_integer,{0,576460752303423487}}},
{integer,4294967295}],{x,0}}.
{gc_bif,'*',{f,0},2,[{tr,{x,0},{t_integer,{0,4294967295}}},
{integer,133850370}],{x,0}}.
{gc_bif,'+',{f,0},2,[{tr,{x,0},{t_integer,{0,572367635452168875}}},
{tr,{x,1},{t_integer,{0,134217727}}}],{x,0}}.
Note that after the initial input band
operation,
type information {tr,{x_},{t_integer,Range}}
has been propagated
all the way down.
Now the JIT:ed code becomes noticeably shorter.
The input mask operation knows nothing about the value so it has the operand test and the fallback to any term code:
# i_band_ssjd
mov rsi, qword ptr [rbx]
mov rax, 9223372036854775807
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L1816
and rax, rsi
short jmp L1817
L1816:
call 139812177115776
L1817:
mov qword ptr [rbx], rax
For all the following operations, operand tests and fallback code has been optimized away to become a straight sequence of machine code:
# line_I
# i_bsr_ssjd
mov rsi, qword ptr [rbx]
# skipped test for small left operand because it is always small
mov rax, rsi
sar rax, 32
or rax, 15
L1818:
L1819:
mov qword ptr [rbx+8], rax
# line_I
# i_band_ssjd
mov rsi, qword ptr [rbx]
mov rax, 68719476735
# skipped test for small operands since they are always small
and rax, rsi
mov qword ptr [rbx], rax
# line_I
# i_times_jssd
# multiplication without overflow check
mov rax, qword ptr [rbx]
mov esi, 2141605935
and rax, -16
sar rsi, 4
imul rax, rsi
or rax, 15
mov qword ptr [rbx], rax
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
mov rsi, qword ptr [rbx+8]
and rax, -16
add rax, rsi
mov qword ptr [rbx], rax
The execution time goes down from 3.7 ns to 3.3 ns which is 10% faster just by avoiding redundant checks and tests, despite adding a not needed initial input mask operation.
And there is room for improvement. The values are moved back and forth
to BEAM {x,_}
registers (qword ptr [rbx]
) between operations.
Moving back from the {x,_}
register could be avoided by the JIT
since it is possible to know that the value is in a process register.
Moving out to the {x,_}
register could be optimized away if the compiler
would emit the information that the value will not be used
from the {x,_}
register after the operation.
To create a really fast PRNG in Erlang there are some limitations coming with the language implementation:
The first attempt was to try a classical power of 2 Linear Congruential Generator:
X1 = (A * X0 + C) band (P-1)
And a Multiplicative Congruential Generator:
X1 = (A * X0) rem P
To avoid bignum operations the product A * X0
must fit in 59 bits. The classical paper “Tables of
Linear Congruential Generators of Different Sizes and
Good Lattice Structure” by Pierre L’Ecuyer lists two generators
that are 35 bit, that is, an LCG with P
= 235
and an MCG with P being a prime number just below 235.
These were the largest generators to be found for which
the muliplication did not overflow 59 bits.
The speed of the LCG is very good. The MCG less so since it has
to do an integer division by rem
, but thanks to P
being
close to 235 that could be optimized so the speed
reached only about 50% slower than the LCG.
The short period and know quirks of a power of 2 LCG unfortunately showed in PRNG tests.
They failed miserably.
Sebastiano Vigna of the University of Milano, who also helped design our current 58-bit Xorshift family generators, suggested to use a Multiply With Carry generator instead:
T = A * X0 + C0,
X1 = T band ((1 bsl Bits)-1),
C0 = T bsr Bits.
This generator operates on “digits” of size Bits
, and if a digit
is half a machine word then the multiplication does not overflow.
Instead of having the state as a digit X
and a carry C
these
can be merged to have T
as the state instead. We get:
X = T0 band ((1 bsl Bits)-1),
C = T0 bsr Bits,
T1 = A * X + C
An MWC generator is actually a different form of a MCG generator with a power of 2 multiplier, so this is an equivalent generator:
T0 = (T1 bsl Bits) rem ((A bsl Bits) - 1)
In this form the generator updates the state in the reverse order,
hence T0
and T1
are swapped. The modulus (A bsl Bits) - 1
has to be a safe prime number or else the generator
does not have maximum period.
Because the multiplier (or its multiplicative inverse) is a power of 2, the MWC generator gets bad Spectral score in 3 dimensions, so using a scrambling function on the state to get a number would be necessary to improve the quality.
A search for a suitable digit size and multiplier started, mostly done by using programs that try multipliers for safe prime numbers, and estimates spectral scores, such as CPRNG.
When the generator is balanced, that is, the multiplier A
has got close to Bits
bits, the spectral scores are the best,
apart from the known problem in 3 dimensions. But since a scrambling
function would be needed anyway there was an opportunity to
try to generate a comfortable 32-bit digit using a 27-bit multiplier.
With these sizes the product A * X0
does not create a bignum,
and with a 32-bit digit it becomes possible to use standard
PRNG tests to test the generator during development.
Because of using such slightly unbalanced parameters, unfortunately the spectral scores for 2 dimensions also gets bad, but the scrambler could solve that too…
The final generator is:
mwc59(T) ->
C = T bsr 32,
X = T band ((1 bsl 32)-1),
16#7fa6502 * X + C.
The 32-bit digits of this base generator do not perform very well in PRNG tests, but actually the low 16 bits pass 2 TB in PractRand and 1 TB with the bits reversed, which is surprisingly good. The problem of bad spectral scores for 2 and 3 dimensions lie in the higher bits of the MWC digit.
The scrambler has to be fast as in use only a few and fast operations. For an arithmetic generator like this, Xorshift is a suitable scrambler. We looked at single Xorshift, double Xorshift and double XorRot. Double XorRot was slower than double Xorshift but not better, probably since the generator has got good low bits, so they need to be shifted up to improve the high bits. Rotating down high bits to the low is no improvement.
This is a single Xorshift scrambler:
V = T bxor (T bsl Shift)
When trying Shift
constants it showed that with a large
shift constant the generator performed better in PractRand,
and with a small one it performed better in birthday spacing tests
(such as in TestU01 BigCrush) and collision tests.
Alas, it was not possible to find a constant good for both.
The choosen single Xorshift constant is 8
that passes
4 TB in PractRand and BigCrush in TestU01 but fails
more thorough birthday spacing tests. The failures are few,
such as the lowest bit in 8 and 9 dimensions,
and some intermediate bits in 2 and 3 dimensions.
This is something unlikely to affect most applications,
and if using the high bits of the 32 generated,
these imperfections should stay under the rug.
The final scrambler has to avoid bignum operations and masks the value to 32 bits so it looks like this:
mwc59_value32(T) ->
V0 = T band ((1 bsl 32)-1),
V1 = V0 band ((1 bsl (32-8))-1),
V0 bxor (V1 bsl 8).
A better scrambler would be a double Xorshift that can
have both a small shift and a large shift.
Using the small shift 4
makes the combined generator
do very well in birthday spacings and collision tests,
and following up with a large shift 27
shifts the
whole improved 32-bit MWC digit all the way up
to the top bit of the generator’s 59-bit state.
That was the idea, and it turned out work fine.
The double Xorshift scrambler produces a 59-bit number where the low, the high, reversed low, reversed high, etc… all perform very well in PractRand, TestU01 BigCrush, and in exhaustive birthday spacing and collision tests. It is also not terribly much slower than the single Xorshift scrambler.
Here is a double Xorshift scrambler 4 then 27:
V1 = T bxor (T bsl 4),
V = V1 bxor (V1 bsl 27).
Which, avoiding bignum operations and producing a 59-bit value, becomes the final scrambler:
mwc59_value(T) ->
V0 = T band ((1 bsl (59-4))),
V1 = T bxor (V0 bsl 4),
V2 = V1 band ((1 bsl (59-27))),
V1 bxor (V2 bsl 27).
Many thanks to Sebastiano Vigna that has done most of (practically all) the parameter searching and extensive testing of the generator and scramblers, backed by knowledge of what could work. Using an MWC generator in this particular way is rather uncharted territory regarding the math, so extensive testing is the way to trust the quality of the generator.
rand_SUITE:measure/1
The test suite for the rand
module — rand_SUITE
,
in the Erlang/OTP source tree, contains a test case measure/1
.
This test case is a micro-benchmark of all the algorithms
in the rand
module, and some more. It measures the execution
time in nanoseconds per generated number, and presents the
times both absolute and relative to the default algorithm
exsss
that is considered to be 100%. See Measurement Results.
measure/1
is runnable also without a test framework.
As long as rand_SUITE.beam
is in the code path
rand_SUITE:measure(N)
will run the benchmark with N
as an effort factor. N = 1
is the default and
for example N = 5
gives a slower
and more thorough measurement.
The test case is divided in sections where each first runs a warm-up with the default generator, then runs an empty benchmark generator to measure the benchmark overhead, and after that runs all generators for the specific section. The benchmark overhead is subtracted from the presented results after the overhead run.
The warm-up and overhead measurement & compensation are
recent improvements to the measure/1
test case.
Overhead has also been reduced by in-lining 10 PRNG iterations
per test case loop iteration, which got the overhead down to
one third of without such in-lining, and the overhead is now
about as large as the fastest generator itself, approaching the
function call overhead in Erlang.
The different measure/1
sections are different use cases such as
“uniform integer half range + 1”, etc. Many of these test the performance
of plug-in framework features. The test sections that are interesting
for this text are “uniform integer range 10000”, “uniform integer 32-bit”,
and “uniform integer full range”.
Here are some selected results from the author’s laptop
from running rand_SUITE:measure(20)
:
The {mwc59,Tag}
generator is rand:mwc59/1
, where
Tag
indicates if the raw
generator, the rand:mwc59_value32/1
,
or the rand:mwc59_value/1
scrambler was used.
The {exsp,_}
generator is rand:exsp_next/1
which
is a newly exported internal function that does not use
the plug-in framework. When called from the plug-in
framework it is called exsp
below.
unique_phash2
is erlang:phash2(erlang:unique_integer(), Range)
.
system_time
is os:system_time(microsecond)
.
RNG uniform integer range 10000 performance
exsss: 57.5 ns (warm-up)
overhead: 3.9 ns 6.8%
exsss: 53.7 ns 100.0%
exsp: 49.2 ns 91.7%
{mwc59,raw_mod}: 9.8 ns 18.2%
{mwc59,value_mod}: 18.8 ns 35.0%
{exsp,mod}: 22.5 ns 41.9%
{mwc59,raw_tm}: 3.5 ns 6.5%
{mwc59,value32_tm}: 8.0 ns 15.0%
{mwc59,value_tm}: 11.7 ns 21.8%
{exsp,tm}: 18.1 ns 33.7%
unique_phash2: 23.6 ns 44.0%
system_time: 30.7 ns 57.2%
The first two are the warm-up and overhead measurements.
The measured overhead is subtracted from all measurements
after the “overhead:” line. The measured overhead here
is 3.9 ns which matches well that exsss
measures
3.8 ns more during the warm-up run than after overhead
.
The warm-up run is, however, a bit unpredictable.
{_,*mod}
and system_time
all use (X rem 10000) + 1
to achieve the desired range. The rem
operation is expensive,
which we will see when comparing with the next section.
{_,*tm}
use truncated multiplication to achieve the range,
that is ((X * 10000) bsr GeneratorBits) + 1
,
which is much faster than using rem
.
erlang:phash2/2
has got a range argument, that performs
the rem 10000
operation in the BIF, which is fairly cheap,
as we also will see when comparing with the next section.
RNG uniform integer 32 bit performance
exsss: 55.3 ns 100.0%
exsp: 51.4 ns 93.0%
{mwc59,raw_mask}: 2.7 ns 4.9%
{mwc59,value32}: 6.6 ns 12.0%
{mwc59,value_shift}: 8.6 ns 15.5%
{exsp,shift}: 16.6 ns 30.0%
unique_phash2: 22.1 ns 40.0%
system_time: 23.5 ns 42.6%
In this section, to generate a number in a 32-bit range,
{mwc59,raw_mask}
and system_time
use a bit mask
X band 16#ffffffff
, {_,*shift}
use bsr
to shift out the low bits, and {mwc59_value32}
has got
the right range in itself. Here we see that bit operations
are up to 10 ns faster than the rem
operation in the previous section.
{mwc59,raw_*}
is more than 3 times faster.
Compared to the truncated multiplication variants in the previous section, the bit operations here are up to 3 ns faster.
unique_phash2
still uses BIF coded integer division to achieve
the range, which gives it about the same speed as in the previous section,
but it seems integer division with a power of 2 is a bit faster.
RNG uniform integer full range performance
exsss: 45.1 ns 100.0%
exsp: 39.8 ns 88.3%
dummy: 25.5 ns 56.6%
{mwc59,raw}: 3.7 ns 8.3%
{mwc59,value32}: 6.9 ns 15.2%
{mwc59,value}: 8.5 ns 18.8%
{exsp,next}: 16.8 ns 37.2%
{splitmix64,next}: 331.1 ns 734.3%
unique_phash2: 21.1 ns 46.8%
procdict: 75.2 ns 166.7%
{mwc59,procdict}: 16.6 ns 36.8%
In this section no range capping is done. The raw generator output is used.
Here we have the dummy
generator, which is an undocumented generator
within the rand
plug-in framework that only does a minimal state
update and returns a constant. It is used here to measure
plug-in framework overhead.
The plug-in framework overhead is measured to 25.5 ns that matches
exsp
- {exsp,next}
= 23.0 ns fairly well,
which is the same algorithm within and without the plug-in framework,
giving another measure of the framework overhead.
procdict
is the default algorithm exsss
but makes the plug-in
framework store the generator state in the process dictionary,
which here costs 30 ns.
{mwc59,procdict}
stores the generator state in the process dictionary,
which here costs 12.9 ns. The state term that is stored is much smaller
than for the plug-in framework. Compare to procdict
in the previous paragraph.
The new fast generator’s functions in the rand
module
fills a niche for speed over quality where the type-based
JIT optimizations have elevated the performance.
The combination of high speed and high quality can only be fulfilled with a BIF implementation, but we hope that to be a combination we do not need to address…
Implementing a PRNG is tricky business.
Recent improvements in rand_SUITE:measure/1
highlights what the precious CPU cycles are used for.
The SSA-based compiler passes introduced in OTP 22 does a sophisticated type analysis, which allows for more optimizations and better code generation. There are, however, limits to what kind of optimizations the Erlang compiler can do because a BEAM file must be possible to load on any BEAM machine running on a 32-bit or 64-bit computer. Therefore, the compiler cannot do optimizations that depend on the size of integers that fit in a machine word or on how Erlang terms are represented.
The JIT (introduced in OTP 24) knows that it is running on a 64-bit
computer and knows how Erlang terms are represented. The JIT is still
limited in how much optimization it can do because it translates a
single BEAM instruction at the time. For example, the +
operator can
add floats or integers of any size or any combination
thereof. Previously executed BEAM instructions might have made it
clear that the operands can only be small integers, but the JIT does
not know that since it only looks at one instruction at the time, and
therefore it must emit native code that handles all possible operands.
In OTP 25, the compiler has been updated to embed type information in the BEAM file and the JIT has been extended to emit better code based on that type information.
The embedded type information is versioned so that we can continue to improve the type-based optimizations in every OTP release. The loader will ignore versions it does not recognize so that the module can still be loaded without the type-based optimizations.
OTP 25 is just the beginning for type-based optimizations. We hope to improve both the type information from the compiler and the optimizations in the JIT in OTP 26.
How much better the native code emitted by the JIT will be depends on the nature of the code in the module.
The most commonly applied optimization is simplified tests. For example, a test for a tuple can frequently be reduced from 5 instructions down to 3 instructions, and a test for small integer operands can frequently be reduced from 5 instructions down to 4 instructions.
Less commonly applied but more significant are the simplifications
that can be made when an integer is known to be “small” (fits in 60
bits). For example, a relational operator (such as <
) used in a
guard can be reduced from 11 instructions down to 4 if the operands
are known to be small integers. This kind of optimization is most
often applied in modules that use binary pattern matching because
integers matched out from a binary have a well-defined range.
In the Erlang/OTP code base, the first kind of optimizations (shaving off one or two instructions) are applied roughly ten times as often as the second kind.
We will see later in this blog post that the optimizations of the
second kind applied to the base64
module resulted in a significant
speed up.
Let’s dive right into some examples.
Consider this module:
-module(example).
-export([tuple_matching/1]).
tuple_matching(X) ->
case increment(X) of
{ok,Result} -> Result;
error -> X
end.
increment(X) when is_integer(X) -> {ok,X+1};
increment(_) -> error.
The BEAM code for the tuple_matching/1
function emitted
by the compiler in OTP 24 is (somewhat simplified):
{allocate,1,1}.
{move,{x,0},{y,0}}.
{call,1,{f,5}}.
{test,is_tuple,{f,3},[{x,0}]}.
{get_tuple_element,{x,0},1,{x,0}}.
{deallocate,1}.
return.
{label,3}.
{move,{y,0},{x,0}}.
{deallocate,1}.
return.
The compiler has figured out that the increment/1
returns either the
atom error
or a two-tuple with ok
as the first element. Therefore,
to distinguish between those two possible return values, a single
instruction suffices:
{test,is_tuple,{f,3},[{x,0}]}.
There is no need to explicitly test for the value error
because it
must be error
if it is not a tuple. Similarly, there is no need
to test that the first element of the tuple is ok
because it must be.
In OTP 24, the JIT translates that instruction to a sequence of 5 native instructions for x86_64:
# i_is_tuple_fs
mov rsi, qword ptr [rbx]
rex test sil, 1
jne L2
test byte ptr [rsi-2], 63
jne L2
(Lines starting with #
are comments.)
The mov
instruction fetches the value of the BEAM register {x,0}
to the CPU register rsi
. The next two instructions test whether the
term is a pointer to an object on the heap. If it is, the header word
for the heap object is tested to make sure it is a tuple. The second
test is needed because the heap object could be some other Erlang term,
such as a binary, a map, or an integer that does not fit in a machine
word.
Now let’s see what the compiler and the JIT in OTP 25 do with this instruction. The BEAM code is now:
{test,is_tuple,
{f,3},
[{tr,{x,0},
{t_union,{t_atom,[error]},
none,none,
[{{2,{t_atom,[ok]}},
{t_tuple,2,true,
#{1 => {t_atom,[ok]},
2 => {t_integer,any}}}}],
none}}]}.
The operand that was {x,0}
in OTP 24 is now a tuple:
{tr,Register,Type}
That is, it is a three-tuple with tr
as the first element. tr
stands for typed register. The second element is the BEAM register
({x,0}
in this case), and the third element is the type of the
register in the compiler’s internal type representation. The type
is equivalent to the following type spec:
'error' | {'ok', integer()}
The JIT cannot take advantage of that level of detail in the types, so the compiler embeds a simplified version of that type into the BEAM file. The embedded type is equivalent to:
atom() | tuple()
By knowing that {x,0}
must be an atom or a tuple, the JIT in OTP 25
emits the following simplified native code:
# i_is_tuple_fs
mov rsi, qword ptr [rbx]
# simplified tuple test since the source is always a tuple when boxed
rex test sil, 1
jne label_3
(The JIT generally emits a comment when type information made a simplification possible.)
Only the first test is now necessary, because if the term is a pointer to a heap object, according to the type information, it must be a tuple.
As another example, let’s look at how the relational operators in guards are translated. Consider this function:
my_less_than(A, B) ->
if
A < B -> smaller;
true -> larger_or_equal
end.
The BEAM code looks like this:
{test,is_lt,{f,9},[{x,0},{x,1}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,9}.
{move,{atom,larger_or_equal},{x,0}}.
return.
When relational operators are used as guard tests, the compiler rewrites
them as special instructions. Thus, the <
operator is rewritten to an
is_lt
instruction.
The <
operator can compare any Erlang terms. It would be impractical
for the JIT to emit the code to handle all kinds of terms. Therefore, the
JIT emits code that directly handles the most common case and
calls a generic routine to handle everything else:
# is_lt_fss
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
mov eax, edi
and eax, esi
and al, 15
cmp al, 15
short jne L39
cmp rdi, rsi
short jmp L40
L39:
call 5447639136
L40:
jge label_9
Let’s walk through the code. The first two instructions:
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
fetches the BEAM registers {x,1}
and {x,0}
into CPU registers.
The most common comparison is between two integers. Depending on the
magnitude, integers can be represented in two different ways. On a 64-bit
computer, signed integers that fit in 60 bits will be stored directly
in a 64-bit word. The remaining 4 bits in the words are used for the
tag, which for a small integer is 15
. If the integer does
not fit, it will be represented as a bignum, which is pointer to
an object on the heap.
Here is the native code for testing that both operands are small:
mov eax, edi
and eax, esi
and al, 15
cmp al, 15
short jne L39
If one or both of the operands have another tag than 15
(are not
small integers), control is transferred to code at label L39
that
handles all other types of terms.
The next lines do the comparison of the small integers. The code is
written in a slightly convoluted way so that the conditional jump
(jge label_9
) that transfers control to the failure label can be
shared with the generic code:
cmp rdi, rsi
short jmp L40
L39:
call 5447639136
L40:
jge label_9
Thus, without type information, 11 instructions are needed to implement
is_lt
.
Now let’s see what happens when types are available:
my_less_than(A, B) when is_integer(A), is_integer(B) ->
.
.
.
When compiled by the compiler in OTP 25, the BEAM code is:
{test,is_integer,{f,7},[{x,0}]}.
{test,is_integer,{f,7},[{x,1}]}.
{test,is_lt,{f,9},[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,9}.
{move,{atom,larger_or_equal},{x,0}}.
return.
The operands for the is_lt
instruction now have types. The BEAM
registers {x,0}
and {x,1}
have the type {t_integer,any}
, which
means an integer with an unknown range.
Having that knowledge of the types, the JIT can emit a slightly shorter test for a small integer:
# simplified small test since all other types are boxed
mov eax, edi
and eax, esi
test al, 1
short je L39
To do a better job, the JIT will need better type information. For example:
map_size_less_than(Map1, Map2) ->
if
map_size(Map1) < map_size(Map2) -> smaller;
true -> larger_or_equal
end.
The BEAM code looks like this:
{gc_bif,map_size,{f,12},2,[{x,0}],{x,0}}.
{gc_bif,map_size,{f,12},2,[{x,1}],{x,1}}.
{test,is_lt,
{f,12},
[{tr,{x,0},{t_integer,{0,288230376151711743}}},
{tr,{x,1},{t_integer,{0,288230376151711743}}}]}.
{move,{atom,smaller},{x,0}}.
return.
{label,12}.
{move,{atom,larger_or_equal},{x,0}}.
return.
Both operands for is_lt
now have the type
{t_integer,{0,288230376151711743}}
, meaning an integer in the range
0 through 288230376151711743 (that is, (1 bsl 58) - 1
). There is no
documented upper limit for the number of elements in a map, but for
the foreseeable future, there is no way that the number of elements in
a map will exceed or even get close to 288230376151711743.
Since both the lower and upper bounds for {x,0}
and {x,1}
fit in
60 bits, there is no need to test the type of the operands:
# is_lt_fss
mov rsi, qword ptr [rbx+8]
mov rdi, qword ptr [rbx]
# skipped test for small operands since they are always small
cmp rdi, rsi
L42:
L43:
jge label_12
Since the operands are always small, the call to the generic routine
(following label L42
) has been omitted.
Looking at arithmetic instructions, we will see the potential for nice simplifications by the JIT, but unfortunately we will also see the limitations of the type analysis done by the Erlang compiler in OTP 25.
Let’s look at the generated code for this function:
add1(X, Y) ->
X + Y.
The BEAM code looks like this:
{gc_bif,'+',{f,0},2,[{x,0},{x,1}],{x,0}}.
return.
The JIT translates the +
instruction to the following native instructions:
# i_plus_ssjd
mov rsi, qword ptr [rbx]
mov rdx, qword ptr [rbx+8]
# are both operands small?
mov eax, esi
and eax, edx
and al, 15
cmp al, 15
short jne L15
# add with overflow check
mov rax, rsi
mov rcx, rdx
and rcx, -16
add rax, rcx
short jno L14
L15:
call 4328985696
L14:
mov qword ptr [rbx], rax
The first two instructions:
mov rsi, qword ptr [rbx]
mov rdx, qword ptr [rbx+8]
loads the operands for the +
operation BEAM registers into CPU registers.
The next 5 instructions tests for small operands:
# are both operands small?
mov eax, esi
and eax, edx
and al, 15
cmp al, 15
short jne L15
The code is almost identical to the code in the is_lt
instruction
that we examined earlier. The only difference is that other CPU
registers are used. If one or both of the operands is not a small
integer, a jump is made to label L15
, which looks like this:
L15:
call 4328985696
This code calls a generic routine that can add any combination of
small, bignums, or floats. The generic routine will also handle
non-number operands by raising a badarith
exception.
If both operands indeed are smalls, the following code adds them and checks for overflow:
# add with overflow check
mov rax, rsi
mov rcx, rdx
and rcx, -16
add rax, rcx
short jno L14
If the addition overflowed, the generic addition routine is called. Otherwise, control is transferred to the following instruction:
mov qword ptr [rbx], rax
which stores the result in {x,0}
.
To summarize, the addition itself (including dealing with the tags) requires 4 instructions. However, 10 more instructions are needed to:
Now let’s see what happens if types are introduced.
Consider:
add2(X0, Y0) ->
X = 2 * X0,
Y = 2 * Y0,
X + Y.
The BEAM code looks like:
{gc_bif,'*',{f,0},2,[{x,0},{integer,2}],{x,0}}.
{gc_bif,'*',{f,0},2,[{x,1},{integer,2}],{x,1}}.
{gc_bif,'+',{f,0},2,[{tr,{x,0},number},{tr,{x,1},number}],{x,0}}.
return.
Types are propagated from arithmetic instructions to other arithmetic
instructions. Because the result of *
(if it succeeds) is a number
(integer or float), the operands for the +
instruction now have the
type number
.
Based on our experience of adding types to the <
operator, we might
guess that we would save only one instruction in the type test. We
would be right:
# simplified test for small operands since both are numbers
mov eax, esi
and eax, edx
test al, 1
short je L22
Returning to the simpler example with addition and no multiplication,
let’s add a guard to ensure that X
and Y
are integers:
add3(X, Y) when is_integer(X), is_integer(Y) ->
X + Y.
That results in the following BEAM code:
{test,is_integer,{f,5},[{x,0}]}.
{test,is_integer,{f,5},[{x,1}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
The types for both operands are now {t_integer,any}
. However, that
will still result in the same simplified four-instruction sequence for
testing small integers, because the integers might not fit in 60 bits.
Clearly, based on our experience with is_lt
, we will need to establish
a range for X
and Y
. A reasonable way to do that would be:
add4(X, Y) when is_integer(X), 0 =< X, X < 16#400,
is_integer(Y), 0 =< Y, Y < 16#400 ->
X + Y.
However, because of limitations in the compiler’s value range analysis,
the types for the +
operator will not improve:
{test,is_integer,{f,19},[{x,0}]}.
{test,is_ge,{f,19},[{tr,{x,0},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,19},[{tr,{x,0},{t_integer,any}},{integer,1024}]}.
{test,is_integer,{f,19},[{x,1}]}.
{test,is_ge,{f,19},[{tr,{x,1},{t_integer,any}},{integer,0}]}.
{test,is_lt,{f,19},[{tr,{x,1},{t_integer,any}},{integer,1024}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,any}},{tr,{x,1},{t_integer,any}}],
{x,0}}.
return.
To add insult to injury, the first 6 instructions cannot be simplified
by the JIT because there is not sufficient type information. That is,
the is_lt
and is_ge
instructions will comprise 11 instructions each.
We aim to improve the type analysis and optimizations in OTP 26 and generate better code for this example. We are also considering adding a new guard BIF in OTP 26 for testing that a term is an integer in a given range.
Meanwhile, while we wait for OTP 26, there is a way in
OTP 25 to write an equivalent guard that will result in
much more efficient code and establish known ranges for X
and
Y
:
add5(X, Y) when X =:= X band 16#3FF,
Y =:= Y band 16#3FF ->
X + Y.
We are showing this way of writing guard for illustrative purposes only; we don’t recommend rewriting your guards in this way.
The band
operator fails if not both of its operands are integers, so
no is_integer/1
test is needed. The =:=
comparison will return
false
if the corresponding variable is outside the range 0
through
16#3FF
.
That will result in the following BEAM code, where the compiler now
has been able to figure out the possible ranges for the operands of
the +
operator:
{gc_bif,'band',{f,21},2,[{x,0},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,21},
[{tr,{x,0},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'band',{f,21},2,[{x,1},{integer,1023}],{x,2}}.
{test,is_eq_exact,
{f,21},
[{tr,{x,1},{t_integer,any}},{tr,{x,2},{t_integer,{0,1023}}}]}.
{gc_bif,'+',
{f,0},
2,
[{tr,{x,0},{t_integer,{0,1023}}},{tr,{x,1},{t_integer,{0,1023}}}],
{x,0}}.
return.
Also, the 4 instructions that precede the +
instructions are now
relatively efficient.
The band
instruction needs to test the operands and be prepared to handle
integers that don’t fit in 60 bits:
# i_band_ssjd
mov rsi, qword ptr [rbx]
mov eax, 16383
# is the operand small?
mov edi, esi
and edi, 15
cmp edi, 15
short jne L97
and rax, rsi
short jmp L98
L97:
call 4456532680
short je label_25
L98:
mov qword ptr [rbx+16], rax
The is_eq_exact
instruction benefits from type information derived from
executing the band
instruction. Since the right-hand side operand is known
to be a small integer that fits in a machine word, a simple comparison is
sufficient with no need for fallback code to handle other Erlang terms:
# is_eq_exact_fss
# simplified check since one argument is an immediate
mov rdi, qword ptr [rbx+16]
cmp qword ptr [rbx], rdi
short jne label_25
The JIT generates the following code for the +
operator:
# i_plus_ssjd
# add without overflow check
mov rax, qword ptr [rbx]
mov rsi, qword ptr [rbx+8]
and rax, -16
add rax, rsi
mov qword ptr [rbx], rax
base64
As far as we know, base64
is the module in OTP that has benefited
the most of the improvements in OTP 25.
Here follows benchmark results for a benchmark included in a Github issue. First the results for OTP 24 on my computer:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 19805 ms: 50 it/sec
fun base64:decode/1: 1000 iterations in 20075 ms: 49 it/sec
The results for OTP 25 on the same computer:
== Testing with 1 MB ==
fun base64:encode/1: 1000 iterations in 16024 ms: 62 it/sec
fun base64:decode/1: 1000 iterations in 18306 ms: 54 it/sec
In OTP 25, the encoding is done in 80 percent of the time that OTP 24 needs. Decoding is also more than a second faster.
The base64
module has not been modified in OTP 25, so the improvements
are entirely down to improvements in the compiler and the JIT.
Here is the clause of encode_binary/2
in the base64
module that does
most of the work of encoding a binary to Base64:
encode_binary(<<B1:8, B2:8, B3:8, Ls/bits>>, A) ->
BB = (B1 bsl 16) bor (B2 bsl 8) bor B3,
encode_binary(Ls,
<<A/bits,(b64e(BB bsr 18)):8,
(b64e((BB bsr 12) band 63)):8,
(b64e((BB bsr 6) band 63)):8,
(b64e(BB band 63)):8>>).
The binary matching in the function head establishes ranges for the
the variables B1
, B2
, and B3
. (The types for all three variables
will be {t_integer,{0,255}}
.)
Because of the ranges, all of the bsl
, bsr
, band
, and bor
operations that follow do not need any type checks. Also, in the
creation of the binary, there is no need to test whether the binary
creation succeeded because all values are known to be small integers.
The 4 calls to the b64e/1
functions are inlined. The function
looks like this:
-compile({inline, [{b64e, 1}]}).
b64e(X) ->
element(X+1,
{$A, $B, $C, $D, $E, $F, $G, $H, $I, $J, $K, $L, $M, $N,
$O, $P, $Q, $R, $S, $T, $U, $V, $W, $X, $Y, $Z,
$a, $b, $c, $d, $e, $f, $g, $h, $i, $j, $k, $l, $m, $n,
$o, $p, $q, $r, $s, $t, $u, $v, $w, $x, $y, $z,
$0, $1, $2, $3, $4, $5, $6, $7, $8, $9, $+, $/}).
In OTP 25, the JIT will optimize calls to element/2
where the
position argument is an integer and the tuple argument is a literal
tuple. For the way element/2
is used in be64e/1
, all type tests
and range checks will be removed:
# bif_element_jssd
# skipped tuple test since source is always a literal tuple
L302:
long mov rsi, 9223372036854775807
mov rdi, qword ptr [rbx+24]
lea rcx, qword ptr [rsi-2]
# skipped test for small position since it is always small
mov rax, rdi
sar rax, 4
# skipped check for position =:= 0 since it is always >= 1
# skipped check for negative position and position beyond tuple
mov rax, qword ptr [rcx+rax*8]
L300:
L301:
mov qword ptr [rbx+24], rax
That is 7 instructions with no conditional branches.
If you want to follow along and examine the native code for loaded modules, start the runtime system like this:
erl +JDdump true
The native code for all modules that are loaded will be dumped to files with the
extension .asm
.
To find code that has been simplified by the JIT, use this command:
egrep "simplified|skipped|without overflow" *.asm
To examine the BEAM code for a module, use the -S
option. For example:
erlc -S base64.erl
Here are the main pull requests that implement type-based optimizations:
]]>{message_queue_data,
off_heap}
setting. The following figure gives an
idea of what type of scalability improvement the optimization can give
in extreme scenarios (number of Erlang processes sending signals on
the x-axis and throughput on the y-axis):
This blog post aims to give you an understanding of how signal sending on a single node is implemented in Erlang and how the new optimization can yield the impressive scalability improvement illustrated in the figure above. Let us begin with a brief introduction to what Erlang signals are.
All concurrently executing entities (processes, ports, etc.) in an Erlang system communicate using asynchronous signals. The most common signal is normal messages that are typically sent between processes with the bang (!) operator. As Erlang takes pride in being a concurrent programming language, it is, of course, essential that signals are sent efficiently between different entities. Let us now discuss what guarantees Erlang programmers get about signal sending ordering, as this will help when learning how the new optimization works.
The signal ordering guarantee is described in the Erlang documentation like this:
“The only signal ordering guarantee given is the following: if an entity sends multiple signals to the same destination entity, the order is preserved; that is, if
A
sends a signalS1
toB
, and later sends signalS2
toB
,S1
is guaranteed not to arrive afterS2
.”
This guarantee means that if multiple processes send signals to a
single process, all signals from the same process are received in the
send order in the receiving process. Still, there is no ordering
guarantee for two signals coming from two distinct processes. One
should not think about signal sending as instantaneous. There can be
an arbitrary delay after a signal has been sent until it has reached
its destination, but all signals from A
to B
travel on the same path
and cannot pass each other.
The guarantee has deliberately been designed to allow for efficient implementations and allow for future optimizations. However, as we will see in the next section, before the optimization presented in this blog post, the implementation did not take advantage of the permissive ordering guarantee for signals sent between processes running on the same node.
Conceptually, the Erlang VM organized the data structure for an Erlang process as in the following figure before the optimization:
Of course, this is an extreme simplification of the Erlang process
structure, but it is enough for our explanation. When a process has
the {message_queue_data, off_heap}
setting activated, the following
algorithm is executed to send a signal:
OuterSignalQueueLock
in the receiving processOuterSignalQueue
OuterSignalQueueLock
When a receiving process has run out of signals in its
InnerSignalQueue
and/or wants to check if there are more signals in
the outer queue, the following algorithm is executed:
OuterSignalQueueLock
OuterSignalQueue
at the end of the InnerSignalQueue
OuterSignalQueueLock
How signal sending works when the receiving process is configured with
{message_queue_data, on_heap}
is not so relevant for the main topic
of this blog post. Still, understanding how {message_queue_data,
on_heap}
works will also give you an understaning of why the parallel
signal queue optimization is not enabled when a process is configured
with {message_queue_data, on_heap}
(which is the default setting),
so here is the algorithm for sending a signal to such a process:
MainProcessLock
with a try_lock
call
try_lock
call succeeded:
OuterSignalQueueLock
OuterSignalQueue
OuterSignalQueueLock
MainProcessLock
OuterSignalQueueLock
OuterSignalQueue
OuterSignalQueueLock
The advantage of {message_queue_data, on_heap}
compared to
{message_queue_data, off_heap}
is that the signal data is copied
directly to the receiving process main heap (when the try_lock
call
for the MainProcessLock
succeeds). The disadvantage of
{message_queue_data, on_heap}
is that the sender creates extra
contention on the receiver’s MainProcessLock
. Notice that we cannot
simply release the MainProcessLock
directly after allocating the
data on the receiver’s process heap. If a garbage collection happen
before the signal have been inserted into the process’ heap, the
signal data would be lost (holding the MainProcessLock
prevents a
garbage collection from happening). Therefore, {message_queue_data,
off_heap}
provides much better scalability than {message_queue_data,
on_heap}
when multiple processes send signals to the same process
concurrently on a multicore system.
However, even though {message_queue_data, off_heap}
scales better
than {message_queue_data, on_heap}
with the old implementation,
signal senders still had to acquire the OuterSignalQueueLock
for a
short time. This lock can become a scalability bottleneck and a
contended hot-spot when there are enough parallel senders. This is why
we saw very poor scalability and even a slowdown for the old
implementation in the benchmark figure above. Now, we are ready to
look at the new optimization.
The optimization takes advantage of Erlang’s permissive signal
ordering guarantee discussed above. It is enough to keep the order of
signals coming from the same entity to ensure that the signal ordering
guarantee holds. So there is no need for different senders to
synchronize with each other! In theory, signal sending could therefore
be parallelized perfectly. In practice, however, there is only one
thread of execution that handles incoming signals, so we also have to
keep in mind that we don’t want to slow down the receiver and ideally
make receiving signals faster. As signal queue data is stored outside
the process main heap area when the {message_queue_data, off_heap}
setting is enabled, the garbage collector does not need to go through
the whole signal queue, giving better performance for processes with a
lot of signals in their signal queue. Therefore, it is also important
for the optimization not to add unnecessary overhead when the
OuterSignalQueueLock
is uncontended, so that we do not slow down
existing use cases for {message_queue_data, off_heap}
too much.
We decided to go for a design that enables the parallel signal sending
optimization on demand when the contention on the OuterSignalQueueLock
seems to be high to avoid as much overhead as possible when the
optimization is unnecessary. Here is a conceptual view of the process
structure when the optimization is not active (which is the initial
state when creating a process with {message_queue_data, off_heap}
):
The following figure shows a conceptual view of the process structure
when the parallel signal sending optimization is turned on. The only
difference between this and the previous figure is that the
OuterSignalQueueBufferArray
field now points to a structure
containing an array with buffers.
When the parallel signal sending optimization is active, senders do
not need to acquire the OuterSignalQueueLock
anymore. Senders are
mapped to a slot in the OuterSignalQueueBufferArray
by a simple hash
function that is applied to the process ID (senders without a process
ID are currently mapped to the same slot). Before a sender takes the
OuterSignalQueueLock
in the receiving process’ structure, the sender
tries to enqueue in its slot in the OuterSignalQueueBufferArray
(if
it exists). If the enqueue attempt succeeds, the sender can continue
without even touching the OuterSignalQueueLock
! The order of signals
coming from the same sender is maintained because the same sender is
always mapped to the same slot in the buffer array. Now, you have
probably got an idea of why the signal sending throughput can increase
so much with the new optimization, as we saw in the benchmark figure
presented earlier. Essentially, the contention on the
OuterSignalQueueLock
gets distributed among the slots in the
OuterSignalQueueBufferArray
. The rest of the subsections in this
section cover details of the implementation, so you can skip those
if you do not want to dig deeper.
As the figure above tries to illustrate, the OuterSignalQueueLock
carries
a statistics counter. When that statistics counter reaches a certain
threshold, the new parallel signal sending optimization is activated
by installing the OuterSignalQueueBufferArray
in the process
structure. The statistics counter for the lock is updated in a simple
way. When a thread tries to acquire the OuterSignalQueueLock
and the lock
is already taken, the counter is increased, and otherwise, it is
decreased, as the following code snippet illustrates:
void erts_proc_sig_queue_lock(Process* proc)
{
if (EBUSY == erts_proc_trylock(proc, ERTS_PROC_LOCK_MSGQ)) {
erts_proc_lock(proc, ERTS_PROC_LOCK_MSGQ);
proc->sig_inq_contention_counter += 1;
} else if(proc->sig_inq_contention_counter > 0) {
proc->sig_inq_contention_counter -= 1;
}
}
Currently, the number of slots in the OuterSignalQueueBufferArray
is
fixed to 64. Sixty-four slots should go a long way to reduce signal
queue contention in most practical application that exists today. Few
servers have more than 100 cores, and typical applications spend a lot
of time doing other things than sending signals. Using 64 slots also
allows us to implement a very efficient atomically updatable bitset
containing information about which slots are currently non-empty (the
NonEmptySlots
field in the figure above). This bitset makes flushing
the buffer array into the OuterSignalQueue
more efficient
since only the non-empty slots in the buffer array need to be visited
and updated to perform the flush.
Pseudo-code for the algorithm that is executed when a process is
sending a signal to another process that has the
OuterSignalQueueBufferArray
installed can be seen below:
I
with the hash functionSlotLock
for the slot I
IsAlive
field for slot I
IsAlive
field’s value is true
:
NonEmptySlots
field, if the buffer is emptyBufferQueue
for slot I
NumberOfEnqueues
in slot I
by 1SlotLock
for slot I
OuterSignalQueueBufferArray
has been deactivated):
I
OuterSignalQueue
in the same way as
the signal sending algorithm did it prior to the optimizationThe algorithm for fetching signals from the outer signal queue uses
the NonEmptySlots
field in the OuterSignalQueueBufferArray
, so it
only needs to check slots that are guaranteed to be non-empty. At a
high level, the routine works according to the following pseudo-code:
OuterSignalQueueLock
OuterSignalQueue
NumberOfEnqueues
field to the
TotNumberOfEnqueues
field in the OuterSignalQueueBufferArray
BufferQueue
and NumberOfEnqueues
fieldsNumberOfFlushes
field in the
OuterSignalQueueBufferArray
by oneNumberOfFlushes
field has reached a certain
threshold T
:
EnqPerFlush
) during the last T
flushes
(TotNumberOfEnqueues
/ T
).
EnqPerFlush
is below a certain threshold Q
:
OuterSignalQueueBufferArray
:
SlotLock
OuterSignalQueue
IsAlive
field to false
SlotLock
OuterSignalQueueBufferArray
field in the process
structure to NULL
Q
:
NumberOfFlushes
and the TotNumberOfEnqueues
fields in the buffer array struct to 0OuterSignalQueue
to the end of the InnerSignalQueue
OuterSignalQueue
OuterSignalQueueLock
For simplicity, many details have been left out from the pseudo-code snippets above. However, if you have understood them, you have an excellent understanding of how signal sending in Erlang works, how the new optimization is implemented, and how it automatically activates and deactivates itself. Let us now dive a little bit deeper into benchmark results for the new implementation.
A configurable benchmark to measure the performance of both signal
sending processes and receiving processes has been created. The
benchmark lets N
Erlang processes send signals (of configurable types
and sizes) to a single process during a period of T
seconds. Both N
and T
are configurable variables. A signal with size S
has a payload
consisting of a list of length S
with word-sized (64 bits) items. The
send throughput is calculated by dividing the number of signals that
are sent by T
. The receive throughput is calculated by waiting until
all sent signals have been received and then dividing the total number
of signals sent by the time between when the first signal was sent and
when the last signal was received. The benchmark machine has 32 cores
and two hardware threads per core (giving 64 hardware threads). You
can find a detailed benchmark description on the signal queue
benchmark page.
First, let us look at the results for very small messages (a list containing a single integer) below. The graph for the receive throughput is the same as we saw at the beginning of this blog post. Not surprisingly, the scalability for sending messages is much better after the optimization. More surprising is that the performance of receiving messages is also substantially improved. For example, with 16 processes, the receive throughput is 520 times better with the optimization! The improved receive throughput can be explained by the fact that in this scenario, the receiver has to fetch messages from the outer signal queue much more seldom. Sending is much faster after the optimization, so the receiver will bring more messages from the outer signal queue to the inner every time it runs out of messages. The sender can thus process messages from the inner queue for a longer time before it needs to fetch messages from the outer queue again. We cannot expect any improvement for the receiver beyond a certain point as there is only a single hardware thread that can work on processing messages at the same time.
Below are the results for larger messages (a list containing 100 integers). We do not get as good improvement in this scenario with a larger message size. With larger messages, the benchmark spends more time doing other work than sending and receiving messages. Things like the speed of the memory system and memory allocation might become limiting factors. Still, we get decent improvement both in the send throughput and receive throughput, as seen below.
You can find results for even larger messages as well as for non-message signals on the benchmark page. Real Erlang applications do much more than message and signal sending, so this benchmark is, of course, not representative of what kind of improvements real applications will get. However, the benchmarks show that we have pushed the threshold for when parallel message sending to a single process becomes a problem. Perhaps the new optimization opens up new interesting ways of writing software that was impractical due to previous performance reasons.
Users can configure processes with {message_queue_data, off_heap}
or
{message_queue_data, on_heap}
. This configurability increases the
burden for Erlang programmers as it can be difficult to figure out
which one is better for a particular process. It would therefore make
sense also to have a {message_queue_data, auto}
option that would
automatically detect lock contention even in on_heap
mode and
seamlessly switch between on_heap
and off_heap
based on how much
contention is detected.
As discussed previously, 64 slots in the signal queue buffer array is a good start but might not be enough when servers have thousands of cores. A possible way to make the implementation even more scalable would be to make the signal queue buffer array expandable. For example, one could have contention detecting locks for each slot in the array. If the contention is high in a particular slot, one could expand this slot by creating a link to a subarray with buffers where senders can use another hash function (similar to how the HAMT data structure works).
The new parallel signal queue optimization that affects processes
configured with {message_queue_data, off_heap}
yields much better
scalability when multiple processes send signals to the same process
in parallel. The optimization has a very low overhead when the
contention is low as it is only activated when its contention
detection mechanism indicates that the contention is high.
decentralized_counters
option brings us one step closer to
perfect scalability.
The ETS table option
decentralized_counters
(introduced in Erlang/OTP 22 for ordered_set
tables and in
Erlang/OTP 23 for the other table types) has made the scalability much
better. A table with decentralized_counters
activated uses
decentralized counters instead of centralized counters to track the
number of items in the table and the memory
consumption. Unfortunately, tables with decentralized_counters
activated will have slow operations to get the table size and
memory usage (ets:info(Table,
size)
and
ets:info(Table,
memory)
), so whether it
is beneficial to turn decentralized_counters
on or off depends on
your use case. This blog post will give you a better understanding of
when one should activate the decentralized_counters
option and how
the decentralized counters work.
The following figure shows the throughput (operations/second) achieved
when processes are doing inserts (ets:insert/2
) and deletes
(ets:delete/2
) to an ETS table of the set
type on a machine with
64 hardware threads both when decentralized_counters
option is
activated and when it is deactivated. The table types bag
and
duplicate_bag
have similar scalability behavior as their
implementation is based on the same hash table.
The following figure shows the results for the same benchmark but with
a table of type ordered_set
:
The interested reader can find more information about the benchmark at
the benchmark website for
decentralized_counters
. The
benchmark results above show that both set
and ordered_set
tables
get a significant scalability boost when the decentralized_counter
option is activated. The ordered_set
type receives a more
substantial scalability improvement than the set
type. Tables of the
set type have a fixed number of locks for the hash table buckets. The
ordered_set
table type is implemented with a contention adapting
search tree that
dynamically changes the locking granularity based on how much
contention is detected. This implementation difference explains the
difference in scalability between set
and ordered_set
. The
interested reader can find details about the ordered_set
implementation in an earlier blog
post.
Worth noting is also that the Erlang VM that ran the benchmarks has
been compiled with the configure option “./configure
--with-ets-write-concurrency-locks=256
”. The configure option
--with-ets-write-concurrency-locks=256
changes the number of locks
for hash-based ETS tables from the current default of 64 to 256 (256
is currently the max value one can set this configuration option
to). Changing the implementation of the hash-based tables so that one
can set the number of locks per table instance or so that the lock
granularity is adjusted automatically seems like an excellent future
improvement, but this is not what this blog post is about.
A centralized counter consists of a single memory word that is incremented and decremented with atomic instructions. The problem with a centralized counter is that modifications of the counter by multiple cores are serialized. This problem is amplified because frequent modifications of a single memory word by multiple cores cause a lot of expensive traffic in the cache coherence system. However, reading from a centralized counter is quite efficient as the reader only has to read a single memory word.
When designing the decentralized counters for ETS, we have tried to
optimize for update performance and scalability as most applications
need to get the size of an ETS table relatively rarely. However, since
there may be applications out in the wild that frequently call
ets:info(Table, size)
and ets:info(Table,
memory)
, we have chosen
to make decentralized counters optional.
Another thing that might be worth keeping in mind is that the hash-based tables that use decentralized counters tend to use slightly more hash table buckets than the corresponding tables without decentralized counters. The reason for this is that, with decentralized counters activated, the resizing decision is based on an estimate of the number of items in the table rather than an exact count, and the resizing heuristics trigger an increase of the number of buckets more eagerly than a decrease.
You will now learn how the decentralized counters in ETS works. The
decentralized counter implementation exports an
API
that makes it easy to swap between a decentralized counter and a
centralized one. ETS uses this to support the usage of both
centralized and decentralized counters. The data structure for the
decentralized counter is illustrated in the following picture. When
is_decentralized = false
, the counter field represents the current
count instead of a pointer to an array of cache line padded counters.
When is_decentralized = true
, processes that update (increment or
decrement) the counter follow the pointer to the array of counters and
increments the counter at the slot in the array that the current
scheduler maps to (one takes the scheduler identifier modulo the
number of slots in the array to get the appropriate slot). Updates do
not need to do anything else, so they are very efficient and can scale
perfectly with the number of cores as long as there are as many slots
as schedulers. One can configure the maximum number of slots in the
array of counters with the
+dcg
option.
To implement the ets:info(Table, size)
and ets:info(Table, memory)
operations, one also needs to read the current counter value. Reading
the current counter value can be implemented by taking the sum of the
values in the counter array. However, if this summation is done
concurrently with updates to the array of counters, we could get
strange results. For example, we could end up in a situation where
ets:info(Table, size)
returns a negative number, which is not
exactly what we want. On the other hand, we want to make counter
updates as fast as possible so having locks to protect the counters in
the counter array is not a good solution. We opted for a solution that
lets readers swap out the entire counter array and wait (using the
Erlang VM’s thread progress
system)
until no updates can occur in the swapped-out array before the sum is
calculated. The following example illustrates this approach:
[Step 1]
A thread is going to read the counter value.
[Step 2]
The reader starts by creating a new counter array.
[Step 3]
The pointer to the old counter array is changed to point to the new
one with the snapshot_ongoing
field set to true
. This
change can only be done when the snapshot_onging
field is set to
false
in the old counter array.
[Step 4]
Now, the reader has to wait until all other threads that will update a counter in the old array have completed their updates. As mentioned, this can be done using the Erlang VM’s thread progress system. After that, the reader can safely calculate the sum of counters in the old counter array (the sum is 1406). The calculated sum is also given to the process that requested the count so that it can continue execution.
[Step 5]
The read operation is not done, even though we have successfully calculated a count. One must add the calculated sum from the old array to the new array to avoid losing something.
[Step 6]
Finally, the snapshot_ongoing
field in the new counter array is
set to false
so that other read operations can swap out the new
counter array.
Now, you should have got a basic understanding of how ETS’ decentralized counters work. You are also welcome to look at the source code in erl_flxctr.c and erl_flxctr.h if you are interested in details of the implementation.
As you can imagine, reading the value of a decentralized counter with,
for example, ets:info(Table, size)
is extremely slow compared to a
centralized counter. Fortunately, most time that is spent reading the
value of a decentralized counter is spent waiting for the thread
progress system to report that it is safe to read the swapped-out array,
and the read operation does not block any scheduler and does not
consume any CPU time during this time. On the other hand, the
decentralized counter can be updated in a very efficient and scalable
way, so using decentralized counters is most likely to prefer, if you
seldom need to get the size and the memory consumed by your shared
ETS table.
This blog post has described the implementation of the decentralized
counter option for ETS tables. ETS tables with decentralized counters
scale much better with the number of cores than ETS tables with
centralized counters. However, as decentralized counters make
ets:info(Table, size)
and ets:info(Table, memory)
very slow, one
should not use them if any of these two operations need to be
performed frequently.
Erlang/OTP 24 includes contributions from 60+ external contributors totalling 1400+ commits, 300+ PRs and changing 0.5 million(!) lines of code. Though I’m not sure the line number should count as we vendored all of AsmJit and re-generated the wxWidgets support. If we ignore AsmJit and wx, there are still 260k lines of code added and 320k lines removed, which is about 100k more than what our releases normally contain.
You can download the readme describing the changes here: Erlang/OTP 24 Readme. Or, as always, look at the release notes of the application you are interested in. For instance here: Erlang/OTP 24 - Erts Release Notes - Version 12.0.
This years highlights are:
The most anticipated feature of Erlang/OTP 24 has to be the JIT compiler. A lot has already been said about it:
and even before released the WhatsApp team has shown what it is capable of.
However, besides the performance gains that the JIT brings, what I am the most excited about is the benefits that come with running native code instead of interpreting. What I’m talking about is the native code tooling that now becomes available to all Erlang programmers, such as integration with perf.
As an example, when building a dialyzer plt of a small core of Erlang, the previous way to profile would be via something like eprof.
> eprof:profile(fun() ->
dialyzer:run([{analysis_type,'plt_build'},{apps,[erts]}])
end).
This increases the time to build the PLT from about 1.2 seconds to 15 seconds on
my system. In the end, you get something like the below that will guide you to
what you need to optimize. Maybe take a look at erl_types:t_has_var*/1
and check if you really need to call it 13-15 million times!
> eprof:analyze(total).
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- ------- ---- [----------]
erl_types:t_sup1/2 2744805 1.68 752795 [ 0.27]
erl_types:t_subst/2 2803211 1.92 858180 [ 0.31]
erl_types:t_limit_k/2 3783173 2.04 913217 [ 0.24]
maps:find/2 4798032 2.14 957223 [ 0.20]
erl_types:t_has_var/1 15943238 5.89 2634428 [ 0.17]
erl_types:t_has_var_list/1 13736485 7.51 3360309 [ 0.24]
------------------------ --------- ------- -------- [----------]
Total: 174708211 100.00% 44719837 [ 0.26]
In Erlang/OTP 24 we can get the same result without having to pay the pretty steep cost of profiling with eprof. When running the same analysis as above using perf it takes roughly 1.3 seconds to run.
$ ERL_FLAGS="+JPperf true" perf record dialyzer --build_plt \
--apps erts
Then we can use tools such as perf report, hotspot or speedscope to analyze the results.
$ hotspot perf.data
In the above, we can see that we get roughly the same result as when using
eprof
, though interestingly not exactly the same. I’ll leave the whys of
this up to the reader to find out :)
With this little overhead when profiling, we can run scenarios that previously would take too long to run when profiling. For those brave enough it might even be possible to run always-on profiling in production!
The journey with what can be done with perf has only started. In PR-4676 we will be adding frame pointer support which will give a much more accurate call frames when profiling and, in the end, the goal is to have mappings to Erlang source code lines instead of only functions when using perf report and hotspot to analyze a perf recording.
Erlang’s error messages tend to get a lot of (valid) criticism for being hard to understand. Two great new features have been added to help the user understand why something has failed.
Thanks to the work of Richard Carlsson and Hans Bolinder, when you compile
Erlang code you now get the line and column of errors and warnings printed in
the shell together with a ^
-sign showing exactly where the error
actually was. For example, if you compile the below:
foo(A, B) ->
#{ a => A, b := B }.
you would in Erlang/OTP 23 and earlier get:
$ erlc t.erl
t.erl:6: only association operators '=>' are allowed in map construction
but in Erlang/OTP 24 you now also get the following printout:
$ erlc test.erl
t.erl:6:16: only association operators '=>' are allowed in map construction
% 6| #{ a => A, b := B }.
% | ^
This behavior also extends into most of the Erlang code editors so that when you use VSCode or Emacs through Erlang LS or flycheck you also get a narrower warning/error indicator, for example in Emacs using Erlang LS.
One of the other big changes when it comes to error information is the introduction of EEP-54. In the past many of the BIFs (built-in functions) would give very cryptic error messages:
1> element({a,b,c}, 1).
** exception error: bad argument
in function element/2
called as element({a,b,c},1)
In the example above, the only thing we know is that one or more of the
arguments are invalid, but without checking
the documentation
there is no way of knowing which one and why. This is especially a problem for
BIFs where the arguments may fail for different reasons depending on factors not
visible in the arguments. For example in the ets:update_counter
call below:
> ets:update_counter(table, k, 1).
** exception error: bad argument
in function ets:update_counter/3
called as ets:update_counter(table,k,1)
We don’t know if the call failed because the table did not exist at all
or if the key k
that we wanted to update did not exist in the table.
In Erlang/OTP 24 both of the examples above will have a much clearer error messages.
1> element({a,b,c}, 1).
** exception error: bad argument
in function element/2
called as element({a,b,c},1)
*** argument 1: not an integer
*** argument 2: not a tuple
2> ets:new(table,[named_table]).
table
3> ets:update_counter(table, k, 1).
** exception error: bad argument
in function ets:update_counter/3
called as ets:update_counter(table,k,1)
*** argument 2: not a key that exists in the table
That looks much better and now we can see what the problem was! The standard logging formatters also include the additional information so that if this type of error happens in a production environment you will get the extra error information:
1> proc_lib:spawn(fun() -> ets:update_counter(table, k, 1) end).
<0.94.0>
=CRASH REPORT==== 10-May-2021::11:20:35.367023 ===
crasher:
initial call: erl_eval:'-expr/5-fun-3-'/0
pid: <0.94.0>
registered_name: []
exception error: bad argument
in function ets:update_counter/3
called as ets:update_counter(table,k,1)
*** argument 1: the table identifier does
not refer to an existing ETS table
ancestors: [<0.92.0>]
EEP-54 is not only useful for error messages coming from BIFs but can be used
by any application that wants to provide extra information about their exceptions.
For example, we have been working on providing better error information around
io:format
in PR-4757.
Since Erlang/OTP R14 (released in 2010), the Erlang compiler and run-time system
have co-operated to optimize for the pattern of code used by
gen_server:call
like functionality to avoid scanning a potentially
huge mailbox. The basic pattern looks like this:
call(To, Msg) ->
Ref = make_ref(),
To ! {call, Ref, self(), Msg},
receive
{reply, Ref, Reply} -> Reply
end.
The compiler can from this figure out that when Ref
is created, there can be
no messages in the mailbox of the process that contains Ref
and therefore it
can skip all of those when receiving the Reply
.
This has always worked great in simple scenarios like this, but as soon as you had to make the scenarios a little more complex it tended to break the compiler’s analysis and you would end up scanning the entire mailbox. For example, in the code below Erlang/OTP 23 will not optimize the receive.
call(To, Msg, Async) ->
Ref = make_ref(),
To ! {call, Ref, self(), Msg},
if
Async ->
{ok, Ref};
not Async ->
receive
{reply, Ref, Reply} -> Reply
end
end.
That all changes with Erlang/OTP 24! Many more complex scenarios are now covered by the optimization and a new compiler flag has been added to tell the user if an optimization is done.
$ erlc +recv_opt_info test.erl
test.erl:6: Warning: OPTIMIZED: reference used to mark
a message queue position
% 6| Ref = make_ref(),
test.erl:12: Warning: OPTIMIZED: all clauses match reference
created by make_ref/0
at test.erl:6
% 12| receive
Even patterns such as multi_call are now optimized to not scan the mailbox of the process.
multi_call(ToList, Msg) ->
%% OPTIMIZED: reference used to mark a message queue position
Ref = make_ref(),
%% INFO: passing reference created by make_ref/0 at test.erl:18
[To ! {call, Ref, self(), Msg} || To <- ToList],
%% INFO: passing reference created by make_ref/0 at test.erl:18
%% OPTIMIZED: all clauses match reference
%% in function parameter 2
[receive {reply, Ref, Reply} -> Reply end || _ <- ToList].
There are still a lot of places where this optimization does not trigger. For instance as soon as any of the make_ref/send/receive are in different modules it will not work. However, the new improvements in Erlang/OTP 24 make the number of scenarios a lot fewer and now we also have the tools to check and see if the optimization is triggered!
You can read more about this optimization and others in the Efficiency Guide.
When doing a call to another Erlang process, the pattern used by
gen_server:call
, gen_statem:call
and others normally looks something
like this:
call(To, Msg, Tmo) ->
MonRef = erlang:monitor(process, To),
To ! {call, MonRef, self(), Msg},
receive
{'DOWN',MonRef,_,_,Reason} ->
{error, Reason};
{reply, MonRef, Reply}
erlang:demonitor(MonRef,[flush]),
{ok, Reply}
after Tmo ->
erlang:demonitor(MonRef,[flush]),
{error, timeout}
end.
This normally works well except for when a timeout happens. When a timeout happens the process on the other end has no way to know that the reply is no longer needed and so will send it anyway when it is done with it. This causes all kinds of problems as the user of a third-party library would never know what messages to expect to be present in the mailbox.
There have been numerous attempts to solve this problem using the primitives
that Erlang gives you, but in the end, most ended up just adding a handle_info
in their gen_server
s that ignored any unknown messages.
In Erlang/OTP 24, EEP-53 has introduced the alias
functionality to solve this problem.
An alias
is a temporary reference to a process that can be used
to send messages to. In most respects, it works just as a PID except that
the lifetime of an alias is not tied with the lifetime of the process it
represents. So when you try to send a late reply to an alias that has been
deactivated the message will just be dropped.
The code changes needed to make this happen are very small and are already used
behind the scenes in all the standard behaviors of Erlang/OTP. The only thing
needed to be changed in the example code above is that a new option must be
given to erlang:monitor
and the reply reference should now be the alias
instead of the calling PID. That is, like this:
call(To, Msg, Tmo) ->
MonAlias = erlang:monitor(process, To, [{alias, demonitor}]),
To ! {call, MonAlias, MonAlias, Msg},
receive
{'DOWN', MonAlias, _ , _, Reason} ->
{error, Reason};
{reply, MonAlias, Reply}
erlang:demonitor(MonAlias,[flush]),
{ok, Reply}
after Tmo ->
erlang:demonitor(MonAlias,[flush]),
{error, timeout}
end.
You can read more about this functionality in the alias documentation.
In Erlang/OTP 23 erl_docgen was extended to be able to emit EEP-48 style
documentation. This allowed the documentation to be used by h(lists)
in
the Erlang shell and external tools such as Erlang LS. However, there
are very few applications outside Erlang/OTP that use erl_docgen
to
create documentation, so EEP-48 style documentation was unavailable to
those applications. Until now!
Radek Szymczyszyn has added support for EEP-48 into edoc which means
that from Erlang/OTP 24 you can view both the documentation of lists:foldl/3
and recon:info/1
.
$ rebar3 as docs shell
Erlang/OTP 24 [erts-12.0] [source] [jit]
Eshell V11.2.1 (abort with ^G)
1> h(recon,info,1).
-spec info(PidTerm) ->
[{info_type(), [{info_key(), Value}]}, ...]
when PidTerm :: pid_term().
Allows to be similar to erlang:process_info/1, but excludes
fields such as the mailbox, which tend to grow
and be unsafe when called in production systems. Also includes
a few more fields than what is usually given (monitors,
monitored_by, etc.), and separates the fields in a more
readable format based on the type of information contained.
For more information about how to enable this in your project see the Doc chunks section in the Edoc User’s Guide.
socket
support in gen_tcp
The gen_tcp module has gotten support for optionally using the new socket
nif API instead of the previous inet driver. The new interface can be configured
to be used either on a system level through setting the application
configuration parameter like this: -kernel inet_backend socket
, or on a per
connection bases like this: gen_tcp:connect(localhost,8080,[{inet_backend,socket}])
.
If you do this you will notice that the Socket
returned by gen_tcp
no longer
is a port but instead of a tuple containing (among other things) a PID and a
reference.
1> gen_tcp:connect(localhost,8080,[{inet_backend,socket}]).
{ok,{'$inet',gen_tcp_socket,
{<0.88.0>,{'$socket',#Ref<0.2959644163.2576220161.68602>}}}}
This data structure is and always has been opaque, and therefore should not be inspected directly but instead only used as an argument to other gen_tcp and inet functions.
You can then use inet:i/0 to get a listing of all open sockets in the system:
2> inet:i().
Port Module Recv Sent Owner Local Address Foreign Address State Type
esock[19] gen_tcp_socket 0 0 <0.98.0> localhost:44082 localhost:http-alt CD:SD STREAM
The gen_tcp API should be completely backward compatible with the old implementation, so if you can, please test it and report any bugs that you find back to us.
Why should you want to test this? Because in some of our benchmarks, we get up to 4 times the throughput vs the old implementation. In others, there is no difference or even a loss of throughput. So, as always, you need to measure and check for yourself!
When creating supervisor hierarchies for applications that manage connections such as ssl or ssh, there are times when there is a need for terminating that supervisor hierarchy from within. Some event happens on the socket that should trigger a graceful shutdown of the processes associated with the connection.
Normally this would be done by using supervisor:terminate_child/2. However, this has two problems.
To solve this problem EEP-56 has added a mechanism in which a child can be marked as significant and if such a child terminates, it can trigger an automatic shutdown of the supervisor that it is part of.
This way a child process can trigger the shutdown of a supervisor hierarchy from within, without the child having to know anything about the supervisor hierarchy nor risking dead-locking itself during termination.
You can read more about automatic shutdown in the supervisor documentation.
With Erlang/OTP 24 comes support for Edwards-curve Digital Signature Algorithm
(EdDSA
). EdDSA
can be used when connecting to or acting as a TLS 1.3
client/server.
EdDSA
is a type of elliptic curve signature algorithm (ECDSA
)
that can be used for secure communication. The security of ECDSA
relies on a
strong cryptographically secure random number which can cause issues when
the random number is by mistake not secure enough, as has been the case in several
uses of ECDSA (none of them in Erlang as far as we know :).
EdDSA
does not rely on a strong random number to be secure. This means that
when you are using EdDSA
, the communication is secure even if your random
number generator is not.
Despite the added security, EdDSA
is claimed to be faster than other elliptic
curve signature algorithms. If you have OpenSSL 1.1.1 or later, then as of
Erlang/OTP 24 you will have access to this algorithm!
> crypto:supports(curves).
[...
c2tnb359v1, c2tnb431r1, ed25519, ed448, ipsec3, ipsec4
...] ^ ^
]]>Erlang processes communicate with each other by sending each other signals
(not to be confused with Unix signals). There are many different kinds and
messages are just the most common. Practically everything involving more than
one process uses signals internally: for example, the link/1
function is
implemented by having the involved processes talk back and forth until they’ve
agreed on a link.
This helps us avoid a great deal of locks and would make an interesting blog post on its own, but for now we only need to keep two things in mind: all signals (including messages) are continuously received and handled behind the scenes, and they have a defined order:
Signals between two processes are guaranteed to arrive in the order they were
sent. In other words, if process A
sends signal 1
and then 2
to process
B
, signal 1
is guaranteed to arrive before signal 2
.
Why is this important? Consider the request-response idiom:
%% Send a monitor signal to `Pid`, requesting a 'DOWN' message
%% when `Pid` dies.
Mref = monitor(process, Pid),
%% Send a message signal to `Pid` with our `Request`
Pid ! {self(), Mref, Request},
receive
{Mref, Response} ->
%% Send a demonitor signal to `Pid`, and remove the
%% corresponding 'DOWN' message that might have
%% arrived in the meantime.
erlang:demonitor(Mref, [flush]),
{ok, Response};
{'DOWN', Mref, _, _, Reason} ->
{error, Reason}
end
Since dead processes cannot send messages we know that the response must come
before any eventual 'DOWN'
message, but without a guaranteed order the
'DOWN'
message could arrive before the response and we’d have no idea whether
a response was coming or not, which would be very annoying to deal with.
Having a defined order saves us quite a bit of hassle and doesn’t come at much of a cost, but the guarantees stop there. If more than one process sends signals to a common process, they can arrive in any order even when you “know” that one of the signals was sent first. For example, this sequence of events is legal and entirely possible:
A
sends signal 1
to B
A
sends signal 2
to C
C
, in response to signal 2
, sends signal 3
to B
B
receives signal 3
B
receives signal 1
Luckily, global orders are rarely needed and are easy to impose yourself (outside distributed cases): just let all involved parties synchronize with a common process.
Sending a message is straightforward: we try to find the process associated with the process identifier, and if one exists we insert the message into its signal queue.
Messages are always copied before being inserted into the queue. As wasteful as this may sound it greatly reduces garbage collection (GC) latency as the GC never has to look beyond a single process. Non-copying implementations have been tried in the past, but they turned out to be a bad fit as low latency is more important than sheer throughput for the kind of soft-realtime systems that Erlang is designed to build.
By default, messages are copied directly into the receiving process’ heap but when this isn’t possible (or desired – see the message_queue_data flag) we allocate the message outside of the heap instead.
Memory allocation makes such “off-heap” messages slightly more expensive but
they’re very neat for processes that receive a ton of messages. We don’t need
to interact with the receiver when copying the message – only when adding it
to the queue – and since the only way a process can see a message is by
matching them in a receive
expression, the GC doesn’t need to consider
unmatched messages which further reduces latency.
Sending messages to processes on other Erlang nodes works in the same way, albeit there’s now a risk of messages being lost in transit. Messages are guaranteed to be delivered as long as the distribution link between the nodes is active, but it gets tricky when the link goes down.
Using monitor/2
on the remote process (or node) will tell you when this
happens, acting as if the process died (with reason noconnection
), but that
doesn’t always help: the link could have died after the message was received
and handled on the other end, all we know is that the link went down before
we got any eventual response.
As with everything else there’s no free lunch, and you need to decide how your applications should handle these scenarios.
One might guess that processes receive messages through receive
expressions,
but receive
is a bit of a misnomer. As with all other signals the process
continuously handles them in the background, moving received messages from the
signal queue to the message queue.
receive
searches for matching messages in the message queue (in the order
they arrived), or waits for new messages if none were found. Searching through
the message queue rather than the signal queue means it doesn’t have to worry
about processes that send messages, which greatly increases performance.
This ability to “selectively receive” specific messages is very convenient: we’re not always in a context where we can decide what to do with a message and having to manually lug around all unhandled messages is certainly annoying.
Unfortunately, sweeping the search under the rug doesn’t make it go away:
receive
{reply, Result} ->
{ok, Result}
end
The above expression finishes instantly if the next message in the queue
matches {reply, Result}
, but if there’s no matching message it has to walk
through them all before giving up. This is expensive when there are a lot of
messages queued up which is common for server-like processes, and since
receive
expressions can match on just about anything there’s little that can
be done to optimize the search itself.
The only optimization we do at the moment is to mark a starting point for the search when we know that a message couldn’t exist prior to a certain point. Let’s revisit the request-response idiom:
Mref = monitor(process, Pid),
Pid ! {self(), Mref, Request},
receive
{Mref, Response} ->
erlang:demonitor(Mref, [flush]),
{ok, Response};
{'DOWN', Mref, _, _, Reason} ->
{error, Reason}
end
Since the reference created by monitor/2
is globally unique and cannot exist
before said call, and the receive
only matches messages that contain said
reference, we don’t need to look at any of the messages received before then.
This makes the idiom efficient even on processes that have absurdly long message queues, but unfortunately it isn’t something we can do in the general case. While you as a programmer can be sure that a certain response must come after its request even without a reference, for example by using your own sequence numbers, the compiler can’t read your intent and has to assume that you want any message that matches.
Figuring out whether the above optimization has kicked in is rather annoying at the moment. It requires inspecting BEAM assembly and even then you’re not guaranteed that it will work due to some annoying limitations:
receive
with the first reference, will end up searching through
the entire message queue.receive
need to be next to each other and you can’t have multiple
functions calling a common receive
helper.We’ve addressed these shortcomings in the upcoming OTP 24 release, and have added a compiler option to help you spot where it’s applied:
$ erlc +recv_opt_info example.erl
-module(example).
-export([t/2]).
t(Pid, Request) ->
%% example.erl:5: OPTIMIZED: reference used to mark a
%% message queue position
Mref = monitor(process, Pid),
Pid ! {self(), Mref, Request},
%% example.erl:7: INFO: passing reference created by
%% monitor/2 at example.erl:5
await_result(Mref).
await_result(Mref) ->
%% example.erl:10: OPTIMIZED: all clauses match reference
%% in function parameter 1
receive
{Mref, Response} ->
erlang:demonitor(Mref, [flush]),
{ok, Response};
{'DOWN', Mref, _, _, Reason} ->
{error, Reason}
end.
]]>The first version of Erlang was implemented in Prolog in 1986. That version of Erlang was too slow for creating real applications, but it was useful for finding out which features of the language were useful and which were not. New language features could be added or deleted in a matter of hours or days.
It soon became clear that Erlang needed to be at least 40 times faster to be useful in real projects.
In 1989 JAM (Joe’s Abstract Machine) was first implemented. Mike Williams wrote the runtime system in C, Joe Armstrong wrote the compiler, and Robert Virding wrote the libraries.
JAM was 70 times faster than the Prolog interpreter, but it turned out that this still wasn’t fast enough.
Bogumil (“Bogdan”) Hausman created TEAM (Turbo Erlang Abstract Machine). It compiled the Erlang code to C code, which was then compiled to native code using GCC.
It was significantly faster than JAM for small projects. Unfortunately, compilation was very slow, and the code size of the compiled code was too big to make it useful for large projects.
Bogumil Hausman’s next machine was called BEAM (Bogdan’s Erlang Abstract Machine). It was a hybrid machine that could execute both native code (translated via C) and threaded code with an interpreter. That allowed customers to compile their time-critical modules to native code and all other modules to threaded BEAM code. The threaded BEAM in itself was faster than JAM code.
The modern BEAM only has the interpreter. The ability of BEAM to generate C code was dropped in OTP R4. Why?
C is not a suitable target language for an Erlang compiler. The main reason is that an Erlang function can’t simply be translated to a C function because of Erlang’s process model. Each Erlang process must have its own stack and that stack cannot be automatically managed by the C compiler.
BEAM/C generated a single C function for each Erlang module. Local
calls within the module were made by explicitly pushing the return
address to the Erlang stack followed by a goto
to the label of the
called function. (Strictly speaking, the calling function stores the
return address to BEAM register and the called function pushes that
register to the stack.)
Calls to other modules were done similarly by using the GCC extension
that makes it possible to take the address of a label
and later jumping to it. Thus an external call was made by pushing
the return address to the stack followed by a goto
to the address of
a label in another C function.
Isn’t that undefined behavior?
Yes, it is undefined behavior even in GCC. It happened to work with GCC on Sparc, but not on GCC for X86. A further complication was the embedded systems having ANSI-C compilers without any GCC extensions.
Because of that, we had to maintain three distinct flavors of BEAM/C to handle different C compilers and platforms. I don’t remember any benchmarks from that time, but it is unlikely that BEAM/C was faster than interpreted BEAM on any other platform than Solaris on Sparc.
In the end, we removed BEAM/C and optimized the interpreted BEAM so that it could beat BEAM/C in speed.
HiPE (The High-Performance Erlang Project) was a research project at Uppsala University running for many years starting around 1996. It was “aimed at efficiently implementing concurrent programming systems using message-passing in general and the concurrent functional language Erlang in particular”.
One of the many outcomes of the project was the HiPE native code compiler for Erlang. HiPE became a part of the OTP distribution in OTP R8 in 2001. The HiPE native compiler is written in Erlang and translates the BEAM code to native code without the help of a C compiler, therefore avoiding many of the problems that BEAM/C ran into.
The HiPE native compiler can often speed up sequential code by a factor of two or three compared to interpreted BEAM code. We hoped that would speed up real-world huge application systems. Unfortunately, projects within Ericsson that tried HiPE found that it did not improve performance.
Why is that?
The main reason is probably that most huge Erlang applications don’t contain enough sequential code that HiPE could optimize. The runtime of those systems is typically dominated by some combination of message passing, calls to the ETS BIFs, and garbage collection, none of which HiPE can optimize.
Another reason could be that big systems typically have many small modules. The HiPE native compiler (in common with the Erlang compiler) cannot optimize code across module boundaries, thus being unable to do much type-based optimizations.
Also, for most big systems, compiling all Erlang modules to native code would lead to impractically long build times and the resulting code would consume too much memory. There is a small overhead of switching from native code to interpreted BEAM and vice versa. It is a non-trivial task to figure out which modules that would gain from being compiled to native code, and at the same time avoiding an excessive amount of context switches between native and interpreted code.
Because none of the Ericsson Erlang projects used the HiPE native compiler, the OTP team could only afford to spend a limited amount of time maintaining HiPE. Therefore, the documentation for HiPE includes this note:
HiPE and execution of HiPE compiled code only have limited support by the OTP team at Ericsson. The OTP team only does limited maintenance of HiPE and does not actively develop HiPE. HiPE is mainly supported by the HiPE team at Uppsala University.
I think it is fair to say that Erlang/OTP would look very different today if it hasn’t been for the HiPE project. Here are the major contributions from the HiPE project to OTP:
A new staged tag scheme in OTP R7. The new tag scheme allowed the Erlang system to address the full 4GB address space (the previous tag scheme only supported addressing the lower 1 GB). Surprisingly, the new tag scheme also improved performance.
The Core Erlang intermediate representation is used in the Erlang compiler to this day. For more information, see An introduction to Core Erlang and Core Erlang by Example.
Dialyzer (DIscrepancy AnaLYZer for ERlang programs), started out as a type analysis pass for the HiPE native compiler, but soon become a tool for Erlang programmers to help find bugs and unreachable code in their applications.
Introducing try
…catch
in OTP R10.
Implementing per function counters and the cprof module. The counters were originally meant to be used for finding hot functions and generating native code only for these. But the overhead in the context switch between interpreted and native code made this usage less useful.
Repeatedly suggesting that Erlang needed a literal pool for premade literal terms (instead of constructing them each time they are used). At one of our meetings between the HiPE team and the OTP team, I remember Richard Carlsson pointing out to me that it would be nice for Wings3D to have floating-point literals. The OTP team implemented literal pools in OTP R12.
There have been three separate research projects that tried to develop a tracing JIT for Erlang. All of them have been led by Frej Drejhammar of RISE (formerly SICS).
A tracing JIT (Just In Time compiler) is a JIT that runs in two phases:
First it traces execution to find sequences of hot (frequently executed) code.
It then rewrites the found traces to native code.
The goals for the three JIT projects were:
The JIT should work automatically with no need for the user to identify which modules to compile to native code beforehand.
There should be total feature compatibility with the non-JIT BEAM. In particular, tracing, scheduling behavior, save calls, and hot code reloading should continue to work, and stack traces should be identical to the ones in the non-JIT BEAM.
The system should at least on average never be slower than the non-JIT BEAM.
There were some promising results when running some benchmarks, but ultimately it turned out to be impossible to fulfill the goal to never be slower than the non-JIT system. Here are the main reasons for the slowdowns:
To do the tracing (finding hot code), the BEAM interpreter needed tweaking. It was difficult to be able to do tracing without lowering the base speed of the BEAM interpreter.
It was also difficult to design the mechanism for context switching between the interpreted code and native code in a way that didn’t lower the base speed of the BEAM interpreter.
When a hot sequence of code has been found, the code needed to be compiled to native code. The compilation, that used LLVM, was slow.
When a hot sequence had finally been converted to native code, it could turn out that it would not be executed again. That was particularly a problem for the Erlang compiler that runs many passes. Typically, when some of the code for one pass had been converted to native code, the compiler was already running the next pass.
The later projects mitigated some of the issues in the previous projects. For example, the compilation time was reduced by doing more optimizations before invoking LLVM. Ultimately, though, it was decided to terminate the third and final tracing JIT project at the end of 2019.
For more information about BEAMJIT, see:
After the end of the third tracing JIT project, Lukas Larsson, having been involved in the last two tracing JIT projects, could not stop thinking about different approaches that might lead to a useful JIT. The things that slowed down the previous approaches were the tracing to find hot code and the generation of optimized native code using LLVM. Would it be possible to have a simpler JIT that didn’t do tracing and did no or little optimization?
In January 2020, salvaging some code from the third tracing JIT project, Lukas quickly built a prototype BEAM system that translated each BEAM instruction at load time to native code. The resulting code was less optimized than LLVM-generated code because it would still use BEAM’s stack and X registers (stored in memory), but the overhead for instruction unpacking and instruction dispatch was eliminated.
The initial benchmarks results were promising: about twice as fast compared to interpreted BEAM code, so Lukas extended the prototype so that it could handle more kinds of BEAM instructions.
John Högberg quickly became interested in the project and started to act as a sounding board. Some time later, probably in March, John suggested that the new JIT should translate all loaded code to native code. That way, there would be no need to support context switching between the BEAM interpreter and native code, which would make the design simpler and eliminate the cost for context switches.
That was a gamble, of course. After all, it could turn out that the native code could be too large to be practically useful or decrease performance because it fitted badly in the code cache. They decided that it was worth taking the risk and that it would probably be possible to optimize the size of the code later. (Spoiler: At the time of writing, the native code generated by the JIT is about 10 percent larger than interpreted BEAM code.)
Another change to the design was the tooling for generating the native code. In Lukas’s prototype, the native code template for each instruction was contained in text files similar to the other files used by the loader. That was inflexible, so it was decided to use some library that could generate native code. While some pure C libraries could have been used, the C++ library AsmJIT was more convenient in practical use than any of the C libraries. Also, some C libraries were excluded because they used a GNU license, which we can’t use in OTP. Therefore the part of the loader that translates BEAM instructions to native code needed to be written in C++, but the rest of the runtime system is still pure C code and will remain so.
John joined the practical work on the rejigged JIT project at the end of March.
On April 7, 2020, John reached the “prompt beer” milestone.
When the Erlang system is started, a surprisingly large amount of code is executed before the prompt appears. On the one hand, that means that the translation of many instructions needs to be implemented before it would be possible to even start the Erlang system, let alone run any test suites or benchmarks.
On the other hand, when the prompt finally appears, it is a major milestone worth celebrating with some prompt beer or other appropriate beverage or by taking the rest of the evening off.
On April 14 John got Dialyzer running with the JIT, and on April 17, after some improvements to the code generation, Dialyzer was only about 10 percent slower with the JIT than with HiPE. None of the tracing JITs had had any success in speeding up Dialyzer. (At the time of writing, Dialyzer runs roughly as fast with the JIT as it did with HiPE, although it has become increasingly difficult to do a fair comparison since HiPE doesn’t work beyond OTP 23.)
It was probably at that point we realized that we had a JIT that could finally be included in an OTP release.
The next major milestone was reached on May 6 when line numbers in stack traces were implemented. That meant that many more test cases now succeeded.
Soon after that, all test suites could be run successfully. During the summer and early fall Dan and I joined the project part-time and the following was done:
A major refactoring of the BEAM loader so that as much code as possible could be shared between the JIT and the BEAM interpreter. (The BEAM interpreter is only used on platforms that don’t support the JIT.)
Implementation and polishing of important but less used features
such as tracing, and perf support, and save calls (see the
flag save_calls
for process_flag/2).
Shrinking of the code size of the generated native code.
Porting the JIT to Windows, which turned out to be relatively easy.
Making it possible to use the native stack pointer register and stack manipulation instructions. That improved perf support and slightly reduced the size of the native code.
The work culminated in a public pull request that Lukas created during his presentation of the new JIT on September 11.
The pull request was merged on September 22.
Here are a few of the improvements that we have been thinking of for future releases:
Supporting ARM-64 (used by Raspberry Pi and Apple’s new Macs with Apple Silicon).
Implementing type-guided generation of native code. The new SSA-based compiler passes introduced in OTP 22 does a sophisticated type analysis. Frustratingly, not all of the type information can be leveraged to generate better code for the interpreted BEAM. We plan to modify the compiler so that some of the type information will be included in the BEAM files and then used by the JIT during code generation.
Introducing new instructions for binary matching and/or construction to help the JIT generate better code.