Programming

Some surprising Erlang behavior

Erlang is in many ways a fantastic language. Its syntax is a little foreign, at first, to someone like myself coming from a background in C and C++. The language and the runtime environment have some really nice qualities, though. When I began working with it, I quickly acquired an appreciation for it. I won’t get too deep into the details of the language and why I like it but I will start off with a discussion of a couple key features relevant to this discussion.

First, work is done in an Erlang application by Erlang processes. Each process is a lightweight thread and has an identifier called a pid. Each process has an input queue of messages. An Erlang process generally operates by receiving a message from its message queue, doing some work to process the message, and then waiting for another message to arrive in its queue. Some of the work that a process may do might involve generating and sending messages to another process using the other process’ pid. Under normal circumstances, sending a message from one process to another process is asynchronous, meaning the send operation does not block waiting for the receiving process to handle the message or even necessarily for it to be added to the queue of the receiving process.

Second, an Erlang node is an instance of the Erlang interpreter, an OS-level process. Erlang nodes can be easily joined together so that a single Erlang application can span many physical hosts. Once a node has joined an Erlang cluster, it becomes trivial for processes in remote hosts to communicate with one another. Sending a message to a remote process is done with the pid of the remote process, just like a local process. This allows for Location Transparency, one of the nice features of Erlang.

Erlang has gotten a reputation as a platform that is very stable. Joe Armstrong, one of the original authors of Erlang, has famously claimed that the Ericcson AXD301 switch, developed using Erlang, has achieved NINE nines of reliability.

Having heard this, I was extremely surprised to identify a case where an Erlang application performs not just poorly but it literally comes to a screeching halt. The problem occurs when one of the nodes in an Erlang cluster suddenly goes network silent. This can occur for a variety of reasons. For example, the node may have kernel panicked or it may have started swapping heavily or a network cable connecting one of the physical hosts in the cluster may have gotten unplugged. When this condition occurs, messages sent to processes on the node which has gone dark are buffered up to a limit but once the buffer fills up, sending messages to processes on the node which has gone dark goes from being asynchronous to being synchronous. Erlang processes not sending messages to the failed node still continue to do work as normal but any process sending a message to the failed node will halt.

The Erlang runtime will monitor the other nodes to which the local node is attached. If a host where one of the nodes is running goes network silent then after some period of time (defaulting to about 1 minute), the other nodes will decide that the network silent node is down. The length of time before a non-responsive node is considered down is tune-able using an Erlang kernel parameter, net_ticktime. You might think that the processes waiting to send messages to processes on the down node would get unblocked when the target node is considered down, but that (mostly) doesn’t happen. It turns out that once the target node is considered down, blocked senders will get unblocked once every 7 seconds, likely tune-able using the Erlang kernel parameter net_setuptime.

I’ve come up with a fairly easy way to demonstrate this problem given two Linux VMs. Let’s call them faucet-vm and sink-vm. Both need Erlang installed.

On the sink-vm host, create sink.erl with these contents:

-module(sink).
-export([start/0, entry/0]).

start() ->
    erlang:register(?MODULE, erlang:spawn(?MODULE, entry, [])).

entry() ->
    loop(0).

loop(X) ->
    receive ping ->
        io:format("~p: ping ~p~n", [timeasfloat(), X]),
        loop(X + 1)
    end.

timeasfloat() ->
    {Mega, Sec, Micro} = os:timestamp(),
    ((Mega * 1000000) + Sec + (Micro * 0.000001)).

Now compile and start the sink process:

$ erlc sink.erl
$ erl -sname sink -setcookie monster
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:4:4] [async-threads:0] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(sink@sink-vm)1> sink:start().
true
(sink@sink-vm)2>

On the faucet-vm host, create faucet.erl with these contents:

-module(faucet).
-export([start/0, entry/0]).

start() ->
    erlang:spawn(?MODULE, entry, []).

entry() ->
    Sink = rpc:call('sink@sink-vm', erlang, whereis, [sink]),
    loop(Sink, 0).

loop(Sink, X) ->
    io:format("~p: sending ping ~p~n", [timeasfloat(), X]),
    Sink ! ping,
    loop(Sink, X + 1).

timeasfloat() ->
    {Mega, Sec, Micro} = os:timestamp(),
    ((Mega * 1000000) + Sec + (Micro * 0.000001)).

Now compile and start the faucet process:

$ erlc faucet.erl
$ erl -sname faucet -setcookie monster
Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.1  (abort with ^G)
(faucet@faucet-vm)1> faucet:start().

This spews horrific amounts of console output on both faucet-vm and sink-vm consoles. If you don’t see tons of console output on both sides, you’ve probably done something wrong.

In a separate console on sink-vm, use iptables to block all IP traffic between sink-vm and faucet-vm:

# iptables -A INPUT -s faucet-vm -j DROP && iptables -A OUTPUT -d faucet-vm -j DROP

The output of the two processes will halt shortly after the traffic between them is blocked. Which one stops first actually depends on whether the sink process is able to keep up with the faucet process before the traffic is blocked. About 60 seconds after halting, each node should report that the other node is DOWN. The number of messages the faucet claims to have sent will be higher than the number the sink has received. The difference will be the number of messages the faucet has buffered to send to the sink before blocking. After the faucet reports that the sink is DOWN, it will report one sent message every 7 seconds.

Here’s the tail of the output of the sink process when I tried it:

1440398109.640845: ping 45404
1440398109.640864: ping 45405
1440398109.640881: ping 45406
1440398109.640899: ping 45407
1440398109.640917: ping 45408
1440398109.640935: ping 45409
1440398109.640953: ping 45410
1440398109.640971: ping 45411
1440398109.640988: ping 45412
1440398109.641006: ping 45413
1440398109.641022: ping 45414
1440398109.64104: ping 45415
1440398109.641057: ping 45416
(sink@sink-vm)2>
=ERROR REPORT==== 23-Aug-2015::23:36:03 ===
** Node faucet@faucet-vm not responding **
** Removing (timedout) connection **

And here’s the tail of the output of the faucet process when I tried it:

1440398099.86504: sending ping 63339
1440398099.865061: sending ping 63340
1440398099.865107: sending ping 63341
1440398099.865159: sending ping 63342
1440398099.865199: sending ping 63343
1440398099.865224: sending ping 63344
1440398099.865246: sending ping 63345
1440398099.865293: sending ping 63346
1440398099.865328: sending ping 63347
1440398099.865364: sending ping 63348
1440398157.560028: sending ping 63349
(faucet@faucet-vm)2>
=ERROR REPORT==== 23-Aug-2015::23:35:57 ===
** Node 'sink@sink-vm' not responding **
** Removing (timedout) connection **
1440398164.566233: sending ping 63350
1440398171.574962: sending ping 63351
1440398178.582127: sending ping 63352
(faucet@faucet-vm)2>

The implications of this behavior are rather severe. If you have an Erlang application running on a cluster of N nodes, you can’t assume that if 1 of the nodes goes network silent that you will only lose 1/N total capacity of the cluster. If there are critical processes attempting to communicate with processes on the failed node then they will eventually be unable to do any work at all, including work that is unrelated to the failed node.

I haven’t yet tried this, but it is likely possible to solve this problem by routing all messages to remote processes through a proxy process. Each remote node can have a registered local proxy and the sender would find the appropriate proxy depending on the node of the remote process using erlang:node/1. Each node proxy could be monitored by a separate process that detects when the proxy has become unresponsive due to the remote node going down. When hung, the proxy can either be restarted to keep its message queue from growing too large or it can just be killed until the remote node becomes available again. At the very least, Location Transparency is lost.

An astute reader might notice that my use of iptables to simulate a network silent node is imperfect. Specifically, iptables will not block ARP traffic. It’s conceivable that ARP traffic might interfere with the experiment. It turns out it does not. I’ve specifically tested this by also blocking ARP traffic using arptables and obtained identical results. I left ARP and arptables out of the example for simplicity.

Additionally, I’ve carefully experimented with various iptables rules to simulate nodes becoming unavailable in other ways with similar results. Specifically, using a REJECT rule for the INPUT chain rather than DROP results in an ICMP destination port unreachable packet being returned to the faucet-vm host but the faucet process will still block. Using REJECT --reject-with icmp-host-unreachable will generate ICMP destination host unreachable packets returned to the faucet-vm host but this will also leave the faucet process blocked.

It turns out the only way to get the sending process (mostly) unblocked is if the sending node receives a TCP RST packet from the target host. I tested this by using REJECT --reject-with tcp-reset and allowing outbound RST packets. This effectively simulates a node whose host is still running but where Erlang is not. The reason this only mostly unblocks the sending process is that the rate of sending is still significantly lower than if the target node is up. I measured the rate of pings “sent” by the faucet under these conditions to be roughly 1/10th the rate when the sink node is up and traffic is not blocked.

One might think this problem is just a bug and if it was a bug, it might be limited to a particular version. After all, the example output above is from R15B01. If this behavior is a bug, it’s been around for a while and it hasn’t been fixed yet. I’ve reproduced it with at least three different versions. The earliest version was R15B01 and the newest is the current version, OTP 18.0.

The conclusion to be drawn is that when designing a distributed application using Erlang, you have to be aware of this behavior and either work around it one way or another or else accept that your application is unlikely to be highly available.

Edit:

After spending more time investigating this problem and work-arounds, I can now say for sure that the solution I suggested above can be made to work. However, it seems that there is a simpler solution. Instead of using the ! operator or erlang:send/2, it is possible to completely mitigate this problem by using erlang:send/3, passing [nosuspend, noconnect] as the options parameter.

If only nosuspend is used, the problem is mitigated up until the point where the remote node is identified as DOWN. Once that happens, it’s back to one message per 7 seconds until communication with the remote host is re-established. This suggests that the 7 second delay is the timeout waiting to connect to the remote node once it has been removed from the node list. Also using noconnect avoids blocking for 7 seconds on each send after the remote node has been determined to be down, which solves the problem in question but it potentially causes others. Using noconnect means that you have to have some other mechanism for re-establishing communication with remote nodes once they become available again. This isn’t necessarily challenging but it needs to be considered in when designing your application.

2 thoughts on “Some surprising Erlang behavior

  1. Hey Eric-

    I’m having this very problem with a highly distributed media server right now. Have you thought any more about this issue? I tried what you suggested with nosuspend/noconnect, but the blocking persists.

    Thanks
    P

    1. The strategy of using send/3 with nosuspend/noconnect was intended to solve a specific problem when one host in an erlang cluster temporarily (or permanently) goes network silent. I have tested this method thoroughly using several different versions of erlang including R15B01, R16, and R18 and found it to effectively mitigate this problem. If you are not having luck with this it may be that the problem you are experiencing is actually a different one.

Leave a Reply

Your email address will not be published. Required fields are marked *