***Warning, long email. Binding maintainers and anyone else doing signal
handling in 0MQ applications please read this***
Hi Brian,
First, to summarize to others who were not part of the IRC discussion
yesterday (especially binding maintainers, but anyone doing signal handling
in a 0MQ application, take note!) what the "EINTR issue" is:
a) Current status:
Currently 0MQ API calls do not return with (-1, errno=EINTR) if a system
call being called by the 0MQ API returns EINTR. This is due to a historical
decision by Martin Sustrik.
b) As a consequence, for all applications not involving a "runtime library"
doing delayed signal handling (see below):
If an application thread is blocking on a 0MQ API call and a signal
arrives, the application thread will not be interrupted[1] unless the
application programmer somehow asynchronously notifies that thread from a
signal handler[2].
c) As a consequence, for all applications involving a "runtime library"
doing delayed signal handling (in this case we define "The
Python/Perl/Ruby/... interpreter" as a "runtime library"):
The notion of delayed signal handling is that the "runtime library"
installs a signal hander which does not actually perform any action, but
only sets a flag somewhere saying "I got this signal". Then, upon return
from the blocking 0MQ call the "runtime library" looks at its "I got this
signal" flag and processes any pending actions[3].
d) Therefore, if blocking 0MQ calls are not interrupted by signals, the
"runtime library" will never get around to processing the signal which
leads to the illusion of "the signal was lost".
e) Therefore, the current 0MQ behaviour is *broken* (the general brokenness
of signal handling notwithstanding), and a critical bug, and must be fixed
ASAP. Policy discussion needs to define what release(s) any such fix
should actually land in.
f) However, at least in the case of Python (and probably Ruby/Perl/...) the
proposed fix to "just return (-1, errno=EINTR) from a 0MQ API call which
gets (..., errno=EINTR) back from a system call will not be a 100%
solution, because:
Take the following situation (pseudocode):
zmq_blocking_api_call ()
{
do_some_amount_of_processing () // 1
blocking_system_call () // 2
}
If a signal is delivered during step 1, but before entering the
blocking_system_call() in step 2, the 0MQ code has no way of detecting that
a signal has been delivered and will start step 2 anyway, which will block
and the signal will be "lost" until the blocking call returns, if ever.
Therefore, this will need to be documented. In the naive user pressing ^C
case, the solution is just to press ^C again. In fact, most people
familiar with the UNIX signal mess know that and take it into account, for
example when terminating applications from system init scripts SIGTERM will
be resent at least a few times until the application goes away, generally
followed by SIGKILL if the application is still not responding.
[Thanks to Martin Sustrik for documenting this precise race condition].
Note [1]: If an application does not set a signal handler for SIGINT,
obviously the default OS handler will be called and the action is to exit
the program.
Note [2]: One method is that used by the zmq-camera example in my
zeromq-examples Github repository. This uses a separate signal handling
thread using a sigwait () loop, and delivers the signals using 0MQ sockets.
Note [3]: The reason "^C does not work" in Python is that the system's
default SIGINT handler is never involved. Python always handles SIGINT in a
delayed fashion, and it is the "delayed action" which actually raises the
KeyboardInterrupt exception, not the C signal handler.
To address your specific comments:
[email protected] said:
> I just tested this idea of having the blocking recv and poll not trap
> and silence EINTR. It work exactly like we want, at least on OS X. I
> just put in a print statement that prints if EINTR is returned, and if
> my program is sitting in blocking recv and I send SIGINT, the print
> statement is triggered at the right time. Thus, if zeromq starts to
> simply return EINTR for any blocking call (recv, poll, etc) all
> language bindings will be able to properly handle signals in blocking
> calls. BUT, one big question...
Correct, except for the caveat in point f) above. However, this appears to
be a known (if subtle) issue with Python signals, see for example:
http://bugs.python.org/issue5315, specifically msg102829 provides a good
analysis.
So, signals will still be lost sometimes, even with the proposed change to
0MQ.
> This issue is the biggest one that we face with both 2.0.7 and 2.0.8.
> Would it be possible to back port this fix to both of these branches
> as well? But that would mean releasing 2.0.7.1 and 2.0.8.1. I guess
> an alternative would be to release a 2.0.9 that branches off 2.0.8 and
> has this fix. It would be a huge pain for us if this didn't show up
> until 2.1. Do you think this is possible?
This is up to Martin Sustrik to decide. Personally I think returning EINTR
from 0MQ API calls is definitely a fix for *broken* API behaviour, and thus
should definitely go into at least 2.1.x.
For 2.0.x, I'm not sure. The problem is it will break code that does not
expect API calls to return EINTR, *if and only if* that application does
get a signal. Given that most of the time the only thing signals are used
for is telling an application to terminate, I would not expect the effect
to be fatal.
So, if it was my choice, I'd own up to "I messed up", make loud
announcements about the EINTR change and probably push it into 2.0.x. In
the grand scheme of things, it'll fix more than it breaks...
Cheers,
-mato
_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev