asr_run_sync is a simple loop around poll() and asr_run(). In theory it polls
until either its timeout expires or the condition that asr_run() requires to
make progress occurs. It currently handles poll() returning -1/EINTR by simply
restarting, without taking into account that some time passed before poll()
returned, meaning it calls poll() again with the same timeout value.
The ruby runtime, when running a threaded program, sends SIGVTALRM to threads
every 100ms under certain conditions, one of which is when the thread is
executing some non-ruby code outside ruby's global vm lock. I didn't get as
far as figuring out why it does this. Anyway, the ruby standard library
resolver functions call the libc resolver oustide the lock.
Add these two things together and we find that this ruby program will never
time out if it doesn't get a response to its first query:
require "thread";
require "socket";
t = Thread.new { p "hi"; };
p Socket.gethostbyname("openbsd.org");
After a recent network outage, we ended up with puppet stuck in the
resolver on 20 or so boxes, which was fairly annoying to clean up.
The diff below improves the situation by deducting elapsed time from the
timeout before restarting poll().
Index: asr.c
===================================================================
RCS file: /cvs/src/lib/libc/asr/asr.c,v
retrieving revision 1.51
diff -u -p -u -p -r1.51 asr.c
--- asr.c 24 Feb 2016 20:52:53 -0000 1.51
+++ asr.c 23 May 2016 03:57:23 -0000
@@ -169,15 +169,30 @@ int
asr_run_sync(struct asr_query *as, struct asr_result *ar)
{
struct pollfd fds[1];
- int r, saved_errno = errno;
+ struct timespec pollstart, pollend, elapsed;
+ int timeout, r, p, saved_errno = errno;
while ((r = asr_run(as, ar)) == ASYNC_COND) {
fds[0].fd = ar->ar_fd;
fds[0].events = (ar->ar_cond == ASR_WANT_READ) ? POLLIN:POLLOUT;
+
+ timeout = ar->ar_timeout;
again:
- r = poll(fds, 1, ar->ar_timeout);
- if (r == -1 && errno == EINTR)
+ if (clock_gettime(CLOCK_MONOTONIC, &pollstart))
+ break;
+ p = poll(fds, 1, timeout);
+ if (p == -1 && errno == EINTR) {
+ if (clock_gettime(CLOCK_MONOTONIC, &pollend))
+ break;
+
+ timespecsub(&pollend, &pollstart, &elapsed);
+ timeout -= (elapsed.tv_sec * 1000) +
+ (elapsed.tv_nsec / 1000000);
+ if (timeout < 1)
+ break;
goto again;
+ }
+
/*
* Otherwise, just ignore the error and let asr_run()
* catch the failure.