Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization
On Fri, Jul 17, 2015 at 06:17:47AM +0200, Paolo Bonzini wrote: On 16/07/2015 11:56, Paolo Bonzini wrote: @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking) npfd = 0; ctx-walking_handlers--; +if (blocking) { +atomic_sub(ctx-notify_me, 2); +} + I kept this place for subtracting notify_me because it is the same place where aio_set_dispatching was called. However, it can be anticipated to /* if we have any readable fds, dispatch event */ if (ret 0) { for (i = 0; i npfd; i++) { i.e. right after poll. As poll has exited, it can't be blocking the thread anymore. Stefan, please let me send v3 on Monday. Sure. Stefan pgp5ys_kE4p0H.pgp Description: PGP signature
[Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization
This patch rewrites the ctx-dispatching optimization, which was the cause of some mysterious hangs that could be reproduced on aarch64 KVM only. The hangs were indirectly caused by aio_poll() and in particular by flash memory updates's call to blk_write(), which invokes aio_poll(). Fun stuff: they had an extremely short race window, so much that adding all kind of tracing to either the kernel or QEMU made it go away (a single printf made it half as reproducible). On the plus side, the failure mode (a hang until the next keypress) made it very easy to examine the state of the process with a debugger. And there was a very nice reproducer from Laszlo, which failed pretty often (more than half of the time) on any version of QEMU with a non-debug kernel; it also failed fast, while still in the firmware. So, it could have been worse. For some unknown reason they happened only with virtio-scsi, but that's not important. It's more interesting that they disappeared with io=native, making thread-pool.c a likely suspect for where the bug arose. thread-pool.c is also one of the few places which use bottom halves across threads, by the way. I hope that no other similar bugs exist, but just in case :) I am going to describe how the successful debugging went... Since the likely culprit was the ctx-dispatching optimization, which mostly affects bottom halves, the first observation was that there are two qemu_bh_schedule() invocations in the thread pool: the one in the aio worker and the one in thread_pool_completion_bh. The latter always causes the optimization to trigger, the former may or may not. In order to restrict the possibilities, I introduced new functions qemu_bh_schedule_slow() and qemu_bh_schedule_fast(): /* qemu_bh_schedule_slow: */ ctx = bh-ctx; bh-idle = 0; if (atomic_xchg(bh-scheduled, 1) == 0) { event_notifier_set(ctx-notifier); } /* qemu_bh_schedule_fast: */ ctx = bh-ctx; bh-idle = 0; assert(ctx-dispatching); atomic_xchg(bh-scheduled, 1); Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This was already debated a few months ago, so I assumed it to be correct. In retrospect this was a very good idea, as you'll see later. Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't trigger the assertion (as expected). Changing the worker's invocation to qemu_bh_schedule_slow() didn't hide the bug (another assumption which luckily held). This already limited heavily the amount of interaction between the threads, hinting that the problematic events must have triggered around thread_pool_completion_bh(). As mentioned early, invoking a debugger to examine the state of a hung process was pretty easy; the iothread was always waiting on a poll(..., -1) system call. Infinite timeouts are much rarer on x86, and this could be the reason why the bug was never observed there. With the buggy sequence more or less resolved to an interaction between thread_pool_completion_bh() and poll(..., -1), my tracing strategy was to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and qemu_bh_schedule_fast() would provide some hint. The output was: (gdb) p last_prepare $3 = 103885451 (gdb) p last_dispatch $4 = 103876492 (gdb) p last_poll $5 = 115909333 (gdb) p last_schedule $6 = 115925212 Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch(). This makes little sense unless there is an aio_poll() call involved, and indeed with a slightly different instrumentation you can see that there is one: (gdb) p last_prepare $3 = 107569679 (gdb) p last_dispatch $4 = 107561600 (gdb) p last_aio_poll $5 = 110671400 (gdb) p last_schedule $6 = 110698917 So the scenario becomes clearer: iothread VCPU thread -- aio_ctx_prepare aio_ctx_check qemu_poll_ns(timeout=-1) aio_poll aio_dispatch thread_pool_completion_bh qemu_bh_schedule() At this point bh-scheduled = 1 and the iothread has not been woken up. The solution must be close, but this alone should not be a problem, because the bottom half is only rescheduled to account for rare situations (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll() calls, 2014-07-15). Introducing a third thread---a thread pool worker thread, which also does qemu_bh_schedule()---does bring out the problematic case. The third thread must be awakened *after* the callback is complete and thread_pool_completion_bh has redone the whole loop, explaining the short race window. And then this is what happens: thread pool worker
Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization
On Thu, 07/16 11:56, Paolo Bonzini wrote: diff --git a/aio-posix.c b/aio-posix.c index 4abec38..268d14d 100644 --- a/aio-posix.c +++ b/aio-posix.c @@ -233,26 +233,23 @@ static void add_pollfd(AioHandler *node) bool aio_poll(AioContext *ctx, bool blocking) { AioHandler *node; -bool was_dispatching; int i, ret; bool progress; int64_t timeout; aio_context_acquire(ctx); -was_dispatching = ctx-dispatching; progress = false; /* aio_notify can avoid the expensive event_notifier_set if * everything (file descriptors, bottom halves, timers) will * be re-evaluated before the next blocking poll(). This is * already true when aio_poll is called with blocking == false; - * if blocking == true, it is only true after poll() returns. - * - * If we're in a nested event loop, ctx-dispatching might be true. - * In that case we can restore it just before returning, but we - * have to clear it now. + * if blocking == true, it is only true after poll() returns, + * so disable the optimization now. */ -aio_set_dispatching(ctx, !blocking); +if (blocking) { +atomic_add(ctx-notify_me, 2); +} Sorry if my questions are stupid, but I'm having difficulties in fully understanding it. What if aio_notify happens after the previous aio_dispatch() but before the next necessary atomic_add? The aio_notify would still skip the event_notifier_set(), and the next ppoll() will not return. For example: Thread A Thread B aio_poll(blocking=true) aio_notify() smp_mb() if (ctx-notify_me) /* false! */ atomic_add(ctx-notify_me, 2) ppoll() atomic_sub(ctx-notify_me, 2) event_notifier_set() /* not run */ And if that's not a problem, why don't we need something like ACCESS_ONCE in aio_noitfy()? Fam ctx-walking_handlers++; @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking) npfd = 0; ctx-walking_handlers--; +if (blocking) { +atomic_sub(ctx-notify_me, 2); +} + /* Run dispatch even if there were no readable fds to run timers */ -aio_set_dispatching(ctx, true); if (aio_dispatch(ctx)) { progress = true; } -aio_set_dispatching(ctx, was_dispatching); aio_context_release(ctx); return progress; diff --git a/aio-win32.c b/aio-win32.c index 9268b5c..9d6c12f 100644 --- a/aio-win32.c +++ b/aio-win32.c @@ -279,25 +279,23 @@ bool aio_poll(AioContext *ctx, bool blocking) { AioHandler *node; HANDLE events[MAXIMUM_WAIT_OBJECTS + 1]; -bool was_dispatching, progress, have_select_revents, first; +bool progress, have_select_revents, first; int count; int timeout; aio_context_acquire(ctx); -was_dispatching = ctx-dispatching; progress = false; /* aio_notify can avoid the expensive event_notifier_set if * everything (file descriptors, bottom halves, timers) will * be re-evaluated before the next blocking poll(). This is * already true when aio_poll is called with blocking == false; - * if blocking == true, it is only true after poll() returns. - * - * If we're in a nested event loop, ctx-dispatching might be true. - * In that case we can restore it just before returning, but we - * have to clear it now. + * if blocking == true, it is only true after poll() returns, + * so disable the optimization now. */ -aio_set_dispatching(ctx, !blocking); +if (blocking) { +atomic_add(ctx-notify_me, 2); +} have_select_revents = aio_prepare(ctx); @@ -334,7 +332,10 @@ bool aio_poll(AioContext *ctx, bool blocking) if (timeout) { aio_context_acquire(ctx); } -aio_set_dispatching(ctx, true); +if (blocking) { +assert(first); +atomic_sub(ctx-notify_me, 2); +} if (first aio_bh_poll(ctx)) { progress = true; @@ -358,7 +359,6 @@ bool aio_poll(AioContext *ctx, bool blocking) progress |= timerlistgroup_run_timers(ctx-tlg); -aio_set_dispatching(ctx, was_dispatching); aio_context_release(ctx); return progress; } diff --git a/async.c b/async.c index 77d080d..a232192 100644 --- a/async.c +++ b/async.c @@ -184,6 +184,8 @@ aio_ctx_prepare(GSource *source, gint*timeout) { AioContext *ctx = (AioContext *) source; +atomic_or(ctx-notify_me, 1); + /* We assume there is no timeout already supplied */ *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx)); @@ -200,6 +202,7 @@ aio_ctx_check(GSource *source) AioContext *ctx = (AioContext *)
Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization
On 17/07/2015 04:25, Fam Zheng wrote: What if aio_notify happens after the previous aio_dispatch() but before the next necessary atomic_add? The aio_notify would still skip the event_notifier_set(), and the next ppoll() will not return. For example: Thread A Thread B aio_poll(blocking=true) aio_notify() smp_mb() if (ctx-notify_me) /* false! */ atomic_add(ctx-notify_me, 2) ppoll() atomic_sub(ctx-notify_me, 2) event_notifier_set() /* not run */ It's not a problem because ppoll() has exited. The next call to aio_poll or aio_ctx_prepare will notice the bottom half, do a non-blocking ppoll(), and then service the bottom half. And if that's not a problem, why don't we need something like ACCESS_ONCE in aio_noitfy()? Because there's already a smp_mb() which is stronger. Paolo
Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization
On 16/07/2015 11:56, Paolo Bonzini wrote: @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking) npfd = 0; ctx-walking_handlers--; +if (blocking) { +atomic_sub(ctx-notify_me, 2); +} + I kept this place for subtracting notify_me because it is the same place where aio_set_dispatching was called. However, it can be anticipated to /* if we have any readable fds, dispatch event */ if (ret 0) { for (i = 0; i npfd; i++) { i.e. right after poll. As poll has exited, it can't be blocking the thread anymore. Stefan, please let me send v3 on Monday. Paolo /* Run dispatch even if there were no readable fds to run timers */ -aio_set_dispatching(ctx, true); if (aio_dispatch(ctx)) { progress = true; }