Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization

2015-07-17 Thread Stefan Hajnoczi
On Fri, Jul 17, 2015 at 06:17:47AM +0200, Paolo Bonzini wrote:
 
 
 On 16/07/2015 11:56, Paolo Bonzini wrote:
  @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking)
   npfd = 0;
   ctx-walking_handlers--;
   
  +if (blocking) {
  +atomic_sub(ctx-notify_me, 2);
  +}
  +
 
 I kept this place for subtracting notify_me because it is the same place
 where aio_set_dispatching was called.  However, it can be anticipated to
 
  /* if we have any readable fds, dispatch event */
  if (ret  0) {
  for (i = 0; i  npfd; i++) {
 
 i.e. right after poll.  As poll has exited, it can't be blocking the
 thread anymore.  Stefan, please let me send v3 on Monday.

Sure.

Stefan


pgp5ys_kE4p0H.pgp
Description: PGP signature


[Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization

2015-07-16 Thread Paolo Bonzini
This patch rewrites the ctx-dispatching optimization, which was the cause
of some mysterious hangs that could be reproduced on aarch64 KVM only.
The hangs were indirectly caused by aio_poll() and in particular by
flash memory updates's call to blk_write(), which invokes aio_poll().
Fun stuff: they had an extremely short race window, so much that
adding all kind of tracing to either the kernel or QEMU made it
go away (a single printf made it half as reproducible).

On the plus side, the failure mode (a hang until the next keypress)
made it very easy to examine the state of the process with a debugger.
And there was a very nice reproducer from Laszlo, which failed pretty
often (more than half of the time) on any version of QEMU with a non-debug
kernel; it also failed fast, while still in the firmware.  So, it could
have been worse.

For some unknown reason they happened only with virtio-scsi, but
that's not important.  It's more interesting that they disappeared with
io=native, making thread-pool.c a likely suspect for where the bug arose.
thread-pool.c is also one of the few places which use bottom halves
across threads, by the way.

I hope that no other similar bugs exist, but just in case :) I am
going to describe how the successful debugging went...  Since the
likely culprit was the ctx-dispatching optimization, which mostly
affects bottom halves, the first observation was that there are two
qemu_bh_schedule() invocations in the thread pool: the one in the aio
worker and the one in thread_pool_completion_bh.  The latter always
causes the optimization to trigger, the former may or may not.  In
order to restrict the possibilities, I introduced new functions
qemu_bh_schedule_slow() and qemu_bh_schedule_fast():

 /* qemu_bh_schedule_slow: */
 ctx = bh-ctx;
 bh-idle = 0;
 if (atomic_xchg(bh-scheduled, 1) == 0) {
 event_notifier_set(ctx-notifier);
 }

 /* qemu_bh_schedule_fast: */
 ctx = bh-ctx;
 bh-idle = 0;
 assert(ctx-dispatching);
 atomic_xchg(bh-scheduled, 1);

Notice how the atomic_xchg is still in qemu_bh_schedule_slow().  This
was already debated a few months ago, so I assumed it to be correct.
In retrospect this was a very good idea, as you'll see later.

Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
trigger the assertion (as expected).  Changing the worker's invocation
to qemu_bh_schedule_slow() didn't hide the bug (another assumption
which luckily held).  This already limited heavily the amount of
interaction between the threads, hinting that the problematic events
must have triggered around thread_pool_completion_bh().

As mentioned early, invoking a debugger to examine the state of a
hung process was pretty easy; the iothread was always waiting on a
poll(..., -1) system call.  Infinite timeouts are much rarer on x86,
and this could be the reason why the bug was never observed there.
With the buggy sequence more or less resolved to an interaction between
thread_pool_completion_bh() and poll(..., -1), my tracing strategy was
to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
qemu_bh_schedule_fast() would provide some hint.  The output was:

(gdb) p last_prepare
$3 = 103885451
(gdb) p last_dispatch
$4 = 103876492
(gdb) p last_poll
$5 = 115909333
(gdb) p last_schedule
$6 = 115925212

Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
This makes little sense unless there is an aio_poll() call involved,
and indeed with a slightly different instrumentation you can see that
there is one:

(gdb) p last_prepare
$3 = 107569679
(gdb) p last_dispatch
$4 = 107561600
(gdb) p last_aio_poll
$5 = 110671400
(gdb) p last_schedule
$6 = 110698917

So the scenario becomes clearer:

   iothread   VCPU thread
--
   aio_ctx_prepare
   aio_ctx_check
   qemu_poll_ns(timeout=-1)
  aio_poll
aio_dispatch
  thread_pool_completion_bh
qemu_bh_schedule()

At this point bh-scheduled = 1 and the iothread has not been woken up.
The solution must be close, but this alone should not be a problem,
because the bottom half is only rescheduled to account for rare situations
(see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll()
calls, 2014-07-15).

Introducing a third thread---a thread pool worker thread, which
also does qemu_bh_schedule()---does bring out the problematic case.
The third thread must be awakened *after* the callback is complete and
thread_pool_completion_bh has redone the whole loop, explaining the
short race window.  And then this is what happens:

  thread pool worker

Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization

2015-07-16 Thread Fam Zheng
On Thu, 07/16 11:56, Paolo Bonzini wrote:
 diff --git a/aio-posix.c b/aio-posix.c
 index 4abec38..268d14d 100644
 --- a/aio-posix.c
 +++ b/aio-posix.c
 @@ -233,26 +233,23 @@ static void add_pollfd(AioHandler *node)
  bool aio_poll(AioContext *ctx, bool blocking)
  {
  AioHandler *node;
 -bool was_dispatching;
  int i, ret;
  bool progress;
  int64_t timeout;
  
  aio_context_acquire(ctx);
 -was_dispatching = ctx-dispatching;
  progress = false;
  
  /* aio_notify can avoid the expensive event_notifier_set if
   * everything (file descriptors, bottom halves, timers) will
   * be re-evaluated before the next blocking poll().  This is
   * already true when aio_poll is called with blocking == false;
 - * if blocking == true, it is only true after poll() returns.
 - *
 - * If we're in a nested event loop, ctx-dispatching might be true.
 - * In that case we can restore it just before returning, but we
 - * have to clear it now.
 + * if blocking == true, it is only true after poll() returns,
 + * so disable the optimization now.
   */
 -aio_set_dispatching(ctx, !blocking);
 +if (blocking) {
 +atomic_add(ctx-notify_me, 2);
 +}

Sorry if my questions are stupid, but I'm having difficulties in fully
understanding it.

What if aio_notify happens after the previous aio_dispatch() but before the
next necessary atomic_add? The aio_notify would still skip the
event_notifier_set(), and the next ppoll() will not return. For example:

  Thread A Thread B

  aio_poll(blocking=true)
   aio_notify()
 smp_mb()
 if (ctx-notify_me) /* false! */
atomic_add(ctx-notify_me, 2)
ppoll()
atomic_sub(ctx-notify_me, 2)   event_notifier_set() /* not run */


And if that's not a problem, why don't we need something like ACCESS_ONCE in
aio_noitfy()?

Fam

  
  ctx-walking_handlers++;
  
 @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking)
  npfd = 0;
  ctx-walking_handlers--;
  
 +if (blocking) {
 +atomic_sub(ctx-notify_me, 2);
 +}
 +
  /* Run dispatch even if there were no readable fds to run timers */
 -aio_set_dispatching(ctx, true);
  if (aio_dispatch(ctx)) {
  progress = true;
  }
  
 -aio_set_dispatching(ctx, was_dispatching);
  aio_context_release(ctx);
  
  return progress;
 diff --git a/aio-win32.c b/aio-win32.c
 index 9268b5c..9d6c12f 100644
 --- a/aio-win32.c
 +++ b/aio-win32.c
 @@ -279,25 +279,23 @@ bool aio_poll(AioContext *ctx, bool blocking)
  {
  AioHandler *node;
  HANDLE events[MAXIMUM_WAIT_OBJECTS + 1];
 -bool was_dispatching, progress, have_select_revents, first;
 +bool progress, have_select_revents, first;
  int count;
  int timeout;
  
  aio_context_acquire(ctx);
 -was_dispatching = ctx-dispatching;
  progress = false;
  
  /* aio_notify can avoid the expensive event_notifier_set if
   * everything (file descriptors, bottom halves, timers) will
   * be re-evaluated before the next blocking poll().  This is
   * already true when aio_poll is called with blocking == false;
 - * if blocking == true, it is only true after poll() returns.
 - *
 - * If we're in a nested event loop, ctx-dispatching might be true.
 - * In that case we can restore it just before returning, but we
 - * have to clear it now.
 + * if blocking == true, it is only true after poll() returns,
 + * so disable the optimization now.
   */
 -aio_set_dispatching(ctx, !blocking);
 +if (blocking) {
 +atomic_add(ctx-notify_me, 2);
 +}
  
  have_select_revents = aio_prepare(ctx);
  
 @@ -334,7 +332,10 @@ bool aio_poll(AioContext *ctx, bool blocking)
  if (timeout) {
  aio_context_acquire(ctx);
  }
 -aio_set_dispatching(ctx, true);
 +if (blocking) {
 +assert(first);
 +atomic_sub(ctx-notify_me, 2);
 +}
  
  if (first  aio_bh_poll(ctx)) {
  progress = true;
 @@ -358,7 +359,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
  
  progress |= timerlistgroup_run_timers(ctx-tlg);
  
 -aio_set_dispatching(ctx, was_dispatching);
  aio_context_release(ctx);
  return progress;
  }
 diff --git a/async.c b/async.c
 index 77d080d..a232192 100644
 --- a/async.c
 +++ b/async.c
 @@ -184,6 +184,8 @@ aio_ctx_prepare(GSource *source, gint*timeout)
  {
  AioContext *ctx = (AioContext *) source;
  
 +atomic_or(ctx-notify_me, 1);
 +
  /* We assume there is no timeout already supplied */
  *timeout = qemu_timeout_ns_to_ms(aio_compute_timeout(ctx));
  
 @@ -200,6 +202,7 @@ aio_ctx_check(GSource *source)
  AioContext *ctx = (AioContext *) 

Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization

2015-07-16 Thread Paolo Bonzini


On 17/07/2015 04:25, Fam Zheng wrote:
 What if aio_notify happens after the previous aio_dispatch() but before the
 next necessary atomic_add? The aio_notify would still skip the
 event_notifier_set(), and the next ppoll() will not return. For example:
 
   Thread A Thread B
 
   aio_poll(blocking=true)
aio_notify()
  smp_mb()
  if (ctx-notify_me) /* false! */
 atomic_add(ctx-notify_me, 2)
 ppoll()
 atomic_sub(ctx-notify_me, 2)   event_notifier_set() /* not run */

It's not a problem because ppoll() has exited.  The next call to
aio_poll or aio_ctx_prepare will notice the bottom half, do a
non-blocking ppoll(), and then service the bottom half.

 
 And if that's not a problem, why don't we need something like ACCESS_ONCE in
 aio_noitfy()?

Because there's already a smp_mb() which is stronger.

Paolo



Re: [Qemu-devel] [PATCH v2 3/3] AioContext: fix broken ctx-dispatching optimization

2015-07-16 Thread Paolo Bonzini


On 16/07/2015 11:56, Paolo Bonzini wrote:
 @@ -286,13 +283,15 @@ bool aio_poll(AioContext *ctx, bool blocking)
  npfd = 0;
  ctx-walking_handlers--;
  
 +if (blocking) {
 +atomic_sub(ctx-notify_me, 2);
 +}
 +

I kept this place for subtracting notify_me because it is the same place
where aio_set_dispatching was called.  However, it can be anticipated to

 /* if we have any readable fds, dispatch event */
 if (ret  0) {
 for (i = 0; i  npfd; i++) {

i.e. right after poll.  As poll has exited, it can't be blocking the
thread anymore.  Stefan, please let me send v3 on Monday.

Paolo

 
  /* Run dispatch even if there were no readable fds to run timers */
 -aio_set_dispatching(ctx, true);
  if (aio_dispatch(ctx)) {
  progress = true;
  }