Re: [Qemu-devel] QEMU event loop optimizations

2019-04-08 Thread Paolo Bonzini
On 05/04/19 18:33, Sergio Lopez wrote:
>> - qemu_bh_schedule_nested should not be necessary since we have 
>> ctx->notify_me to also avoid the event_notifier_set call.  However, it 
>> is possible to avoid the smp_mb at the beginning of aio_notify, since  
>> atomic_xchg already implies it.  Maybe add a "static void 
>> aio_notify__after_smp_mb"?
> try_poll_mode() is called with ctx->notify_me != 0, so we get at least
> one event_notifier_set() call while working in polling mode.

Right, though then there is the idea of making ctx->notify_me and
ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll()
parameters changed".  try_poll_mode() cares about the latter, but does
not need event_notifier_set() for BH ready.

Paolo



Re: [Qemu-devel] QEMU event loop optimizations

2019-04-08 Thread Stefan Hajnoczi
On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote:
> 
> Stefan Hajnoczi writes:
> 
> > Hi Sergio,
> > Here are the forgotten event loop optimizations I mentioned:
> >
> >   https://github.com/stefanha/qemu/commits/event-loop-optimizations
> >
> > The goal was to eliminate or reorder syscalls so that useful work (like
> > executing BHs) occurs as soon as possible after an event is detected.
> >
> > I remember that these optimizations only shave off a handful of
> > microseconds, so they aren't a huge win.  They do become attractive on
> > fast SSDs with <10us read/write latency.
> >
> > These optimizations are aggressive and there is a possibility of
> > introducing regressions.
> >
> > If you have time to pick up this work, try benchmarking each commit
> > individually so performance changes are attributed individually.
> > There's no need to send them together in a single patch series, the
> > changes are quite independent.
> 
> It took me a while to find a way to get meaningful numbers to evaluate
> those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
> 7351P) the cost of event_notifier_set() is just ~0.4us when the code
> path is hot, and it's hard differentiating it from the noise.
> 
> To do so, I've used a patched kernel with a naive io_poll implementation
> for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
> be sure we're polling) and ran the test on semi-isolated cores
> (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
> storage is simulated by null_blk with "completion_nsec=0 no_sched=1
> irqmode=0".
> 
> # fio --time_based --runtime=30 --rw=randread --name=randread \
>  --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1
> 
> | avg_lat (us) | master | qbsn* |
> |   run1   | 11.32  | 10.96 |
> |   run2   | 11.37  | 10.79 |
> |   run3   | 11.42  | 10.67 |
> |   run4   | 11.32  | 11.06 |
> |   run5   | 11.42  | 11.19 |
> |   run6   | 11.42  | 10.91 |
>  * patched with aio: add optimized qemu_bh_schedule_nested() API
> 
> Even though there's still some variance in the numbers, the 0.4us
> improvement can be clearly appreciated.
> 
> I haven't tested the other 3 patches, as their optimizations only have
> effect when the event loop is not running in polling mode. Without
> polling, we get an additional overhead of, at least, 10us, in addition
> to a lot of noise, due to both direct costs (ppoll()...) and indirect
> ones (re-scheduling and TLB/cache pollution), so I don't think we can
> reliable benchmark them. Probably their impact won't be significant
> either, due to the costs I've just mentioned.

Thanks for benchmarking them.  We can leave them for now, since there is
a risk of introducing bugs and they don't make a great difference.

Stefan

> Sergio.
> 
> [1] 
> https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
> [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html




signature.asc
Description: PGP signature


Re: [Qemu-devel] QEMU event loop optimizations

2019-04-05 Thread Sergio Lopez

Paolo Bonzini writes:

> On 26/03/19 14:18, Stefan Hajnoczi wrote:
>> Hi Sergio,
>> Here are the forgotten event loop optimizations I mentioned:
>> 
>>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
>> 
>> The goal was to eliminate or reorder syscalls so that useful work (like
>> executing BHs) occurs as soon as possible after an event is detected.
>> 
>> I remember that these optimizations only shave off a handful of
>> microseconds, so they aren't a huge win.  They do become attractive on
>> fast SSDs with <10us read/write latency.
>> 
>> These optimizations are aggressive and there is a possibility of
>> introducing regressions.
>> 
>> If you have time to pick up this work, try benchmarking each commit
>> individually so performance changes are attributed individually.
>> There's no need to send them together in a single patch series, the
>> changes are quite independent.
>
> I reviewed the patches now:
>
> - qemu_bh_schedule_nested should not be necessary since we have 
> ctx->notify_me to also avoid the event_notifier_set call.  However, it 
> is possible to avoid the smp_mb at the beginning of aio_notify, since  
> atomic_xchg already implies it.  Maybe add a "static void 
> aio_notify__after_smp_mb"?

try_poll_mode() is called with ctx->notify_me != 0, so we get at least
one event_notifier_set() call while working in polling mode.

Sergio.


signature.asc
Description: PGP signature


Re: [Qemu-devel] QEMU event loop optimizations

2019-04-05 Thread Sergio Lopez

Stefan Hajnoczi writes:

> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
>
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
>
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
>
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
>
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
>
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

It took me a while to find a way to get meaningful numbers to evaluate
those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC
7351P) the cost of event_notifier_set() is just ~0.4us when the code
path is hot, and it's hard differentiating it from the noise.

To do so, I've used a patched kernel with a naive io_poll implementation
for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to
be sure we're polling) and ran the test on semi-isolated cores
(nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The
storage is simulated by null_blk with "completion_nsec=0 no_sched=1
irqmode=0".

# fio --time_based --runtime=30 --rw=randread --name=randread \
 --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1

| avg_lat (us) | master | qbsn* |
|   run1   | 11.32  | 10.96 |
|   run2   | 11.37  | 10.79 |
|   run3   | 11.42  | 10.67 |
|   run4   | 11.32  | 11.06 |
|   run5   | 11.42  | 11.19 |
|   run6   | 11.42  | 10.91 |
 * patched with aio: add optimized qemu_bh_schedule_nested() API

Even though there's still some variance in the numbers, the 0.4us
improvement can be clearly appreciated.

I haven't tested the other 3 patches, as their optimizations only have
effect when the event loop is not running in polling mode. Without
polling, we get an additional overhead of, at least, 10us, in addition
to a lot of noise, due to both direct costs (ppoll()...) and indirect
ones (re-scheduling and TLB/cache pollution), so I don't think we can
reliable benchmark them. Probably their impact won't be significant
either, due to the costs I've just mentioned.

Sergio.

[1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13
[2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html


signature.asc
Description: PGP signature


Re: [Qemu-devel] QEMU event loop optimizations

2019-04-02 Thread Paolo Bonzini
On 02/04/19 18:18, Kevin Wolf wrote:
> Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben:
>> - but actually (and a precursor to using IOCB_CMD_POLL) it should be
>> possible to have just one LinuxAioState per AioContext, and then
>> it can simply share the AioContext's EventNotifier.  This removes
>> the need to do the event_notifier_test_and_clear in linux-aio.c.
> 
> Isn't having only one LinuxAioState per AioContext what we already do?
> See aio_get_linux_aio().

And I should have known that:

commit 0187f5c9cb172771ba85c66e3bf61f8cde6d6561
Author: Paolo Bonzini 
Date:   Mon Jul 4 18:33:20 2016 +0200

linux-aio: share one LinuxAioState within an AioContext

This has better performance because it executes fewer system calls
and does not use a bottom half per disk.

Originally proposed by Ming Lei.

The second point, which is to share the AioContext's EventNotifier,
still stands.

Paolo




Re: [Qemu-devel] QEMU event loop optimizations

2019-04-02 Thread Kevin Wolf
Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben:
> - but actually (and a precursor to using IOCB_CMD_POLL) it should be
> possible to have just one LinuxAioState per AioContext, and then
> it can simply share the AioContext's EventNotifier.  This removes
> the need to do the event_notifier_test_and_clear in linux-aio.c.

Isn't having only one LinuxAioState per AioContext what we already do?
See aio_get_linux_aio().

Kevin



Re: [Qemu-devel] QEMU event loop optimizations

2019-03-27 Thread Stefan Hajnoczi
On Tue, Mar 26, 2019 at 02:37:35PM +0100, Paolo Bonzini wrote:
> On 26/03/19 14:18, Stefan Hajnoczi wrote:
> > Hi Sergio,
> > Here are the forgotten event loop optimizations I mentioned:
> > 
> >   https://github.com/stefanha/qemu/commits/event-loop-optimizations
> > 
> > The goal was to eliminate or reorder syscalls so that useful work (like
> > executing BHs) occurs as soon as possible after an event is detected.
> > 
> > I remember that these optimizations only shave off a handful of
> > microseconds, so they aren't a huge win.  They do become attractive on
> > fast SSDs with <10us read/write latency.
> > 
> > These optimizations are aggressive and there is a possibility of
> > introducing regressions.
> > 
> > If you have time to pick up this work, try benchmarking each commit
> > individually so performance changes are attributed individually.
> > There's no need to send them together in a single patch series, the
> > changes are quite independent.
> 
> Another thing to consider is IOCB_CMD_POLL, and replacing poll() with
> io_getevents(min_nr=1, nr=1).  If more than one event is ready, it can
> be read directly from the ring buffer.

Hey, cool.  io_uring has an equivalent operation too.  We'll add it as a
stretch goal to the GSoC/Outreachy project.

> Maybe we can even remove the little-tested epoll() path in favor of this
> one.

That would be nice if the performance results are favorable.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] QEMU event loop optimizations

2019-03-26 Thread Paolo Bonzini
On 26/03/19 14:18, Stefan Hajnoczi wrote:
> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
> 
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
> 
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
> 
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
> 
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
> 
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

I reviewed the patches now:

- qemu_bh_schedule_nested should not be necessary since we have 
ctx->notify_me to also avoid the event_notifier_set call.  However, it 
is possible to avoid the smp_mb at the beginning of aio_notify, since  
atomic_xchg already implies it.  Maybe add a "static void 
aio_notify__after_smp_mb"?

- another possible optimization for latency is to delay 
event_notifier_test_and_clear until the very last moment:

diff --git a/block/linux-aio.c b/block/linux-aio.c
index d4b61fb251..5c80ceb85f 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -200,15 +207,16 @@ io_getevents_advance_and_peek(io_context_t ctx,
  * can be called again in a nested event loop.  When there are no events left
  * to complete the BH is being canceled.
  */
-static void qemu_laio_process_completions(LinuxAioState *s)
+static void qemu_laio_process_completions(LinuxAioState *s, bool clear)
 {
 struct io_event *events;
 
 /* Reschedule so nested event loops see currently pending completions */
 qemu_bh_schedule(s->completion_bh);
 
-while ((s->event_max = io_getevents_advance_and_peek(s->ctx, ,
- s->event_idx))) {
+do {
+s->event_max = io_getevents_advance_and_peek(s->ctx, ,
+ s->event_idx);
 for (s->event_idx = 0; s->event_idx < s->event_max; ) {
 struct iocb *iocb = events[s->event_idx].obj;
 struct qemu_laiocb *laiocb =
@@ -221,21 +229,30 @@ static void qemu_laio_process_completions(LinuxAioState 
*s)
 s->event_idx++;
 qemu_laio_process_completion(laiocb);
 }
-}
+
+if (!s->event_max && clear) {
+/* Clearing the eventfd is expensive, only do it once.  */
+clear = false;
+event_notifier_test_and_clear(>e);
+/* Check one last time after the EventNotifier's trailing edge.  */
+s->event_max = io_getevents_peek(ctx, events);
+}
+} while (s->event_max);
 
 qemu_bh_cancel(s->completion_bh);
 
 /* If we are nested we have to notify the level above that we are done
  * by setting event_max to zero, upper level will then jump out of it's
- * own `for` loop.  If we are the last all counters droped to zero. */
+ * own `for` loop.  If we are the last all counters dropped to zero. */
 s->event_max = 0;
 s->event_idx = 0;
 }
 
-static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
+static void qemu_laio_process_completions_and_submit(LinuxAioState *s,
+ bool clear)
 {
 aio_context_acquire(s->aio_context);
-qemu_laio_process_completions(s);
+qemu_laio_process_completions(s, clear);
 
 if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(>io_q.pending)) {
 ioq_submit(s);
@@ -247,16 +264,14 @@ static void qemu_laio_completion_bh(void *opaque)
 {
 LinuxAioState *s = opaque;
 
-qemu_laio_process_completions_and_submit(s);
+qemu_laio_process_completions_and_submit(s, true);
 }
 
 static void qemu_laio_completion_cb(EventNotifier *e)
 {
 LinuxAioState *s = container_of(e, LinuxAioState, e);
 
-if (event_notifier_test_and_clear(>e)) {
-qemu_laio_process_completions_and_submit(s);
-}
+qemu_laio_process_completions_and_submit(s, true);
 }
 
 static bool qemu_laio_poll_cb(void *opaque)
@@ -269,7 +284,7 @@ static bool qemu_laio_poll_cb(void *opaque)
 return false;
 }
 
-qemu_laio_process_completions_and_submit(s);
+qemu_laio_process_completions_and_submit(s, false);
 return true;
 }
 
- but actually (and a precursor to using IOCB_CMD_POLL) it should be
possible to have just one LinuxAioState per AioContext, and then
it can simply share the AioContext's EventNotifier.  This removes
the need to do the event_notifier_test_and_clear in linux-aio.c.

Paolo



Re: [Qemu-devel] QEMU event loop optimizations

2019-03-26 Thread Paolo Bonzini
On 26/03/19 14:18, Stefan Hajnoczi wrote:
> Hi Sergio,
> Here are the forgotten event loop optimizations I mentioned:
> 
>   https://github.com/stefanha/qemu/commits/event-loop-optimizations
> 
> The goal was to eliminate or reorder syscalls so that useful work (like
> executing BHs) occurs as soon as possible after an event is detected.
> 
> I remember that these optimizations only shave off a handful of
> microseconds, so they aren't a huge win.  They do become attractive on
> fast SSDs with <10us read/write latency.
> 
> These optimizations are aggressive and there is a possibility of
> introducing regressions.
> 
> If you have time to pick up this work, try benchmarking each commit
> individually so performance changes are attributed individually.
> There's no need to send them together in a single patch series, the
> changes are quite independent.

Another thing to consider is IOCB_CMD_POLL, and replacing poll() with
io_getevents(min_nr=1, nr=1).  If more than one event is ready, it can
be read directly from the ring buffer.

Maybe we can even remove the little-tested epoll() path in favor of this
one.

Paolo