Re: [Qemu-devel] QEMU event loop optimizations
On 05/04/19 18:33, Sergio Lopez wrote: >> - qemu_bh_schedule_nested should not be necessary since we have >> ctx->notify_me to also avoid the event_notifier_set call. However, it >> is possible to avoid the smp_mb at the beginning of aio_notify, since >> atomic_xchg already implies it. Maybe add a "static void >> aio_notify__after_smp_mb"? > try_poll_mode() is called with ctx->notify_me != 0, so we get at least > one event_notifier_set() call while working in polling mode. Right, though then there is the idea of making ctx->notify_me and ctx->notified bitmasks, where bit 0 is "BH ready" and bit 1 is "poll() parameters changed". try_poll_mode() cares about the latter, but does not need event_notifier_set() for BH ready. Paolo
Re: [Qemu-devel] QEMU event loop optimizations
On Fri, Apr 05, 2019 at 06:29:49PM +0200, Sergio Lopez wrote: > > Stefan Hajnoczi writes: > > > Hi Sergio, > > Here are the forgotten event loop optimizations I mentioned: > > > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > > > The goal was to eliminate or reorder syscalls so that useful work (like > > executing BHs) occurs as soon as possible after an event is detected. > > > > I remember that these optimizations only shave off a handful of > > microseconds, so they aren't a huge win. They do become attractive on > > fast SSDs with <10us read/write latency. > > > > These optimizations are aggressive and there is a possibility of > > introducing regressions. > > > > If you have time to pick up this work, try benchmarking each commit > > individually so performance changes are attributed individually. > > There's no need to send them together in a single patch series, the > > changes are quite independent. > > It took me a while to find a way to get meaningful numbers to evaluate > those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC > 7351P) the cost of event_notifier_set() is just ~0.4us when the code > path is hot, and it's hard differentiating it from the noise. > > To do so, I've used a patched kernel with a naive io_poll implementation > for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to > be sure we're polling) and ran the test on semi-isolated cores > (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The > storage is simulated by null_blk with "completion_nsec=0 no_sched=1 > irqmode=0". > > # fio --time_based --runtime=30 --rw=randread --name=randread \ > --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 > > | avg_lat (us) | master | qbsn* | > | run1 | 11.32 | 10.96 | > | run2 | 11.37 | 10.79 | > | run3 | 11.42 | 10.67 | > | run4 | 11.32 | 11.06 | > | run5 | 11.42 | 11.19 | > | run6 | 11.42 | 10.91 | > * patched with aio: add optimized qemu_bh_schedule_nested() API > > Even though there's still some variance in the numbers, the 0.4us > improvement can be clearly appreciated. > > I haven't tested the other 3 patches, as their optimizations only have > effect when the event loop is not running in polling mode. Without > polling, we get an additional overhead of, at least, 10us, in addition > to a lot of noise, due to both direct costs (ppoll()...) and indirect > ones (re-scheduling and TLB/cache pollution), so I don't think we can > reliable benchmark them. Probably their impact won't be significant > either, due to the costs I've just mentioned. Thanks for benchmarking them. We can leave them for now, since there is a risk of introducing bugs and they don't make a great difference. Stefan > Sergio. > > [1] > https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 > [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html signature.asc Description: PGP signature
Re: [Qemu-devel] QEMU event loop optimizations
Paolo Bonzini writes: > On 26/03/19 14:18, Stefan Hajnoczi wrote: >> Hi Sergio, >> Here are the forgotten event loop optimizations I mentioned: >> >> https://github.com/stefanha/qemu/commits/event-loop-optimizations >> >> The goal was to eliminate or reorder syscalls so that useful work (like >> executing BHs) occurs as soon as possible after an event is detected. >> >> I remember that these optimizations only shave off a handful of >> microseconds, so they aren't a huge win. They do become attractive on >> fast SSDs with <10us read/write latency. >> >> These optimizations are aggressive and there is a possibility of >> introducing regressions. >> >> If you have time to pick up this work, try benchmarking each commit >> individually so performance changes are attributed individually. >> There's no need to send them together in a single patch series, the >> changes are quite independent. > > I reviewed the patches now: > > - qemu_bh_schedule_nested should not be necessary since we have > ctx->notify_me to also avoid the event_notifier_set call. However, it > is possible to avoid the smp_mb at the beginning of aio_notify, since > atomic_xchg already implies it. Maybe add a "static void > aio_notify__after_smp_mb"? try_poll_mode() is called with ctx->notify_me != 0, so we get at least one event_notifier_set() call while working in polling mode. Sergio. signature.asc Description: PGP signature
Re: [Qemu-devel] QEMU event loop optimizations
Stefan Hajnoczi writes: > Hi Sergio, > Here are the forgotten event loop optimizations I mentioned: > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > The goal was to eliminate or reorder syscalls so that useful work (like > executing BHs) occurs as soon as possible after an event is detected. > > I remember that these optimizations only shave off a handful of > microseconds, so they aren't a huge win. They do become attractive on > fast SSDs with <10us read/write latency. > > These optimizations are aggressive and there is a possibility of > introducing regressions. > > If you have time to pick up this work, try benchmarking each commit > individually so performance changes are attributed individually. > There's no need to send them together in a single patch series, the > changes are quite independent. It took me a while to find a way to get meaningful numbers to evaluate those optimizations. The problem is that here (Xeon E5-2640 v3 and EPYC 7351P) the cost of event_notifier_set() is just ~0.4us when the code path is hot, and it's hard differentiating it from the noise. To do so, I've used a patched kernel with a naive io_poll implementation for virtio_blk [1], an also patched QEMU with poll-inflight [2] (just to be sure we're polling) and ran the test on semi-isolated cores (nohz_full + rcu_nocbs + systemd_isolation) with idle siblings. The storage is simulated by null_blk with "completion_nsec=0 no_sched=1 irqmode=0". # fio --time_based --runtime=30 --rw=randread --name=randread \ --filename=/dev/vdb --direct=1 --ioengine=pvsync2 --iodepth=1 --hipri=1 | avg_lat (us) | master | qbsn* | | run1 | 11.32 | 10.96 | | run2 | 11.37 | 10.79 | | run3 | 11.42 | 10.67 | | run4 | 11.32 | 11.06 | | run5 | 11.42 | 11.19 | | run6 | 11.42 | 10.91 | * patched with aio: add optimized qemu_bh_schedule_nested() API Even though there's still some variance in the numbers, the 0.4us improvement can be clearly appreciated. I haven't tested the other 3 patches, as their optimizations only have effect when the event loop is not running in polling mode. Without polling, we get an additional overhead of, at least, 10us, in addition to a lot of noise, due to both direct costs (ppoll()...) and indirect ones (re-scheduling and TLB/cache pollution), so I don't think we can reliable benchmark them. Probably their impact won't be significant either, due to the costs I've just mentioned. Sergio. [1] https://github.com/slp/linux/commit/d369b37db3e298933e8bb88c6eeacff07f39bc13 [2] https://lists.nongnu.org/archive/html/qemu-devel/2019-04/msg00447.html signature.asc Description: PGP signature
Re: [Qemu-devel] QEMU event loop optimizations
On 02/04/19 18:18, Kevin Wolf wrote: > Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben: >> - but actually (and a precursor to using IOCB_CMD_POLL) it should be >> possible to have just one LinuxAioState per AioContext, and then >> it can simply share the AioContext's EventNotifier. This removes >> the need to do the event_notifier_test_and_clear in linux-aio.c. > > Isn't having only one LinuxAioState per AioContext what we already do? > See aio_get_linux_aio(). And I should have known that: commit 0187f5c9cb172771ba85c66e3bf61f8cde6d6561 Author: Paolo Bonzini Date: Mon Jul 4 18:33:20 2016 +0200 linux-aio: share one LinuxAioState within an AioContext This has better performance because it executes fewer system calls and does not use a bottom half per disk. Originally proposed by Ming Lei. The second point, which is to share the AioContext's EventNotifier, still stands. Paolo
Re: [Qemu-devel] QEMU event loop optimizations
Am 26.03.2019 um 15:11 hat Paolo Bonzini geschrieben: > - but actually (and a precursor to using IOCB_CMD_POLL) it should be > possible to have just one LinuxAioState per AioContext, and then > it can simply share the AioContext's EventNotifier. This removes > the need to do the event_notifier_test_and_clear in linux-aio.c. Isn't having only one LinuxAioState per AioContext what we already do? See aio_get_linux_aio(). Kevin
Re: [Qemu-devel] QEMU event loop optimizations
On Tue, Mar 26, 2019 at 02:37:35PM +0100, Paolo Bonzini wrote: > On 26/03/19 14:18, Stefan Hajnoczi wrote: > > Hi Sergio, > > Here are the forgotten event loop optimizations I mentioned: > > > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > > > The goal was to eliminate or reorder syscalls so that useful work (like > > executing BHs) occurs as soon as possible after an event is detected. > > > > I remember that these optimizations only shave off a handful of > > microseconds, so they aren't a huge win. They do become attractive on > > fast SSDs with <10us read/write latency. > > > > These optimizations are aggressive and there is a possibility of > > introducing regressions. > > > > If you have time to pick up this work, try benchmarking each commit > > individually so performance changes are attributed individually. > > There's no need to send them together in a single patch series, the > > changes are quite independent. > > Another thing to consider is IOCB_CMD_POLL, and replacing poll() with > io_getevents(min_nr=1, nr=1). If more than one event is ready, it can > be read directly from the ring buffer. Hey, cool. io_uring has an equivalent operation too. We'll add it as a stretch goal to the GSoC/Outreachy project. > Maybe we can even remove the little-tested epoll() path in favor of this > one. That would be nice if the performance results are favorable. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] QEMU event loop optimizations
On 26/03/19 14:18, Stefan Hajnoczi wrote: > Hi Sergio, > Here are the forgotten event loop optimizations I mentioned: > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > The goal was to eliminate or reorder syscalls so that useful work (like > executing BHs) occurs as soon as possible after an event is detected. > > I remember that these optimizations only shave off a handful of > microseconds, so they aren't a huge win. They do become attractive on > fast SSDs with <10us read/write latency. > > These optimizations are aggressive and there is a possibility of > introducing regressions. > > If you have time to pick up this work, try benchmarking each commit > individually so performance changes are attributed individually. > There's no need to send them together in a single patch series, the > changes are quite independent. I reviewed the patches now: - qemu_bh_schedule_nested should not be necessary since we have ctx->notify_me to also avoid the event_notifier_set call. However, it is possible to avoid the smp_mb at the beginning of aio_notify, since atomic_xchg already implies it. Maybe add a "static void aio_notify__after_smp_mb"? - another possible optimization for latency is to delay event_notifier_test_and_clear until the very last moment: diff --git a/block/linux-aio.c b/block/linux-aio.c index d4b61fb251..5c80ceb85f 100644 --- a/block/linux-aio.c +++ b/block/linux-aio.c @@ -200,15 +207,16 @@ io_getevents_advance_and_peek(io_context_t ctx, * can be called again in a nested event loop. When there are no events left * to complete the BH is being canceled. */ -static void qemu_laio_process_completions(LinuxAioState *s) +static void qemu_laio_process_completions(LinuxAioState *s, bool clear) { struct io_event *events; /* Reschedule so nested event loops see currently pending completions */ qemu_bh_schedule(s->completion_bh); -while ((s->event_max = io_getevents_advance_and_peek(s->ctx, , - s->event_idx))) { +do { +s->event_max = io_getevents_advance_and_peek(s->ctx, , + s->event_idx); for (s->event_idx = 0; s->event_idx < s->event_max; ) { struct iocb *iocb = events[s->event_idx].obj; struct qemu_laiocb *laiocb = @@ -221,21 +229,30 @@ static void qemu_laio_process_completions(LinuxAioState *s) s->event_idx++; qemu_laio_process_completion(laiocb); } -} + +if (!s->event_max && clear) { +/* Clearing the eventfd is expensive, only do it once. */ +clear = false; +event_notifier_test_and_clear(>e); +/* Check one last time after the EventNotifier's trailing edge. */ +s->event_max = io_getevents_peek(ctx, events); +} +} while (s->event_max); qemu_bh_cancel(s->completion_bh); /* If we are nested we have to notify the level above that we are done * by setting event_max to zero, upper level will then jump out of it's - * own `for` loop. If we are the last all counters droped to zero. */ + * own `for` loop. If we are the last all counters dropped to zero. */ s->event_max = 0; s->event_idx = 0; } -static void qemu_laio_process_completions_and_submit(LinuxAioState *s) +static void qemu_laio_process_completions_and_submit(LinuxAioState *s, + bool clear) { aio_context_acquire(s->aio_context); -qemu_laio_process_completions(s); +qemu_laio_process_completions(s, clear); if (!s->io_q.plugged && !QSIMPLEQ_EMPTY(>io_q.pending)) { ioq_submit(s); @@ -247,16 +264,14 @@ static void qemu_laio_completion_bh(void *opaque) { LinuxAioState *s = opaque; -qemu_laio_process_completions_and_submit(s); +qemu_laio_process_completions_and_submit(s, true); } static void qemu_laio_completion_cb(EventNotifier *e) { LinuxAioState *s = container_of(e, LinuxAioState, e); -if (event_notifier_test_and_clear(>e)) { -qemu_laio_process_completions_and_submit(s); -} +qemu_laio_process_completions_and_submit(s, true); } static bool qemu_laio_poll_cb(void *opaque) @@ -269,7 +284,7 @@ static bool qemu_laio_poll_cb(void *opaque) return false; } -qemu_laio_process_completions_and_submit(s); +qemu_laio_process_completions_and_submit(s, false); return true; } - but actually (and a precursor to using IOCB_CMD_POLL) it should be possible to have just one LinuxAioState per AioContext, and then it can simply share the AioContext's EventNotifier. This removes the need to do the event_notifier_test_and_clear in linux-aio.c. Paolo
Re: [Qemu-devel] QEMU event loop optimizations
On 26/03/19 14:18, Stefan Hajnoczi wrote: > Hi Sergio, > Here are the forgotten event loop optimizations I mentioned: > > https://github.com/stefanha/qemu/commits/event-loop-optimizations > > The goal was to eliminate or reorder syscalls so that useful work (like > executing BHs) occurs as soon as possible after an event is detected. > > I remember that these optimizations only shave off a handful of > microseconds, so they aren't a huge win. They do become attractive on > fast SSDs with <10us read/write latency. > > These optimizations are aggressive and there is a possibility of > introducing regressions. > > If you have time to pick up this work, try benchmarking each commit > individually so performance changes are attributed individually. > There's no need to send them together in a single patch series, the > changes are quite independent. Another thing to consider is IOCB_CMD_POLL, and replacing poll() with io_getevents(min_nr=1, nr=1). If more than one event is ready, it can be read directly from the ring buffer. Maybe we can even remove the little-tested epoll() path in favor of this one. Paolo