Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-06 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 10:40:49AM -0600, Karl Rister wrote:
> On 12/05/2016 08:56 AM, Stefan Hajnoczi wrote:
> 
> 
> > Karl: do you have time to run a bigger suite of benchmarks to identify a
> > reasonable default poll-max-ns value?  Both aio=native and aio=threads
> > are important.
> > 
> > If there is a sweet spot that improves performance without pathological
> > cases then we could even enable polling by default in QEMU.
> > 
> > Otherwise we'd just document the recommended best polling duration as a
> > starting point for users.
> > 
> 
> I have collected a baseline on the latest patches and am currently
> collecting poll-max-ns=16384.  I can certainly throw in a few more
> scenarios.  Do we want to stick with powers of 2 or some other strategy?

Excellent, thanks!  The algorithm self-tunes by doubling poll time so
there's not much advantage to looking at non pow-2 values.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-05 Thread Karl Rister
On 12/05/2016 08:56 AM, Stefan Hajnoczi wrote:


> Karl: do you have time to run a bigger suite of benchmarks to identify a
> reasonable default poll-max-ns value?  Both aio=native and aio=threads
> are important.
> 
> If there is a sweet spot that improves performance without pathological
> cases then we could even enable polling by default in QEMU.
> 
> Otherwise we'd just document the recommended best polling duration as a
> starting point for users.
> 

I have collected a baseline on the latest patches and am currently
collecting poll-max-ns=16384.  I can certainly throw in a few more
scenarios.  Do we want to stick with powers of 2 or some other strategy?

-- 
Karl Rister 



Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-05 Thread Stefan Hajnoczi
On Mon, Dec 05, 2016 at 06:25:49PM +0800, Fam Zheng wrote:
> On Thu, 12/01 19:26, Stefan Hajnoczi wrote:
> > One way to avoid the costly exit is to use polling instead of notification.
> 
> Testing with fio on virtio-blk backed by NVMe device, I can see significant
> performance improvement with this series:
> 
> poll-max-nsiodepth=1iodepth=32
> --
> 0  24806.94 151430.36
> 1  24435.02 155285.59
> 10024702.41 149937.2
> 1000   24457.08 152936.3
> 1  24605.05 156158.12
> 13000  24412.95 154908.73
> 16000  30654.24 185731.58
> 18000  36365.69 185538.78
> 2  35640.48 188640.61
> 3  37411.05 184198.39
> 6  35230.66 190427.42
> 
> So this definitely helps synchronous I/O with queue depth = 1. Great!

Nice.  Even with iodepth=32 the improvement is significant.

> I have a small concern with high queue depth, though. The more frequent we 
> check
> the virtqueue, the less likely the requests can be batched, and the more
> submissions (both from guest to QEMU and from QEMU to host) are needed to
> achieve the same bandwidth, because we'd do less plug/unplug. This could be a
> problem under heavy workload.  We may want to consider the driver's transient
> state when it is appending requests to the virtqueue. For example, virtio-blk
> driver in Linux updates avail idx after adding each request. If QEMU looks at
> the index in the middle, it will miss the opportunities to plug/unplug and 
> merge
> requests. On the other hand, though virtio-blk driver doesn't have IO 
> scheduler,
> it does have some batch submission semantics passed down by blk-mq (see the
> "notify" condition in drivers/block/virtio_blk.c:virtio_queue_rq()).  So I'm
> wondering whether it makes sense to wait for the whole batch of requests to be
> added to the virtqueue before processing it? This can be done by changing the
> driver to only update "avail" index after all requests are added to the queue,
> or even adding a flag in the virtio ring descriptor to suppress busy polling.

Only benchmarking can tell.  It would be interesting to extract the
vq->vring.avail->idx update from virtqueue_add().  I don't think a new
flag is necessary.

> > The main drawback of polling is that it consumes CPU resources.  In order to
> > benefit performance the host must have extra CPU cycles available on 
> > physical
> > CPUs that aren't used by the guest.
> > 
> > This is an experimental AioContext polling implementation.  It adds a 
> > polling
> > callback into the event loop.  Polling functions are implemented for 
> > virtio-blk
> > virtqueue guest->host kick and Linux AIO completion.
> > 
> > The -object iothread,poll-max-ns=NUM parameter sets the number of 
> > nanoseconds
> > to poll before entering the usual blocking poll(2) syscall.  Try setting 
> > this
> > parameter to the time from old request completion to new virtqueue kick.  By
> > default no polling is done so you must set this parameter to get busy 
> > polling.
> 
> Given the self tuning, should we document the best practice in setting the
> value?  Is it okay for user or even QEMU to use a relatively large value by
> default and expect it to tune itself sensibly?

Karl: do you have time to run a bigger suite of benchmarks to identify a
reasonable default poll-max-ns value?  Both aio=native and aio=threads
are important.

If there is a sweet spot that improves performance without pathological
cases then we could even enable polling by default in QEMU.

Otherwise we'd just document the recommended best polling duration as a
starting point for users.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-05 Thread Christian Borntraeger
On 12/01/2016 08:26 PM, Stefan Hajnoczi wrote:
> v4:
>  * Added poll time self-tuning algorithm [Christian and Paolo]
>  * Try a single iteration of polling to avoid non-blocking 
> ppoll(2)/epoll_wait(2) [Paolo]
>  * Reordered patches to make performance analysis easier - see below

This looks quite promising. I will give this to our performance folks to get
some more controlled measurements.

> 
> v3:
>  * Avoid ppoll(2)/epoll_wait(2) if polling succeeded [Paolo]
>  * Disable guest->host virtqueue notification during polling [Christian]
>  * Rebased on top of my virtio-blk/scsi virtqueue notification disable patches
> 
> v2:
>  * Uninitialized node->deleted gone [Fam]
>  * Removed 1024 polling loop iteration qemu_clock_get_ns() optimization which
>created a weird step pattern [Fam]
>  * Unified with AioHandler, dropped AioPollHandler struct [Paolo]
>(actually I think Paolo had more in mind but this is the first step)
>  * Only poll when all event loop resources support it [Paolo]
>  * Added run_poll_handlers_begin/end trace events for perf analysis
>  * Sorry, Christian, no virtqueue kick suppression yet
> 
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The -object iothread,poll-max-ns=NUM parameter sets the number of nanoseconds
> to poll before entering the usual blocking poll(2) syscall.  Try setting this
> parameter to the time from old request completion to new virtqueue kick.  By
> default no polling is done so you must set this parameter to get busy polling.
> 
> Patch series overview:
> 
> Here are convenient points in the patch series at which to do performance
> benchmarking to see how various pieces interact.
> 
> 1. Basic -iothread poll-max-ns=NUM parameter without self-tuning:
> 
> aio: add flag to skip fds to aio_dispatch()
> aio: add AioPollFn and io_poll() interface
> aio: add polling mode to AioContext
> virtio: poll virtqueues for new buffers
> linux-aio: poll ring for completions
> iothread: add polling parameters
> 
> 2. My "[PATCH 0/3] virtio: disable notifications in blk and scsi" series is
>needed as a base.  This is unrelated to polling but we'll also disable
>notifications during polling later:
> 
> virtio-blk: suppress virtqueue kick during processing
> virtio-scsi: suppress virtqueue kick during processing
> 
> 3. Disable virtqueue notifications (reduce vmexits) during polling:
> 
> virtio: turn vq->notification into a nested counter
> aio: add .io_poll_begin/end() callbacks
> virtio: disable virtqueue notifications during polling
> 
> 4. Self-tuning, adjusts polling time depending on wait times:
> 
> aio: self-tune polling time
> iothread: add poll-grow and poll-shrink parameters
> 
> Stefan Hajnoczi (13):
>   aio: add flag to skip fds to aio_dispatch()
>   aio: add AioPollFn and io_poll() interface
>   aio: add polling mode to AioContext
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
>   iothread: add polling parameters
>   virtio-blk: suppress virtqueue kick during processing
>   virtio-scsi: suppress virtqueue kick during processing
>   virtio: turn vq->notification into a nested counter
>   aio: add .io_poll_begin/end() callbacks
>   virtio: disable virtqueue notifications during polling
>   aio: self-tune polling time
>   iothread: add poll-grow and poll-shrink parameters
> 
>  include/block/aio.h |  53 +++-
>  include/sysemu/iothread.h   |   5 +
>  aio-posix.c | 308 
> +++-
>  aio-win32.c |  32 -
>  async.c |  21 ++-
>  block/curl.c|   8 +-
>  block/iscsi.c   |   3 +-
>  block/linux-aio.c   |  19 ++-
>  block/nbd-client.c  |   8 +-
>  block/nfs.c |   7 +-
>  block/sheepdog.c|  26 ++--
>  block/ssh.c |   4 +-
>  block/win32-aio.c   |   4 +-
>  hw/block/virtio-blk.c   |  18 ++-
>  hw/scsi/virtio-scsi.c   |  36 +++---
>  hw/virtio/virtio.c  |  54 ++--
>  iohandler.c |   2 +-
>  iothread.c  |  84 
>  nbd/server.c|   9 +-
>  stubs/set-fd-handler.c  |   1 +
>  tests/test-aio.c|   4 +-
>  util/event_notifier-posix.c |   2 +-
>  trace-events|   6 +
>  23 files changed, 604 

Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-05 Thread Fam Zheng
On Thu, 12/01 19:26, Stefan Hajnoczi wrote:
> One way to avoid the costly exit is to use polling instead of notification.

Testing with fio on virtio-blk backed by NVMe device, I can see significant
performance improvement with this series:

poll-max-nsiodepth=1iodepth=32
--
0  24806.94 151430.36
1  24435.02 155285.59
10024702.41 149937.2
1000   24457.08 152936.3
1  24605.05 156158.12
13000  24412.95 154908.73
16000  30654.24 185731.58
18000  36365.69 185538.78
2  35640.48 188640.61
3  37411.05 184198.39
6  35230.66 190427.42

So this definitely helps synchronous I/O with queue depth = 1. Great!

I have a small concern with high queue depth, though. The more frequent we check
the virtqueue, the less likely the requests can be batched, and the more
submissions (both from guest to QEMU and from QEMU to host) are needed to
achieve the same bandwidth, because we'd do less plug/unplug. This could be a
problem under heavy workload.  We may want to consider the driver's transient
state when it is appending requests to the virtqueue. For example, virtio-blk
driver in Linux updates avail idx after adding each request. If QEMU looks at
the index in the middle, it will miss the opportunities to plug/unplug and merge
requests. On the other hand, though virtio-blk driver doesn't have IO scheduler,
it does have some batch submission semantics passed down by blk-mq (see the
"notify" condition in drivers/block/virtio_blk.c:virtio_queue_rq()).  So I'm
wondering whether it makes sense to wait for the whole batch of requests to be
added to the virtqueue before processing it? This can be done by changing the
driver to only update "avail" index after all requests are added to the queue,
or even adding a flag in the virtio ring descriptor to suppress busy polling.

> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The -object iothread,poll-max-ns=NUM parameter sets the number of nanoseconds
> to poll before entering the usual blocking poll(2) syscall.  Try setting this
> parameter to the time from old request completion to new virtqueue kick.  By
> default no polling is done so you must set this parameter to get busy polling.

Given the self tuning, should we document the best practice in setting the
value?  Is it okay for user or even QEMU to use a relatively large value by
default and expect it to tune itself sensibly?

Fam



Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-02 Thread Stefan Hajnoczi
On Thu, Dec 01, 2016 at 07:26:39PM +, Stefan Hajnoczi wrote:
> v4:
>  * Added poll time self-tuning algorithm [Christian and Paolo]
>  * Try a single iteration of polling to avoid non-blocking 
> ppoll(2)/epoll_wait(2) [Paolo]
>  * Reordered patches to make performance analysis easier - see below
> 
> v3:
>  * Avoid ppoll(2)/epoll_wait(2) if polling succeeded [Paolo]
>  * Disable guest->host virtqueue notification during polling [Christian]
>  * Rebased on top of my virtio-blk/scsi virtqueue notification disable patches

Karl Rister found a performance regression between v2 -> v3.

There was a bug in the poll_disable_cnt code in aio_set_fd_handler()
which meant that polling was disabled unnecessarily.  This has been
fixed in v4 and I suspect it was the cause for the regression.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-02 Thread Paolo Bonzini

> v4:
>  * Added poll time self-tuning algorithm [Christian and Paolo]
>  * Try a single iteration of polling to avoid non-blocking
>  ppoll(2)/epoll_wait(2) [Paolo]
>  * Reordered patches to make performance analysis easier - see below

Reviewed-by: Paolo Bonzini 

> v3:
>  * Avoid ppoll(2)/epoll_wait(2) if polling succeeded [Paolo]
>  * Disable guest->host virtqueue notification during polling [Christian]
>  * Rebased on top of my virtio-blk/scsi virtqueue notification disable
>  patches
> 
> v2:
>  * Uninitialized node->deleted gone [Fam]
>  * Removed 1024 polling loop iteration qemu_clock_get_ns() optimization which
>created a weird step pattern [Fam]
>  * Unified with AioHandler, dropped AioPollHandler struct [Paolo]
>(actually I think Paolo had more in mind but this is the first step)
>  * Only poll when all event loop resources support it [Paolo]
>  * Added run_poll_handlers_begin/end trace events for perf analysis
>  * Sorry, Christian, no virtqueue kick suppression yet
> 
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the
> "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The -object iothread,poll-max-ns=NUM parameter sets the number of nanoseconds
> to poll before entering the usual blocking poll(2) syscall.  Try setting this
> parameter to the time from old request completion to new virtqueue kick.  By
> default no polling is done so you must set this parameter to get busy
> polling.
> 
> Patch series overview:
> 
> Here are convenient points in the patch series at which to do performance
> benchmarking to see how various pieces interact.
> 
> 1. Basic -iothread poll-max-ns=NUM parameter without self-tuning:
> 
> aio: add flag to skip fds to aio_dispatch()
> aio: add AioPollFn and io_poll() interface
> aio: add polling mode to AioContext
> virtio: poll virtqueues for new buffers
> linux-aio: poll ring for completions
> iothread: add polling parameters
> 
> 2. My "[PATCH 0/3] virtio: disable notifications in blk and scsi" series is
>needed as a base.  This is unrelated to polling but we'll also disable
>notifications during polling later:
> 
> virtio-blk: suppress virtqueue kick during processing
> virtio-scsi: suppress virtqueue kick during processing
> 
> 3. Disable virtqueue notifications (reduce vmexits) during polling:
> 
> virtio: turn vq->notification into a nested counter
> aio: add .io_poll_begin/end() callbacks
> virtio: disable virtqueue notifications during polling
> 
> 4. Self-tuning, adjusts polling time depending on wait times:
> 
> aio: self-tune polling time
> iothread: add poll-grow and poll-shrink parameters
> 
> Stefan Hajnoczi (13):
>   aio: add flag to skip fds to aio_dispatch()
>   aio: add AioPollFn and io_poll() interface
>   aio: add polling mode to AioContext
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
>   iothread: add polling parameters
>   virtio-blk: suppress virtqueue kick during processing
>   virtio-scsi: suppress virtqueue kick during processing
>   virtio: turn vq->notification into a nested counter
>   aio: add .io_poll_begin/end() callbacks
>   virtio: disable virtqueue notifications during polling
>   aio: self-tune polling time
>   iothread: add poll-grow and poll-shrink parameters
> 
>  include/block/aio.h |  53 +++-
>  include/sysemu/iothread.h   |   5 +
>  aio-posix.c | 308
>  +++-
>  aio-win32.c |  32 -
>  async.c |  21 ++-
>  block/curl.c|   8 +-
>  block/iscsi.c   |   3 +-
>  block/linux-aio.c   |  19 ++-
>  block/nbd-client.c  |   8 +-
>  block/nfs.c |   7 +-
>  block/sheepdog.c|  26 ++--
>  block/ssh.c |   4 +-
>  block/win32-aio.c   |   4 +-
>  hw/block/virtio-blk.c   |  18 ++-
>  hw/scsi/virtio-scsi.c   |  36 +++---
>  hw/virtio/virtio.c  |  54 ++--
>  iohandler.c |   2 +-
>  iothread.c  |  84 
>  nbd/server.c|   9 +-
>  stubs/set-fd-handler.c  |   1 +
>  tests/test-aio.c|   4 +-
>  util/event_notifier-posix.c |   2 +-
>  trace-events|   6 +
>  23 files changed, 604 insertions(+), 110 deletions(-)
> 
> --
> 2.9.3
> 
> 



[Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-01 Thread Stefan Hajnoczi
v4:
 * Added poll time self-tuning algorithm [Christian and Paolo]
 * Try a single iteration of polling to avoid non-blocking 
ppoll(2)/epoll_wait(2) [Paolo]
 * Reordered patches to make performance analysis easier - see below

v3:
 * Avoid ppoll(2)/epoll_wait(2) if polling succeeded [Paolo]
 * Disable guest->host virtqueue notification during polling [Christian]
 * Rebased on top of my virtio-blk/scsi virtqueue notification disable patches

v2:
 * Uninitialized node->deleted gone [Fam]
 * Removed 1024 polling loop iteration qemu_clock_get_ns() optimization which
   created a weird step pattern [Fam]
 * Unified with AioHandler, dropped AioPollHandler struct [Paolo]
   (actually I think Paolo had more in mind but this is the first step)
 * Only poll when all event loop resources support it [Paolo]
 * Added run_poll_handlers_begin/end trace events for perf analysis
 * Sorry, Christian, no virtqueue kick suppression yet

Recent performance investigation work done by Karl Rister shows that the
guest->host notification takes around 20 us.  This is more than the "overhead"
of QEMU itself (e.g. block layer).

One way to avoid the costly exit is to use polling instead of notification.
The main drawback of polling is that it consumes CPU resources.  In order to
benefit performance the host must have extra CPU cycles available on physical
CPUs that aren't used by the guest.

This is an experimental AioContext polling implementation.  It adds a polling
callback into the event loop.  Polling functions are implemented for virtio-blk
virtqueue guest->host kick and Linux AIO completion.

The -object iothread,poll-max-ns=NUM parameter sets the number of nanoseconds
to poll before entering the usual blocking poll(2) syscall.  Try setting this
parameter to the time from old request completion to new virtqueue kick.  By
default no polling is done so you must set this parameter to get busy polling.

Patch series overview:

Here are convenient points in the patch series at which to do performance
benchmarking to see how various pieces interact.

1. Basic -iothread poll-max-ns=NUM parameter without self-tuning:

aio: add flag to skip fds to aio_dispatch()
aio: add AioPollFn and io_poll() interface
aio: add polling mode to AioContext
virtio: poll virtqueues for new buffers
linux-aio: poll ring for completions
iothread: add polling parameters

2. My "[PATCH 0/3] virtio: disable notifications in blk and scsi" series is
   needed as a base.  This is unrelated to polling but we'll also disable
   notifications during polling later:

virtio-blk: suppress virtqueue kick during processing
virtio-scsi: suppress virtqueue kick during processing

3. Disable virtqueue notifications (reduce vmexits) during polling:

virtio: turn vq->notification into a nested counter
aio: add .io_poll_begin/end() callbacks
virtio: disable virtqueue notifications during polling

4. Self-tuning, adjusts polling time depending on wait times:

aio: self-tune polling time
iothread: add poll-grow and poll-shrink parameters

Stefan Hajnoczi (13):
  aio: add flag to skip fds to aio_dispatch()
  aio: add AioPollFn and io_poll() interface
  aio: add polling mode to AioContext
  virtio: poll virtqueues for new buffers
  linux-aio: poll ring for completions
  iothread: add polling parameters
  virtio-blk: suppress virtqueue kick during processing
  virtio-scsi: suppress virtqueue kick during processing
  virtio: turn vq->notification into a nested counter
  aio: add .io_poll_begin/end() callbacks
  virtio: disable virtqueue notifications during polling
  aio: self-tune polling time
  iothread: add poll-grow and poll-shrink parameters

 include/block/aio.h |  53 +++-
 include/sysemu/iothread.h   |   5 +
 aio-posix.c | 308 +++-
 aio-win32.c |  32 -
 async.c |  21 ++-
 block/curl.c|   8 +-
 block/iscsi.c   |   3 +-
 block/linux-aio.c   |  19 ++-
 block/nbd-client.c  |   8 +-
 block/nfs.c |   7 +-
 block/sheepdog.c|  26 ++--
 block/ssh.c |   4 +-
 block/win32-aio.c   |   4 +-
 hw/block/virtio-blk.c   |  18 ++-
 hw/scsi/virtio-scsi.c   |  36 +++---
 hw/virtio/virtio.c  |  54 ++--
 iohandler.c |   2 +-
 iothread.c  |  84 
 nbd/server.c|   9 +-
 stubs/set-fd-handler.c  |   1 +
 tests/test-aio.c|   4 +-
 util/event_notifier-posix.c |   2 +-
 trace-events|   6 +
 23 files changed, 604 insertions(+), 110 deletions(-)

-- 
2.9.3