Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-16 Thread Fam Zheng
On Mon, 11/14 16:29, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 16:26, Stefan Hajnoczi wrote:
> > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
> >> QEMU_AIO_POLL_MAX_NS  IOPs
> >>unset31,383
> >>146,860
> >>246,440
> >>435,246
> >>834,973
> >>   1646,794
> >>   3246,729
> >>   6435,520
> >>  12845,902
> > 
> > The environment variable is in nanoseconds.  The range of values you
> > tried are very small (all <1 usec).  It would be interesting to try
> > larger values in the ballpark of the latencies you have traced.  For
> > example 2000, 4000, 8000, 16000, and 32000 ns.
> > 
> > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> > much CPU overhead.
> 
> That basically means "avoid a syscall if you already know there's
> something to do", so in retrospect it's not that surprising.  Still
> interesting though, and it means that the feature is useful even if you
> don't have CPU to waste.

With the "deleted" bug fixed I did a little more testing to understand this.

Setting QEMU_AIO_POLL_MAX_NS=1 doesn't mean run_poll_handlers() will only loop
for 1 ns - the patch only checks at every 1024 polls. The first poll in a
run_poll_handlers() call can hardly succeed, so we poll at least 1024 times.

According to my test, on average each run_poll_handlers() takes ~12000ns, which
is ~160 iterations of the poll loop, before geting a new event (either from
virtio queue or linux-aio, I don't have the ratio here).

So in the worse case (no new event), 1024 iterations is basically (12000 / 160 *
1024) = 76800 ns!

The above is with iodepth=1 and jobs=1.  With iodepth=32 and jobs=1, or
iodepth=8 and jobs=4, the numbers are ~30th poll with 5600ns.

Fam



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-15 Thread Karl Rister
On 11/15/2016 04:32 AM, Stefan Hajnoczi wrote:
> On Mon, Nov 14, 2016 at 09:52:00PM +0100, Paolo Bonzini wrote:
>> On 14/11/2016 21:12, Karl Rister wrote:
>>>  25646,929
>>>  51235,627
>>>1,02446,477
>>>2,00035,247
>>>2,04846,322
>>>4,00046,540
>>>4,09646,368
>>>8,00047,054
>>>8,19246,671
>>>   16,00046,466
>>>   16,38432,504
>>>   32,00020,620
>>>   32,76820,807
>>
>> Huh, it breaks down exactly when it should start going faster
>> (10^9/46000 = ~21000).
> 
> Could it be because we're not breaking the polling loop for BHs, new
> timers, or aio_notify()?
> 
> Once that is fixed polling should achieve maximum performance when
> QEMU_AIO_MAX_POLL_NS is at least as long as the duration of a request.
> 
> This is logical if there are enough pinned CPUs so the polling thread
> can run flat out.
> 

I removed all the pinning and restored the guest to a "normal"
configuration.

QEMU_AIO_POLL_MAX_NS  IOPs
   unset25,553
   128,684
   238,213
   429,413
   838,612
  1630,578
  3230,145
  6441,637
 12828,554
 25629,661
 51239,178
   1,02429,644
   2,04837,190
   4,09629,838
   8,19238,581
  16,38437,793
  32,76820,332
  65,53635,755

-- 
Karl Rister 



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-15 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 06:15:46PM +0100, Paolo Bonzini wrote:
> 
> 
> On 14/11/2016 18:06, Stefan Hajnoczi wrote:
> >>> > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> >>> > > much CPU overhead.
> >> > 
> >> > That basically means "avoid a syscall if you already know there's
> >> > something to do", so in retrospect it's not that surprising.  Still
> >> > interesting though, and it means that the feature is useful even if you
> >> > don't have CPU to waste.
> > Can you spell out which syscall you mean?  Reading the ioeventfd?
> 
> I mean ppoll.  If ppoll succeeds without ever going to sleep, you can
> achieve the same result with QEMU_AIO_POLL_MAX_NS=1, but cheaper.

It's not obvious to me that ioeventfd or Linux AIO will become ready
with QEMU_AIO_POLL_MAX_NS=1.

This benchmark is iodepth=1 so there's just a single request.  Fam
suggested that maybe Linux AIO is ready immediately but AFAIK this NVMe
device should still take a few microseconds to complete a request
whereas our polling time is 1 nanosecond.

Tracing would reveal what is going on here.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-15 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 09:52:00PM +0100, Paolo Bonzini wrote:
> On 14/11/2016 21:12, Karl Rister wrote:
> >  25646,929
> >  51235,627
> >1,02446,477
> >2,00035,247
> >2,04846,322
> >4,00046,540
> >4,09646,368
> >8,00047,054
> >8,19246,671
> >   16,00046,466
> >   16,38432,504
> >   32,00020,620
> >   32,76820,807
> 
> Huh, it breaks down exactly when it should start going faster
> (10^9/46000 = ~21000).

Could it be because we're not breaking the polling loop for BHs, new
timers, or aio_notify()?

Once that is fixed polling should achieve maximum performance when
QEMU_AIO_MAX_POLL_NS is at least as long as the duration of a request.

This is logical if there are enough pinned CPUs so the polling thread
can run flat out.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 21:12, Karl Rister wrote:
>  25646,929
>  51235,627
>1,02446,477
>2,00035,247
>2,04846,322
>4,00046,540
>4,09646,368
>8,00047,054
>8,19246,671
>   16,00046,466
>   16,38432,504
>   32,00020,620
>   32,76820,807

Huh, it breaks down exactly when it should start going faster
(10^9/46000 = ~21000).

Paolo



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Here are some more numbers with higher values.  I continued the power of
2 values and added in your examples as well:

QEMU_AIO_POLL_MAX_NS  IOPs
 25646,929
 51235,627
   1,02446,477
   2,00035,247
   2,04846,322
   4,00046,540
   4,09646,368
   8,00047,054
   8,19246,671
  16,00046,466
  16,38432,504
  32,00020,620
  32,76820,807

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>> 146,972   35,434
>> 246,939   35,719
>> 347,005   35,584
>> 447,016   35,615
>> 547,267   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister 



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 18:06, Stefan Hajnoczi wrote:
>>> > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
>>> > > much CPU overhead.
>> > 
>> > That basically means "avoid a syscall if you already know there's
>> > something to do", so in retrospect it's not that surprising.  Still
>> > interesting though, and it means that the feature is useful even if you
>> > don't have CPU to waste.
> Can you spell out which syscall you mean?  Reading the ioeventfd?

I mean ppoll.  If ppoll succeeds without ever going to sleep, you can
achieve the same result with QEMU_AIO_POLL_MAX_NS=1, but cheaper.

Paolo

> The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't
> be much IOThread event loop activity besides the single I/O request.
> 
> The reason this puzzles me is that I wouldn't expect poll to succeed
> with QEMU_AIO_POLL_MAX_NS and iodepth=1.



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 04:29:49PM +0100, Paolo Bonzini wrote:
> On 14/11/2016 16:26, Stefan Hajnoczi wrote:
> > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
> >> QEMU_AIO_POLL_MAX_NS  IOPs
> >>unset31,383
> >>146,860
> >>246,440
> >>435,246
> >>834,973
> >>   1646,794
> >>   3246,729
> >>   6435,520
> >>  12845,902
> > 
> > The environment variable is in nanoseconds.  The range of values you
> > tried are very small (all <1 usec).  It would be interesting to try
> > larger values in the ballpark of the latencies you have traced.  For
> > example 2000, 4000, 8000, 16000, and 32000 ns.
> > 
> > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> > much CPU overhead.
> 
> That basically means "avoid a syscall if you already know there's
> something to do", so in retrospect it's not that surprising.  Still
> interesting though, and it means that the feature is useful even if you
> don't have CPU to waste.

Can you spell out which syscall you mean?  Reading the ioeventfd?

The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't
be much IOThread event loop activity besides the single I/O request.

The reason this puzzles me is that I wouldn't expect poll to succeed
with QEMU_AIO_POLL_MAX_NS and iodepth=1.

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Fam Zheng
On Mon, 11/14 17:06, Stefan Hajnoczi wrote:
> On Mon, Nov 14, 2016 at 04:29:49PM +0100, Paolo Bonzini wrote:
> > On 14/11/2016 16:26, Stefan Hajnoczi wrote:
> > > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
> > >> QEMU_AIO_POLL_MAX_NS  IOPs
> > >>unset31,383
> > >>146,860
> > >>246,440
> > >>435,246
> > >>834,973
> > >>   1646,794
> > >>   3246,729
> > >>   6435,520
> > >>  12845,902
> > > 
> > > The environment variable is in nanoseconds.  The range of values you
> > > tried are very small (all <1 usec).  It would be interesting to try
> > > larger values in the ballpark of the latencies you have traced.  For
> > > example 2000, 4000, 8000, 16000, and 32000 ns.
> > > 
> > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> > > much CPU overhead.
> > 
> > That basically means "avoid a syscall if you already know there's
> > something to do", so in retrospect it's not that surprising.  Still
> > interesting though, and it means that the feature is useful even if you
> > don't have CPU to waste.
> 
> Can you spell out which syscall you mean?  Reading the ioeventfd?
> 
> The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't
> be much IOThread event loop activity besides the single I/O request.
> 
> The reason this puzzles me is that I wouldn't expect poll to succeed
> with QEMU_AIO_POLL_MAX_NS and iodepth=1.

I see the guest shouldn't send more requests, but isn't it possible for
the linux-aio poll to succeed?

Fam

> 
> Thanks,
> Stefan





Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 08:52:21AM -0600, Karl Rister wrote:
> On 11/14/2016 07:53 AM, Fam Zheng wrote:
> > On Fri, 11/11 13:59, Karl Rister wrote:
> >>
> >> Stefan
> >>
> >> I ran some quick tests with your patches and got some pretty good gains,
> >> but also some seemingly odd behavior.
> >>
> >> These results are for a 5 minute test doing sequential 4KB requests from
> >> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
> >> performed directly against the virtio-blk device (no filesystem) which
> >> is backed by a 400GB NVme card.
> >>
> >> QEMU_AIO_POLL_MAX_NS  IOPs
> >>unset31,383
> >>146,860
> >>246,440
> >>435,246
> >>834,973
> >>   1646,794
> >>   3246,729
> >>   6435,520
> >>  12845,902
> > 
> > For sequential read with ioq=1, each request takes >2ns under 45,000 
> > IOPs.
> > Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
> > larger values? Not criticizing, just trying to understand how it workd.
> 
> Not yet, I was just trying to get something out as quick as I could
> (while juggling this with some other stuff...).  Frankly I was a bit
> surprised that the low values made such an impact and then got
> distracted by the behaviors of 4, 8, and 64.
> 
> > 
> > Also, do you happen to have numbers for unpatched QEMU (just to confirm that
> > "unset" case doesn't cause regression) and baremetal for comparison?
> 
> I didn't run this exact test on the same qemu.git master changeset
> unpatched.  I did however previously try it against the v2.7.0 tag and
> got somewhere around 27.5K IOPs.  My original intention was to apply the
> patches to v2.7.0 but it wouldn't build.
> 
> We have done a lot of testing and tracing on the qemu-rhev package and
> 27K IOPs is about what we see there (with tracing disabled).
> 
> Given the patch discussions I saw I was mainly trying to get a sniff
> test out and then do a more complete workup with whatever updates are made.
> 
> I should probably note that there are a lot of pinning optimizations
> made here to assist in our tracing efforts which also result in improved
> performance.  Ultimately, in a proper evaluation of these patches most
> of that will be removed so the behavior may change somewhat.

To clarify: QEMU_AIO_POLL_MAX_NS unset or 0 disables polling completely.
Therefore it's not necessary to run unpatched.


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 03:51:18PM +0100, Christian Borntraeger wrote:
> On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> > Recent performance investigation work done by Karl Rister shows that the
> > guest->host notification takes around 20 us.  This is more than the 
> > "overhead"
> > of QEMU itself (e.g. block layer).
> > 
> > One way to avoid the costly exit is to use polling instead of notification.
> > The main drawback of polling is that it consumes CPU resources.  In order to
> > benefit performance the host must have extra CPU cycles available on 
> > physical
> > CPUs that aren't used by the guest.
> > 
> > This is an experimental AioContext polling implementation.  It adds a 
> > polling
> > callback into the event loop.  Polling functions are implemented for 
> > virtio-blk
> > virtqueue guest->host kick and Linux AIO completion.
> > 
> > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
> > nanoseconds to
> > poll before entering the usual blocking poll(2) syscall.  Try setting this
> > variable to the time from old request completion to new virtqueue kick.
> > 
> > By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> > any
> > polling!
> > 
> > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> > values.  If you don't find a good value we should double-check the tracing 
> > data
> > to see if this experimental code can be improved.
> > 
> > Stefan Hajnoczi (3):
> >   aio-posix: add aio_set_poll_handler()
> >   virtio: poll virtqueues for new buffers
> >   linux-aio: poll ring for completions
> > 
> >  aio-posix.c | 133 
> > 
> >  block/linux-aio.c   |  17 +++
> >  hw/virtio/virtio.c  |  19 
> >  include/block/aio.h |  16 +++
> >  4 files changed, 185 insertions(+)
> 
> Hmm, I see all affected threads using more CPU power, but the performance 
> numbers are
> somewhat inconclusive on s390. I have no proper test setup (only a shared 
> LPAR), but
> all numbers are in the same ballpark of 3-5Gbyte/sec for 5 disks for 4k 
> random reads
> with iodepth=8.
> 
> What I find interesting is that the guest still does a huge amount of exits 
> for the
> guest->host notifications. I think if we could combine this with some 
> notification
> suppression, then things could be even more interesting.

Great idea.  I'll add that to the next revision.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Mon, Nov 14, 2016 at 03:59:32PM +0100, Christian Borntraeger wrote:
> On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> > Recent performance investigation work done by Karl Rister shows that the
> > guest->host notification takes around 20 us.  This is more than the 
> > "overhead"
> > of QEMU itself (e.g. block layer).
> > 
> > One way to avoid the costly exit is to use polling instead of notification.
> > The main drawback of polling is that it consumes CPU resources.  In order to
> > benefit performance the host must have extra CPU cycles available on 
> > physical
> > CPUs that aren't used by the guest.
> > 
> > This is an experimental AioContext polling implementation.  It adds a 
> > polling
> > callback into the event loop.  Polling functions are implemented for 
> > virtio-blk
> > virtqueue guest->host kick and Linux AIO completion.
> > 
> > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
> > nanoseconds to
> > poll before entering the usual blocking poll(2) syscall.  Try setting this
> > variable to the time from old request completion to new virtqueue kick.
> > 
> > By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> > any
> > polling!
> > 
> > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> > values.  If you don't find a good value we should double-check the tracing 
> > data
> > to see if this experimental code can be improved.
> > 
> > Stefan Hajnoczi (3):
> >   aio-posix: add aio_set_poll_handler()
> >   virtio: poll virtqueues for new buffers
> >   linux-aio: poll ring for completions
> > 
> >  aio-posix.c | 133 
> > 
> >  block/linux-aio.c   |  17 +++
> >  hw/virtio/virtio.c  |  19 
> >  include/block/aio.h |  16 +++
> >  4 files changed, 185 insertions(+)
> > 
> 
> Another observation: With more iothreads than host CPUs the performance drops 
> significantly.

This makes sense although we can eliminate it in common cases by only
polling when we actually need to monitor events.

The current series wastes CPU on Linux AIO polling when no requests are
pending ;).


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Agreed.  As I alluded to in another post, I decided to start at 1 and
double the values until I saw a difference with the expectation that it
would have to get quite large before that happened.  The results went in
a different direction, and then I got distracted by the variation at
certain points.  I figured that by itself the fact that noticeable
improvements were possible with such low values was interesting.

I will definitely continue the progression and capture some larger values.

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>> 146,972   35,434
>> 246,939   35,719
>> 347,005   35,584
>> 447,016   35,615
>> 547,267   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister 



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Paolo Bonzini


On 14/11/2016 16:26, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.
> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.

That basically means "avoid a syscall if you already know there's
something to do", so in retrospect it's not that surprising.  Still
interesting though, and it means that the feature is useful even if you
don't have CPU to waste.

Paolo



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Stefan Hajnoczi
On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
> > Recent performance investigation work done by Karl Rister shows that the
> > guest->host notification takes around 20 us.  This is more than the 
> > "overhead"
> > of QEMU itself (e.g. block layer).
> > 
> > One way to avoid the costly exit is to use polling instead of notification.
> > The main drawback of polling is that it consumes CPU resources.  In order to
> > benefit performance the host must have extra CPU cycles available on 
> > physical
> > CPUs that aren't used by the guest.
> > 
> > This is an experimental AioContext polling implementation.  It adds a 
> > polling
> > callback into the event loop.  Polling functions are implemented for 
> > virtio-blk
> > virtqueue guest->host kick and Linux AIO completion.
> > 
> > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
> > nanoseconds to
> > poll before entering the usual blocking poll(2) syscall.  Try setting this
> > variable to the time from old request completion to new virtqueue kick.
> > 
> > By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> > any
> > polling!
> > 
> > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> > values.  If you don't find a good value we should double-check the tracing 
> > data
> > to see if this experimental code can be improved.
> 
> Stefan
> 
> I ran some quick tests with your patches and got some pretty good gains,
> but also some seemingly odd behavior.
>
> These results are for a 5 minute test doing sequential 4KB requests from
> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
> performed directly against the virtio-blk device (no filesystem) which
> is backed by a 400GB NVme card.
> 
> QEMU_AIO_POLL_MAX_NS  IOPs
>unset31,383
>146,860
>246,440
>435,246
>834,973
>   1646,794
>   3246,729
>   6435,520
>  12845,902

The environment variable is in nanoseconds.  The range of values you
tried are very small (all <1 usec).  It would be interesting to try
larger values in the ballpark of the latencies you have traced.  For
example 2000, 4000, 8000, 16000, and 32000 ns.

Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
much CPU overhead.

> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
> is what I got:
> 
> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
> 146,972   35,434
> 246,939   35,719
> 347,005   35,584
> 447,016   35,615
> 547,267   35,474
> 
> So the results seem consistent.

That is interesting.  I don't have an explanation for the consistent
difference between 2 and 4 ns polling time.  The time difference is so
small yet the IOPS difference is clear.

Comparing traces could shed light on the cause for this difference.

> I saw some discussion on the patches made which make me think you'll be
> making some changes, is that right?  If so, I may wait for the updates
> and then we can run the much more exhaustive set of workloads
> (sequential read and write, random read and write) at various block
> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
> that we were doing when we started looking at this.

I'll send an updated version of the patches.

Stefan


signature.asc
Description: PGP signature


Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Christian Borntraeger
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.
> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)
> 

Another observation: With more iothreads than host CPUs the performance drops 
significantly.





Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Christian Borntraeger
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.
> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)

Hmm, I see all affected threads using more CPU power, but the performance 
numbers are
somewhat inconclusive on s390. I have no proper test setup (only a shared 
LPAR), but
all numbers are in the same ballpark of 3-5Gbyte/sec for 5 disks for 4k random 
reads
with iodepth=8.

What I find interesting is that the guest still does a huge amount of exits for 
the
guest->host notifications. I think if we could combine this with some 
notification
suppression, then things could be even more interesting.

Christian




Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 07:53 AM, Fam Zheng wrote:
> On Fri, 11/11 13:59, Karl Rister wrote:
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs.
> Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
> larger values? Not criticizing, just trying to understand how it workd.

Not yet, I was just trying to get something out as quick as I could
(while juggling this with some other stuff...).  Frankly I was a bit
surprised that the low values made such an impact and then got
distracted by the behaviors of 4, 8, and 64.

> 
> Also, do you happen to have numbers for unpatched QEMU (just to confirm that
> "unset" case doesn't cause regression) and baremetal for comparison?

I didn't run this exact test on the same qemu.git master changeset
unpatched.  I did however previously try it against the v2.7.0 tag and
got somewhere around 27.5K IOPs.  My original intention was to apply the
patches to v2.7.0 but it wouldn't build.

We have done a lot of testing and tracing on the qemu-rhev package and
27K IOPs is about what we see there (with tracing disabled).

Given the patch discussions I saw I was mainly trying to get a sniff
test out and then do a more complete workup with whatever updates are made.

I should probably note that there are a lot of pinning optimizations
made here to assist in our tracing efforts which also result in improved
performance.  Ultimately, in a proper evaluation of these patches most
of that will be removed so the behavior may change somewhat.

> 
> Fam
> 


-- 
Karl Rister 



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Fam Zheng
On Fri, 11/11 13:59, Karl Rister wrote:
> 
> Stefan
> 
> I ran some quick tests with your patches and got some pretty good gains,
> but also some seemingly odd behavior.
> 
> These results are for a 5 minute test doing sequential 4KB requests from
> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
> performed directly against the virtio-blk device (no filesystem) which
> is backed by a 400GB NVme card.
> 
> QEMU_AIO_POLL_MAX_NS  IOPs
>unset31,383
>146,860
>246,440
>435,246
>834,973
>   1646,794
>   3246,729
>   6435,520
>  12845,902

For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs.
Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
larger values? Not criticizing, just trying to understand how it workd.

Also, do you happen to have numbers for unpatched QEMU (just to confirm that
"unset" case doesn't cause regression) and baremetal for comparison?

Fam



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-12 Thread no-reply
Hi,

Your series failed automatic build test. Please find the testing commands and
their output below. If you have docker installed, you can probably reproduce it
locally.

Type: series
Subject: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
Message-id: 1478711602-12620-1-git-send-email-stefa...@redhat.com

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=16
make docker-test-quick@centos6
make docker-test-mingw@fedora
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
40e0074 linux-aio: poll ring for completions
c7e69fe virtio: poll virtqueues for new buffers
3129add aio-posix: add aio_set_poll_handler()

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into 'dtc'...
Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf'
  BUILD   centos6
make[1]: Entering directory `/var/tmp/patchew-tester-tmp-c68h00s0/src'
  ARCHIVE qemu.tgz
  ARCHIVE dtc.tgz
  COPYRUNNER
RUN test-quick in qemu:centos6 
Packages installed:
SDL-devel-1.2.14-7.el6_7.1.x86_64
ccache-3.1.6-2.el6.x86_64
epel-release-6-8.noarch
gcc-4.4.7-17.el6.x86_64
git-1.7.1-4.el6_7.1.x86_64
glib2-devel-2.28.8-5.el6.x86_64
libfdt-devel-1.4.0-1.el6.x86_64
make-3.81-23.el6.x86_64
package g++ is not installed
pixman-devel-0.32.8-1.el6.x86_64
tar-1.23-15.el6_8.x86_64
zlib-devel-1.2.3-29.el6.x86_64

Environment variables:
PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel 
glib2-devel SDL-devel pixman-devel epel-release
HOSTNAME=50ce3cd670bd
TERM=xterm
MAKEFLAGS= -j16
HISTSIZE=1000
J=16
USER=root
CCACHE_DIR=/var/tmp/ccache
EXTRA_CONFIGURE_OPTS=
V=
SHOW_ENV=1
MAIL=/var/spool/mail/root
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/
LANG=en_US.UTF-8
TARGET_LIST=
HISTCONTROL=ignoredups
SHLVL=1
HOME=/root
TEST_DIR=/tmp/qemu-test
LOGNAME=root
LESSOPEN=||/usr/bin/lesspipe.sh %s
FEATURES= dtc
DEBUG=
G_BROKEN_FILENAMES=1
CCACHE_HASHDIR=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu 
--prefix=/var/tmp/qemu-build/install
No C++ compiler available; disabling C++ specific optional code
Install prefix/var/tmp/qemu-build/install
BIOS directory/var/tmp/qemu-build/install/share/qemu
binary directory  /var/tmp/qemu-build/install/bin
library directory /var/tmp/qemu-build/install/lib
module directory  /var/tmp/qemu-build/install/lib/qemu
libexec directory /var/tmp/qemu-build/install/libexec
include directory /var/tmp/qemu-build/install/include
config directory  /var/tmp/qemu-build/install/etc
local state directory   /var/tmp/qemu-build/install/var
Manual directory  /var/tmp/qemu-build/install/share/man
ELF interp prefix /usr/gnemul/qemu-%M
Source path   /tmp/qemu-test/src
C compilercc
Host C compiler   cc
C++ compiler  
Objective-C compiler cc
ARFLAGS   rv
CFLAGS-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS   -I/usr/include/pixman-1-pthread -I/usr/include/glib-2.0 
-I/usr/lib64/glib-2.0/include   -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes 
-Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes 
-fno-strict-aliasing -fno-common -fwrapv  -Wendif-labels -Wmissing-include-dirs 
-Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self 
-Wignored-qualifiers -Wold-style-declaration -Wold-style-definition 
-Wtype-limits -fstack-protector-all
LDFLAGS   -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g 
make  make
install   install
pythonpython -B
smbd  /usr/sbin/smbd
module supportno
host CPU  x86_64
host big endian   no
target list   x86_64-softmmu aarch64-softmmu
tcg debug enabled no
gprof enabled no
sparse enabledno
strip binariesyes
profiler  no
static build  no
pixmansystem
SDL support   yes (1.2.14)
GTK support   no 
GTK GL supportno
VTE support   no 
TLS priority  NORMAL
GNUTLS supportno
GNUTLS rndno
libgcrypt no
libgcrypt kdf no
nettleno 
nettle kdfno
libtasn1  no
curses supportno
virgl support no
curl support  no
mingw32 support   no
Audio drivers oss
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS supportno
VNC support   yes
VNC SASL support  no
VNC JPEG support  no
VNC PNG support   no
xen support   no
brlapi supportno
bluez  supportno
Documentation no
PIE   yes
vde support   no
netmap supportno
Linux AIO support no
ATTR/XATTR support yes
Install blobs yes
KVM support   yes
COLO support  yes
RDMA support  no
TCG interpreter   no
fdt support   yes
preadv supportyes
fdatasync 

Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-11 Thread Karl Rister
On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.

Stefan

I ran some quick tests with your patches and got some pretty good gains,
but also some seemingly odd behavior.

These results are for a 5 minute test doing sequential 4KB requests from
fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
performed directly against the virtio-blk device (no filesystem) which
is backed by a 400GB NVme card.

QEMU_AIO_POLL_MAX_NS  IOPs
   unset31,383
   146,860
   246,440
   435,246
   834,973
  1646,794
  3246,729
  6435,520
 12845,902

I found the results for 4, 8, and 64 odd so I re-ran some tests to check
for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
is what I got:

IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
146,972   35,434
246,939   35,719
347,005   35,584
447,016   35,615
547,267   35,474

So the results seem consistent.

I saw some discussion on the patches made which make me think you'll be
making some changes, is that right?  If so, I may wait for the updates
and then we can run the much more exhaustive set of workloads
(sequential read and write, random read and write) at various block
sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
that we were doing when we started looking at this.

Karl

> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues fr new buffers
>   linux-aio: poll ring for ompletions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)
> 


-- 
Karl Rister