[Qemu-devel] dataplane performance on s390

2014-06-09 Thread Karl Rister

Hi All

I was asked by our development team to do a performance sniff test of 
the latest dataplane code on s390 and compare it against qemu.git.  Here 
is a brief description of the configuration, the testing done, and then 
the results.


Configuration:

Host: 26 CPU LPAR, 64GB, 8 zFCP adapters
Guest: 4 VCPU, 1GB, 128 virtio block devices

Each virtio block device maps to a dm-multipath device in the host with 
8 paths.  Multipath is configured with the service-time policy.  All 
block devices are configured to use the deadline IO scheduler.


Test:

FIO is used to run 4 scenarios: sequential read, sequential write, 
random read, and random write.  Sequential scenarios use a 128KB request 
size and random scenarios us a 8KB request size.  Each scenario is run 
with an increasing number of jobs, from 1 to 128 (powers of 2).  Each 
job is bound to an individual file on an ext3 file system on a virtio 
device and uses O_DIRECT, libaio, and iodepth=1.  Each test is run three 
times for 2 minutes each, the first iteration (a warmup) is thrown out 
and the next two iterations are averaged together.


Results:

Baseline: qemu.git 93f94f9018229f146ed6bbe9e5ff72d67e4bd7ab

Dataplane: bdrv_set_aio_context 0ab50cde71aa27f39b8a3ea4766ff82671adb2a4

Sequential Read:

Overall a slight throughput regression with a noticeable reduction in 
CPU efficiency.


1 Job: Throughput regressed -1.4%, CPU improved -0.83%.
2 Job: Throughput regressed -2.5%, CPU regressed +2.81%
4 Job: Throughput regressed -2.2%, CPU regressed +12.22%
8 Job: Throughput regressed -0.7%, CPU regressed +9.77%
16 Job: Throughput regressed -3.4%, CPU regressed +7.04%
32 Job: Throughput regressed -1.8%, CPU regressed +12.03%
64 Job: Throughput regressed -0.1%, CPU regressed +10.60%
128 Job: Throughput increased +0.3%, CPU regressed +10.70%

Sequential Write:

Mostly regressed throughput, although it gets better as job count 
increases and even has some gains at higher job counts.  CPU efficiency 
is regressed.


1 Job: Throughput regressed -1.9%, CPU regressed +0.90%
2 Job: Throughput regressed -2.0%, CPU regressed +1.07%
4 Job: Throughput regressed -2.4%, CPU regressed +8.68%
8 Job: Throughput regressed -2.0%, CPU regressed +4.23%
16 Job: Throughput regressed -5.0%, CPU regressed +10.53%
32 Job: Throughput improved +7.6%, CPU regressed +7.37%
64 Job: Throughput regressed -0.6%, CPU regressed +7.29%
128 Job: Throughput improved +8.3%, CPU regressed +6.68%

Random Read:

Again, mostly throughput regressions except for the largest job counts. 
 CPU efficiency is regressed at all data points.


1 Job: Throughput regressed -3.0%, CPU regressed +0.14%
2 Job: Throughput regressed -3.6%, CPU regressed +6.86%
4 Job: Throughput regressed -5.1%, CPU regressed +11.11%
8 Job: Throughput regressed -8.6%, CPU regressed +12.32%
16 Job: Throughput regressed -5.7%, CPU regressed +12.99%
32 Job: Throughput regressed -7.4%, CPU regressed +7.62%
64 Job: Throughput improved +10.0%, CPU regressed +10.83%
128 Job: Throughput improved +10.7%, CPU regressed +10.85%

Random Write:

Throughput and CPU regressed at all but one data point.

1 Job: Throughput regressed -2.3%, CPU improved -1.50%
2 Job: Throughput regressed -2.2%, CPU regressed +0.16%
4 Job: Throughput regressed -1.0%, CPU regressed +8.36%
8 Job: Throughput regressed -8.6%, CPU regressed +12.47%
16 Job: Throughput regressed -3.1%, CPU regressed +12.40%
32 Job: Throughput regressed -0.2%, CPU regressed +11.59%
64 Job: Throughput regressed -1.9%, CPU regressed +12.65%
128 Job: Throughput improved +5.6%, CPU regressed +11.68%


* CPU consumption is an efficiency calculation of usage per MB of 
throughput.


--
Karl Rister k...@us.ibm.com
IBM Linux/KVM Development Optimization




Re: [Qemu-devel] dataplane performance on s390

2014-06-10 Thread Karl Rister

On 06/09/2014 08:40 PM, Fam Zheng wrote:

On Mon, 06/09 15:43, Karl Rister wrote:

Hi All

I was asked by our development team to do a performance sniff test of the
latest dataplane code on s390 and compare it against qemu.git.  Here is a
brief description of the configuration, the testing done, and then the
results.

Configuration:

Host: 26 CPU LPAR, 64GB, 8 zFCP adapters
Guest: 4 VCPU, 1GB, 128 virtio block devices

Each virtio block device maps to a dm-multipath device in the host with 8
paths.  Multipath is configured with the service-time policy.  All block
devices are configured to use the deadline IO scheduler.

Test:

FIO is used to run 4 scenarios: sequential read, sequential write, random
read, and random write.  Sequential scenarios use a 128KB request size and
random scenarios us a 8KB request size.  Each scenario is run with an
increasing number of jobs, from 1 to 128 (powers of 2).  Each job is bound
to an individual file on an ext3 file system on a virtio device and uses
O_DIRECT, libaio, and iodepth=1.  Each test is run three times for 2 minutes
each, the first iteration (a warmup) is thrown out and the next two
iterations are averaged together.

Results:

Baseline: qemu.git 93f94f9018229f146ed6bbe9e5ff72d67e4bd7ab

Dataplane: bdrv_set_aio_context 0ab50cde71aa27f39b8a3ea4766ff82671adb2a4


Hi Karl,

Thanks for the results.

The throughput differences look minimal, where is the bandwidth saturated in
these tests?  And why use iodepth=1, not more?


Hi Fam

Based on previously collected data, the configuration is hitting 
saturation at the following points:


Sequential Read: 128 jobs
Sequential Write: 32 jobs
Random Read: 64 jobs
Random Write: saturation not reached

The iodepth=1 configuration is a somewhat arbitrary choice that is only 
limited by machine run time, I could certainly run higher loads and at 
times I do.


Thanks.

Karl



Thanks,
Fam



Sequential Read:

Overall a slight throughput regression with a noticeable reduction in CPU
efficiency.

1 Job: Throughput regressed -1.4%, CPU improved -0.83%.
2 Job: Throughput regressed -2.5%, CPU regressed +2.81%
4 Job: Throughput regressed -2.2%, CPU regressed +12.22%
8 Job: Throughput regressed -0.7%, CPU regressed +9.77%
16 Job: Throughput regressed -3.4%, CPU regressed +7.04%
32 Job: Throughput regressed -1.8%, CPU regressed +12.03%
64 Job: Throughput regressed -0.1%, CPU regressed +10.60%
128 Job: Throughput increased +0.3%, CPU regressed +10.70%

Sequential Write:

Mostly regressed throughput, although it gets better as job count increases
and even has some gains at higher job counts.  CPU efficiency is regressed.

1 Job: Throughput regressed -1.9%, CPU regressed +0.90%
2 Job: Throughput regressed -2.0%, CPU regressed +1.07%
4 Job: Throughput regressed -2.4%, CPU regressed +8.68%
8 Job: Throughput regressed -2.0%, CPU regressed +4.23%
16 Job: Throughput regressed -5.0%, CPU regressed +10.53%
32 Job: Throughput improved +7.6%, CPU regressed +7.37%
64 Job: Throughput regressed -0.6%, CPU regressed +7.29%
128 Job: Throughput improved +8.3%, CPU regressed +6.68%

Random Read:

Again, mostly throughput regressions except for the largest job counts.  CPU
efficiency is regressed at all data points.

1 Job: Throughput regressed -3.0%, CPU regressed +0.14%
2 Job: Throughput regressed -3.6%, CPU regressed +6.86%
4 Job: Throughput regressed -5.1%, CPU regressed +11.11%
8 Job: Throughput regressed -8.6%, CPU regressed +12.32%
16 Job: Throughput regressed -5.7%, CPU regressed +12.99%
32 Job: Throughput regressed -7.4%, CPU regressed +7.62%
64 Job: Throughput improved +10.0%, CPU regressed +10.83%
128 Job: Throughput improved +10.7%, CPU regressed +10.85%

Random Write:

Throughput and CPU regressed at all but one data point.

1 Job: Throughput regressed -2.3%, CPU improved -1.50%
2 Job: Throughput regressed -2.2%, CPU regressed +0.16%
4 Job: Throughput regressed -1.0%, CPU regressed +8.36%
8 Job: Throughput regressed -8.6%, CPU regressed +12.47%
16 Job: Throughput regressed -3.1%, CPU regressed +12.40%
32 Job: Throughput regressed -0.2%, CPU regressed +11.59%
64 Job: Throughput regressed -1.9%, CPU regressed +12.65%
128 Job: Throughput improved +5.6%, CPU regressed +11.68%


* CPU consumption is an efficiency calculation of usage per MB of
throughput.

--
Karl Rister k...@us.ibm.com
IBM Linux/KVM Development Optimization







--
Karl Rister k...@us.ibm.com
IBM Linux/KVM Development Optimization




Re: [Qemu-devel] Performance about x-data-plane

2017-01-16 Thread Karl Rister
On 01/16/2017 07:15 AM, Stefan Hajnoczi wrote:
> On Tue, Jan 03, 2017 at 12:02:14PM -0500, Weiwei Jia wrote:
>>> The expensive part is the virtqueue kick.  Recently we tried polling the
>>> virtqueue instead of waiting for the ioeventfd file descriptor and got
>>> double-digit performance improvements:
>>> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg00148.html
>>>
>>> If you want to understand the performance of your benchmark you'll have
>>> compare host/guest disk stats (e.g. request lifetime, disk utilization,
>>> queue depth, average request size) to check that the bare metal and
>>> guest workloads are really sending comparable I/O patterns to the
>>> physical disk.
>>>
>>> Then you using Linux and/or QEMU tracing to analyze the request latency
>>> by looking at interesting points in the request lifecycle like virtqueue
>>> kick, host Linux AIO io_submit(2), etc.
>>>
>>
>> Thank you. I will look into "polling the virtqueue" as you said above.
>> Currently, I just use blktrace to see disk stats and add logs in the
>> I/O workload to see the time latency for each request. What kind of
>> tools are you using to analyze request lifecycle like virtqueue kick,
>> host Linux AIO iosubmit, etc.
>>
>> Do you trace the lifecycle like this
>> (http://www.linux-kvm.org/page/Virtio/Block/Latency#Performance_data)
>> but it seems to be out of date. Does it
>> (http://repo.or.cz/qemu-kvm/stefanha.git/shortlog/refs/heads/tracing-dev-0.12.4)
>> still work on QEMU 2.4.1?
> 
> The details are out of date but the general approach to tracing the I/O
> request lifecycle still apply.
> 
> There are multiple tracing tools that can do what you need.  I've CCed
> Karl Rister who did the latest virtio-blk dataplane tracing.

I roughly followed this guide by Luiz Capitulino:

https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html

I tweaked his trace-host-and-guest script to avoid doing IO while
tracing is enabled, my version is available here:

http://people.redhat.com/~krister/tracing/trace-host-and-guest

I built QEMU with --enable-trace-backends=ftrace and then turned on the
QEMU trace events I was interested in with this bit of bash:

for event in $(/usr/libexec/qemu-kvm -trace help 2>&1|grep virtio|grep
-v "gpu\|console\|serial\|rng\|balloon\|ccw"); do virsh
qemu-monitor-command master --hmp trace-event ${event} on; done

At this point, the QEMU trace events are automatically inserted into the
ftrace buffers and the methodology outlined by Luiz gets the guest
kernel, host kernel, and QEMU events properly interleaved.

> 
> "perf record -a -e kvm:\*" is a good start.  You can use "perf probe" to
> trace QEMU's trace events (recent versions have sdt support, which means
> SystemTap tracepoints work) and also trace any function in QEMU:
> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html
> 
> Stefan
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 07:53 AM, Fam Zheng wrote:
> On Fri, 11/11 13:59, Karl Rister wrote:
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs.
> Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried
> larger values? Not criticizing, just trying to understand how it workd.

Not yet, I was just trying to get something out as quick as I could
(while juggling this with some other stuff...).  Frankly I was a bit
surprised that the low values made such an impact and then got
distracted by the behaviors of 4, 8, and 64.

> 
> Also, do you happen to have numbers for unpatched QEMU (just to confirm that
> "unset" case doesn't cause regression) and baremetal for comparison?

I didn't run this exact test on the same qemu.git master changeset
unpatched.  I did however previously try it against the v2.7.0 tag and
got somewhere around 27.5K IOPs.  My original intention was to apply the
patches to v2.7.0 but it wouldn't build.

We have done a lot of testing and tracing on the qemu-rhev package and
27K IOPs is about what we see there (with tracing disabled).

Given the patch discussions I saw I was mainly trying to get a sniff
test out and then do a more complete workup with whatever updates are made.

I should probably note that there are a lot of pinning optimizations
made here to assist in our tracing efforts which also result in improved
performance.  Ultimately, in a proper evaluation of these patches most
of that will be removed so the behavior may change somewhat.

> 
> Fam
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Agreed.  As I alluded to in another post, I decided to start at 1 and
double the values until I saw a difference with the expectation that it
would have to get quite large before that happened.  The results went in
a different direction, and then I got distracted by the variation at
certain points.  I figured that by itself the fact that noticeable
improvements were possible with such low values was interesting.

I will definitely continue the progression and capture some larger values.

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>> 146,972   35,434
>> 246,939   35,719
>> 347,005   35,584
>> 447,016   35,615
>> 547,267   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [PATCH v2 0/4] aio: experimental virtio-blk polling mode

2016-11-17 Thread Karl Rister
I think these results look a bit more in line with expectations on the
quick sniff test:

QEMU_AIO_POLL_MAX_NS  IOPs
   unset26,299
   125,929
   225,753
   427,214
   827,053
  1626,861
  3224,752
  6425,058
 12824,732
 25625,560
 51224,614
   1,02425,186
   2,04825,829
   4,09625,671
   8,19227,896
  16,38438,086
  32,76835,493
  65,53638,496
 131,07238,296

I did a spot check of CPU utilization when the polling started having
benefits.

Without polling (QEMU_AIO_POLL_MAX_NS=unset) the iothread's CPU usage
looked like this:

user time:   25.94%
system time: 22.11%

With polling and QEMU_AIO_POLL_MAX_NS=16384 the iothread's CPU usage
looked like this:

user time:   78.92%
system time: 20.80%

Karl

On 11/16/2016 11:46 AM, Stefan Hajnoczi wrote:
> v2:
>  * Uninitialized node->deleted gone [Fam]
>  * Removed 1024 polling loop iteration qemu_clock_get_ns() optimization which
>created a weird step pattern [Fam]
>  * Unified with AioHandler, dropped AioPollHandler struct [Paolo]
>(actually I think Paolo had more in mind but this is the first step)
>  * Only poll when all event loop resources support it [Paolo]
>  * Added run_poll_handlers_begin/end trace events for perf analysis
>  * Sorry, Christian, no virtqueue kick suppression yet
> 
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Stefan Hajnoczi (4):
>   aio: add AioPollFn and io_poll() interface
>   aio: add polling mode to AioContext
>   virtio: poll virtqueues for new buffers
>   linux-aio: poll ring for completions
> 
>  aio-posix.c | 115 
> ++--
>  async.c |  14 +-
>  block/curl.c|   8 +--
>  block/iscsi.c   |   3 +-
>  block/linux-aio.c   |  19 +++-
>  block/nbd-client.c  |   8 +--
>  block/nfs.c |   7 +--
>  block/sheepdog.c|  26 +-
>  block/ssh.c |   4 +-
>  block/win32-aio.c   |   4 +-
>  hw/virtio/virtio.c  |  18 ++-
>  include/block/aio.h |   8 ++-
>  iohandler.c |   2 +-
>  nbd/server.c|   9 ++--
>  stubs/set-fd-handler.c  |   1 +
>  tests/test-aio.c|   4 +-
>  trace-events|   4 ++
>  util/event_notifier-posix.c |   2 +-
>  18 files changed, 207 insertions(+), 49 deletions(-)
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-11 Thread Karl Rister
On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
> Recent performance investigation work done by Karl Rister shows that the
> guest->host notification takes around 20 us.  This is more than the "overhead"
> of QEMU itself (e.g. block layer).
> 
> One way to avoid the costly exit is to use polling instead of notification.
> The main drawback of polling is that it consumes CPU resources.  In order to
> benefit performance the host must have extra CPU cycles available on physical
> CPUs that aren't used by the guest.
> 
> This is an experimental AioContext polling implementation.  It adds a polling
> callback into the event loop.  Polling functions are implemented for 
> virtio-blk
> virtqueue guest->host kick and Linux AIO completion.
> 
> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds 
> to
> poll before entering the usual blocking poll(2) syscall.  Try setting this
> variable to the time from old request completion to new virtqueue kick.
> 
> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
> any
> polling!
> 
> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
> values.  If you don't find a good value we should double-check the tracing 
> data
> to see if this experimental code can be improved.

Stefan

I ran some quick tests with your patches and got some pretty good gains,
but also some seemingly odd behavior.

These results are for a 5 minute test doing sequential 4KB requests from
fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
performed directly against the virtio-blk device (no filesystem) which
is backed by a 400GB NVme card.

QEMU_AIO_POLL_MAX_NS  IOPs
   unset31,383
   146,860
   246,440
   435,246
   834,973
  1646,794
  3246,729
  6435,520
 12845,902

I found the results for 4, 8, and 64 odd so I re-ran some tests to check
for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
is what I got:

IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
146,972   35,434
246,939   35,719
347,005   35,584
447,016   35,615
547,267   35,474

So the results seem consistent.

I saw some discussion on the patches made which make me think you'll be
making some changes, is that right?  If so, I may wait for the updates
and then we can run the much more exhaustive set of workloads
(sequential read and write, random read and write) at various block
sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
that we were doing when we started looking at this.

Karl

> 
> Stefan Hajnoczi (3):
>   aio-posix: add aio_set_poll_handler()
>   virtio: poll virtqueues fr new buffers
>   linux-aio: poll ring for ompletions
> 
>  aio-posix.c | 133 
> 
>  block/linux-aio.c   |  17 +++
>  hw/virtio/virtio.c  |  19 
>  include/block/aio.h |  16 +++
>  4 files changed, 185 insertions(+)
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-15 Thread Karl Rister
On 11/15/2016 04:32 AM, Stefan Hajnoczi wrote:
> On Mon, Nov 14, 2016 at 09:52:00PM +0100, Paolo Bonzini wrote:
>> On 14/11/2016 21:12, Karl Rister wrote:
>>>  25646,929
>>>  51235,627
>>>1,02446,477
>>>2,00035,247
>>>2,04846,322
>>>4,00046,540
>>>4,09646,368
>>>8,00047,054
>>>8,19246,671
>>>   16,00046,466
>>>   16,38432,504
>>>   32,00020,620
>>>   32,76820,807
>>
>> Huh, it breaks down exactly when it should start going faster
>> (10^9/46000 = ~21000).
> 
> Could it be because we're not breaking the polling loop for BHs, new
> timers, or aio_notify()?
> 
> Once that is fixed polling should achieve maximum performance when
> QEMU_AIO_MAX_POLL_NS is at least as long as the duration of a request.
> 
> This is logical if there are enough pinned CPUs so the polling thread
> can run flat out.
> 

I removed all the pinning and restored the guest to a "normal"
configuration.

QEMU_AIO_POLL_MAX_NS  IOPs
   unset25,553
   128,684
   238,213
   429,413
   838,612
  1630,578
  3230,145
  6441,637
 12828,554
 25629,661
 51239,178
   1,02429,644
       2,04837,190
   4,09629,838
   8,19238,581
  16,38437,793
  32,76820,332
  65,53635,755

-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode

2016-11-14 Thread Karl Rister
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote:
> On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote:
>> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote:
>>> Recent performance investigation work done by Karl Rister shows that the
>>> guest->host notification takes around 20 us.  This is more than the 
>>> "overhead"
>>> of QEMU itself (e.g. block layer).
>>>
>>> One way to avoid the costly exit is to use polling instead of notification.
>>> The main drawback of polling is that it consumes CPU resources.  In order to
>>> benefit performance the host must have extra CPU cycles available on 
>>> physical
>>> CPUs that aren't used by the guest.
>>>
>>> This is an experimental AioContext polling implementation.  It adds a 
>>> polling
>>> callback into the event loop.  Polling functions are implemented for 
>>> virtio-blk
>>> virtqueue guest->host kick and Linux AIO completion.
>>>
>>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of 
>>> nanoseconds to
>>> poll before entering the usual blocking poll(2) syscall.  Try setting this
>>> variable to the time from old request completion to new virtqueue kick.
>>>
>>> By default no polling is done.  The QEMU_AIO_POLL_MAX_NS must be set to get 
>>> any
>>> polling!
>>>
>>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS
>>> values.  If you don't find a good value we should double-check the tracing 
>>> data
>>> to see if this experimental code can be improved.
>>
>> Stefan
>>
>> I ran some quick tests with your patches and got some pretty good gains,
>> but also some seemingly odd behavior.
>>
>> These results are for a 5 minute test doing sequential 4KB requests from
>> fio using O_DIRECT, libaio, and IO depth of 1.  The requests are
>> performed directly against the virtio-blk device (no filesystem) which
>> is backed by a 400GB NVme card.
>>
>> QEMU_AIO_POLL_MAX_NS  IOPs
>>unset31,383
>>146,860
>>246,440
>>435,246
>>834,973
>>   1646,794
>>   3246,729
>>   6435,520
>>  12845,902
> 
> The environment variable is in nanoseconds.  The range of values you
> tried are very small (all <1 usec).  It would be interesting to try
> larger values in the ballpark of the latencies you have traced.  For
> example 2000, 4000, 8000, 16000, and 32000 ns.

Here are some more numbers with higher values.  I continued the power of
2 values and added in your examples as well:

QEMU_AIO_POLL_MAX_NS  IOPs
 25646,929
 51235,627
   1,02446,477
   2,00035,247
   2,04846,322
   4,00046,540
   4,09646,368
   8,00047,054
   8,19246,671
  16,00046,466
  16,38432,504
  32,00020,620
  32,76820,807

> 
> Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without
> much CPU overhead.
> 
>> I found the results for 4, 8, and 64 odd so I re-ran some tests to check
>> for consistency.  I used values of 2 and 4 and ran each 5 times.  Here
>> is what I got:
>>
>> IterationQEMU_AIO_POLL_MAX_NS=2   QEMU_AIO_POLL_MAX_NS=4
>> 146,972   35,434
>> 246,939   35,719
>> 347,005   35,584
>> 447,016   35,615
>> 547,267   35,474
>>
>> So the results seem consistent.
> 
> That is interesting.  I don't have an explanation for the consistent
> difference between 2 and 4 ns polling time.  The time difference is so
> small yet the IOPS difference is clear.
> 
> Comparing traces could shed light on the cause for this difference.
> 
>> I saw some discussion on the patches made which make me think you'll be
>> making some changes, is that right?  If so, I may wait for the updates
>> and then we can run the much more exhaustive set of workloads
>> (sequential read and write, random read and write) at various block
>> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32)
>> that we were doing when we started looking at this.
> 
> I'll send an updated version of the patches.
> 
> Stefan
> 


-- 
Karl Rister <kris...@redhat.com>



Re: [Qemu-devel] [PATCH v4 00/13] aio: experimental virtio-blk polling mode

2016-12-05 Thread Karl Rister
On 12/05/2016 08:56 AM, Stefan Hajnoczi wrote:


> Karl: do you have time to run a bigger suite of benchmarks to identify a
> reasonable default poll-max-ns value?  Both aio=native and aio=threads
> are important.
> 
> If there is a sweet spot that improves performance without pathological
> cases then we could even enable polling by default in QEMU.
> 
> Otherwise we'd just document the recommended best polling duration as a
> starting point for users.
> 

I have collected a baseline on the latest patches and am currently
collecting poll-max-ns=16384.  I can certainly throw in a few more
scenarios.  Do we want to stick with powers of 2 or some other strategy?

-- 
Karl Rister <kris...@redhat.com>