Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, 11/14 16:29, Paolo Bonzini wrote: > > > On 14/11/2016 16:26, Stefan Hajnoczi wrote: > > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: > >> QEMU_AIO_POLL_MAX_NS IOPs > >>unset31,383 > >>146,860 > >>246,440 > >>435,246 > >>834,973 > >> 1646,794 > >> 3246,729 > >> 6435,520 > >> 12845,902 > > > > The environment variable is in nanoseconds. The range of values you > > tried are very small (all <1 usec). It would be interesting to try > > larger values in the ballpark of the latencies you have traced. For > > example 2000, 4000, 8000, 16000, and 32000 ns. > > > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > > much CPU overhead. > > That basically means "avoid a syscall if you already know there's > something to do", so in retrospect it's not that surprising. Still > interesting though, and it means that the feature is useful even if you > don't have CPU to waste. With the "deleted" bug fixed I did a little more testing to understand this. Setting QEMU_AIO_POLL_MAX_NS=1 doesn't mean run_poll_handlers() will only loop for 1 ns - the patch only checks at every 1024 polls. The first poll in a run_poll_handlers() call can hardly succeed, so we poll at least 1024 times. According to my test, on average each run_poll_handlers() takes ~12000ns, which is ~160 iterations of the poll loop, before geting a new event (either from virtio queue or linux-aio, I don't have the ratio here). So in the worse case (no new event), 1024 iterations is basically (12000 / 160 * 1024) = 76800 ns! The above is with iodepth=1 and jobs=1. With iodepth=32 and jobs=1, or iodepth=8 and jobs=4, the numbers are ~30th poll with 5600ns. Fam
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/15/2016 04:32 AM, Stefan Hajnoczi wrote: > On Mon, Nov 14, 2016 at 09:52:00PM +0100, Paolo Bonzini wrote: >> On 14/11/2016 21:12, Karl Rister wrote: >>> 25646,929 >>> 51235,627 >>>1,02446,477 >>>2,00035,247 >>>2,04846,322 >>>4,00046,540 >>>4,09646,368 >>>8,00047,054 >>>8,19246,671 >>> 16,00046,466 >>> 16,38432,504 >>> 32,00020,620 >>> 32,76820,807 >> >> Huh, it breaks down exactly when it should start going faster >> (10^9/46000 = ~21000). > > Could it be because we're not breaking the polling loop for BHs, new > timers, or aio_notify()? > > Once that is fixed polling should achieve maximum performance when > QEMU_AIO_MAX_POLL_NS is at least as long as the duration of a request. > > This is logical if there are enough pinned CPUs so the polling thread > can run flat out. > I removed all the pinning and restored the guest to a "normal" configuration. QEMU_AIO_POLL_MAX_NS IOPs unset25,553 128,684 238,213 429,413 838,612 1630,578 3230,145 6441,637 12828,554 25629,661 51239,178 1,02429,644 2,04837,190 4,09629,838 8,19238,581 16,38437,793 32,76820,332 65,53635,755 -- Karl Rister
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 06:15:46PM +0100, Paolo Bonzini wrote: > > > On 14/11/2016 18:06, Stefan Hajnoczi wrote: > >>> > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > >>> > > much CPU overhead. > >> > > >> > That basically means "avoid a syscall if you already know there's > >> > something to do", so in retrospect it's not that surprising. Still > >> > interesting though, and it means that the feature is useful even if you > >> > don't have CPU to waste. > > Can you spell out which syscall you mean? Reading the ioeventfd? > > I mean ppoll. If ppoll succeeds without ever going to sleep, you can > achieve the same result with QEMU_AIO_POLL_MAX_NS=1, but cheaper. It's not obvious to me that ioeventfd or Linux AIO will become ready with QEMU_AIO_POLL_MAX_NS=1. This benchmark is iodepth=1 so there's just a single request. Fam suggested that maybe Linux AIO is ready immediately but AFAIK this NVMe device should still take a few microseconds to complete a request whereas our polling time is 1 nanosecond. Tracing would reveal what is going on here. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 09:52:00PM +0100, Paolo Bonzini wrote: > On 14/11/2016 21:12, Karl Rister wrote: > > 25646,929 > > 51235,627 > >1,02446,477 > >2,00035,247 > >2,04846,322 > >4,00046,540 > >4,09646,368 > >8,00047,054 > >8,19246,671 > > 16,00046,466 > > 16,38432,504 > > 32,00020,620 > > 32,76820,807 > > Huh, it breaks down exactly when it should start going faster > (10^9/46000 = ~21000). Could it be because we're not breaking the polling loop for BHs, new timers, or aio_notify()? Once that is fixed polling should achieve maximum performance when QEMU_AIO_MAX_POLL_NS is at least as long as the duration of a request. This is logical if there are enough pinned CPUs so the polling thread can run flat out. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 14/11/2016 21:12, Karl Rister wrote: > 25646,929 > 51235,627 >1,02446,477 >2,00035,247 >2,04846,322 >4,00046,540 >4,09646,368 >8,00047,054 >8,19246,671 > 16,00046,466 > 16,38432,504 > 32,00020,620 > 32,76820,807 Huh, it breaks down exactly when it should start going faster (10^9/46000 = ~21000). Paolo
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote: > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: >> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: >>> Recent performance investigation work done by Karl Rister shows that the >>> guest->host notification takes around 20 us. This is more than the >>> "overhead" >>> of QEMU itself (e.g. block layer). >>> >>> One way to avoid the costly exit is to use polling instead of notification. >>> The main drawback of polling is that it consumes CPU resources. In order to >>> benefit performance the host must have extra CPU cycles available on >>> physical >>> CPUs that aren't used by the guest. >>> >>> This is an experimental AioContext polling implementation. It adds a >>> polling >>> callback into the event loop. Polling functions are implemented for >>> virtio-blk >>> virtqueue guest->host kick and Linux AIO completion. >>> >>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of >>> nanoseconds to >>> poll before entering the usual blocking poll(2) syscall. Try setting this >>> variable to the time from old request completion to new virtqueue kick. >>> >>> By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get >>> any >>> polling! >>> >>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS >>> values. If you don't find a good value we should double-check the tracing >>> data >>> to see if this experimental code can be improved. >> >> Stefan >> >> I ran some quick tests with your patches and got some pretty good gains, >> but also some seemingly odd behavior. >> >> These results are for a 5 minute test doing sequential 4KB requests from >> fio using O_DIRECT, libaio, and IO depth of 1. The requests are >> performed directly against the virtio-blk device (no filesystem) which >> is backed by a 400GB NVme card. >> >> QEMU_AIO_POLL_MAX_NS IOPs >>unset31,383 >>146,860 >>246,440 >>435,246 >>834,973 >> 1646,794 >> 3246,729 >> 6435,520 >> 12845,902 > > The environment variable is in nanoseconds. The range of values you > tried are very small (all <1 usec). It would be interesting to try > larger values in the ballpark of the latencies you have traced. For > example 2000, 4000, 8000, 16000, and 32000 ns. Here are some more numbers with higher values. I continued the power of 2 values and added in your examples as well: QEMU_AIO_POLL_MAX_NS IOPs 25646,929 51235,627 1,02446,477 2,00035,247 2,04846,322 4,00046,540 4,09646,368 8,00047,054 8,19246,671 16,00046,466 16,38432,504 32,00020,620 32,76820,807 > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > much CPU overhead. > >> I found the results for 4, 8, and 64 odd so I re-ran some tests to check >> for consistency. I used values of 2 and 4 and ran each 5 times. Here >> is what I got: >> >> IterationQEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 >> 146,972 35,434 >> 246,939 35,719 >> 347,005 35,584 >> 447,016 35,615 >> 547,267 35,474 >> >> So the results seem consistent. > > That is interesting. I don't have an explanation for the consistent > difference between 2 and 4 ns polling time. The time difference is so > small yet the IOPS difference is clear. > > Comparing traces could shed light on the cause for this difference. > >> I saw some discussion on the patches made which make me think you'll be >> making some changes, is that right? If so, I may wait for the updates >> and then we can run the much more exhaustive set of workloads >> (sequential read and write, random read and write) at various block >> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) >> that we were doing when we started looking at this. > > I'll send an updated version of the patches. > > Stefan > -- Karl Rister
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 14/11/2016 18:06, Stefan Hajnoczi wrote: >>> > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without >>> > > much CPU overhead. >> > >> > That basically means "avoid a syscall if you already know there's >> > something to do", so in retrospect it's not that surprising. Still >> > interesting though, and it means that the feature is useful even if you >> > don't have CPU to waste. > Can you spell out which syscall you mean? Reading the ioeventfd? I mean ppoll. If ppoll succeeds without ever going to sleep, you can achieve the same result with QEMU_AIO_POLL_MAX_NS=1, but cheaper. Paolo > The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't > be much IOThread event loop activity besides the single I/O request. > > The reason this puzzles me is that I wouldn't expect poll to succeed > with QEMU_AIO_POLL_MAX_NS and iodepth=1. signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 04:29:49PM +0100, Paolo Bonzini wrote: > On 14/11/2016 16:26, Stefan Hajnoczi wrote: > > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: > >> QEMU_AIO_POLL_MAX_NS IOPs > >>unset31,383 > >>146,860 > >>246,440 > >>435,246 > >>834,973 > >> 1646,794 > >> 3246,729 > >> 6435,520 > >> 12845,902 > > > > The environment variable is in nanoseconds. The range of values you > > tried are very small (all <1 usec). It would be interesting to try > > larger values in the ballpark of the latencies you have traced. For > > example 2000, 4000, 8000, 16000, and 32000 ns. > > > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > > much CPU overhead. > > That basically means "avoid a syscall if you already know there's > something to do", so in retrospect it's not that surprising. Still > interesting though, and it means that the feature is useful even if you > don't have CPU to waste. Can you spell out which syscall you mean? Reading the ioeventfd? The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't be much IOThread event loop activity besides the single I/O request. The reason this puzzles me is that I wouldn't expect poll to succeed with QEMU_AIO_POLL_MAX_NS and iodepth=1. Thanks, Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, 11/14 17:06, Stefan Hajnoczi wrote: > On Mon, Nov 14, 2016 at 04:29:49PM +0100, Paolo Bonzini wrote: > > On 14/11/2016 16:26, Stefan Hajnoczi wrote: > > > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: > > >> QEMU_AIO_POLL_MAX_NS IOPs > > >>unset31,383 > > >>146,860 > > >>246,440 > > >>435,246 > > >>834,973 > > >> 1646,794 > > >> 3246,729 > > >> 6435,520 > > >> 12845,902 > > > > > > The environment variable is in nanoseconds. The range of values you > > > tried are very small (all <1 usec). It would be interesting to try > > > larger values in the ballpark of the latencies you have traced. For > > > example 2000, 4000, 8000, 16000, and 32000 ns. > > > > > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > > > much CPU overhead. > > > > That basically means "avoid a syscall if you already know there's > > something to do", so in retrospect it's not that surprising. Still > > interesting though, and it means that the feature is useful even if you > > don't have CPU to waste. > > Can you spell out which syscall you mean? Reading the ioeventfd? > > The benchmark uses virtio-blk dataplane and iodepth=1 so there shouldn't > be much IOThread event loop activity besides the single I/O request. > > The reason this puzzles me is that I wouldn't expect poll to succeed > with QEMU_AIO_POLL_MAX_NS and iodepth=1. I see the guest shouldn't send more requests, but isn't it possible for the linux-aio poll to succeed? Fam > > Thanks, > Stefan
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 08:52:21AM -0600, Karl Rister wrote: > On 11/14/2016 07:53 AM, Fam Zheng wrote: > > On Fri, 11/11 13:59, Karl Rister wrote: > >> > >> Stefan > >> > >> I ran some quick tests with your patches and got some pretty good gains, > >> but also some seemingly odd behavior. > >> > >> These results are for a 5 minute test doing sequential 4KB requests from > >> fio using O_DIRECT, libaio, and IO depth of 1. The requests are > >> performed directly against the virtio-blk device (no filesystem) which > >> is backed by a 400GB NVme card. > >> > >> QEMU_AIO_POLL_MAX_NS IOPs > >>unset31,383 > >>146,860 > >>246,440 > >>435,246 > >>834,973 > >> 1646,794 > >> 3246,729 > >> 6435,520 > >> 12845,902 > > > > For sequential read with ioq=1, each request takes >2ns under 45,000 > > IOPs. > > Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried > > larger values? Not criticizing, just trying to understand how it workd. > > Not yet, I was just trying to get something out as quick as I could > (while juggling this with some other stuff...). Frankly I was a bit > surprised that the low values made such an impact and then got > distracted by the behaviors of 4, 8, and 64. > > > > > Also, do you happen to have numbers for unpatched QEMU (just to confirm that > > "unset" case doesn't cause regression) and baremetal for comparison? > > I didn't run this exact test on the same qemu.git master changeset > unpatched. I did however previously try it against the v2.7.0 tag and > got somewhere around 27.5K IOPs. My original intention was to apply the > patches to v2.7.0 but it wouldn't build. > > We have done a lot of testing and tracing on the qemu-rhev package and > 27K IOPs is about what we see there (with tracing disabled). > > Given the patch discussions I saw I was mainly trying to get a sniff > test out and then do a more complete workup with whatever updates are made. > > I should probably note that there are a lot of pinning optimizations > made here to assist in our tracing efforts which also result in improved > performance. Ultimately, in a proper evaluation of these patches most > of that will be removed so the behavior may change somewhat. To clarify: QEMU_AIO_POLL_MAX_NS unset or 0 disables polling completely. Therefore it's not necessary to run unpatched. signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 03:51:18PM +0100, Christian Borntraeger wrote: > On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote: > > Recent performance investigation work done by Karl Rister shows that the > > guest->host notification takes around 20 us. This is more than the > > "overhead" > > of QEMU itself (e.g. block layer). > > > > One way to avoid the costly exit is to use polling instead of notification. > > The main drawback of polling is that it consumes CPU resources. In order to > > benefit performance the host must have extra CPU cycles available on > > physical > > CPUs that aren't used by the guest. > > > > This is an experimental AioContext polling implementation. It adds a > > polling > > callback into the event loop. Polling functions are implemented for > > virtio-blk > > virtqueue guest->host kick and Linux AIO completion. > > > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of > > nanoseconds to > > poll before entering the usual blocking poll(2) syscall. Try setting this > > variable to the time from old request completion to new virtqueue kick. > > > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > > any > > polling! > > > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > > values. If you don't find a good value we should double-check the tracing > > data > > to see if this experimental code can be improved. > > > > Stefan Hajnoczi (3): > > aio-posix: add aio_set_poll_handler() > > virtio: poll virtqueues for new buffers > > linux-aio: poll ring for completions > > > > aio-posix.c | 133 > > > > block/linux-aio.c | 17 +++ > > hw/virtio/virtio.c | 19 > > include/block/aio.h | 16 +++ > > 4 files changed, 185 insertions(+) > > Hmm, I see all affected threads using more CPU power, but the performance > numbers are > somewhat inconclusive on s390. I have no proper test setup (only a shared > LPAR), but > all numbers are in the same ballpark of 3-5Gbyte/sec for 5 disks for 4k > random reads > with iodepth=8. > > What I find interesting is that the guest still does a huge amount of exits > for the > guest->host notifications. I think if we could combine this with some > notification > suppression, then things could be even more interesting. Great idea. I'll add that to the next revision. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Mon, Nov 14, 2016 at 03:59:32PM +0100, Christian Borntraeger wrote: > On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote: > > Recent performance investigation work done by Karl Rister shows that the > > guest->host notification takes around 20 us. This is more than the > > "overhead" > > of QEMU itself (e.g. block layer). > > > > One way to avoid the costly exit is to use polling instead of notification. > > The main drawback of polling is that it consumes CPU resources. In order to > > benefit performance the host must have extra CPU cycles available on > > physical > > CPUs that aren't used by the guest. > > > > This is an experimental AioContext polling implementation. It adds a > > polling > > callback into the event loop. Polling functions are implemented for > > virtio-blk > > virtqueue guest->host kick and Linux AIO completion. > > > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of > > nanoseconds to > > poll before entering the usual blocking poll(2) syscall. Try setting this > > variable to the time from old request completion to new virtqueue kick. > > > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > > any > > polling! > > > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > > values. If you don't find a good value we should double-check the tracing > > data > > to see if this experimental code can be improved. > > > > Stefan Hajnoczi (3): > > aio-posix: add aio_set_poll_handler() > > virtio: poll virtqueues for new buffers > > linux-aio: poll ring for completions > > > > aio-posix.c | 133 > > > > block/linux-aio.c | 17 +++ > > hw/virtio/virtio.c | 19 > > include/block/aio.h | 16 +++ > > 4 files changed, 185 insertions(+) > > > > Another observation: With more iothreads than host CPUs the performance drops > significantly. This makes sense although we can eliminate it in common cases by only polling when we actually need to monitor events. The current series wastes CPU on Linux AIO polling when no requests are pending ;). signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/14/2016 09:26 AM, Stefan Hajnoczi wrote: > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: >> On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: >>> Recent performance investigation work done by Karl Rister shows that the >>> guest->host notification takes around 20 us. This is more than the >>> "overhead" >>> of QEMU itself (e.g. block layer). >>> >>> One way to avoid the costly exit is to use polling instead of notification. >>> The main drawback of polling is that it consumes CPU resources. In order to >>> benefit performance the host must have extra CPU cycles available on >>> physical >>> CPUs that aren't used by the guest. >>> >>> This is an experimental AioContext polling implementation. It adds a >>> polling >>> callback into the event loop. Polling functions are implemented for >>> virtio-blk >>> virtqueue guest->host kick and Linux AIO completion. >>> >>> The QEMU_AIO_POLL_MAX_NS environment variable sets the number of >>> nanoseconds to >>> poll before entering the usual blocking poll(2) syscall. Try setting this >>> variable to the time from old request completion to new virtqueue kick. >>> >>> By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get >>> any >>> polling! >>> >>> Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS >>> values. If you don't find a good value we should double-check the tracing >>> data >>> to see if this experimental code can be improved. >> >> Stefan >> >> I ran some quick tests with your patches and got some pretty good gains, >> but also some seemingly odd behavior. >> >> These results are for a 5 minute test doing sequential 4KB requests from >> fio using O_DIRECT, libaio, and IO depth of 1. The requests are >> performed directly against the virtio-blk device (no filesystem) which >> is backed by a 400GB NVme card. >> >> QEMU_AIO_POLL_MAX_NS IOPs >>unset31,383 >>146,860 >>246,440 >>435,246 >>834,973 >> 1646,794 >> 3246,729 >> 6435,520 >> 12845,902 > > The environment variable is in nanoseconds. The range of values you > tried are very small (all <1 usec). It would be interesting to try > larger values in the ballpark of the latencies you have traced. For > example 2000, 4000, 8000, 16000, and 32000 ns. Agreed. As I alluded to in another post, I decided to start at 1 and double the values until I saw a difference with the expectation that it would have to get quite large before that happened. The results went in a different direction, and then I got distracted by the variation at certain points. I figured that by itself the fact that noticeable improvements were possible with such low values was interesting. I will definitely continue the progression and capture some larger values. > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > much CPU overhead. > >> I found the results for 4, 8, and 64 odd so I re-ran some tests to check >> for consistency. I used values of 2 and 4 and ran each 5 times. Here >> is what I got: >> >> IterationQEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 >> 146,972 35,434 >> 246,939 35,719 >> 347,005 35,584 >> 447,016 35,615 >> 547,267 35,474 >> >> So the results seem consistent. > > That is interesting. I don't have an explanation for the consistent > difference between 2 and 4 ns polling time. The time difference is so > small yet the IOPS difference is clear. > > Comparing traces could shed light on the cause for this difference. > >> I saw some discussion on the patches made which make me think you'll be >> making some changes, is that right? If so, I may wait for the updates >> and then we can run the much more exhaustive set of workloads >> (sequential read and write, random read and write) at various block >> sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) >> that we were doing when we started looking at this. > > I'll send an updated version of the patches. > > Stefan > -- Karl Rister
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 14/11/2016 16:26, Stefan Hajnoczi wrote: > On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: >> QEMU_AIO_POLL_MAX_NS IOPs >>unset31,383 >>146,860 >>246,440 >>435,246 >>834,973 >> 1646,794 >> 3246,729 >> 6435,520 >> 12845,902 > > The environment variable is in nanoseconds. The range of values you > tried are very small (all <1 usec). It would be interesting to try > larger values in the ballpark of the latencies you have traced. For > example 2000, 4000, 8000, 16000, and 32000 ns. > > Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without > much CPU overhead. That basically means "avoid a syscall if you already know there's something to do", so in retrospect it's not that surprising. Still interesting though, and it means that the feature is useful even if you don't have CPU to waste. Paolo signature.asc Description: OpenPGP digital signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Fri, Nov 11, 2016 at 01:59:25PM -0600, Karl Rister wrote: > On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: > > Recent performance investigation work done by Karl Rister shows that the > > guest->host notification takes around 20 us. This is more than the > > "overhead" > > of QEMU itself (e.g. block layer). > > > > One way to avoid the costly exit is to use polling instead of notification. > > The main drawback of polling is that it consumes CPU resources. In order to > > benefit performance the host must have extra CPU cycles available on > > physical > > CPUs that aren't used by the guest. > > > > This is an experimental AioContext polling implementation. It adds a > > polling > > callback into the event loop. Polling functions are implemented for > > virtio-blk > > virtqueue guest->host kick and Linux AIO completion. > > > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of > > nanoseconds to > > poll before entering the usual blocking poll(2) syscall. Try setting this > > variable to the time from old request completion to new virtqueue kick. > > > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > > any > > polling! > > > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > > values. If you don't find a good value we should double-check the tracing > > data > > to see if this experimental code can be improved. > > Stefan > > I ran some quick tests with your patches and got some pretty good gains, > but also some seemingly odd behavior. > > These results are for a 5 minute test doing sequential 4KB requests from > fio using O_DIRECT, libaio, and IO depth of 1. The requests are > performed directly against the virtio-blk device (no filesystem) which > is backed by a 400GB NVme card. > > QEMU_AIO_POLL_MAX_NS IOPs >unset31,383 >146,860 >246,440 >435,246 >834,973 > 1646,794 > 3246,729 > 6435,520 > 12845,902 The environment variable is in nanoseconds. The range of values you tried are very small (all <1 usec). It would be interesting to try larger values in the ballpark of the latencies you have traced. For example 2000, 4000, 8000, 16000, and 32000 ns. Very interesting that QEMU_AIO_POLL_MAX_NS=1 performs so well without much CPU overhead. > I found the results for 4, 8, and 64 odd so I re-ran some tests to check > for consistency. I used values of 2 and 4 and ran each 5 times. Here > is what I got: > > IterationQEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 > 146,972 35,434 > 246,939 35,719 > 347,005 35,584 > 447,016 35,615 > 547,267 35,474 > > So the results seem consistent. That is interesting. I don't have an explanation for the consistent difference between 2 and 4 ns polling time. The time difference is so small yet the IOPS difference is clear. Comparing traces could shed light on the cause for this difference. > I saw some discussion on the patches made which make me think you'll be > making some changes, is that right? If so, I may wait for the updates > and then we can run the much more exhaustive set of workloads > (sequential read and write, random read and write) at various block > sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) > that we were doing when we started looking at this. I'll send an updated version of the patches. Stefan signature.asc Description: PGP signature
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote: > Recent performance investigation work done by Karl Rister shows that the > guest->host notification takes around 20 us. This is more than the "overhead" > of QEMU itself (e.g. block layer). > > One way to avoid the costly exit is to use polling instead of notification. > The main drawback of polling is that it consumes CPU resources. In order to > benefit performance the host must have extra CPU cycles available on physical > CPUs that aren't used by the guest. > > This is an experimental AioContext polling implementation. It adds a polling > callback into the event loop. Polling functions are implemented for > virtio-blk > virtqueue guest->host kick and Linux AIO completion. > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds > to > poll before entering the usual blocking poll(2) syscall. Try setting this > variable to the time from old request completion to new virtqueue kick. > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > any > polling! > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > values. If you don't find a good value we should double-check the tracing > data > to see if this experimental code can be improved. > > Stefan Hajnoczi (3): > aio-posix: add aio_set_poll_handler() > virtio: poll virtqueues for new buffers > linux-aio: poll ring for completions > > aio-posix.c | 133 > > block/linux-aio.c | 17 +++ > hw/virtio/virtio.c | 19 > include/block/aio.h | 16 +++ > 4 files changed, 185 insertions(+) > Another observation: With more iothreads than host CPUs the performance drops significantly.
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/09/2016 06:13 PM, Stefan Hajnoczi wrote: > Recent performance investigation work done by Karl Rister shows that the > guest->host notification takes around 20 us. This is more than the "overhead" > of QEMU itself (e.g. block layer). > > One way to avoid the costly exit is to use polling instead of notification. > The main drawback of polling is that it consumes CPU resources. In order to > benefit performance the host must have extra CPU cycles available on physical > CPUs that aren't used by the guest. > > This is an experimental AioContext polling implementation. It adds a polling > callback into the event loop. Polling functions are implemented for > virtio-blk > virtqueue guest->host kick and Linux AIO completion. > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds > to > poll before entering the usual blocking poll(2) syscall. Try setting this > variable to the time from old request completion to new virtqueue kick. > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > any > polling! > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > values. If you don't find a good value we should double-check the tracing > data > to see if this experimental code can be improved. > > Stefan Hajnoczi (3): > aio-posix: add aio_set_poll_handler() > virtio: poll virtqueues for new buffers > linux-aio: poll ring for completions > > aio-posix.c | 133 > > block/linux-aio.c | 17 +++ > hw/virtio/virtio.c | 19 > include/block/aio.h | 16 +++ > 4 files changed, 185 insertions(+) Hmm, I see all affected threads using more CPU power, but the performance numbers are somewhat inconclusive on s390. I have no proper test setup (only a shared LPAR), but all numbers are in the same ballpark of 3-5Gbyte/sec for 5 disks for 4k random reads with iodepth=8. What I find interesting is that the guest still does a huge amount of exits for the guest->host notifications. I think if we could combine this with some notification suppression, then things could be even more interesting. Christian
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/14/2016 07:53 AM, Fam Zheng wrote: > On Fri, 11/11 13:59, Karl Rister wrote: >> >> Stefan >> >> I ran some quick tests with your patches and got some pretty good gains, >> but also some seemingly odd behavior. >> >> These results are for a 5 minute test doing sequential 4KB requests from >> fio using O_DIRECT, libaio, and IO depth of 1. The requests are >> performed directly against the virtio-blk device (no filesystem) which >> is backed by a 400GB NVme card. >> >> QEMU_AIO_POLL_MAX_NS IOPs >>unset31,383 >>146,860 >>246,440 >>435,246 >>834,973 >> 1646,794 >> 3246,729 >> 6435,520 >> 12845,902 > > For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs. > Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried > larger values? Not criticizing, just trying to understand how it workd. Not yet, I was just trying to get something out as quick as I could (while juggling this with some other stuff...). Frankly I was a bit surprised that the low values made such an impact and then got distracted by the behaviors of 4, 8, and 64. > > Also, do you happen to have numbers for unpatched QEMU (just to confirm that > "unset" case doesn't cause regression) and baremetal for comparison? I didn't run this exact test on the same qemu.git master changeset unpatched. I did however previously try it against the v2.7.0 tag and got somewhere around 27.5K IOPs. My original intention was to apply the patches to v2.7.0 but it wouldn't build. We have done a lot of testing and tracing on the qemu-rhev package and 27K IOPs is about what we see there (with tracing disabled). Given the patch discussions I saw I was mainly trying to get a sniff test out and then do a more complete workup with whatever updates are made. I should probably note that there are a lot of pinning optimizations made here to assist in our tracing efforts which also result in improved performance. Ultimately, in a proper evaluation of these patches most of that will be removed so the behavior may change somewhat. > > Fam > -- Karl Rister
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On Fri, 11/11 13:59, Karl Rister wrote: > > Stefan > > I ran some quick tests with your patches and got some pretty good gains, > but also some seemingly odd behavior. > > These results are for a 5 minute test doing sequential 4KB requests from > fio using O_DIRECT, libaio, and IO depth of 1. The requests are > performed directly against the virtio-blk device (no filesystem) which > is backed by a 400GB NVme card. > > QEMU_AIO_POLL_MAX_NS IOPs >unset31,383 >146,860 >246,440 >435,246 >834,973 > 1646,794 > 3246,729 > 6435,520 > 12845,902 For sequential read with ioq=1, each request takes >2ns under 45,000 IOPs. Isn't a poll time of 128ns a mismatching order of magnitude? Have you tried larger values? Not criticizing, just trying to understand how it workd. Also, do you happen to have numbers for unpatched QEMU (just to confirm that "unset" case doesn't cause regression) and baremetal for comparison? Fam
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
Hi, Your series failed automatic build test. Please find the testing commands and their output below. If you have docker installed, you can probably reproduce it locally. Type: series Subject: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode Message-id: 1478711602-12620-1-git-send-email-stefa...@redhat.com === TEST SCRIPT BEGIN === #!/bin/bash set -e git submodule update --init dtc # Let docker tests dump environment info export SHOW_ENV=1 export J=16 make docker-test-quick@centos6 make docker-test-mingw@fedora === TEST SCRIPT END === Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 Switched to a new branch 'test' 40e0074 linux-aio: poll ring for completions c7e69fe virtio: poll virtqueues for new buffers 3129add aio-posix: add aio_set_poll_handler() === OUTPUT BEGIN === Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc' Cloning into 'dtc'... Submodule path 'dtc': checked out '65cc4d2748a2c2e6f27f1cf39e07a5dbabd80ebf' BUILD centos6 make[1]: Entering directory `/var/tmp/patchew-tester-tmp-c68h00s0/src' ARCHIVE qemu.tgz ARCHIVE dtc.tgz COPYRUNNER RUN test-quick in qemu:centos6 Packages installed: SDL-devel-1.2.14-7.el6_7.1.x86_64 ccache-3.1.6-2.el6.x86_64 epel-release-6-8.noarch gcc-4.4.7-17.el6.x86_64 git-1.7.1-4.el6_7.1.x86_64 glib2-devel-2.28.8-5.el6.x86_64 libfdt-devel-1.4.0-1.el6.x86_64 make-3.81-23.el6.x86_64 package g++ is not installed pixman-devel-0.32.8-1.el6.x86_64 tar-1.23-15.el6_8.x86_64 zlib-devel-1.2.3-29.el6.x86_64 Environment variables: PACKAGES=libfdt-devel ccache tar git make gcc g++ zlib-devel glib2-devel SDL-devel pixman-devel epel-release HOSTNAME=50ce3cd670bd TERM=xterm MAKEFLAGS= -j16 HISTSIZE=1000 J=16 USER=root CCACHE_DIR=/var/tmp/ccache EXTRA_CONFIGURE_OPTS= V= SHOW_ENV=1 MAIL=/var/spool/mail/root PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PWD=/ LANG=en_US.UTF-8 TARGET_LIST= HISTCONTROL=ignoredups SHLVL=1 HOME=/root TEST_DIR=/tmp/qemu-test LOGNAME=root LESSOPEN=||/usr/bin/lesspipe.sh %s FEATURES= dtc DEBUG= G_BROKEN_FILENAMES=1 CCACHE_HASHDIR= _=/usr/bin/env Configure options: --enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/var/tmp/qemu-build/install No C++ compiler available; disabling C++ specific optional code Install prefix/var/tmp/qemu-build/install BIOS directory/var/tmp/qemu-build/install/share/qemu binary directory /var/tmp/qemu-build/install/bin library directory /var/tmp/qemu-build/install/lib module directory /var/tmp/qemu-build/install/lib/qemu libexec directory /var/tmp/qemu-build/install/libexec include directory /var/tmp/qemu-build/install/include config directory /var/tmp/qemu-build/install/etc local state directory /var/tmp/qemu-build/install/var Manual directory /var/tmp/qemu-build/install/share/man ELF interp prefix /usr/gnemul/qemu-%M Source path /tmp/qemu-test/src C compilercc Host C compiler cc C++ compiler Objective-C compiler cc ARFLAGS rv CFLAGS-O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g QEMU_CFLAGS -I/usr/include/pixman-1-pthread -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -fPIE -DPIE -m64 -mcx16 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv -Wendif-labels -Wmissing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-all LDFLAGS -Wl,--warn-common -Wl,-z,relro -Wl,-z,now -pie -m64 -g make make install install pythonpython -B smbd /usr/sbin/smbd module supportno host CPU x86_64 host big endian no target list x86_64-softmmu aarch64-softmmu tcg debug enabled no gprof enabled no sparse enabledno strip binariesyes profiler no static build no pixmansystem SDL support yes (1.2.14) GTK support no GTK GL supportno VTE support no TLS priority NORMAL GNUTLS supportno GNUTLS rndno libgcrypt no libgcrypt kdf no nettleno nettle kdfno libtasn1 no curses supportno virgl support no curl support no mingw32 support no Audio drivers oss Block whitelist (rw) Block whitelist (ro) VirtFS supportno VNC support yes VNC SASL support no VNC JPEG support no VNC PNG support no xen support no brlapi supportno bluez supportno Documentation no PIE yes vde support no netmap supportno Linux AIO support no ATTR/XATTR support yes Install blobs yes KVM support yes COLO support yes RDMA support no TCG interpreter no fdt support yes preadv supportyes fdatasync
Re: [Qemu-devel] [RFC 0/3] aio: experimental virtio-blk polling mode
On 11/09/2016 11:13 AM, Stefan Hajnoczi wrote: > Recent performance investigation work done by Karl Rister shows that the > guest->host notification takes around 20 us. This is more than the "overhead" > of QEMU itself (e.g. block layer). > > One way to avoid the costly exit is to use polling instead of notification. > The main drawback of polling is that it consumes CPU resources. In order to > benefit performance the host must have extra CPU cycles available on physical > CPUs that aren't used by the guest. > > This is an experimental AioContext polling implementation. It adds a polling > callback into the event loop. Polling functions are implemented for > virtio-blk > virtqueue guest->host kick and Linux AIO completion. > > The QEMU_AIO_POLL_MAX_NS environment variable sets the number of nanoseconds > to > poll before entering the usual blocking poll(2) syscall. Try setting this > variable to the time from old request completion to new virtqueue kick. > > By default no polling is done. The QEMU_AIO_POLL_MAX_NS must be set to get > any > polling! > > Karl: I hope you can try this patch series with several QEMU_AIO_POLL_MAX_NS > values. If you don't find a good value we should double-check the tracing > data > to see if this experimental code can be improved. Stefan I ran some quick tests with your patches and got some pretty good gains, but also some seemingly odd behavior. These results are for a 5 minute test doing sequential 4KB requests from fio using O_DIRECT, libaio, and IO depth of 1. The requests are performed directly against the virtio-blk device (no filesystem) which is backed by a 400GB NVme card. QEMU_AIO_POLL_MAX_NS IOPs unset31,383 146,860 246,440 435,246 834,973 1646,794 3246,729 6435,520 12845,902 I found the results for 4, 8, and 64 odd so I re-ran some tests to check for consistency. I used values of 2 and 4 and ran each 5 times. Here is what I got: IterationQEMU_AIO_POLL_MAX_NS=2 QEMU_AIO_POLL_MAX_NS=4 146,972 35,434 246,939 35,719 347,005 35,584 447,016 35,615 547,267 35,474 So the results seem consistent. I saw some discussion on the patches made which make me think you'll be making some changes, is that right? If so, I may wait for the updates and then we can run the much more exhaustive set of workloads (sequential read and write, random read and write) at various block sizes (4, 8, 16, 32, 64, 128, and 256) and multiple IO depths (1 and 32) that we were doing when we started looking at this. Karl > > Stefan Hajnoczi (3): > aio-posix: add aio_set_poll_handler() > virtio: poll virtqueues fr new buffers > linux-aio: poll ring for ompletions > > aio-posix.c | 133 > > block/linux-aio.c | 17 +++ > hw/virtio/virtio.c | 19 > include/block/aio.h | 16 +++ > 4 files changed, 185 insertions(+) > -- Karl Rister