Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
On 12/18/2012 09:42 PM, Michael S. Tsirkin wrote: On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote: Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration - [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 13761500 8 1136 21302060 161440 22692474 241408 21792436 321515 19782319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 5890 101 32 4761 72 Could you please try and measure host CPU utilization? I measured and didn't see any CPU utilization regression here. Without this data it is possible that your host is undersubscribed and you are drinking up more host CPU. Another thing to note is that ATM you might need to test with idle=poll on host otherwise we have strange interaction with power management where reducing the overhead switches to lower power so gives you a worse IOPS. Yeah, I measured with host cpu idle=poll and saw that the performance improved about 68%. Thanks, Wanlong Gao Patch 1 adds a new API to add functions for piecewise addition for buffers, which enables various simplifications in virtio-scsi (patches 2-3) and a small performance improvement of 2-6%. Patches 4 and 5 add multiqueuing. I'm mostly looking for comments on the new API of patch 1 for inclusion into the 3.9 kernel. Thanks to Wao Ganlong for help rebasing and benchmarking these patches. Paolo Bonzini (5): virtio: add functions for piecewise addition of buffers virtio-scsi: use functions for piecewise composition of buffers virtio-scsi: redo allocation of target data virtio-scsi: pass struct virtio_scsi to virtqueue completion function virtio-scsi: introduce multiqueue support drivers/scsi/virtio_scsi.c | 374 +- drivers/virtio/virtio_ring.c | 205 include/linux/virtio.h | 21 +++ 3 files changed, 485 insertions(+), 115 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
Il 18/12/2012 23:18, Rolf Eike Beer ha scritto: Paolo Bonzini wrote: Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration - [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 13761500 8 1136 21302060 161440 22692474 241408 21792436 321515 19782319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 5890 101 32 4761 72 Is there an explanation why 8x8 is slower then 4x8 in both cases? Regarding the in both cases part, it's because the second table has the same data as the first, but divided by the first column. In general, the strangenesses you find are probably within statistical noise or due to other effects such as host CPU utilization or contention on the big QEMU lock. Paolo 8x1 and 8x2 being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16. Eike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
On Wed, Dec 19, 2012 at 09:52:59AM +0100, Paolo Bonzini wrote: Il 18/12/2012 23:18, Rolf Eike Beer ha scritto: Paolo Bonzini wrote: Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration - [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 13761500 8 1136 21302060 161440 22692474 241408 21792436 321515 19782319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 5890 101 32 4761 72 Is there an explanation why 8x8 is slower then 4x8 in both cases? Regarding the in both cases part, it's because the second table has the same data as the first, but divided by the first column. In general, the strangenesses you find are probably within statistical noise or due to other effects such as host CPU utilization or contention on the big QEMU lock. Paolo That's exactly what bothers me. If the IOPS divided by host CPU goes down, then the win on lightly loaded host will become a regression on a loaded host. Need to measure that. 8x1 and 8x2 being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16. Eike -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission
On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote: Hi all, this series adds multiqueue support to the virtio-scsi driver, based on Jason Wang's work on virtio-net. It uses a simple queue steering algorithm that expects one queue per CPU. LUNs in the same target always use the same queue (so that commands are not reordered); queue switching occurs when the request being queued is the only one for the target. Also based on Jason's patches, the virtqueue affinity is set so that each CPU is associated to one virtqueue. I tested the patches with fio, using up to 32 virtio-scsi disks backed by tmpfs on the host. These numbers are with 1 LUN per target. FIO configuration - [global] rw=read bsrange=4k-64k ioengine=libaio direct=1 iodepth=4 loops=20 overall bandwidth (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 795 965 925 4 997 13761500 8 1136 21302060 161440 22692474 241408 21792436 321515 19782319 (These numbers for single-queue are with 4 VCPUs, but the impact of adding more VCPUs is very limited). avg bandwidth per LUN (MB/s) # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs 1 540 626 599 2 397 482 462 4 249 344 375 8 142 266 257 16 90 141 154 24 5890 101 32 4761 72 Could you please try and measure host CPU utilization? Without this data it is possible that your host is undersubscribed and you are drinking up more host CPU. Another thing to note is that ATM you might need to test with idle=poll on host otherwise we have strange interaction with power management where reducing the overhead switches to lower power so gives you a worse IOPS. Patch 1 adds a new API to add functions for piecewise addition for buffers, which enables various simplifications in virtio-scsi (patches 2-3) and a small performance improvement of 2-6%. Patches 4 and 5 add multiqueuing. I'm mostly looking for comments on the new API of patch 1 for inclusion into the 3.9 kernel. Thanks to Wao Ganlong for help rebasing and benchmarking these patches. Paolo Bonzini (5): virtio: add functions for piecewise addition of buffers virtio-scsi: use functions for piecewise composition of buffers virtio-scsi: redo allocation of target data virtio-scsi: pass struct virtio_scsi to virtqueue completion function virtio-scsi: introduce multiqueue support drivers/scsi/virtio_scsi.c | 374 +- drivers/virtio/virtio_ring.c | 205 include/linux/virtio.h | 21 +++ 3 files changed, 485 insertions(+), 115 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html