Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission

2012-12-23 Thread Wanlong Gao
On 12/18/2012 09:42 PM, Michael S. Tsirkin wrote:
 On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
 Hi all,

 this series adds multiqueue support to the virtio-scsi driver, based
 on Jason Wang's work on virtio-net.  It uses a simple queue steering
 algorithm that expects one queue per CPU.  LUNs in the same target always
 use the same queue (so that commands are not reordered); queue switching
 occurs when the request being queued is the only one for the target.
 Also based on Jason's patches, the virtqueue affinity is set so that
 each CPU is associated to one virtqueue.

 I tested the patches with fio, using up to 32 virtio-scsi disks backed
 by tmpfs on the host.  These numbers are with 1 LUN per target.

 FIO configuration
 -
 [global]
 rw=read
 bsrange=4k-64k
 ioengine=libaio
 direct=1
 iodepth=4
 loops=20

 overall bandwidth (MB/s)
 

 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  795   965 925
 4  997  13761500
 8 1136  21302060
 161440  22692474
 241408  21792436
 321515  19782319

 (These numbers for single-queue are with 4 VCPUs, but the impact of adding
 more VCPUs is very limited).

 avg bandwidth per LUN (MB/s)
 

 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  397   482 462
 4  249   344 375
 8  142   266 257
 16  90   141 154
 24  5890 101
 32  4761  72
 
 
 Could you please try and measure host CPU utilization?

I measured and didn't see any CPU utilization regression here.

 Without this data it is possible that your host
 is undersubscribed and you are drinking up more host CPU.
 
 Another thing to note is that ATM you might need to
 test with idle=poll on host otherwise we have strange interaction
 with power management where reducing the overhead
 switches to lower power so gives you a worse IOPS.

Yeah, I measured with host cpu idle=poll and saw that the performance
improved about 68%.

Thanks,
Wanlong Gao

 
 
 Patch 1 adds a new API to add functions for piecewise addition for buffers,
 which enables various simplifications in virtio-scsi (patches 2-3) and a
 small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.

 I'm mostly looking for comments on the new API of patch 1 for inclusion
 into the 3.9 kernel.

 Thanks to Wao Ganlong for help rebasing and benchmarking these patches.

 Paolo Bonzini (5):
   virtio: add functions for piecewise addition of buffers
   virtio-scsi: use functions for piecewise composition of buffers
   virtio-scsi: redo allocation of target data
   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
   virtio-scsi: introduce multiqueue support

  drivers/scsi/virtio_scsi.c   |  374 
 +-
  drivers/virtio/virtio_ring.c |  205 
  include/linux/virtio.h   |   21 +++
  3 files changed, 485 insertions(+), 115 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission

2012-12-19 Thread Paolo Bonzini
Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
 Paolo Bonzini wrote:
 Hi all,

 this series adds multiqueue support to the virtio-scsi driver, based
 on Jason Wang's work on virtio-net.  It uses a simple queue steering
 algorithm that expects one queue per CPU.  LUNs in the same target always
 use the same queue (so that commands are not reordered); queue switching
 occurs when the request being queued is the only one for the target.
 Also based on Jason's patches, the virtqueue affinity is set so that
 each CPU is associated to one virtqueue.

 I tested the patches with fio, using up to 32 virtio-scsi disks backed
 by tmpfs on the host.  These numbers are with 1 LUN per target.

 FIO configuration
 -
 [global]
 rw=read
 bsrange=4k-64k
 ioengine=libaio
 direct=1
 iodepth=4
 loops=20

 overall bandwidth (MB/s)
 

 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  795   965 925
 4  997  13761500
 8 1136  21302060
 161440  22692474
 241408  21792436
 321515  19782319

 (These numbers for single-queue are with 4 VCPUs, but the impact of adding
 more VCPUs is very limited).

 avg bandwidth per LUN (MB/s)
 

 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  397   482 462
 4  249   344 375
 8  142   266 257
 16  90   141 154
 24  5890 101
 32  4761  72
 
 Is there an explanation why 8x8 is slower then 4x8 in both cases?

Regarding the in both cases part, it's because the second table has
the same data as the first, but divided by the first column.

In general, the strangenesses you find are probably within statistical
noise or due to other effects such as host CPU utilization or contention
on the big QEMU lock.

Paolo


 8x1 and 8x2
 being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses against 
 4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
 
 Eike
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission

2012-12-19 Thread Michael S. Tsirkin
On Wed, Dec 19, 2012 at 09:52:59AM +0100, Paolo Bonzini wrote:
 Il 18/12/2012 23:18, Rolf Eike Beer ha scritto:
  Paolo Bonzini wrote:
  Hi all,
 
  this series adds multiqueue support to the virtio-scsi driver, based
  on Jason Wang's work on virtio-net.  It uses a simple queue steering
  algorithm that expects one queue per CPU.  LUNs in the same target always
  use the same queue (so that commands are not reordered); queue switching
  occurs when the request being queued is the only one for the target.
  Also based on Jason's patches, the virtqueue affinity is set so that
  each CPU is associated to one virtqueue.
 
  I tested the patches with fio, using up to 32 virtio-scsi disks backed
  by tmpfs on the host.  These numbers are with 1 LUN per target.
 
  FIO configuration
  -
  [global]
  rw=read
  bsrange=4k-64k
  ioengine=libaio
  direct=1
  iodepth=4
  loops=20
 
  overall bandwidth (MB/s)
  
 
  # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 
  VCPUs
  1  540   626 599
  2  795   965 925
  4  997  13761500
  8 1136  21302060
  161440  22692474
  241408  21792436
  321515  19782319
 
  (These numbers for single-queue are with 4 VCPUs, but the impact of adding
  more VCPUs is very limited).
 
  avg bandwidth per LUN (MB/s)
  
 
  # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 
  VCPUs
  1  540   626 599
  2  397   482 462
  4  249   344 375
  8  142   266 257
  16  90   141 154
  24  5890 101
  32  4761  72
  
  Is there an explanation why 8x8 is slower then 4x8 in both cases?
 
 Regarding the in both cases part, it's because the second table has
 the same data as the first, but divided by the first column.
 
 In general, the strangenesses you find are probably within statistical
 noise or due to other effects such as host CPU utilization or contention
 on the big QEMU lock.
 
 Paolo
 

That's exactly what bothers me. If the IOPS divided by host CPU
goes down, then the win on lightly loaded host will become a regression
on a loaded host.

Need to measure that.

  8x1 and 8x2
  being slower than 4x1 and 4x2 is more or less expected, but 8x8 loses 
  against 
  4x8 while 8x4 wins against 4x4 and 8x16 against 4x16.
  
  Eike
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/5] Multiqueue virtio-scsi, and API for piecewise buffer submission

2012-12-18 Thread Michael S. Tsirkin
On Tue, Dec 18, 2012 at 01:32:47PM +0100, Paolo Bonzini wrote:
 Hi all,
 
 this series adds multiqueue support to the virtio-scsi driver, based
 on Jason Wang's work on virtio-net.  It uses a simple queue steering
 algorithm that expects one queue per CPU.  LUNs in the same target always
 use the same queue (so that commands are not reordered); queue switching
 occurs when the request being queued is the only one for the target.
 Also based on Jason's patches, the virtqueue affinity is set so that
 each CPU is associated to one virtqueue.
 
 I tested the patches with fio, using up to 32 virtio-scsi disks backed
 by tmpfs on the host.  These numbers are with 1 LUN per target.
 
 FIO configuration
 -
 [global]
 rw=read
 bsrange=4k-64k
 ioengine=libaio
 direct=1
 iodepth=4
 loops=20
 
 overall bandwidth (MB/s)
 
 
 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  795   965 925
 4  997  13761500
 8 1136  21302060
 161440  22692474
 241408  21792436
 321515  19782319
 
 (These numbers for single-queue are with 4 VCPUs, but the impact of adding
 more VCPUs is very limited).
 
 avg bandwidth per LUN (MB/s)
 
 
 # of targetssingle-queuemulti-queue, 4 VCPUsmulti-queue, 8 VCPUs
 1  540   626 599
 2  397   482 462
 4  249   344 375
 8  142   266 257
 16  90   141 154
 24  5890 101
 32  4761  72


Could you please try and measure host CPU utilization?
Without this data it is possible that your host
is undersubscribed and you are drinking up more host CPU.

Another thing to note is that ATM you might need to
test with idle=poll on host otherwise we have strange interaction
with power management where reducing the overhead
switches to lower power so gives you a worse IOPS.


 Patch 1 adds a new API to add functions for piecewise addition for buffers,
 which enables various simplifications in virtio-scsi (patches 2-3) and a
 small performance improvement of 2-6%.  Patches 4 and 5 add multiqueuing.
 
 I'm mostly looking for comments on the new API of patch 1 for inclusion
 into the 3.9 kernel.
 
 Thanks to Wao Ganlong for help rebasing and benchmarking these patches.
 
 Paolo Bonzini (5):
   virtio: add functions for piecewise addition of buffers
   virtio-scsi: use functions for piecewise composition of buffers
   virtio-scsi: redo allocation of target data
   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
   virtio-scsi: introduce multiqueue support
 
  drivers/scsi/virtio_scsi.c   |  374 
 +-
  drivers/virtio/virtio_ring.c |  205 
  include/linux/virtio.h   |   21 +++
  3 files changed, 485 insertions(+), 115 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html