Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-16 Thread Christoph Lameter
On Fri, 6 Jul 2012, James Bottomley wrote:

 What people might pay attention to is evidence that there's a problem in
 3.5-rc6 (without any OFED crap).  If you're not going to bother
 investigating, it has to be in an environment they can reproduce (so
 ordinary hardware, not infiniband) otherwise it gets ignored as an
 esoteric hardware issue.

The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
supported for a long time and its a very important technology given the
problematic nature of ethernet at high network speeds.

OFED crap exists for those running RHEL5/6. The new enterprise distros are
based on the 3.2 kernel which has pretty good Infiniband support
out of the box.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
 On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
 
  So I'm pretty sure this discrepancy is attributed to the small block
  random I/O bottleneck currently present for all Linux/SCSI core LLDs
  regardless of physical or virtual storage fabric.
  
  The SCSI wide host-lock less conversion that happened in .38 code back
  in 2010, and subsequently having LLDs like virtio-scsi convert to run in
  host-lock-less mode have helped to some extent..  But it's still not
  enough..
  
  Another example where we've been able to prove this bottleneck recently
  is with the following target setup:
  
  *) Intel Romley production machines with 128 GB of DDR-3 memory
  *) 4x FusionIO ioDrive 2 (1.5 TB @ PCI-e Gen2 x2)
  *) Mellanox PCI-exress Gen3 HCA running at 56 gb/sec 
  *) Infiniband SRP Target backported to RHEL 6.2 + latest OFED
  
  In this setup using ib_srpt + IBLOCK w/ emulate_write_cache=1 +
  iomemory_vsl export we end up avoiding SCSI core bottleneck on the
  target machine, just as with the tcm_vhost example here for host kernel
  side processing with vhost.
  
  Using Linux IB SRP initiator + Windows Server 2008 R2 SCSI-miniport SRP
  (OFED) Initiator connected to four ib_srpt LUNs, we've observed that
  MSFT SCSI is currently outperforming RHEL 6.2 on the order of ~285K vs.
  ~215K with heavy random 4k WRITE iometer / fio tests.  Note this with an
  optimized queue_depth ib_srp client w/ noop I/O schedulering, but is
  still lacking the host_lock-less patches on RHEL 6.2 OFED..
  
  This bottleneck has been mentioned by various people (including myself)
  on linux-scsi the last 18 months, and I've proposed that that it be
  discussed at KS-2012 so we can start making some forward progress:
 
 Well, no, it hasn't.  You randomly drop things like this into unrelated
 email (I suppose that is a mention in strict English construction) but
 it's not really enough to get anyone to pay attention since they mostly
 stopped reading at the top, if they got that far: most people just go by
 subject when wading through threads initially.
 

It most certainly has been made clear to me, numerous times from many
people in the Linux/SCSI community that there is a bottleneck for small
block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
Linux based SCSI subsystems.

My apologies if mentioning this issue last year at LC 2011 to you
privately did not take a tone of a more serious nature, or that
proposing a topic for LSF-2012 this year was not a clear enough
indication of a problem with SCSI small block random I/O performance.

 But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
 kernel, which is now nearly three years old) is 25% slower than W2k8R2
 on infiniband isn't really going to get anyone excited either
 (particularly when you mention OFED, which usually means a stack
 replacement on Linux anyway).
 

The specific issue was first raised for .38 where we where able to get
most of the interesting high performance LLDs converted to using
internal locking methods so that host_lock did not have to be obtained
during each -queuecommand() I/O dispatch, right..?

This has helped a good deal for large multi-lun scsi_host configs that
are now running in host-lock less mode, but there is still a large
discrepancy single LUN vs. raw struct block_device access even with LLD
host_lock less mode enabled.

Now I think the virtio-blk client performance is demonstrating this
issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
flash benchmarks that is demonstrate some other yet-to-be determined
limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
randrw workload.

 What people might pay attention to is evidence that there's a problem in
 3.5-rc6 (without any OFED crap).  If you're not going to bother
 investigating, it has to be in an environment they can reproduce (so
 ordinary hardware, not infiniband) otherwise it gets ignored as an
 esoteric hardware issue.
 

It's really quite simple for anyone to demonstrate the bottleneck
locally on any machine using tcm_loop with raw block flash.  Take a
struct block_device backend (like a Fusion IO /dev/fio*) and using
IBLOCK and export locally accessible SCSI LUNs via tcm_loop..

Using FIO there is a significant drop for randrw 4k performance between
tcm_loop - IBLOCK vs. raw struct block device backends.  And no, it's
not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
LUN limitation for small block random I/Os on the order of ~75K for each
SCSI LUN.

If anyone has gone actually gone faster than this with any single SCSI
LUN on any storage fabric, I would be interested in hearing about your
setup.

Thanks,

--nab

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 17:49 +0400, James Bottomley wrote:
 On Fri, 2012-07-06 at 02:13 -0700, Nicholas A. Bellinger wrote:
  On Fri, 2012-07-06 at 09:43 +0400, James Bottomley wrote:
   On Thu, 2012-07-05 at 20:01 -0700, Nicholas A. Bellinger wrote:
   

SNIP

This bottleneck has been mentioned by various people (including myself)
on linux-scsi the last 18 months, and I've proposed that that it be
discussed at KS-2012 so we can start making some forward progress:
   
   Well, no, it hasn't.  You randomly drop things like this into unrelated
   email (I suppose that is a mention in strict English construction) but
   it's not really enough to get anyone to pay attention since they mostly
   stopped reading at the top, if they got that far: most people just go by
   subject when wading through threads initially.
   
  
  It most certainly has been made clear to me, numerous times from many
  people in the Linux/SCSI community that there is a bottleneck for small
  block random I/O in SCSI core vs. raw Linux/Block, as well as vs. non
  Linux based SCSI subsystems.
  
  My apologies if mentioning this issue last year at LC 2011 to you
  privately did not take a tone of a more serious nature, or that
  proposing a topic for LSF-2012 this year was not a clear enough
  indication of a problem with SCSI small block random I/O performance.
  
   But even if anyone noticed, a statement that RHEL6.2 (on a 2.6.32
   kernel, which is now nearly three years old) is 25% slower than W2k8R2
   on infiniband isn't really going to get anyone excited either
   (particularly when you mention OFED, which usually means a stack
   replacement on Linux anyway).
   
  
  The specific issue was first raised for .38 where we where able to get
  most of the interesting high performance LLDs converted to using
  internal locking methods so that host_lock did not have to be obtained
  during each -queuecommand() I/O dispatch, right..?
  
  This has helped a good deal for large multi-lun scsi_host configs that
  are now running in host-lock less mode, but there is still a large
  discrepancy single LUN vs. raw struct block_device access even with LLD
  host_lock less mode enabled.
  
  Now I think the virtio-blk client performance is demonstrating this
  issue pretty vividly, along with this week's tcm_vhost IBLOCK raw block
  flash benchmarks that is demonstrate some other yet-to-be determined
  limitations for virtio-scsi-raw vs. tcm_vhost for this particular fio
  randrw workload.
  
   What people might pay attention to is evidence that there's a problem in
   3.5-rc6 (without any OFED crap).  If you're not going to bother
   investigating, it has to be in an environment they can reproduce (so
   ordinary hardware, not infiniband) otherwise it gets ignored as an
   esoteric hardware issue.
   
  
  It's really quite simple for anyone to demonstrate the bottleneck
  locally on any machine using tcm_loop with raw block flash.  Take a
  struct block_device backend (like a Fusion IO /dev/fio*) and using
  IBLOCK and export locally accessible SCSI LUNs via tcm_loop..
  
  Using FIO there is a significant drop for randrw 4k performance between
  tcm_loop - IBLOCK vs. raw struct block device backends.  And no, it's
  not some type of target IBLOCK or tcm_loop bottleneck, it's a per SCSI
  LUN limitation for small block random I/Os on the order of ~75K for each
  SCSI LUN.
 
 Here, you're saying here that the end to end SCSI stack tops out at
 around 75k iops, which is reasonably respectable if you don't employ any
 mitigation like queue steering and interrupt polling ... what were the
 mitigation techniques in the test you employed by the way?
 

~75K per SCSI LUN in a multi-lun per host setup is being optimistic btw.
On the other side of the coin, the same pure block device can easily go
~200K per backend.-

For the simplest case with tcm_loop, a struct scsi_cmnd is queued via
cmwq to execute in process context - submit the backend I/O.  Once
completed from IBLOCK, the I/O is run though a target completion wq, and
completed back to SCSI.

There is no fancy queue steering or interrupt polling going on (at least
not in tcm_loop) because it's a simple virtual SCSI LLD similar to
scsi_debug.

 But previously, you ascribed a performance drop of around 75% on
 virtio-scsi (topping out around 15-20k iops) to this same problem ...
 that doesn't really seem likely.
 

No.  I ascribed the performance difference between virtio-scsi+tcm_vhost
vs. bare-metal raw block flash to this bottleneck in Linux/SCSI.

It's obvious that virtio-scsi-raw going through QEMU SCSI / block is
having some other shortcomings.

 Here's the rough ranges of concern:
 
 10K iops: standard arrays
 100K iops: modern expensive fast flash drives on 6Gb links
 1M iops: PCIe NVMexpress like devices
 
 SCSI should do arrays with no problem at all, so I'd be really concerned
 that it can't make 0-20k iops.  If you push the system and fine tune it,
 SCSI can just about 

Re: [Ksummit-2012-discuss] SCSI Performance regression [was Re: [PATCH 0/6] tcm_vhost/virtio-scsi WIP code for-3.6]

2012-07-06 Thread Nicholas A. Bellinger
On Fri, 2012-07-06 at 15:30 -0500, Christoph Lameter wrote:
 On Fri, 6 Jul 2012, James Bottomley wrote:
 
  What people might pay attention to is evidence that there's a problem in
  3.5-rc6 (without any OFED crap).  If you're not going to bother
  investigating, it has to be in an environment they can reproduce (so
  ordinary hardware, not infiniband) otherwise it gets ignored as an
  esoteric hardware issue.
 
 The OFED stuff in the meantime is part of 3.5-rc6. Infiniband has been
 supported for a long time and its a very important technology given the
 problematic nature of ethernet at high network speeds.
 
 OFED crap exists for those running RHEL5/6. The new enterprise distros are
 based on the 3.2 kernel which has pretty good Infiniband support
 out of the box.
 

So I don't think the HCAs or Infiniband fabric was the limiting factor
for small block random I/O in the RHEL 6.2 w/ OFED vs. Windows Server
2008 R2 w/ OFED setup mentioned earlier.

I've seen both FC and iSCSI fabrics demonstrate the same type of random
small block I/O performance anomalies with Linux/SCSI clients too.  The
v3.x Linux/SCSI clients are certainly better in the multi-lun per host
small block random I/O case, but single LUN performance is (still)
lacking compared to everything else.

Also RHEL 6.2 does have the scsi-host-lock less bits in place now, but
it's been more a matter of converting OFED ib_srp code to run in
host-lock less mode to realize extra gains for multi-lun per host.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization