RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> > - Later we can explore if nr_hw_queue more than one really add benefit.
> > From current limited testing, I don't see major performance boost if
> > we have nr_hw_queue more than one.
> >
> Well, the _actual_ code to support mq is rather trivial, and really serves
> as a
> good testbed for scsi-mq.
> I would prefer to leave it in, and disable it via a module parameter.

I am thinking as adding extra code for more than one nr_hw_queue will add
maintenance overhead and support. Especially IO error handling code become
complex with nr_hw_queues > 1 case.  If we really like to see performance
boost, we should attempt and bare other side effect.

For time being we should drop this nr_hw_queue > 1 support is what I choose
(not even module parameter base).

>
> But in either case, I can rebase the patches to leave any notions of
> 'nr_hw_queues' to patch 8 for implementing full mq support.

Thanks Hannes. It was just heads up...We are not sure when we can submit
upcoming patch set from Broadcom. May be we can syncup with you offline in
case any rebase requires.

>
> And we need to discuss how to handle MPI2_FUNCTION_SCSI_IO_REQUEST;
> the current method doesn't work with blk-mq.
> I really would like to see that go, especially as sg/bsg supports the same
> functionality ...
>


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Hannes Reinecke
On 02/16/2017 10:48 AM, Kashyap Desai wrote:
>> -Original Message-
>> From: Hannes Reinecke [mailto:h...@suse.de]
>> Sent: Wednesday, February 15, 2017 3:35 PM
>> To: Kashyap Desai; Sreekanth Reddy
>> Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux-
>> s...@vger.kernel.org; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX
>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>
>> On 02/15/2017 10:18 AM, Kashyap Desai wrote:
>>>>
>>>>
>>>> Hannes,
>>>>
>>>> Result I have posted last time is with merge operation enabled in
>>>> block layer. If I disable merge operation then I don't see much
>>>> improvement with multiple hw request queues. Here is the result,
>>>>
>>>> fio results when nr_hw_queues=1,
>>>> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
>>>> runt=150003msec
>>>>
>>>> fio results when nr_hw_queues=24,
>>>> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
>>>> runt=150001msec
>>>
>>> Hannes -
>>>
>>>  I worked with Sreekanth and also understand pros/cons of Patch #10.
>>> " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"
>>>
>>> In above patch, can_queue of HBA is divided based on logic CPU, it
>>> means we want to mimic as if mpt3sas HBA support multi queue
>>> distributing actual resources which is single Submission H/W Queue.
>>> This approach badly impact many performance areas.
>>>
>>> nr_hw_queues = 1 is what I observe as best performance approach since
>>> it never throttle IO if sdev->queue_depth is set to HBA queue depth.
>>> In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we
>>> never allow more than "updated can_queue" in LLD.
>>>
>> True.
>> And this was actually one of the things I wanted to demonstrate with this
>> patchset :-) ATM blk-mq really works best when having a distinct tag space
>> per port/device. As soon as the hardware provides a _shared_ tag space you
>> end up with tag starvation issues as blk-mq only allows you to do a static
>> split of the available tagspace.
>> While this patchset demonstrates that the HBA itself _does_ benefit from
>> using block-mq (especially on highly parallel loads), it also demonstrates
>> that
>> _block-mq_ has issues with singlethreaded loads on this HBA (or, rather,
>> type of HBA, as I doubt this issue is affecting mpt3sas only).
>>
>>> Below code bring actual HBA can_queue very low ( Ea on 96 logical core
>>> CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we
>>> will see lots of IO throttling in scsi mid layer due to
>>> shost->can_queue reach the limit very soon if you have  jobs with
>> higher QD.
>>>
>>> if (ioc->shost->nr_hw_queues > 1) {
>>> ioc->shost->nr_hw_queues = ioc->msix_vector_count;
>>> ioc->shost->can_queue /= ioc->msix_vector_count;
>>> }
>>> I observe negative performance if I have 8 SSD drives attached to
>>> Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K
>>> IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly
>>> ~850K IOPs. This is mainly because of host_busy stuck at very low ~169
>>> on
>> my setup.
>>>
>> Which actually might be an issue with the way scsi is hooked into blk-mq.
>> The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the
>> host is
>> capable of accepting more commands.
>> As we're limiting can_queue (to get the per-queue command depth
>> correctly) we should be using the _overall_ command depth for the
>> can_queue value itself to make the host_busy check work correctly.
>>
>> I've attached a patch for that; can you test if it makes a difference?
> Hannes -
> Attached patch works fine for me. FYI -  We need to set device queue depth
> to can_queue as we are currently not doing in mpt3sas driver.
> 
> With attached patch when I tried, I see ~2-3% improvement running multiple
> jobs. Single job profile no difference.
> 
> So looks like we are good to reach performance with single nr_hw_queues.
> 
Whee, cool.

> We have some patches to be send so want to know how to rebase this patch
> series as few patches coming from Broadcom. Can we consider below as plan ?
> 
Sure, can do.

> - Patches from 1-7 will be reposted. Also Sreekanth will complete review on
> exi

RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, February 15, 2017 3:35 PM
> To: Kashyap Desai; Sreekanth Reddy
> Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 02/15/2017 10:18 AM, Kashyap Desai wrote:
> >>
> >>
> >> Hannes,
> >>
> >> Result I have posted last time is with merge operation enabled in
> >> block layer. If I disable merge operation then I don't see much
> >> improvement with multiple hw request queues. Here is the result,
> >>
> >> fio results when nr_hw_queues=1,
> >> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> >> runt=150003msec
> >>
> >> fio results when nr_hw_queues=24,
> >> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> >> runt=150001msec
> >
> > Hannes -
> >
> >  I worked with Sreekanth and also understand pros/cons of Patch #10.
> > " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"
> >
> > In above patch, can_queue of HBA is divided based on logic CPU, it
> > means we want to mimic as if mpt3sas HBA support multi queue
> > distributing actual resources which is single Submission H/W Queue.
> > This approach badly impact many performance areas.
> >
> > nr_hw_queues = 1 is what I observe as best performance approach since
> > it never throttle IO if sdev->queue_depth is set to HBA queue depth.
> > In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we
> > never allow more than "updated can_queue" in LLD.
> >
> True.
> And this was actually one of the things I wanted to demonstrate with this
> patchset :-) ATM blk-mq really works best when having a distinct tag space
> per port/device. As soon as the hardware provides a _shared_ tag space you
> end up with tag starvation issues as blk-mq only allows you to do a static
> split of the available tagspace.
> While this patchset demonstrates that the HBA itself _does_ benefit from
> using block-mq (especially on highly parallel loads), it also demonstrates
> that
> _block-mq_ has issues with singlethreaded loads on this HBA (or, rather,
> type of HBA, as I doubt this issue is affecting mpt3sas only).
>
> > Below code bring actual HBA can_queue very low ( Ea on 96 logical core
> > CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we
> > will see lots of IO throttling in scsi mid layer due to
> > shost->can_queue reach the limit very soon if you have  jobs with
> higher QD.
> >
> > if (ioc->shost->nr_hw_queues > 1) {
> > ioc->shost->nr_hw_queues = ioc->msix_vector_count;
> > ioc->shost->can_queue /= ioc->msix_vector_count;
> > }
> > I observe negative performance if I have 8 SSD drives attached to
> > Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K
> > IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly
> > ~850K IOPs. This is mainly because of host_busy stuck at very low ~169
> > on
> my setup.
> >
> Which actually might be an issue with the way scsi is hooked into blk-mq.
> The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the
> host is
> capable of accepting more commands.
> As we're limiting can_queue (to get the per-queue command depth
> correctly) we should be using the _overall_ command depth for the
> can_queue value itself to make the host_busy check work correctly.
>
> I've attached a patch for that; can you test if it makes a difference?
Hannes -
Attached patch works fine for me. FYI -  We need to set device queue depth
to can_queue as we are currently not doing in mpt3sas driver.

With attached patch when I tried, I see ~2-3% improvement running multiple
jobs. Single job profile no difference.

So looks like we are good to reach performance with single nr_hw_queues.

We have some patches to be send so want to know how to rebase this patch
series as few patches coming from Broadcom. Can we consider below as plan ?

- Patches from 1-7 will be reposted. Also Sreekanth will complete review on
existing patch 1-7.
- We need blk_tag support only for nr_hw_queue = 1.

With that say, we will have many code changes/function without "
shost_use_blk_mq" check and assume it is single nr_hw_queue supported
 driver.

Ea - Below function can be simplify - just refer tag from scmd->request and
don't need check of shost_use_blk_mq + nr_hw_queue etc..

u16
mpt3sas_base_get_smid_scsiio(struct MPT3SAS

RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Kashyap Desai
>
>
> Hannes,
>
> Result I have posted last time is with merge operation enabled in block
> layer. If I disable merge operation then I don't see much improvement
> with
> multiple hw request queues. Here is the result,
>
> fio results when nr_hw_queues=1,
> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> runt=150003msec
>
> fio results when nr_hw_queues=24,
> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> runt=150001msec

Hannes -

 I worked with Sreekanth and also understand pros/cons of Patch #10.
" [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"

In above patch, can_queue of HBA is divided based on logic CPU, it means we
want to mimic as if mpt3sas HBA support multi queue distributing actual
resources which is single Submission H/W Queue. This approach badly impact
many performance areas.

nr_hw_queues = 1 is what I observe as best performance approach since it
never throttle IO if sdev->queue_depth is set to HBA queue depth.
In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we never
allow more than "updated can_queue" in LLD.

Below code bring actual HBA can_queue very low ( Ea on 96 logical core CPU
new can_queue goes to 42, if HBA queue depth is 4K). It means we will see
lots of IO throttling in scsi mid layer due to shost->can_queue reach the
limit very soon if you have  jobs with higher QD.

if (ioc->shost->nr_hw_queues > 1) {
ioc->shost->nr_hw_queues = ioc->msix_vector_count;
ioc->shost->can_queue /= ioc->msix_vector_count;
}
I observe negative performance if I have 8 SSD drives attached to Ventura
(latest IT controller). 16 fio jobs at QD=128 gives ~1600K IOPs and the
moment I switch to nr_hw_queues = "CPUs", it gave hardly ~850K IOPs. This is
mainly because of host_busy stuck at very low ~169 on my setup.

May be as Sreekanth mentioned, performance improvement you have observed is
due to nomerges=2 is not set and OS will attempt soft back/front merge.

I debug live machine and understood we never see parallel instance of
"scsi_dispatch_cmd" as we expect due to can_queue is less. If we really has
*very* large HBA QD, this patch #10 to expose multiple SQ may be useful.

For now, we are looking for updated version of patch which will only keep IT
HBA in SQ mode (like we are doing in  driver) and add
interface to use blk_tag in both scsi.mq and !scsi.mq mode.  Sreekanth has
already started working on it, but we may need to check full performance
test run to post the actual patch.
May be we can cherry pick few patches from this series and get blk_tag
support to improve performance of  later which will not allow use
to choose nr_hw_queue to be tunable.

Thanks, Kashyap


>
> Thanks,
> Sreekanth


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Sreekanth Reddy
On Mon, Feb 13, 2017 at 6:41 PM, Hannes Reinecke  wrote:
> On 02/13/2017 07:15 AM, Sreekanth Reddy wrote:
>> On Fri, Feb 10, 2017 at 12:29 PM, Hannes Reinecke  wrote:
>>> On 02/10/2017 05:43 AM, Sreekanth Reddy wrote:
 On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke  wrote:
> On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
>>> [ .. ]
>>
>>
>> Hannes,
>>
>> I have created a md raid0 with 4 SAS SSD drives using below command,
>> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
>> /dev/sdi /dev/sdj
>>
>> And here is 'mdadm --detail /dev/md0' command output,
>> --
>> /dev/md0:
>> Version : 1.2
>>   Creation Time : Thu Feb  9 14:38:47 2017
>>  Raid Level : raid0
>>  Array Size : 780918784 (744.74 GiB 799.66 GB)
>>Raid Devices : 4
>>   Total Devices : 4
>> Persistence : Superblock is persistent
>>
>> Update Time : Thu Feb  9 14:38:47 2017
>>   State : clean
>>  Active Devices : 4
>> Working Devices : 4
>>  Failed Devices : 0
>>   Spare Devices : 0
>>
>>  Chunk Size : 512K
>>
>>Name : host_name
>>UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>>  Events : 0
>>
>> Number   Major   Minor   RaidDevice State
>>0   8   960  active sync   /dev/sdg
>>1   8  1121  active sync   /dev/sdh
>>2   8  1442  active sync   /dev/sdj
>>3   8  1283  active sync   /dev/sdi
>> --
>>
>> Then I have used below fio profile to run 4K sequence read operations
>> with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
>> system has two numa node and each with 12 cpus).
>> -
>> global]
>> ioengine=libaio
>> group_reporting
>> direct=1
>> rw=read
>> bs=4k
>> allow_mounted_write=0
>> iodepth=128
>> runtime=150s
>>
>> [job1]
>> filename=/dev/md0
>> -
>>
>> Here are the fio results when nr_hw_queues=1 (i.e. single request
>> queue) with various number of job counts
>> 1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
>> 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
>> 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
>> 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec
>>
>> Here are the fio results when nr_hw_queues=24 (i.e. multiple request
>> queue) with various number of job counts
>> 1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
>> 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
>> 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
>> 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec
>>
>> Here we can see that on less number of jobs count, single request
>> queue (nr_hw_queues=1) is giving more IOPs than multi request
>> queues(nr_hw_queues=24).
>>
>> Can you please share your fio profile, so that I can try same thing on
>> my system.
>>
> Have you tried with the latest git update from Jens for-4.11/block (or
> for-4.11/next) branch?

 I am using below git repo,

 https://git.kernel.org/cgit/linux/kernel/git/mkp/scsi.git/log/?h=4.11/scsi-queue

 Today I will try with Jens for-4.11/block.

>>> By all means, do.
>>>
> I've found that using the mq-deadline scheduler has a noticeable
> performance boost.
>
> The fio job I'm using is essentially the same; you just should make sure
> to specify a 'numjob=' statement in there.
> Otherwise fio will just use a single CPU, which of course leads to
> averse effects in the multiqueue case.

 Yes I am providing 'numjob=' on fio command line as shown below,

 # fio md_fio_profile --numjobs=8 --output=fio_results.txt

>>> Still, it looks as if you'd be using less jobs than you have CPUs.
>>> Which means you'll be running into a tag starvation scenario on those
>>> CPUs, especially for the small blocksizes.
>>> What are the results if you set 'numjobs' to the number of CPUs?
>>>
>>
>> Hannes,
>>
>> Tried on Jens for-4.11/block kernel repo and also set each block PD's
>> scheduler as 'mq-deadline', and here is my results for 4K SR on md0
>> (raid0 with 4 drives). I have 24 CPUs and so tried even with setting
>> numjobs=24.
>>

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Hannes Reinecke
On 02/15/2017 09:15 AM, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 02:19:09PM +0100, Christoph Hellwig wrote:
>> Patch 1-7 look fine to me with minor fixups, and I'd love to see
>> them go into 4.11.
> 
> Any chance to see a resend of these?
> 
Sure.

Will do shortly.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 02:19:09PM +0100, Christoph Hellwig wrote:
> Patch 1-7 look fine to me with minor fixups, and I'd love to see
> them go into 4.11.

Any chance to see a resend of these?


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-13 Thread Hannes Reinecke
On 02/13/2017 07:15 AM, Sreekanth Reddy wrote:
> On Fri, Feb 10, 2017 at 12:29 PM, Hannes Reinecke  wrote:
>> On 02/10/2017 05:43 AM, Sreekanth Reddy wrote:
>>> On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke  wrote:
 On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
>> [ .. ]
>
>
> Hannes,
>
> I have created a md raid0 with 4 SAS SSD drives using below command,
> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
> /dev/sdi /dev/sdj
>
> And here is 'mdadm --detail /dev/md0' command output,
> --
> /dev/md0:
> Version : 1.2
>   Creation Time : Thu Feb  9 14:38:47 2017
>  Raid Level : raid0
>  Array Size : 780918784 (744.74 GiB 799.66 GB)
>Raid Devices : 4
>   Total Devices : 4
> Persistence : Superblock is persistent
>
> Update Time : Thu Feb  9 14:38:47 2017
>   State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
>
>  Chunk Size : 512K
>
>Name : host_name
>UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>  Events : 0
>
> Number   Major   Minor   RaidDevice State
>0   8   960  active sync   /dev/sdg
>1   8  1121  active sync   /dev/sdh
>2   8  1442  active sync   /dev/sdj
>3   8  1283  active sync   /dev/sdi
> --
>
> Then I have used below fio profile to run 4K sequence read operations
> with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
> system has two numa node and each with 12 cpus).
> -
> global]
> ioengine=libaio
> group_reporting
> direct=1
> rw=read
> bs=4k
> allow_mounted_write=0
> iodepth=128
> runtime=150s
>
> [job1]
> filename=/dev/md0
> -
>
> Here are the fio results when nr_hw_queues=1 (i.e. single request
> queue) with various number of job counts
> 1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
> 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
> 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
> 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec
>
> Here are the fio results when nr_hw_queues=24 (i.e. multiple request
> queue) with various number of job counts
> 1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
> 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
> 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
> 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec
>
> Here we can see that on less number of jobs count, single request
> queue (nr_hw_queues=1) is giving more IOPs than multi request
> queues(nr_hw_queues=24).
>
> Can you please share your fio profile, so that I can try same thing on
> my system.
>
 Have you tried with the latest git update from Jens for-4.11/block (or
 for-4.11/next) branch?
>>>
>>> I am using below git repo,
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/mkp/scsi.git/log/?h=4.11/scsi-queue
>>>
>>> Today I will try with Jens for-4.11/block.
>>>
>> By all means, do.
>>
 I've found that using the mq-deadline scheduler has a noticeable
 performance boost.

 The fio job I'm using is essentially the same; you just should make sure
 to specify a 'numjob=' statement in there.
 Otherwise fio will just use a single CPU, which of course leads to
 averse effects in the multiqueue case.
>>>
>>> Yes I am providing 'numjob=' on fio command line as shown below,
>>>
>>> # fio md_fio_profile --numjobs=8 --output=fio_results.txt
>>>
>> Still, it looks as if you'd be using less jobs than you have CPUs.
>> Which means you'll be running into a tag starvation scenario on those
>> CPUs, especially for the small blocksizes.
>> What are the results if you set 'numjobs' to the number of CPUs?
>>
> 
> Hannes,
> 
> Tried on Jens for-4.11/block kernel repo and also set each block PD's
> scheduler as 'mq-deadline', and here is my results for 4K SR on md0
> (raid0 with 4 drives). I have 24 CPUs and so tried even with setting
> numjobs=24.
> 
> fio results when nr_hw_queues=1 (i.e. single request queue) with
> various number of job counts
> 
> 4k read when numjobs=1 : io=215553MB, bw=1437.9MB/s, iops=367874,
> runt=150001msec
> 

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-12 Thread Sreekanth Reddy
On Fri, Feb 10, 2017 at 12:29 PM, Hannes Reinecke  wrote:
> On 02/10/2017 05:43 AM, Sreekanth Reddy wrote:
>> On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke  wrote:
>>> On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
> [ .. ]


 Hannes,

 I have created a md raid0 with 4 SAS SSD drives using below command,
 #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
 /dev/sdi /dev/sdj

 And here is 'mdadm --detail /dev/md0' command output,
 --
 /dev/md0:
 Version : 1.2
   Creation Time : Thu Feb  9 14:38:47 2017
  Raid Level : raid0
  Array Size : 780918784 (744.74 GiB 799.66 GB)
Raid Devices : 4
   Total Devices : 4
 Persistence : Superblock is persistent

 Update Time : Thu Feb  9 14:38:47 2017
   State : clean
  Active Devices : 4
 Working Devices : 4
  Failed Devices : 0
   Spare Devices : 0

  Chunk Size : 512K

Name : host_name
UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
  Events : 0

 Number   Major   Minor   RaidDevice State
0   8   960  active sync   /dev/sdg
1   8  1121  active sync   /dev/sdh
2   8  1442  active sync   /dev/sdj
3   8  1283  active sync   /dev/sdi
 --

 Then I have used below fio profile to run 4K sequence read operations
 with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
 system has two numa node and each with 12 cpus).
 -
 global]
 ioengine=libaio
 group_reporting
 direct=1
 rw=read
 bs=4k
 allow_mounted_write=0
 iodepth=128
 runtime=150s

 [job1]
 filename=/dev/md0
 -

 Here are the fio results when nr_hw_queues=1 (i.e. single request
 queue) with various number of job counts
 1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec

 Here are the fio results when nr_hw_queues=24 (i.e. multiple request
 queue) with various number of job counts
 1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec

 Here we can see that on less number of jobs count, single request
 queue (nr_hw_queues=1) is giving more IOPs than multi request
 queues(nr_hw_queues=24).

 Can you please share your fio profile, so that I can try same thing on
 my system.

>>> Have you tried with the latest git update from Jens for-4.11/block (or
>>> for-4.11/next) branch?
>>
>> I am using below git repo,
>>
>> https://git.kernel.org/cgit/linux/kernel/git/mkp/scsi.git/log/?h=4.11/scsi-queue
>>
>> Today I will try with Jens for-4.11/block.
>>
> By all means, do.
>
>>> I've found that using the mq-deadline scheduler has a noticeable
>>> performance boost.
>>>
>>> The fio job I'm using is essentially the same; you just should make sure
>>> to specify a 'numjob=' statement in there.
>>> Otherwise fio will just use a single CPU, which of course leads to
>>> averse effects in the multiqueue case.
>>
>> Yes I am providing 'numjob=' on fio command line as shown below,
>>
>> # fio md_fio_profile --numjobs=8 --output=fio_results.txt
>>
> Still, it looks as if you'd be using less jobs than you have CPUs.
> Which means you'll be running into a tag starvation scenario on those
> CPUs, especially for the small blocksizes.
> What are the results if you set 'numjobs' to the number of CPUs?
>

Hannes,

Tried on Jens for-4.11/block kernel repo and also set each block PD's
scheduler as 'mq-deadline', and here is my results for 4K SR on md0
(raid0 with 4 drives). I have 24 CPUs and so tried even with setting
numjobs=24.

fio results when nr_hw_queues=1 (i.e. single request queue) with
various number of job counts

4k read when numjobs=1 : io=215553MB, bw=1437.9MB/s, iops=367874,
runt=150001msec
4k read when numjobs=2 : io=307771MB, bw=2051.9MB/s, iops=525258,
runt=150001msec
4k read when numjobs=4 : io=300382MB, bw=2002.6MB/s, iops=512644,
runt=150002msec
4k read when numjobs=8 : 

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-09 Thread Hannes Reinecke
On 02/10/2017 05:43 AM, Sreekanth Reddy wrote:
> On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke  wrote:
>> On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
[ .. ]
>>>
>>>
>>> Hannes,
>>>
>>> I have created a md raid0 with 4 SAS SSD drives using below command,
>>> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
>>> /dev/sdi /dev/sdj
>>>
>>> And here is 'mdadm --detail /dev/md0' command output,
>>> --
>>> /dev/md0:
>>> Version : 1.2
>>>   Creation Time : Thu Feb  9 14:38:47 2017
>>>  Raid Level : raid0
>>>  Array Size : 780918784 (744.74 GiB 799.66 GB)
>>>Raid Devices : 4
>>>   Total Devices : 4
>>> Persistence : Superblock is persistent
>>>
>>> Update Time : Thu Feb  9 14:38:47 2017
>>>   State : clean
>>>  Active Devices : 4
>>> Working Devices : 4
>>>  Failed Devices : 0
>>>   Spare Devices : 0
>>>
>>>  Chunk Size : 512K
>>>
>>>Name : host_name
>>>UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>>>  Events : 0
>>>
>>> Number   Major   Minor   RaidDevice State
>>>0   8   960  active sync   /dev/sdg
>>>1   8  1121  active sync   /dev/sdh
>>>2   8  1442  active sync   /dev/sdj
>>>3   8  1283  active sync   /dev/sdi
>>> --
>>>
>>> Then I have used below fio profile to run 4K sequence read operations
>>> with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
>>> system has two numa node and each with 12 cpus).
>>> -
>>> global]
>>> ioengine=libaio
>>> group_reporting
>>> direct=1
>>> rw=read
>>> bs=4k
>>> allow_mounted_write=0
>>> iodepth=128
>>> runtime=150s
>>>
>>> [job1]
>>> filename=/dev/md0
>>> -
>>>
>>> Here are the fio results when nr_hw_queues=1 (i.e. single request
>>> queue) with various number of job counts
>>> 1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
>>> 2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
>>> 4JOBs 4k read : io=281001MB, bw=1873.4MB/s, iops=479569, runt=150002msec
>>> 8JOBs 4k read : io=236297MB, bw=1575.2MB/s, iops=403236, runt=150016msec
>>>
>>> Here are the fio results when nr_hw_queues=24 (i.e. multiple request
>>> queue) with various number of job counts
>>> 1JOB 4k read   : io=95194MB, bw=649852KB/s, iops=162463, runt=150001msec
>>> 2JOBs 4k read : io=189343MB, bw=1262.3MB/s, iops=323142, runt=150001msec
>>> 4JOBs 4k read : io=314832MB, bw=2098.9MB/s, iops=537309, runt=150001msec
>>> 8JOBs 4k read : io=277015MB, bw=1846.8MB/s, iops=472769, runt=150001msec
>>>
>>> Here we can see that on less number of jobs count, single request
>>> queue (nr_hw_queues=1) is giving more IOPs than multi request
>>> queues(nr_hw_queues=24).
>>>
>>> Can you please share your fio profile, so that I can try same thing on
>>> my system.
>>>
>> Have you tried with the latest git update from Jens for-4.11/block (or
>> for-4.11/next) branch?
> 
> I am using below git repo,
> 
> https://git.kernel.org/cgit/linux/kernel/git/mkp/scsi.git/log/?h=4.11/scsi-queue
> 
> Today I will try with Jens for-4.11/block.
> 
By all means, do.

>> I've found that using the mq-deadline scheduler has a noticeable
>> performance boost.
>>
>> The fio job I'm using is essentially the same; you just should make sure
>> to specify a 'numjob=' statement in there.
>> Otherwise fio will just use a single CPU, which of course leads to
>> averse effects in the multiqueue case.
> 
> Yes I am providing 'numjob=' on fio command line as shown below,
> 
> # fio md_fio_profile --numjobs=8 --output=fio_results.txt
> 
Still, it looks as if you'd be using less jobs than you have CPUs.
Which means you'll be running into a tag starvation scenario on those
CPUs, especially for the small blocksizes.
What are the results if you set 'numjobs' to the number of CPUs?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-09 Thread Sreekanth Reddy
On Thu, Feb 9, 2017 at 6:42 PM, Hannes Reinecke <h...@suse.de> wrote:
> On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
>> On Wed, Feb 1, 2017 at 1:13 PM, Hannes Reinecke <h...@suse.de> wrote:
>>>
>>> On 02/01/2017 08:07 AM, Kashyap Desai wrote:
>>>>>
>>>>> -Original Message-
>>>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>>>> Sent: Wednesday, February 01, 2017 12:21 PM
>>>>> To: Kashyap Desai; Christoph Hellwig
>>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>>>> Sathya
>>>>> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
>>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>>
>>>>> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
>>>>>>>
>>>>>>> -Original Message-
>>>>>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>>>>>> Sent: Tuesday, January 31, 2017 4:47 PM
>>>>>>> To: Christoph Hellwig
>>>>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>>>>>
>>>>>> Sathya
>>>>>>>
>>>>>>> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
>>>>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>>>>
>>>>>>> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
>>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> this is a patchset to enable full multiqueue support for the
>>>>>>>>> mpt3sas
>>>>>>>
>>>>>>> driver.
>>>>>>>>>
>>>>>>>>> While the HBA only has a single mailbox register for submitting
>>>>>>>>> commands, it does have individual receive queues per MSI-X
>>>>>>>>> interrupt and as such does benefit from converting it to full
>>>>>>>>> multiqueue
>>>>>>
>>>>>> support.
>>>>>>>>
>>>>>>>>
>>>>>>>> Explanation and numbers on why this would be beneficial, please.
>>>>>>>> We should not need multiple submissions queues for a single register
>>>>>>>> to benefit from multiple completion queues.
>>>>>>>>
>>>>>>> Well, the actual throughput very strongly depends on the blk-mq-sched
>>>>>>> patches from Jens.
>>>>>>> As this is barely finished I didn't post any numbers yet.
>>>>>>>
>>>>>>> However:
>>>>>>> With multiqueue support:
>>>>>>> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
>>>>>
>>>>> 60021msec
>>>>>>>
>>>>>>> With scsi-mq on 1 queue:
>>>>>>> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
>>>>>>> So yes, there _is_ a benefit.
>>
>>
>> Hannes,
>>
>> I have created a md raid0 with 4 SAS SSD drives using below command,
>> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
>> /dev/sdi /dev/sdj
>>
>> And here is 'mdadm --detail /dev/md0' command output,
>> --
>> /dev/md0:
>> Version : 1.2
>>   Creation Time : Thu Feb  9 14:38:47 2017
>>  Raid Level : raid0
>>  Array Size : 780918784 (744.74 GiB 799.66 GB)
>>Raid Devices : 4
>>   Total Devices : 4
>> Persistence : Superblock is persistent
>>
>> Update Time : Thu Feb  9 14:38:47 2017
>>   State : clean
>>  Active Devices : 4
>> Working Devices : 4
>>  Failed Devices : 0
>>   Spare Devices : 0
>>
>>  Chunk Size : 512K
>>
>>Name : host_name
>>UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>>  Events : 0
>>
>> Number   Major   Minor   RaidDevice State
>>0   8   960  active sync   /dev/sdg
>>1   8  1121  active sync   /dev/sdh
>>2

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-09 Thread Hannes Reinecke
On 02/09/2017 02:03 PM, Sreekanth Reddy wrote:
> On Wed, Feb 1, 2017 at 1:13 PM, Hannes Reinecke <h...@suse.de> wrote:
>>
>> On 02/01/2017 08:07 AM, Kashyap Desai wrote:
>>>>
>>>> -Original Message-
>>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>>> Sent: Wednesday, February 01, 2017 12:21 PM
>>>> To: Kashyap Desai; Christoph Hellwig
>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>>> Sathya
>>>> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>
>>>> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
>>>>>>
>>>>>> -Original Message-
>>>>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>>>>> Sent: Tuesday, January 31, 2017 4:47 PM
>>>>>> To: Christoph Hellwig
>>>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>>>>
>>>>> Sathya
>>>>>>
>>>>>> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
>>>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>>>
>>>>>> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
>>>>>>>
>>>>>>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> this is a patchset to enable full multiqueue support for the
>>>>>>>> mpt3sas
>>>>>>
>>>>>> driver.
>>>>>>>>
>>>>>>>> While the HBA only has a single mailbox register for submitting
>>>>>>>> commands, it does have individual receive queues per MSI-X
>>>>>>>> interrupt and as such does benefit from converting it to full
>>>>>>>> multiqueue
>>>>>
>>>>> support.
>>>>>>>
>>>>>>>
>>>>>>> Explanation and numbers on why this would be beneficial, please.
>>>>>>> We should not need multiple submissions queues for a single register
>>>>>>> to benefit from multiple completion queues.
>>>>>>>
>>>>>> Well, the actual throughput very strongly depends on the blk-mq-sched
>>>>>> patches from Jens.
>>>>>> As this is barely finished I didn't post any numbers yet.
>>>>>>
>>>>>> However:
>>>>>> With multiqueue support:
>>>>>> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
>>>>
>>>> 60021msec
>>>>>>
>>>>>> With scsi-mq on 1 queue:
>>>>>> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
>>>>>> So yes, there _is_ a benefit.
> 
> 
> Hannes,
> 
> I have created a md raid0 with 4 SAS SSD drives using below command,
> #mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
> /dev/sdi /dev/sdj
> 
> And here is 'mdadm --detail /dev/md0' command output,
> --
> /dev/md0:
> Version : 1.2
>   Creation Time : Thu Feb  9 14:38:47 2017
>  Raid Level : raid0
>  Array Size : 780918784 (744.74 GiB 799.66 GB)
>Raid Devices : 4
>   Total Devices : 4
> Persistence : Superblock is persistent
> 
> Update Time : Thu Feb  9 14:38:47 2017
>   State : clean
>  Active Devices : 4
> Working Devices : 4
>  Failed Devices : 0
>   Spare Devices : 0
> 
>  Chunk Size : 512K
> 
>Name : host_name
>UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
>  Events : 0
> 
> Number   Major   Minor   RaidDevice State
>0   8   960  active sync   /dev/sdg
>1   8  1121  active sync   /dev/sdh
>2   8  1442  active sync   /dev/sdj
>3   8  1283  active sync   /dev/sdi
> --
> 
> Then I have used below fio profile to run 4K sequence read operations
> with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
> system has two numa node and each with 12 cpus).
> 

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-09 Thread Sreekanth Reddy
On Wed, Feb 1, 2017 at 1:13 PM, Hannes Reinecke <h...@suse.de> wrote:
>
> On 02/01/2017 08:07 AM, Kashyap Desai wrote:
>>>
>>> -Original Message-
>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>> Sent: Wednesday, February 01, 2017 12:21 PM
>>> To: Kashyap Desai; Christoph Hellwig
>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>> Sathya
>>> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>
>>> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
>>>>>
>>>>> -Original Message-
>>>>> From: Hannes Reinecke [mailto:h...@suse.de]
>>>>> Sent: Tuesday, January 31, 2017 4:47 PM
>>>>> To: Christoph Hellwig
>>>>> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
>>>>
>>>> Sathya
>>>>>
>>>>> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
>>>>> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>>>>>
>>>>> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
>>>>>>
>>>>>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> this is a patchset to enable full multiqueue support for the
>>>>>>> mpt3sas
>>>>>
>>>>> driver.
>>>>>>>
>>>>>>> While the HBA only has a single mailbox register for submitting
>>>>>>> commands, it does have individual receive queues per MSI-X
>>>>>>> interrupt and as such does benefit from converting it to full
>>>>>>> multiqueue
>>>>
>>>> support.
>>>>>>
>>>>>>
>>>>>> Explanation and numbers on why this would be beneficial, please.
>>>>>> We should not need multiple submissions queues for a single register
>>>>>> to benefit from multiple completion queues.
>>>>>>
>>>>> Well, the actual throughput very strongly depends on the blk-mq-sched
>>>>> patches from Jens.
>>>>> As this is barely finished I didn't post any numbers yet.
>>>>>
>>>>> However:
>>>>> With multiqueue support:
>>>>> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
>>>
>>> 60021msec
>>>>>
>>>>> With scsi-mq on 1 queue:
>>>>> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
>>>>> So yes, there _is_ a benefit.


Hannes,

I have created a md raid0 with 4 SAS SSD drives using below command,
#mdadm --create /dev/md0 --level=0 --raid-devices=4 /dev/sdg /dev/sdh
/dev/sdi /dev/sdj

And here is 'mdadm --detail /dev/md0' command output,
--
/dev/md0:
Version : 1.2
  Creation Time : Thu Feb  9 14:38:47 2017
 Raid Level : raid0
 Array Size : 780918784 (744.74 GiB 799.66 GB)
   Raid Devices : 4
  Total Devices : 4
Persistence : Superblock is persistent

Update Time : Thu Feb  9 14:38:47 2017
  State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

 Chunk Size : 512K

   Name : host_name
   UUID : b63f9da7:b7de9a25:6a46ca00:42214e22
 Events : 0

Number   Major   Minor   RaidDevice State
   0   8   960  active sync   /dev/sdg
   1   8  1121  active sync   /dev/sdh
   2   8  1442  active sync   /dev/sdj
   3   8  1283  active sync   /dev/sdi
--

Then I have used below fio profile to run 4K sequence read operations
with nr_hw_queues=1 driver and with nr_hw_queues=24 driver (as my
system has two numa node and each with 12 cpus).
-
global]
ioengine=libaio
group_reporting
direct=1
rw=read
bs=4k
allow_mounted_write=0
iodepth=128
runtime=150s

[job1]
filename=/dev/md0
-

Here are the fio results when nr_hw_queues=1 (i.e. single request
queue) with various number of job counts
1JOB 4k read  : io=213268MB, bw=1421.8MB/s, iops=363975, runt=150001msec
2JOBs 4k read : io=309605MB, bw=2064.2MB/s, iops=528389, runt=150001msec
4JOBs 4k 

Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Hannes Reinecke
On 02/07/2017 04:40 PM, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 04:39:01PM +0100, Hannes Reinecke wrote:
>> But we do; we're getting the index/tag/smid from the high-priority list,
>> which is separated from the normal SCSI I/O tag space.
>> (which reminds me; there's another cleanup patch to be had in
>> _ctl_do_mpt_command(), but that's beside the point).
> 
> The calls to blk_mq_tagset_busy_iter added in patch 8 indicate the
> contrary.
> 
Right. Now I see what you mean.
We should have used reserved_tags here.
Sadly we still don't have an interface on actually _allocate_ reserved
tags, have we?

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 04:39:01PM +0100, Hannes Reinecke wrote:
> But we do; we're getting the index/tag/smid from the high-priority list,
> which is separated from the normal SCSI I/O tag space.
> (which reminds me; there's another cleanup patch to be had in
> _ctl_do_mpt_command(), but that's beside the point).

The calls to blk_mq_tagset_busy_iter added in patch 8 indicate the
contrary.


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Hannes Reinecke
On 02/07/2017 04:34 PM, Christoph Hellwig wrote:
> On Tue, Feb 07, 2017 at 03:38:51PM +0100, Hannes Reinecke wrote:
>> The SCSI passthrough commands pass in pre-formatted SGLs, so the driver
>> just has to map them.
>> If we were converting that we first have to re-format the
>> (driver-specific) SGLs into linux sg lists, only to have them converted
>> back into driver-specific ones once queuecommand is called.
>> You sure it's worth the effort?
>>
>> The driver already reserves some tags for precisely this use-case, so it
>> won't conflict with normal I/O operation.
>> So where's the problem with that?
> 
> If it was an entirely separate path that would be easy, but it's
> not - see all the poking into the tag maps that your patch 8
> includes.  If it was just a few tags on the side not interacting
> with the scsi or blk-mq it wouldn't be such a problem.
> 
But we do; we're getting the index/tag/smid from the high-priority list,
which is separated from the normal SCSI I/O tag space.
(which reminds me; there's another cleanup patch to be had in
_ctl_do_mpt_command(), but that's beside the point).

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Christoph Hellwig
On Tue, Feb 07, 2017 at 03:38:51PM +0100, Hannes Reinecke wrote:
> The SCSI passthrough commands pass in pre-formatted SGLs, so the driver
> just has to map them.
> If we were converting that we first have to re-format the
> (driver-specific) SGLs into linux sg lists, only to have them converted
> back into driver-specific ones once queuecommand is called.
> You sure it's worth the effort?
> 
> The driver already reserves some tags for precisely this use-case, so it
> won't conflict with normal I/O operation.
> So where's the problem with that?

If it was an entirely separate path that would be easy, but it's
not - see all the poking into the tag maps that your patch 8
includes.  If it was just a few tags on the side not interacting
with the scsi or blk-mq it wouldn't be such a problem.


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Hannes Reinecke
On 02/07/2017 02:19 PM, Christoph Hellwig wrote:
> Patch 1-7 look fine to me with minor fixups, and I'd love to see
> them go into 4.11.  The last one looks really questionable,
> and 8 and 9 will need some work so that the MPT passthrough ioctls
> either go away or make use of struct request and the block layer
> and SCSI infrastructure.
> 
Hmm. Which is quite a bit of effort for very little gain.

The SCSI passthrough commands pass in pre-formatted SGLs, so the driver
just has to map them.
If we were converting that we first have to re-format the
(driver-specific) SGLs into linux sg lists, only to have them converted
back into driver-specific ones once queuecommand is called.
You sure it's worth the effort?

The driver already reserves some tags for precisely this use-case, so it
won't conflict with normal I/O operation.
So where's the problem with that?

I know the SCSI passthrough operations are decidedly ugly, but if I were
to change them I'd rather move them over to bsg once we converted bsg to
operate without a request queue.
But for now ... not sure.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 00/10] mpt3sas: full mq support

2017-02-07 Thread Christoph Hellwig
Patch 1-7 look fine to me with minor fixups, and I'd love to see
them go into 4.11.  The last one looks really questionable,
and 8 and 9 will need some work so that the MPT passthrough ioctls
either go away or make use of struct request and the block layer
and SCSI infrastructure.


Re: [PATCH 00/10] mpt3sas: full mq support

2017-01-31 Thread Hannes Reinecke

On 01/31/2017 06:54 PM, Kashyap Desai wrote:

-Original Message-
From: Hannes Reinecke [mailto:h...@suse.de]
Sent: Tuesday, January 31, 2017 4:47 PM
To: Christoph Hellwig
Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;

Sathya

Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
Subject: Re: [PATCH 00/10] mpt3sas: full mq support

On 01/31/2017 11:02 AM, Christoph Hellwig wrote:

On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:

Hi all,

this is a patchset to enable full multiqueue support for the mpt3sas

driver.

While the HBA only has a single mailbox register for submitting
commands, it does have individual receive queues per MSI-X interrupt
and as such does benefit from converting it to full multiqueue

support.


Explanation and numbers on why this would be beneficial, please.
We should not need multiple submissions queues for a single register
to benefit from multiple completion queues.


Well, the actual throughput very strongly depends on the blk-mq-sched
patches from Jens.
As this is barely finished I didn't post any numbers yet.

However:
With multiqueue support:
4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= 60021msec
With scsi-mq on 1 queue:
4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec So
yes, there _is_ a benefit.

(Which is actually quite cool, as these tests were done on a SAS3 HBA,

so

we're getting close to the theoretical maximum of 1.2GB/s).
(Unlike the single-queue case :-)


Hannes -

Can you share detail about setup ? How many drives do you have and how is
connection (enclosure -> drives. ??) ?
To me it looks like current mpt3sas driver might be taking more hit in
spinlock operation (penalty on NUMA arch is more compare to single core
server) unlike we have in megaraid_sas driver use of shared blk tag.

The tests were done with a single LSI SAS3008 connected to a NetApp 
E-series (2660), using 4 LUNs under MD-RAID0.


Megaraid_sas is even worse here; due to the odd nature of the 'fusion' 
implementation we're ending up having _two_ sets of tags, making it 
really hard to use scsi-mq here.
(Not that I didn't try; but lacking a proper backend it's really hard to 
evaluate the benefit of those ... spinning HDDs simply don't cut it here)



I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-mq"
patch is improving performance removing spinlock overhead and attempting
to get request using blk_tags.
Are you seeing performance improvement  if you hard code nr_hw_queues = 1
in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq interrupt
steering"

No. The numbers posted above are generated with exactly that patch; the 
first line is running with nr_hw_queues=32 and the second line with 
nr_hw_queues=1.


Curiously, though, patch 8/10 also reduces the 'can_queue' value by 
dividing it by the number of CPUs (required for blk tag space scaling).
If I _increase_ can_queue after setting up the tagspace to the original 
value performance _drops_ again.

Most unexpected; I'll be doing more experimenting there.

Full results will be presented at VAULT, btw :-)

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)