Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-23 Thread Ming Lei
On Wed, Aug 23, 2017 at 10:15:29AM -0600, Jens Axboe wrote:
> On 08/23/2017 10:12 AM, Bart Van Assche wrote:
> > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> >> In Red Hat internal storage test wrt. blk-mq scheduler, we
> >> found that I/O performance is much bad with mq-deadline, especially
> >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> >> SRP...)
> > 
> > Hello Ming and Jens,
> > 
> > There may not be enough time left to reach agreement about the whole patch
> > series before the kernel v4.14 merge window opens. How about focusing on
> > patches 1..8 of this series for kernel v4.14 and revisiting the rest of this
> > patch series later?
> 
> I was going to go over the series today with 4.14 in mind. Looks to me like
> this should really be 2-3 patch series, that depend on each other. Might be
> better for review purposes as well. So I'd agree with Bart - can we get this
> split a bit and geared towards what we need for 4.14 at least, since it's
> getting close. And some of the changes do make me somewhat nervous, they
> need proper cooking time.

I agree to split the patchset, will do it tomorrow.
If you guys have any suggestions about the splitting(such as
which should aim at v4.14), please let me know.

-- 
Ming


Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-23 Thread Jens Axboe
On 08/23/2017 10:12 AM, Bart Van Assche wrote:
> On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
>> In Red Hat internal storage test wrt. blk-mq scheduler, we
>> found that I/O performance is much bad with mq-deadline, especially
>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
>> SRP...)
> 
> Hello Ming and Jens,
> 
> There may not be enough time left to reach agreement about the whole patch
> series before the kernel v4.14 merge window opens. How about focusing on
> patches 1..8 of this series for kernel v4.14 and revisiting the rest of this
> patch series later?

I was going to go over the series today with 4.14 in mind. Looks to me like
this should really be 2-3 patch series, that depend on each other. Might be
better for review purposes as well. So I'd agree with Bart - can we get this
split a bit and geared towards what we need for 4.14 at least, since it's
getting close. And some of the changes do make me somewhat nervous, they
need proper cooking time.

-- 
Jens Axboe



Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-23 Thread Bart Van Assche
On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote:
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)

Hello Ming and Jens,

There may not be enough time left to reach agreement about the whole patch
series before the kernel v4.14 merge window opens. How about focusing on
patches 1..8 of this series for kernel v4.14 and revisiting the rest of this
patch series later?

Thanks,

Bart.

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-11 Thread James Bottomley
On Fri, 2017-08-11 at 01:11 -0700, Christoph Hellwig wrote:
> [+ Martin and linux-scsi]
> 
> Given that we need this big pile and a few bfq fixes to avoid
> major regressesions I'm tempted to revert the default to scsi-mq
> for 4.14, but bring it back a little later for 4.15.
> 
> What do you think?  Maybe for 4.15 we could also do it through the
> block tree where all the fixes will be queued.

Given the severe workload regressions Mel reported, I think that's
wise.

I also think we wouldn't have found all these problems if it hadn't
been the default, so the original patch was the best way of trying to
find out if we were ready for the switch and forcing all the issues
out.

Thanks,

James



Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-11 Thread Christoph Hellwig
[+ Martin and linux-scsi]

Given that we need this big pile and a few bfq fixes to avoid
major regressesions I'm tempted to revert the default to scsi-mq
for 4.14, but bring it back a little later for 4.15.

What do you think?  Maybe for 4.15 we could also do it through the
block tree where all the fixes will be queued.


Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Laurence Oberman



On 08/08/2017 09:41 AM, Ming Lei wrote:

Hi Laurence and Guys,

On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:

On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman 
wrote:
Hello

I need to retract my Tested-by:

While its valid that the patches do not introduce performance regressions,
they seem to cause a hard lockup when the [mq-deadline] scheduler is
enabled so I am not confident with a passing result here.

This is specific to large buffered I/O writes (4MB) At least that is my
current test.

I did not wait long enough for the issue to show when I first sent the pass
(Tested-by) message because I know my test platform so well I thought I had
given it enough time to validate the patches for performance regressions.

I dont know if the failing clone in blk_get_request() is a direct a
catalyst for the hard lockup but what I do know is with the stock upstream
4.13-RC3 I only see them when I am set to [none] and stock upstream never
seems to see the hard lockup.

With [mq-deadline] enabled on stock I dont see them at all and it behaves.

Now with Ming's patches if we enable [mq-deadline] we DO see the clone
failures and the hard lockup so we have opposit behaviour with the
scheduler choice and we have the hard lockup.

On Ming's kernel with [none] we are well behaved and that was my original
focus, testing on [none] and hence my Tested-by: pass.

So more investigation is needed here.


Laurence, as we talked in IRC, the hard lock issue you saw isn't
related with this patchset, because the issue can be reproduced on
both v4.13-rc3 and RHEL7. The only trick is to run your hammer
write script concurrently in 16 jobs, then it just takes several
minutes to trigger, no matter with using mq none or mq-deadline
scheduler.

Given it is easy to reproduce, I believe it shouldn't be very
difficult to investigate and root cause.

I will report the issue on another thread, and attach the
script for reproduction.

So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve
SCSI-MQ performance) in this thread.

Thanks again for your test!

Thanks,
Ming



Hello Ming,

Yes I agree, this means my original Tested-by: for your patch set is 
then still valid for large size I/O tests.

Thank you for all this hard work and improving block-MQ

Regards
Laurence


Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Ming Lei
Hi Laurence and Guys, 

On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:
> On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman 
> wrote:
> Hello
> 
> I need to retract my Tested-by:
> 
> While its valid that the patches do not introduce performance regressions,
> they seem to cause a hard lockup when the [mq-deadline] scheduler is
> enabled so I am not confident with a passing result here.
> 
> This is specific to large buffered I/O writes (4MB) At least that is my
> current test.
> 
> I did not wait long enough for the issue to show when I first sent the pass
> (Tested-by) message because I know my test platform so well I thought I had
> given it enough time to validate the patches for performance regressions.
> 
> I dont know if the failing clone in blk_get_request() is a direct a
> catalyst for the hard lockup but what I do know is with the stock upstream
> 4.13-RC3 I only see them when I am set to [none] and stock upstream never
> seems to see the hard lockup.
> 
> With [mq-deadline] enabled on stock I dont see them at all and it behaves.
> 
> Now with Ming's patches if we enable [mq-deadline] we DO see the clone
> failures and the hard lockup so we have opposit behaviour with the
> scheduler choice and we have the hard lockup.
> 
> On Ming's kernel with [none] we are well behaved and that was my original
> focus, testing on [none] and hence my Tested-by: pass.
> 
> So more investigation is needed here.

Laurence, as we talked in IRC, the hard lock issue you saw isn't
related with this patchset, because the issue can be reproduced on
both v4.13-rc3 and RHEL7. The only trick is to run your hammer
write script concurrently in 16 jobs, then it just takes several
minutes to trigger, no matter with using mq none or mq-deadline
scheduler.

Given it is easy to reproduce, I believe it shouldn't be very
difficult to investigate and root cause.

I will report the issue on another thread, and attach the
script for reproduction.

So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve
SCSI-MQ performance) in this thread.

Thanks again for your test!

Thanks,
Ming


Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Paolo Valente

> Il giorno 08 ago 2017, alle ore 11:09, Ming Lei  ha 
> scritto:
> 
> On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote:
>> 
>>> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei  ha 
>>> scritto:
>>> 
>>> In Red Hat internal storage test wrt. blk-mq scheduler, we
>>> found that I/O performance is much bad with mq-deadline, especially
>>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
>>> SRP...)
>>> 
>>> Turns out one big issue causes the performance regression: requests
>>> are still dequeued from sw queue/scheduler queue even when ldd's
>>> queue is busy, so I/O merge becomes quite difficult to make, then
>>> sequential IO degrades a lot.
>>> 
>>> The 1st five patches improve this situation, and brings back
>>> some performance loss.
>>> 
>>> But looks they are still not enough. It is caused by
>>> the shared queue depth among all hw queues. For SCSI devices,
>>> .cmd_per_lun defines the max number of pending I/O on one
>>> request queue, which is per-request_queue depth. So during
>>> dispatch, if one hctx is too busy to move on, all hctxs can't
>>> dispatch too because of the per-request_queue depth.
>>> 
>>> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
>>> to dequeue requests from sw/scheduler queue when lld queue
>>> is busy.
>>> 
>>> Patch 15 ~20 improve bio merge via hash table in sw queue,
>>> which makes bio merge more efficient than current approch
>>> in which only the last 8 requests are checked. Since patch
>>> 6~14 converts to the scheduler way of dequeuing one request
>>> from sw queue one time for SCSI device, and the times of
>>> acquring ctx->lock is increased, and merging bio via hash
>>> table decreases holding time of ctx->lock and should eliminate
>>> effect from patch 14. 
>>> 
>>> With this changes, SCSI-MQ sequential I/O performance is
>>> improved much, for lpfc, it is basically brought back
>>> compared with block legacy path[1], especially mq-deadline
>>> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
>>> For mq-none it is improved by 10% on lpfc, and write is
>>> improved by > 10% on SRP too.
>>> 
>>> Also Bart worried that this patchset may affect SRP, so provide
>>> test data on SCSI SRP this time:
>>> 
>>> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
>>> - system(16 cores, dual sockets, mem: 96G)
>>> 
>>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |
>>> |blk-legacy dd |blk-mq none   | blk-mq none  |
>>> ---|  
>>> read :iops| 587K | 526K | 537K |
>>> randread :iops| 115K | 140K | 139K |
>>> write:iops| 596K | 519K | 602K |
>>> randwrite:iops| 103K | 122K | 120K |
>>> 
>>> 
>>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches
>>> |blk-legacy dd |blk-mq dd | blk-mq dd|
>>> 
>>> read :iops| 587K | 155K | 522K |
>>> randread :iops| 115K | 140K | 141K |
>>> write:iops| 596K | 135K | 587K |
>>> randwrite:iops| 103K | 120K | 118K |
>>> 
>>> V2:
>>> - dequeue request from sw queues in round roubin's style
>>> as suggested by Bart, and introduces one helper in sbitmap
>>> for this purpose
>>> - improve bio merge via hash table from sw queue
>>> - add comments about using DISPATCH_BUSY state in lockless way,
>>> simplifying handling on busy state,
>>> - hold ctx->lock when clearing ctx busy bit as suggested
>>> by Bart
>>> 
>>> 
>> 
>> Hi,
>> I've performance-tested Ming's patchset with the dbench4 test in
>> MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
>> have decreased dramatically: up to 32 times.  Very good results for
>> average latencies as well.
>> 
>> For brevity, here are only results for deadline.  You can find full
>> results with bfq in the thread that triggered my testing of Ming's
>> patches [1].
>> 
>> MQ-DEADLINE WITHOUT MING'S PATCHES
>> 
>> OperationCountAvgLatMaxLat
>> --
>> Flush1376090.542 13221.495
>> Close   137654 0.00827.133
>> LockX  640 0.009 0.115
>> Rename8064 1.062   246.759
>> ReadX   297956 0.051   347.018
>> WriteX   94698   425.636 15090.020
>> Unlink   35077 0.580   208.462
>> UnlockX640 0.007 0.291
>> FIND_FIRST   66630 0.566   530.339
>> SET_FILE_INFORMATION 16000 1.419   811.494
>> QUERY_FILE_INFORMATION   30717 0.004 1.108
>> QUERY_PATH_INFORMATION  176153 0.182   517.419
>> QUERY_FS_INFORMATION 30857 0.01818.562
>> NTCreateX   184145 

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Ming Lei
On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote:
> 
> > Il giorno 05 ago 2017, alle ore 08:56, Ming Lei  ha 
> > scritto:
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > The 1st five patches improve this situation, and brings back
> > some performance loss.
> > 
> > But looks they are still not enough. It is caused by
> > the shared queue depth among all hw queues. For SCSI devices,
> > .cmd_per_lun defines the max number of pending I/O on one
> > request queue, which is per-request_queue depth. So during
> > dispatch, if one hctx is too busy to move on, all hctxs can't
> > dispatch too because of the per-request_queue depth.
> > 
> > Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> > to dequeue requests from sw/scheduler queue when lld queue
> > is busy.
> > 
> > Patch 15 ~20 improve bio merge via hash table in sw queue,
> > which makes bio merge more efficient than current approch
> > in which only the last 8 requests are checked. Since patch
> > 6~14 converts to the scheduler way of dequeuing one request
> > from sw queue one time for SCSI device, and the times of
> > acquring ctx->lock is increased, and merging bio via hash
> > table decreases holding time of ctx->lock and should eliminate
> > effect from patch 14. 
> > 
> > With this changes, SCSI-MQ sequential I/O performance is
> > improved much, for lpfc, it is basically brought back
> > compared with block legacy path[1], especially mq-deadline
> > is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> > For mq-none it is improved by 10% on lpfc, and write is
> > improved by > 10% on SRP too.
> > 
> > Also Bart worried that this patchset may affect SRP, so provide
> > test data on SCSI SRP this time:
> > 
> > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> > - system(16 cores, dual sockets, mem: 96G)
> > 
> >  |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |
> >  |blk-legacy dd |blk-mq none   | blk-mq none  |
> > ---|  
> > read :iops| 587K | 526K | 537K |
> > randread :iops| 115K | 140K | 139K |
> > write:iops| 596K | 519K | 602K |
> > randwrite:iops| 103K | 122K | 120K |
> > 
> > 
> >  |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches
> >  |blk-legacy dd |blk-mq dd | blk-mq dd|
> > 
> > read :iops| 587K | 155K | 522K |
> > randread :iops| 115K | 140K | 141K |
> > write:iops| 596K | 135K | 587K |
> > randwrite:iops| 103K | 120K | 118K |
> > 
> > V2:
> > - dequeue request from sw queues in round roubin's style
> > as suggested by Bart, and introduces one helper in sbitmap
> > for this purpose
> > - improve bio merge via hash table from sw queue
> > - add comments about using DISPATCH_BUSY state in lockless way,
> > simplifying handling on busy state,
> > - hold ctx->lock when clearing ctx busy bit as suggested
> > by Bart
> > 
> > 
> 
> Hi,
> I've performance-tested Ming's patchset with the dbench4 test in
> MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
> have decreased dramatically: up to 32 times.  Very good results for
> average latencies as well.
> 
> For brevity, here are only results for deadline.  You can find full
> results with bfq in the thread that triggered my testing of Ming's
> patches [1].
> 
> MQ-DEADLINE WITHOUT MING'S PATCHES
> 
>  OperationCountAvgLatMaxLat
>  --
>  Flush1376090.542 13221.495
>  Close   137654 0.00827.133
>  LockX  640 0.009 0.115
>  Rename8064 1.062   246.759
>  ReadX   297956 0.051   347.018
>  WriteX   94698   425.636 15090.020
>  Unlink   35077 0.580   208.462
>  UnlockX640 0.007 0.291
>  FIND_FIRST   66630 0.566   530.339
>  SET_FILE_INFORMATION 16000 1.419   811.494
>  QUERY_FILE_INFORMATION   30717 0.004 1.108
>  QUERY_PATH_INFORMATION  176153 0.182   517.419
>  QUERY_FS_INFORMATION 30857 0.01818.562
>  NTCreateX   184145 0.281   582.076
> 
> Throughput 8.93961 MB/sec  64 clients  64 procs  max_late

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Paolo Valente

> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei  ha 
> scritto:
> 
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> The 1st five patches improve this situation, and brings back
> some performance loss.
> 
> But looks they are still not enough. It is caused by
> the shared queue depth among all hw queues. For SCSI devices,
> .cmd_per_lun defines the max number of pending I/O on one
> request queue, which is per-request_queue depth. So during
> dispatch, if one hctx is too busy to move on, all hctxs can't
> dispatch too because of the per-request_queue depth.
> 
> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> to dequeue requests from sw/scheduler queue when lld queue
> is busy.
> 
> Patch 15 ~20 improve bio merge via hash table in sw queue,
> which makes bio merge more efficient than current approch
> in which only the last 8 requests are checked. Since patch
> 6~14 converts to the scheduler way of dequeuing one request
> from sw queue one time for SCSI device, and the times of
> acquring ctx->lock is increased, and merging bio via hash
> table decreases holding time of ctx->lock and should eliminate
> effect from patch 14. 
> 
> With this changes, SCSI-MQ sequential I/O performance is
> improved much, for lpfc, it is basically brought back
> compared with block legacy path[1], especially mq-deadline
> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> For mq-none it is improved by 10% on lpfc, and write is
> improved by > 10% on SRP too.
> 
> Also Bart worried that this patchset may affect SRP, so provide
> test data on SCSI SRP this time:
> 
> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> - system(16 cores, dual sockets, mem: 96G)
> 
>  |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |
>  |blk-legacy dd |blk-mq none   | blk-mq none  |
> ---|  
> read :iops| 587K | 526K | 537K |
> randread :iops| 115K | 140K | 139K |
> write:iops| 596K | 519K | 602K |
> randwrite:iops| 103K | 122K | 120K |
> 
> 
>  |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches
>  |blk-legacy dd |blk-mq dd | blk-mq dd|
> 
> read :iops| 587K | 155K | 522K |
> randread :iops| 115K | 140K | 141K |
> write:iops| 596K | 135K | 587K |
> randwrite:iops| 103K | 120K | 118K |
> 
> V2:
>   - dequeue request from sw queues in round roubin's style
>   as suggested by Bart, and introduces one helper in sbitmap
>   for this purpose
>   - improve bio merge via hash table from sw queue
>   - add comments about using DISPATCH_BUSY state in lockless way,
>   simplifying handling on busy state,
>   - hold ctx->lock when clearing ctx busy bit as suggested
>   by Bart
> 
> 

Hi,
I've performance-tested Ming's patchset with the dbench4 test in
MMTests, and with the mq-deadline and bfq schedulers.  Max latencies,
have decreased dramatically: up to 32 times.  Very good results for
average latencies as well.

For brevity, here are only results for deadline.  You can find full
results with bfq in the thread that triggered my testing of Ming's
patches [1].

MQ-DEADLINE WITHOUT MING'S PATCHES

 OperationCountAvgLatMaxLat
 --
 Flush1376090.542 13221.495
 Close   137654 0.00827.133
 LockX  640 0.009 0.115
 Rename8064 1.062   246.759
 ReadX   297956 0.051   347.018
 WriteX   94698   425.636 15090.020
 Unlink   35077 0.580   208.462
 UnlockX640 0.007 0.291
 FIND_FIRST   66630 0.566   530.339
 SET_FILE_INFORMATION 16000 1.419   811.494
 QUERY_FILE_INFORMATION   30717 0.004 1.108
 QUERY_PATH_INFORMATION  176153 0.182   517.419
 QUERY_FS_INFORMATION 30857 0.01818.562
 NTCreateX   184145 0.281   582.076

Throughput 8.93961 MB/sec  64 clients  64 procs  max_latency=15090.026 ms

MQ-DEADLINE WITH MING'S PATCHES

 OperationCountAvgLatMaxLat
 --
 Flush1376048.650   431.525
 Close   144320 0.004 7.605
 LockX

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Ming Lei
On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:
> On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman 
> wrote:
> 
> >
> >
> > On 08/05/2017 02:56 AM, Ming Lei wrote:
> >
> >> In Red Hat internal storage test wrt. blk-mq scheduler, we
> >> found that I/O performance is much bad with mq-deadline, especially
> >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> >> SRP...)
> >>
> >> Turns out one big issue causes the performance regression: requests
> >> are still dequeued from sw queue/scheduler queue even when ldd's
> >> queue is busy, so I/O merge becomes quite difficult to make, then
> >> sequential IO degrades a lot.
> >>
> >> The 1st five patches improve this situation, and brings back
> >> some performance loss.
> >>
> >> But looks they are still not enough. It is caused by
> >> the shared queue depth among all hw queues. For SCSI devices,
> >> .cmd_per_lun defines the max number of pending I/O on one
> >> request queue, which is per-request_queue depth. So during
> >> dispatch, if one hctx is too busy to move on, all hctxs can't
> >> dispatch too because of the per-request_queue depth.
> >>
> >> Patch 6 ~ 14 use per-request_queue dispatch list to avoid
> >> to dequeue requests from sw/scheduler queue when lld queue
> >> is busy.
> >>
> >> Patch 15 ~20 improve bio merge via hash table in sw queue,
> >> which makes bio merge more efficient than current approch
> >> in which only the last 8 requests are checked. Since patch
> >> 6~14 converts to the scheduler way of dequeuing one request
> >> from sw queue one time for SCSI device, and the times of
> >> acquring ctx->lock is increased, and merging bio via hash
> >> table decreases holding time of ctx->lock and should eliminate
> >> effect from patch 14.
> >>
> >> With this changes, SCSI-MQ sequential I/O performance is
> >> improved much, for lpfc, it is basically brought back
> >> compared with block legacy path[1], especially mq-deadline
> >> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
> >> For mq-none it is improved by 10% on lpfc, and write is
> >> improved by > 10% on SRP too.
> >>
> >> Also Bart worried that this patchset may affect SRP, so provide
> >> test data on SCSI SRP this time:
> >>
> >> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
> >> - system(16 cores, dual sockets, mem: 96G)
> >>
> >>|v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |
> >>|blk-legacy dd |blk-mq none   | blk-mq none  |
> >> ---|
> >> read :iops| 587K | 526K | 537K |
> >> randread :iops| 115K | 140K | 139K |
> >> write:iops| 596K | 519K | 602K |
> >> randwrite:iops| 103K | 122K | 120K |
> >>
> >>
> >>|v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches
> >>|blk-legacy dd |blk-mq dd | blk-mq dd|
> >> 
> >> read :iops| 587K | 155K | 522K |
> >> randread :iops| 115K | 140K | 141K |
> >> write:iops| 596K | 135K | 587K |
> >> randwrite:iops| 103K | 120K | 118K |
> >>
> >> V2:
> >> - dequeue request from sw queues in round roubin's style
> >> as suggested by Bart, and introduces one helper in sbitmap
> >> for this purpose
> >> - improve bio merge via hash table from sw queue
> >> - add comments about using DISPATCH_BUSY state in lockless way,
> >> simplifying handling on busy state,
> >> - hold ctx->lock when clearing ctx busy bit as suggested
> >> by Bart
> >>
> >>
> >> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2
> >>
> >> Ming Lei (20):
> >>blk-mq-sched: fix scheduler bad performance
> >>sbitmap: introduce __sbitmap_for_each_set()
> >>blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
> >>blk-mq-sched: move actual dispatching into one helper
> >>blk-mq-sched: improve dispatching from sw queue
> >>blk-mq-sched: don't dequeue request until all in ->dispatch are
> >>  flushed
> >>blk-mq-sched: introduce blk_mq_sched_queue_depth()
> >>blk-mq-sched: use q->queue_depth as hint for q->nr_requests
> >>blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
> >>blk-mq-sched: introduce helpers for query, change busy state
> >>blk-mq: introduce helpers for operating ->dispatch list
> >>blk-mq: introduce pointers to dispatch lock & list
> >>blk-mq: pass 'request_queue *' to several helpers of operating BUSY
> >>blk-mq-sched: improve IO scheduling on SCSI devcie
> >>block: introduce rqhash helpers
> >>block: move actual bio merge code into __elv_merge
> >>block: add check on elevator for supporting bio merge via hashtable
> >>  from blk-mq sw queue
> >>block: introduce .last_merge and .hash to blk_mq_ctx
> >>  

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Ming Lei
On Mon, Aug 07, 2017 at 01:29:41PM -0400, Laurence Oberman wrote:
> 
> 
> On 08/07/2017 11:27 AM, Bart Van Assche wrote:
> > On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
> > > I tested this series using Ming's tests as well as my own set of tests
> > > typically run against changes to upstream code in my SRP test-bed.
> > > My tests also include very large sequential buffered and un-buffered I/O.
> > > 
> > > This series seems to be fine for me. I did uncover another issue that is
> > > unrelated to these patches and also exists in 4.13-RC3 generic that I am
> > > still debugging.
> > 
> > Hello Laurence,
> > 
> > What kind of tests did you run? Only functional tests or also performance
> > tests?
> > 
> > Has the issue you encountered with kernel 4.13-rc3 already been reported on
> > the linux-rdma mailing list?
> > 
> > Thanks,
> > 
> > Bart.
> > 
> 
> Hi Bart
> 
> Actually I was focusing on just performance to see if we had any regressions
> with Mings new patches for the large sequential I/O cases.
> 
> Ming had already tested the small I/O performance cases so I was making sure
> the large I/O sequential tests did not suffer.
> 
> The 4MB un-buffered direct read tests to DM devices seems to perform much
> the same in my test bed.
> The 4MB buffered and un-buffered 4MB writes also seem to be well behaved
> with not much improvement.

As I described, this patch improves I/O scheduling, especially it
improves I/O merge in sequential I/O. BLK_DEF_MAX_SECTORS is defined
as 2560(1280K), it is expected that this patch can't help the 4M I/O
because there isn't no merge for 4MB I/O.

But the result is still positive, since there isn't regression with
patchset.

> 
> These were not exhaustive tests and did not include my usual port disconnect
> and recovery tests either.
> I was just making sure we did not regress with Ming's changes.
> 
> I was only just starting to baseline test the mq-deadline scheduler as prior
> to 4.13-RC3 I had not been testing any of the new MQ schedulers.
> I had always only tested with [none]
> 
> The tests were with [none] and [mq-deadline]
> 
> The new issue I started seeing was not yet reported yet as I was still
> investigating it.
> 
> In summary:
> With buffered writes we see the clone fail in blk_get_request in both Mings
> kernel and in the upstream 4.13-RC3 stock kernel
> 
> [  885.271451] io scheduler mq-deadline registered
> [  898.455442] device-mapper: multipath: blk_get_request() returned -11 -
> requeuing

-11 is -EAGAIN, and it isn't a error.

GFP_ATOMIC is passed to blk_get_request() in multipath_clone_and_map(),
it isn't strange to see this failure especially when there are too
many concurrent I/O. So 

> 
> This is due to
> 
> multipath_clone_and_map()
> 
> /*
>  * Map cloned requests (request-based multipath)
>  */
> static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
>union map_info *map_context,
>struct request **__clone)
> {
> ..
> ..
> ..
> clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
> if (IS_ERR(clone)) {
> /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
> bool queue_dying = blk_queue_dying(q);
> DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
> PTR_ERR(clone), queue_dying ? " (path offline)"
> : "");
> if (queue_dying) {
> atomic_inc(&m->pg_init_in_progress);
> activate_or_offline_path(pgpath);
> return DM_MAPIO_REQUEUE;
> }
> return DM_MAPIO_DELAY_REQUEUE;
> }
> 
> Still investigating but it leads to a hard lockup
> 
> 
> So I still need to see if the hard-lockup happens in the stock kernel with
> mq-deadline and some other work before coming up with a full summary of the
> issue.
> 
> I also intend to re-run all tests including disconnect and reconnect tests
> on both mq-deadline and none.
> 
> Trace below
> 
> 
> [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
> [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin
> xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter
> ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set
> nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat
> nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle
> ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4
> iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser
> libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw
> ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables
> iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs
> ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp
> kvm_intel kvm irqbypass crct10dif_

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Bart Van Assche
On Mon, 2017-08-07 at 18:06 -0400, Laurence Oberman wrote:
> With [mq-deadline] enabled on stock I dont see them at all and it behaves.
> 
> Now with Ming's patches if we enable [mq-deadline] we DO see the clone
> failures and the hard lockup so we have opposit behaviour with the
> scheduler choice and we have the hard lockup.
> 
> On Ming's kernel with [none] we are well behaved and that was my original
> focus, testing on [none] and hence my Tested-by: pass.
> 
> So more investigation is needed here.

Hello Laurence,

Was debugfs enabled on your test setup? Had you perhaps collected the
contents of the block layer debugfs files after the lockup occurred, e.g. as
follows: (cd /sys/kernel/debug/block && find -type f | xargs grep -aH '')?

Thanks,

Bart.

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman



On 08/07/2017 02:46 PM, Laurence Oberman wrote:



On 08/07/2017 01:29 PM, Laurence Oberman wrote:



On 08/07/2017 11:27 AM, Bart Van Assche wrote:

On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:

I tested this series using Ming's tests as well as my own set of tests
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered 
I/O.


This series seems to be fine for me. I did uncover another issue 
that is
unrelated to these patches and also exists in 4.13-RC3 generic that 
I am

still debugging.


Hello Laurence,

What kind of tests did you run? Only functional tests or also 
performance

tests?

Has the issue you encountered with kernel 4.13-rc3 already been 
reported on

the linux-rdma mailing list?

Thanks,

Bart.



Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.


Ming had already tested the small I/O performance cases so I was 
making sure the large I/O sequential tests did not suffer.


The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well 
behaved with not much improvement.


These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.

I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.

I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.


In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel


[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned 
-11 - requeuing


This is due to

multipath_clone_and_map()

/*
  * Map cloned requests (request-based multipath)
  */
static int multipath_clone_and_map(struct dm_target *ti, struct 
request *rq,

union map_info *map_context,
struct request **__clone)
{
..
..
..
 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);

 if (IS_ERR(clone)) {
 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
 bool queue_dying = blk_queue_dying(q);
 DMERR_LIMIT("blk_get_request() returned %ld%s - 
requeuing",
 PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");

 if (queue_dying) {
 atomic_inc(&m->pg_init_in_progress);
 activate_or_offline_path(pgpath);
 return DM_MAPIO_REQUEUE;
 }
 return DM_MAPIO_DELAY_REQUEUE;
 }

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.


I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.


Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc 
dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun 
ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 
xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp 
llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma 
ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert 
nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle 
iptable_security iptable_raw ebtable_filter ebtables target_core_mod 
ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp 
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib 
ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G  I 
4.13.0-rc3lo

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman



On 08/07/2017 01:29 PM, Laurence Oberman wrote:



On 08/07/2017 11:27 AM, Bart Van Assche wrote:

On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:

I tested this series using Ming's tests as well as my own set of tests
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered 
I/O.


This series seems to be fine for me. I did uncover another issue that is
unrelated to these patches and also exists in 4.13-RC3 generic that I am
still debugging.


Hello Laurence,

What kind of tests did you run? Only functional tests or also performance
tests?

Has the issue you encountered with kernel 4.13-rc3 already been 
reported on

the linux-rdma mailing list?

Thanks,

Bart.



Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.


Ming had already tested the small I/O performance cases so I was making 
sure the large I/O sequential tests did not suffer.


The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
with not much improvement.


These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.

I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.

I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.


In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel


[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned -11 
- requeuing


This is due to

multipath_clone_and_map()

/*
  * Map cloned requests (request-based multipath)
  */
static int multipath_clone_and_map(struct dm_target *ti, struct request 
*rq,

union map_info *map_context,
struct request **__clone)
{
..
..
..
 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);

 if (IS_ERR(clone)) {
 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
 bool queue_dying = blk_queue_dying(q);
 DMERR_LIMIT("blk_get_request() returned %ld%s - 
requeuing",
 PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");

 if (queue_dying) {
 atomic_inc(&m->pg_init_in_progress);
 activate_or_offline_path(pgpath);
 return DM_MAPIO_REQUEUE;
 }
 return DM_MAPIO_DELAY_REQUEUE;
 }

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.


I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.


Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G  I 
4.13.0-rc3lobeming.ming_V4+ #20

[ 1553.167411] Hardware name: HP Pro

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman



On 08/07/2017 11:27 AM, Bart Van Assche wrote:

On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:

I tested this series using Ming's tests as well as my own set of tests
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered I/O.

This series seems to be fine for me. I did uncover another issue that is
unrelated to these patches and also exists in 4.13-RC3 generic that I am
still debugging.


Hello Laurence,

What kind of tests did you run? Only functional tests or also performance
tests?

Has the issue you encountered with kernel 4.13-rc3 already been reported on
the linux-rdma mailing list?

Thanks,

Bart.



Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.


Ming had already tested the small I/O performance cases so I was making 
sure the large I/O sequential tests did not suffer.


The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
with not much improvement.


These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.

I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.

I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.


In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel


[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned -11 
- requeuing


This is due to

multipath_clone_and_map()

/*
 * Map cloned requests (request-based multipath)
 */
static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
   union map_info *map_context,
   struct request **__clone)
{
..
..
..
clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);

if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q);
DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");

if (queue_dying) {
atomic_inc(&m->pg_init_in_progress);
activate_or_offline_path(pgpath);
return DM_MAPIO_REQUEUE;
}
return DM_MAPIO_DELAY_REQUEUE;
}

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.


I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.


Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G  I 
4.13.0-rc3lobeming.ming_V4+ #20

[ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 1553.167412] task: 9d9344b0d800 t

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Bart Van Assche
On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:
> I tested this series using Ming's tests as well as my own set of tests 
> typically run against changes to upstream code in my SRP test-bed.
> My tests also include very large sequential buffered and un-buffered I/O.
> 
> This series seems to be fine for me. I did uncover another issue that is 
> unrelated to these patches and also exists in 4.13-RC3 generic that I am 
> still debugging.

Hello Laurence,

What kind of tests did you run? Only functional tests or also performance
tests?

Has the issue you encountered with kernel 4.13-rc3 already been reported on
the linux-rdma mailing list?

Thanks,

Bart.

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman



On 08/05/2017 02:56 AM, Ming Lei wrote:

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

The 1st five patches improve this situation, and brings back
some performance loss.

But looks they are still not enough. It is caused by
the shared queue depth among all hw queues. For SCSI devices,
.cmd_per_lun defines the max number of pending I/O on one
request queue, which is per-request_queue depth. So during
dispatch, if one hctx is too busy to move on, all hctxs can't
dispatch too because of the per-request_queue depth.

Patch 6 ~ 14 use per-request_queue dispatch list to avoid
to dequeue requests from sw/scheduler queue when lld queue
is busy.

Patch 15 ~20 improve bio merge via hash table in sw queue,
which makes bio merge more efficient than current approch
in which only the last 8 requests are checked. Since patch
6~14 converts to the scheduler way of dequeuing one request
from sw queue one time for SCSI device, and the times of
acquring ctx->lock is increased, and merging bio via hash
table decreases holding time of ctx->lock and should eliminate
effect from patch 14.

With this changes, SCSI-MQ sequential I/O performance is
improved much, for lpfc, it is basically brought back
compared with block legacy path[1], especially mq-deadline
is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP,
For mq-none it is improved by 10% on lpfc, and write is
improved by > 10% on SRP too.

Also Bart worried that this patchset may affect SRP, so provide
test data on SCSI SRP this time:

- fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs)
- system(16 cores, dual sockets, mem: 96G)

   |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |
   |blk-legacy dd |blk-mq none   | blk-mq none  |
---|
read :iops| 587K | 526K | 537K |
randread :iops| 115K | 140K | 139K |
write:iops| 596K | 519K | 602K |
randwrite:iops| 103K | 122K | 120K |


   |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches
   |blk-legacy dd |blk-mq dd | blk-mq dd|

read :iops| 587K | 155K | 522K |
randread :iops| 115K | 140K | 141K |
write:iops| 596K | 135K | 587K |
randwrite:iops| 103K | 120K | 118K |

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart


[1] http://marc.info/?l=linux-block&m=150151989915776&w=2

Ming Lei (20):
   blk-mq-sched: fix scheduler bad performance
   sbitmap: introduce __sbitmap_for_each_set()
   blk-mq: introduce blk_mq_dispatch_rq_from_ctx()
   blk-mq-sched: move actual dispatching into one helper
   blk-mq-sched: improve dispatching from sw queue
   blk-mq-sched: don't dequeue request until all in ->dispatch are
 flushed
   blk-mq-sched: introduce blk_mq_sched_queue_depth()
   blk-mq-sched: use q->queue_depth as hint for q->nr_requests
   blk-mq: introduce BLK_MQ_F_SHARED_DEPTH
   blk-mq-sched: introduce helpers for query, change busy state
   blk-mq: introduce helpers for operating ->dispatch list
   blk-mq: introduce pointers to dispatch lock & list
   blk-mq: pass 'request_queue *' to several helpers of operating BUSY
   blk-mq-sched: improve IO scheduling on SCSI devcie
   block: introduce rqhash helpers
   block: move actual bio merge code into __elv_merge
   block: add check on elevator for supporting bio merge via hashtable
 from blk-mq sw queue
   block: introduce .last_merge and .hash to blk_mq_ctx
   blk-mq-sched: refactor blk_mq_sched_try_merge()
   blk-mq: improve bio merge from blk-mq sw queue

  block/blk-mq-debugfs.c  |  12 ++--
  block/blk-mq-sched.c| 187 +---
  block/blk-mq-sched.h|  23 ++
  block/blk-mq.c  | 133 +++---
  block/blk-mq.h  |  73 +++
  block/blk-settings.c|   2 +
  block/blk.h |  55 ++
  block/elevator.c|  93 ++--
  include/linux/blk-mq.h  |   5 ++
  include/linux/blkdev.h  |   5 ++
  include/linux/sbitmap.h |  54