Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Wed, Aug 23, 2017 at 10:15:29AM -0600, Jens Axboe wrote: > On 08/23/2017 10:12 AM, Bart Van Assche wrote: > > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > >> In Red Hat internal storage test wrt. blk-mq scheduler, we > >> found that I/O performance is much bad with mq-deadline, especially > >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > >> SRP...) > > > > Hello Ming and Jens, > > > > There may not be enough time left to reach agreement about the whole patch > > series before the kernel v4.14 merge window opens. How about focusing on > > patches 1..8 of this series for kernel v4.14 and revisiting the rest of this > > patch series later? > > I was going to go over the series today with 4.14 in mind. Looks to me like > this should really be 2-3 patch series, that depend on each other. Might be > better for review purposes as well. So I'd agree with Bart - can we get this > split a bit and geared towards what we need for 4.14 at least, since it's > getting close. And some of the changes do make me somewhat nervous, they > need proper cooking time. I agree to split the patchset, will do it tomorrow. If you guys have any suggestions about the splitting(such as which should aim at v4.14), please let me know. -- Ming
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/23/2017 10:12 AM, Bart Van Assche wrote: > On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: >> In Red Hat internal storage test wrt. blk-mq scheduler, we >> found that I/O performance is much bad with mq-deadline, especially >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, >> SRP...) > > Hello Ming and Jens, > > There may not be enough time left to reach agreement about the whole patch > series before the kernel v4.14 merge window opens. How about focusing on > patches 1..8 of this series for kernel v4.14 and revisiting the rest of this > patch series later? I was going to go over the series today with 4.14 in mind. Looks to me like this should really be 2-3 patch series, that depend on each other. Might be better for review purposes as well. So I'd agree with Bart - can we get this split a bit and geared towards what we need for 4.14 at least, since it's getting close. And some of the changes do make me somewhat nervous, they need proper cooking time. -- Jens Axboe
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Sat, 2017-08-05 at 14:56 +0800, Ming Lei wrote: > In Red Hat internal storage test wrt. blk-mq scheduler, we > found that I/O performance is much bad with mq-deadline, especially > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > SRP...) Hello Ming and Jens, There may not be enough time left to reach agreement about the whole patch series before the kernel v4.14 merge window opens. How about focusing on patches 1..8 of this series for kernel v4.14 and revisiting the rest of this patch series later? Thanks, Bart.
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Fri, 2017-08-11 at 01:11 -0700, Christoph Hellwig wrote: > [+ Martin and linux-scsi] > > Given that we need this big pile and a few bfq fixes to avoid > major regressesions I'm tempted to revert the default to scsi-mq > for 4.14, but bring it back a little later for 4.15. > > What do you think? Maybe for 4.15 we could also do it through the > block tree where all the fixes will be queued. Given the severe workload regressions Mel reported, I think that's wise. I also think we wouldn't have found all these problems if it hadn't been the default, so the original patch was the best way of trying to find out if we were ready for the switch and forcing all the issues out. Thanks, James
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
[+ Martin and linux-scsi] Given that we need this big pile and a few bfq fixes to avoid major regressesions I'm tempted to revert the default to scsi-mq for 4.14, but bring it back a little later for 4.15. What do you think? Maybe for 4.15 we could also do it through the block tree where all the fixes will be queued.
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/08/2017 09:41 AM, Ming Lei wrote: Hi Laurence and Guys, On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote: On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman wrote: Hello I need to retract my Tested-by: While its valid that the patches do not introduce performance regressions, they seem to cause a hard lockup when the [mq-deadline] scheduler is enabled so I am not confident with a passing result here. This is specific to large buffered I/O writes (4MB) At least that is my current test. I did not wait long enough for the issue to show when I first sent the pass (Tested-by) message because I know my test platform so well I thought I had given it enough time to validate the patches for performance regressions. I dont know if the failing clone in blk_get_request() is a direct a catalyst for the hard lockup but what I do know is with the stock upstream 4.13-RC3 I only see them when I am set to [none] and stock upstream never seems to see the hard lockup. With [mq-deadline] enabled on stock I dont see them at all and it behaves. Now with Ming's patches if we enable [mq-deadline] we DO see the clone failures and the hard lockup so we have opposit behaviour with the scheduler choice and we have the hard lockup. On Ming's kernel with [none] we are well behaved and that was my original focus, testing on [none] and hence my Tested-by: pass. So more investigation is needed here. Laurence, as we talked in IRC, the hard lock issue you saw isn't related with this patchset, because the issue can be reproduced on both v4.13-rc3 and RHEL7. The only trick is to run your hammer write script concurrently in 16 jobs, then it just takes several minutes to trigger, no matter with using mq none or mq-deadline scheduler. Given it is easy to reproduce, I believe it shouldn't be very difficult to investigate and root cause. I will report the issue on another thread, and attach the script for reproduction. So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance) in this thread. Thanks again for your test! Thanks, Ming Hello Ming, Yes I agree, this means my original Tested-by: for your patch set is then still valid for large size I/O tests. Thank you for all this hard work and improving block-MQ Regards Laurence
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
Hi Laurence and Guys, On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote: > On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman > wrote: > Hello > > I need to retract my Tested-by: > > While its valid that the patches do not introduce performance regressions, > they seem to cause a hard lockup when the [mq-deadline] scheduler is > enabled so I am not confident with a passing result here. > > This is specific to large buffered I/O writes (4MB) At least that is my > current test. > > I did not wait long enough for the issue to show when I first sent the pass > (Tested-by) message because I know my test platform so well I thought I had > given it enough time to validate the patches for performance regressions. > > I dont know if the failing clone in blk_get_request() is a direct a > catalyst for the hard lockup but what I do know is with the stock upstream > 4.13-RC3 I only see them when I am set to [none] and stock upstream never > seems to see the hard lockup. > > With [mq-deadline] enabled on stock I dont see them at all and it behaves. > > Now with Ming's patches if we enable [mq-deadline] we DO see the clone > failures and the hard lockup so we have opposit behaviour with the > scheduler choice and we have the hard lockup. > > On Ming's kernel with [none] we are well behaved and that was my original > focus, testing on [none] and hence my Tested-by: pass. > > So more investigation is needed here. Laurence, as we talked in IRC, the hard lock issue you saw isn't related with this patchset, because the issue can be reproduced on both v4.13-rc3 and RHEL7. The only trick is to run your hammer write script concurrently in 16 jobs, then it just takes several minutes to trigger, no matter with using mq none or mq-deadline scheduler. Given it is easy to reproduce, I believe it shouldn't be very difficult to investigate and root cause. I will report the issue on another thread, and attach the script for reproduction. So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance) in this thread. Thanks again for your test! Thanks, Ming
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
> Il giorno 08 ago 2017, alle ore 11:09, Ming Lei ha > scritto: > > On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote: >> >>> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei ha >>> scritto: >>> >>> In Red Hat internal storage test wrt. blk-mq scheduler, we >>> found that I/O performance is much bad with mq-deadline, especially >>> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, >>> SRP...) >>> >>> Turns out one big issue causes the performance regression: requests >>> are still dequeued from sw queue/scheduler queue even when ldd's >>> queue is busy, so I/O merge becomes quite difficult to make, then >>> sequential IO degrades a lot. >>> >>> The 1st five patches improve this situation, and brings back >>> some performance loss. >>> >>> But looks they are still not enough. It is caused by >>> the shared queue depth among all hw queues. For SCSI devices, >>> .cmd_per_lun defines the max number of pending I/O on one >>> request queue, which is per-request_queue depth. So during >>> dispatch, if one hctx is too busy to move on, all hctxs can't >>> dispatch too because of the per-request_queue depth. >>> >>> Patch 6 ~ 14 use per-request_queue dispatch list to avoid >>> to dequeue requests from sw/scheduler queue when lld queue >>> is busy. >>> >>> Patch 15 ~20 improve bio merge via hash table in sw queue, >>> which makes bio merge more efficient than current approch >>> in which only the last 8 requests are checked. Since patch >>> 6~14 converts to the scheduler way of dequeuing one request >>> from sw queue one time for SCSI device, and the times of >>> acquring ctx->lock is increased, and merging bio via hash >>> table decreases holding time of ctx->lock and should eliminate >>> effect from patch 14. >>> >>> With this changes, SCSI-MQ sequential I/O performance is >>> improved much, for lpfc, it is basically brought back >>> compared with block legacy path[1], especially mq-deadline >>> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, >>> For mq-none it is improved by 10% on lpfc, and write is >>> improved by > 10% on SRP too. >>> >>> Also Bart worried that this patchset may affect SRP, so provide >>> test data on SCSI SRP this time: >>> >>> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) >>> - system(16 cores, dual sockets, mem: 96G) >>> >>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | >>> |blk-legacy dd |blk-mq none | blk-mq none | >>> ---| >>> read :iops| 587K | 526K | 537K | >>> randread :iops| 115K | 140K | 139K | >>> write:iops| 596K | 519K | 602K | >>> randwrite:iops| 103K | 122K | 120K | >>> >>> >>> |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches >>> |blk-legacy dd |blk-mq dd | blk-mq dd| >>> >>> read :iops| 587K | 155K | 522K | >>> randread :iops| 115K | 140K | 141K | >>> write:iops| 596K | 135K | 587K | >>> randwrite:iops| 103K | 120K | 118K | >>> >>> V2: >>> - dequeue request from sw queues in round roubin's style >>> as suggested by Bart, and introduces one helper in sbitmap >>> for this purpose >>> - improve bio merge via hash table from sw queue >>> - add comments about using DISPATCH_BUSY state in lockless way, >>> simplifying handling on busy state, >>> - hold ctx->lock when clearing ctx busy bit as suggested >>> by Bart >>> >>> >> >> Hi, >> I've performance-tested Ming's patchset with the dbench4 test in >> MMTests, and with the mq-deadline and bfq schedulers. Max latencies, >> have decreased dramatically: up to 32 times. Very good results for >> average latencies as well. >> >> For brevity, here are only results for deadline. You can find full >> results with bfq in the thread that triggered my testing of Ming's >> patches [1]. >> >> MQ-DEADLINE WITHOUT MING'S PATCHES >> >> OperationCountAvgLatMaxLat >> -- >> Flush1376090.542 13221.495 >> Close 137654 0.00827.133 >> LockX 640 0.009 0.115 >> Rename8064 1.062 246.759 >> ReadX 297956 0.051 347.018 >> WriteX 94698 425.636 15090.020 >> Unlink 35077 0.580 208.462 >> UnlockX640 0.007 0.291 >> FIND_FIRST 66630 0.566 530.339 >> SET_FILE_INFORMATION 16000 1.419 811.494 >> QUERY_FILE_INFORMATION 30717 0.004 1.108 >> QUERY_PATH_INFORMATION 176153 0.182 517.419 >> QUERY_FS_INFORMATION 30857 0.01818.562 >> NTCreateX 184145
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Tue, Aug 08, 2017 at 10:09:57AM +0200, Paolo Valente wrote: > > > Il giorno 05 ago 2017, alle ore 08:56, Ming Lei ha > > scritto: > > > > In Red Hat internal storage test wrt. blk-mq scheduler, we > > found that I/O performance is much bad with mq-deadline, especially > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > > SRP...) > > > > Turns out one big issue causes the performance regression: requests > > are still dequeued from sw queue/scheduler queue even when ldd's > > queue is busy, so I/O merge becomes quite difficult to make, then > > sequential IO degrades a lot. > > > > The 1st five patches improve this situation, and brings back > > some performance loss. > > > > But looks they are still not enough. It is caused by > > the shared queue depth among all hw queues. For SCSI devices, > > .cmd_per_lun defines the max number of pending I/O on one > > request queue, which is per-request_queue depth. So during > > dispatch, if one hctx is too busy to move on, all hctxs can't > > dispatch too because of the per-request_queue depth. > > > > Patch 6 ~ 14 use per-request_queue dispatch list to avoid > > to dequeue requests from sw/scheduler queue when lld queue > > is busy. > > > > Patch 15 ~20 improve bio merge via hash table in sw queue, > > which makes bio merge more efficient than current approch > > in which only the last 8 requests are checked. Since patch > > 6~14 converts to the scheduler way of dequeuing one request > > from sw queue one time for SCSI device, and the times of > > acquring ctx->lock is increased, and merging bio via hash > > table decreases holding time of ctx->lock and should eliminate > > effect from patch 14. > > > > With this changes, SCSI-MQ sequential I/O performance is > > improved much, for lpfc, it is basically brought back > > compared with block legacy path[1], especially mq-deadline > > is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, > > For mq-none it is improved by 10% on lpfc, and write is > > improved by > 10% on SRP too. > > > > Also Bart worried that this patchset may affect SRP, so provide > > test data on SCSI SRP this time: > > > > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) > > - system(16 cores, dual sockets, mem: 96G) > > > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | > > |blk-legacy dd |blk-mq none | blk-mq none | > > ---| > > read :iops| 587K | 526K | 537K | > > randread :iops| 115K | 140K | 139K | > > write:iops| 596K | 519K | 602K | > > randwrite:iops| 103K | 122K | 120K | > > > > > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches > > |blk-legacy dd |blk-mq dd | blk-mq dd| > > > > read :iops| 587K | 155K | 522K | > > randread :iops| 115K | 140K | 141K | > > write:iops| 596K | 135K | 587K | > > randwrite:iops| 103K | 120K | 118K | > > > > V2: > > - dequeue request from sw queues in round roubin's style > > as suggested by Bart, and introduces one helper in sbitmap > > for this purpose > > - improve bio merge via hash table from sw queue > > - add comments about using DISPATCH_BUSY state in lockless way, > > simplifying handling on busy state, > > - hold ctx->lock when clearing ctx busy bit as suggested > > by Bart > > > > > > Hi, > I've performance-tested Ming's patchset with the dbench4 test in > MMTests, and with the mq-deadline and bfq schedulers. Max latencies, > have decreased dramatically: up to 32 times. Very good results for > average latencies as well. > > For brevity, here are only results for deadline. You can find full > results with bfq in the thread that triggered my testing of Ming's > patches [1]. > > MQ-DEADLINE WITHOUT MING'S PATCHES > > OperationCountAvgLatMaxLat > -- > Flush1376090.542 13221.495 > Close 137654 0.00827.133 > LockX 640 0.009 0.115 > Rename8064 1.062 246.759 > ReadX 297956 0.051 347.018 > WriteX 94698 425.636 15090.020 > Unlink 35077 0.580 208.462 > UnlockX640 0.007 0.291 > FIND_FIRST 66630 0.566 530.339 > SET_FILE_INFORMATION 16000 1.419 811.494 > QUERY_FILE_INFORMATION 30717 0.004 1.108 > QUERY_PATH_INFORMATION 176153 0.182 517.419 > QUERY_FS_INFORMATION 30857 0.01818.562 > NTCreateX 184145 0.281 582.076 > > Throughput 8.93961 MB/sec 64 clients 64 procs max_late
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
> Il giorno 05 ago 2017, alle ore 08:56, Ming Lei ha > scritto: > > In Red Hat internal storage test wrt. blk-mq scheduler, we > found that I/O performance is much bad with mq-deadline, especially > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > SRP...) > > Turns out one big issue causes the performance regression: requests > are still dequeued from sw queue/scheduler queue even when ldd's > queue is busy, so I/O merge becomes quite difficult to make, then > sequential IO degrades a lot. > > The 1st five patches improve this situation, and brings back > some performance loss. > > But looks they are still not enough. It is caused by > the shared queue depth among all hw queues. For SCSI devices, > .cmd_per_lun defines the max number of pending I/O on one > request queue, which is per-request_queue depth. So during > dispatch, if one hctx is too busy to move on, all hctxs can't > dispatch too because of the per-request_queue depth. > > Patch 6 ~ 14 use per-request_queue dispatch list to avoid > to dequeue requests from sw/scheduler queue when lld queue > is busy. > > Patch 15 ~20 improve bio merge via hash table in sw queue, > which makes bio merge more efficient than current approch > in which only the last 8 requests are checked. Since patch > 6~14 converts to the scheduler way of dequeuing one request > from sw queue one time for SCSI device, and the times of > acquring ctx->lock is increased, and merging bio via hash > table decreases holding time of ctx->lock and should eliminate > effect from patch 14. > > With this changes, SCSI-MQ sequential I/O performance is > improved much, for lpfc, it is basically brought back > compared with block legacy path[1], especially mq-deadline > is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, > For mq-none it is improved by 10% on lpfc, and write is > improved by > 10% on SRP too. > > Also Bart worried that this patchset may affect SRP, so provide > test data on SCSI SRP this time: > > - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) > - system(16 cores, dual sockets, mem: 96G) > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | > |blk-legacy dd |blk-mq none | blk-mq none | > ---| > read :iops| 587K | 526K | 537K | > randread :iops| 115K | 140K | 139K | > write:iops| 596K | 519K | 602K | > randwrite:iops| 103K | 122K | 120K | > > > |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches > |blk-legacy dd |blk-mq dd | blk-mq dd| > > read :iops| 587K | 155K | 522K | > randread :iops| 115K | 140K | 141K | > write:iops| 596K | 135K | 587K | > randwrite:iops| 103K | 120K | 118K | > > V2: > - dequeue request from sw queues in round roubin's style > as suggested by Bart, and introduces one helper in sbitmap > for this purpose > - improve bio merge via hash table from sw queue > - add comments about using DISPATCH_BUSY state in lockless way, > simplifying handling on busy state, > - hold ctx->lock when clearing ctx busy bit as suggested > by Bart > > Hi, I've performance-tested Ming's patchset with the dbench4 test in MMTests, and with the mq-deadline and bfq schedulers. Max latencies, have decreased dramatically: up to 32 times. Very good results for average latencies as well. For brevity, here are only results for deadline. You can find full results with bfq in the thread that triggered my testing of Ming's patches [1]. MQ-DEADLINE WITHOUT MING'S PATCHES OperationCountAvgLatMaxLat -- Flush1376090.542 13221.495 Close 137654 0.00827.133 LockX 640 0.009 0.115 Rename8064 1.062 246.759 ReadX 297956 0.051 347.018 WriteX 94698 425.636 15090.020 Unlink 35077 0.580 208.462 UnlockX640 0.007 0.291 FIND_FIRST 66630 0.566 530.339 SET_FILE_INFORMATION 16000 1.419 811.494 QUERY_FILE_INFORMATION 30717 0.004 1.108 QUERY_PATH_INFORMATION 176153 0.182 517.419 QUERY_FS_INFORMATION 30857 0.01818.562 NTCreateX 184145 0.281 582.076 Throughput 8.93961 MB/sec 64 clients 64 procs max_latency=15090.026 ms MQ-DEADLINE WITH MING'S PATCHES OperationCountAvgLatMaxLat -- Flush1376048.650 431.525 Close 144320 0.004 7.605 LockX
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote: > On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman > wrote: > > > > > > > On 08/05/2017 02:56 AM, Ming Lei wrote: > > > >> In Red Hat internal storage test wrt. blk-mq scheduler, we > >> found that I/O performance is much bad with mq-deadline, especially > >> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, > >> SRP...) > >> > >> Turns out one big issue causes the performance regression: requests > >> are still dequeued from sw queue/scheduler queue even when ldd's > >> queue is busy, so I/O merge becomes quite difficult to make, then > >> sequential IO degrades a lot. > >> > >> The 1st five patches improve this situation, and brings back > >> some performance loss. > >> > >> But looks they are still not enough. It is caused by > >> the shared queue depth among all hw queues. For SCSI devices, > >> .cmd_per_lun defines the max number of pending I/O on one > >> request queue, which is per-request_queue depth. So during > >> dispatch, if one hctx is too busy to move on, all hctxs can't > >> dispatch too because of the per-request_queue depth. > >> > >> Patch 6 ~ 14 use per-request_queue dispatch list to avoid > >> to dequeue requests from sw/scheduler queue when lld queue > >> is busy. > >> > >> Patch 15 ~20 improve bio merge via hash table in sw queue, > >> which makes bio merge more efficient than current approch > >> in which only the last 8 requests are checked. Since patch > >> 6~14 converts to the scheduler way of dequeuing one request > >> from sw queue one time for SCSI device, and the times of > >> acquring ctx->lock is increased, and merging bio via hash > >> table decreases holding time of ctx->lock and should eliminate > >> effect from patch 14. > >> > >> With this changes, SCSI-MQ sequential I/O performance is > >> improved much, for lpfc, it is basically brought back > >> compared with block legacy path[1], especially mq-deadline > >> is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, > >> For mq-none it is improved by 10% on lpfc, and write is > >> improved by > 10% on SRP too. > >> > >> Also Bart worried that this patchset may affect SRP, so provide > >> test data on SCSI SRP this time: > >> > >> - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) > >> - system(16 cores, dual sockets, mem: 96G) > >> > >>|v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | > >>|blk-legacy dd |blk-mq none | blk-mq none | > >> ---| > >> read :iops| 587K | 526K | 537K | > >> randread :iops| 115K | 140K | 139K | > >> write:iops| 596K | 519K | 602K | > >> randwrite:iops| 103K | 122K | 120K | > >> > >> > >>|v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches > >>|blk-legacy dd |blk-mq dd | blk-mq dd| > >> > >> read :iops| 587K | 155K | 522K | > >> randread :iops| 115K | 140K | 141K | > >> write:iops| 596K | 135K | 587K | > >> randwrite:iops| 103K | 120K | 118K | > >> > >> V2: > >> - dequeue request from sw queues in round roubin's style > >> as suggested by Bart, and introduces one helper in sbitmap > >> for this purpose > >> - improve bio merge via hash table from sw queue > >> - add comments about using DISPATCH_BUSY state in lockless way, > >> simplifying handling on busy state, > >> - hold ctx->lock when clearing ctx busy bit as suggested > >> by Bart > >> > >> > >> [1] http://marc.info/?l=linux-block&m=150151989915776&w=2 > >> > >> Ming Lei (20): > >>blk-mq-sched: fix scheduler bad performance > >>sbitmap: introduce __sbitmap_for_each_set() > >>blk-mq: introduce blk_mq_dispatch_rq_from_ctx() > >>blk-mq-sched: move actual dispatching into one helper > >>blk-mq-sched: improve dispatching from sw queue > >>blk-mq-sched: don't dequeue request until all in ->dispatch are > >> flushed > >>blk-mq-sched: introduce blk_mq_sched_queue_depth() > >>blk-mq-sched: use q->queue_depth as hint for q->nr_requests > >>blk-mq: introduce BLK_MQ_F_SHARED_DEPTH > >>blk-mq-sched: introduce helpers for query, change busy state > >>blk-mq: introduce helpers for operating ->dispatch list > >>blk-mq: introduce pointers to dispatch lock & list > >>blk-mq: pass 'request_queue *' to several helpers of operating BUSY > >>blk-mq-sched: improve IO scheduling on SCSI devcie > >>block: introduce rqhash helpers > >>block: move actual bio merge code into __elv_merge > >>block: add check on elevator for supporting bio merge via hashtable > >> from blk-mq sw queue > >>block: introduce .last_merge and .hash to blk_mq_ctx > >>
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Mon, Aug 07, 2017 at 01:29:41PM -0400, Laurence Oberman wrote: > > > On 08/07/2017 11:27 AM, Bart Van Assche wrote: > > On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote: > > > I tested this series using Ming's tests as well as my own set of tests > > > typically run against changes to upstream code in my SRP test-bed. > > > My tests also include very large sequential buffered and un-buffered I/O. > > > > > > This series seems to be fine for me. I did uncover another issue that is > > > unrelated to these patches and also exists in 4.13-RC3 generic that I am > > > still debugging. > > > > Hello Laurence, > > > > What kind of tests did you run? Only functional tests or also performance > > tests? > > > > Has the issue you encountered with kernel 4.13-rc3 already been reported on > > the linux-rdma mailing list? > > > > Thanks, > > > > Bart. > > > > Hi Bart > > Actually I was focusing on just performance to see if we had any regressions > with Mings new patches for the large sequential I/O cases. > > Ming had already tested the small I/O performance cases so I was making sure > the large I/O sequential tests did not suffer. > > The 4MB un-buffered direct read tests to DM devices seems to perform much > the same in my test bed. > The 4MB buffered and un-buffered 4MB writes also seem to be well behaved > with not much improvement. As I described, this patch improves I/O scheduling, especially it improves I/O merge in sequential I/O. BLK_DEF_MAX_SECTORS is defined as 2560(1280K), it is expected that this patch can't help the 4M I/O because there isn't no merge for 4MB I/O. But the result is still positive, since there isn't regression with patchset. > > These were not exhaustive tests and did not include my usual port disconnect > and recovery tests either. > I was just making sure we did not regress with Ming's changes. > > I was only just starting to baseline test the mq-deadline scheduler as prior > to 4.13-RC3 I had not been testing any of the new MQ schedulers. > I had always only tested with [none] > > The tests were with [none] and [mq-deadline] > > The new issue I started seeing was not yet reported yet as I was still > investigating it. > > In summary: > With buffered writes we see the clone fail in blk_get_request in both Mings > kernel and in the upstream 4.13-RC3 stock kernel > > [ 885.271451] io scheduler mq-deadline registered > [ 898.455442] device-mapper: multipath: blk_get_request() returned -11 - > requeuing -11 is -EAGAIN, and it isn't a error. GFP_ATOMIC is passed to blk_get_request() in multipath_clone_and_map(), it isn't strange to see this failure especially when there are too many concurrent I/O. So > > This is due to > > multipath_clone_and_map() > > /* > * Map cloned requests (request-based multipath) > */ > static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, >union map_info *map_context, >struct request **__clone) > { > .. > .. > .. > clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); > if (IS_ERR(clone)) { > /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ > bool queue_dying = blk_queue_dying(q); > DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing", > PTR_ERR(clone), queue_dying ? " (path offline)" > : ""); > if (queue_dying) { > atomic_inc(&m->pg_init_in_progress); > activate_or_offline_path(pgpath); > return DM_MAPIO_REQUEUE; > } > return DM_MAPIO_DELAY_REQUEUE; > } > > Still investigating but it leads to a hard lockup > > > So I still need to see if the hard-lockup happens in the stock kernel with > mq-deadline and some other work before coming up with a full summary of the > issue. > > I also intend to re-run all tests including disconnect and reconnect tests > on both mq-deadline and none. > > Trace below > > > [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 > [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin > xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter > ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set > nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat > nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle > ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 > iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser > libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw > ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables > iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs > ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp > kvm_intel kvm irqbypass crct10dif_
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Mon, 2017-08-07 at 18:06 -0400, Laurence Oberman wrote: > With [mq-deadline] enabled on stock I dont see them at all and it behaves. > > Now with Ming's patches if we enable [mq-deadline] we DO see the clone > failures and the hard lockup so we have opposit behaviour with the > scheduler choice and we have the hard lockup. > > On Ming's kernel with [none] we are well behaved and that was my original > focus, testing on [none] and hence my Tested-by: pass. > > So more investigation is needed here. Hello Laurence, Was debugfs enabled on your test setup? Had you perhaps collected the contents of the block layer debugfs files after the lockup occurred, e.g. as follows: (cd /sys/kernel/debug/block && find -type f | xargs grep -aH '')? Thanks, Bart.
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/07/2017 02:46 PM, Laurence Oberman wrote: On 08/07/2017 01:29 PM, Laurence Oberman wrote: On 08/07/2017 11:27 AM, Bart Van Assche wrote: On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote: I tested this series using Ming's tests as well as my own set of tests typically run against changes to upstream code in my SRP test-bed. My tests also include very large sequential buffered and un-buffered I/O. This series seems to be fine for me. I did uncover another issue that is unrelated to these patches and also exists in 4.13-RC3 generic that I am still debugging. Hello Laurence, What kind of tests did you run? Only functional tests or also performance tests? Has the issue you encountered with kernel 4.13-rc3 already been reported on the linux-rdma mailing list? Thanks, Bart. Hi Bart Actually I was focusing on just performance to see if we had any regressions with Mings new patches for the large sequential I/O cases. Ming had already tested the small I/O performance cases so I was making sure the large I/O sequential tests did not suffer. The 4MB un-buffered direct read tests to DM devices seems to perform much the same in my test bed. The 4MB buffered and un-buffered 4MB writes also seem to be well behaved with not much improvement. These were not exhaustive tests and did not include my usual port disconnect and recovery tests either. I was just making sure we did not regress with Ming's changes. I was only just starting to baseline test the mq-deadline scheduler as prior to 4.13-RC3 I had not been testing any of the new MQ schedulers. I had always only tested with [none] The tests were with [none] and [mq-deadline] The new issue I started seeing was not yet reported yet as I was still investigating it. In summary: With buffered writes we see the clone fail in blk_get_request in both Mings kernel and in the upstream 4.13-RC3 stock kernel [ 885.271451] io scheduler mq-deadline registered [ 898.455442] device-mapper: multipath: blk_get_request() returned -11 - requeuing This is due to multipath_clone_and_map() /* * Map cloned requests (request-based multipath) */ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, union map_info *map_context, struct request **__clone) { .. .. .. clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); if (IS_ERR(clone)) { /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ bool queue_dying = blk_queue_dying(q); DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing", PTR_ERR(clone), queue_dying ? " (path offline)" : ""); if (queue_dying) { atomic_inc(&m->pg_init_in_progress); activate_or_offline_path(pgpath); return DM_MAPIO_REQUEUE; } return DM_MAPIO_DELAY_REQUEUE; } Still investigating but it leads to a hard lockup So I still need to see if the hard-lockup happens in the stock kernel with mq-deadline and some other work before coming up with a full summary of the issue. I also intend to re-run all tests including disconnect and reconnect tests on both mq-deadline and none. Trace below [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul [ 1553.167385] crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt] [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G I 4.13.0-rc3lo
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/07/2017 01:29 PM, Laurence Oberman wrote: On 08/07/2017 11:27 AM, Bart Van Assche wrote: On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote: I tested this series using Ming's tests as well as my own set of tests typically run against changes to upstream code in my SRP test-bed. My tests also include very large sequential buffered and un-buffered I/O. This series seems to be fine for me. I did uncover another issue that is unrelated to these patches and also exists in 4.13-RC3 generic that I am still debugging. Hello Laurence, What kind of tests did you run? Only functional tests or also performance tests? Has the issue you encountered with kernel 4.13-rc3 already been reported on the linux-rdma mailing list? Thanks, Bart. Hi Bart Actually I was focusing on just performance to see if we had any regressions with Mings new patches for the large sequential I/O cases. Ming had already tested the small I/O performance cases so I was making sure the large I/O sequential tests did not suffer. The 4MB un-buffered direct read tests to DM devices seems to perform much the same in my test bed. The 4MB buffered and un-buffered 4MB writes also seem to be well behaved with not much improvement. These were not exhaustive tests and did not include my usual port disconnect and recovery tests either. I was just making sure we did not regress with Ming's changes. I was only just starting to baseline test the mq-deadline scheduler as prior to 4.13-RC3 I had not been testing any of the new MQ schedulers. I had always only tested with [none] The tests were with [none] and [mq-deadline] The new issue I started seeing was not yet reported yet as I was still investigating it. In summary: With buffered writes we see the clone fail in blk_get_request in both Mings kernel and in the upstream 4.13-RC3 stock kernel [ 885.271451] io scheduler mq-deadline registered [ 898.455442] device-mapper: multipath: blk_get_request() returned -11 - requeuing This is due to multipath_clone_and_map() /* * Map cloned requests (request-based multipath) */ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, union map_info *map_context, struct request **__clone) { .. .. .. clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); if (IS_ERR(clone)) { /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ bool queue_dying = blk_queue_dying(q); DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing", PTR_ERR(clone), queue_dying ? " (path offline)" : ""); if (queue_dying) { atomic_inc(&m->pg_init_in_progress); activate_or_offline_path(pgpath); return DM_MAPIO_REQUEUE; } return DM_MAPIO_DELAY_REQUEUE; } Still investigating but it leads to a hard lockup So I still need to see if the hard-lockup happens in the stock kernel with mq-deadline and some other work before coming up with a full summary of the issue. I also intend to re-run all tests including disconnect and reconnect tests on both mq-deadline and none. Trace below [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul [ 1553.167385] crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt] [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G I 4.13.0-rc3lobeming.ming_V4+ #20 [ 1553.167411] Hardware name: HP Pro
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/07/2017 11:27 AM, Bart Van Assche wrote: On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote: I tested this series using Ming's tests as well as my own set of tests typically run against changes to upstream code in my SRP test-bed. My tests also include very large sequential buffered and un-buffered I/O. This series seems to be fine for me. I did uncover another issue that is unrelated to these patches and also exists in 4.13-RC3 generic that I am still debugging. Hello Laurence, What kind of tests did you run? Only functional tests or also performance tests? Has the issue you encountered with kernel 4.13-rc3 already been reported on the linux-rdma mailing list? Thanks, Bart. Hi Bart Actually I was focusing on just performance to see if we had any regressions with Mings new patches for the large sequential I/O cases. Ming had already tested the small I/O performance cases so I was making sure the large I/O sequential tests did not suffer. The 4MB un-buffered direct read tests to DM devices seems to perform much the same in my test bed. The 4MB buffered and un-buffered 4MB writes also seem to be well behaved with not much improvement. These were not exhaustive tests and did not include my usual port disconnect and recovery tests either. I was just making sure we did not regress with Ming's changes. I was only just starting to baseline test the mq-deadline scheduler as prior to 4.13-RC3 I had not been testing any of the new MQ schedulers. I had always only tested with [none] The tests were with [none] and [mq-deadline] The new issue I started seeing was not yet reported yet as I was still investigating it. In summary: With buffered writes we see the clone fail in blk_get_request in both Mings kernel and in the upstream 4.13-RC3 stock kernel [ 885.271451] io scheduler mq-deadline registered [ 898.455442] device-mapper: multipath: blk_get_request() returned -11 - requeuing This is due to multipath_clone_and_map() /* * Map cloned requests (request-based multipath) */ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, union map_info *map_context, struct request **__clone) { .. .. .. clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); if (IS_ERR(clone)) { /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ bool queue_dying = blk_queue_dying(q); DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing", PTR_ERR(clone), queue_dying ? " (path offline)" : ""); if (queue_dying) { atomic_inc(&m->pg_init_in_progress); activate_or_offline_path(pgpath); return DM_MAPIO_REQUEUE; } return DM_MAPIO_DELAY_REQUEUE; } Still investigating but it leads to a hard lockup So I still need to see if the hard-lockup happens in the stock kernel with mq-deadline and some other work before coming up with a full summary of the issue. I also intend to re-run all tests including disconnect and reconnect tests on both mq-deadline and none. Trace below [ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4 [ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser libiscsi scsi_transport_iscsi iptable_mangle iptable_security iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul [ 1553.167385] crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt] [ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G I 4.13.0-rc3lobeming.ming_V4+ #20 [ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015 [ 1553.167412] task: 9d9344b0d800 t
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote: > I tested this series using Ming's tests as well as my own set of tests > typically run against changes to upstream code in my SRP test-bed. > My tests also include very large sequential buffered and un-buffered I/O. > > This series seems to be fine for me. I did uncover another issue that is > unrelated to these patches and also exists in 4.13-RC3 generic that I am > still debugging. Hello Laurence, What kind of tests did you run? Only functional tests or also performance tests? Has the issue you encountered with kernel 4.13-rc3 already been reported on the linux-rdma mailing list? Thanks, Bart.
Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance
On 08/05/2017 02:56 AM, Ming Lei wrote: In Red Hat internal storage test wrt. blk-mq scheduler, we found that I/O performance is much bad with mq-deadline, especially about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx, SRP...) Turns out one big issue causes the performance regression: requests are still dequeued from sw queue/scheduler queue even when ldd's queue is busy, so I/O merge becomes quite difficult to make, then sequential IO degrades a lot. The 1st five patches improve this situation, and brings back some performance loss. But looks they are still not enough. It is caused by the shared queue depth among all hw queues. For SCSI devices, .cmd_per_lun defines the max number of pending I/O on one request queue, which is per-request_queue depth. So during dispatch, if one hctx is too busy to move on, all hctxs can't dispatch too because of the per-request_queue depth. Patch 6 ~ 14 use per-request_queue dispatch list to avoid to dequeue requests from sw/scheduler queue when lld queue is busy. Patch 15 ~20 improve bio merge via hash table in sw queue, which makes bio merge more efficient than current approch in which only the last 8 requests are checked. Since patch 6~14 converts to the scheduler way of dequeuing one request from sw queue one time for SCSI device, and the times of acquring ctx->lock is increased, and merging bio via hash table decreases holding time of ctx->lock and should eliminate effect from patch 14. With this changes, SCSI-MQ sequential I/O performance is improved much, for lpfc, it is basically brought back compared with block legacy path[1], especially mq-deadline is improved by > X10 [1] on lpfc and by > 3X on SCSI SRP, For mq-none it is improved by 10% on lpfc, and write is improved by > 10% on SRP too. Also Bart worried that this patchset may affect SRP, so provide test data on SCSI SRP this time: - fio(libaio, bs:4k, dio, queue_depth:64, 64 jobs) - system(16 cores, dual sockets, mem: 96G) |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches | |blk-legacy dd |blk-mq none | blk-mq none | ---| read :iops| 587K | 526K | 537K | randread :iops| 115K | 140K | 139K | write:iops| 596K | 519K | 602K | randwrite:iops| 103K | 122K | 120K | |v4.13-rc3 |v4.13-rc3 | v4.13-rc3+patches |blk-legacy dd |blk-mq dd | blk-mq dd| read :iops| 587K | 155K | 522K | randread :iops| 115K | 140K | 141K | write:iops| 596K | 135K | 587K | randwrite:iops| 103K | 120K | 118K | V2: - dequeue request from sw queues in round roubin's style as suggested by Bart, and introduces one helper in sbitmap for this purpose - improve bio merge via hash table from sw queue - add comments about using DISPATCH_BUSY state in lockless way, simplifying handling on busy state, - hold ctx->lock when clearing ctx busy bit as suggested by Bart [1] http://marc.info/?l=linux-block&m=150151989915776&w=2 Ming Lei (20): blk-mq-sched: fix scheduler bad performance sbitmap: introduce __sbitmap_for_each_set() blk-mq: introduce blk_mq_dispatch_rq_from_ctx() blk-mq-sched: move actual dispatching into one helper blk-mq-sched: improve dispatching from sw queue blk-mq-sched: don't dequeue request until all in ->dispatch are flushed blk-mq-sched: introduce blk_mq_sched_queue_depth() blk-mq-sched: use q->queue_depth as hint for q->nr_requests blk-mq: introduce BLK_MQ_F_SHARED_DEPTH blk-mq-sched: introduce helpers for query, change busy state blk-mq: introduce helpers for operating ->dispatch list blk-mq: introduce pointers to dispatch lock & list blk-mq: pass 'request_queue *' to several helpers of operating BUSY blk-mq-sched: improve IO scheduling on SCSI devcie block: introduce rqhash helpers block: move actual bio merge code into __elv_merge block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue block: introduce .last_merge and .hash to blk_mq_ctx blk-mq-sched: refactor blk_mq_sched_try_merge() blk-mq: improve bio merge from blk-mq sw queue block/blk-mq-debugfs.c | 12 ++-- block/blk-mq-sched.c| 187 +--- block/blk-mq-sched.h| 23 ++ block/blk-mq.c | 133 +++--- block/blk-mq.h | 73 +++ block/blk-settings.c| 2 + block/blk.h | 55 ++ block/elevator.c| 93 ++-- include/linux/blk-mq.h | 5 ++ include/linux/blkdev.h | 5 ++ include/linux/sbitmap.h | 54