Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-10 Thread John Garry

On 10/10/2017 14:45, Ming Lei wrote:

Hi John,

All change in V6.2 is blk-mq/scsi-mq only, which shouldn't
affect non SCSI_MQ, so I suggest you to compare the perf
between deadline and mq-deadline, like Johannes mentioned.


>
> V6.2 series with default SCSI_MQ
> read, rw, write IOPS   
> 700K, 130K/128K, 640K

If possible, could you provide your fio script and log on both
non SCSI_MQ(deadline) and SCSI_MQ(mq_deadline)? Maybe some clues
can be figured out.

Also, I just put another patch on V6.2 branch, which may improve
a bit too. You may try that in your test.


https://github.com/ming1/linux/commit/e31e2eec46c9b5ae7cfa181e9b77adad2c6a97ce

-- Ming .


Hi Ming Lei,

OK, I have tested deadline vs mq-deadline for your v6.2 branch and 
4.12-rc2. Unfortunately I don't have time now to test your experimental 
patches.


4.14-rc2 without default SCSI_MQ, deadline scheduler
read, rw, write IOPS
920K, 115K/115K, 806K

4.14-rc2 with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS
280K, 99K/99K, 300K

V6.2 series without default SCSI_MQ, deadline scheduler
read, rw, write IOPS
919K, 117K/117K, 806K

V6.2 series with default SCSI_MQ, mq-deadline scheduler
read, rw, write IOPS
688K, 128K/128K, 630K

I think that the non-mq results look a bit more sensible - that is, 
consistent results.


Here's my script sample:
[global]
rw=rW
direct=1
ioengine=libaio
iodepth=2048
numjobs=1
bs=4k
;size=1024m
;zero_buffers=1
group_reporting=1
group_reporting=1
;ioscheduler=noop
cpumask=0xff
;cpus_allowed=0-3
;gtod_reduce=1
;iodepth_batch=2
;iodepth_batch_complete=2
runtime=1
;thread
loops = 1

[job1]
filename=/dev/sdb:
[job1]
filename=/dev/sdc:
[job1]
filename=/dev/sdd:
[job1]
filename=/dev/sde:
[job1]
filename=/dev/sdf:
[job1]
filename=/dev/sdg:

John




Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-10 Thread Ming Lei
On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
> On 10/10/2017 02:46, Ming Lei wrote:
> > > > > > I tested this series for the SAS controller on HiSilicon hip07 
> > > > > > platform as I
> > > > > > am interested in enabling MQ for this driver. Driver is
> > > > > > ./drivers/scsi/hisi_sas/.
> > > > > >
> > > > > > So I found that that performance is improved when enabling default 
> > > > > > SCSI_MQ
> > > > > > with this series vs baseline. However, it is still not as a good as 
> > > > > > when
> > > > > > default SCSI_MQ is disabled.
> > > > > >
> > > > > > Here are some figures I got with fio:
> > > > > > 4.14-rc2 without default SCSI_MQ
> > > > > > read, rw, write IOPS
> > > > > > 952K, 133K/133K, 800K
> > > > > >
> > > > > > 4.14-rc2 with default SCSI_MQ
> > > > > > read, rw, write IOPS
> > > > > > 311K, 117K/117K, 320K
> > > > > >
> > > > > > This series* without default SCSI_MQ
> > > > > > read, rw, write IOPS
> > > > > > 975K, 132K/132K, 790K
> > > > > >
> > > > > > This series* with default SCSI_MQ
> > > > > > read, rw, write IOPS
> > > > > > 770K, 164K/164K, 594K
> > > >
> > > > Thanks for testing this patchset!
> > > >
> > > > Looks there is big improvement, but the gap compared with
> > > > block legacy is not small too.
> > > >
> > > > > >
> > > > > > Please note that hisi_sas driver does not enable mq by exposing 
> > > > > > multiple
> > > > > > queues to upper layer (even though it has multiple queues). I have 
> > > > > > been
> > > > > > playing with enabling it, but my performance is always worse...
> > > > > >
> > > > > > * I'm using
> > > > > > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > > > > > as advised by Ming Lei.
> > > >
> > > > Could you test on the following branch and see if it makes a
> > > > difference?
> > > >
> > > > 
> > > > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test
> > Hi John,
> > 
> > Please test the following branch directly:
> > 
> > https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test
> > 
> > And code is simplified and cleaned up much in V6.2, then only two extra
> > patches(top 2) are needed against V6 which was posted yesterday.
> > 
> > Please test SCSI_MQ with mq-deadline, which should be the default
> > mq scheduler on your HiSilicon SAS.
> 
> Hi Ming Lei,
> 
> It's using cfq (for non-mq) and mq-deadline (obviously for mq).
> 
> root@(none)$ pwd
> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> noop [cfq]
> 
> and
> 
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> [mq-deadline] kyber none
> 
> Unfortunately my setup has changed since yeterday, and the absolute figures
> are not the exact same (I retested 4.14-rc2). However, we still see that
> drop when mq is enabled.
> 
> Here's the results:
> 4.14-rc4 without default SCSI_MQ
> read, rw, write IOPS  
> 860K, 112K/112K, 800K
> 
> 4.14-rc2 without default SCSI_MQ
> read, rw, write IOPS  
> 880K, 113K/113K, 808K
> 
> V6.2 series without default SCSI_MQ
> read, rw, write IOPS  
> 820K, 114/114K, 790K

Hi John,

All change in V6.2 is blk-mq/scsi-mq only, which shouldn't
affect non SCSI_MQ, so I suggest you to compare the perf
between deadline and mq-deadline, like Johannes mentioned.

> 
> V6.2 series with default SCSI_MQ
> read, rw, write IOPS  
> 700K, 130K/128K, 640K

If possible, could you provide your fio script and log on both
non SCSI_MQ(deadline) and SCSI_MQ(mq_deadline)? Maybe some clues
can be figured out.

Also, I just put another patch on V6.2 branch, which may improve
a bit too. You may try that in your test.


https://github.com/ming1/linux/commit/e31e2eec46c9b5ae7cfa181e9b77adad2c6a97ce

-- 
Ming


Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-10 Thread Paolo Valente

> Il giorno 10 ott 2017, alle ore 14:34, Johannes Thumshirn 
>  ha scritto:
> 
> Hi John,
> 
> On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
>> It's using cfq (for non-mq) and mq-deadline (obviously for mq).
> 
> Please be aware that cfq and mq-deadline are _not_ comparable, for a realistic
> comparasion please use deadline and mq-deadline or cfq and bfq.
> 

Please set low_latency=0 for bfq if yours is just a maximum-throughput test.

Thanks,
Paolo

>> root@(none)$ pwd
>> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
>> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
>> noop [cfq]
> 
> Maybe missing CONFIG_IOSCHED_DEADLINE?
> 
> Thanks,
>   Johannes
> 
> -- 
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850



Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-10 Thread Johannes Thumshirn
Hi John,

On Tue, Oct 10, 2017 at 01:24:52PM +0100, John Garry wrote:
> It's using cfq (for non-mq) and mq-deadline (obviously for mq).

Please be aware that cfq and mq-deadline are _not_ comparable, for a realistic
comparasion please use deadline and mq-deadline or cfq and bfq.

> root@(none)$ pwd
> /sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
> root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
> noop [cfq]

Maybe missing CONFIG_IOSCHED_DEADLINE?

Thanks,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-10 Thread John Garry

On 10/10/2017 02:46, Ming Lei wrote:

> > I tested this series for the SAS controller on HiSilicon hip07 platform as I
> > am interested in enabling MQ for this driver. Driver is
> > ./drivers/scsi/hisi_sas/.
> >
> > So I found that that performance is improved when enabling default SCSI_MQ
> > with this series vs baseline. However, it is still not as a good as when
> > default SCSI_MQ is disabled.
> >
> > Here are some figures I got with fio:
> > 4.14-rc2 without default SCSI_MQ
> > read, rw, write IOPS  
> > 952K, 133K/133K, 800K
> >
> > 4.14-rc2 with default SCSI_MQ
> > read, rw, write IOPS  
> > 311K, 117K/117K, 320K
> >
> > This series* without default SCSI_MQ
> > read, rw, write IOPS  
> > 975K, 132K/132K, 790K
> >
> > This series* with default SCSI_MQ
> > read, rw, write IOPS  
> > 770K, 164K/164K, 594K

>
> Thanks for testing this patchset!
>
> Looks there is big improvement, but the gap compared with
> block legacy is not small too.
>

> >
> > Please note that hisi_sas driver does not enable mq by exposing multiple
> > queues to upper layer (even though it has multiple queues). I have been
> > playing with enabling it, but my performance is always worse...
> >
> > * I'm using
> > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > as advised by Ming Lei.

>
> Could you test on the following branch and see if it makes a
> difference?
>
>
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

Hi John,

Please test the following branch directly:

https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test

And code is simplified and cleaned up much in V6.2, then only two extra
patches(top 2) are needed against V6 which was posted yesterday.

Please test SCSI_MQ with mq-deadline, which should be the default
mq scheduler on your HiSilicon SAS.


Hi Ming Lei,

It's using cfq (for non-mq) and mq-deadline (obviously for mq).

root@(none)$ pwd
/sys/devices/platform/HISI0162:01/host0/port-0:0/expander-0:0/port-0:0:7/end_device-0:0:7
root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
noop [cfq]

and

root@(none)$ more ./target0:0:3/0:0:3:0/block/sdd/queue/scheduler
[mq-deadline] kyber none

Unfortunately my setup has changed since yeterday, and the absolute 
figures are not the exact same (I retested 4.14-rc2). However, we still 
see that drop when mq is enabled.


Here's the results:
4.14-rc4 without default SCSI_MQ
read, rw, write IOPS
860K, 112K/112K, 800K

4.14-rc2 without default SCSI_MQ
read, rw, write IOPS
880K, 113K/113K, 808K

V6.2 series without default SCSI_MQ
read, rw, write IOPS
820K, 114/114K, 790K

V6.2 series with default SCSI_MQ
read, rw, write IOPS
700K, 130K/128K, 640K

Cheers,
John



-- Ming .





Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread Ming Lei
On Mon, Oct 09, 2017 at 11:04:39PM +0800, Ming Lei wrote:
> Hi John,
> 
> On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> > On 30/09/2017 11:27, Ming Lei wrote:
> > > Hi Jens,
> > > 
> > > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > > found that I/O performance is much bad with mq-deadline, especially
> > > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > > SRP...)
> > > 
> > > Turns out one big issue causes the performance regression: requests
> > > are still dequeued from sw queue/scheduler queue even when ldd's
> > > queue is busy, so I/O merge becomes quite difficult to make, then
> > > sequential IO degrades a lot.
> > > 
> > > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > > in V4.13.
> > > 
> > > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > > then we can improve dm-mpath's performance in part 2, which will
> > > be posted out soon.
> > > 
> > > The 2nd six patches improve this situation, and brings back
> > > some performance loss.
> > > 
> > > With this change, SCSI-MQ sequential I/O performance is
> > > improved much, Paolo reported that mq-deadline performance
> > > improved much[2] in his dbench test wrt V2. Also performanc
> > > improvement on lpfc/qla2xx was observed with V1.[1]
> > > 
> > > Please consider it for V4.15.
> > > 
> > > [1] http://marc.info/?l=linux-block=150151989915776=2
> > > [2] https://marc.info/?l=linux-block=150217980602843=2
> > > 
> > 
> > I tested this series for the SAS controller on HiSilicon hip07 platform as I
> > am interested in enabling MQ for this driver. Driver is
> > ./drivers/scsi/hisi_sas/.
> > 
> > So I found that that performance is improved when enabling default SCSI_MQ
> > with this series vs baseline. However, it is still not as a good as when
> > default SCSI_MQ is disabled.
> > 
> > Here are some figures I got with fio:
> > 4.14-rc2 without default SCSI_MQ
> > read, rw, write IOPS
> > 952K, 133K/133K, 800K
> > 
> > 4.14-rc2 with default SCSI_MQ
> > read, rw, write IOPS
> > 311K, 117K/117K, 320K
> > 
> > This series* without default SCSI_MQ
> > read, rw, write IOPS
> > 975K, 132K/132K, 790K
> > 
> > This series* with default SCSI_MQ
> > read, rw, write IOPS
> > 770K, 164K/164K, 594K
> 
> Thanks for testing this patchset!
> 
> Looks there is big improvement, but the gap compared with
> block legacy is not small too.
> 
> > 
> > Please note that hisi_sas driver does not enable mq by exposing multiple
> > queues to upper layer (even though it has multiple queues). I have been
> > playing with enabling it, but my performance is always worse...
> > 
> > * I'm using
> > https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> > as advised by Ming Lei.
> 
> Could you test on the following branch and see if it makes a
> difference?
> 
>   
> https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

Hi John,

Please test the following branch directly:

https://github.com/ming1/linux/tree/blk_mq_improve_scsi_mpath_perf_V6.2_test

And code is simplified and cleaned up much in V6.2, then only two extra
patches(top 2) are needed against V6 which was posted yesterday.

Please test SCSI_MQ with mq-deadline, which should be the default
mq scheduler on your HiSilicon SAS.

-- 
Ming


Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread Ming Lei
Hi John,

On Mon, Oct 09, 2017 at 01:09:22PM +0100, John Garry wrote:
> On 30/09/2017 11:27, Ming Lei wrote:
> > Hi Jens,
> > 
> > In Red Hat internal storage test wrt. blk-mq scheduler, we
> > found that I/O performance is much bad with mq-deadline, especially
> > about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> > SRP...)
> > 
> > Turns out one big issue causes the performance regression: requests
> > are still dequeued from sw queue/scheduler queue even when ldd's
> > queue is busy, so I/O merge becomes quite difficult to make, then
> > sequential IO degrades a lot.
> > 
> > This issue becomes one of mains reasons for reverting default SCSI_MQ
> > in V4.13.
> > 
> > The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> > then we can improve dm-mpath's performance in part 2, which will
> > be posted out soon.
> > 
> > The 2nd six patches improve this situation, and brings back
> > some performance loss.
> > 
> > With this change, SCSI-MQ sequential I/O performance is
> > improved much, Paolo reported that mq-deadline performance
> > improved much[2] in his dbench test wrt V2. Also performanc
> > improvement on lpfc/qla2xx was observed with V1.[1]
> > 
> > Please consider it for V4.15.
> > 
> > [1] http://marc.info/?l=linux-block=150151989915776=2
> > [2] https://marc.info/?l=linux-block=150217980602843=2
> > 
> 
> I tested this series for the SAS controller on HiSilicon hip07 platform as I
> am interested in enabling MQ for this driver. Driver is
> ./drivers/scsi/hisi_sas/.
> 
> So I found that that performance is improved when enabling default SCSI_MQ
> with this series vs baseline. However, it is still not as a good as when
> default SCSI_MQ is disabled.
> 
> Here are some figures I got with fio:
> 4.14-rc2 without default SCSI_MQ
> read, rw, write IOPS  
> 952K, 133K/133K, 800K
> 
> 4.14-rc2 with default SCSI_MQ
> read, rw, write IOPS  
> 311K, 117K/117K, 320K
> 
> This series* without default SCSI_MQ
> read, rw, write IOPS  
> 975K, 132K/132K, 790K
> 
> This series* with default SCSI_MQ
> read, rw, write IOPS  
> 770K, 164K/164K, 594K

Thanks for testing this patchset!

Looks there is big improvement, but the gap compared with
block legacy is not small too.

> 
> Please note that hisi_sas driver does not enable mq by exposing multiple
> queues to upper layer (even though it has multiple queues). I have been
> playing with enabling it, but my performance is always worse...
> 
> * I'm using
> https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1,
> as advised by Ming Lei.

Could you test on the following branch and see if it makes a
difference?


https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V6.1_test

BTW, one big change is in the following commit, which just takes block
legacy's policy to dequeue request, and I can observe some improvement
on virtio-scsi too, and this commit is just for verification/debug purpose,
which is never posted out before.

https://github.com/ming1/linux/commit/94a117fdd9cfc1291445e5a35f04464c89c9ce70


Thanks,
Ming


Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-10-09 Thread John Garry

On 30/09/2017 11:27, Ming Lei wrote:

Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2



I tested this series for the SAS controller on HiSilicon hip07 platform 
as I am interested in enabling MQ for this driver. Driver is 
./drivers/scsi/hisi_sas/.


So I found that that performance is improved when enabling default 
SCSI_MQ with this series vs baseline. However, it is still not as a good 
as when default SCSI_MQ is disabled.


Here are some figures I got with fio:
4.14-rc2 without default SCSI_MQ
read, rw, write IOPS
952K, 133K/133K, 800K

4.14-rc2 with default SCSI_MQ
read, rw, write IOPS
311K, 117K/117K, 320K

This series* without default SCSI_MQ
read, rw, write IOPS
975K, 132K/132K, 790K

This series* with default SCSI_MQ
read, rw, write IOPS
770K, 164K/164K, 594K

Please note that hisi_sas driver does not enable mq by exposing multiple 
queues to upper layer (even though it has multiple queues). I have been 
playing with enabling it, but my performance is always worse...


* I'm using 
https://github.com/ming1/linux/commits/blk_mq_improve_scsi_mpath_perf_V5.1, 
as advised by Ming Lei.


Thanks,
John


V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart


Ming Lei (7):
  blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dequeue_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed

 block/blk-core.c|   3 +-
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c| 104 ---
 block/blk-mq.c  | 114 +++-
 block/blk-mq.h  |   4 +-
 drivers/md/dm-rq.c  |   2 +-
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 +++
 8 files changed, 238 insertions(+), 57 deletions(-)






Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-09-30 Thread Ming Lei
On Sat, Sep 30, 2017 at 06:27:13PM +0800, Ming Lei wrote:
> Hi Jens,
> 
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> This issue becomes one of mains reasons for reverting default SCSI_MQ
> in V4.13.
> 
> The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> then we can improve dm-mpath's performance in part 2, which will
> be posted out soon.
> 
> The 2nd six patches improve this situation, and brings back
> some performance loss.
> 
> With this change, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
> 
> Please consider it for V4.15.
> 
> [1] http://marc.info/?l=linux-block=150151989915776=2
> [2] https://marc.info/?l=linux-block=150217980602843=2
> 
> V5:
>   - address some comments from Omar
>   - add Tested-by & Reveiewed-by tag
>   - use direct issue for blk_mq_request_bypass_insert(), and
>   start to consider to improve sequential I/O for dm-mpath
>   - only include part 1(the original patch 1 ~ 6), as suggested
>   by Omar
> 
> V4:
>   - add Reviewed-by tag
>   - some trival change: typo fix in commit log or comment,
>   variable name, no actual functional change
> 
> V3:
>   - totally round robin for picking req from ctx, as suggested
>   by Bart
>   - remove one local variable in __sbitmap_for_each_set()
>   - drop patches of single dispatch list, which can improve
>   performance on mq-deadline, but cause a bit degrade on
>   none because all hctxs need to be checked after ->dispatch
>   is flushed. Will post it again once it is mature.
>   - rebase on v4.13-rc6 with block for-next
> 
> V2:
>   - dequeue request from sw queues in round roubin's style
>   as suggested by Bart, and introduces one helper in sbitmap
>   for this purpose
>   - improve bio merge via hash table from sw queue
>   - add comments about using DISPATCH_BUSY state in lockless way,
>   simplifying handling on busy state,
>   - hold ctx->lock when clearing ctx busy bit as suggested
>   by Bart
> 
> 
> Ming Lei (7):
>   blk-mq: issue rq directly in blk_mq_request_bypass_insert()
>   blk-mq-sched: fix scheduler bad performance
>   sbitmap: introduce __sbitmap_for_each_set()
>   blk-mq: introduce blk_mq_dequeue_from_ctx()
>   blk-mq-sched: move actual dispatching into one helper
>   blk-mq-sched: improve dispatching from sw queue
>   blk-mq-sched: don't dequeue request until all in ->dispatch are
> flushed
> 
>  block/blk-core.c|   3 +-
>  block/blk-mq-debugfs.c  |   1 +
>  block/blk-mq-sched.c| 104 ---
>  block/blk-mq.c  | 114 
> +++-
>  block/blk-mq.h  |   4 +-
>  drivers/md/dm-rq.c  |   2 +-
>  include/linux/blk-mq.h  |   3 ++
>  include/linux/sbitmap.h |  64 +++
>  8 files changed, 238 insertions(+), 57 deletions(-)

Oops, the title should have been:

[PATCH V5 0/7] blk-mq-sched: improve sequential I/O performance(part 1)

Sorry for that.

-- 
Ming


[PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-09-30 Thread Ming Lei
Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2

V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart


Ming Lei (7):
  blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dequeue_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed

 block/blk-core.c|   3 +-
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c| 104 ---
 block/blk-mq.c  | 114 +++-
 block/blk-mq.h  |   4 +-
 drivers/md/dm-rq.c  |   2 +-
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 +++
 8 files changed, 238 insertions(+), 57 deletions(-)

-- 
2.9.5