Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Coly Li
On 2017/10/1 上午6:49, Michael Lyle wrote:
> One final attempt to resend, because gmail has been giving me trouble
> sending plain text mail.
> 
> Two instances of this.  Tested as above, with a big set of random I/Os
> that ultimately cover every block in a file (e.g. allowing sequential
> writeback).
> 
> With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:
> 
> Typical seconds look like:
> 
> Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.
> 
> Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining
> about 11.9 extents to a contiguous writeback.  Tracing, there are
> still contiguous things that are not getting merged well, but it's OK.
> (I'm hoping plugging makes this better).
> 
> sda4809.00 38232.00   446.00  38232446
> sdb 400.00 0.00 38112.00  0  38112
> 
> Without the 5 patches, a typical second--
> 
> sda2509.00 19968.00   316.00  19968316
> sdb 502.00 0.00 19648.00  0  38112
> 
> or we are combining about 4.9 extents to a contiguous writeback, and
> writing back at about half the rate.  All of these numbers are +/- 10%
> and obtained by eyeballing and grabbing representative seconds.
> 
> Mike
> 
> On Sat, Sep 30, 2017 at 2:02 AM, Michael Lyle  wrote:
>>
>> Resent because I can't seem to get gmail to not send HTML mail.  And off to 
>> sleep.
>>
>>
>> On Sat, Sep 30, 2017 at 1:57 AM, Michael Lyle  wrote:
>>> Two instances of this.
>>>
>>> With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:
>>>
>>> Typical seconds look like:
>>>
>>> Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.
>>>
>>> Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining about
>>> 11.9 extents to a contiguous writeback.  Tracing, there are still contiguous
>>> things that are not getting merged well, but it's OK.  (I'm hoping plugging
>>> makes this better).
>>>
>>> sda4809.00 38232.00   446.00  38232446
>>> sdb 400.00 0.00 38112.00  0  38112
>>>
>>> Without the 5 patches, a typical second--
>>>
>>> sda2509.00 19968.00   316.00  19968316
>>> sdb 502.00 0.00 19648.00  0  38112
>>>
>>> or we are combining about 4.9 extents to a contiguous writeback, and writing
>>> back at about half the rate.

Hi Mike,

Get it. Now I am testing your patches (all 5 patches). It has been 12+
hours, should be 12+ hours more. The backing device is a raid0 composed
by 4x1.8T 2.5inch harddisk, cache device is a 3.8T NVMe SSD. Block size
is 512K.

Hope it works as you expected on my server.

-- 
Coly Li


Loan

2017-09-30 Thread CAPITAL FINANCE
 UNSECURED BUSINESS/PERSONAL LOAN BY LOAN CAPITAL FINANCE
  - NO COLLATERAL
  - MINIMUM DOCUMENTATION
  - BUSINESS LOAN UP TO FIVE(5) MILLION US DOLLARS

  CONTACT US TODAY VIA EMAIL: financecapital...@mail.com



Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Michael Lyle
One final attempt to resend, because gmail has been giving me trouble
sending plain text mail.

Two instances of this.  Tested as above, with a big set of random I/Os
that ultimately cover every block in a file (e.g. allowing sequential
writeback).

With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:

Typical seconds look like:

Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.

Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining
about 11.9 extents to a contiguous writeback.  Tracing, there are
still contiguous things that are not getting merged well, but it's OK.
(I'm hoping plugging makes this better).

sda4809.00 38232.00   446.00  38232446
sdb 400.00 0.00 38112.00  0  38112

Without the 5 patches, a typical second--

sda2509.00 19968.00   316.00  19968316
sdb 502.00 0.00 19648.00  0  38112

or we are combining about 4.9 extents to a contiguous writeback, and
writing back at about half the rate.  All of these numbers are +/- 10%
and obtained by eyeballing and grabbing representative seconds.

Mike

On Sat, Sep 30, 2017 at 2:02 AM, Michael Lyle  wrote:
>
> Resent because I can't seem to get gmail to not send HTML mail.  And off to 
> sleep.
>
>
> On Sat, Sep 30, 2017 at 1:57 AM, Michael Lyle  wrote:
> > Two instances of this.
> >
> > With the 5 patches, samsung 940 SSD cache + crummy 5400 RPM USB hard drive:
> >
> > Typical seconds look like:
> >
> > Reading 38232K from cache in 4809 IO.  38232/4809=7.95k per cache device IO.
> >
> > Writing 38112k to cache in 400 I/O = 95.28k -- or we are combining about
> > 11.9 extents to a contiguous writeback.  Tracing, there are still contiguous
> > things that are not getting merged well, but it's OK.  (I'm hoping plugging
> > makes this better).
> >
> > sda4809.00 38232.00   446.00  38232446
> > sdb 400.00 0.00 38112.00  0  38112
> >
> > Without the 5 patches, a typical second--
> >
> > sda2509.00 19968.00   316.00  19968316
> > sdb 502.00 0.00 19648.00  0  38112
> >
> > or we are combining about 4.9 extents to a contiguous writeback, and writing
> > back at about half the rate.
> >
> > Anyways, cya, I'm unsubscribed.
> >
> > Mike
>


Re: [PATCH 8/9] nvme: implement multipath access to nvme subsystems

2017-09-30 Thread Johannes Thumshirn

[+Cc Hannes ]

Keith Busch  writes:

> On Mon, Sep 25, 2017 at 03:40:30PM +0200, Christoph Hellwig wrote:
>> The new block devices nodes for multipath access will show up as
>> 
>>  /dev/nvm-subXnZ
>
> Just thinking ahead ... Once this goes in, someone will want to boot their
> OS from a multipath target. It was a pain getting installers to recognize
> /dev/nvmeXnY as an install destination. I'm not sure if installers have
> gotten any better in the last 5 years about recognizing new block names.

We discussed (Hannes, Christoph and me) this as well offline this week
and it should eithr be possible with some udev magic (fake a
/dev/mapper/WWID symlink) or a shim device-mapper target that prevents
dm-mpath from attaching to the underlying block devices and creates a
real /dev/mapper/WWID device node (Christoph especially dislikes
this). None of them are pretty, but probably all we can do so far.

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH v4] blk-mq: fix nr_requests wrong value when modify it from sysfs

2017-09-30 Thread weiping zhang
On Fri, Sep 22, 2017 at 11:36:28PM +0800, weiping zhang wrote:
> if blk-mq use "none" io scheduler, nr_request get a wrong value when
> input a number > tag_set->queue_depth. blk_mq_tag_update_depth will get
> the smaller one min(nr, set->queue_depth), and then q->nr_request get a
> wrong value.
> 
> Reproduce:
> 
> echo none > /sys/block/nvme0n1/queue/scheduler
> echo 100 > /sys/block/nvme0n1/queue/nr_requests
> cat /sys/block/nvme0n1/queue/nr_requests
> 100
> 
> Signed-off-by: weiping zhang 
> ---
> 
> Changes since v4:
>  * fix typo in commit message(queue/ioscheduler => queue/scheduler)
> 
> Changes since v3:
>  * remove compare nr with tags->qdepth, pass nr to blk_mq_tag_update_depth
> directly
> 
>  * remove return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
> 
> Changes since v2:
>  * add return EINVAL when user modify nr_request less than BLKDEV_MIN_RQ
>  * remove pr_warn, and return EINVAL, if input number is too large
> 
>  block/blk-mq.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 98a1860..491e336 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2642,8 +2642,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
> unsigned int nr)
>* queue depth. This is similar to what the old code would do.
>*/
>   if (!hctx->sched_tags) {
> - ret = blk_mq_tag_update_depth(hctx, >tags,
> - min(nr, 
> set->queue_depth),
> + ret = blk_mq_tag_update_depth(hctx, >tags, nr,
>   false);
>   } else {
>   ret = blk_mq_tag_update_depth(hctx, >sched_tags,
> -- 
> 2.9.4
> 

Hi Jens,

As you say before:
> blk_mq_tag_update_depth() should already return
> -EINVAL for the case where we can't grow the tags. Looks like this patch
> should simply remove the min(nr, set->queue_depth) and just pass in 'nr'.
I also find that hctx->tags->nr_tags equal to tag_set->queue_depth no matter use
hctx->sched_tags or not. So I think it's safe to verify 'nr' by
hctx->tags->nr_tags.

Thanks
weiping


[PATCH 4/5] dm-mpath: cache ti->clone during requeue

2017-09-30 Thread Ming Lei
During requeue, block layer won't change the request any
more, such as no merge, so we can cache ti->clone and
let .clone_and_map_rq check if the cache can be hit.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 31 ---
 drivers/md/dm-rq.c| 41 +
 2 files changed, 53 insertions(+), 19 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 9ee223170ee9..52e4730541fd 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -457,6 +457,11 @@ do {   
\
 dm_noflush_suspending((m)->ti));   \
 } while (0)
 
+static void multipath_release_clone(struct request *clone)
+{
+   blk_put_request(clone);
+}
+
 /*
  * Map cloned requests (request-based multipath)
  */
@@ -470,7 +475,7 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
struct block_device *bdev;
struct dm_mpath_io *mpio = get_mpio(map_context);
struct request_queue *q;
-   struct request *clone;
+   struct request *clone = *__clone;
 
/* Do we need to select a new pgpath? */
pgpath = lockless_dereference(m->current_pgpath);
@@ -493,7 +498,23 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
 
bdev = pgpath->path.dev->bdev;
q = bdev_get_queue(bdev);
-   clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
+
+   /*
+* This request may be from requeue path, and its clone
+* may have been allocated before. We need to check
+* if the cached clone can be hit.
+*/
+   if (clone) {
+   if (clone->q != q) {
+   blk_rq_unprep_clone(clone);
+   multipath_release_clone(clone);
+   clone = NULL;
+   } else
+   goto start_io;
+   }
+
+   if (!clone)
+   clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);
if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q);
@@ -520,6 +541,7 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
clone->cmd_flags |= REQ_FAILFAST_TRANSPORT;
*__clone = clone;
 
+ start_io:
if (pgpath->pg->ps.type->start_io)
pgpath->pg->ps.type->start_io(>pg->ps,
  >path,
@@ -527,11 +549,6 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
return DM_MAPIO_REMAPPED;
 }
 
-static void multipath_release_clone(struct request *clone)
-{
-   blk_put_request(clone);
-}
-
 /*
  * Map cloned bios (bio-based multipath)
  */
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 46f012185b43..2ef524bddd38 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -221,6 +221,12 @@ static void dm_end_request(struct request *clone, 
blk_status_t error)
blk_rq_unprep_clone(clone);
tio->ti->type->release_clone_rq(clone);
 
+   /*
+* We move the clearing from tio_init in .queue_rq to here because
+* tio->clone may be cached during requeue
+*/
+   tio->clone = NULL;
+
rq_end_stats(md, rq);
if (!rq->q->mq_ops)
blk_end_request_all(rq, error);
@@ -267,11 +273,15 @@ static void dm_requeue_original_request(struct 
dm_rq_target_io *tio, bool delay_
int rw = rq_data_dir(rq);
unsigned long delay_ms = delay_requeue ? 100 : 0;
 
+   /*
+* This request won't be changed any more during requeue,
+* so we cache tio->clone and let .clone_and_map_rq decide
+* to use the cached clone or allocate a new clone, and
+* the cached clone has to be freed before allocating a
+* new one.
+*/
+
rq_end_stats(md, rq);
-   if (tio->clone) {
-   blk_rq_unprep_clone(tio->clone);
-   tio->ti->type->release_clone_rq(tio->clone);
-   }
 
if (!rq->q->mq_ops)
dm_old_requeue_request(rq, delay_ms);
@@ -448,7 +458,6 @@ static void init_tio(struct dm_rq_target_io *tio, struct 
request *rq,
 {
tio->md = md;
tio->ti = NULL;
-   tio->clone = NULL;
tio->orig = rq;
tio->error = 0;
tio->completed = 0;
@@ -456,8 +465,12 @@ static void init_tio(struct dm_rq_target_io *tio, struct 
request *rq,
 * Avoid initializing info for blk-mq; it passes
 * target-specific data through info.ptr
 * (see: dm_mq_init_request)
+*
+* If tio->clone is cached during requeue, we don't
+* clear tio->info, and delay the initialization
+* to .clone_and_map_rq if the cache isn't hit.
 */
-   if (!md->init_tio_pdu)
+   if 

[PATCH 5/5] dm-rq: improve I/O merge by dealing with underlying STS_RESOURCE

2017-09-30 Thread Ming Lei
If the underlying queue returns BLK_STS_RESOURCE, we let dm-rq
handle the requeue instead of blk-mq, then I/O merge can be
improved because underlying's out-of-resource can be perceived
and handled by dm-rq now.

Follows IOPS test of mpath on lpfc, fio(libaio, bs:4k, dio,
queue_depth:64, 8 jobs).

1) blk-mq none scheduler
-
 IOPS(K)  |v4.14-rc2|v4.14-rc2 with| v4.14-rc2 with
  | |[1][2]| [1] [2] [3]
-
read  |   53.69 |   40.26  |   94.61
-
randread  |   24.64 |   30.08  |   35.57
-
write |   39.55 |   41.51  |  216.84
-
randwrite |   33.97 |   34.27  |   33.98
-

2) blk-mq mq-deadline scheduler
-
 IOPS(K)  |v4.14-rc2|v4.14-rc2 with| v4.14-rc2 with
  | |[1][2]| [1] [2] [3]
-
 IOPS(K)  |MQ-DEADLINE  |MQ-DEADLINE  |MQ-DEADLINE
-
read  |   23.81 |   21.91 |   89.94
-
randread  |   38.47 |   38.96 |   38.02
-
write |   39.52 |40.2 |  225.75
-
randwrite |34.8 |   33.73 |   33.44
-

[1] [PATCH V5 0/7] blk-mq-sched: improve sequential I/O performance(part 1)

https://marc.info/?l=linux-block=150676854821077=2

[2] [PATCH V5 0/8] blk-mq: improve bio merge for none scheduler

https://marc.info/?l=linux-block=150677085521416=2

[3] this patchset

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 17 +
 drivers/md/dm-rq.c | 14 --
 2 files changed, 13 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9a3a561a63b5..58d2268f9733 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1467,17 +1467,6 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx *hctx, 
struct request *rq,
blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
-static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
-struct request *rq)
-{
-   spin_lock(>lock);
-   list_add_tail(>queuelist, >dispatch);
-   set_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
-   spin_unlock(>lock);
-
-   blk_mq_run_hw_queue(hctx, false);
-}
-
 /*
  * Should only be used carefully, when the caller knows we want to
  * bypass a potential IO scheduler on the target device.
@@ -1487,12 +1476,8 @@ blk_status_t blk_mq_request_bypass_insert(struct request 
*rq)
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
blk_qc_t cookie;
-   blk_status_t ret;
 
-   ret = blk_mq_try_issue_directly(hctx, rq, , true);
-   if (ret == BLK_STS_RESOURCE)
-   blk_mq_request_direct_insert(hctx, rq);
-   return ret;
+   return blk_mq_try_issue_directly(hctx, rq, , true);
 }
 
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 2ef524bddd38..feb49c4d6fa2 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -405,7 +405,7 @@ static void end_clone_request(struct request *clone, 
blk_status_t error)
dm_complete_request(tio->orig, error);
 }
 
-static void dm_dispatch_clone_request(struct request *clone, struct request 
*rq)
+static blk_status_t dm_dispatch_clone_request(struct request *clone, struct 
request *rq)
 {
blk_status_t r;
 
@@ -417,6 +417,7 @@ static void dm_dispatch_clone_request(struct request 
*clone, struct request *rq)
if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
+   return r;
 }
 
 static int dm_rq_bio_constructor(struct bio *bio, struct bio *bio_orig,
@@ -490,8 +491,10 @@ static int map_request(struct dm_rq_target_io *tio)
struct request *rq = tio->orig;
struct request *cache = tio->clone;
struct request *clone = cache;
+   blk_status_t ret;
 
r = ti->type->clone_and_map_rq(ti, rq, >info, );
+ again:
switch (r) {
case DM_MAPIO_SUBMITTED:
/* The target has taken the I/O to submit by itself later */
@@ -509,7 +512,14 @@ static int map_request(struct dm_rq_target_io *tio)
/* The target has remapped the I/O so dispatch it */
trace_block_rq_remap(clone->q, 

[PATCH 3/5] dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure

2017-09-30 Thread Ming Lei
blk-mq will rerun queue via RESTART after one request is completed,
so not necessary to wait random time for requeuing, we should trust
blk-mq to do it.

More importantly, we need return BLK_STS_RESOURCE to blk-mq
so that dequeue from I/O scheduler can be stopped, then
I/O merge gets improved.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 97e4bd100fa1..9ee223170ee9 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -500,8 +500,20 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
if (queue_dying) {
atomic_inc(>pg_init_in_progress);
activate_or_offline_path(pgpath);
+   return DM_MAPIO_DELAY_REQUEUE;
}
-   return DM_MAPIO_DELAY_REQUEUE;
+
+   /*
+* blk-mq's SCHED_RESTART can cover this requeue, so
+* we needn't to deal with it by DELAY_REQUEUE. More
+* importantly, we have to return DM_MAPIO_REQUEUE
+* so that blk-mq can get the queue busy feedback,
+* otherwise I/O merge can be hurt.
+*/
+   if (q->mq_ops)
+   return DM_MAPIO_REQUEUE;
+   else
+   return DM_MAPIO_DELAY_REQUEUE;
}
clone->bio = clone->biotail = NULL;
clone->rq_disk = bdev->bd_disk;
-- 
2.9.5



[PATCH 1/5] dm-mpath: remove annoying message of 'blk_get_request() returned -11'

2017-09-30 Thread Ming Lei
It is very normal to see allocation failure, so not necessary
to dump it and annoy people.

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 11f273d2f018..e8094d8fbe0d 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -499,8 +499,6 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q);
-   DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
-   PTR_ERR(clone), queue_dying ? " (path offline)" : 
"");
if (queue_dying) {
atomic_inc(>pg_init_in_progress);
activate_or_offline_path(pgpath);
-- 
2.9.5



[PATCH 0/5] dm-rq: improve sequential I/O performance

2017-09-30 Thread Ming Lei
Hi,

This 1st one patch removes one log message which can be triggered
very easily.

The 2nd patch removes the workaround of blk_mq_delay_run_hw_queue()
in case of requeue, this way isn't necessary, and more worse, it
makes BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.

The 3rd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
allocation fails, then we can return BLK_STS_RESOURCE from dm-rq
to blk-mq, so that blk-mq can hold the requests to be dequeued.

The 4th patch is a pre-patch for the 5th one, becasue even though
underlying request allocation succeeds, its queue may be
busy and we can get this feedback from blk_insert_cloned_request()
now. This patch trys to cache the allocated request so that
it may be reused in next dispatch to underlying queue.

The 5th patch improves sequential I/O performance by returning
STS_RESOURCE if underlying queue is busy.

In the commit log of the 5th patch, I/O IOPS data is provided and
we can see sequential I/O performance is improved a lot with this
patchset.

This patchset depends on the following two patchset:

[1] [PATCH V5 0/7] blk-mq-sched: improve sequential I/O performance(part 1)

https://marc.info/?l=linux-block=150676854821077=2

[2] [PATCH V5 0/8] blk-mq: improve bio merge for none scheduler

https://marc.info/?l=linux-block=150677085521416=2

Any comments are welcome! 

Thanks,
Ming

Ming Lei (5):
  dm-mpath: remove annoying message of 'blk_get_request() returned -11'
  dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of
BLK_STS_RESOURCE
  dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  dm-mpath: cache ti->clone during requeue
  dm-rq: improve I/O merge by dealing with underlying STS_RESOURCE

 block/blk-mq.c| 17 +---
 drivers/md/dm-mpath.c | 51 ++
 drivers/md/dm-rq.c| 56 +--
 3 files changed, 80 insertions(+), 44 deletions(-)

-- 
2.9.5



[PATCH 2/5] dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE

2017-09-30 Thread Ming Lei
If .queue_rq() returns BLK_STS_RESOURCE, blk-mq will rerun
the queue in the three situations:

1) if BLK_MQ_S_SCHED_RESTART is set
- queue is rerun after one rq is completed, see blk_mq_sched_restart()
which is run from blk_mq_free_request()

2) BLK_MQ_S_TAG_WAITING is set
- queue is rerun after one tag is freed

3) otherwise
- queue is run immediately in blk_mq_dispatch_rq_list()

This random dealy of running hw queue is introduced by commit
6077c2d706097c0(dm rq: Avoid that request processing stalls sporadically),
which claimed one request processing stalling is fixed,
but never explained the behind idea, and it is a workaound at most.
Even the question isn't explained by anyone in recent discussion.

Also calling blk_mq_delay_run_hw_queue() inside .queue_rq() is
a horrible hack because it makes BLK_MQ_S_SCHED_RESTART not
working, and will degrade I/O peformance a lot.

Finally this patch makes sure that dm-rq returns
BLK_STS_RESOURCE to blk-mq only when underlying queue is
out of resource, so we switch to return DM_MAPIO_DELAY_REQUEU
if either MPATHF_QUEUE_IO or MPATHF_PG_INIT_REQUIRED is set in
multipath_clone_and_map().

Signed-off-by: Ming Lei 
---
 drivers/md/dm-mpath.c | 4 +---
 drivers/md/dm-rq.c| 1 -
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index e8094d8fbe0d..97e4bd100fa1 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -484,9 +484,7 @@ static int multipath_clone_and_map(struct dm_target *ti, 
struct request *rq,
return DM_MAPIO_KILL;
} else if (test_bit(MPATHF_QUEUE_IO, >flags) ||
   test_bit(MPATHF_PG_INIT_REQUIRED, >flags)) {
-   if (pg_init_all_paths(m))
-   return DM_MAPIO_DELAY_REQUEUE;
-   return DM_MAPIO_REQUEUE;
+   return DM_MAPIO_DELAY_REQUEUE;
}
 
memset(mpio, 0, sizeof(*mpio));
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index f5e2b6967357..46f012185b43 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -758,7 +758,6 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
*hctx,
/* Undo dm_start_request() before requeuing */
rq_end_stats(md, rq);
rq_completed(md, rq_data_dir(rq), false);
-   blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
return BLK_STS_RESOURCE;
}
 
-- 
2.9.5



[PATCH V5 8/8] blk-mq: improve bio merge from blk-mq sw queue

2017-09-30 Thread Ming Lei
This patch uses hash table to do bio merge from sw queue,
then we can align to blk-mq scheduler/block legacy's way
for bio merge.

Turns out bio merge via hash table is more efficient than
simple merge on the last 8 requests in sw queue. On SCSI SRP,
it is observed ~10% IOPS is increased in sequential IO test
with this patch.

It is also one step forward to real 'none' scheduler, in which
way the blk-mq scheduler framework can be more clean.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 49 -
 block/blk-mq.c   | 28 +---
 2 files changed, 37 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index a58f4746317c..8262ae71e0cd 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -260,50 +260,25 @@ bool blk_mq_sched_try_merge(struct request_queue *q, 
struct bio *bio,
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
-/*
- * Reverse check our software queue for entries that we could potentially
- * merge with. Currently includes a hand-wavy stop count of 8, to not spend
- * too much time checking for merges.
- */
-static bool blk_mq_attempt_merge(struct request_queue *q,
+static bool blk_mq_ctx_try_merge(struct request_queue *q,
 struct blk_mq_ctx *ctx, struct bio *bio)
 {
-   struct request *rq;
-   int checked = 8;
+   struct request *rq, *free = NULL;
+   enum elv_merge type;
+   bool merged;
 
lockdep_assert_held(>lock);
 
-   list_for_each_entry_reverse(rq, >rq_list, queuelist) {
-   bool merged = false;
-
-   if (!checked--)
-   break;
-
-   if (!blk_rq_merge_ok(rq, bio))
-   continue;
+   type = elv_merge_ctx(q, , bio, ctx);
+   merged = __blk_mq_try_merge(q, bio, , rq, type);
 
-   switch (blk_try_merge(rq, bio)) {
-   case ELEVATOR_BACK_MERGE:
-   if (blk_mq_sched_allow_merge(q, rq, bio))
-   merged = bio_attempt_back_merge(q, rq, bio);
-   break;
-   case ELEVATOR_FRONT_MERGE:
-   if (blk_mq_sched_allow_merge(q, rq, bio))
-   merged = bio_attempt_front_merge(q, rq, bio);
-   break;
-   case ELEVATOR_DISCARD_MERGE:
-   merged = bio_attempt_discard_merge(q, rq, bio);
-   break;
-   default:
-   continue;
-   }
+   if (free)
+   blk_mq_free_request(free);
 
-   if (merged)
-   ctx->rq_merged++;
-   return merged;
-   }
+   if (merged)
+   ctx->rq_merged++;
 
-   return false;
+   return merged;
 }
 
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
@@ -321,7 +296,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, 
struct bio *bio)
if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
/* default per sw-queue merge */
spin_lock(>lock);
-   ret = blk_mq_attempt_merge(q, ctx, bio);
+   ret = blk_mq_ctx_try_merge(q, ctx, bio);
spin_unlock(>lock);
}
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 561a663cdd0e..9a3a561a63b5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -849,6 +849,18 @@ static void blk_mq_timeout_work(struct work_struct *work)
blk_queue_exit(q);
 }
 
+static void blk_mq_ctx_remove_rq_list(struct blk_mq_ctx *ctx,
+   struct list_head *head)
+{
+   struct request *rq;
+
+   lockdep_assert_held(>lock);
+
+   list_for_each_entry(rq, head, queuelist)
+   rqhash_del(rq);
+   ctx->last_merge = NULL;
+}
+
 struct flush_busy_ctx_data {
struct blk_mq_hw_ctx *hctx;
struct list_head *list;
@@ -863,6 +875,7 @@ static bool flush_busy_ctx(struct sbitmap *sb, unsigned int 
bitnr, void *data)
sbitmap_clear_bit(sb, bitnr);
spin_lock(>lock);
list_splice_tail_init(>rq_list, flush_data->list);
+   blk_mq_ctx_remove_rq_list(ctx, flush_data->list);
spin_unlock(>lock);
return true;
 }
@@ -892,17 +905,23 @@ static bool dispatch_rq_from_ctx(struct sbitmap *sb, 
unsigned int bitnr, void *d
struct dispatch_rq_data *dispatch_data = data;
struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+   struct request *rq = NULL;
 
spin_lock(>lock);
if (unlikely(!list_empty(>rq_list))) {
-   dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
-   list_del_init(_data->rq->queuelist);
+   rq = 

[PATCH V5 5/8] block: add check on elevator for supporting bio merge via hashtable from blk-mq sw queue

2017-09-30 Thread Ming Lei
blk_mq_sched_try_merge() will be reused in following patches
to support bio merge to blk-mq sw queue, so add checkes to
related functions which are called from blk_mq_sched_try_merge().

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/elevator.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index e11c7873fc21..2424aea85393 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -71,6 +71,10 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
if (!blk_rq_merge_ok(rq, bio))
return false;
 
+   /* We need to support to merge bio from sw queue */
+   if (!rq->q->elevator)
+   return true;
+
if (!elv_iosched_allow_bio_merge(rq, bio))
return false;
 
@@ -449,6 +453,10 @@ static enum elv_merge __elv_merge(struct request_queue *q,
return ELEVATOR_BACK_MERGE;
}
 
+   /* no elevator when merging bio to blk-mq sw queue */
+   if (!e)
+   return ELEVATOR_NO_MERGE;
+
if (e->uses_mq && e->type->ops.mq.request_merge)
return e->type->ops.mq.request_merge(q, req, bio);
else if (!e->uses_mq && e->type->ops.sq.elevator_merge_fn)
@@ -711,6 +719,10 @@ struct request *elv_latter_request(struct request_queue 
*q, struct request *rq)
 {
struct elevator_queue *e = q->elevator;
 
+   /* no elevator when merging bio to blk-mq sw queue */
+   if (!e)
+   return NULL;
+
if (e->uses_mq && e->type->ops.mq.next_request)
return e->type->ops.mq.next_request(q, rq);
else if (!e->uses_mq && e->type->ops.sq.elevator_latter_req_fn)
@@ -723,6 +735,10 @@ struct request *elv_former_request(struct request_queue 
*q, struct request *rq)
 {
struct elevator_queue *e = q->elevator;
 
+   /* no elevator when merging bio to blk-mq sw queue */
+   if (!e)
+   return NULL;
+
if (e->uses_mq && e->type->ops.mq.former_request)
return e->type->ops.mq.former_request(q, rq);
if (!e->uses_mq && e->type->ops.sq.elevator_former_req_fn)
-- 
2.9.5



[PATCH V5 7/8] blk-mq-sched: refactor blk_mq_sched_try_merge()

2017-09-30 Thread Ming Lei
This patch introduces one function __blk_mq_try_merge()
which will be resued for bio merge to sw queue in
the following patch.

No functional change.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 8c09959bc0d0..a58f4746317c 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -222,12 +222,11 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
}
 }
 
-bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-   struct request **merged_request)
+static bool __blk_mq_try_merge(struct request_queue *q,
+   struct bio *bio, struct request **merged_request,
+   struct request *rq, enum elv_merge type)
 {
-   struct request *rq;
-
-   switch (elv_merge(q, , bio)) {
+   switch (type) {
case ELEVATOR_BACK_MERGE:
if (!blk_mq_sched_allow_merge(q, rq, bio))
return false;
@@ -250,6 +249,15 @@ bool blk_mq_sched_try_merge(struct request_queue *q, 
struct bio *bio,
return false;
}
 }
+
+bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
+   struct request **merged_request)
+{
+   struct request *rq;
+   enum elv_merge type = elv_merge(q, , bio);
+
+   return __blk_mq_try_merge(q, bio, merged_request, rq, type);
+}
 EXPORT_SYMBOL_GPL(blk_mq_sched_try_merge);
 
 /*
-- 
2.9.5



[PATCH V5 6/8] block: introduce .last_merge and .hash to blk_mq_ctx

2017-09-30 Thread Ming Lei
Prepare for supporting bio merge to sw queue if no
blk-mq io scheduler is taken.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq.h   |  4 
 block/blk.h  |  3 +++
 block/elevator.c | 22 +++---
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 5bca6ce1f01d..85ea8615fecf 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -18,6 +18,10 @@ struct blk_mq_ctx {
unsigned long   rq_dispatched[2];
unsigned long   rq_merged;
 
+   /* bio merge via request hash table */
+   struct request  *last_merge;
+   DECLARE_HASHTABLE(hash, ELV_HASH_BITS);
+
/* incremented at completion time */
unsigned long   cacheline_aligned_in_smp rq_completed[2];
 
diff --git a/block/blk.h b/block/blk.h
index eb3436d4a73f..fa4f232afc18 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -198,6 +198,9 @@ static inline struct request *rqhash_find(struct hlist_head 
*hash, sector_t offs
return NULL;
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+struct bio *bio, struct blk_mq_ctx *ctx);
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index 2424aea85393..0e13e5c18982 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -471,6 +471,13 @@ enum elv_merge elv_merge(struct request_queue *q, struct 
request **req,
return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
 }
 
+enum elv_merge elv_merge_ctx(struct request_queue *q, struct request **req,
+   struct bio *bio, struct blk_mq_ctx *ctx)
+{
+   WARN_ON_ONCE(!q->mq_ops);
+   return __elv_merge(q, req, bio, ctx->hash, ctx->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
@@ -516,16 +523,25 @@ void elv_merged_request(struct request_queue *q, struct 
request *rq,
enum elv_merge type)
 {
struct elevator_queue *e = q->elevator;
+   struct hlist_head *hash = e->hash;
+
+   /* we do bio merge on blk-mq sw queue */
+   if (q->mq_ops && !e) {
+   rq->mq_ctx->last_merge = rq;
+   hash = rq->mq_ctx->hash;
+   goto reposition;
+   }
+
+   q->last_merge = rq;
 
if (e->uses_mq && e->type->ops.mq.request_merged)
e->type->ops.mq.request_merged(q, rq, type);
else if (!e->uses_mq && e->type->ops.sq.elevator_merged_fn)
e->type->ops.sq.elevator_merged_fn(q, rq, type);
 
+ reposition:
if (type == ELEVATOR_BACK_MERGE)
-   elv_rqhash_reposition(q, rq);
-
-   q->last_merge = rq;
+   rqhash_reposition(hash, rq);
 }
 
 void elv_merge_requests(struct request_queue *q, struct request *rq,
-- 
2.9.5



[PATCH V5 4/8] block: move actual bio merge code into __elv_merge

2017-09-30 Thread Ming Lei
So that we can reuse __elv_merge() to merge bio
into requests from sw queue in the following patches.

No functional change.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/elevator.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/block/elevator.c b/block/elevator.c
index 824cc3e69ac3..e11c7873fc21 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -409,8 +409,9 @@ void elv_dispatch_add_tail(struct request_queue *q, struct 
request *rq)
 }
 EXPORT_SYMBOL(elv_dispatch_add_tail);
 
-enum elv_merge elv_merge(struct request_queue *q, struct request **req,
-   struct bio *bio)
+static enum elv_merge __elv_merge(struct request_queue *q,
+   struct request **req, struct bio *bio,
+   struct hlist_head *hash, struct request *last_merge)
 {
struct elevator_queue *e = q->elevator;
struct request *__rq;
@@ -427,11 +428,11 @@ enum elv_merge elv_merge(struct request_queue *q, struct 
request **req,
/*
 * First try one-hit cache.
 */
-   if (q->last_merge && elv_bio_merge_ok(q->last_merge, bio)) {
-   enum elv_merge ret = blk_try_merge(q->last_merge, bio);
+   if (last_merge && elv_bio_merge_ok(last_merge, bio)) {
+   enum elv_merge ret = blk_try_merge(last_merge, bio);
 
if (ret != ELEVATOR_NO_MERGE) {
-   *req = q->last_merge;
+   *req = last_merge;
return ret;
}
}
@@ -442,7 +443,7 @@ enum elv_merge elv_merge(struct request_queue *q, struct 
request **req,
/*
 * See if our hash lookup can find a potential backmerge.
 */
-   __rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);
+   __rq = rqhash_find(hash, bio->bi_iter.bi_sector);
if (__rq && elv_bio_merge_ok(__rq, bio)) {
*req = __rq;
return ELEVATOR_BACK_MERGE;
@@ -456,6 +457,12 @@ enum elv_merge elv_merge(struct request_queue *q, struct 
request **req,
return ELEVATOR_NO_MERGE;
 }
 
+enum elv_merge elv_merge(struct request_queue *q, struct request **req,
+   struct bio *bio)
+{
+   return __elv_merge(q, req, bio, q->elevator->hash, q->last_merge);
+}
+
 /*
  * Attempt to do an insertion back merge. Only check for the case where
  * we can append 'rq' to an existing request, so we can throw 'rq' away
-- 
2.9.5



[PATCH V5 1/8] blk-mq-sched: introduce blk_mq_sched_queue_depth()

2017-09-30 Thread Ming Lei
The following patch will use one hint to figure out
default queue depth for scheduler queue, so introduce
the helper of blk_mq_sched_queue_depth() for this purpose.

Reviewed-by: Christoph Hellwig 
Reviewed-by: Bart Van Assche 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c |  8 +---
 block/blk-mq-sched.h | 11 +++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index c5eac1eee442..8c09959bc0d0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -588,13 +588,7 @@ int blk_mq_init_sched(struct request_queue *q, struct 
elevator_type *e)
return 0;
}
 
-   /*
-* Default to double of smaller one between hw queue_depth and 128,
-* since we don't split into sync/async like the old code did.
-* Additionally, this is a per-hw queue depth.
-*/
-   q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth,
-  BLKDEV_MAX_RQ);
+   q->nr_requests = blk_mq_sched_queue_depth(q);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_sched_alloc_tags(q, hctx, i);
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 9267d0b7c197..1d47f3fda1d0 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -96,4 +96,15 @@ static inline bool blk_mq_sched_needs_restart(struct 
blk_mq_hw_ctx *hctx)
return test_bit(BLK_MQ_S_SCHED_RESTART, >state);
 }
 
+static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
+{
+   /*
+* Default to double of smaller one between hw queue_depth and 128,
+* since we don't split into sync/async like the old code did.
+* Additionally, this is a per-hw queue depth.
+*/
+   return 2 * min_t(unsigned int, q->tag_set->queue_depth,
+  BLKDEV_MAX_RQ);
+}
+
 #endif
-- 
2.9.5



[PATCH V5 3/8] block: introduce rqhash helpers

2017-09-30 Thread Ming Lei
We need this helpers for supporting to use hashtable to improve
bio merge from sw queue in the following patches.

No functional change.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk.h  | 52 
 block/elevator.c | 36 +++-
 2 files changed, 59 insertions(+), 29 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index fcb9775b997d..eb3436d4a73f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -146,6 +146,58 @@ static inline void blk_clear_rq_complete(struct request 
*rq)
  */
 #define ELV_ON_HASH(rq) ((rq)->rq_flags & RQF_HASHED)
 
+/*
+ * Merge hash stuff.
+ */
+#define rq_hash_key(rq)(blk_rq_pos(rq) + blk_rq_sectors(rq))
+
+#define bucket(head, key)  &((head)[hash_min((key), ELV_HASH_BITS)])
+
+static inline void __rqhash_del(struct request *rq)
+{
+   hash_del(>hash);
+   rq->rq_flags &= ~RQF_HASHED;
+}
+
+static inline void rqhash_del(struct request *rq)
+{
+   if (ELV_ON_HASH(rq))
+   __rqhash_del(rq);
+}
+
+static inline void rqhash_add(struct hlist_head *hash, struct request *rq)
+{
+   BUG_ON(ELV_ON_HASH(rq));
+   hlist_add_head(>hash, bucket(hash, rq_hash_key(rq)));
+   rq->rq_flags |= RQF_HASHED;
+}
+
+static inline void rqhash_reposition(struct hlist_head *hash, struct request 
*rq)
+{
+   __rqhash_del(rq);
+   rqhash_add(hash, rq);
+}
+
+static inline struct request *rqhash_find(struct hlist_head *hash, sector_t 
offset)
+{
+   struct hlist_node *next;
+   struct request *rq = NULL;
+
+   hlist_for_each_entry_safe(rq, next, bucket(hash, offset), hash) {
+   BUG_ON(!ELV_ON_HASH(rq));
+
+   if (unlikely(!rq_mergeable(rq))) {
+   __rqhash_del(rq);
+   continue;
+   }
+
+   if (rq_hash_key(rq) == offset)
+   return rq;
+   }
+
+   return NULL;
+}
+
 void blk_insert_flush(struct request *rq);
 
 static inline struct request *__elv_next_request(struct request_queue *q)
diff --git a/block/elevator.c b/block/elevator.c
index 153926a90901..824cc3e69ac3 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -47,11 +47,6 @@ static DEFINE_SPINLOCK(elv_list_lock);
 static LIST_HEAD(elv_list);
 
 /*
- * Merge hash stuff.
- */
-#define rq_hash_key(rq)(blk_rq_pos(rq) + blk_rq_sectors(rq))
-
-/*
  * Query io scheduler to see if the current process issuing bio may be
  * merged with rq.
  */
@@ -268,14 +263,12 @@ EXPORT_SYMBOL(elevator_exit);
 
 static inline void __elv_rqhash_del(struct request *rq)
 {
-   hash_del(>hash);
-   rq->rq_flags &= ~RQF_HASHED;
+   __rqhash_del(rq);
 }
 
 void elv_rqhash_del(struct request_queue *q, struct request *rq)
 {
-   if (ELV_ON_HASH(rq))
-   __elv_rqhash_del(rq);
+   rqhash_del(rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_del);
 
@@ -283,37 +276,22 @@ void elv_rqhash_add(struct request_queue *q, struct 
request *rq)
 {
struct elevator_queue *e = q->elevator;
 
-   BUG_ON(ELV_ON_HASH(rq));
-   hash_add(e->hash, >hash, rq_hash_key(rq));
-   rq->rq_flags |= RQF_HASHED;
+   rqhash_add(e->hash, rq);
 }
 EXPORT_SYMBOL_GPL(elv_rqhash_add);
 
 void elv_rqhash_reposition(struct request_queue *q, struct request *rq)
 {
-   __elv_rqhash_del(rq);
-   elv_rqhash_add(q, rq);
+   struct elevator_queue *e = q->elevator;
+
+   rqhash_reposition(e->hash, rq);
 }
 
 struct request *elv_rqhash_find(struct request_queue *q, sector_t offset)
 {
struct elevator_queue *e = q->elevator;
-   struct hlist_node *next;
-   struct request *rq;
-
-   hash_for_each_possible_safe(e->hash, rq, next, hash, offset) {
-   BUG_ON(!ELV_ON_HASH(rq));
 
-   if (unlikely(!rq_mergeable(rq))) {
-   __elv_rqhash_del(rq);
-   continue;
-   }
-
-   if (rq_hash_key(rq) == offset)
-   return rq;
-   }
-
-   return NULL;
+   return rqhash_find(e->hash, offset);
 }
 
 /*
-- 
2.9.5



[PATCH V5 2/8] blk-mq-sched: use q->queue_depth as hint for q->nr_requests

2017-09-30 Thread Ming Lei
SCSI sets q->queue_depth from shost->cmd_per_lun, and
q->queue_depth is per request_queue and more related to
scheduler queue compared with hw queue depth, which can be
shared by queues, such as TAG_SHARED.

This patch tries to use q->queue_depth as hint for computing
q->nr_requests, which should be more effective than
current way.

Reviewed-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.h | 18 +++---
 block/blk-mq.c   | 27 +--
 block/blk-mq.h   |  1 +
 block/blk-settings.c |  2 ++
 4 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1d47f3fda1d0..906b10c54f78 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -99,12 +99,24 @@ static inline bool blk_mq_sched_needs_restart(struct 
blk_mq_hw_ctx *hctx)
 static inline unsigned blk_mq_sched_queue_depth(struct request_queue *q)
 {
/*
-* Default to double of smaller one between hw queue_depth and 128,
+* q->queue_depth is more close to scheduler queue, so use it
+* as hint for computing scheduler queue depth if it is valid
+*/
+   unsigned q_depth = q->queue_depth ?: q->tag_set->queue_depth;
+
+   /*
+* Default to double of smaller one between queue depth and 128,
 * since we don't split into sync/async like the old code did.
 * Additionally, this is a per-hw queue depth.
 */
-   return 2 * min_t(unsigned int, q->tag_set->queue_depth,
-  BLKDEV_MAX_RQ);
+   q_depth = 2 * min_t(unsigned int, q_depth, BLKDEV_MAX_RQ);
+
+   /*
+* when queue depth of driver is too small, we set queue depth
+* of scheduler queue as 128 which is the default setting of
+* block legacy code.
+*/
+   return max_t(unsigned, q_depth, BLKDEV_MAX_RQ);
 }
 
 #endif
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7cb3f87334c0..561a663cdd0e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2694,7 +2694,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 }
 EXPORT_SYMBOL(blk_mq_free_tag_set);
 
-int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+static int __blk_mq_update_nr_requests(struct request_queue *q,
+  bool sched_only,
+  unsigned int nr)
 {
struct blk_mq_tag_set *set = q->tag_set;
struct blk_mq_hw_ctx *hctx;
@@ -2713,7 +2715,7 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
 * If we're using an MQ scheduler, just update the scheduler
 * queue depth. This is similar to what the old code would do.
 */
-   if (!hctx->sched_tags) {
+   if (!sched_only && !hctx->sched_tags) {
ret = blk_mq_tag_update_depth(hctx, >tags,
min(nr, 
set->queue_depth),
false);
@@ -2733,6 +2735,27 @@ int blk_mq_update_nr_requests(struct request_queue *q, 
unsigned int nr)
return ret;
 }
 
+int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
+{
+   return __blk_mq_update_nr_requests(q, false, nr);
+}
+
+/*
+ * When drivers update q->queue_depth, this API is called so that
+ * we can use this queue depth as hint for adjusting scheduler
+ * queue depth.
+ */
+int blk_mq_update_sched_queue_depth(struct request_queue *q)
+{
+   unsigned nr;
+
+   if (!q->mq_ops || !q->elevator)
+   return 0;
+
+   nr = blk_mq_sched_queue_depth(q);
+   return __blk_mq_update_nr_requests(q, true, nr);
+}
+
 static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
int nr_hw_queues)
 {
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 915de58572e7..5bca6ce1f01d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -37,6 +37,7 @@ bool blk_mq_get_driver_tag(struct request *rq, struct 
blk_mq_hw_ctx **hctx,
bool wait);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
struct blk_mq_ctx *start);
+int blk_mq_update_sched_queue_depth(struct request_queue *q);
 
 /*
  * Internal helpers for allocating/freeing the request map
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8559e9563c52..c2db38d2ec2b 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -878,6 +878,8 @@ void blk_set_queue_depth(struct request_queue *q, unsigned 
int depth)
 {
q->queue_depth = depth;
wbt_set_queue_depth(q->rq_wb, depth);
+
+  

[PATCH V5 0/8] blk-mq: improve bio merge for none scheduler

2017-09-30 Thread Ming Lei
Hi,

Patch 1 ~ 2 uses q->queue_depth as hint for setting up
scheduler queue depth.

Patch 3 ~ 8 improve bio merge via hash table in sw queue,
which makes bio merge more efficient than current approch
in which only the last 8 requests in sw queue are checked.

Also this way has been used in block legacy path for long time,
and blk-mq scheduler uses hash table to do bio merge too.

V5:
- splitted from previous patchset of 'blk-mq-sched: improve
SCSI-MQ performance V4'

Ming Lei (8):
  blk-mq-sched: introduce blk_mq_sched_queue_depth()
  blk-mq-sched: use q->queue_depth as hint for q->nr_requests
  block: introduce rqhash helpers
  block: move actual bio merge code into __elv_merge
  block: add check on elevator for supporting bio merge via hashtable
from blk-mq sw queue
  block: introduce .last_merge and .hash to blk_mq_ctx
  blk-mq-sched: refactor blk_mq_sched_try_merge()
  blk-mq: improve bio merge from blk-mq sw queue

 block/blk-mq-sched.c | 75 +++---
 block/blk-mq-sched.h | 23 +
 block/blk-mq.c   | 55 ---
 block/blk-mq.h   |  5 +++
 block/blk-settings.c |  2 ++
 block/blk.h  | 55 +++
 block/elevator.c | 93 +++-
 7 files changed, 216 insertions(+), 92 deletions(-)

-- 
2.9.5



Re: [PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-09-30 Thread Ming Lei
On Sat, Sep 30, 2017 at 06:27:13PM +0800, Ming Lei wrote:
> Hi Jens,
> 
> In Red Hat internal storage test wrt. blk-mq scheduler, we
> found that I/O performance is much bad with mq-deadline, especially
> about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
> SRP...)
> 
> Turns out one big issue causes the performance regression: requests
> are still dequeued from sw queue/scheduler queue even when ldd's
> queue is busy, so I/O merge becomes quite difficult to make, then
> sequential IO degrades a lot.
> 
> This issue becomes one of mains reasons for reverting default SCSI_MQ
> in V4.13.
> 
> The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
> then we can improve dm-mpath's performance in part 2, which will
> be posted out soon.
> 
> The 2nd six patches improve this situation, and brings back
> some performance loss.
> 
> With this change, SCSI-MQ sequential I/O performance is
> improved much, Paolo reported that mq-deadline performance
> improved much[2] in his dbench test wrt V2. Also performanc
> improvement on lpfc/qla2xx was observed with V1.[1]
> 
> Please consider it for V4.15.
> 
> [1] http://marc.info/?l=linux-block=150151989915776=2
> [2] https://marc.info/?l=linux-block=150217980602843=2
> 
> V5:
>   - address some comments from Omar
>   - add Tested-by & Reveiewed-by tag
>   - use direct issue for blk_mq_request_bypass_insert(), and
>   start to consider to improve sequential I/O for dm-mpath
>   - only include part 1(the original patch 1 ~ 6), as suggested
>   by Omar
> 
> V4:
>   - add Reviewed-by tag
>   - some trival change: typo fix in commit log or comment,
>   variable name, no actual functional change
> 
> V3:
>   - totally round robin for picking req from ctx, as suggested
>   by Bart
>   - remove one local variable in __sbitmap_for_each_set()
>   - drop patches of single dispatch list, which can improve
>   performance on mq-deadline, but cause a bit degrade on
>   none because all hctxs need to be checked after ->dispatch
>   is flushed. Will post it again once it is mature.
>   - rebase on v4.13-rc6 with block for-next
> 
> V2:
>   - dequeue request from sw queues in round roubin's style
>   as suggested by Bart, and introduces one helper in sbitmap
>   for this purpose
>   - improve bio merge via hash table from sw queue
>   - add comments about using DISPATCH_BUSY state in lockless way,
>   simplifying handling on busy state,
>   - hold ctx->lock when clearing ctx busy bit as suggested
>   by Bart
> 
> 
> Ming Lei (7):
>   blk-mq: issue rq directly in blk_mq_request_bypass_insert()
>   blk-mq-sched: fix scheduler bad performance
>   sbitmap: introduce __sbitmap_for_each_set()
>   blk-mq: introduce blk_mq_dequeue_from_ctx()
>   blk-mq-sched: move actual dispatching into one helper
>   blk-mq-sched: improve dispatching from sw queue
>   blk-mq-sched: don't dequeue request until all in ->dispatch are
> flushed
> 
>  block/blk-core.c|   3 +-
>  block/blk-mq-debugfs.c  |   1 +
>  block/blk-mq-sched.c| 104 ---
>  block/blk-mq.c  | 114 
> +++-
>  block/blk-mq.h  |   4 +-
>  drivers/md/dm-rq.c  |   2 +-
>  include/linux/blk-mq.h  |   3 ++
>  include/linux/sbitmap.h |  64 +++
>  8 files changed, 238 insertions(+), 57 deletions(-)

Oops, the title should have been:

[PATCH V5 0/7] blk-mq-sched: improve sequential I/O performance(part 1)

Sorry for that.

-- 
Ming


[PATCH V5 4/7] blk-mq: introduce blk_mq_dequeue_from_ctx()

2017-09-30 Thread Ming Lei
This function is introduced for dequeuing request
from sw queue so that we can dispatch it in
scheduler's way.

More importantly, some SCSI devices may set
q->queue_depth, which is a per-request_queue limit,
and applied on pending I/O from all hctxs. This
function is introduced for avoiding to dequeue too
many requests from sw queue when ->dispatch isn't
flushed completely.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 38 ++
 block/blk-mq.h |  2 ++
 2 files changed, 40 insertions(+)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index d1b9fb539eba..8b49af1ade7f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -882,6 +882,44 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list)
 }
 EXPORT_SYMBOL_GPL(blk_mq_flush_busy_ctxs);
 
+struct dispatch_rq_data {
+   struct blk_mq_hw_ctx *hctx;
+   struct request *rq;
+};
+
+static bool dispatch_rq_from_ctx(struct sbitmap *sb, unsigned int bitnr, void 
*data)
+{
+   struct dispatch_rq_data *dispatch_data = data;
+   struct blk_mq_hw_ctx *hctx = dispatch_data->hctx;
+   struct blk_mq_ctx *ctx = hctx->ctxs[bitnr];
+
+   spin_lock(>lock);
+   if (unlikely(!list_empty(>rq_list))) {
+   dispatch_data->rq = list_entry_rq(ctx->rq_list.next);
+   list_del_init(_data->rq->queuelist);
+   if (list_empty(>rq_list))
+   sbitmap_clear_bit(sb, bitnr);
+   }
+   spin_unlock(>lock);
+
+   return !dispatch_data->rq;
+}
+
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_ctx *start)
+{
+   unsigned off = start ? start->index_hw : 0;
+   struct dispatch_rq_data data = {
+   .hctx = hctx,
+   .rq   = NULL,
+   };
+
+   __sbitmap_for_each_set(>ctx_map, off,
+  dispatch_rq_from_ctx, );
+
+   return data.rq;
+}
+
 static inline unsigned int queued_to_index(unsigned int queued)
 {
if (!queued)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 61aecf398a4b..915de58572e7 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,8 @@ void blk_mq_flush_busy_ctxs(struct blk_mq_hw_ctx *hctx, 
struct list_head *list);
 bool blk_mq_hctx_has_pending(struct blk_mq_hw_ctx *hctx);
 bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
bool wait);
+struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
+   struct blk_mq_ctx *start);
 
 /*
  * Internal helpers for allocating/freeing the request map
-- 
2.9.5



[PATCH V5 7/7] blk-mq-sched: don't dequeue request until all in ->dispatch are flushed

2017-09-30 Thread Ming Lei
During dispatching, we moved all requests from hctx->dispatch to
one temporary list, then dispatch them one by one from this list.
Unfortunately during this period, run queue from other contexts
may think the queue is idle, then start to dequeue from sw/scheduler
queue and still try to dispatch because ->dispatch is empty. This way
hurts sequential I/O performance because requests are dequeued when
lld queue is busy.

This patch introduces the state of BLK_MQ_S_DISPATCH_BUSY to
make sure that request isn't dequeued until ->dispatch is
flushed.

Reviewed-by: Bart Van Assche 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   | 53 +-
 block/blk-mq.c |  6 ++
 include/linux/blk-mq.h |  1 +
 4 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 813ca3bbbefc..f1a62c0d1acc 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -182,6 +182,7 @@ static const char *const hctx_state_name[] = {
HCTX_STATE_NAME(SCHED_RESTART),
HCTX_STATE_NAME(TAG_WAITING),
HCTX_STATE_NAME(START_ON_RUN),
+   HCTX_STATE_NAME(DISPATCH_BUSY),
 };
 #undef HCTX_STATE_NAME
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 3ba112d9dc15..c5eac1eee442 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -146,7 +146,6 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -177,8 +176,33 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
-   } else if (!has_sched_dispatch && !q->queue_depth) {
+   blk_mq_dispatch_rq_list(q, _list);
+
+   /*
+* We may clear DISPATCH_BUSY just after it
+* is set from another context, the only cost
+* is that one request is dequeued a bit early,
+* we can survive that. Given the window is
+* small enough, no need to worry about performance
+* effect.
+*/
+   if (list_empty_careful(>dispatch))
+   clear_bit(BLK_MQ_S_DISPATCH_BUSY, >state);
+   }
+
+   /*
+* If DISPATCH_BUSY is set, that means hw queue is busy
+* and requests in the list of hctx->dispatch need to
+* be flushed first, so return early.
+*
+* Wherever DISPATCH_BUSY is set, blk_mq_run_hw_queue()
+* will be run to try to make progress, so it is always
+* safe to check the state here.
+*/
+   if (test_bit(BLK_MQ_S_DISPATCH_BUSY, >state))
+   return;
+
+   if (!has_sched_dispatch) {
/*
 * If there is no per-request_queue depth, we
 * flush all requests in this hw queue, otherwise
@@ -187,22 +211,15 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * run out of resource, which can be triggered
 * easily by per-request_queue queue depth
 */
-   blk_mq_flush_busy_ctxs(hctx, _list);
-   blk_mq_dispatch_rq_list(q, _list);
-   }
-
-   if (!do_sched_dispatch)
-   return;
-
-   /*
-* We want to dispatch from the scheduler if there was nothing
-* on the dispatch list or we were able to dispatch from the
-* dispatch list.
-*/
-   if (has_sched_dispatch)
+   if (!q->queue_depth) {
+   blk_mq_flush_busy_ctxs(hctx, _list);
+   blk_mq_dispatch_rq_list(q, _list);
+   } else {
+   blk_mq_do_dispatch_ctx(q, hctx);
+   }
+   } else {
blk_mq_do_dispatch_sched(q, e, hctx);
-   else
-   blk_mq_do_dispatch_ctx(q, hctx);
+   }
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8b49af1ade7f..7cb3f87334c0 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1142,6 +1142,11 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list)
 
spin_lock(>lock);
list_splice_init(list, >dispatch);
+   /*
+* DISPATCH_BUSY 

[PATCH V5 5/7] blk-mq-sched: move actual dispatching into one helper

2017-09-30 Thread Ming Lei
So that it becomes easy to support to dispatch from
sw queue in the following patch.

No functional change.

Reviewed-by: Bart Van Assche 
Reviewed-by: Omar Sandoval 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index eca011fdfa0e..538f363f39ca 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -89,6 +89,22 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
*hctx)
return false;
 }
 
+static void blk_mq_do_dispatch_sched(struct request_queue *q,
+struct elevator_queue *e,
+struct blk_mq_hw_ctx *hctx)
+{
+   LIST_HEAD(rq_list);
+
+   do {
+   struct request *rq;
+
+   rq = e->type->ops.mq.dispatch_request(hctx);
+   if (!rq)
+   break;
+   list_add(>queuelist, _list);
+   } while (blk_mq_dispatch_rq_list(q, _list));
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
@@ -136,16 +152,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 * on the dispatch list or we were able to dispatch from the
 * dispatch list.
 */
-   if (do_sched_dispatch && has_sched_dispatch) {
-   do {
-   struct request *rq;
-
-   rq = e->type->ops.mq.dispatch_request(hctx);
-   if (!rq)
-   break;
-   list_add(>queuelist, _list);
-   } while (blk_mq_dispatch_rq_list(q, _list));
-   }
+   if (do_sched_dispatch && has_sched_dispatch)
+   blk_mq_do_dispatch_sched(q, e, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
-- 
2.9.5



[PATCH V5 6/7] blk-mq-sched: improve dispatching from sw queue

2017-09-30 Thread Ming Lei
SCSI devices use host-wide tagset, and the shared
driver tag space is often quite big. Meantime
there is also queue depth for each lun(.cmd_per_lun),
which is often small.

So lots of requests may stay in sw queue, and we
always flush all belonging to same hw queue and
dispatch them all to driver, unfortunately it is
easy to cause queue busy because of the small
per-lun queue depth. Once these requests are flushed
out, they have to stay in hctx->dispatch, and no bio
merge can participate into these requests, and
sequential IO performance is hurted.

This patch improves dispatching from sw queue when
there is per-request-queue queue depth by taking
request one by one from sw queue, just like the way
of IO scheduler.

Reviewed-by: Omar Sandoval 
Reviewed-by: Bart Van Assche 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c   | 53 --
 include/linux/blk-mq.h |  2 ++
 2 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 538f363f39ca..3ba112d9dc15 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -105,6 +105,42 @@ static void blk_mq_do_dispatch_sched(struct request_queue 
*q,
} while (blk_mq_dispatch_rq_list(q, _list));
 }
 
+static struct blk_mq_ctx *blk_mq_next_ctx(struct blk_mq_hw_ctx *hctx,
+ struct blk_mq_ctx *ctx)
+{
+   unsigned idx = ctx->index_hw;
+
+   if (++idx == hctx->nr_ctx)
+   idx = 0;
+
+   return hctx->ctxs[idx];
+}
+
+static void blk_mq_do_dispatch_ctx(struct request_queue *q,
+  struct blk_mq_hw_ctx *hctx)
+{
+   LIST_HEAD(rq_list);
+   struct blk_mq_ctx *ctx = READ_ONCE(hctx->dispatch_from);
+   bool dispatched;
+
+   do {
+   struct request *rq;
+
+   rq = blk_mq_dequeue_from_ctx(hctx, ctx);
+   if (!rq)
+   break;
+   list_add(>queuelist, _list);
+
+   /* round robin for fair dispatch */
+   ctx = blk_mq_next_ctx(hctx, rq->mq_ctx);
+
+   dispatched = blk_mq_dispatch_rq_list(q, _list);
+   } while (dispatched);
+
+   if (!dispatched)
+   WRITE_ONCE(hctx->dispatch_from, ctx);
+}
+
 void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
 {
struct request_queue *q = hctx->queue;
@@ -142,18 +178,31 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
-   } else if (!has_sched_dispatch) {
+   } else if (!has_sched_dispatch && !q->queue_depth) {
+   /*
+* If there is no per-request_queue depth, we
+* flush all requests in this hw queue, otherwise
+* pick up request one by one from sw queue for
+* avoiding to mess up I/O merge when dispatch
+* run out of resource, which can be triggered
+* easily by per-request_queue queue depth
+*/
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
 
+   if (!do_sched_dispatch)
+   return;
+
/*
 * We want to dispatch from the scheduler if there was nothing
 * on the dispatch list or we were able to dispatch from the
 * dispatch list.
 */
-   if (do_sched_dispatch && has_sched_dispatch)
+   if (has_sched_dispatch)
blk_mq_do_dispatch_sched(q, e, hctx);
+   else
+   blk_mq_do_dispatch_ctx(q, hctx);
 }
 
 bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2747469cedaf..fccabe00fb55 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -30,6 +30,8 @@ struct blk_mq_hw_ctx {
 
struct sbitmap  ctx_map;
 
+   struct blk_mq_ctx   *dispatch_from;
+
struct blk_mq_ctx   **ctxs;
unsigned intnr_ctx;
 
-- 
2.9.5



[PATCH V5 3/7] sbitmap: introduce __sbitmap_for_each_set()

2017-09-30 Thread Ming Lei
We need to iterate ctx starting from any ctx in round robin
way, so introduce this helper.

Cc: Omar Sandoval 
Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Signed-off-by: Ming Lei 
---
 include/linux/sbitmap.h | 64 -
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index a1904aadbc45..0dcc60e820de 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -211,10 +211,14 @@ bool sbitmap_any_bit_set(const struct sbitmap *sb);
  */
 bool sbitmap_any_bit_clear(const struct sbitmap *sb);
 
+#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
+#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+
 typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned int, void *);
 
 /**
- * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * __sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @start: Where to start the iteration.
  * @sb: Bitmap to iterate over.
  * @fn: Callback. Should return true to continue or false to break early.
  * @data: Pointer to pass to callback.
@@ -222,35 +226,61 @@ typedef bool (*sb_for_each_fn)(struct sbitmap *, unsigned 
int, void *);
  * This is inline even though it's non-trivial so that the function calls to 
the
  * callback will hopefully get optimized away.
  */
-static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
-   void *data)
+static inline void __sbitmap_for_each_set(struct sbitmap *sb,
+ unsigned int start,
+ sb_for_each_fn fn, void *data)
 {
-   unsigned int i;
+   unsigned int index;
+   unsigned int nr;
+   unsigned int scanned = 0;
 
-   for (i = 0; i < sb->map_nr; i++) {
-   struct sbitmap_word *word = >map[i];
-   unsigned int off, nr;
+   if (start >= sb->depth)
+   start = 0;
+   index = SB_NR_TO_INDEX(sb, start);
+   nr = SB_NR_TO_BIT(sb, start);
 
-   if (!word->word)
-   continue;
+   while (scanned < sb->depth) {
+   struct sbitmap_word *word = >map[index];
+   unsigned int depth = min_t(unsigned int, word->depth - nr,
+  sb->depth - scanned);
 
-   nr = 0;
-   off = i << sb->shift;
+   scanned += depth;
+   if (!word->word)
+   goto next;
+
+   /*
+* On the first iteration of the outer loop, we need to add the
+* bit offset back to the size of the word for find_next_bit().
+* On all other iterations, nr is zero, so this is a noop.
+*/
+   depth += nr;
while (1) {
-   nr = find_next_bit(>word, word->depth, nr);
-   if (nr >= word->depth)
+   nr = find_next_bit(>word, depth, nr);
+   if (nr >= depth)
break;
-
-   if (!fn(sb, off + nr, data))
+   if (!fn(sb, (index << sb->shift) + nr, data))
return;
 
nr++;
}
+next:
+   nr = 0;
+   if (++index >= sb->map_nr)
+   index = 0;
}
 }
 
-#define SB_NR_TO_INDEX(sb, bitnr) ((bitnr) >> (sb)->shift)
-#define SB_NR_TO_BIT(sb, bitnr) ((bitnr) & ((1U << (sb)->shift) - 1U))
+/**
+ * sbitmap_for_each_set() - Iterate over each set bit in a  sbitmap.
+ * @sb: Bitmap to iterate over.
+ * @fn: Callback. Should return true to continue or false to break early.
+ * @data: Pointer to pass to callback.
+ */
+static inline void sbitmap_for_each_set(struct sbitmap *sb, sb_for_each_fn fn,
+   void *data)
+{
+   __sbitmap_for_each_set(sb, 0, fn, data);
+}
 
 static inline unsigned long *__sbitmap_word(struct sbitmap *sb,
unsigned int bitnr)
-- 
2.9.5



[PATCH V5 1/7] blk-mq: issue rq directly in blk_mq_request_bypass_insert()

2017-09-30 Thread Ming Lei
With issuing rq directly in blk_mq_request_bypass_insert(),
we can:

1) avoid to acquire hctx->lock.

2) the dispatch result can be returned to dm-rq, so that dm-rq
can use this information for improving I/O performance, and
part2 of this patchset will do that.

3) Also the following patch for improving sequential I/O performance
uses hctx->dispatch to decide if hctx is busy, so we need to avoid
to add rq into hctx->dispatch direclty.

There will be another patch in which we move blk_mq_request_direct_insert()
out since it is better for dm-rq to deal with this situation, and
the IO scheduler is actually in dm-rq side.

Signed-off-by: Ming Lei 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 70 ++
 block/blk-mq.h |  2 +-
 drivers/md/dm-rq.c |  2 +-
 4 files changed, 52 insertions(+), 25 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..4c7fd2231145 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2350,8 +2350,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq);
-   return BLK_STS_OK;
+   return blk_mq_request_bypass_insert(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..d1b9fb539eba 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -39,6 +39,8 @@
 
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, blk_qc_t *cookie, bool dispatch_only);
 
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
@@ -1401,20 +1403,31 @@ void __blk_mq_insert_request(struct blk_mq_hw_ctx 
*hctx, struct request *rq,
blk_mq_hctx_mark_pending(hctx, ctx);
 }
 
+static void blk_mq_request_direct_insert(struct blk_mq_hw_ctx *hctx,
+struct request *rq)
+{
+   spin_lock(>lock);
+   list_add_tail(>queuelist, >dispatch);
+   spin_unlock(>lock);
+
+   blk_mq_run_hw_queue(hctx, false);
+}
+
 /*
  * Should only be used carefully, when the caller knows we want to
  * bypass a potential IO scheduler on the target device.
  */
-void blk_mq_request_bypass_insert(struct request *rq)
+blk_status_t blk_mq_request_bypass_insert(struct request *rq)
 {
struct blk_mq_ctx *ctx = rq->mq_ctx;
struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+   blk_qc_t cookie;
+   blk_status_t ret;
 
-   spin_lock(>lock);
-   list_add_tail(>queuelist, >dispatch);
-   spin_unlock(>lock);
-
-   blk_mq_run_hw_queue(hctx, false);
+   ret = blk_mq_try_issue_directly(hctx, rq, , true);
+   if (ret == BLK_STS_RESOURCE)
+   blk_mq_request_direct_insert(hctx, rq);
+   return ret;
 }
 
 void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, struct blk_mq_ctx *ctx,
@@ -1527,9 +1540,14 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie, bool may_sleep)
+/*
+ * 'dispatch_only' means we only try to dispatch it out, and
+ * don't deal with dispatch failure if BLK_STS_RESOURCE or
+ * BLK_STS_IOERR happens.
+ */
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, blk_qc_t *cookie, bool may_sleep,
+   bool dispatch_only)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1537,7 +1555,7 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
.last = true,
};
blk_qc_t new_cookie;
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1546,9 +1564,10 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !dispatch_only)
goto insert;
 
+   ret = BLK_STS_RESOURCE;
if (!blk_mq_get_driver_tag(rq, NULL, false))
goto insert;
 
@@ -1563,26 +1582,32 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-   return;
+   return ret;
case BLK_STS_RESOURCE:

[PATCH V5 2/7] blk-mq-sched: fix scheduler bad performance

2017-09-30 Thread Ming Lei
When hw queue is busy, we shouldn't take requests from
scheduler queue any more, otherwise it is difficult to do
IO merge.

This patch fixes the awful IO performance on some
SCSI devices(lpfc, qla2xxx, ...) when mq-deadline/kyber
is used by not taking requests if hw queue is busy.

Tested-by: Oleksandr Natalenko 
Tested-by: Tom Nguyen 
Tested-by: Paolo Valente 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sched.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..eca011fdfa0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -94,7 +94,7 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
struct request_queue *q = hctx->queue;
struct elevator_queue *e = q->elevator;
const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request;
-   bool did_work = false;
+   bool do_sched_dispatch = true;
LIST_HEAD(rq_list);
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -125,18 +125,18 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
 */
if (!list_empty(_list)) {
blk_mq_sched_mark_restart_hctx(hctx);
-   did_work = blk_mq_dispatch_rq_list(q, _list);
+   do_sched_dispatch = blk_mq_dispatch_rq_list(q, _list);
} else if (!has_sched_dispatch) {
blk_mq_flush_busy_ctxs(hctx, _list);
blk_mq_dispatch_rq_list(q, _list);
}
 
/*
-* We want to dispatch from the scheduler if we had no work left
-* on the dispatch list, OR if we did have work but weren't able
-* to make progress.
+* We want to dispatch from the scheduler if there was nothing
+* on the dispatch list or we were able to dispatch from the
+* dispatch list.
 */
-   if (!did_work && has_sched_dispatch) {
+   if (do_sched_dispatch && has_sched_dispatch) {
do {
struct request *rq;
 
-- 
2.9.5



[PATCH V5 00/14] blk-mq-sched: improve sequential I/O performance(part 1)

2017-09-30 Thread Ming Lei
Hi Jens,

In Red Hat internal storage test wrt. blk-mq scheduler, we
found that I/O performance is much bad with mq-deadline, especially
about sequential I/O on some multi-queue SCSI devcies(lpfc, qla2xxx,
SRP...)

Turns out one big issue causes the performance regression: requests
are still dequeued from sw queue/scheduler queue even when ldd's
queue is busy, so I/O merge becomes quite difficult to make, then
sequential IO degrades a lot.

This issue becomes one of mains reasons for reverting default SCSI_MQ
in V4.13.

The 1st patch takes direct issue in blk_mq_request_bypass_insert(),
then we can improve dm-mpath's performance in part 2, which will
be posted out soon.

The 2nd six patches improve this situation, and brings back
some performance loss.

With this change, SCSI-MQ sequential I/O performance is
improved much, Paolo reported that mq-deadline performance
improved much[2] in his dbench test wrt V2. Also performanc
improvement on lpfc/qla2xx was observed with V1.[1]

Please consider it for V4.15.

[1] http://marc.info/?l=linux-block=150151989915776=2
[2] https://marc.info/?l=linux-block=150217980602843=2

V5:
- address some comments from Omar
- add Tested-by & Reveiewed-by tag
- use direct issue for blk_mq_request_bypass_insert(), and
start to consider to improve sequential I/O for dm-mpath
- only include part 1(the original patch 1 ~ 6), as suggested
by Omar

V4:
- add Reviewed-by tag
- some trival change: typo fix in commit log or comment,
variable name, no actual functional change

V3:
- totally round robin for picking req from ctx, as suggested
by Bart
- remove one local variable in __sbitmap_for_each_set()
- drop patches of single dispatch list, which can improve
performance on mq-deadline, but cause a bit degrade on
none because all hctxs need to be checked after ->dispatch
is flushed. Will post it again once it is mature.
- rebase on v4.13-rc6 with block for-next

V2:
- dequeue request from sw queues in round roubin's style
as suggested by Bart, and introduces one helper in sbitmap
for this purpose
- improve bio merge via hash table from sw queue
- add comments about using DISPATCH_BUSY state in lockless way,
simplifying handling on busy state,
- hold ctx->lock when clearing ctx busy bit as suggested
by Bart


Ming Lei (7):
  blk-mq: issue rq directly in blk_mq_request_bypass_insert()
  blk-mq-sched: fix scheduler bad performance
  sbitmap: introduce __sbitmap_for_each_set()
  blk-mq: introduce blk_mq_dequeue_from_ctx()
  blk-mq-sched: move actual dispatching into one helper
  blk-mq-sched: improve dispatching from sw queue
  blk-mq-sched: don't dequeue request until all in ->dispatch are
flushed

 block/blk-core.c|   3 +-
 block/blk-mq-debugfs.c  |   1 +
 block/blk-mq-sched.c| 104 ---
 block/blk-mq.c  | 114 +++-
 block/blk-mq.h  |   4 +-
 drivers/md/dm-rq.c  |   2 +-
 include/linux/blk-mq.h  |   3 ++
 include/linux/sbitmap.h |  64 +++
 8 files changed, 238 insertions(+), 57 deletions(-)

-- 
2.9.5



Re: [PATCH V7 0/6] block/scsi: safe SCSI quiescing

2017-09-30 Thread Ming Lei
Hi Martin,

On Sat, Sep 30, 2017 at 11:47:10AM +0200, Martin Steigerwald wrote:
> Hi Ming.
> 
> Ming Lei - 30.09.17, 14:12:
> > Please consider this patchset for V4.15, and it fixes one
> > kind of long-term I/O hang issue in either block legacy path
> > or blk-mq.
> >
> > The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.
> 
> Isn´t that material for -stable as well?

Yeah, the patch 6 is CCed to -stable.

> 
> I´d love to see this go into 4.14. Especially as its an LTS release.

I am fine with either 4.14 or 4.15, and it is up to Jens.

Thanks
Ming


Re: [PATCH V7 0/6] block/scsi: safe SCSI quiescing

2017-09-30 Thread Martin Steigerwald
Hi Ming.

Ming Lei - 30.09.17, 14:12:
> Please consider this patchset for V4.15, and it fixes one
> kind of long-term I/O hang issue in either block legacy path
> or blk-mq.
>
> The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Isn´t that material for -stable as well?

I´d love to see this go into 4.14. Especially as its an LTS release.

Thanks,
Martin

> Once SCSI device is put into QUIESCE, no new request except for
> RQF_PREEMPT can be dispatched to SCSI successfully, and
> scsi_device_quiesce() just simply waits for completion of I/Os
> dispatched to SCSI stack. It isn't enough at all.
> 
> Because new request still can be comming, but all the allocated
> requests can't be dispatched successfully, so request pool can be
> consumed up easily.
> 
> Then request with RQF_PREEMPT can't be allocated and wait forever,
> then system hangs forever, such as during system suspend or
> sending SCSI domain alidation in case of transport_spi.
> 
> Both IO hang inside system suspend[1] or SCSI domain validation
> were reported before.
> 
> This patch introduces preempt only mode, and solves the issue
> by allowing RQF_PREEMP only during SCSI quiesce.
> 
> Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
> them all.
> 
> V7:
>   - add Reviewed-by & Tested-by
>   - one line change in patch 5 for checking preempt request
> 
> V6:
>   - borrow Bart's idea of preempt only, with clean
> implementation(patch 5/patch 6)
>   - needn't any external driver's dependency, such as MD's
>   change
> 
> V5:
>   - fix one tiny race by introducing blk_queue_enter_preempt_freeze()
>   given this change is small enough compared with V4, I added
>   tested-by directly
> 
> V4:
>   - reorganize patch order to make it more reasonable
>   - support nested preempt freeze, as required by SCSI transport spi
>   - check preempt freezing in slow path of of blk_queue_enter()
>   - add "SCSI: transport_spi: resume a quiesced device"
>   - wake up freeze queue in setting dying for both blk-mq and legacy
>   - rename blk_mq_[freeze|unfreeze]_queue() in one patch
>   - rename .mq_freeze_wq and .mq_freeze_depth
>   - improve comment
> 
> V3:
>   - introduce q->preempt_unfreezing to fix one bug of preempt freeze
>   - call blk_queue_enter_live() only when queue is preempt frozen
>   - cleanup a bit on the implementation of preempt freeze
>   - only patch 6 and 7 are changed
> 
> V2:
>   - drop the 1st patch in V1 because percpu_ref_is_dying() is
>   enough as pointed by Tejun
>   - introduce preempt version of blk_[freeze|unfreeze]_queue
>   - sync between preempt freeze and normal freeze
>   - fix warning from percpu-refcount as reported by Oleksandr
> 
> 
> [1] https://marc.info/?t=150340250100013=3=2
> 
> 
> Thanks,
> Ming
> 
> Ming Lei (6):
>   blk-mq: only run hw queues for blk-mq
>   block: tracking request allocation with q_usage_counter
>   block: pass flags to blk_queue_enter()
>   block: prepare for passing RQF_PREEMPT to request allocation
>   block: support PREEMPT_ONLY
>   SCSI: set block queue at preempt only when SCSI device is put into
> quiesce
> 
>  block/blk-core.c| 63
> +++-- block/blk-mq.c  |
> 14 ---
>  block/blk-timeout.c |  2 +-
>  drivers/scsi/scsi_lib.c | 25 +---
>  fs/block_dev.c  |  4 ++--
>  include/linux/blk-mq.h  |  7 +++---
>  include/linux/blkdev.h  | 27 ++---
>  7 files changed, 107 insertions(+), 35 deletions(-)


-- 
Martin


Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Michael Lyle
Just one more note---

IO merging is not happening properly right now.

It's easy to get a case together where basically all the writeback is
sequential.  E.g. if your writeback dirty data target is 15GB, do
something like:

$ sync;fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=8k --iodepth=256 --size=30G
--readwrite=randwrite --ramp_time=4

This creates a situation where bcache has lots of 8K segments and
they're almost all sequential.

Wait for the test to complete, then watch iostat 1 .  You'll see that
the writeback KB written / tps == just a little more than 8, despite
the spinning disks doing IO at maximum rate.  You can feel the disk
and tell it is seeking tons.  The IOs are not being merged properly.

Mike

On Sat, Sep 30, 2017 at 1:23 AM, Michael Lyle  wrote:
> On Sat, Sep 30, 2017 at 1:03 AM, Coly Li  wrote:
 If writeback_rate is not minimum value, it means there are front end
 write requests existing.
>>>
>>> This is wrong.  Else we'd never make it back to the target.
>>>
>>
>> Of cause we can :-) When dirty data is far beyond dirty percent,
>> writeback rate is quite high, in that case dc->last_read takes effect.
>> But this is not the situation which your patch handles better.
>
> ... and dirty data can be beyond dirty percent without front-end write
> requests happening, right?  dirty data being significantly beyond
> dirty percent means that we fell far behind at some point in the past,
> because the spinning disk couldn't keep up with the writeback rate...
>
> I really don't even understand what you're saying.  Front-end write
> exists existing that will make their way to the backing disk will only
> happen if writeback is bypassed or if they're judged to be sequential.
> The point of writeback is to eat these front-end write requests.
>
>> You are right. But when writeback rate is high, dc->last_read can solve
>> the seeking problem as your patch does.
>
> Sure, but it doesn't lay the ground work for plugging..
>
>> Do you mean write I/O cannot be issued within slice_idle, or they are
>> issued within slice_idle but underlying elevator does not merge the
>> continuous request ?
>
> I mean this:  imagine the writeback rate is high with the current
> code, and the writeback code issues read requests for backing
> locations 1,2,3,4,5,6, which are noncontiguous on the SSD.
>
> 1, 3, 4, 5 are read quickly and the underlying elevator merges 3, 4, and 5.
>
> 2 & 6 arrive 2ms later.
>
> What happens?
>
>> I meant the dirty_io list in your future patch, not current 5 bios
>> patch. Hmm, maybe you can ignore my noise here, I just jumped off the topic.
>
> Keeping the set of dirty_io objects that are being issued in a list so
> that it's easy to iterate contiguous ones doesn't change the
> commitment behavior.
>
 And plug list will be unplugged automatically as default, when context
 switching happens. If you will performance read I/Os to the btrees, a
 context switch is probably to happen, then you won't keep a large bio
 lists ...
>>>
>>> Thankfully we'll have 5 things to fire immediately after each other,
>>> so we don't need to worry about automatic unplug.
>>>
>>
>> Forgive me, again I meant dirty_io list stuffs, not this patch...
>
> The idea is this: the group of up-to-5 IOs that the current patch
> gathers, are put in a list.  When all 5 have been read, then the
> device is plugged, and the writes are issued together, and then the
> device is unplugged.  Kent suggested this to me on IRC.
>
>> Could you please show exact benchmark result ? Then it will be easier to
>> change my mind. I only know if writeback I/O is not continuously issued
>> within slice_idle, mostly it is because read dirty delayed in line 238
>> of read_dirty(),
>> 235 if (KEY_START(>key) != dc->last_read ||
>> 236 jiffies_to_msecs(delay) > 50)
>> 237 while (!kthread_should_stop() && delay)
>> 238 delay = schedule_timeout_interruptible(delay);
>>
>> If writeback rate is high and writeback I/O is continuous, read I/O will
>> be issued without delay. Then writeback I/O will be issued by
>> write_dirty() in very short time.
>
> What this code says is, you're willing to get up to 50ms "ahead" on
> the writeback for contiguous I/O.
>
> This is bad, because: A) there is no bounds on how much you are
> willing to aggregate.  If the writeback rate is high, 50ms could be a
> lot of data, and a lot of IOPs to the cache device.  The new code
> bounds this.  B) Even if a single I/O could put you 50ms ahead, it
> still could be advantageous to do multiple writes.
>
>> Therefore without a real performance comparison, I cannot image why
>> adjacent write requests cannot merged in elevator... Unless I do the
>> test myself.
>
> See the above example with the lack of write-ordering-enforcement.
>
> Mike
>
>>
>>
>> --
>> Coly Li


Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Michael Lyle
On Sat, Sep 30, 2017 at 1:03 AM, Coly Li  wrote:
>>> If writeback_rate is not minimum value, it means there are front end
>>> write requests existing.
>>
>> This is wrong.  Else we'd never make it back to the target.
>>
>
> Of cause we can :-) When dirty data is far beyond dirty percent,
> writeback rate is quite high, in that case dc->last_read takes effect.
> But this is not the situation which your patch handles better.

... and dirty data can be beyond dirty percent without front-end write
requests happening, right?  dirty data being significantly beyond
dirty percent means that we fell far behind at some point in the past,
because the spinning disk couldn't keep up with the writeback rate...

I really don't even understand what you're saying.  Front-end write
exists existing that will make their way to the backing disk will only
happen if writeback is bypassed or if they're judged to be sequential.
The point of writeback is to eat these front-end write requests.

> You are right. But when writeback rate is high, dc->last_read can solve
> the seeking problem as your patch does.

Sure, but it doesn't lay the ground work for plugging..

> Do you mean write I/O cannot be issued within slice_idle, or they are
> issued within slice_idle but underlying elevator does not merge the
> continuous request ?

I mean this:  imagine the writeback rate is high with the current
code, and the writeback code issues read requests for backing
locations 1,2,3,4,5,6, which are noncontiguous on the SSD.

1, 3, 4, 5 are read quickly and the underlying elevator merges 3, 4, and 5.

2 & 6 arrive 2ms later.

What happens?

> I meant the dirty_io list in your future patch, not current 5 bios
> patch. Hmm, maybe you can ignore my noise here, I just jumped off the topic.

Keeping the set of dirty_io objects that are being issued in a list so
that it's easy to iterate contiguous ones doesn't change the
commitment behavior.

>>> And plug list will be unplugged automatically as default, when context
>>> switching happens. If you will performance read I/Os to the btrees, a
>>> context switch is probably to happen, then you won't keep a large bio
>>> lists ...
>>
>> Thankfully we'll have 5 things to fire immediately after each other,
>> so we don't need to worry about automatic unplug.
>>
>
> Forgive me, again I meant dirty_io list stuffs, not this patch...

The idea is this: the group of up-to-5 IOs that the current patch
gathers, are put in a list.  When all 5 have been read, then the
device is plugged, and the writes are issued together, and then the
device is unplugged.  Kent suggested this to me on IRC.

> Could you please show exact benchmark result ? Then it will be easier to
> change my mind. I only know if writeback I/O is not continuously issued
> within slice_idle, mostly it is because read dirty delayed in line 238
> of read_dirty(),
> 235 if (KEY_START(>key) != dc->last_read ||
> 236 jiffies_to_msecs(delay) > 50)
> 237 while (!kthread_should_stop() && delay)
> 238 delay = schedule_timeout_interruptible(delay);
>
> If writeback rate is high and writeback I/O is continuous, read I/O will
> be issued without delay. Then writeback I/O will be issued by
> write_dirty() in very short time.

What this code says is, you're willing to get up to 50ms "ahead" on
the writeback for contiguous I/O.

This is bad, because: A) there is no bounds on how much you are
willing to aggregate.  If the writeback rate is high, 50ms could be a
lot of data, and a lot of IOPs to the cache device.  The new code
bounds this.  B) Even if a single I/O could put you 50ms ahead, it
still could be advantageous to do multiple writes.

> Therefore without a real performance comparison, I cannot image why
> adjacent write requests cannot merged in elevator... Unless I do the
> test myself.

See the above example with the lack of write-ordering-enforcement.

Mike

>
>
> --
> Coly Li


Re: [PATCH] blk-mq: remove unused function hctx_allow_merges

2017-09-30 Thread Jens Axboe
On 09/30/2017 10:15 AM, Jens Axboe wrote:
> On 09/30/2017 09:01 AM, weiping zhang wrote:
>> since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default 
>> .bio_merge"
>> there is no caller for this function.
> 
> Looks like this should change the new check to actually use this function
> instead, or we will be doing merges if someone has turned off merging
> through the sysfs 'nomerges' function.

Actually, looks like the caller does that now, so the patch was actually
fine.

-- 
Jens Axboe



Re: [PATCH] blk-mq: remove unused function hctx_allow_merges

2017-09-30 Thread Jens Axboe
On 09/30/2017 09:01 AM, weiping zhang wrote:
> since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default 
> .bio_merge"
> there is no caller for this function.

Looks like this should change the new check to actually use this function
instead, or we will be doing merges if someone has turned off merging
through the sysfs 'nomerges' function.


diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 4ab69435708c..5a3fe1bc5de6 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -224,6 +224,12 @@ static bool blk_mq_attempt_merge(struct request_queue *q,
return false;
 }
 
+static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx)
+{
+   return (hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
+   !blk_queue_nomerges(hctx->queue);
+}
+
 bool __blk_mq_sched_bio_merge(struct request_queue *q, struct bio *bio)
 {
struct elevator_queue *e = q->elevator;
@@ -236,7 +242,7 @@ bool __blk_mq_sched_bio_merge(struct request_queue *q, 
struct bio *bio)
return e->type->ops.mq.bio_merge(hctx, bio);
}
 
-   if (hctx->flags & BLK_MQ_F_SHOULD_MERGE) {
+   if (hctx_allow_merges(hctx)) {
/* default per sw-queue merge */
spin_lock(>lock);
ret = blk_mq_attempt_merge(q, ctx, bio);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..59687ed6561b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1504,12 +1504,6 @@ static void blk_mq_bio_to_request(struct request *rq, 
struct bio *bio)
blk_account_io_start(rq, true);
 }
 
-static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx)
-{
-   return (hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
-   !blk_queue_nomerges(hctx->queue);
-}
-
 static inline void blk_mq_queue_io(struct blk_mq_hw_ctx *hctx,
   struct blk_mq_ctx *ctx,
   struct request *rq)

-- 
Jens Axboe



[RFC PATCH 2/2] block/cfq: Fix memory leak of async cfqq

2017-09-30 Thread Jeffy Chen
Currently we only unref the async cfqqs in cfq_pd_offline, which would
not be called when CONFIG_CFQ_GROUP_IOSCHED is disabled.

Kmemleak reported:
unreferenced object 0xffc0cd9fc000 (size 240):
  comm "kworker/3:1", pid 52, jiffies 4294673527 (age 97.149s)
  hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 80 55 13 cf c0 ff ff ff  .U..
10 c0 9f cd c0 ff ff ff 00 00 00 00 00 00 00 00  
  backtrace:
[] kmemleak_alloc+0x58/0x8c
[] kmem_cache_alloc+0x184/0x268
[] cfq_get_queue.isra.11+0x144/0x2e0
[] cfq_set_request+0x1bc/0x444
[] elv_set_request+0x88/0x9c
[] get_request+0x494/0x914
[] blk_get_request+0xdc/0x160
[] scsi_execute+0x70/0x23c
[] scsi_test_unit_ready+0xf4/0x1ec

Signed-off-by: Jeffy Chen 
---

 block/cfq-iosched.c | 28 ++--
 1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 9f342ef1ad42..75773eabecc7 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -401,6 +401,7 @@ struct cfq_data {
 
 static struct cfq_group *cfq_get_next_cfqg(struct cfq_data *cfqd);
 static void cfq_put_queue(struct cfq_queue *cfqq);
+static void cfqg_offline(struct cfq_group *cfqg);
 
 static struct cfq_rb_root *st_for(struct cfq_group *cfqg,
enum wl_class_t class,
@@ -1638,17 +1639,8 @@ static void cfq_pd_init(struct blkg_policy_data *pd)
 static void cfq_pd_offline(struct blkg_policy_data *pd)
 {
struct cfq_group *cfqg = pd_to_cfqg(pd);
-   int i;
-
-   for (i = 0; i < IOPRIO_BE_NR; i++) {
-   if (cfqg->async_cfqq[0][i])
-   cfq_put_queue(cfqg->async_cfqq[0][i]);
-   if (cfqg->async_cfqq[1][i])
-   cfq_put_queue(cfqg->async_cfqq[1][i]);
-   }
 
-   if (cfqg->async_idle_cfqq)
-   cfq_put_queue(cfqg->async_idle_cfqq);
+   cfqg_offline(cfqg);
 
/*
 * @blkg is going offline and will be ignored by
@@ -3741,6 +3733,21 @@ static void cfq_init_cfqq(struct cfq_data *cfqd, struct 
cfq_queue *cfqq,
cfqq->pid = pid;
 }
 
+static void cfqg_offline(struct cfq_group *cfqg)
+{
+   int i;
+
+   for (i = 0; i < IOPRIO_BE_NR; i++) {
+   if (cfqg->async_cfqq[0][i])
+   cfq_put_queue(cfqg->async_cfqq[0][i]);
+   if (cfqg->async_cfqq[1][i])
+   cfq_put_queue(cfqg->async_cfqq[1][i]);
+   }
+
+   if (cfqg->async_idle_cfqq)
+   cfq_put_queue(cfqg->async_idle_cfqq);
+}
+
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
 static void check_blkcg_changed(struct cfq_io_cq *cic, struct bio *bio)
 {
@@ -4564,6 +4571,7 @@ static void cfq_exit_queue(struct elevator_queue *e)
 #ifdef CONFIG_CFQ_GROUP_IOSCHED
blkcg_deactivate_policy(q, _policy_cfq);
 #else
+   cfqg_offline(cfqd->root_group);
kfree(cfqd->root_group);
 #endif
kfree(cfqd);
-- 
2.11.0




[PATCH] blk-mq: wire up completion notifier for laptop mode

2017-09-30 Thread Jens Axboe
For some reason, the laptop mode IO completion notified was never wired
up for blk-mq. Ensure that we trigger the callback appropriately, to arm
the laptop mode flush timer.

Signed-off-by: Jens Axboe 

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..09e92667be98 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -476,6 +476,9 @@ void blk_mq_free_request(struct request *rq)
if (rq->rq_flags & RQF_MQ_INFLIGHT)
atomic_dec(>nr_active);
 
+   if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
+   laptop_io_completion(q->backing_dev_info);
+
wbt_done(q->rq_wb, >issue_stat);
 
clear_bit(REQ_ATOM_STARTED, >atomic_flags);

-- 
Jens Axboe



Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Coly Li
On 2017/9/30 下午3:13, Michael Lyle wrote:
> Coly--
> 
> 
> On Fri, Sep 29, 2017 at 11:58 PM, Coly Li  wrote:
>> On 2017/9/30 上午11:17, Michael Lyle wrote:
>> [snip]
>>
>> If writeback_rate is not minimum value, it means there are front end
>> write requests existing.
> 
> This is wrong.  Else we'd never make it back to the target.
> 

Of cause we can :-) When dirty data is far beyond dirty percent,
writeback rate is quite high, in that case dc->last_read takes effect.
But this is not the situation which your patch handles better.


>> In this case, backend writeback I/O should nice
>> I/O throughput to front end I/O. Otherwise, application will observe
>> increased I/O latency, especially when dirty percentage is not very
>> high. For enterprise workload, this change hurts performance.
> 
> No, utilizing less of disk throughput for a given writeback rate
> improves workload latency.  :P  The maximum that will be aggregated in
> this way is constrained more strongly than the old code...
> 

Could you please show exact latency comparison data ? Imagine that by
code is too hard to me ...


>> An desired behavior for low latency enterprise workload is, when dirty
>> percentage is low, once there is front end I/O, backend writeback should
>> be at minimum rate. This patch will introduce unstable and unpredictable
>> I/O latency.
> 
> Nope.  It lowers disk utilization overall, and the amount of disk
> bandwidth any individual request chunk can use is explicitly bounded
> (unlike before, where it was not).
> 

It will be great if we can see performance data here.

>> Unless there is performance bottleneck of writeback seeking, at least
>> enterprise users will focus more on front end I/O latency 
> 
> The less writeback seeks the disk, the more the real workload can seek
> this disk!  And if you're at high writeback rates, the vast majority
> of the time the disk is seeking!
> 

You are right. But when writeback rate is high, dc->last_read can solve
the seeking problem as your patch does.


>> This method is helpful only when writeback I/Os is not issued
>> continuously, other wise if they are issued within slice_idle,
>> underlying elevator will reorder or merge the I/Os in larger request.
>> We have a subsystem in place to rate-limit and make sure that they are
> not issued continuously!  If you want to preserve good latency for
> userspace, you want to keep the system in the regime where that
> control system is effective!
> 


Do you mean write I/O cannot be issued within slice_idle, or they are
issued within slice_idle but underlying elevator does not merge the
continuous request ?


>> Hmm, if you move the dirty IO from btree into dirty_io list, then
>> perform I/O, there is risk that once machine is power down during
>> writeback there might be dirty data lost. If you continuously issue
>> dirty I/O and remove them from btree at same time, that means you will
>> introduce more latency to front end I/O...
> 
> No... this doesn't change the consistency properties.  I am just
> saying-- if we have (up to 5 contiguous things that we're going to
> write back), wait till they're all read; plug; dispatch writes ;
> unplug.
> 

I meant the dirty_io list in your future patch, not current 5 bios
patch. Hmm, maybe you can ignore my noise here, I just jumped off the topic.


>> And plug list will be unplugged automatically as default, when context
>> switching happens. If you will performance read I/Os to the btrees, a
>> context switch is probably to happen, then you won't keep a large bio
>> lists ...
> 
> Thankfully we'll have 5 things to fire immediately after each other,
> so we don't need to worry about automatic unplug.
> 

Forgive me, again I meant dirty_io list stuffs, not this patch...


>> IMHO when writeback rate is low, especially when backing hard disk is
>> not bottleneck, group continuous I/Os in bcache code does not help too
>> much for writeback performance. The only benefit is less I/O issued when
>> front end I/O is low or idle, but not most of users care about it,
>> especially enterprise users.
> 
> I disagree!
> 
>>> I believe patch 4 is useful on its own, but I have this and other
>>> pieces of development that depend upon it.
>>
>> Current bcache code works well in most of writeback loads, I just worry
>> that implementing an elevator in bcache writeback logic is a big
>> investment with a little return.
> 
> Bcache already implements a (one-way) elevator!  Bcache invests
> **significant effort** already to do writebacks in LBA order (to
> effectively short-stroke the disk), but because of different arrival
> times of the read request they can end up reordered.  This is bad.
> This is a bug.

Could you please show exact benchmark result ? Then it will be easier to
change my mind. I only know if writeback I/O is not continuously issued
within slice_idle, mostly it is because read dirty delayed in line 238
of read_dirty(),
235 if (KEY_START(>key) != dc->last_read ||
236  

Re: [PATCH] blk-mq: remove unused function hctx_allow_merges

2017-09-30 Thread Ming Lei
On Sat, Sep 30, 2017 at 3:01 PM, weiping zhang
 wrote:
> since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default 
> .bio_merge"
> there is no caller for this function.
>
> Signed-off-by: weiping zhang 

Reviewed-by: Ming Lei 



-- 
Ming Lei


Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Michael Lyle
Actually-- I give up.

You've initially bounced every single one of my patches, even the ones
that fix crash & data corruption bugs.

I spend 10x as much time fighting with you as writing stuff for
bcache, and basically every time it's turned out that you're wrong.

I will go do something else where it is just not so much work to get
something done.

Mike

On Sat, Sep 30, 2017 at 12:13 AM, Michael Lyle  wrote:
> Coly--
>
>
> On Fri, Sep 29, 2017 at 11:58 PM, Coly Li  wrote:
>> On 2017/9/30 上午11:17, Michael Lyle wrote:
>> [snip]
>>
>> If writeback_rate is not minimum value, it means there are front end
>> write requests existing.
>
> This is wrong.  Else we'd never make it back to the target.
>
>> In this case, backend writeback I/O should nice
>> I/O throughput to front end I/O. Otherwise, application will observe
>> increased I/O latency, especially when dirty percentage is not very
>> high. For enterprise workload, this change hurts performance.
>
> No, utilizing less of disk throughput for a given writeback rate
> improves workload latency.  :P  The maximum that will be aggregated in
> this way is constrained more strongly than the old code...
>
>> An desired behavior for low latency enterprise workload is, when dirty
>> percentage is low, once there is front end I/O, backend writeback should
>> be at minimum rate. This patch will introduce unstable and unpredictable
>> I/O latency.
>
> Nope.  It lowers disk utilization overall, and the amount of disk
> bandwidth any individual request chunk can use is explicitly bounded
> (unlike before, where it was not).
>
>> Unless there is performance bottleneck of writeback seeking, at least
>> enterprise users will focus more on front end I/O latency 
>
> The less writeback seeks the disk, the more the real workload can seek
> this disk!  And if you're at high writeback rates, the vast majority
> of the time the disk is seeking!
>
>> This method is helpful only when writeback I/Os is not issued
>> continuously, other wise if they are issued within slice_idle,
>> underlying elevator will reorder or merge the I/Os in larger request.
>
> We have a subsystem in place to rate-limit and make sure that they are
> not issued continuously!  If you want to preserve good latency for
> userspace, you want to keep the system in the regime where that
> control system is effective!
>
>> Hmm, if you move the dirty IO from btree into dirty_io list, then
>> perform I/O, there is risk that once machine is power down during
>> writeback there might be dirty data lost. If you continuously issue
>> dirty I/O and remove them from btree at same time, that means you will
>> introduce more latency to front end I/O...
>
> No... this doesn't change the consistency properties.  I am just
> saying-- if we have (up to 5 contiguous things that we're going to
> write back), wait till they're all read; plug; dispatch writes ;
> unplug.
>
>> And plug list will be unplugged automatically as default, when context
>> switching happens. If you will performance read I/Os to the btrees, a
>> context switch is probably to happen, then you won't keep a large bio
>> lists ...
>
> Thankfully we'll have 5 things to fire immediately after each other,
> so we don't need to worry about automatic unplug.
>
>> IMHO when writeback rate is low, especially when backing hard disk is
>> not bottleneck, group continuous I/Os in bcache code does not help too
>> much for writeback performance. The only benefit is less I/O issued when
>> front end I/O is low or idle, but not most of users care about it,
>> especially enterprise users.
>
> I disagree!
>
>>> I believe patch 4 is useful on its own, but I have this and other
>>> pieces of development that depend upon it.
>>
>> Current bcache code works well in most of writeback loads, I just worry
>> that implementing an elevator in bcache writeback logic is a big
>> investment with a little return.
>
> Bcache already implements a (one-way) elevator!  Bcache invests
> **significant effort** already to do writebacks in LBA order (to
> effectively short-stroke the disk), but because of different arrival
> times of the read request they can end up reordered.  This is bad.
> This is a bug.
>
> Mike
>
>>
>> --
>> Coly Li
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] blk-mq: remove unused function hctx_allow_merges

2017-09-30 Thread weiping zhang
since 9bddeb2a5b981 "blk-mq: make per-sw-queue bio merge as default .bio_merge"
there is no caller for this function.

Signed-off-by: weiping zhang 
---
 block/blk-mq.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f84d145..520d257 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1484,12 +1484,6 @@ static void blk_mq_bio_to_request(struct request *rq, 
struct bio *bio)
blk_account_io_start(rq, true);
 }
 
-static inline bool hctx_allow_merges(struct blk_mq_hw_ctx *hctx)
-{
-   return (hctx->flags & BLK_MQ_F_SHOULD_MERGE) &&
-   !blk_queue_nomerges(hctx->queue);
-}
-
 static inline void blk_mq_queue_io(struct blk_mq_hw_ctx *hctx,
   struct blk_mq_ctx *ctx,
   struct request *rq)
-- 
2.9.4



Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

2017-09-30 Thread Coly Li
On 2017/9/30 上午11:17, Michael Lyle wrote:
> Coly--
> 
> What you say is correct-- it has a few changes from current behavior.
> 
> - When writeback rate is low, it is more willing to do contiguous
> I/Os.  This provides an opportunity for the IO scheduler to combine
> operations together.  The cost of doing 5 contiguous I/Os and 1 I/O is
> usually about the same on spinning disks, because most of the cost is
> seeking and rotational latency-- the actual sequential I/O bandwidth
> is very high.  This is a benefit.

Hi Mike,

Yes I can see it.


> - When writeback rate is medium, it does I/O more efficiently.  e.g.
> if the current writeback rate is 10MB/sec, and there are two
> contiguous 1MB segments, they would not presently be combined.  A 1MB
> write would occur, then we would increase the delay counter by 100ms,
> and then the next write would wait; this new code would issue 2 1MB
> writes one after the other, and then sleep 200ms.  On a disk that does
> 150MB/sec sequential, and has a 7ms seek time, this uses the disk for
> 13ms + 7ms, compared to the old code that does 13ms + 7ms * 2.  This
> is the difference between using 10% of the disk's I/O throughput and
> 13% of the disk's throughput to do the same work.


If writeback_rate is not minimum value, it means there are front end
write requests existing. In this case, backend writeback I/O should nice
I/O throughput to front end I/O. Otherwise, application will observe
increased I/O latency, especially when dirty percentage is not very
high. For enterprise workload, this change hurts performance.

An desired behavior for low latency enterprise workload is, when dirty
percentage is low, once there is front end I/O, backend writeback should
be at minimum rate. This patch will introduce unstable and unpredictable
I/O latency.

Unless there is performance bottleneck of writeback seeking, at least
enterprise users will focus more on front end I/O latency 

> - When writeback rate is very high (e.g. can't be obtained), there is
> not much difference currently, BUT:
> 
> Patch 5 is very important.  Right now, if there are many writebacks
> happening at once, the cached blocks can be read in any order.  This
> means that if we want to writeback blocks 1,2,3,4,5 we could actually
> end up issuing the write I/Os to the backing device as 3,1,4,2,5, with
> delays between them.  This is likely to make the disk seek a lot.
> Patch 5 provides an ordering property to ensure that the writes get
> issued in LBA order to the backing device.

This method is helpful only when writeback I/Os is not issued
continuously, other wise if they are issued within slice_idle,
underlying elevator will reorder or merge the I/Os in larger request.


> 
> ***The next step in this line of development (patch 6 ;) is to link
> groups of contiguous I/Os into a list in the dirty_io structure.  To
> know whether the "next I/Os" will be contiguous, we need to scan ahead
> like the new code in patch 4 does.  Then, in turn, we can plug the
> block device, and issue the contiguous writes together.  This allows
> us to guarantee that the I/Os will be properly merged and optimized by
> the underlying block IO scheduler.   Even with patch 5, currently the
> I/Os end up imperfectly combined, and the block layer ends up issuing
> writes 1, then 2,3, then 4,5.  This is great that things are combined
> some, but it could be combined into one big request.***  To get this
> benefit, it requires something like what was done in patch 4.
> 

Hmm, if you move the dirty IO from btree into dirty_io list, then
perform I/O, there is risk that once machine is power down during
writeback there might be dirty data lost. If you continuously issue
dirty I/O and remove them from btree at same time, that means you will
introduce more latency to front end I/O...

And plug list will be unplugged automatically as default, when context
switching happens. If you will performance read I/Os to the btrees, a
context switch is probably to happen, then you won't keep a large bio
lists ...

IMHO when writeback rate is low, especially when backing hard disk is
not bottleneck, group continuous I/Os in bcache code does not help too
much for writeback performance. The only benefit is less I/O issued when
front end I/O is low or idle, but not most of users care about it,
especially enterprise users.


> I believe patch 4 is useful on its own, but I have this and other
> pieces of development that depend upon it.

Current bcache code works well in most of writeback loads, I just worry
that implementing an elevator in bcache writeback logic is a big
investment with a little return.

-- 
Coly Li


[PATCH v2] blk-throttle: fix possible io stall when upgrade to max

2017-09-30 Thread Joseph Qi
From: Joseph Qi 

There is a case which will lead to io stall. The case is described as
follows. 
/test1
  |-subtest1
/test2
  |-subtest2
And subtest1 and subtest2 each has 32 queued bios already.

Now upgrade to max. In throtl_upgrade_state, it will try to dispatch
bios as follows:
1) tg=subtest1, do nothing;
2) tg=test1, transfer 32 queued bios from subtest1 to test1; no pending
left, no need to schedule next dispatch;
3) tg=subtest2, do nothing;
4) tg=test2, transfer 32 queued bios from subtest2 to test2; no pending
left, no need to schedule next dispatch;
5) tg=/, transfer 8 queued bios from test1 to /, 8 queued bios from
test2 to /, 8 queued bios from test1 to /, and 8 queued bios from test2
to /; note that test1 and test2 each still has 16 queued bios left;
6) tg=/, try to schedule next dispatch, but since disptime is now
(update in tg_update_disptime, wait=0), pending timer is not scheduled
in fact;
7) In throtl_upgrade_state it totally dispatches 32 queued bios and with
32 left. test1 and test2 each has 16 queued bios;
8) throtl_pending_timer_fn sees the left over bios, but could do
nothing, because throtl_select_dispatch returns 0, and test1/test2 has
no pending tg.

The blktrace shows the following:
8,32   00 2.539007641 0  m   N throtl upgrade to max
8,32   00 2.539072267 0  m   N throtl /test2 dispatch 
nr_queued=16 read=0 write=16
8,32   70 2.539077142 0  m   N throtl /test1 dispatch 
nr_queued=16 read=0 write=16

So force schedule dispatch if there are pending children.

Signed-off-by: Joseph Qi 
---
 block/blk-throttle.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0fea76a..17816a0 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1911,11 +1911,11 @@ static void throtl_upgrade_state(struct throtl_data *td)
 
tg->disptime = jiffies - 1;
throtl_select_dispatch(sq);
-   throtl_schedule_next_dispatch(sq, false);
+   throtl_schedule_next_dispatch(sq, true);
}
rcu_read_unlock();
throtl_select_dispatch(>service_queue);
-   throtl_schedule_next_dispatch(>service_queue, false);
+   throtl_schedule_next_dispatch(>service_queue, true);
queue_work(kthrotld_workqueue, >dispatch_work);
 }
 
-- 
1.9.4


[PATCH V7 3/6] block: pass flags to blk_queue_enter()

2017-09-30 Thread Ming Lei
We need to pass PREEMPT flags to blk_queue_enter()
for allocating request with RQF_PREEMPT in the
following patch.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 10 ++
 block/blk-mq.c |  5 +++--
 block/blk-timeout.c|  2 +-
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  7 ++-
 5 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a5011c824ac6..7d5040a6d5a4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -766,7 +766,7 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+int blk_queue_enter(struct request_queue *q, unsigned flags)
 {
while (true) {
int ret;
@@ -774,7 +774,7 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
 
-   if (nowait)
+   if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -1408,7 +1408,8 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -2215,7 +2216,8 @@ blk_qc_t generic_make_request(struct bio *bio)
do {
struct request_queue *q = bio->bi_disk->queue;
 
-   if (likely(blk_queue_enter(q, bio->bi_opf & REQ_NOWAIT) == 0)) {
+   if (likely(blk_queue_enter(q, (bio->bi_opf & REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0) == 0)) {
struct bio_list lower, same;
 
/* Create a fresh bio_list for all subordinate requests 
*/
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 10c1f49f663d..45bff90e08f7 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,7 +384,8 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, flags & BLK_MQ_REQ_NOWAIT);
+   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
+   BLK_REQ_NOWAIT : 0);
if (ret)
return ERR_PTR(ret);
 
@@ -423,7 +424,7 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
if (hctx_idx >= q->nr_hw_queues)
return ERR_PTR(-EIO);
 
-   ret = blk_queue_enter(q, true);
+   ret = blk_queue_enter(q, BLK_REQ_NOWAIT);
if (ret)
return ERR_PTR(ret);
 
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 17ec83bb0900..e803106a5e5b 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -134,7 +134,7 @@ void blk_timeout_work(struct work_struct *work)
struct request *rq, *tmp;
int next_set = 0;
 
-   if (blk_queue_enter(q, true))
+   if (blk_queue_enter(q, BLK_REQ_NOWAIT))
return;
spin_lock_irqsave(q->queue_lock, flags);
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 93d088ffc05c..98cf2d7ee9d3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -674,7 +674,7 @@ int bdev_read_page(struct block_device *bdev, sector_t 
sector,
if (!ops->rw_page || bdev_get_integrity(bdev))
return result;
 
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
result = ops->rw_page(bdev, sector + get_start_sect(bdev), page, false);
@@ -710,7 +710,7 @@ int bdev_write_page(struct block_device *bdev, sector_t 
sector,
 
if (!ops->rw_page || bdev_get_integrity(bdev))
return -EOPNOTSUPP;
-   result = blk_queue_enter(bdev->bd_queue, false);
+   result = blk_queue_enter(bdev->bd_queue, 0);
if (result)
return result;
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 02fa42d24b52..127f64c7012c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -858,6 +858,11 @@ enum {
BLKPREP_INVALID,/* invalid command, kill, return -EREMOTEIO */
 };
 
+/* passed to blk_queue_enter */
+enum {
+   BLK_REQ_NOWAIT = (1 << 0),
+};
+
 extern unsigned long blk_max_low_pfn, blk_max_pfn;
 
 /*
@@ -963,7 +968,7 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct 
gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 

[PATCH V7 4/6] block: prepare for passing RQF_PREEMPT to request allocation

2017-09-30 Thread Ming Lei
REQF_PREEMPT is a bit special because the request is required
to be dispatched to lld even when SCSI device is quiesced.

So this patch introduces __blk_get_request() and allows users to pass
RQF_PREEMPT flag in, then we can allow to allocate request of RQF_PREEMPT
when queue is in mode of PREEMPT ONLY which will be introduced
in the following patch.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 19 +--
 block/blk-mq.c |  3 +--
 include/linux/blk-mq.h |  7 ---
 include/linux/blkdev.h | 17 ++---
 4 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7d5040a6d5a4..95b1c5e50be3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1398,7 +1398,8 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 }
 
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, gfp_t gfp_mask,
+  unsigned int flags)
 {
struct request *rq;
int ret = 0;
@@ -1408,8 +1409,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
-   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
@@ -1427,26 +1427,25 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+struct request *__blk_get_request(struct request_queue *q, unsigned int op,
+ gfp_t gfp_mask, unsigned int flags)
 {
struct request *req;
 
+   flags |= gfp_mask & __GFP_DIRECT_RECLAIM ? 0 : BLK_REQ_NOWAIT;
if (q->mq_ops) {
-   req = blk_mq_alloc_request(q, op,
-   (gfp_mask & __GFP_DIRECT_RECLAIM) ?
-   0 : BLK_MQ_REQ_NOWAIT);
+   req = blk_mq_alloc_request(q, op, flags);
if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
q->mq_ops->initialize_rq_fn(req);
} else {
-   req = blk_old_get_request(q, op, gfp_mask);
+   req = blk_old_get_request(q, op, gfp_mask, flags);
if (!IS_ERR(req) && q->initialize_rq_fn)
q->initialize_rq_fn(req);
}
 
return req;
 }
-EXPORT_SYMBOL(blk_get_request);
+EXPORT_SYMBOL(__blk_get_request);
 
 /**
  * blk_requeue_request - put a request back on queue
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 45bff90e08f7..90b43f607e3c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -384,8 +384,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
struct request *rq;
int ret;
 
-   ret = blk_queue_enter(q, (flags & BLK_MQ_REQ_NOWAIT) ?
-   BLK_REQ_NOWAIT : 0);
+   ret = blk_queue_enter(q, flags & BLK_REQ_BITS_MASK);
if (ret)
return ERR_PTR(ret);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 50c6485cb04f..066a676d7749 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -197,9 +197,10 @@ void blk_mq_free_request(struct request *rq);
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *);
 
 enum {
-   BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
-   BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
-   BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_NOWAIT   = BLK_REQ_NOWAIT, /* return when out of 
requests */
+   BLK_MQ_REQ_PREEMPT  = BLK_REQ_PREEMPT, /* allocate for RQF_PREEMPT 
*/
+   BLK_MQ_REQ_RESERVED = (1 << BLK_REQ_MQ_START_BIT), /* allocate from 
reserved pool */
+   BLK_MQ_REQ_INTERNAL = (1 << (BLK_REQ_MQ_START_BIT + 1)), /* 
allocate internal/sched tag */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 127f64c7012c..68445adc8765 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -860,7 +860,10 @@ enum {
 
 /* passed to blk_queue_enter */
 enum {
-   BLK_REQ_NOWAIT = (1 << 0),
+   BLK_REQ_NOWAIT  = (1 << 0),
+   BLK_REQ_PREEMPT = (1 << 1),
+   BLK_REQ_MQ_START_BIT= 2,
+   BLK_REQ_BITS_MASK   = (1U << BLK_REQ_MQ_START_BIT) 

[PATCH V7 6/6] SCSI: set block queue at preempt only when SCSI device is put into quiesce

2017-09-30 Thread Ming Lei
Simply quiesing SCSI device and waiting for completeion of IO
dispatched to SCSI queue isn't safe, it is easy to use up
request pool because all allocated requests before can't
be dispatched when device is put in QIUESCE. Then no request
can be allocated for RQF_PREEMPT, and system may hang somewhere,
such as When sending commands of sync_cache or start_stop during
system suspend path.

Before quiesing SCSI, this patch sets block queue in preempt
mode first, so no new normal request can enter queue any more,
and all pending requests are drained too once blk_set_preempt_only(true)
is returned. Then RQF_PREEMPT can be allocated successfully duirng
SCSI quiescing.

This patch fixes one long term issue of IO hang, in either block legacy
and blk-mq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: sta...@vger.kernel.org
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 drivers/scsi/scsi_lib.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 9cf6a80fe297..82c51619f1b7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,10 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = __blk_get_request(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM,
+   BLK_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -2928,12 +2929,28 @@ scsi_device_quiesce(struct scsi_device *sdev)
 {
int err;
 
+   /*
+* Simply quiesing SCSI device isn't safe, it is easy
+* to use up requests because all these allocated requests
+* can't be dispatched when device is put in QIUESCE.
+* Then no request can be allocated and we may hang
+* somewhere, such as system suspend/resume.
+*
+* So we set block queue in preempt only first, no new
+* normal request can enter queue any more, and all pending
+* requests are drained once blk_set_preempt_only()
+* returns. Only RQF_PREEMPT is allowed in preempt only mode.
+*/
+   blk_set_preempt_only(sdev->request_queue, true);
+
mutex_lock(>state_mutex);
err = scsi_device_set_state(sdev, SDEV_QUIESCE);
mutex_unlock(>state_mutex);
 
-   if (err)
+   if (err) {
+   blk_set_preempt_only(sdev->request_queue, false);
return err;
+   }
 
scsi_run_queue(sdev->request_queue);
while (atomic_read(>device_busy)) {
@@ -2964,6 +2981,8 @@ void scsi_device_resume(struct scsi_device *sdev)
scsi_device_set_state(sdev, SDEV_RUNNING) == 0)
scsi_run_queue(sdev->request_queue);
mutex_unlock(>state_mutex);
+
+   blk_set_preempt_only(sdev->request_queue, false);
 }
 EXPORT_SYMBOL(scsi_device_resume);
 
-- 
2.9.5



[PATCH V7 5/6] block: support PREEMPT_ONLY

2017-09-30 Thread Ming Lei
When queue is in PREEMPT_ONLY mode, only RQF_PREEMPT request
can be allocated and dispatched, other requests won't be allowed
to enter I/O path.

This is useful for supporting safe SCSI quiesce.

Part of this patch is from Bart's '[PATCH v4 4∕7] block: Add the 
QUEUE_FLAG_PREEMPT_ONLY
request queue flag'.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c   | 26 --
 include/linux/blkdev.h |  5 +
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 95b1c5e50be3..bb683bfe37b2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -346,6 +346,17 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+void blk_set_preempt_only(struct request_queue *q, bool preempt_only)
+{
+   blk_mq_freeze_queue(q);
+   if (preempt_only)
+   queue_flag_set_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   else
+   queue_flag_clear_unlocked(QUEUE_FLAG_PREEMPT_ONLY, q);
+   blk_mq_unfreeze_queue(q);
+}
+EXPORT_SYMBOL(blk_set_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
@@ -771,9 +782,18 @@ int blk_queue_enter(struct request_queue *q, unsigned 
flags)
while (true) {
int ret;
 
+   /*
+* preempt_only flag has to be set after queue is frozen,
+* so it can be checked here lockless and safely
+*/
+   if (blk_queue_preempt_only(q)) {
+   if (!(flags & BLK_REQ_PREEMPT))
+   goto slow_path;
+   }
+
if (percpu_ref_tryget_live(>q_usage_counter))
return 0;
-
+ slow_path:
if (flags & BLK_REQ_NOWAIT)
return -EBUSY;
 
@@ -787,7 +807,9 @@ int blk_queue_enter(struct request_queue *q, unsigned flags)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   (!atomic_read(>mq_freeze_depth) &&
+   ((flags & BLK_REQ_PREEMPT) ||
+!blk_queue_preempt_only(q))) ||
blk_queue_dying(q));
if (blk_queue_dying(q))
return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 68445adc8765..b01a0c6bb1f0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -631,6 +631,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_STACKABLE)|   \
@@ -735,6 +736,10 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern void blk_set_preempt_only(struct request_queue *q, bool preempt_only);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.9.5



[PATCH V7 1/6] blk-mq: only run hw queues for blk-mq

2017-09-30 Thread Ming Lei
This patch just makes it explicitely.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Johannes Thumshirn 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 98a18609755e..6fd9f86fc86d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -125,7 +125,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
-- 
2.9.5



[PATCH V7 0/6] block/scsi: safe SCSI quiescing

2017-09-30 Thread Ming Lei
Hi Jens,

Please consider this patchset for V4.15, and it fixes one
kind of long-term I/O hang issue in either block legacy path
or blk-mq.

The current SCSI quiesce isn't safe and easy to trigger I/O deadlock.

Once SCSI device is put into QUIESCE, no new request except for
RQF_PREEMPT can be dispatched to SCSI successfully, and
scsi_device_quiesce() just simply waits for completion of I/Os
dispatched to SCSI stack. It isn't enough at all.

Because new request still can be comming, but all the allocated
requests can't be dispatched successfully, so request pool can be
consumed up easily.

Then request with RQF_PREEMPT can't be allocated and wait forever,
then system hangs forever, such as during system suspend or
sending SCSI domain alidation in case of transport_spi.

Both IO hang inside system suspend[1] or SCSI domain validation
were reported before.

This patch introduces preempt only mode, and solves the issue
by allowing RQF_PREEMP only during SCSI quiesce.

Both SCSI and SCSI_MQ have this IO deadlock issue, this patch fixes
them all.

V7:
- add Reviewed-by & Tested-by
- one line change in patch 5 for checking preempt request

V6:
- borrow Bart's idea of preempt only, with clean
  implementation(patch 5/patch 6)
- needn't any external driver's dependency, such as MD's
change

V5:
- fix one tiny race by introducing blk_queue_enter_preempt_freeze()
given this change is small enough compared with V4, I added
tested-by directly

V4:
- reorganize patch order to make it more reasonable
- support nested preempt freeze, as required by SCSI transport spi
- check preempt freezing in slow path of of blk_queue_enter()
- add "SCSI: transport_spi: resume a quiesced device"
- wake up freeze queue in setting dying for both blk-mq and legacy
- rename blk_mq_[freeze|unfreeze]_queue() in one patch
- rename .mq_freeze_wq and .mq_freeze_depth
- improve comment

V3:
- introduce q->preempt_unfreezing to fix one bug of preempt freeze
- call blk_queue_enter_live() only when queue is preempt frozen
- cleanup a bit on the implementation of preempt freeze
- only patch 6 and 7 are changed

V2:
- drop the 1st patch in V1 because percpu_ref_is_dying() is
enough as pointed by Tejun
- introduce preempt version of blk_[freeze|unfreeze]_queue
- sync between preempt freeze and normal freeze
- fix warning from percpu-refcount as reported by Oleksandr


[1] https://marc.info/?t=150340250100013=3=2


Thanks,
Ming

Ming Lei (6):
  blk-mq: only run hw queues for blk-mq
  block: tracking request allocation with q_usage_counter
  block: pass flags to blk_queue_enter()
  block: prepare for passing RQF_PREEMPT to request allocation
  block: support PREEMPT_ONLY
  SCSI: set block queue at preempt only when SCSI device is put into
quiesce

 block/blk-core.c| 63 +++--
 block/blk-mq.c  | 14 ---
 block/blk-timeout.c |  2 +-
 drivers/scsi/scsi_lib.c | 25 +---
 fs/block_dev.c  |  4 ++--
 include/linux/blk-mq.h  |  7 +++---
 include/linux/blkdev.h  | 27 ++---
 7 files changed, 107 insertions(+), 35 deletions(-)

-- 
2.9.5



[PATCH V7 2/6] block: tracking request allocation with q_usage_counter

2017-09-30 Thread Ming Lei
This usage is basically same with blk-mq, so that we can
support to freeze legacy queue easily.

Also 'wake_up_all(>mq_freeze_wq)' has to be moved
into blk_set_queue_dying() since both legacy and blk-mq
may wait on the wait queue of .mq_freeze_wq.

Tested-by: Oleksandr Natalenko 
Tested-by: Martin Steigerwald 
Reviewed-by: Hannes Reinecke 
Cc: Bart Van Assche 
Signed-off-by: Ming Lei 
---
 block/blk-core.c | 14 ++
 block/blk-mq.c   |  7 ---
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 048be4aa6024..a5011c824ac6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -610,6 +610,12 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /*
+* We need to ensure that processes currently waiting on
+* the queue are notified as well.
+*/
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1395,16 +1401,21 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1576,6 +1587,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1857,8 +1869,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6fd9f86fc86d..10c1f49f663d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -256,13 +256,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.9.5