date:20180115

Re: [PATCH] bcache: fix for allocator and register thread race

2018-01-15 Thread Coly Li

On 10/01/2018 4:51 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> After long time run of random small IO writing,
> reboot the machine, and after the machine power on,
> bcache got stuck, the stack is:
> [root@ceph153 ~]# cat /proc/2510/task/*/stack
> [] closure_sync+0x25/0x90 [bcache]
> [] bch_journal+0x118/0x2b0 [bcache]
> [] bch_journal_meta+0x47/0x70 [bcache]
> [] bch_prio_write+0x237/0x340 [bcache]
> [] bch_allocator_thread+0x3c8/0x3d0 [bcache]
> [] kthread+0xcf/0xe0
> [] ret_from_fork+0x58/0x90
> [] 0x
> [root@ceph153 ~]# cat /proc/2038/task/*/stack
> [] __bch_btree_map_nodes+0x12d/0x150 [bcache]
> [] bch_btree_insert+0xf1/0x170 [bcache]
> [] bch_journal_replay+0x13f/0x230 [bcache]
> [] run_cache_set+0x79a/0x7c2 [bcache]
> [] register_bcache+0xd48/0x1310 [bcache]
> [] kobj_attr_store+0xf/0x20
> [] sysfs_write_file+0xc6/0x140
> [] vfs_write+0xbd/0x1e0
> [] SyS_write+0x7f/0xe0
> [] system_call_fastpath+0x16/0x1
> The stack shows the register thread and allocator thread
> were getting stuck when registering cache device.
> 
> we reboot the machine several times, the issue always
> exsit in this machine.
> 
> We debug the code, and found the call trace as bellow:
> register_bcache()
>   ==>run_cache_set()
>  ==>bch_journal_replay()
> ==>bch_btree_insert()
>==>__bch_btree_map_nodes()
>   ==>btree_insert_fn()
>  ==>btree_split() //node need split
> ==>btree_check_reserve()
> In btree_check_reserve(), It will check if there is enough buckets
> of RESERVE_BTREE type, since allocator thread did not work yet, so
> no buckets of RESERVE_BTREE type allocated, so the register thread
> waits on c->btree_cache_wait, and goes to sleep.
> 
> Then the allocator thread initialized, and goes to work,
> the call trace is bellow:
> bch_allocator_thread()
>   ==>bch_prio_write()
>  ==>bch_journal_meta()
> ==>bch_journal()
>==>journal_wait_for_write()
> In journal_wait_for_write(), It will check if journal is full by
> journal_full(), but the long time random small IO writing
> causes the exhaustion of journal buckets(journal.blocks_free=0),
> In order to release the journal buckets,
> the allocator calls btree_flush_write() to flush keys to
> btree nodes, and waits on c->journal.wait until btree nodes writing over
> or there has already some journal buckets space, then the allocator
> thread goes to sleep. but in btree_flush_write(), since
> bch_journal_replay() is not finished, so no btree nodes have journal
> (condition "if (btree_current_write(b)->journal)" never satisfied),
> so we got no btree node to flush, no journal bucket released,
> and allocator sleep all the times.
> 
> Through the above analysis, we can see that:
> 1) Register thread wait for allocator thread to allocate buckets of
>RESERVE_BTREE type;
> 2) Alloctor thread wait for register thread to replay journal, so it
>can flush btree nodes and get journal bucket.
>then they are all got stuck by waiting for each other.
> 
> Hua Rui provided a patch for me, by allocating some buckets of
> RESERVE_BTREE type in advance, so the register thread can get bucket
> when btree node splitting and no need to waiting for the allocator thread.
> I tested it, it has effect, and register thread run a step forward, but
> finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE
> type were allocated, and in bch_journal_replay(), after 2 btree nodes
> splitting, only 4 bucket of RESERVE_BTREE type left, then
> btree_check_reserve() is not satisfied anymore, so it goes to sleep again,
> and in the same time, alloctor thread did not flush enough btree nodes to
> release a journal bucket, so they all got stuck again.
> 
> So we need to allocate more buckets of RESERVE_BTREE type in advance,
> but how much is enough?  By experience and test, I think it should be
> as much as journal buckets. Then I modify the code as this patch,
> and test in the machine, and it works.
> 
> This patch modified base on Hua Rui’s patch, and allocate more buckets
> of RESERVE_BTREE type in advance to avoid register thread and allocate
> thread going to wait for each other.
> 
> Signed-off-by: Hua Rui 
> Signed-off-by: Tang Junhui 

Hi Junhui,

It spent some time for me to understand the problem and the fix :-)
The root cause is because bcache journal does not reserve space
previously, and allocates journal slot when btree node split by insert a
node from journal to btree. If we reserve journal space previously, then
we will need something like commit record to make sure the operation is
atomic. That is complicated and I doubt whether bcache is deserved for it.

This fix is much simpler and it makes things work. I am OK with it.

Reviewed-by: Coly Li 

> ---
>  drivers/md/bcache/btree.c |  9 ++---
>  drivers/md/bcache/super.c | 12 +++-
>  2 files

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Ming Lei

On Tue, Jan 16, 2018 at 9:40 AM, Ming Lei  wrote:
> On Mon, Jan 15, 2018 at 10:29:46AM -0700, Jens Axboe wrote:
>> On 1/15/18 9:58 AM, Ming Lei wrote:
>> > No functional change, just to clean up code a bit, so that the following
>> > change of using direct issue for blk_mq_request_bypass_insert() which is
>> > needed by DM can be easier to do.
>> >
>> > Signed-off-by: Ming Lei 
>> > ---
>> >  block/blk-mq.c | 39 +++
>> >  1 file changed, 27 insertions(+), 12 deletions(-)
>> >
>> > diff --git a/block/blk-mq.c b/block/blk-mq.c
>> > index edb1291a42c5..bf8d6651f40e 100644
>> > --- a/block/blk-mq.c
>> > +++ b/block/blk-mq.c
>> > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct 
>> > blk_mq_hw_ctx *hctx, struct request *rq)
>> > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>> >  }
>> >
>> > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>> > -   struct request *rq,
>> > -   blk_qc_t *cookie)
>> > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
>> > +  struct request *rq,
>> > +  blk_qc_t *new_cookie)
>> >  {
>> > +   blk_status_t ret;
>> > struct request_queue *q = rq->q;
>> > struct blk_mq_queue_data bd = {
>> > .rq = rq,
>> > .last = true,
>> > };
>> > +
>> > +   if (!blk_mq_get_driver_tag(rq, NULL, false))
>> > +   return BLK_STS_AGAIN;
>> > +
>> > +   if (!blk_mq_get_dispatch_budget(hctx)) {
>> > +   blk_mq_put_driver_tag(rq);
>> > +   return BLK_STS_AGAIN;
>> > +   }
>> > +
>> > +   *new_cookie = request_to_qc_t(hctx, rq);
>> > +
>> > +   ret = q->mq_ops->queue_rq(hctx, );
>> > +
>> > +   return ret;
>>
>>   return q->mq_ops->queue_rq(hctx, );
>>
>> and kill 'ret', it's not used. But more importantly, who puts the
>
> OK.
>
>> driver tag and the budget if we get != OK for ->queue_rq()?
>
> For the budget, the current protocol is that driver is responsible for the
> release once .queue_rq is called, see scsi_mq_queue_rq() and comment in
> blk_mq_do_dispatch_sched().
>
> For driver tag, it is released in blk_mq_free_request(), which is done
> in failure handling of != OK.

Actually the driver tag should be released here when .queue_rq()
returns !OK, and this patch does follow the current logic:

- call __blk_mq_requeue_request() to release tag and make it ready
for requeue if STS_RESOURCE is returned.

- For other !OK cases, let driver free the request.

-- 
Ming Lei

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Ming Lei

On Tue, Jan 16, 2018 at 7:13 AM, Mike Snitzer  wrote:
> On Mon, Jan 15 2018 at  6:10P -0500,
> Mike Snitzer  wrote:
>
>> On Mon, Jan 15 2018 at  5:51pm -0500,
>> Bart Van Assche  wrote:
>>
>> > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
>> > > sysfs write op calls kernfs_fop_write which takes:
>> > > of->mutex then kn->count#213 (no idea what that is)
>> > > then q->sysfs_lock (via queue_attr_store)
>> > >
>> > > vs
>> > >
>> > > blk_unregister_queue takes:
>> > > q->sysfs_lock then
>> > > kernfs_mutex (via kernfs_remove)
>> > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
>> > >
>> > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
>> > > triggering false positives.. because these seem like different kernfs
>> > > locks yet they are reported as "kn->count#213".
>> > >
>> > > Certainly feeling out of my depth with kernfs's locking though.
>> >
>> > Hello Mike,
>> >
>> > I don't think that this is a false positive but rather the following 
>> > traditional
>> > pattern of a potential deadlock involving sysfs attributes:
>> > * One context obtains a mutex from inside a sysfs attribute method:
>> >   queue_attr_store() obtains q->sysfs_lock.
>> > * Another context removes a sysfs attribute while holding a mutex:
>> >   blk_unregister_queue() removes the queue sysfs attributes while holding
>> >   q->sysfs_lock.
>> >
>> > This can result in a real deadlock because the code that removes sysfs 
>> > objects
>> > waits until all ongoing attribute callbacks have finished.
>> >
>> > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
>> > blk_unregister_queue") modified blk_unregister_queue() such that 
>> > q->sysfs_lock
>> > is held around the kobject_del(>kobj) call I think this is a regression
>> > introduced by that commit.
>>
>> Sure, of course it is a regression.
>>
>> Aside from moving the mutex_unlock(>sysfs_lock) above the
>> kobject_del(>kobj) I don't know how to fix it.
>>
>> Though, realistically that'd be an adequate fix (given the way the code
>> was before).
>
> Any chance you apply this and re-run your srp_test that triggered the
> lockdep splat?
>
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 4a6a40ffd78e..c50e08e9bf17 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk)
> if (q->request_fn || (q->mq_ops && q->elevator))
> elv_unregister_queue(q);
>
> +   mutex_unlock(>sysfs_lock);
> +
> kobject_uevent(>kobj, KOBJ_REMOVE);
> kobject_del(>kobj);
> blk_trace_remove_sysfs(disk_to_dev(disk));
> kobject_put(_to_dev(disk)->kobj);
> -
> -   mutex_unlock(>sysfs_lock);
>  }

If this patch is required, similar change should be needed in failure path
of blk_register_queue() too.

-- 
Ming Lei

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > Hi,
> > 
> > These two patches fixes IO hang issue reported by Laurence.
> > 
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this vector
> > can't handle irq any more.
> 
> Well, that very much was the intention of managed interrupts.  Why
> does the device raise an interrupt for a queue that has no online
> cpu assigned to it?

If pci_alloc_irq_vectors() returns OK, driver may think everything
is just fine, and configure the related hw queues(such as enabling irq
on queues), and finally irq comes and no CPU can handle them.

Also I think there may not drivers which check if the CPUs assigned
for irq vectors are online or not, and seems never a job which is
supposed to do by driver.

-- 
Ming

Re: [PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 12:43:44PM -0500, Mike Snitzer wrote:
> On Mon, Jan 15 2018 at 11:58am -0500,
> Ming Lei  wrote:
> 
> > Hi Guys,
> > 
> > The 3 paches changes the blk-mq part of blk_insert_cloned_request(),
> > in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
> > and blk-mq can get the dispatch result of underlying queue, and with
> > this information, blk-mq can handle IO merge much better, then
> > sequential I/O performance is improved much.
> > 
> > In my dm-mpath over virtio-scsi test, this whole patchset improves
> > sequential IO by 3X ~ 5X.
> > 
> > V4:
> > - remove dm patches which are in DM tree already
> > - cleanup __blk_mq_issue_req as suggested by Jens
> > 
> 
> Ming,
> 
> You dropped the header cleanups that I did in v3 ("blk-mq: issue request
> directly for blk_insert_cloned_request") being the one header I care
> about being updated).
> 
> I also worked in parallel on my own v4 to address Jens' dislike for v3's
> 3 returns.  But I skinned the cat a different way, by dropping your
> first patch that introduces the __blk_mq_issue_req helper, please see
> these 2 commits:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=40f8947784128bb83dc5f7a6aed7ed230222f675
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=f641015f42f41df75220313ee62e8241f2fd
> 
> I think it makes the changes more obvious (not spread across 2 methods
> and doesn't require use of BLK_STS_AGAIN).
> 
> Happy to yield to Jens to decide which he prefers.
> 
> Jens, if you'd like to pick my variant of v4 up they are here, thanks!

Hi Mike,

Looks we were working on it at the same time, I am fine with your V4.

Jens, please let us know if Mike's V4 is OK, if not, we can make V5/.., :-)

Thanks,
Ming

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  8:43pm -0500,
Ming Lei  wrote:

> On Mon, Jan 15, 2018 at 02:41:12PM -0500, Mike Snitzer wrote:
> > On Mon, Jan 15 2018 at 12:29pm -0500,
> > Jens Axboe  wrote:
> > 
> > > On 1/15/18 9:58 AM, Ming Lei wrote:
> > > > No functional change, just to clean up code a bit, so that the following
> > > > change of using direct issue for blk_mq_request_bypass_insert() which is
> > > > needed by DM can be easier to do.
> > > > 
> > > > Signed-off-by: Ming Lei 
> > > > ---
> > > >  block/blk-mq.c | 39 +++
> > > >  1 file changed, 27 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index edb1291a42c5..bf8d6651f40e 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct 
> > > > blk_mq_hw_ctx *hctx, struct request *rq)
> > > > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
> > > >  }
> > > >  
> > > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > > > -   struct request *rq,
> > > > -   blk_qc_t *cookie)
> > > > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> > > > +  struct request *rq,
> > > > +  blk_qc_t *new_cookie)
> > > >  {
> > > > +   blk_status_t ret;
> > > > struct request_queue *q = rq->q;
> > > > struct blk_mq_queue_data bd = {
> > > > .rq = rq,
> > > > .last = true,
> > > > };
> > > > +
> > > > +   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > > > +   return BLK_STS_AGAIN;
> > > > +
> > > > +   if (!blk_mq_get_dispatch_budget(hctx)) {
> > > > +   blk_mq_put_driver_tag(rq);
> > > > +   return BLK_STS_AGAIN;
> > > > +   }
> > > > +
> > > > +   *new_cookie = request_to_qc_t(hctx, rq);
> > > > +
> > > > +   ret = q->mq_ops->queue_rq(hctx, );
> > > > +
> > > > +   return ret;
> > > 
> > >   return q->mq_ops->queue_rq(hctx, );
> > > 
> > > and kill 'ret', it's not used.
> > 
> > Yeap, good point.
> > 
> > > But more importantly, who puts the
> > > driver tag and the budget if we get != OK for ->queue_rq()?
> > 
> > __blk_mq_try_issue_directly() processes the returned value same as before
> > this patch.  Means this patch isn't making any functional change:
> > If BLK_STS_RESOURCE: __blk_mq_requeue_request() is called.
> > __blk_mq_requeue_request() will blk_mq_put_driver_tag().
> > Otherwise, all other errors result in blk_mq_end_request(rq, ret);
> > 
> > So ignoring this patch, are you concerned that:
> > 1) Does blk_mq_end_request() put both?  Looks like blk_mq_free_request()
> > handles rq->tag != -1 but why not have it use __blk_mq_put_driver_tag()?
> > I'm not seeing where the budget is put from blk_mq_end_request()...
> 
> blk_mq_free_request() releases driver tag, and budget is owned by driver
> once .queue_rq is called.
> 
> > 
> > 2) Nothing seems to be putting the budget in
> > __blk_mq_try_issue_directly()'s BLK_STS_RESOURCE error path?  I share
> > your concern now (for drivers who set {get,put}_budget in mq_ops).
> > Should __blk_mq_requeue_request() be updated to also
> > blk_mq_put_dispatch_budget()?
> 
> No, at least it is current protocol of using budget, please see
> scsi_mq_queue_rq() and comment of blk_mq_do_dispatch_sched().

Yeap, thanks for clarifying.

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 10:29:46AM -0700, Jens Axboe wrote:
> On 1/15/18 9:58 AM, Ming Lei wrote:
> > No functional change, just to clean up code a bit, so that the following
> > change of using direct issue for blk_mq_request_bypass_insert() which is
> > needed by DM can be easier to do.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 39 +++
> >  1 file changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index edb1291a42c5..bf8d6651f40e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct 
> > blk_mq_hw_ctx *hctx, struct request *rq)
> > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
> >  }
> >  
> > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > -   struct request *rq,
> > -   blk_qc_t *cookie)
> > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> > +  struct request *rq,
> > +  blk_qc_t *new_cookie)
> >  {
> > +   blk_status_t ret;
> > struct request_queue *q = rq->q;
> > struct blk_mq_queue_data bd = {
> > .rq = rq,
> > .last = true,
> > };
> > +
> > +   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > +   return BLK_STS_AGAIN;
> > +
> > +   if (!blk_mq_get_dispatch_budget(hctx)) {
> > +   blk_mq_put_driver_tag(rq);
> > +   return BLK_STS_AGAIN;
> > +   }
> > +
> > +   *new_cookie = request_to_qc_t(hctx, rq);
> > +
> > +   ret = q->mq_ops->queue_rq(hctx, );
> > +
> > +   return ret;
> 
>   return q->mq_ops->queue_rq(hctx, );
> 
> and kill 'ret', it's not used. But more importantly, who puts the

OK.

> driver tag and the budget if we get != OK for ->queue_rq()?

For the budget, the current protocol is that driver is responsible for the
release once .queue_rq is called, see scsi_mq_queue_rq() and comment in
blk_mq_do_dispatch_sched().

For driver tag, it is released in blk_mq_free_request(), which is done
in failure handling of != OK.

-- 
Ming

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 12:15:47PM -0500, Mike Snitzer wrote:
> On Mon, Jan 15 2018 at 11:58am -0500,
> Ming Lei  wrote:
> 
> > No functional change, just to clean up code a bit, so that the following
> > change of using direct issue for blk_mq_request_bypass_insert() which is
> > needed by DM can be easier to do.
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 39 +++
> >  1 file changed, 27 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index edb1291a42c5..bf8d6651f40e 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct 
> > blk_mq_hw_ctx *hctx, struct request *rq)
> > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
> >  }
> >  
> > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > -   struct request *rq,
> > -   blk_qc_t *cookie)
> > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> > +  struct request *rq,
> > +  blk_qc_t *new_cookie)
> >  {
> > +   blk_status_t ret;
> > struct request_queue *q = rq->q;
> > struct blk_mq_queue_data bd = {
> > .rq = rq,
> > .last = true,
> > };
> > +
> > +   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > +   return BLK_STS_AGAIN;
> > +
> > +   if (!blk_mq_get_dispatch_budget(hctx)) {
> > +   blk_mq_put_driver_tag(rq);
> > +   return BLK_STS_AGAIN;
> > +   }
> > +
> > +   *new_cookie = request_to_qc_t(hctx, rq);
> > +
> > +   ret = q->mq_ops->queue_rq(hctx, );
> > +
> > +   return ret;
> > +}
> > +
> > +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > +   struct request *rq,
> > +   blk_qc_t *cookie)
> > +{
> > +   struct request_queue *q = rq->q;
> > blk_qc_t new_cookie;
> > blk_status_t ret;
> > bool run_queue = true;
> > @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct 
> > blk_mq_hw_ctx *hctx,
> > if (q->elevator)
> > goto insert;
> >  
> > -   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > +   ret = __blk_mq_issue_req(hctx, rq, _cookie);
> > +   if (ret == BLK_STS_AGAIN)
> > goto insert;
> 
> You're (ab)using BLK_STS_AGAIN as a means to an end...
> But what if in the future q->mq_ops->queue_rq() returns BLK_STS_AGAIN?

Yeah, but now only dio uses that, so I let Jens decide it.

-- 
Ming

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 06:43:47PM +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> > These two patches fixes IO hang issue reported by Laurence.
> > 
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this vector
> > can't handle irq any more.
> > 
> > The 1st patch moves irq vectors spread into one function, and prepares
> > for the fix done in 2nd patch.
> > 
> > The 2nd patch fixes the issue by trying to make sure online CPUs assigned
> > to irq vector.
> 
> Which means it's completely undoing the intent and mechanism of managed
> interrupts. Not going to happen.

As I replied in previous mail, some of offline CPUs may be assigned to
some of irq vectors after we assign vectors to all possible CPUs, some
of which are not present.

> 
> Which driver is that which abuses managed interrupts and does not keep its
> queues properly sorted on cpu hotplug?

It isn't related with driver/device, and I can trigger this issue on NVMe
easily except for HPSA.


Thanks,
Ming

Re: [PATCH V4 3/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
> totally, and depend on DM rq driver to do IO schedule completely. But DM
> rq driver can't get underlying queue's dispatch feedback at all, and this
> information is extreamly useful for IO merge. Without that IO merge can't
> be done basically by blk-mq, and causes very bad sequential IO performance.
> 
> This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
> underlying queue and provides disptch result to dm-rq and blk-mq, and
> improves the above situations very much.
> 
> With this patch, seqential IO is improved by 3X in my test over
> mpath/virtio-scsi.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)

Hey Ming,

I also just noticed your V4 of this patch only includes the
block/blk-mq.c changes.

You're missing:

 block/blk-core.c   |  3 +--
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 +---

Please let Jens know if you're OK with my V4, tagged here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/tag/?h=for-block-4.16/dm-changes-2
And also detailed in this message from earlier in this thread:
https://marc.info/?l=linux-block=151603824725438=2

Or please generate V5 of your series.  Hopefully it'd include the header
I revised for this 3/3 patch, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-block-4.16/dm-changes-2=d86beab5712a8f18123011487dee797a1e3a07e1

We also need to address the issues Jens noticed and I looked at a bit
closer: https://marc.info/?l=linux-block=151604528127707=2
(those issues might need fixing first, and marked for stable@, and the
series rebased ontop of it?)

Thanks!
Mie

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > Hi,
> > 
> > These two patches fixes IO hang issue reported by Laurence.
> > 
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this vector
> > can't handle irq any more.
> 
> Well, that very much was the intention of managed interrupts.  Why
> does the device raise an interrupt for a queue that has no online
> cpu assigned to it?

It is because of irq_create_affinity_masks().

Once irq vectors spread across possible CPUs, some of which are offline,
may be assigned to one vector.

For example of HPSA, there are 8 irq vectors in this device, and the
system supports at most 32 CPUs, but only 16 presents(0-15) after booting,
we should allow to assign at least one CPU for handling each irq vector for
HPSA, but:

1) before commit 84676c1f21:

irq 25, cpu list 0
irq 26, cpu list 2
irq 27, cpu list 4
irq 28, cpu list 6
irq 29, cpu list 8
irq 30, cpu list 10
irq 31, cpu list 12
irq 32, cpu list 14
irq 33, cpu list 1
irq 34, cpu list 3
irq 35, cpu list 5
irq 36, cpu list 7
irq 37, cpu list 9
irq 38, cpu list 11
irq 39, cpu list 13
irq 40, cpu list 15

2) after commit 84676c1f21:

irq 25, cpu list 0, 2
irq 26, cpu list 4, 6
irq 27, cpu list 8, 10
irq 28, cpu list 12, 14
irq 29, cpu list 16, 18
irq 30, cpu list 20, 22
irq 31, cpu list 24, 26
irq 32, cpu list 28, 30
irq 33, cpu list 1, 3
irq 34, cpu list 5, 7
irq 35, cpu list 9, 11
irq 36, cpu list 13, 15
irq 37, cpu list 17, 19
irq 38, cpu list 21, 23
irq 39, cpu list 25, 27
irq 40, cpu list 29, 31

And vectors of 29-32, 37-40 are assigned to offline CPUs.

-- 
Ming

[LSF/MM TOPIC] blk-mq priority based hctx selection

2018-01-15 Thread Keith Busch

For the storage track, I would like to propose a topic for differentiated
blk-mq hardware contexts. Today, blk-mq considers all hardware contexts
equal, and are selected based on the software's CPU context. There are
use cases that benefit from having hardware context selection criteria
beyond which CPU a process happens to be running on.

One example is exlusive polling for the latency sensitive use cases.
Mixing polled and non-polled requests into the same context loses part of
the benefit when interrupts unnecessarilly occur, and coalescing tricks
to mitigate this have undesirable side effects during times when
HIPRI commands are not issued.

Another example is for hardware priority queues, where not all command
queues the hardware provides may be equal to another. Many newer storage
controllers provide such queues with different QoS guarantees, and Linux
currently does not make use of this feature.

The talk would like to discuss potential new blk-mq APIs needed for
a block driver to register its different priority queues, changes to
blk-mq hwctx selection, and implications for low level drivers that
utilize IRQ affinity to set up current mappings.

Thanks,
Keith

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  6:10P -0500,
Mike Snitzer  wrote:

> On Mon, Jan 15 2018 at  5:51pm -0500,
> Bart Van Assche  wrote:
> 
> > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> > > sysfs write op calls kernfs_fop_write which takes:
> > > of->mutex then kn->count#213 (no idea what that is)
> > > then q->sysfs_lock (via queue_attr_store)
> > > 
> > > vs 
> > > 
> > > blk_unregister_queue takes:
> > > q->sysfs_lock then
> > > kernfs_mutex (via kernfs_remove)
> > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> > > 
> > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> > > triggering false positives.. because these seem like different kernfs
> > > locks yet they are reported as "kn->count#213".
> > > 
> > > Certainly feeling out of my depth with kernfs's locking though.
> > 
> > Hello Mike,
> > 
> > I don't think that this is a false positive but rather the following 
> > traditional
> > pattern of a potential deadlock involving sysfs attributes:
> > * One context obtains a mutex from inside a sysfs attribute method:
> >   queue_attr_store() obtains q->sysfs_lock.
> > * Another context removes a sysfs attribute while holding a mutex:
> >   blk_unregister_queue() removes the queue sysfs attributes while holding
> >   q->sysfs_lock.
> > 
> > This can result in a real deadlock because the code that removes sysfs 
> > objects
> > waits until all ongoing attribute callbacks have finished.
> > 
> > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
> > blk_unregister_queue") modified blk_unregister_queue() such that 
> > q->sysfs_lock
> > is held around the kobject_del(>kobj) call I think this is a regression
> > introduced by that commit.
> 
> Sure, of course it is a regression.
> 
> Aside from moving the mutex_unlock(>sysfs_lock) above the
> kobject_del(>kobj) I don't know how to fix it.
> 
> Though, realistically that'd be an adequate fix (given the way the code
> was before).

Any chance you apply this and re-run your srp_test that triggered the
lockdep splat?

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4a6a40ffd78e..c50e08e9bf17 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk)
if (q->request_fn || (q->mq_ops && q->elevator))
elv_unregister_queue(q);
 
+   mutex_unlock(>sysfs_lock);
+
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
kobject_put(_to_dev(disk)->kobj);
-
-   mutex_unlock(>sysfs_lock);
 }

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  5:51pm -0500,
Bart Van Assche  wrote:

> On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> > sysfs write op calls kernfs_fop_write which takes:
> > of->mutex then kn->count#213 (no idea what that is)
> > then q->sysfs_lock (via queue_attr_store)
> > 
> > vs 
> > 
> > blk_unregister_queue takes:
> > q->sysfs_lock then
> > kernfs_mutex (via kernfs_remove)
> > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> > 
> > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> > triggering false positives.. because these seem like different kernfs
> > locks yet they are reported as "kn->count#213".
> > 
> > Certainly feeling out of my depth with kernfs's locking though.
> 
> Hello Mike,
> 
> I don't think that this is a false positive but rather the following 
> traditional
> pattern of a potential deadlock involving sysfs attributes:
> * One context obtains a mutex from inside a sysfs attribute method:
>   queue_attr_store() obtains q->sysfs_lock.
> * Another context removes a sysfs attribute while holding a mutex:
>   blk_unregister_queue() removes the queue sysfs attributes while holding
>   q->sysfs_lock.
> 
> This can result in a real deadlock because the code that removes sysfs objects
> waits until all ongoing attribute callbacks have finished.
> 
> Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
> blk_unregister_queue") modified blk_unregister_queue() such that q->sysfs_lock
> is held around the kobject_del(>kobj) call I think this is a regression
> introduced by that commit.

Sure, of course it is a regression.

Aside from moving the mutex_unlock(>sysfs_lock) above the
kobject_del(>kobj) I don't know how to fix it.

Though, realistically that'd be an adequate fix (given the way the code
was before).

Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Bart Van Assche

On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> sysfs write op calls kernfs_fop_write which takes:
> of->mutex then kn->count#213 (no idea what that is)
> then q->sysfs_lock (via queue_attr_store)
> 
> vs 
> 
> blk_unregister_queue takes:
> q->sysfs_lock then
> kernfs_mutex (via kernfs_remove)
> seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> 
> Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> triggering false positives.. because these seem like different kernfs
> locks yet they are reported as "kn->count#213".
> 
> Certainly feeling out of my depth with kernfs's locking though.

Hello Mike,

I don't think that this is a false positive but rather the following traditional
pattern of a potential deadlock involving sysfs attributes:
* One context obtains a mutex from inside a sysfs attribute method:
  queue_attr_store() obtains q->sysfs_lock.
* Another context removes a sysfs attribute while holding a mutex:
  blk_unregister_queue() removes the queue sysfs attributes while holding
  q->sysfs_lock.

This can result in a real deadlock because the code that removes sysfs objects
waits until all ongoing attribute callbacks have finished.

Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
blk_unregister_queue") modified blk_unregister_queue() such that q->sysfs_lock
is held around the kobject_del(>kobj) call I think this is a regression
introduced by that commit.

Bart.

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:16pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote:
> > I'm submitting this v5 with more feeling now ;)
> 
> Hello Mike,
> 
> Have these patches been tested with lockdep enabled? The following appeared in
> the kernel log when after I started testing Jens' for-next tree of this 
> morning:
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0-rc7-dbg+ #1 Not tainted
> --
> 02-mq/1211 is trying to acquire lock:
>  (>sysfs_lock){+.+.}, at: [<8b65bdad>] queue_attr_store+0x35/0x80
> 
> but task is already holding lock:
>  (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (kn->count#213){}:
>kernfs_remove+0x1a/0x30
>kobject_del.part.3+0xe/0x40
>blk_unregister_queue+0xa7/0xe0
>del_gendisk+0x12f/0x260
>sd_remove+0x58/0xc0 [sd_mod]
>device_release_driver_internal+0x15a/0x220
>bus_remove_device+0xf4/0x170
>device_del+0x12f/0x330
>__scsi_remove_device+0xef/0x120 [scsi_mod]
>scsi_forget_host+0x1b/0x60 [scsi_mod]
>scsi_remove_host+0x6f/0x110 [scsi_mod]
>0xc09ed6e4
>process_one_work+0x21c/0x6d0
>worker_thread+0x35/0x380
>kthread+0x117/0x130
>ret_from_fork+0x24/0x30
> 
> -> #0 (>sysfs_lock){+.+.}:
>__mutex_lock+0x6c/0x9e0
>queue_attr_store+0x35/0x80
>kernfs_fop_write+0x109/0x190
>__vfs_write+0x1e/0x130
>vfs_write+0xb9/0x1b0
>SyS_write+0x40/0xa0
>entry_SYSCALL_64_fastpath+0x23/0x9a
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(kn->count#213);
>lock(>sysfs_lock);
>lock(kn->count#213);
>   lock(>sysfs_lock);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by 02-mq/1211:
>  #0:  (sb_writers#6){.+.+}, at: [] vfs_write+0x17f/0x1b0
>  #1:  (>mutex){+.+.}, at: [] kernfs_fop_write+0xdd/0x190
>  #2:  (kn->count#213){}, at: [<7a18ad18>] 
> kernfs_fop_write+0xe5/0x190

sysfs write op calls kernfs_fop_write which takes:
of->mutex then kn->count#213 (no idea what that is)
then q->sysfs_lock (via queue_attr_store)

vs 

blk_unregister_queue takes:
q->sysfs_lock then
kernfs_mutex (via kernfs_remove)
seems lockdep thinks "kernfs_mutex" is "kn->count#213"?

Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
triggering false positives.. because these seem like different kernfs
locks yet they are reported as "kn->count#213".

Certainly feeling out of my depth with kernfs's locking though.

Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:51pm -0500,
Bart Van Assche  wrote:

> On Mon, 2018-01-15 at 12:48 -0500, Mike Snitzer wrote:
> > Do I need to do something more to enable lockdep aside from set
> > CONFIG_LOCKDEP_SUPPORT=y ?
> 
> Hello Mike,
> 
> I think you also need to set CONFIG_PROVE_LOCKING=y.

Ah ok, was missing that, thanks.

Mike

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Laurence Oberman

On Mon, 2018-01-15 at 18:43 +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> > These two patches fixes IO hang issue reported by Laurence.
> > 
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this
> > vector
> > can't handle irq any more.
> > 
> > The 1st patch moves irq vectors spread into one function, and
> > prepares
> > for the fix done in 2nd patch.
> > 
> > The 2nd patch fixes the issue by trying to make sure online CPUs
> > assigned
> > to irq vector.
> 
> Which means it's completely undoing the intent and mechanism of
> managed
> interrupts. Not going to happen.
> 
> Which driver is that which abuses managed interrupts and does not
> keep its
> queues properly sorted on cpu hotplug?
> 
> Thanks,
> 
>   tglx

Hello Thomas

The servers I am using are all booting off hpsa (SmartArray)
The system would hang on boot with this stack below.

So seen when booting off hpsa driver, not seen by Mike when booting off
a server not using hpsa.

Also not seen when reverting the patch I called out and reverted.

Putting that patch back into Mike/Jens combined tree and adding Ming's
patch seems to fix this issue now. I can boot.

I just did a quick sanity boot and check, not any in-depth testing
right now.

Its not code I am at all familiar with that Ming has changed to make it
work so I defer to Ming to explain in-depth


[  246.751050] INFO: task systemd-udevd:411 blocked for more than 120
seconds.
[  246.791852]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  246.874637] systemd-udevd   D0   411408 0x8004
[  246.904934] Call Trace:
[  246.918191]  ? __schedule+0x28d/0x870
[  246.937643]  ? _cond_resched+0x15/0x30
[  246.958222]  schedule+0x32/0x80
[  246.975424]  async_synchronize_cookie_domain+0x8b/0x140
[  247.004452]  ? remove_wait_queue+0x60/0x60
[  247.027335]  do_init_module+0xbe/0x219
[  247.048022]  load_module+0x21d6/0x2910
[  247.069436]  ? m_show+0x1c0/0x1c0
[  247.087999]  SYSC_finit_module+0x94/0xe0
[  247.110392]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  247.136669] RIP: 0033:0x7f84049287f9
[  247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX:
0139
[  247.196883] RAX: ffda RBX: 55b712b59e80 RCX:
7f84049287f9
[  247.237989] RDX:  RSI: 7f8405245099 RDI:
0008
[  247.279105] RBP: 7f8404bf2760 R08:  R09:
55b712b45760
[  247.320005] R10: 0008 R11: 0246 R12:
0020
[  247.360625] R13: 7f8404bf2818 R14: 0050 R15:
7f8404bf27b8
[  247.401062] INFO: task scsi_eh_0:471 blocked for more than 120
seconds.
[  247.438161]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  247.520700] scsi_eh_0   D0   471  2 0x8000
[  247.551339] Call Trace:
[  247.564360]  ? __schedule+0x28d/0x870
[  247.584720]  schedule+0x32/0x80
[  247.601294]  hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa]
[  247.633358]  ? remove_wait_queue+0x60/0x60
[  247.656345]  scsi_try_bus_device_reset+0x27/0x40
[  247.682424]  scsi_eh_ready_devs+0x53f/0xe20
[  247.706467]  ? __pm_runtime_resume+0x55/0x70
[  247.730327]  scsi_error_handler+0x434/0x5e0
[  247.754387]  ? __schedule+0x295/0x870
[  247.775420]  kthread+0xf5/0x130
[  247.793461]  ? scsi_eh_get_sense+0x240/0x240
[  247.818008]  ? kthread_associate_blkcg+0x90/0x90
[  247.844759]  ret_from_fork+0x1f/0x30
[  247.865440] INFO: task scsi_id:488 blocked for more than 120
seconds.
[  247.901112]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  247.981092] scsi_id D0   488  1 0x0004
[  248.010535] Call Trace:
[  248.023567]  ? __schedule+0x28d/0x870
[  248.044236]  ? __switch_to+0x1f5/0x460
[  248.065776]  schedule+0x32/0x80
[  248.084238]  schedule_timeout+0x1d4/0x2f0
[  248.106184]  wait_for_completion+0x123/0x190
[  248.130759]  ? wake_up_q+0x70/0x70
[  248.150295]  flush_work+0x119/0x1a0
[  248.169238]  ? wake_up_worker+0x30/0x30
[  248.189670]  __cancel_work_timer+0x103/0x190
[  248.213751]  ? kobj_lookup+0x10b/0x160
[  248.235441]  disk_block_events+0x6f/0x90
[  248.257820]  __blkdev_get+0x6a/0x480
[  248.278770]  ? bd_acquire+0xd0/0xd0
[  248.298438]  blkdev_get+0x1a5/0x300
[  248.316587]  ? bd_acquire+0xd0/0xd0
[  248.334814]  do_dentry_open+0x202/0x320
[  248.354372]  ? security_inode_permission+0x3c/0x50
[  248.378818]  path_openat+0x537/0x12c0
[  248.397386]  ? vm_insert_page+0x1e0/0x1f0
[  248.417664]  ? vvar_fault+0x75/0x140
[  248.435811]  do_filp_open+0x91/0x100
[  248.454061]  do_sys_open+0x126/0x210
[  248.472462]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Bart Van Assche

On Mon, 2018-01-15 at 12:48 -0500, Mike Snitzer wrote:
> Do I need to do something more to enable lockdep aside from set
> CONFIG_LOCKDEP_SUPPORT=y ?

Hello Mike,

I think you also need to set CONFIG_PROVE_LOCKING=y.

Bart.

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:36pm -0500,
Bart Van Assche  wrote:

> On Mon, 2018-01-15 at 12:29 -0500, Mike Snitzer wrote:
> > So you replied to v5, I emailed a v6 out for the relevant patch.  Just
> > want to make sure you're testing with either Jens' latest tree or are
> > using my v6 that fixed blk_mq_unregister_dev() to require caller holds
> > q->sysfs_lock ?
> 
> Hello Mike,
> 
> In my test I was using Jens' latest for-next tree (commit 563877ae7dae).

OK, that includes his latest for-4.16/block (commit c100ec49fdd8) so
I'll take a closer look at this lockdep splat.

Do I need to do something more to enable lockdep aside from set
CONFIG_LOCKDEP_SUPPORT=y ?

Thanks,
Mike

Re: [PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> Hi Guys,
> 
> The 3 paches changes the blk-mq part of blk_insert_cloned_request(),
> in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
> and blk-mq can get the dispatch result of underlying queue, and with
> this information, blk-mq can handle IO merge much better, then
> sequential I/O performance is improved much.
> 
> In my dm-mpath over virtio-scsi test, this whole patchset improves
> sequential IO by 3X ~ 5X.
> 
> V4:
>   - remove dm patches which are in DM tree already
>   - cleanup __blk_mq_issue_req as suggested by Jens
> 

Ming,

You dropped the header cleanups that I did in v3 ("blk-mq: issue request
directly for blk_insert_cloned_request") being the one header I care
about being updated).

I also worked in parallel on my own v4 to address Jens' dislike for v3's
3 returns.  But I skinned the cat a different way, by dropping your
first patch that introduces the __blk_mq_issue_req helper, please see
these 2 commits:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=40f8947784128bb83dc5f7a6aed7ed230222f675
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=f641015f42f41df75220313ee62e8241f2fd

I think it makes the changes more obvious (not spread across 2 methods
and doesn't require use of BLK_STS_AGAIN).

Happy to yield to Jens to decide which he prefers.

Jens, if you'd like to pick my variant of v4 up they are here, thanks!

The following changes since commit c100ec49fdd836ff8a17c7bfcc7611d2ee2b:

  dm: fix incomplete request_queue initialization (2018-01-15 08:54:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-block-4.16/dm-changes-2

for you to fetch changes up to d86beab5712a8f18123011487dee797a1e3a07e1:

  blk-mq: issue request directly for blk_insert_cloned_request (2018-01-15 
12:40:44 -0500)


- Ming's blk-mq improvements to blk_insert_cloned_request(), which is
  used exclusively by request-based DM's blk-mq mode, that enable
  substantial dm-mpath sequential IO performance improvements.


Ming Lei (2):
  blk-mq: return dispatch result from blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c   |  3 +--
 block/blk-mq.c | 65 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 +---
 4 files changed, 70 insertions(+), 20 deletions(-)

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Thomas Gleixner

On Tue, 16 Jan 2018, Ming Lei wrote:
> These two patches fixes IO hang issue reported by Laurence.
> 
> 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> may cause one irq vector assigned to all offline CPUs, then this vector
> can't handle irq any more.
> 
> The 1st patch moves irq vectors spread into one function, and prepares
> for the fix done in 2nd patch.
> 
> The 2nd patch fixes the issue by trying to make sure online CPUs assigned
> to irq vector.

Which means it's completely undoing the intent and mechanism of managed
interrupts. Not going to happen.

Which driver is that which abuses managed interrupts and does not keep its
queues properly sorted on cpu hotplug?

Thanks,

tglx

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Christoph Hellwig

On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> Hi,
> 
> These two patches fixes IO hang issue reported by Laurence.
> 
> 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> may cause one irq vector assigned to all offline CPUs, then this vector
> can't handle irq any more.

Well, that very much was the intention of managed interrupts.  Why
does the device raise an interrupt for a queue that has no online
cpu assigned to it?

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Bart Van Assche

On Mon, 2018-01-15 at 12:29 -0500, Mike Snitzer wrote:
> So you replied to v5, I emailed a v6 out for the relevant patch.  Just
> want to make sure you're testing with either Jens' latest tree or are
> using my v6 that fixed blk_mq_unregister_dev() to require caller holds
> q->sysfs_lock ?

Hello Mike,

In my test I was using Jens' latest for-next tree (commit 563877ae7dae).

Bart.

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:16pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote:
> > I'm submitting this v5 with more feeling now ;)
> 
> Hello Mike,
> 
> Have these patches been tested with lockdep enabled? The following appeared in
> the kernel log when after I started testing Jens' for-next tree of this 
> morning:

So you replied to v5, I emailed a v6 out for the relevant patch.  Just
want to make sure you're testing with either Jens' latest tree or are
using my v6 that fixed blk_mq_unregister_dev() to require caller holds
q->sysfs_lock ?

With the latest code that Jens merged, I do have lockdep enabled in my
kernel... but haven't seen any issues on the 2 testbeds I've been
hammering the changes on.

Ming was concerned about the potential for a lockdep splat (said he
tried expanding the scope of q->sysfs_lock in blk_unregister_queue
before and in doing so got lockdep to trigger).

Once you respond with what it is you're testing, and if it is latest
code, I'll look closer at what you've provided and see if it is real or
that lockdep just needs some training.

Thanks,
Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Bart Van Assche

On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote:
> I'm submitting this v5 with more feeling now ;)

Hello Mike,

Have these patches been tested with lockdep enabled? The following appeared in
the kernel log when after I started testing Jens' for-next tree of this morning:

==
WARNING: possible circular locking dependency detected
4.15.0-rc7-dbg+ #1 Not tainted
--
02-mq/1211 is trying to acquire lock:
 (>sysfs_lock){+.+.}, at: [<8b65bdad>] queue_attr_store+0x35/0x80

but task is already holding lock:
 (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #1 (kn->count#213){}:
   kernfs_remove+0x1a/0x30
   kobject_del.part.3+0xe/0x40
   blk_unregister_queue+0xa7/0xe0
   del_gendisk+0x12f/0x260
   sd_remove+0x58/0xc0 [sd_mod]
   device_release_driver_internal+0x15a/0x220
   bus_remove_device+0xf4/0x170
   device_del+0x12f/0x330
   __scsi_remove_device+0xef/0x120 [scsi_mod]
   scsi_forget_host+0x1b/0x60 [scsi_mod]
   scsi_remove_host+0x6f/0x110 [scsi_mod]
   0xc09ed6e4
   process_one_work+0x21c/0x6d0
   worker_thread+0x35/0x380
   kthread+0x117/0x130
   ret_from_fork+0x24/0x30

-> #0 (>sysfs_lock){+.+.}:
   __mutex_lock+0x6c/0x9e0
   queue_attr_store+0x35/0x80
   kernfs_fop_write+0x109/0x190
   __vfs_write+0x1e/0x130
   vfs_write+0xb9/0x1b0
   SyS_write+0x40/0xa0
   entry_SYSCALL_64_fastpath+0x23/0x9a

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(kn->count#213);
   lock(>sysfs_lock);
   lock(kn->count#213);
  lock(>sysfs_lock);

 *** DEADLOCK ***

3 locks held by 02-mq/1211:
 #0:  (sb_writers#6){.+.+}, at: [] vfs_write+0x17f/0x1b0
 #1:  (>mutex){+.+.}, at: [] kernfs_fop_write+0xdd/0x190
 #2:  (kn->count#213){}, at: [<7a18ad18>] 
kernfs_fop_write+0xe5/0x190

stack backtrace:
CPU: 2 PID: 1211 Comm: 02-mq Not tainted 4.15.0-rc7-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Call Trace:
 dump_stack+0x85/0xbf
 print_circular_bug.isra.35+0x1d7/0x1e4
 __lock_acquire+0x12af/0x1370
 ? lock_acquire+0xa8/0x230
 lock_acquire+0xa8/0x230
 ? queue_attr_store+0x35/0x80
 ? queue_attr_store+0x35/0x80
 __mutex_lock+0x6c/0x9e0
 ? queue_attr_store+0x35/0x80
 ? kernfs_fop_write+0xdd/0x190
 ? __mutex_lock+0x6c/0x9e0
 ? __mutex_lock+0x3d0/0x9e0
 ? lock_acquire+0xa8/0x230
 ? __lock_is_held+0x55/0x90
 ? queue_attr_store+0x35/0x80
 queue_attr_store+0x35/0x80
 kernfs_fop_write+0x109/0x190
 __vfs_write+0x1e/0x130
 ? rcu_read_lock_sched_held+0x66/0x70
 ? rcu_sync_lockdep_assert+0x23/0x50
 ? __sb_start_write+0x152/0x1f0
 ? __sb_start_write+0x168/0x1f0
 vfs_write+0xb9/0x1b0
 SyS_write+0x40/0xa0
 entry_SYSCALL_64_fastpath+0x23/0x9a
RIP: 0033:0x7fab8be8d054
RSP: 002b:7ffd39f0def8 EFLAGS: 0246

Thanks,

Bart.

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> No functional change, just to clean up code a bit, so that the following
> change of using direct issue for blk_mq_request_bypass_insert() which is
> needed by DM can be easier to do.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 39 +++
>  1 file changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index edb1291a42c5..bf8d6651f40e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
> *hctx, struct request *rq)
>   return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>  }
>  
> -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> - struct request *rq,
> - blk_qc_t *cookie)
> +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> +struct request *rq,
> +blk_qc_t *new_cookie)
>  {
> + blk_status_t ret;
>   struct request_queue *q = rq->q;
>   struct blk_mq_queue_data bd = {
>   .rq = rq,
>   .last = true,
>   };
> +
> + if (!blk_mq_get_driver_tag(rq, NULL, false))
> + return BLK_STS_AGAIN;
> +
> + if (!blk_mq_get_dispatch_budget(hctx)) {
> + blk_mq_put_driver_tag(rq);
> + return BLK_STS_AGAIN;
> + }
> +
> + *new_cookie = request_to_qc_t(hctx, rq);
> +
> + ret = q->mq_ops->queue_rq(hctx, );
> +
> + return ret;
> +}
> +
> +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> + struct request *rq,
> + blk_qc_t *cookie)
> +{
> + struct request_queue *q = rq->q;
>   blk_qc_t new_cookie;
>   blk_status_t ret;
>   bool run_queue = true;
> @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   if (q->elevator)
>   goto insert;
>  
> - if (!blk_mq_get_driver_tag(rq, NULL, false))
> + ret = __blk_mq_issue_req(hctx, rq, _cookie);
> + if (ret == BLK_STS_AGAIN)
>   goto insert;

You're (ab)using BLK_STS_AGAIN as a means to an end...
But what if in the future q->mq_ops->queue_rq() returns BLK_STS_AGAIN?

Mike

Re: [GIT PULL] block changes to improve device mapper for 4.16

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 08:33:41AM -0700, Jens Axboe wrote:
> On 1/14/18 7:59 PM, Mike Snitzer wrote:
> > Hi Jens,
> > 
> > I prepared this pull request in the hope that it may help you review and
> > stage these changes for 4.16.
> > 
> > I went over Ming's changes again to refine the headers and code comments
> > for clarity to help ease review and inclussion.
> > 
> > I've done extensive testing of the changes in this pull request in
> > combination with all the dm-4.16 changes I'll be sending to Linus, using
> > this tree (which gets pulled into linux-next, I wanted some coverage):
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
> > I'll obviously drop these changes from dm's linux-next once you pick
> > them up.
> > 
> > FYI, this dm-4.16 commit addresses the concerns Bart raised last week
> > against Ming's dm-mpath changes:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=7a9d12664a183c4a1e2fa86b0a9206006b95138e
> > 
> > The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:
> > 
> >   blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() 
> > (2018-01-14 10:46:24 -0700)
> > 
> > are available in the Git repository at:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
> > tags/for-block-4.16/dm
> > 
> > for you to fetch changes up to 0e94dfab56071a7afd0a9de9b8f6a0535148fb36:
> > 
> >   blk-mq: issue request directly for blk_insert_cloned_request (2018-01-14 
> > 13:00:02 -0500)
> > 
> > Please pull, thanks!
> > Mike
> > 
> > 
> > - Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used.
> > 
> > - Cleanup blk_unregister_queue() to more precisely protect against
> >   concurrent sysfs changes, blk_mq_unregister_dev() now requires caller
> >   to hold q->sysfslock (blk_unregister_queue is only caller).
> > 
> > - Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the
> >   gendisk to be registered but the associated disk->queue's
> >   blk_register_queue() is left for the driver to do once its
> >   request_queue is fully initialized.  Fixes long-standing DM
> >   request_queue initialization issues.
> > 
> > - Ming's blk-mq improvements to blk_insert_cloned_request(), which is
> >   used exclusively by request-based DM's blk-mq mode, that enable
> >   substantial dm-mpath sequential IO performance improvements.
> > 
> > 
> > Mike Snitzer (4):
> >   block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
> >   block: properly protect the 'queue' kobj in blk_unregister_queue
> >   block: allow gendisk's request_queue registration to be deferred
> >   dm: fix incomplete request_queue initialization
> > 
> > Ming Lei (3):
> >   blk-mq: move actual issue into __blk_mq_issue_req helper
> 
> I don't like this patch at all - it's a 10 line function (if that)
> that ends up with three outputs, two of them hidden in passed
> in pointers. On top of that, a function that is named
> __blk_mq_issue_req() and returns bool, you would logically expect
> a 'true' return to mean that it succeeded. This is the opposite.

OK, __blk_mq_issue_req() has been cleaned up in V4, please consider
it for V4.16.

Thanks,
Ming

[PATCH V4 3/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Ming Lei

blk_insert_cloned_request() is called in fast path of dm-rq driver, and
in this function we append request to hctx->dispatch_list of the underlying
queue directly.

1) This way isn't efficient enough because hctx lock is always required

2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
totally, and depend on DM rq driver to do IO schedule completely. But DM
rq driver can't get underlying queue's dispatch feedback at all, and this
information is extreamly useful for IO merge. Without that IO merge can't
be done basically by blk-mq, and causes very bad sequential IO performance.

This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
underlying queue and provides disptch result to dm-rq and blk-mq, and
improves the above situations very much.

With this patch, seqential IO is improved by 3X in my test over
mpath/virtio-scsi.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1654c9c284d8..ce3965f271e3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1730,6 +1730,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
+   /*
+* This function is called from blk_insert_cloned_request() if
+* 'cookie' is NULL, and for dispatching this request only.
+*/
+   bool dispatch_only = !cookie;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1737,10 +1742,17 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !dispatch_only)
goto insert;
 
ret = __blk_mq_issue_req(hctx, rq, _cookie);
+   if (dispatch_only) {
+   if (ret == BLK_STS_AGAIN)
+   return BLK_STS_RESOURCE;
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_requeue_request(rq);
+   return ret;
+   }
if (ret == BLK_STS_AGAIN)
goto insert;
 
@@ -1763,8 +1775,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
}
 
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!dispatch_only)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
return ret;
 }
 
@@ -1784,6 +1799,14 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
+blk_status_t blk_mq_request_direct_issue(struct request *rq)
+{
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+
+   return blk_mq_try_issue_directly(hctx, rq, NULL);
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
 {
const int is_sync = op_is_sync(bio->bi_opf);
-- 
2.9.5

[PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Ming Lei

No functional change, just to clean up code a bit, so that the following
change of using direct issue for blk_mq_request_bypass_insert() which is
needed by DM can be easier to do.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 39 +++
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index edb1291a42c5..bf8d6651f40e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
+  struct request *rq,
+  blk_qc_t *new_cookie)
 {
+   blk_status_t ret;
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
.rq = rq,
.last = true,
};
+
+   if (!blk_mq_get_driver_tag(rq, NULL, false))
+   return BLK_STS_AGAIN;
+
+   if (!blk_mq_get_dispatch_budget(hctx)) {
+   blk_mq_put_driver_tag(rq);
+   return BLK_STS_AGAIN;
+   }
+
+   *new_cookie = request_to_qc_t(hctx, rq);
+
+   ret = q->mq_ops->queue_rq(hctx, );
+
+   return ret;
+}
+
+static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
+{
+   struct request_queue *q = rq->q;
blk_qc_t new_cookie;
blk_status_t ret;
bool run_queue = true;
@@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (q->elevator)
goto insert;
 
-   if (!blk_mq_get_driver_tag(rq, NULL, false))
+   ret = __blk_mq_issue_req(hctx, rq, _cookie);
+   if (ret == BLK_STS_AGAIN)
goto insert;
 
-   if (!blk_mq_get_dispatch_budget(hctx)) {
-   blk_mq_put_driver_tag(rq);
-   goto insert;
-   }
-
-   new_cookie = request_to_qc_t(hctx, rq);
-
/*
 * For OK queue, we are done. For error, kill it. Any other
 * error (busy), just add it to our list as we previously
 * would have done
 */
-   ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-- 
2.9.5

[PATCH V4 2/3] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly

2018-01-15 Thread Ming Lei

In the following patch, we will use blk_mq_try_issue_directly() for DM
to return the dispatch result, and DM need this informatin to improve
IO merge.

Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bf8d6651f40e..1654c9c284d8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1722,13 +1722,13 @@ static blk_status_t __blk_mq_issue_req(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
blk_qc_t new_cookie;
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1752,31 +1752,36 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-   return;
+   return ret;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
goto insert;
default:
*cookie = BLK_QC_T_NONE;
blk_mq_end_request(rq, ret);
-   return;
+   return ret;
}
 
 insert:
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
+   return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq,
+ blk_qc_t *cookie)
 {
int srcu_idx;
+   blk_status_t ret;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, _idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
hctx_unlock(hctx, srcu_idx);
+
+   return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.9.5

[PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Ming Lei

Hi Guys,

The 3 paches changes the blk-mq part of blk_insert_cloned_request(),
in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
and blk-mq can get the dispatch result of underlying queue, and with
this information, blk-mq can handle IO merge much better, then
sequential I/O performance is improved much.

In my dm-mpath over virtio-scsi test, this whole patchset improves
sequential IO by 3X ~ 5X.

V4:
- remove dm patches which are in DM tree already
- cleanup __blk_mq_issue_req as suggested by Jens

V3:
- rebase on the latest for-4.16/block of block tree
- add missed pg_init_all_paths() in patch 1, according to Bart's review

V2:
- drop 'dm-mpath: cache ti->clone during requeue', which is a bit
too complicated, and not see obvious performance improvement.
- make change on blk-mq part cleaner

Ming Lei (3):
  blk-mq: move actual issue into one helper
  blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-mq.c | 85 +++---
 1 file changed, 64 insertions(+), 21 deletions(-)

-- 
2.9.5

LSF/MM 2018: Call for Proposals

2018-01-15 Thread Johannes Weiner

The annual Linux Storage, Filesystem and Memory Management (LSF/MM)
Summit for 2018 will be held from April 23-25 at the Deer Valley
Lodges in Park City, Utah. LSF/MM is an invitation-only technical
workshop to map out improvements to the Linux storage, filesystem and
memory management subsystems that will make their way into the
mainline kernel within the coming years.


http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit

LSF/MM 2018 will be a three day, stand-alone conference with three
subsystem-specific tracks, cross-track discussions, as well as BoF and
hacking sessions.

On behalf of the committee I am issuing a call for agenda proposals
that are suitable for cross-track discussion as well as technical
subjects for the breakout sessions.

If advance notice is required for visa applications then please point
that out in your proposal or request to attend, and submit the topic
as soon as possible.

1) Proposals for agenda topics should be sent before January 31st,
2018 to:

lsf...@lists.linux-foundation.org

and CC the mailing lists that are relevant for the topic in question:

FS: linux-fsde...@vger.kernel.org
MM: linux...@kvack.org
Block:  linux-block@vger.kernel.org
ATA:linux-...@vger.kernel.org
SCSI:   linux-s...@vger.kernel.org
NVMe:   linux-n...@lists.infradead.org

Please tag your proposal with [LSF/MM TOPIC] to make it easier to
track. In addition, please make sure to start a new thread for each
topic rather than following up to an existing one. Agenda topics and
attendees will be selected by the program committee, but the final
agenda will be formed by consensus of the attendees on the day.

2) Requests to attend the summit for those that are not proposing a
topic should be sent to:

lsf...@lists.linux-foundation.org

Please summarize what expertise you will bring to the meeting, and
what you would like to discuss. Please also tag your email with
[LSF/MM ATTEND] and send it as a new thread so there is less chance of
it getting lost.

We will try to cap attendance at around 25-30 per track to facilitate
discussions although the final numbers will depend on the room sizes
at the venue.

For discussion leaders, slides and visualizations are encouraged to
outline the subject matter and focus the discussions. Please refrain
from lengthy presentations and talks; the sessions are supposed to be
interactive, inclusive discussions.

There will be no recording or audio bridge. However, we expect that
written minutes will be published as we did in previous years:

2017: https://lwn.net/Articles/lsfmm2017/

2016: https://lwn.net/Articles/lsfmm2016/

2015: https://lwn.net/Articles/lsfmm2015/

2014: http://lwn.net/Articles/LSFMM2014/

2013: http://lwn.net/Articles/548089/

3) If you have feedback on last year's meeting that we can use to
improve this year's, please also send that to:

lsf...@lists.linux-foundation.org

Thank you on behalf of the program committee:

Anna Schumaker (Filesystems)
Jens Axboe (Storage)
Josef Bacik (Filesystems)
Martin K. Petersen (Storage)
Michal Hocko (MM)
Rik van Riel (MM)

Johannes Weiner

[RFC PATCH] blktrace: fail earlier if blk_trace in use

2018-01-15 Thread weiping zhang

add a check before allocate resource for blk_trace, if it's in use.

Signed-off-by: weiping zhang 
---
 kernel/trace/blktrace.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 987d9a9a..16c 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -491,6 +491,9 @@ static int do_blk_trace_setup(struct request_queue *q, char 
*name, dev_t dev,
struct dentry *dir = NULL;
int ret;
 
+   if (unlikely(q->blk_trace))
+   return -EBUSY;
+
if (!buts->buf_size || !buts->buf_nr)
return -EINVAL;
 
-- 
2.9.4

[PATCH 2/2] genirq/affinity: try best to make sure online CPU is assigned to vector

2018-01-15 Thread Ming Lei

84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
causes irq vector assigned to all offline CPUs, and IO hang is reported
on HPSA by Laurence.

This patch fixes this issue by trying best to make sure online CPU can be
assigned to irq vector. And take two steps to spread irq vectors:

1) spread irq vectors across offline CPUs in the node cpumask

2) spread irq vectors across online CPUs in the node cpumask

Fixes: 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Reported-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 99eb38a4cc83..8b716548b3db 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -103,6 +103,10 @@ static int irq_vecs_spread_affinity(struct cpumask *irqmsk,
int v, ncpus = cpumask_weight(nmsk);
int vecs_to_assign, extra_vecs;
 
+   /* May happen when spreading vectors across offline cpus */
+   if (!ncpus)
+   return 0;
+
/* How many vectors we will try to spread */
vecs_to_assign = min(max_vecs, ncpus);
 
@@ -165,13 +169,16 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_possible_cpumask(node_to_possible_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_possible_cpumask, 
cpu_possible_mask,
-);
 
/*
+* Don't spread irq vector across offline node.
+*
 * If the number of nodes in the mask is greater than or equal the
 * number of vectors we just spread the vectors across the nodes.
+*
 */
+   nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_online_mask,
+);
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
@@ -182,14 +189,22 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
goto done;
}
 
+   nodes_clear(nodemsk);
+   nodes = get_nodes_in_cpumask(node_to_possible_cpumask, 
cpu_possible_mask,
+);
for_each_node_mask(n, nodemsk) {
int vecs_per_node;
 
/* Spread the vectors per node */
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
-   cpumask_and(nmsk, cpu_possible_mask, 
node_to_possible_cpumask[n]);
+   /* spread vectors across offline cpus in the node cpumask */
+   cpumask_andnot(nmsk, node_to_possible_cpumask[n], 
cpu_online_mask);
+   irq_vecs_spread_affinity([curvec], last_affv - curvec,
+   vecs_per_node, nmsk);
 
+   /* spread vectors across online cpus in the node cpumask */
+   cpumask_and(nmsk, node_to_possible_cpumask[n], cpu_online_mask);
curvec += irq_vecs_spread_affinity([curvec],
   last_affv - curvec,
   vecs_per_node, nmsk);
-- 
2.9.5

[PATCH 1/2] genirq/affinity: move irq vectors spread into one function

2018-01-15 Thread Ming Lei

This patch is preparing for doing two steps spread:

- spread vectors across non-online CPUs
- spread vectors across online CPUs

This way is applied for trying best to avoid allocating all offline CPUs
to one single vector.

No functional change, and code gets cleaned up too.

Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 56 +++
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..99eb38a4cc83 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,6 +94,35 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_possible_cpumask,
return nodes;
 }
 
+/* Spread irq vectors, and the result is stored to @irqmsk. */
+static int irq_vecs_spread_affinity(struct cpumask *irqmsk,
+   int max_irqmsks,
+   int max_vecs,
+   struct cpumask *nmsk)
+{
+   int v, ncpus = cpumask_weight(nmsk);
+   int vecs_to_assign, extra_vecs;
+
+   /* How many vectors we will try to spread */
+   vecs_to_assign = min(max_vecs, ncpus);
+
+   /* Account for rounding errors */
+   extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
+
+   for (v = 0; v < min(max_irqmsks, vecs_to_assign); v++) {
+   int cpus_per_vec = ncpus / vecs_to_assign;
+
+   /* Account for extra vectors to compensate rounding errors */
+   if (extra_vecs) {
+   cpus_per_vec++;
+   --extra_vecs;
+   }
+   irq_spread_init_one(irqmsk + v, nmsk, cpus_per_vec);
+   }
+
+   return v;
+}
+
 /**
  * irq_create_affinity_masks - Create affinity masks for multiqueue spreading
  * @nvecs: The total number of vectors
@@ -104,7 +133,7 @@ static int get_nodes_in_cpumask(cpumask_var_t 
*node_to_possible_cpumask,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
-   int n, nodes, cpus_per_vec, extra_vecs, curvec;
+   int n, nodes, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
@@ -154,33 +183,16 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
}
 
for_each_node_mask(n, nodemsk) {
-   int ncpus, v, vecs_to_assign, vecs_per_node;
+   int vecs_per_node;
 
/* Spread the vectors per node */
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
-   /* Get the cpus on this node which are in the mask */
cpumask_and(nmsk, cpu_possible_mask, 
node_to_possible_cpumask[n]);
 
-   /* Calculate the number of cpus per vector */
-   ncpus = cpumask_weight(nmsk);
-   vecs_to_assign = min(vecs_per_node, ncpus);
-
-   /* Account for rounding errors */
-   extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign);
-
-   for (v = 0; curvec < last_affv && v < vecs_to_assign;
-curvec++, v++) {
-   cpus_per_vec = ncpus / vecs_to_assign;
-
-   /* Account for extra vectors to compensate rounding 
errors */
-   if (extra_vecs) {
-   cpus_per_vec++;
-   --extra_vecs;
-   }
-   irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
-   }
-
+   curvec += irq_vecs_spread_affinity([curvec],
+  last_affv - curvec,
+  vecs_per_node, nmsk);
if (curvec >= last_affv)
break;
--nodes;
-- 
2.9.5

[PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Ming Lei

Hi,

These two patches fixes IO hang issue reported by Laurence.

84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
may cause one irq vector assigned to all offline CPUs, then this vector
can't handle irq any more.

The 1st patch moves irq vectors spread into one function, and prepares
for the fix done in 2nd patch.

The 2nd patch fixes the issue by trying to make sure online CPUs assigned
to irq vector.


Ming Lei (2):
  genirq/affinity: move irq vectors spread into one function
  genirq/affinity: try best to make sure online CPU is assigned to
vector

 kernel/irq/affinity.c | 77 ++-
 1 file changed, 52 insertions(+), 25 deletions(-)

-- 
2.9.5

Re: [GIT PULL] block changes to improve device mapper for 4.16

2018-01-15 Thread Jens Axboe

On 1/15/18 8:52 AM, Mike Snitzer wrote:
> On Mon, Jan 15 2018 at 10:33am -0500,
> Jens Axboe  wrote:
> 
>> On 1/14/18 7:59 PM, Mike Snitzer wrote:
> ...
>>> Ming Lei (3):
>>>   blk-mq: move actual issue into __blk_mq_issue_req helper
>>
>> I don't like this patch at all - it's a 10 line function (if that)
>> that ends up with three outputs, two of them hidden in passed
>> in pointers. On top of that, a function that is named
>> __blk_mq_issue_req() and returns bool, you would logically expect
>> a 'true' return to mean that it succeeded. This is the opposite.
>>
>> Not strongly opposed to the rest.
> 
> OK, I'll have a closer look at how to clean it up (and also get with
> Ming).
> 
> In the meantime, you can either cherry-pick my first 4 patches or feel
> free to use this to pull them in:

I took 1-3 initially, added 4 now as well.

-- 
Jens Axboe

Re: [GIT PULL] block changes to improve device mapper for 4.16

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 10:33am -0500,
Jens Axboe  wrote:

> On 1/14/18 7:59 PM, Mike Snitzer wrote:
...
> > Ming Lei (3):
> >   blk-mq: move actual issue into __blk_mq_issue_req helper
> 
> I don't like this patch at all - it's a 10 line function (if that)
> that ends up with three outputs, two of them hidden in passed
> in pointers. On top of that, a function that is named
> __blk_mq_issue_req() and returns bool, you would logically expect
> a 'true' return to mean that it succeeded. This is the opposite.
> 
> Not strongly opposed to the rest.

OK, I'll have a closer look at how to clean it up (and also get with
Ming).

In the meantime, you can either cherry-pick my first 4 patches or feel
free to use this to pull them in:

The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:

  blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 
10:46:24 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-block-4.16/dm-changes-1

for you to fetch changes up to d0cc27da2b04351b3cb52afeb99ceca7b9f91f3b:

  dm: fix incomplete request_queue initialization (2018-01-14 12:59:59 -0500)


- Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used.

- Cleanup blk_unregister_queue() to more precisely protect against
  concurrent sysfs changes, blk_mq_unregister_dev() now requires caller
  to hold q->sysfslock (blk_unregister_queue is only caller).

- Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the
  gendisk to be registered but the associated disk->queue's
  blk_register_queue() is left for the driver to do once its
  request_queue is fully initialized.  Fixes long-standing DM
  request_queue initialization issues.


Mike Snitzer (4):
  block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
  block: properly protect the 'queue' kobj in blk_unregister_queue
  block: allow gendisk's request_queue registration to be deferred
  dm: fix incomplete request_queue initialization

 block/blk-mq-sysfs.c  |  9 +
 block/blk-sysfs.c | 18 +++---
 block/genhd.c | 23 +++
 drivers/md/dm-rq.c|  9 -
 drivers/md/dm.c   | 11 ++-
 include/linux/genhd.h |  5 +
 6 files changed, 50 insertions(+), 25 deletions(-)

Re: 4.16 genirq change prevents HP servers from booting [was: Re: linux-next: Signed-off-by missing for commit in the device-mapper tree]

2018-01-15 Thread Ming Lei

On Mon, Jan 15, 2018 at 10:25:01AM -0500, Mike Snitzer wrote:
> On Mon, Jan 15 2018 at  8:27am -0500,
> Stephen Rothwell  wrote:
> 
> > Hi all,
> > 
> > Commit
> > 
> >   34e1467da673 ("Revert "genirq/affinity: assign vectors to all possible 
> > CPUs"")
> > 
> > is missing a Signed-off-by from its author and committer.
> > 
> > Reverts are commits as well.
> 
> Right, I'm aware.  I staged the tree that made some HP servers finally
> work with the latest linux-block 4.16 changes.  Without thinking about
> the broader implications.  Anyway, I'll drop the revert from
> linux-dm.git's 'for-next'.
> 
> Because I'm confident others will hunt down the irq issues.
> 
> I think Ming was looking to grab the queue mapping info and CPU related
> info from the affected server.

Hi Mike,

I have cooked a fix today, please see the following link:

https://marc.info/?l=linux-kernel=151601867018444=2

This issue is clear, and has been described in above link and the
patch. If no objection, I will prepare a formal version.

-- 
Ming

4.16 genirq change prevents HP servers from booting [was: Re: linux-next: Signed-off-by missing for commit in the device-mapper tree]

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  8:27am -0500,
Stephen Rothwell  wrote:

> Hi all,
> 
> Commit
> 
>   34e1467da673 ("Revert "genirq/affinity: assign vectors to all possible 
> CPUs"")
> 
> is missing a Signed-off-by from its author and committer.
> 
> Reverts are commits as well.

Right, I'm aware.  I staged the tree that made some HP servers finally
work with the latest linux-block 4.16 changes.  Without thinking about
the broader implications.  Anyway, I'll drop the revert from
linux-dm.git's 'for-next'.

Because I'm confident others will hunt down the irq issues.

I think Ming was looking to grab the queue mapping info and CPU related
info from the affected server.

> Though I do note it actually has a reasonable commit message, thanks.

Semi-reasonable.  Lacks detail.  The issue is that over the weekend
Laurence found linux-block.git commit 84676c1f21e8ff54befe98 prevents
some HP servers from booting.  They'd hang when trying to initialize
their HPSA controller's devices, e.g.:

[  246.751050] INFO: task systemd-udevd:411 blocked for more than 120
seconds.
[  246.791852]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  246.874637] systemd-udevd   D0   411408 0x8004
[  246.904934] Call Trace:
[  246.918191]  ? __schedule+0x28d/0x870
[  246.937643]  ? _cond_resched+0x15/0x30
[  246.958222]  schedule+0x32/0x80
[  246.975424]  async_synchronize_cookie_domain+0x8b/0x140
[  247.004452]  ? remove_wait_queue+0x60/0x60
[  247.027335]  do_init_module+0xbe/0x219
[  247.048022]  load_module+0x21d6/0x2910
[  247.069436]  ? m_show+0x1c0/0x1c0
[  247.087999]  SYSC_finit_module+0x94/0xe0
[  247.110392]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  247.136669] RIP: 0033:0x7f84049287f9
[  247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX:
0139
[  247.196883] RAX: ffda RBX: 55b712b59e80 RCX:
7f84049287f9
[  247.237989] RDX:  RSI: 7f8405245099 RDI:
0008
[  247.279105] RBP: 7f8404bf2760 R08:  R09:
55b712b45760
[  247.320005] R10: 0008 R11: 0246 R12:
0020
[  247.360625] R13: 7f8404bf2818 R14: 0050 R15:
7f8404bf27b8
[  247.401062] INFO: task scsi_eh_0:471 blocked for more than 120 seconds.
[  247.438161]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  247.520700] scsi_eh_0   D0   471  2 0x8000
[  247.551339] Call Trace:
[  247.564360]  ? __schedule+0x28d/0x870
[  247.584720]  schedule+0x32/0x80
[  247.601294]  hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa]
[  247.633358]  ? remove_wait_queue+0x60/0x60
[  247.656345]  scsi_try_bus_device_reset+0x27/0x40
[  247.682424]  scsi_eh_ready_devs+0x53f/0xe20
[  247.706467]  ? __pm_runtime_resume+0x55/0x70
[  247.730327]  scsi_error_handler+0x434/0x5e0
[  247.754387]  ? __schedule+0x295/0x870
[  247.775420]  kthread+0xf5/0x130
[  247.793461]  ? scsi_eh_get_sense+0x240/0x240
[  247.818008]  ? kthread_associate_blkcg+0x90/0x90
[  247.844759]  ret_from_fork+0x1f/0x30
[  247.865440] INFO: task scsi_id:488 blocked for more than 120 seconds.
[  247.901112]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  247.981092] scsi_id D0   488  1 0x0004
[  248.010535] Call Trace:
[  248.023567]  ? __schedule+0x28d/0x870
[  248.044236]  ? __switch_to+0x1f5/0x460
[  248.065776]  schedule+0x32/0x80
[  248.084238]  schedule_timeout+0x1d4/0x2f0
[  248.106184]  wait_for_completion+0x123/0x190
[  248.130759]  ? wake_up_q+0x70/0x70
[  248.150295]  flush_work+0x119/0x1a0
[  248.169238]  ? wake_up_worker+0x30/0x30
[  248.189670]  __cancel_work_timer+0x103/0x190
[  248.213751]  ? kobj_lookup+0x10b/0x160
[  248.235441]  disk_block_events+0x6f/0x90
[  248.257820]  __blkdev_get+0x6a/0x480
[  248.278770]  ? bd_acquire+0xd0/0xd0
[  248.298438]  blkdev_get+0x1a5/0x300
[  248.316587]  ? bd_acquire+0xd0/0xd0
[  248.334814]  do_dentry_open+0x202/0x320
[  248.354372]  ? security_inode_permission+0x3c/0x50
[  248.378818]  path_openat+0x537/0x12c0
[  248.397386]  ? vm_insert_page+0x1e0/0x1f0
[  248.417664]  ? vvar_fault+0x75/0x140
[  248.435811]  do_filp_open+0x91/0x100
[  248.454061]  do_sys_open+0x126/0x210
[  248.472462]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  248.495438] RIP: 0033:0x7f39e60e1e90
[  248.513136] RSP: 002b:7ffc4c906ba8 EFLAGS: 0246 ORIG_RAX:
0002
[  248.550726] RAX: ffda RBX: 5624aead3010 RCX:
7f39e60e1e90
[  248.586207] RDX: 7f39e60cc0c4 RSI: 00080800 RDI:
7ffc4c906ed0
[  248.622411] RBP: 7ffc4c906b60 R08: 7f39e60cc140 R09:
7f39e60cc140
[  248.658704] R10: 001f R11: 0246 R12:
7ffc4c906ed0

Re: [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  7:46am -0500,
Lars Ellenberg  wrote:

> As I understood it,
> blkdev_issue_zeroout() was supposed to "always try to unmap",
> deprovision, the relevant region, and zero-out any unaligned
> head or tail, just like my work around above was doing.
> 
> And that device mapper thin was "about to" learn this, "soon",
> or maybe block core would do the equivalent of my workaround
> described above.
> 
> But it then did not.
> 
> See also:
> https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

Right, now that you mention it it is starting to ring a bell (especially
after I read your 2nd dm-devel archive url above).

> I then did not follow this closely enough anymore,
> and I missed that with recent enough kernel,
> discard on DRBD on dm-thin would fully allocate.
> 
> In our out-of-tree module, we had to keep the older code for
> compat reasons, anyways. I will just re-enable our zeroout
> workaround there again.
> 
> In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> so the result in this scenario is what we expect:
> 
>   _: unprovisioned, not allocated, returns zero on read anyways
>   *: provisioned, some arbitrary data
>   0: explicitly zeroed:
> 
>   |gran|ular|ity ||||
>   |||||
>  to|-be-|zero|ed
>   |**00|||00**|
> 
> (leave unallocated blocks alone,
>  de-allocate full blocks just like with discard,
>  explicitly zero unaligned head and tail)

"de-allocate full blocks just like with discard" is an interesting take
what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".

> Or DRBD will have to resurrect that reinvented zeroout again,
> with exactly those semantics. I did reinvent it for a reason ;)

Yeah, I now recall dropping that line of development because it
became "hard" (or at least harder than originally thought).

Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
disk?  E.g. zeroing superblocks, metadata areas, or whatever?

If we just discarded the logical extent and then a user did a partial
write to the block, areas that a user might expect to be zeroed wouldn't
be (at least in the case of dm-thinp if "skip_block_zeroing" is
enabled).  And yes if discard passdown is enabled and the device's
discard implementation does "discard_zeroes_data" then it'd be
fine.. but there are a lot of things that need to line up for drbd's
REQ_OP_WRITE_ZEROES to "just work" (as it expects).

(now I'm just echoing the kinds of concerns I had in that 2nd dm-devel
post above).

This post from mkp is interesting:
https://www.redhat.com/archives/dm-devel/2017-March/msg00228.html

Specifically:
"You don't have a way to mark those blocks as being full of zeroes
without actually writing them?

Note that the fallback to a zeroout command is to do a regular write. So
if DM doesn't zero the blocks, the block layer is going to it."

No, dm-thinp doesn't have an easy way to mark an allocated block as
containing zeroes (without actually zeroing).  I toyed with adding that
but then realized that even if we had it it'd still require block
zeroing be enabled.  But block zeroing is done at allocation time.  So
we'd need to interpret the "this block is zeroes" flag to mean "on first
write or read to this block it needs to first zero it".  Fugly to say
the least...

I've been quite busy with other things but I can revisit all this with
Joe Thornber and see what we come up with after a 2nd discussion.

But sadly, in general, this is a low priority for me, so you might do
well to reintroduce your drbd workaround.. sorry about that :(

Mike

Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2018-01-15 Thread Lars Ellenberg

On Sat, Jan 13, 2018 at 12:46:40AM +, Eric Wheeler wrote:
> Hello All,
> 
> We just noticed that discards to DRBD devices backed by dm-thin devices 
> are fully allocating the thin blocks.
> 
> This behavior does not exist before 
> ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout
> 
> The problem exists somewhere between
> [working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing
>   and
> [broken]  45c21793 drbd: implement REQ_OP_WRITE_ZEROES
> 
> Note that c20cfc27 works as expected, but 45c21793 discards blocks 
> being zeroed on the dm-thin backing device. All commits between those two 
> produce the following error:
> 
> blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error
> 
> Also note that issuing a blkdiscard to the backing device directly 
> discards as you would expect. This is just a problem when sending discards 
> through DRBD.
> 
> Is there an easy way to solve this in the short term, even if the ultimate 
> fix is more involved?

> On Wed, 5 Apr 2017, Christoph Hellwig wrote:
> 

commit 0dbed96a3cc9786bc4814dab98a7218753bde934
Author: Christoph Hellwig 
Date:   Wed Apr 5 19:21:21 2017 +0200

drbd: make intelligent use of blkdev_issue_zeroout

> > drbd always wants its discard wire operations to zero the blocks, so
> > use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of
> > reinventing it poorly.

> > -/*
> > - * We *may* ignore the discard-zeroes-data setting, if so configured.
> > - *
> > - * Assumption is that it "discard_zeroes_data=0" is only because the 
> > backend
> > - * may ignore partial unaligned discards.
> > - *
> > - * LVM/DM thin as of at least
> > - *   LVM version: 2.02.115(2)-RHEL7 (2015-01-28)
> > - *   Library version: 1.02.93-RHEL7 (2015-01-28)
> > - *   Driver version:  4.29.0
> > - * still behaves this way.
> > - *
> > - * For unaligned (wrt. alignment and granularity) or too small discards,
> > - * we zero-out the initial (and/or) trailing unaligned partial chunks,
> > - * but discard all the aligned full chunks.
> > - *
> > - * At least for LVM/DM thin, the result is effectively 
> > "discard_zeroes_data=1".
> > - */
> > -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t 
> > start, unsigned int nr_sectors, bool discard)


As I understood it,
blkdev_issue_zeroout() was supposed to "always try to unmap",
deprovision, the relevant region, and zero-out any unaligned
head or tail, just like my work around above was doing.

And that device mapper thin was "about to" learn this, "soon",
or maybe block core would do the equivalent of my workaround
described above.

But it then did not.

See also:
https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

I then did not follow this closely enough anymore,
and I missed that with recent enough kernel,
discard on DRBD on dm-thin would fully allocate.

In our out-of-tree module, we had to keep the older code for
compat reasons, anyways. I will just re-enable our zeroout
workaround there again.

In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
so the result in this scenario is what we expect:

  _: unprovisioned, not allocated, returns zero on read anyways
  *: provisioned, some arbitrary data
  0: explicitly zeroed:

  |gran|ular|ity ||||
  |||||
 to|-be-|zero|ed
  |**00|||00**|

(leave unallocated blocks alone,
 de-allocate full blocks just like with discard,
 explicitly zero unaligned head and tail)

Or DRBD will have to resurrect that reinvented zeroout again,
with exactly those semantics. I did reinvent it for a reason ;)

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

43 matches

Mail list logo