Re: [PATCH] bcache: fix for allocator and register thread race
On 10/01/2018 4:51 PM, tang.jun...@zte.com.cn wrote: > From: Tang Junhui> > After long time run of random small IO writing, > reboot the machine, and after the machine power on, > bcache got stuck, the stack is: > [root@ceph153 ~]# cat /proc/2510/task/*/stack > [] closure_sync+0x25/0x90 [bcache] > [] bch_journal+0x118/0x2b0 [bcache] > [] bch_journal_meta+0x47/0x70 [bcache] > [] bch_prio_write+0x237/0x340 [bcache] > [] bch_allocator_thread+0x3c8/0x3d0 [bcache] > [] kthread+0xcf/0xe0 > [] ret_from_fork+0x58/0x90 > [] 0x > [root@ceph153 ~]# cat /proc/2038/task/*/stack > [] __bch_btree_map_nodes+0x12d/0x150 [bcache] > [] bch_btree_insert+0xf1/0x170 [bcache] > [] bch_journal_replay+0x13f/0x230 [bcache] > [] run_cache_set+0x79a/0x7c2 [bcache] > [] register_bcache+0xd48/0x1310 [bcache] > [] kobj_attr_store+0xf/0x20 > [] sysfs_write_file+0xc6/0x140 > [] vfs_write+0xbd/0x1e0 > [] SyS_write+0x7f/0xe0 > [] system_call_fastpath+0x16/0x1 > The stack shows the register thread and allocator thread > were getting stuck when registering cache device. > > we reboot the machine several times, the issue always > exsit in this machine. > > We debug the code, and found the call trace as bellow: > register_bcache() > ==>run_cache_set() > ==>bch_journal_replay() > ==>bch_btree_insert() >==>__bch_btree_map_nodes() > ==>btree_insert_fn() > ==>btree_split() //node need split > ==>btree_check_reserve() > In btree_check_reserve(), It will check if there is enough buckets > of RESERVE_BTREE type, since allocator thread did not work yet, so > no buckets of RESERVE_BTREE type allocated, so the register thread > waits on c->btree_cache_wait, and goes to sleep. > > Then the allocator thread initialized, and goes to work, > the call trace is bellow: > bch_allocator_thread() > ==>bch_prio_write() > ==>bch_journal_meta() > ==>bch_journal() >==>journal_wait_for_write() > In journal_wait_for_write(), It will check if journal is full by > journal_full(), but the long time random small IO writing > causes the exhaustion of journal buckets(journal.blocks_free=0), > In order to release the journal buckets, > the allocator calls btree_flush_write() to flush keys to > btree nodes, and waits on c->journal.wait until btree nodes writing over > or there has already some journal buckets space, then the allocator > thread goes to sleep. but in btree_flush_write(), since > bch_journal_replay() is not finished, so no btree nodes have journal > (condition "if (btree_current_write(b)->journal)" never satisfied), > so we got no btree node to flush, no journal bucket released, > and allocator sleep all the times. > > Through the above analysis, we can see that: > 1) Register thread wait for allocator thread to allocate buckets of >RESERVE_BTREE type; > 2) Alloctor thread wait for register thread to replay journal, so it >can flush btree nodes and get journal bucket. >then they are all got stuck by waiting for each other. > > Hua Rui provided a patch for me, by allocating some buckets of > RESERVE_BTREE type in advance, so the register thread can get bucket > when btree node splitting and no need to waiting for the allocator thread. > I tested it, it has effect, and register thread run a step forward, but > finally are still got stuck, the reason is only 8 bucket of RESERVE_BTREE > type were allocated, and in bch_journal_replay(), after 2 btree nodes > splitting, only 4 bucket of RESERVE_BTREE type left, then > btree_check_reserve() is not satisfied anymore, so it goes to sleep again, > and in the same time, alloctor thread did not flush enough btree nodes to > release a journal bucket, so they all got stuck again. > > So we need to allocate more buckets of RESERVE_BTREE type in advance, > but how much is enough? By experience and test, I think it should be > as much as journal buckets. Then I modify the code as this patch, > and test in the machine, and it works. > > This patch modified base on Hua Rui’s patch, and allocate more buckets > of RESERVE_BTREE type in advance to avoid register thread and allocate > thread going to wait for each other. > > Signed-off-by: Hua Rui > Signed-off-by: Tang Junhui Hi Junhui, It spent some time for me to understand the problem and the fix :-) The root cause is because bcache journal does not reserve space previously, and allocates journal slot when btree node split by insert a node from journal to btree. If we reserve journal space previously, then we will need something like commit record to make sure the operation is atomic. That is complicated and I doubt whether bcache is deserved for it. This fix is much simpler and it makes things work. I am OK with it. Reviewed-by: Coly Li > --- > drivers/md/bcache/btree.c | 9 ++--- > drivers/md/bcache/super.c | 12 +++- > 2 files
Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper
On Tue, Jan 16, 2018 at 9:40 AM, Ming Leiwrote: > On Mon, Jan 15, 2018 at 10:29:46AM -0700, Jens Axboe wrote: >> On 1/15/18 9:58 AM, Ming Lei wrote: >> > No functional change, just to clean up code a bit, so that the following >> > change of using direct issue for blk_mq_request_bypass_insert() which is >> > needed by DM can be easier to do. >> > >> > Signed-off-by: Ming Lei >> > --- >> > block/blk-mq.c | 39 +++ >> > 1 file changed, 27 insertions(+), 12 deletions(-) >> > >> > diff --git a/block/blk-mq.c b/block/blk-mq.c >> > index edb1291a42c5..bf8d6651f40e 100644 >> > --- a/block/blk-mq.c >> > +++ b/block/blk-mq.c >> > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct >> > blk_mq_hw_ctx *hctx, struct request *rq) >> > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); >> > } >> > >> > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, >> > - struct request *rq, >> > - blk_qc_t *cookie) >> > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, >> > + struct request *rq, >> > + blk_qc_t *new_cookie) >> > { >> > + blk_status_t ret; >> > struct request_queue *q = rq->q; >> > struct blk_mq_queue_data bd = { >> > .rq = rq, >> > .last = true, >> > }; >> > + >> > + if (!blk_mq_get_driver_tag(rq, NULL, false)) >> > + return BLK_STS_AGAIN; >> > + >> > + if (!blk_mq_get_dispatch_budget(hctx)) { >> > + blk_mq_put_driver_tag(rq); >> > + return BLK_STS_AGAIN; >> > + } >> > + >> > + *new_cookie = request_to_qc_t(hctx, rq); >> > + >> > + ret = q->mq_ops->queue_rq(hctx, ); >> > + >> > + return ret; >> >> return q->mq_ops->queue_rq(hctx, ); >> >> and kill 'ret', it's not used. But more importantly, who puts the > > OK. > >> driver tag and the budget if we get != OK for ->queue_rq()? > > For the budget, the current protocol is that driver is responsible for the > release once .queue_rq is called, see scsi_mq_queue_rq() and comment in > blk_mq_do_dispatch_sched(). > > For driver tag, it is released in blk_mq_free_request(), which is done > in failure handling of != OK. Actually the driver tag should be released here when .queue_rq() returns !OK, and this patch does follow the current logic: - call __blk_mq_requeue_request() to release tag and make it ready for requeue if STS_RESOURCE is returned. - For other !OK cases, let driver free the request. -- Ming Lei
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Tue, Jan 16, 2018 at 7:13 AM, Mike Snitzerwrote: > On Mon, Jan 15 2018 at 6:10P -0500, > Mike Snitzer wrote: > >> On Mon, Jan 15 2018 at 5:51pm -0500, >> Bart Van Assche wrote: >> >> > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote: >> > > sysfs write op calls kernfs_fop_write which takes: >> > > of->mutex then kn->count#213 (no idea what that is) >> > > then q->sysfs_lock (via queue_attr_store) >> > > >> > > vs >> > > >> > > blk_unregister_queue takes: >> > > q->sysfs_lock then >> > > kernfs_mutex (via kernfs_remove) >> > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"? >> > > >> > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is >> > > triggering false positives.. because these seem like different kernfs >> > > locks yet they are reported as "kn->count#213". >> > > >> > > Certainly feeling out of my depth with kernfs's locking though. >> > >> > Hello Mike, >> > >> > I don't think that this is a false positive but rather the following >> > traditional >> > pattern of a potential deadlock involving sysfs attributes: >> > * One context obtains a mutex from inside a sysfs attribute method: >> > queue_attr_store() obtains q->sysfs_lock. >> > * Another context removes a sysfs attribute while holding a mutex: >> > blk_unregister_queue() removes the queue sysfs attributes while holding >> > q->sysfs_lock. >> > >> > This can result in a real deadlock because the code that removes sysfs >> > objects >> > waits until all ongoing attribute callbacks have finished. >> > >> > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in >> > blk_unregister_queue") modified blk_unregister_queue() such that >> > q->sysfs_lock >> > is held around the kobject_del(>kobj) call I think this is a regression >> > introduced by that commit. >> >> Sure, of course it is a regression. >> >> Aside from moving the mutex_unlock(>sysfs_lock) above the >> kobject_del(>kobj) I don't know how to fix it. >> >> Though, realistically that'd be an adequate fix (given the way the code >> was before). > > Any chance you apply this and re-run your srp_test that triggered the > lockdep splat? > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 4a6a40ffd78e..c50e08e9bf17 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk) > if (q->request_fn || (q->mq_ops && q->elevator)) > elv_unregister_queue(q); > > + mutex_unlock(>sysfs_lock); > + > kobject_uevent(>kobj, KOBJ_REMOVE); > kobject_del(>kobj); > blk_trace_remove_sysfs(disk_to_dev(disk)); > kobject_put(_to_dev(disk)->kobj); > - > - mutex_unlock(>sysfs_lock); > } If this patch is required, similar change should be needed in failure path of blk_register_queue() too. -- Ming Lei
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote: > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote: > > Hi, > > > > These two patches fixes IO hang issue reported by Laurence. > > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > > may cause one irq vector assigned to all offline CPUs, then this vector > > can't handle irq any more. > > Well, that very much was the intention of managed interrupts. Why > does the device raise an interrupt for a queue that has no online > cpu assigned to it? If pci_alloc_irq_vectors() returns OK, driver may think everything is just fine, and configure the related hw queues(such as enabling irq on queues), and finally irq comes and no CPU can handle them. Also I think there may not drivers which check if the CPUs assigned for irq vectors are online or not, and seems never a job which is supposed to do by driver. -- Ming
Re: [PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request
On Mon, Jan 15, 2018 at 12:43:44PM -0500, Mike Snitzer wrote: > On Mon, Jan 15 2018 at 11:58am -0500, > Ming Leiwrote: > > > Hi Guys, > > > > The 3 paches changes the blk-mq part of blk_insert_cloned_request(), > > in which we switch to blk_mq_try_issue_directly(), so that both dm-rq > > and blk-mq can get the dispatch result of underlying queue, and with > > this information, blk-mq can handle IO merge much better, then > > sequential I/O performance is improved much. > > > > In my dm-mpath over virtio-scsi test, this whole patchset improves > > sequential IO by 3X ~ 5X. > > > > V4: > > - remove dm patches which are in DM tree already > > - cleanup __blk_mq_issue_req as suggested by Jens > > > > Ming, > > You dropped the header cleanups that I did in v3 ("blk-mq: issue request > directly for blk_insert_cloned_request") being the one header I care > about being updated). > > I also worked in parallel on my own v4 to address Jens' dislike for v3's > 3 returns. But I skinned the cat a different way, by dropping your > first patch that introduces the __blk_mq_issue_req helper, please see > these 2 commits: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=40f8947784128bb83dc5f7a6aed7ed230222f675 > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=f641015f42f41df75220313ee62e8241f2fd > > I think it makes the changes more obvious (not spread across 2 methods > and doesn't require use of BLK_STS_AGAIN). > > Happy to yield to Jens to decide which he prefers. > > Jens, if you'd like to pick my variant of v4 up they are here, thanks! Hi Mike, Looks we were working on it at the same time, I am fine with your V4. Jens, please let us know if Mike's V4 is OK, if not, we can make V5/.., :-) Thanks, Ming
Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper
On Mon, Jan 15 2018 at 8:43pm -0500, Ming Leiwrote: > On Mon, Jan 15, 2018 at 02:41:12PM -0500, Mike Snitzer wrote: > > On Mon, Jan 15 2018 at 12:29pm -0500, > > Jens Axboe wrote: > > > > > On 1/15/18 9:58 AM, Ming Lei wrote: > > > > No functional change, just to clean up code a bit, so that the following > > > > change of using direct issue for blk_mq_request_bypass_insert() which is > > > > needed by DM can be easier to do. > > > > > > > > Signed-off-by: Ming Lei > > > > --- > > > > block/blk-mq.c | 39 +++ > > > > 1 file changed, 27 insertions(+), 12 deletions(-) > > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > > > index edb1291a42c5..bf8d6651f40e 100644 > > > > --- a/block/blk-mq.c > > > > +++ b/block/blk-mq.c > > > > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct > > > > blk_mq_hw_ctx *hctx, struct request *rq) > > > > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); > > > > } > > > > > > > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > > > - struct request *rq, > > > > - blk_qc_t *cookie) > > > > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, > > > > + struct request *rq, > > > > + blk_qc_t *new_cookie) > > > > { > > > > + blk_status_t ret; > > > > struct request_queue *q = rq->q; > > > > struct blk_mq_queue_data bd = { > > > > .rq = rq, > > > > .last = true, > > > > }; > > > > + > > > > + if (!blk_mq_get_driver_tag(rq, NULL, false)) > > > > + return BLK_STS_AGAIN; > > > > + > > > > + if (!blk_mq_get_dispatch_budget(hctx)) { > > > > + blk_mq_put_driver_tag(rq); > > > > + return BLK_STS_AGAIN; > > > > + } > > > > + > > > > + *new_cookie = request_to_qc_t(hctx, rq); > > > > + > > > > + ret = q->mq_ops->queue_rq(hctx, ); > > > > + > > > > + return ret; > > > > > > return q->mq_ops->queue_rq(hctx, ); > > > > > > and kill 'ret', it's not used. > > > > Yeap, good point. > > > > > But more importantly, who puts the > > > driver tag and the budget if we get != OK for ->queue_rq()? > > > > __blk_mq_try_issue_directly() processes the returned value same as before > > this patch. Means this patch isn't making any functional change: > > If BLK_STS_RESOURCE: __blk_mq_requeue_request() is called. > > __blk_mq_requeue_request() will blk_mq_put_driver_tag(). > > Otherwise, all other errors result in blk_mq_end_request(rq, ret); > > > > So ignoring this patch, are you concerned that: > > 1) Does blk_mq_end_request() put both? Looks like blk_mq_free_request() > > handles rq->tag != -1 but why not have it use __blk_mq_put_driver_tag()? > > I'm not seeing where the budget is put from blk_mq_end_request()... > > blk_mq_free_request() releases driver tag, and budget is owned by driver > once .queue_rq is called. > > > > > 2) Nothing seems to be putting the budget in > > __blk_mq_try_issue_directly()'s BLK_STS_RESOURCE error path? I share > > your concern now (for drivers who set {get,put}_budget in mq_ops). > > Should __blk_mq_requeue_request() be updated to also > > blk_mq_put_dispatch_budget()? > > No, at least it is current protocol of using budget, please see > scsi_mq_queue_rq() and comment of blk_mq_do_dispatch_sched(). Yeap, thanks for clarifying.
Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper
On Mon, Jan 15, 2018 at 10:29:46AM -0700, Jens Axboe wrote: > On 1/15/18 9:58 AM, Ming Lei wrote: > > No functional change, just to clean up code a bit, so that the following > > change of using direct issue for blk_mq_request_bypass_insert() which is > > needed by DM can be easier to do. > > > > Signed-off-by: Ming Lei> > --- > > block/blk-mq.c | 39 +++ > > 1 file changed, 27 insertions(+), 12 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index edb1291a42c5..bf8d6651f40e 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct > > blk_mq_hw_ctx *hctx, struct request *rq) > > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); > > } > > > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > - struct request *rq, > > - blk_qc_t *cookie) > > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, > > + struct request *rq, > > + blk_qc_t *new_cookie) > > { > > + blk_status_t ret; > > struct request_queue *q = rq->q; > > struct blk_mq_queue_data bd = { > > .rq = rq, > > .last = true, > > }; > > + > > + if (!blk_mq_get_driver_tag(rq, NULL, false)) > > + return BLK_STS_AGAIN; > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) { > > + blk_mq_put_driver_tag(rq); > > + return BLK_STS_AGAIN; > > + } > > + > > + *new_cookie = request_to_qc_t(hctx, rq); > > + > > + ret = q->mq_ops->queue_rq(hctx, ); > > + > > + return ret; > > return q->mq_ops->queue_rq(hctx, ); > > and kill 'ret', it's not used. But more importantly, who puts the OK. > driver tag and the budget if we get != OK for ->queue_rq()? For the budget, the current protocol is that driver is responsible for the release once .queue_rq is called, see scsi_mq_queue_rq() and comment in blk_mq_do_dispatch_sched(). For driver tag, it is released in blk_mq_free_request(), which is done in failure handling of != OK. -- Ming
Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper
On Mon, Jan 15, 2018 at 12:15:47PM -0500, Mike Snitzer wrote: > On Mon, Jan 15 2018 at 11:58am -0500, > Ming Leiwrote: > > > No functional change, just to clean up code a bit, so that the following > > change of using direct issue for blk_mq_request_bypass_insert() which is > > needed by DM can be easier to do. > > > > Signed-off-by: Ming Lei > > --- > > block/blk-mq.c | 39 +++ > > 1 file changed, 27 insertions(+), 12 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index edb1291a42c5..bf8d6651f40e 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct > > blk_mq_hw_ctx *hctx, struct request *rq) > > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); > > } > > > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > - struct request *rq, > > - blk_qc_t *cookie) > > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, > > + struct request *rq, > > + blk_qc_t *new_cookie) > > { > > + blk_status_t ret; > > struct request_queue *q = rq->q; > > struct blk_mq_queue_data bd = { > > .rq = rq, > > .last = true, > > }; > > + > > + if (!blk_mq_get_driver_tag(rq, NULL, false)) > > + return BLK_STS_AGAIN; > > + > > + if (!blk_mq_get_dispatch_budget(hctx)) { > > + blk_mq_put_driver_tag(rq); > > + return BLK_STS_AGAIN; > > + } > > + > > + *new_cookie = request_to_qc_t(hctx, rq); > > + > > + ret = q->mq_ops->queue_rq(hctx, ); > > + > > + return ret; > > +} > > + > > +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > > + struct request *rq, > > + blk_qc_t *cookie) > > +{ > > + struct request_queue *q = rq->q; > > blk_qc_t new_cookie; > > blk_status_t ret; > > bool run_queue = true; > > @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct > > blk_mq_hw_ctx *hctx, > > if (q->elevator) > > goto insert; > > > > - if (!blk_mq_get_driver_tag(rq, NULL, false)) > > + ret = __blk_mq_issue_req(hctx, rq, _cookie); > > + if (ret == BLK_STS_AGAIN) > > goto insert; > > You're (ab)using BLK_STS_AGAIN as a means to an end... > But what if in the future q->mq_ops->queue_rq() returns BLK_STS_AGAIN? Yeah, but now only dio uses that, so I let Jens decide it. -- Ming
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Mon, Jan 15, 2018 at 06:43:47PM +0100, Thomas Gleixner wrote: > On Tue, 16 Jan 2018, Ming Lei wrote: > > These two patches fixes IO hang issue reported by Laurence. > > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > > may cause one irq vector assigned to all offline CPUs, then this vector > > can't handle irq any more. > > > > The 1st patch moves irq vectors spread into one function, and prepares > > for the fix done in 2nd patch. > > > > The 2nd patch fixes the issue by trying to make sure online CPUs assigned > > to irq vector. > > Which means it's completely undoing the intent and mechanism of managed > interrupts. Not going to happen. As I replied in previous mail, some of offline CPUs may be assigned to some of irq vectors after we assign vectors to all possible CPUs, some of which are not present. > > Which driver is that which abuses managed interrupts and does not keep its > queues properly sorted on cpu hotplug? It isn't related with driver/device, and I can trigger this issue on NVMe easily except for HPSA. Thanks, Ming
Re: [PATCH V4 3/3] blk-mq: issue request directly for blk_insert_cloned_request
On Mon, Jan 15 2018 at 11:58am -0500, Ming Leiwrote: > blk_insert_cloned_request() is called in fast path of dm-rq driver, and > in this function we append request to hctx->dispatch_list of the underlying > queue directly. > > 1) This way isn't efficient enough because hctx lock is always required > > 2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler > totally, and depend on DM rq driver to do IO schedule completely. But DM > rq driver can't get underlying queue's dispatch feedback at all, and this > information is extreamly useful for IO merge. Without that IO merge can't > be done basically by blk-mq, and causes very bad sequential IO performance. > > This patch makes use of blk_mq_try_issue_directly() to dispatch rq to > underlying queue and provides disptch result to dm-rq and blk-mq, and > improves the above situations very much. > > With this patch, seqential IO is improved by 3X in my test over > mpath/virtio-scsi. > > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 29 ++--- > 1 file changed, 26 insertions(+), 3 deletions(-) Hey Ming, I also just noticed your V4 of this patch only includes the block/blk-mq.c changes. You're missing: block/blk-core.c | 3 +-- block/blk-mq.h | 3 +++ drivers/md/dm-rq.c | 19 +--- Please let Jens know if you're OK with my V4, tagged here: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/tag/?h=for-block-4.16/dm-changes-2 And also detailed in this message from earlier in this thread: https://marc.info/?l=linux-block=151603824725438=2 Or please generate V5 of your series. Hopefully it'd include the header I revised for this 3/3 patch, see: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-block-4.16/dm-changes-2=d86beab5712a8f18123011487dee797a1e3a07e1 We also need to address the issues Jens noticed and I looked at a bit closer: https://marc.info/?l=linux-block=151604528127707=2 (those issues might need fixing first, and marked for stable@, and the series rebased ontop of it?) Thanks! Mie
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote: > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote: > > Hi, > > > > These two patches fixes IO hang issue reported by Laurence. > > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > > may cause one irq vector assigned to all offline CPUs, then this vector > > can't handle irq any more. > > Well, that very much was the intention of managed interrupts. Why > does the device raise an interrupt for a queue that has no online > cpu assigned to it? It is because of irq_create_affinity_masks(). Once irq vectors spread across possible CPUs, some of which are offline, may be assigned to one vector. For example of HPSA, there are 8 irq vectors in this device, and the system supports at most 32 CPUs, but only 16 presents(0-15) after booting, we should allow to assign at least one CPU for handling each irq vector for HPSA, but: 1) before commit 84676c1f21: irq 25, cpu list 0 irq 26, cpu list 2 irq 27, cpu list 4 irq 28, cpu list 6 irq 29, cpu list 8 irq 30, cpu list 10 irq 31, cpu list 12 irq 32, cpu list 14 irq 33, cpu list 1 irq 34, cpu list 3 irq 35, cpu list 5 irq 36, cpu list 7 irq 37, cpu list 9 irq 38, cpu list 11 irq 39, cpu list 13 irq 40, cpu list 15 2) after commit 84676c1f21: irq 25, cpu list 0, 2 irq 26, cpu list 4, 6 irq 27, cpu list 8, 10 irq 28, cpu list 12, 14 irq 29, cpu list 16, 18 irq 30, cpu list 20, 22 irq 31, cpu list 24, 26 irq 32, cpu list 28, 30 irq 33, cpu list 1, 3 irq 34, cpu list 5, 7 irq 35, cpu list 9, 11 irq 36, cpu list 13, 15 irq 37, cpu list 17, 19 irq 38, cpu list 21, 23 irq 39, cpu list 25, 27 irq 40, cpu list 29, 31 And vectors of 29-32, 37-40 are assigned to offline CPUs. -- Ming
[LSF/MM TOPIC] blk-mq priority based hctx selection
For the storage track, I would like to propose a topic for differentiated blk-mq hardware contexts. Today, blk-mq considers all hardware contexts equal, and are selected based on the software's CPU context. There are use cases that benefit from having hardware context selection criteria beyond which CPU a process happens to be running on. One example is exlusive polling for the latency sensitive use cases. Mixing polled and non-polled requests into the same context loses part of the benefit when interrupts unnecessarilly occur, and coalescing tricks to mitigate this have undesirable side effects during times when HIPRI commands are not issued. Another example is for hardware priority queues, where not all command queues the hardware provides may be equal to another. Many newer storage controllers provide such queues with different QoS guarantees, and Linux currently does not make use of this feature. The talk would like to discuss potential new blk-mq APIs needed for a block driver to register its different priority queues, changes to blk-mq hwctx selection, and implications for low level drivers that utilize IRQ affinity to set up current mappings. Thanks, Keith
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 6:10P -0500, Mike Snitzerwrote: > On Mon, Jan 15 2018 at 5:51pm -0500, > Bart Van Assche wrote: > > > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote: > > > sysfs write op calls kernfs_fop_write which takes: > > > of->mutex then kn->count#213 (no idea what that is) > > > then q->sysfs_lock (via queue_attr_store) > > > > > > vs > > > > > > blk_unregister_queue takes: > > > q->sysfs_lock then > > > kernfs_mutex (via kernfs_remove) > > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"? > > > > > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is > > > triggering false positives.. because these seem like different kernfs > > > locks yet they are reported as "kn->count#213". > > > > > > Certainly feeling out of my depth with kernfs's locking though. > > > > Hello Mike, > > > > I don't think that this is a false positive but rather the following > > traditional > > pattern of a potential deadlock involving sysfs attributes: > > * One context obtains a mutex from inside a sysfs attribute method: > > queue_attr_store() obtains q->sysfs_lock. > > * Another context removes a sysfs attribute while holding a mutex: > > blk_unregister_queue() removes the queue sysfs attributes while holding > > q->sysfs_lock. > > > > This can result in a real deadlock because the code that removes sysfs > > objects > > waits until all ongoing attribute callbacks have finished. > > > > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in > > blk_unregister_queue") modified blk_unregister_queue() such that > > q->sysfs_lock > > is held around the kobject_del(>kobj) call I think this is a regression > > introduced by that commit. > > Sure, of course it is a regression. > > Aside from moving the mutex_unlock(>sysfs_lock) above the > kobject_del(>kobj) I don't know how to fix it. > > Though, realistically that'd be an adequate fix (given the way the code > was before). Any chance you apply this and re-run your srp_test that triggered the lockdep splat? diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 4a6a40ffd78e..c50e08e9bf17 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk) if (q->request_fn || (q->mq_ops && q->elevator)) elv_unregister_queue(q); + mutex_unlock(>sysfs_lock); + kobject_uevent(>kobj, KOBJ_REMOVE); kobject_del(>kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); kobject_put(_to_dev(disk)->kobj); - - mutex_unlock(>sysfs_lock); }
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 5:51pm -0500, Bart Van Asschewrote: > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote: > > sysfs write op calls kernfs_fop_write which takes: > > of->mutex then kn->count#213 (no idea what that is) > > then q->sysfs_lock (via queue_attr_store) > > > > vs > > > > blk_unregister_queue takes: > > q->sysfs_lock then > > kernfs_mutex (via kernfs_remove) > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"? > > > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is > > triggering false positives.. because these seem like different kernfs > > locks yet they are reported as "kn->count#213". > > > > Certainly feeling out of my depth with kernfs's locking though. > > Hello Mike, > > I don't think that this is a false positive but rather the following > traditional > pattern of a potential deadlock involving sysfs attributes: > * One context obtains a mutex from inside a sysfs attribute method: > queue_attr_store() obtains q->sysfs_lock. > * Another context removes a sysfs attribute while holding a mutex: > blk_unregister_queue() removes the queue sysfs attributes while holding > q->sysfs_lock. > > This can result in a real deadlock because the code that removes sysfs objects > waits until all ongoing attribute callbacks have finished. > > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in > blk_unregister_queue") modified blk_unregister_queue() such that q->sysfs_lock > is held around the kobject_del(>kobj) call I think this is a regression > introduced by that commit. Sure, of course it is a regression. Aside from moving the mutex_unlock(>sysfs_lock) above the kobject_del(>kobj) I don't know how to fix it. Though, realistically that'd be an adequate fix (given the way the code was before). Mike
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote: > sysfs write op calls kernfs_fop_write which takes: > of->mutex then kn->count#213 (no idea what that is) > then q->sysfs_lock (via queue_attr_store) > > vs > > blk_unregister_queue takes: > q->sysfs_lock then > kernfs_mutex (via kernfs_remove) > seems lockdep thinks "kernfs_mutex" is "kn->count#213"? > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is > triggering false positives.. because these seem like different kernfs > locks yet they are reported as "kn->count#213". > > Certainly feeling out of my depth with kernfs's locking though. Hello Mike, I don't think that this is a false positive but rather the following traditional pattern of a potential deadlock involving sysfs attributes: * One context obtains a mutex from inside a sysfs attribute method: queue_attr_store() obtains q->sysfs_lock. * Another context removes a sysfs attribute while holding a mutex: blk_unregister_queue() removes the queue sysfs attributes while holding q->sysfs_lock. This can result in a real deadlock because the code that removes sysfs objects waits until all ongoing attribute callbacks have finished. Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in blk_unregister_queue") modified blk_unregister_queue() such that q->sysfs_lock is held around the kobject_del(>kobj) call I think this is a regression introduced by that commit. Bart.
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 12:16pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote: > > I'm submitting this v5 with more feeling now ;) > > Hello Mike, > > Have these patches been tested with lockdep enabled? The following appeared in > the kernel log when after I started testing Jens' for-next tree of this > morning: > > == > WARNING: possible circular locking dependency detected > 4.15.0-rc7-dbg+ #1 Not tainted > -- > 02-mq/1211 is trying to acquire lock: > (>sysfs_lock){+.+.}, at: [<8b65bdad>] queue_attr_store+0x35/0x80 > > but task is already holding lock: > (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190 > > which lock already depends on the new lock. > > > the existing dependency chain (in reverse order) is: > > -> #1 (kn->count#213){}: >kernfs_remove+0x1a/0x30 >kobject_del.part.3+0xe/0x40 >blk_unregister_queue+0xa7/0xe0 >del_gendisk+0x12f/0x260 >sd_remove+0x58/0xc0 [sd_mod] >device_release_driver_internal+0x15a/0x220 >bus_remove_device+0xf4/0x170 >device_del+0x12f/0x330 >__scsi_remove_device+0xef/0x120 [scsi_mod] >scsi_forget_host+0x1b/0x60 [scsi_mod] >scsi_remove_host+0x6f/0x110 [scsi_mod] >0xc09ed6e4 >process_one_work+0x21c/0x6d0 >worker_thread+0x35/0x380 >kthread+0x117/0x130 >ret_from_fork+0x24/0x30 > > -> #0 (>sysfs_lock){+.+.}: >__mutex_lock+0x6c/0x9e0 >queue_attr_store+0x35/0x80 >kernfs_fop_write+0x109/0x190 >__vfs_write+0x1e/0x130 >vfs_write+0xb9/0x1b0 >SyS_write+0x40/0xa0 >entry_SYSCALL_64_fastpath+0x23/0x9a > > other info that might help us debug this: > > Possible unsafe locking scenario: > >CPU0CPU1 > > lock(kn->count#213); >lock(>sysfs_lock); >lock(kn->count#213); > lock(>sysfs_lock); > > *** DEADLOCK *** > > 3 locks held by 02-mq/1211: > #0: (sb_writers#6){.+.+}, at: [ ] vfs_write+0x17f/0x1b0 > #1: (>mutex){+.+.}, at: [ ] kernfs_fop_write+0xdd/0x190 > #2: (kn->count#213){}, at: [<7a18ad18>] > kernfs_fop_write+0xe5/0x190 sysfs write op calls kernfs_fop_write which takes: of->mutex then kn->count#213 (no idea what that is) then q->sysfs_lock (via queue_attr_store) vs blk_unregister_queue takes: q->sysfs_lock then kernfs_mutex (via kernfs_remove) seems lockdep thinks "kernfs_mutex" is "kn->count#213"? Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is triggering false positives.. because these seem like different kernfs locks yet they are reported as "kn->count#213". Certainly feeling out of my depth with kernfs's locking though. Mike
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 12:51pm -0500, Bart Van Asschewrote: > On Mon, 2018-01-15 at 12:48 -0500, Mike Snitzer wrote: > > Do I need to do something more to enable lockdep aside from set > > CONFIG_LOCKDEP_SUPPORT=y ? > > Hello Mike, > > I think you also need to set CONFIG_PROVE_LOCKING=y. Ah ok, was missing that, thanks. Mike
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Mon, 2018-01-15 at 18:43 +0100, Thomas Gleixner wrote: > On Tue, 16 Jan 2018, Ming Lei wrote: > > These two patches fixes IO hang issue reported by Laurence. > > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > > may cause one irq vector assigned to all offline CPUs, then this > > vector > > can't handle irq any more. > > > > The 1st patch moves irq vectors spread into one function, and > > prepares > > for the fix done in 2nd patch. > > > > The 2nd patch fixes the issue by trying to make sure online CPUs > > assigned > > to irq vector. > > Which means it's completely undoing the intent and mechanism of > managed > interrupts. Not going to happen. > > Which driver is that which abuses managed interrupts and does not > keep its > queues properly sorted on cpu hotplug? > > Thanks, > > tglx Hello Thomas The servers I am using are all booting off hpsa (SmartArray) The system would hang on boot with this stack below. So seen when booting off hpsa driver, not seen by Mike when booting off a server not using hpsa. Also not seen when reverting the patch I called out and reverted. Putting that patch back into Mike/Jens combined tree and adding Ming's patch seems to fix this issue now. I can boot. I just did a quick sanity boot and check, not any in-depth testing right now. Its not code I am at all familiar with that Ming has changed to make it work so I defer to Ming to explain in-depth [ 246.751050] INFO: task systemd-udevd:411 blocked for more than 120 seconds. [ 246.791852] Tainted: G I 4.15.0- rc4.block.dm.4.16+ #1 [ 246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.874637] systemd-udevd D0 411408 0x8004 [ 246.904934] Call Trace: [ 246.918191] ? __schedule+0x28d/0x870 [ 246.937643] ? _cond_resched+0x15/0x30 [ 246.958222] schedule+0x32/0x80 [ 246.975424] async_synchronize_cookie_domain+0x8b/0x140 [ 247.004452] ? remove_wait_queue+0x60/0x60 [ 247.027335] do_init_module+0xbe/0x219 [ 247.048022] load_module+0x21d6/0x2910 [ 247.069436] ? m_show+0x1c0/0x1c0 [ 247.087999] SYSC_finit_module+0x94/0xe0 [ 247.110392] entry_SYSCALL_64_fastpath+0x1a/0x7d [ 247.136669] RIP: 0033:0x7f84049287f9 [ 247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX: 0139 [ 247.196883] RAX: ffda RBX: 55b712b59e80 RCX: 7f84049287f9 [ 247.237989] RDX: RSI: 7f8405245099 RDI: 0008 [ 247.279105] RBP: 7f8404bf2760 R08: R09: 55b712b45760 [ 247.320005] R10: 0008 R11: 0246 R12: 0020 [ 247.360625] R13: 7f8404bf2818 R14: 0050 R15: 7f8404bf27b8 [ 247.401062] INFO: task scsi_eh_0:471 blocked for more than 120 seconds. [ 247.438161] Tainted: G I 4.15.0- rc4.block.dm.4.16+ #1 [ 247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.520700] scsi_eh_0 D0 471 2 0x8000 [ 247.551339] Call Trace: [ 247.564360] ? __schedule+0x28d/0x870 [ 247.584720] schedule+0x32/0x80 [ 247.601294] hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa] [ 247.633358] ? remove_wait_queue+0x60/0x60 [ 247.656345] scsi_try_bus_device_reset+0x27/0x40 [ 247.682424] scsi_eh_ready_devs+0x53f/0xe20 [ 247.706467] ? __pm_runtime_resume+0x55/0x70 [ 247.730327] scsi_error_handler+0x434/0x5e0 [ 247.754387] ? __schedule+0x295/0x870 [ 247.775420] kthread+0xf5/0x130 [ 247.793461] ? scsi_eh_get_sense+0x240/0x240 [ 247.818008] ? kthread_associate_blkcg+0x90/0x90 [ 247.844759] ret_from_fork+0x1f/0x30 [ 247.865440] INFO: task scsi_id:488 blocked for more than 120 seconds. [ 247.901112] Tainted: G I 4.15.0- rc4.block.dm.4.16+ #1 [ 247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.981092] scsi_id D0 488 1 0x0004 [ 248.010535] Call Trace: [ 248.023567] ? __schedule+0x28d/0x870 [ 248.044236] ? __switch_to+0x1f5/0x460 [ 248.065776] schedule+0x32/0x80 [ 248.084238] schedule_timeout+0x1d4/0x2f0 [ 248.106184] wait_for_completion+0x123/0x190 [ 248.130759] ? wake_up_q+0x70/0x70 [ 248.150295] flush_work+0x119/0x1a0 [ 248.169238] ? wake_up_worker+0x30/0x30 [ 248.189670] __cancel_work_timer+0x103/0x190 [ 248.213751] ? kobj_lookup+0x10b/0x160 [ 248.235441] disk_block_events+0x6f/0x90 [ 248.257820] __blkdev_get+0x6a/0x480 [ 248.278770] ? bd_acquire+0xd0/0xd0 [ 248.298438] blkdev_get+0x1a5/0x300 [ 248.316587] ? bd_acquire+0xd0/0xd0 [ 248.334814] do_dentry_open+0x202/0x320 [ 248.354372] ? security_inode_permission+0x3c/0x50 [ 248.378818] path_openat+0x537/0x12c0 [ 248.397386] ? vm_insert_page+0x1e0/0x1f0 [ 248.417664] ? vvar_fault+0x75/0x140 [ 248.435811] do_filp_open+0x91/0x100 [ 248.454061] do_sys_open+0x126/0x210 [ 248.472462] entry_SYSCALL_64_fastpath+0x1a/0x7d [
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, 2018-01-15 at 12:48 -0500, Mike Snitzer wrote: > Do I need to do something more to enable lockdep aside from set > CONFIG_LOCKDEP_SUPPORT=y ? Hello Mike, I think you also need to set CONFIG_PROVE_LOCKING=y. Bart.
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 12:36pm -0500, Bart Van Asschewrote: > On Mon, 2018-01-15 at 12:29 -0500, Mike Snitzer wrote: > > So you replied to v5, I emailed a v6 out for the relevant patch. Just > > want to make sure you're testing with either Jens' latest tree or are > > using my v6 that fixed blk_mq_unregister_dev() to require caller holds > > q->sysfs_lock ? > > Hello Mike, > > In my test I was using Jens' latest for-next tree (commit 563877ae7dae). OK, that includes his latest for-4.16/block (commit c100ec49fdd8) so I'll take a closer look at this lockdep splat. Do I need to do something more to enable lockdep aside from set CONFIG_LOCKDEP_SUPPORT=y ? Thanks, Mike
Re: [PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request
On Mon, Jan 15 2018 at 11:58am -0500, Ming Leiwrote: > Hi Guys, > > The 3 paches changes the blk-mq part of blk_insert_cloned_request(), > in which we switch to blk_mq_try_issue_directly(), so that both dm-rq > and blk-mq can get the dispatch result of underlying queue, and with > this information, blk-mq can handle IO merge much better, then > sequential I/O performance is improved much. > > In my dm-mpath over virtio-scsi test, this whole patchset improves > sequential IO by 3X ~ 5X. > > V4: > - remove dm patches which are in DM tree already > - cleanup __blk_mq_issue_req as suggested by Jens > Ming, You dropped the header cleanups that I did in v3 ("blk-mq: issue request directly for blk_insert_cloned_request") being the one header I care about being updated). I also worked in parallel on my own v4 to address Jens' dislike for v3's 3 returns. But I skinned the cat a different way, by dropping your first patch that introduces the __blk_mq_issue_req helper, please see these 2 commits: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=40f8947784128bb83dc5f7a6aed7ed230222f675 https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=f641015f42f41df75220313ee62e8241f2fd I think it makes the changes more obvious (not spread across 2 methods and doesn't require use of BLK_STS_AGAIN). Happy to yield to Jens to decide which he prefers. Jens, if you'd like to pick my variant of v4 up they are here, thanks! The following changes since commit c100ec49fdd836ff8a17c7bfcc7611d2ee2b: dm: fix incomplete request_queue initialization (2018-01-15 08:54:32 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-block-4.16/dm-changes-2 for you to fetch changes up to d86beab5712a8f18123011487dee797a1e3a07e1: blk-mq: issue request directly for blk_insert_cloned_request (2018-01-15 12:40:44 -0500) - Ming's blk-mq improvements to blk_insert_cloned_request(), which is used exclusively by request-based DM's blk-mq mode, that enable substantial dm-mpath sequential IO performance improvements. Ming Lei (2): blk-mq: return dispatch result from blk_mq_try_issue_directly blk-mq: issue request directly for blk_insert_cloned_request block/blk-core.c | 3 +-- block/blk-mq.c | 65 +- block/blk-mq.h | 3 +++ drivers/md/dm-rq.c | 19 +--- 4 files changed, 70 insertions(+), 20 deletions(-)
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Tue, 16 Jan 2018, Ming Lei wrote: > These two patches fixes IO hang issue reported by Laurence. > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > may cause one irq vector assigned to all offline CPUs, then this vector > can't handle irq any more. > > The 1st patch moves irq vectors spread into one function, and prepares > for the fix done in 2nd patch. > > The 2nd patch fixes the issue by trying to make sure online CPUs assigned > to irq vector. Which means it's completely undoing the intent and mechanism of managed interrupts. Not going to happen. Which driver is that which abuses managed interrupts and does not keep its queues properly sorted on cpu hotplug? Thanks, tglx
Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote: > Hi, > > These two patches fixes IO hang issue reported by Laurence. > > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") > may cause one irq vector assigned to all offline CPUs, then this vector > can't handle irq any more. Well, that very much was the intention of managed interrupts. Why does the device raise an interrupt for a queue that has no online cpu assigned to it?
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, 2018-01-15 at 12:29 -0500, Mike Snitzer wrote: > So you replied to v5, I emailed a v6 out for the relevant patch. Just > want to make sure you're testing with either Jens' latest tree or are > using my v6 that fixed blk_mq_unregister_dev() to require caller holds > q->sysfs_lock ? Hello Mike, In my test I was using Jens' latest for-next tree (commit 563877ae7dae). Bart.
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Mon, Jan 15 2018 at 12:16pm -0500, Bart Van Asschewrote: > On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote: > > I'm submitting this v5 with more feeling now ;) > > Hello Mike, > > Have these patches been tested with lockdep enabled? The following appeared in > the kernel log when after I started testing Jens' for-next tree of this > morning: So you replied to v5, I emailed a v6 out for the relevant patch. Just want to make sure you're testing with either Jens' latest tree or are using my v6 that fixed blk_mq_unregister_dev() to require caller holds q->sysfs_lock ? With the latest code that Jens merged, I do have lockdep enabled in my kernel... but haven't seen any issues on the 2 testbeds I've been hammering the changes on. Ming was concerned about the potential for a lockdep splat (said he tried expanding the scope of q->sysfs_lock in blk_unregister_queue before and in doing so got lockdep to trigger). Once you respond with what it is you're testing, and if it is latest code, I'll look closer at what you've provided and see if it is real or that lockdep just needs some training. Thanks, Mike
Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready
On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote: > I'm submitting this v5 with more feeling now ;) Hello Mike, Have these patches been tested with lockdep enabled? The following appeared in the kernel log when after I started testing Jens' for-next tree of this morning: == WARNING: possible circular locking dependency detected 4.15.0-rc7-dbg+ #1 Not tainted -- 02-mq/1211 is trying to acquire lock: (>sysfs_lock){+.+.}, at: [<8b65bdad>] queue_attr_store+0x35/0x80 but task is already holding lock: (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (kn->count#213){}: kernfs_remove+0x1a/0x30 kobject_del.part.3+0xe/0x40 blk_unregister_queue+0xa7/0xe0 del_gendisk+0x12f/0x260 sd_remove+0x58/0xc0 [sd_mod] device_release_driver_internal+0x15a/0x220 bus_remove_device+0xf4/0x170 device_del+0x12f/0x330 __scsi_remove_device+0xef/0x120 [scsi_mod] scsi_forget_host+0x1b/0x60 [scsi_mod] scsi_remove_host+0x6f/0x110 [scsi_mod] 0xc09ed6e4 process_one_work+0x21c/0x6d0 worker_thread+0x35/0x380 kthread+0x117/0x130 ret_from_fork+0x24/0x30 -> #0 (>sysfs_lock){+.+.}: __mutex_lock+0x6c/0x9e0 queue_attr_store+0x35/0x80 kernfs_fop_write+0x109/0x190 __vfs_write+0x1e/0x130 vfs_write+0xb9/0x1b0 SyS_write+0x40/0xa0 entry_SYSCALL_64_fastpath+0x23/0x9a other info that might help us debug this: Possible unsafe locking scenario: CPU0CPU1 lock(kn->count#213); lock(>sysfs_lock); lock(kn->count#213); lock(>sysfs_lock); *** DEADLOCK *** 3 locks held by 02-mq/1211: #0: (sb_writers#6){.+.+}, at: [] vfs_write+0x17f/0x1b0 #1: (>mutex){+.+.}, at: [ ] kernfs_fop_write+0xdd/0x190 #2: (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190 stack backtrace: CPU: 2 PID: 1211 Comm: 02-mq Not tainted 4.15.0-rc7-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014 Call Trace: dump_stack+0x85/0xbf print_circular_bug.isra.35+0x1d7/0x1e4 __lock_acquire+0x12af/0x1370 ? lock_acquire+0xa8/0x230 lock_acquire+0xa8/0x230 ? queue_attr_store+0x35/0x80 ? queue_attr_store+0x35/0x80 __mutex_lock+0x6c/0x9e0 ? queue_attr_store+0x35/0x80 ? kernfs_fop_write+0xdd/0x190 ? __mutex_lock+0x6c/0x9e0 ? __mutex_lock+0x3d0/0x9e0 ? lock_acquire+0xa8/0x230 ? __lock_is_held+0x55/0x90 ? queue_attr_store+0x35/0x80 queue_attr_store+0x35/0x80 kernfs_fop_write+0x109/0x190 __vfs_write+0x1e/0x130 ? rcu_read_lock_sched_held+0x66/0x70 ? rcu_sync_lockdep_assert+0x23/0x50 ? __sb_start_write+0x152/0x1f0 ? __sb_start_write+0x168/0x1f0 vfs_write+0xb9/0x1b0 SyS_write+0x40/0xa0 entry_SYSCALL_64_fastpath+0x23/0x9a RIP: 0033:0x7fab8be8d054 RSP: 002b:7ffd39f0def8 EFLAGS: 0246 Thanks, Bart.
Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper
On Mon, Jan 15 2018 at 11:58am -0500, Ming Leiwrote: > No functional change, just to clean up code a bit, so that the following > change of using direct issue for blk_mq_request_bypass_insert() which is > needed by DM can be easier to do. > > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 39 +++ > 1 file changed, 27 insertions(+), 12 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index edb1291a42c5..bf8d6651f40e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx > *hctx, struct request *rq) > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); > } > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > - struct request *rq, > - blk_qc_t *cookie) > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, > +struct request *rq, > +blk_qc_t *new_cookie) > { > + blk_status_t ret; > struct request_queue *q = rq->q; > struct blk_mq_queue_data bd = { > .rq = rq, > .last = true, > }; > + > + if (!blk_mq_get_driver_tag(rq, NULL, false)) > + return BLK_STS_AGAIN; > + > + if (!blk_mq_get_dispatch_budget(hctx)) { > + blk_mq_put_driver_tag(rq); > + return BLK_STS_AGAIN; > + } > + > + *new_cookie = request_to_qc_t(hctx, rq); > + > + ret = q->mq_ops->queue_rq(hctx, ); > + > + return ret; > +} > + > +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > + struct request *rq, > + blk_qc_t *cookie) > +{ > + struct request_queue *q = rq->q; > blk_qc_t new_cookie; > blk_status_t ret; > bool run_queue = true; > @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > if (q->elevator) > goto insert; > > - if (!blk_mq_get_driver_tag(rq, NULL, false)) > + ret = __blk_mq_issue_req(hctx, rq, _cookie); > + if (ret == BLK_STS_AGAIN) > goto insert; You're (ab)using BLK_STS_AGAIN as a means to an end... But what if in the future q->mq_ops->queue_rq() returns BLK_STS_AGAIN? Mike
Re: [GIT PULL] block changes to improve device mapper for 4.16
On Mon, Jan 15, 2018 at 08:33:41AM -0700, Jens Axboe wrote: > On 1/14/18 7:59 PM, Mike Snitzer wrote: > > Hi Jens, > > > > I prepared this pull request in the hope that it may help you review and > > stage these changes for 4.16. > > > > I went over Ming's changes again to refine the headers and code comments > > for clarity to help ease review and inclussion. > > > > I've done extensive testing of the changes in this pull request in > > combination with all the dm-4.16 changes I'll be sending to Linus, using > > this tree (which gets pulled into linux-next, I wanted some coverage): > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next > > I'll obviously drop these changes from dm's linux-next once you pick > > them up. > > > > FYI, this dm-4.16 commit addresses the concerns Bart raised last week > > against Ming's dm-mpath changes: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=7a9d12664a183c4a1e2fa86b0a9206006b95138e > > > > The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993: > > > > blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() > > (2018-01-14 10:46:24 -0700) > > > > are available in the Git repository at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git > > tags/for-block-4.16/dm > > > > for you to fetch changes up to 0e94dfab56071a7afd0a9de9b8f6a0535148fb36: > > > > blk-mq: issue request directly for blk_insert_cloned_request (2018-01-14 > > 13:00:02 -0500) > > > > Please pull, thanks! > > Mike > > > > > > - Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used. > > > > - Cleanup blk_unregister_queue() to more precisely protect against > > concurrent sysfs changes, blk_mq_unregister_dev() now requires caller > > to hold q->sysfslock (blk_unregister_queue is only caller). > > > > - Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the > > gendisk to be registered but the associated disk->queue's > > blk_register_queue() is left for the driver to do once its > > request_queue is fully initialized. Fixes long-standing DM > > request_queue initialization issues. > > > > - Ming's blk-mq improvements to blk_insert_cloned_request(), which is > > used exclusively by request-based DM's blk-mq mode, that enable > > substantial dm-mpath sequential IO performance improvements. > > > > > > Mike Snitzer (4): > > block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN > > block: properly protect the 'queue' kobj in blk_unregister_queue > > block: allow gendisk's request_queue registration to be deferred > > dm: fix incomplete request_queue initialization > > > > Ming Lei (3): > > blk-mq: move actual issue into __blk_mq_issue_req helper > > I don't like this patch at all - it's a 10 line function (if that) > that ends up with three outputs, two of them hidden in passed > in pointers. On top of that, a function that is named > __blk_mq_issue_req() and returns bool, you would logically expect > a 'true' return to mean that it succeeded. This is the opposite. OK, __blk_mq_issue_req() has been cleaned up in V4, please consider it for V4.16. Thanks, Ming
[PATCH V4 3/3] blk-mq: issue request directly for blk_insert_cloned_request
blk_insert_cloned_request() is called in fast path of dm-rq driver, and in this function we append request to hctx->dispatch_list of the underlying queue directly. 1) This way isn't efficient enough because hctx lock is always required 2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler totally, and depend on DM rq driver to do IO schedule completely. But DM rq driver can't get underlying queue's dispatch feedback at all, and this information is extreamly useful for IO merge. Without that IO merge can't be done basically by blk-mq, and causes very bad sequential IO performance. This patch makes use of blk_mq_try_issue_directly() to dispatch rq to underlying queue and provides disptch result to dm-rq and blk-mq, and improves the above situations very much. With this patch, seqential IO is improved by 3X in my test over mpath/virtio-scsi. Signed-off-by: Ming Lei--- block/blk-mq.c | 29 ++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 1654c9c284d8..ce3965f271e3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1730,6 +1730,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, blk_qc_t new_cookie; blk_status_t ret = BLK_STS_OK; bool run_queue = true; + /* +* This function is called from blk_insert_cloned_request() if +* 'cookie' is NULL, and for dispatching this request only. +*/ + bool dispatch_only = !cookie; /* RCU or SRCU read lock is needed before checking quiesced flag */ if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) { @@ -1737,10 +1742,17 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, goto insert; } - if (q->elevator) + if (q->elevator && !dispatch_only) goto insert; ret = __blk_mq_issue_req(hctx, rq, _cookie); + if (dispatch_only) { + if (ret == BLK_STS_AGAIN) + return BLK_STS_RESOURCE; + if (ret == BLK_STS_RESOURCE) + __blk_mq_requeue_request(rq); + return ret; + } if (ret == BLK_STS_AGAIN) goto insert; @@ -1763,8 +1775,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, } insert: - blk_mq_sched_insert_request(rq, false, run_queue, false, - hctx->flags & BLK_MQ_F_BLOCKING); + if (!dispatch_only) + blk_mq_sched_insert_request(rq, false, run_queue, false, + hctx->flags & BLK_MQ_F_BLOCKING); + else + blk_mq_request_bypass_insert(rq, run_queue); return ret; } @@ -1784,6 +1799,14 @@ static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } +blk_status_t blk_mq_request_direct_issue(struct request *rq) +{ + struct blk_mq_ctx *ctx = rq->mq_ctx; + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu); + + return blk_mq_try_issue_directly(hctx, rq, NULL); +} + static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) { const int is_sync = op_is_sync(bio->bi_opf); -- 2.9.5
[PATCH V4 1/3] blk-mq: move actual issue into one helper
No functional change, just to clean up code a bit, so that the following change of using direct issue for blk_mq_request_bypass_insert() which is needed by DM can be easier to do. Signed-off-by: Ming Lei--- block/blk-mq.c | 39 +++ 1 file changed, 27 insertions(+), 12 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index edb1291a42c5..bf8d6651f40e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx *hctx, struct request *rq) return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true); } -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, - struct request *rq, - blk_qc_t *cookie) +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, + struct request *rq, + blk_qc_t *new_cookie) { + blk_status_t ret; struct request_queue *q = rq->q; struct blk_mq_queue_data bd = { .rq = rq, .last = true, }; + + if (!blk_mq_get_driver_tag(rq, NULL, false)) + return BLK_STS_AGAIN; + + if (!blk_mq_get_dispatch_budget(hctx)) { + blk_mq_put_driver_tag(rq); + return BLK_STS_AGAIN; + } + + *new_cookie = request_to_qc_t(hctx, rq); + + ret = q->mq_ops->queue_rq(hctx, ); + + return ret; +} + +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, + struct request *rq, + blk_qc_t *cookie) +{ + struct request_queue *q = rq->q; blk_qc_t new_cookie; blk_status_t ret; bool run_queue = true; @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (q->elevator) goto insert; - if (!blk_mq_get_driver_tag(rq, NULL, false)) + ret = __blk_mq_issue_req(hctx, rq, _cookie); + if (ret == BLK_STS_AGAIN) goto insert; - if (!blk_mq_get_dispatch_budget(hctx)) { - blk_mq_put_driver_tag(rq); - goto insert; - } - - new_cookie = request_to_qc_t(hctx, rq); - /* * For OK queue, we are done. For error, kill it. Any other * error (busy), just add it to our list as we previously * would have done */ - ret = q->mq_ops->queue_rq(hctx, ); switch (ret) { case BLK_STS_OK: *cookie = new_cookie; -- 2.9.5
[PATCH V4 2/3] blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
In the following patch, we will use blk_mq_try_issue_directly() for DM to return the dispatch result, and DM need this informatin to improve IO merge. Signed-off-by: Ming Lei--- block/blk-mq.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index bf8d6651f40e..1654c9c284d8 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1722,13 +1722,13 @@ static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx, return ret; } -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, - struct request *rq, - blk_qc_t *cookie) +static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, + struct request *rq, + blk_qc_t *cookie) { struct request_queue *q = rq->q; blk_qc_t new_cookie; - blk_status_t ret; + blk_status_t ret = BLK_STS_OK; bool run_queue = true; /* RCU or SRCU read lock is needed before checking quiesced flag */ @@ -1752,31 +1752,36 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, switch (ret) { case BLK_STS_OK: *cookie = new_cookie; - return; + return ret; case BLK_STS_RESOURCE: __blk_mq_requeue_request(rq); goto insert; default: *cookie = BLK_QC_T_NONE; blk_mq_end_request(rq, ret); - return; + return ret; } insert: blk_mq_sched_insert_request(rq, false, run_queue, false, hctx->flags & BLK_MQ_F_BLOCKING); + return ret; } -static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, - struct request *rq, blk_qc_t *cookie) +static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, + struct request *rq, + blk_qc_t *cookie) { int srcu_idx; + blk_status_t ret; might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING); hctx_lock(hctx, _idx); - __blk_mq_try_issue_directly(hctx, rq, cookie); + ret = __blk_mq_try_issue_directly(hctx, rq, cookie); hctx_unlock(hctx, srcu_idx); + + return ret; } static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio) -- 2.9.5
[PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request
Hi Guys, The 3 paches changes the blk-mq part of blk_insert_cloned_request(), in which we switch to blk_mq_try_issue_directly(), so that both dm-rq and blk-mq can get the dispatch result of underlying queue, and with this information, blk-mq can handle IO merge much better, then sequential I/O performance is improved much. In my dm-mpath over virtio-scsi test, this whole patchset improves sequential IO by 3X ~ 5X. V4: - remove dm patches which are in DM tree already - cleanup __blk_mq_issue_req as suggested by Jens V3: - rebase on the latest for-4.16/block of block tree - add missed pg_init_all_paths() in patch 1, according to Bart's review V2: - drop 'dm-mpath: cache ti->clone during requeue', which is a bit too complicated, and not see obvious performance improvement. - make change on blk-mq part cleaner Ming Lei (3): blk-mq: move actual issue into one helper blk-mq: return dispatch result to caller in blk_mq_try_issue_directly blk-mq: issue request directly for blk_insert_cloned_request block/blk-mq.c | 85 +++--- 1 file changed, 64 insertions(+), 21 deletions(-) -- 2.9.5
LSF/MM 2018: Call for Proposals
The annual Linux Storage, Filesystem and Memory Management (LSF/MM) Summit for 2018 will be held from April 23-25 at the Deer Valley Lodges in Park City, Utah. LSF/MM is an invitation-only technical workshop to map out improvements to the Linux storage, filesystem and memory management subsystems that will make their way into the mainline kernel within the coming years. http://events.linuxfoundation.org/events/linux-storage-filesystem-and-mm-summit LSF/MM 2018 will be a three day, stand-alone conference with three subsystem-specific tracks, cross-track discussions, as well as BoF and hacking sessions. On behalf of the committee I am issuing a call for agenda proposals that are suitable for cross-track discussion as well as technical subjects for the breakout sessions. If advance notice is required for visa applications then please point that out in your proposal or request to attend, and submit the topic as soon as possible. 1) Proposals for agenda topics should be sent before January 31st, 2018 to: lsf...@lists.linux-foundation.org and CC the mailing lists that are relevant for the topic in question: FS: linux-fsde...@vger.kernel.org MM: linux...@kvack.org Block: linux-block@vger.kernel.org ATA:linux-...@vger.kernel.org SCSI: linux-s...@vger.kernel.org NVMe: linux-n...@lists.infradead.org Please tag your proposal with [LSF/MM TOPIC] to make it easier to track. In addition, please make sure to start a new thread for each topic rather than following up to an existing one. Agenda topics and attendees will be selected by the program committee, but the final agenda will be formed by consensus of the attendees on the day. 2) Requests to attend the summit for those that are not proposing a topic should be sent to: lsf...@lists.linux-foundation.org Please summarize what expertise you will bring to the meeting, and what you would like to discuss. Please also tag your email with [LSF/MM ATTEND] and send it as a new thread so there is less chance of it getting lost. We will try to cap attendance at around 25-30 per track to facilitate discussions although the final numbers will depend on the room sizes at the venue. For discussion leaders, slides and visualizations are encouraged to outline the subject matter and focus the discussions. Please refrain from lengthy presentations and talks; the sessions are supposed to be interactive, inclusive discussions. There will be no recording or audio bridge. However, we expect that written minutes will be published as we did in previous years: 2017: https://lwn.net/Articles/lsfmm2017/ 2016: https://lwn.net/Articles/lsfmm2016/ 2015: https://lwn.net/Articles/lsfmm2015/ 2014: http://lwn.net/Articles/LSFMM2014/ 2013: http://lwn.net/Articles/548089/ 3) If you have feedback on last year's meeting that we can use to improve this year's, please also send that to: lsf...@lists.linux-foundation.org Thank you on behalf of the program committee: Anna Schumaker (Filesystems) Jens Axboe (Storage) Josef Bacik (Filesystems) Martin K. Petersen (Storage) Michal Hocko (MM) Rik van Riel (MM) Johannes Weiner
[RFC PATCH] blktrace: fail earlier if blk_trace in use
add a check before allocate resource for blk_trace, if it's in use. Signed-off-by: weiping zhang--- kernel/trace/blktrace.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c index 987d9a9a..16c 100644 --- a/kernel/trace/blktrace.c +++ b/kernel/trace/blktrace.c @@ -491,6 +491,9 @@ static int do_blk_trace_setup(struct request_queue *q, char *name, dev_t dev, struct dentry *dir = NULL; int ret; + if (unlikely(q->blk_trace)) + return -EBUSY; + if (!buts->buf_size || !buts->buf_nr) return -EINVAL; -- 2.9.4
[PATCH 2/2] genirq/affinity: try best to make sure online CPU is assigned to vector
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") causes irq vector assigned to all offline CPUs, and IO hang is reported on HPSA by Laurence. This patch fixes this issue by trying best to make sure online CPU can be assigned to irq vector. And take two steps to spread irq vectors: 1) spread irq vectors across offline CPUs in the node cpumask 2) spread irq vectors across online CPUs in the node cpumask Fixes: 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") Cc: Thomas GleixnerCc: Christoph Hellwig Reported-by: Laurence Oberman Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 99eb38a4cc83..8b716548b3db 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -103,6 +103,10 @@ static int irq_vecs_spread_affinity(struct cpumask *irqmsk, int v, ncpus = cpumask_weight(nmsk); int vecs_to_assign, extra_vecs; + /* May happen when spreading vectors across offline cpus */ + if (!ncpus) + return 0; + /* How many vectors we will try to spread */ vecs_to_assign = min(max_vecs, ncpus); @@ -165,13 +169,16 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) /* Stabilize the cpumasks */ get_online_cpus(); build_node_to_possible_cpumask(node_to_possible_cpumask); - nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask, -); /* +* Don't spread irq vector across offline node. +* * If the number of nodes in the mask is greater than or equal the * number of vectors we just spread the vectors across the nodes. +* */ + nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_online_mask, +); if (affv <= nodes) { for_each_node_mask(n, nodemsk) { cpumask_copy(masks + curvec, @@ -182,14 +189,22 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) goto done; } + nodes_clear(nodemsk); + nodes = get_nodes_in_cpumask(node_to_possible_cpumask, cpu_possible_mask, +); for_each_node_mask(n, nodemsk) { int vecs_per_node; /* Spread the vectors per node */ vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes; - cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]); + /* spread vectors across offline cpus in the node cpumask */ + cpumask_andnot(nmsk, node_to_possible_cpumask[n], cpu_online_mask); + irq_vecs_spread_affinity([curvec], last_affv - curvec, + vecs_per_node, nmsk); + /* spread vectors across online cpus in the node cpumask */ + cpumask_and(nmsk, node_to_possible_cpumask[n], cpu_online_mask); curvec += irq_vecs_spread_affinity([curvec], last_affv - curvec, vecs_per_node, nmsk); -- 2.9.5
[PATCH 1/2] genirq/affinity: move irq vectors spread into one function
This patch is preparing for doing two steps spread: - spread vectors across non-online CPUs - spread vectors across online CPUs This way is applied for trying best to avoid allocating all offline CPUs to one single vector. No functional change, and code gets cleaned up too. Cc: Thomas GleixnerCc: Christoph Hellwig Signed-off-by: Ming Lei --- kernel/irq/affinity.c | 56 +++ 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index a37a3b4b6342..99eb38a4cc83 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -94,6 +94,35 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask, return nodes; } +/* Spread irq vectors, and the result is stored to @irqmsk. */ +static int irq_vecs_spread_affinity(struct cpumask *irqmsk, + int max_irqmsks, + int max_vecs, + struct cpumask *nmsk) +{ + int v, ncpus = cpumask_weight(nmsk); + int vecs_to_assign, extra_vecs; + + /* How many vectors we will try to spread */ + vecs_to_assign = min(max_vecs, ncpus); + + /* Account for rounding errors */ + extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign); + + for (v = 0; v < min(max_irqmsks, vecs_to_assign); v++) { + int cpus_per_vec = ncpus / vecs_to_assign; + + /* Account for extra vectors to compensate rounding errors */ + if (extra_vecs) { + cpus_per_vec++; + --extra_vecs; + } + irq_spread_init_one(irqmsk + v, nmsk, cpus_per_vec); + } + + return v; +} + /** * irq_create_affinity_masks - Create affinity masks for multiqueue spreading * @nvecs: The total number of vectors @@ -104,7 +133,7 @@ static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask, struct cpumask * irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) { - int n, nodes, cpus_per_vec, extra_vecs, curvec; + int n, nodes, curvec; int affv = nvecs - affd->pre_vectors - affd->post_vectors; int last_affv = affv + affd->pre_vectors; nodemask_t nodemsk = NODE_MASK_NONE; @@ -154,33 +183,16 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) } for_each_node_mask(n, nodemsk) { - int ncpus, v, vecs_to_assign, vecs_per_node; + int vecs_per_node; /* Spread the vectors per node */ vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes; - /* Get the cpus on this node which are in the mask */ cpumask_and(nmsk, cpu_possible_mask, node_to_possible_cpumask[n]); - /* Calculate the number of cpus per vector */ - ncpus = cpumask_weight(nmsk); - vecs_to_assign = min(vecs_per_node, ncpus); - - /* Account for rounding errors */ - extra_vecs = ncpus - vecs_to_assign * (ncpus / vecs_to_assign); - - for (v = 0; curvec < last_affv && v < vecs_to_assign; -curvec++, v++) { - cpus_per_vec = ncpus / vecs_to_assign; - - /* Account for extra vectors to compensate rounding errors */ - if (extra_vecs) { - cpus_per_vec++; - --extra_vecs; - } - irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec); - } - + curvec += irq_vecs_spread_affinity([curvec], + last_affv - curvec, + vecs_per_node, nmsk); if (curvec >= last_affv) break; --nodes; -- 2.9.5
[PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector
Hi, These two patches fixes IO hang issue reported by Laurence. 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs") may cause one irq vector assigned to all offline CPUs, then this vector can't handle irq any more. The 1st patch moves irq vectors spread into one function, and prepares for the fix done in 2nd patch. The 2nd patch fixes the issue by trying to make sure online CPUs assigned to irq vector. Ming Lei (2): genirq/affinity: move irq vectors spread into one function genirq/affinity: try best to make sure online CPU is assigned to vector kernel/irq/affinity.c | 77 ++- 1 file changed, 52 insertions(+), 25 deletions(-) -- 2.9.5
Re: [GIT PULL] block changes to improve device mapper for 4.16
On 1/15/18 8:52 AM, Mike Snitzer wrote: > On Mon, Jan 15 2018 at 10:33am -0500, > Jens Axboewrote: > >> On 1/14/18 7:59 PM, Mike Snitzer wrote: > ... >>> Ming Lei (3): >>> blk-mq: move actual issue into __blk_mq_issue_req helper >> >> I don't like this patch at all - it's a 10 line function (if that) >> that ends up with three outputs, two of them hidden in passed >> in pointers. On top of that, a function that is named >> __blk_mq_issue_req() and returns bool, you would logically expect >> a 'true' return to mean that it succeeded. This is the opposite. >> >> Not strongly opposed to the rest. > > OK, I'll have a closer look at how to clean it up (and also get with > Ming). > > In the meantime, you can either cherry-pick my first 4 patches or feel > free to use this to pull them in: I took 1-3 initially, added 4 now as well. -- Jens Axboe
Re: [GIT PULL] block changes to improve device mapper for 4.16
On Mon, Jan 15 2018 at 10:33am -0500, Jens Axboewrote: > On 1/14/18 7:59 PM, Mike Snitzer wrote: ... > > Ming Lei (3): > > blk-mq: move actual issue into __blk_mq_issue_req helper > > I don't like this patch at all - it's a 10 line function (if that) > that ends up with three outputs, two of them hidden in passed > in pointers. On top of that, a function that is named > __blk_mq_issue_req() and returns bool, you would logically expect > a 'true' return to mean that it succeeded. This is the opposite. > > Not strongly opposed to the rest. OK, I'll have a closer look at how to clean it up (and also get with Ming). In the meantime, you can either cherry-pick my first 4 patches or feel free to use this to pull them in: The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993: blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 10:46:24 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git tags/for-block-4.16/dm-changes-1 for you to fetch changes up to d0cc27da2b04351b3cb52afeb99ceca7b9f91f3b: dm: fix incomplete request_queue initialization (2018-01-14 12:59:59 -0500) - Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used. - Cleanup blk_unregister_queue() to more precisely protect against concurrent sysfs changes, blk_mq_unregister_dev() now requires caller to hold q->sysfslock (blk_unregister_queue is only caller). - Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the gendisk to be registered but the associated disk->queue's blk_register_queue() is left for the driver to do once its request_queue is fully initialized. Fixes long-standing DM request_queue initialization issues. Mike Snitzer (4): block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN block: properly protect the 'queue' kobj in blk_unregister_queue block: allow gendisk's request_queue registration to be deferred dm: fix incomplete request_queue initialization block/blk-mq-sysfs.c | 9 + block/blk-sysfs.c | 18 +++--- block/genhd.c | 23 +++ drivers/md/dm-rq.c| 9 - drivers/md/dm.c | 11 ++- include/linux/genhd.h | 5 + 6 files changed, 50 insertions(+), 25 deletions(-)
Re: 4.16 genirq change prevents HP servers from booting [was: Re: linux-next: Signed-off-by missing for commit in the device-mapper tree]
On Mon, Jan 15, 2018 at 10:25:01AM -0500, Mike Snitzer wrote: > On Mon, Jan 15 2018 at 8:27am -0500, > Stephen Rothwellwrote: > > > Hi all, > > > > Commit > > > > 34e1467da673 ("Revert "genirq/affinity: assign vectors to all possible > > CPUs"") > > > > is missing a Signed-off-by from its author and committer. > > > > Reverts are commits as well. > > Right, I'm aware. I staged the tree that made some HP servers finally > work with the latest linux-block 4.16 changes. Without thinking about > the broader implications. Anyway, I'll drop the revert from > linux-dm.git's 'for-next'. > > Because I'm confident others will hunt down the irq issues. > > I think Ming was looking to grab the queue mapping info and CPU related > info from the affected server. Hi Mike, I have cooked a fix today, please see the following link: https://marc.info/?l=linux-kernel=151601867018444=2 This issue is clear, and has been described in above link and the patch. If no objection, I will prepare a formal version. -- Ming
4.16 genirq change prevents HP servers from booting [was: Re: linux-next: Signed-off-by missing for commit in the device-mapper tree]
On Mon, Jan 15 2018 at 8:27am -0500, Stephen Rothwellwrote: > Hi all, > > Commit > > 34e1467da673 ("Revert "genirq/affinity: assign vectors to all possible > CPUs"") > > is missing a Signed-off-by from its author and committer. > > Reverts are commits as well. Right, I'm aware. I staged the tree that made some HP servers finally work with the latest linux-block 4.16 changes. Without thinking about the broader implications. Anyway, I'll drop the revert from linux-dm.git's 'for-next'. Because I'm confident others will hunt down the irq issues. I think Ming was looking to grab the queue mapping info and CPU related info from the affected server. > Though I do note it actually has a reasonable commit message, thanks. Semi-reasonable. Lacks detail. The issue is that over the weekend Laurence found linux-block.git commit 84676c1f21e8ff54befe98 prevents some HP servers from booting. They'd hang when trying to initialize their HPSA controller's devices, e.g.: [ 246.751050] INFO: task systemd-udevd:411 blocked for more than 120 seconds. [ 246.791852] Tainted: G I 4.15.0-rc4.block.dm.4.16+ #1 [ 246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 246.874637] systemd-udevd D0 411408 0x8004 [ 246.904934] Call Trace: [ 246.918191] ? __schedule+0x28d/0x870 [ 246.937643] ? _cond_resched+0x15/0x30 [ 246.958222] schedule+0x32/0x80 [ 246.975424] async_synchronize_cookie_domain+0x8b/0x140 [ 247.004452] ? remove_wait_queue+0x60/0x60 [ 247.027335] do_init_module+0xbe/0x219 [ 247.048022] load_module+0x21d6/0x2910 [ 247.069436] ? m_show+0x1c0/0x1c0 [ 247.087999] SYSC_finit_module+0x94/0xe0 [ 247.110392] entry_SYSCALL_64_fastpath+0x1a/0x7d [ 247.136669] RIP: 0033:0x7f84049287f9 [ 247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX: 0139 [ 247.196883] RAX: ffda RBX: 55b712b59e80 RCX: 7f84049287f9 [ 247.237989] RDX: RSI: 7f8405245099 RDI: 0008 [ 247.279105] RBP: 7f8404bf2760 R08: R09: 55b712b45760 [ 247.320005] R10: 0008 R11: 0246 R12: 0020 [ 247.360625] R13: 7f8404bf2818 R14: 0050 R15: 7f8404bf27b8 [ 247.401062] INFO: task scsi_eh_0:471 blocked for more than 120 seconds. [ 247.438161] Tainted: G I 4.15.0-rc4.block.dm.4.16+ #1 [ 247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.520700] scsi_eh_0 D0 471 2 0x8000 [ 247.551339] Call Trace: [ 247.564360] ? __schedule+0x28d/0x870 [ 247.584720] schedule+0x32/0x80 [ 247.601294] hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa] [ 247.633358] ? remove_wait_queue+0x60/0x60 [ 247.656345] scsi_try_bus_device_reset+0x27/0x40 [ 247.682424] scsi_eh_ready_devs+0x53f/0xe20 [ 247.706467] ? __pm_runtime_resume+0x55/0x70 [ 247.730327] scsi_error_handler+0x434/0x5e0 [ 247.754387] ? __schedule+0x295/0x870 [ 247.775420] kthread+0xf5/0x130 [ 247.793461] ? scsi_eh_get_sense+0x240/0x240 [ 247.818008] ? kthread_associate_blkcg+0x90/0x90 [ 247.844759] ret_from_fork+0x1f/0x30 [ 247.865440] INFO: task scsi_id:488 blocked for more than 120 seconds. [ 247.901112] Tainted: G I 4.15.0-rc4.block.dm.4.16+ #1 [ 247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 247.981092] scsi_id D0 488 1 0x0004 [ 248.010535] Call Trace: [ 248.023567] ? __schedule+0x28d/0x870 [ 248.044236] ? __switch_to+0x1f5/0x460 [ 248.065776] schedule+0x32/0x80 [ 248.084238] schedule_timeout+0x1d4/0x2f0 [ 248.106184] wait_for_completion+0x123/0x190 [ 248.130759] ? wake_up_q+0x70/0x70 [ 248.150295] flush_work+0x119/0x1a0 [ 248.169238] ? wake_up_worker+0x30/0x30 [ 248.189670] __cancel_work_timer+0x103/0x190 [ 248.213751] ? kobj_lookup+0x10b/0x160 [ 248.235441] disk_block_events+0x6f/0x90 [ 248.257820] __blkdev_get+0x6a/0x480 [ 248.278770] ? bd_acquire+0xd0/0xd0 [ 248.298438] blkdev_get+0x1a5/0x300 [ 248.316587] ? bd_acquire+0xd0/0xd0 [ 248.334814] do_dentry_open+0x202/0x320 [ 248.354372] ? security_inode_permission+0x3c/0x50 [ 248.378818] path_openat+0x537/0x12c0 [ 248.397386] ? vm_insert_page+0x1e0/0x1f0 [ 248.417664] ? vvar_fault+0x75/0x140 [ 248.435811] do_filp_open+0x91/0x100 [ 248.454061] do_sys_open+0x126/0x210 [ 248.472462] entry_SYSCALL_64_fastpath+0x1a/0x7d [ 248.495438] RIP: 0033:0x7f39e60e1e90 [ 248.513136] RSP: 002b:7ffc4c906ba8 EFLAGS: 0246 ORIG_RAX: 0002 [ 248.550726] RAX: ffda RBX: 5624aead3010 RCX: 7f39e60e1e90 [ 248.586207] RDX: 7f39e60cc0c4 RSI: 00080800 RDI: 7ffc4c906ed0 [ 248.622411] RBP: 7ffc4c906b60 R08: 7f39e60cc140 R09: 7f39e60cc140 [ 248.658704] R10: 001f R11: 0246 R12: 7ffc4c906ed0
Re: [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
On Mon, Jan 15 2018 at 7:46am -0500, Lars Ellenbergwrote: > As I understood it, > blkdev_issue_zeroout() was supposed to "always try to unmap", > deprovision, the relevant region, and zero-out any unaligned > head or tail, just like my work around above was doing. > > And that device mapper thin was "about to" learn this, "soon", > or maybe block core would do the equivalent of my workaround > described above. > > But it then did not. > > See also: > https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html > https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html Right, now that you mention it it is starting to ring a bell (especially after I read your 2nd dm-devel archive url above). > I then did not follow this closely enough anymore, > and I missed that with recent enough kernel, > discard on DRBD on dm-thin would fully allocate. > > In our out-of-tree module, we had to keep the older code for > compat reasons, anyways. I will just re-enable our zeroout > workaround there again. > > In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly", > so the result in this scenario is what we expect: > > _: unprovisioned, not allocated, returns zero on read anyways > *: provisioned, some arbitrary data > 0: explicitly zeroed: > > |gran|ular|ity |||| > ||||| > to|-be-|zero|ed > |**00|||00**| > > (leave unallocated blocks alone, > de-allocate full blocks just like with discard, > explicitly zero unaligned head and tail) "de-allocate full blocks just like with discard" is an interesting take what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly". > Or DRBD will have to resurrect that reinvented zeroout again, > with exactly those semantics. I did reinvent it for a reason ;) Yeah, I now recall dropping that line of development because it became "hard" (or at least harder than originally thought). Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the disk? E.g. zeroing superblocks, metadata areas, or whatever? If we just discarded the logical extent and then a user did a partial write to the block, areas that a user might expect to be zeroed wouldn't be (at least in the case of dm-thinp if "skip_block_zeroing" is enabled). And yes if discard passdown is enabled and the device's discard implementation does "discard_zeroes_data" then it'd be fine.. but there are a lot of things that need to line up for drbd's REQ_OP_WRITE_ZEROES to "just work" (as it expects). (now I'm just echoing the kinds of concerns I had in that 2nd dm-devel post above). This post from mkp is interesting: https://www.redhat.com/archives/dm-devel/2017-March/msg00228.html Specifically: "You don't have a way to mark those blocks as being full of zeroes without actually writing them? Note that the fallback to a zeroout command is to do a regular write. So if DM doesn't zero the blocks, the block layer is going to it." No, dm-thinp doesn't have an easy way to mark an allocated block as containing zeroes (without actually zeroing). I toyed with adding that but then realized that even if we had it it'd still require block zeroing be enabled. But block zeroing is done at allocation time. So we'd need to interpret the "this block is zeroes" flag to mean "on first write or read to this block it needs to first zero it". Fugly to say the least... I've been quite busy with other things but I can revisit all this with Joe Thornber and see what we come up with after a 2nd discussion. But sadly, in general, this is a low priority for me, so you might do well to reintroduce your drbd workaround.. sorry about that :( Mike
Re: [Drbd-dev] [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout
On Sat, Jan 13, 2018 at 12:46:40AM +, Eric Wheeler wrote: > Hello All, > > We just noticed that discards to DRBD devices backed by dm-thin devices > are fully allocating the thin blocks. > > This behavior does not exist before > ee472d83 block: add a flags argument to (__)blkdev_issue_zeroout > > The problem exists somewhere between > [working] c20cfc27 block: stop using blkdev_issue_write_same for zeroing > and > [broken] 45c21793 drbd: implement REQ_OP_WRITE_ZEROES > > Note that c20cfc27 works as expected, but 45c21793 discards blocks > being zeroed on the dm-thin backing device. All commits between those two > produce the following error: > > blkdiscard: /dev/drbd/by-res/test: BLKDISCARD ioctl failed: Input/output error > > Also note that issuing a blkdiscard to the backing device directly > discards as you would expect. This is just a problem when sending discards > through DRBD. > > Is there an easy way to solve this in the short term, even if the ultimate > fix is more involved? > On Wed, 5 Apr 2017, Christoph Hellwig wrote: > commit 0dbed96a3cc9786bc4814dab98a7218753bde934 Author: Christoph HellwigDate: Wed Apr 5 19:21:21 2017 +0200 drbd: make intelligent use of blkdev_issue_zeroout > > drbd always wants its discard wire operations to zero the blocks, so > > use blkdev_issue_zeroout with the BLKDEV_ZERO_UNMAP flag instead of > > reinventing it poorly. > > -/* > > - * We *may* ignore the discard-zeroes-data setting, if so configured. > > - * > > - * Assumption is that it "discard_zeroes_data=0" is only because the > > backend > > - * may ignore partial unaligned discards. > > - * > > - * LVM/DM thin as of at least > > - * LVM version: 2.02.115(2)-RHEL7 (2015-01-28) > > - * Library version: 1.02.93-RHEL7 (2015-01-28) > > - * Driver version: 4.29.0 > > - * still behaves this way. > > - * > > - * For unaligned (wrt. alignment and granularity) or too small discards, > > - * we zero-out the initial (and/or) trailing unaligned partial chunks, > > - * but discard all the aligned full chunks. > > - * > > - * At least for LVM/DM thin, the result is effectively > > "discard_zeroes_data=1". > > - */ > > -int drbd_issue_discard_or_zero_out(struct drbd_device *device, sector_t > > start, unsigned int nr_sectors, bool discard) As I understood it, blkdev_issue_zeroout() was supposed to "always try to unmap", deprovision, the relevant region, and zero-out any unaligned head or tail, just like my work around above was doing. And that device mapper thin was "about to" learn this, "soon", or maybe block core would do the equivalent of my workaround described above. But it then did not. See also: https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html I then did not follow this closely enough anymore, and I missed that with recent enough kernel, discard on DRBD on dm-thin would fully allocate. In our out-of-tree module, we had to keep the older code for compat reasons, anyways. I will just re-enable our zeroout workaround there again. In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly", so the result in this scenario is what we expect: _: unprovisioned, not allocated, returns zero on read anyways *: provisioned, some arbitrary data 0: explicitly zeroed: |gran|ular|ity |||| ||||| to|-be-|zero|ed |**00|||00**| (leave unallocated blocks alone, de-allocate full blocks just like with discard, explicitly zero unaligned head and tail) Or DRBD will have to resurrect that reinvented zeroout again, with exactly those semantics. I did reinvent it for a reason ;) -- : Lars Ellenberg : LINBIT | Keeping the Digital World Running : DRBD -- Heartbeat -- Corosync -- Pacemaker : R, Integration, Ops, Consulting, Support DRBD® and LINBIT® are registered trademarks of LINBIT