Re: [PATCH V2] blk-mq: re-build queue map in case of kdump kernel
On 12/6/18 8:03 PM, Ming Lei wrote: > Now almost all .map_queues() implementation based on managed irq > affinity doesn't update queue mapping and it just retrieves the > old built mapping, so if nr_hw_queues is changed, the mapping talbe > includes stale mapping. And only blk_mq_map_queues() may rebuild > the mapping talbe. > > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. > However, drivers often builds queue mapping before allocating tagset > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set > as 1 in case of kdump kernel, so wrong queue mapping is used, and > kernel panic[1] is observed during booting. > > This patch fixes the kernel panic triggerd on nvme by rebulding the > mapping table via blk_mq_map_queues(). > > [1] kernel panic log > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues > [4.443277] BUG: unable to handle kernel NULL pointer dereference at > 0098 > [4.444681] PGD 0 P4D 0 > [4.445367] Oops: [#1] SMP NOPTI > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.10.2-2.fc27 04/01/2014 > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 > 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b > b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 > [4.454061] RAX: RBX: 888174448000 RCX: > 0001 > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: > 0001 > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: > 0002 > [4.464580] R10: c900023b3c10 R11: 0004 R12: > 888174448538 > [4.467803] R13: 0004 R14: 0001 R15: > 0001 > [4.469220] FS: () GS:88817bac() > knlGS: > [4.471554] CS: 0010 DS: ES: CR0: 80050033 > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: > 00760ee0 > [4.474264] DR0: DR1: DR2: > > [4.476007] DR3: DR6: fffe0ff0 DR7: > 0400 > [4.477061] PKRU: 5554 > [4.477464] Call Trace: > [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad > [4.479595] blk_mq_init_queue+0x32/0x4e > [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] > [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] > [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] > [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] > [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 > [4.483930] ? try_to_wake_up+0x38d/0x3b3 > [4.484478] ? process_one_work+0x179/0x2fc > [4.485118] process_one_work+0x1d3/0x2fc > [4.485655] ? rescuer_thread+0x2ae/0x2ae > [4.486196] worker_thread+0x1e9/0x2be > [4.486841] kthread+0x115/0x11d > [4.487294] ? kthread_park+0x76/0x76 > [4.487784] ret_from_fork+0x3a/0x50 > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi > ip_tables > [4.489428] Dumping ftrace buffer: > [4.489939](ftrace buffer empty) > [4.490492] CR2: 0098 > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Works for me, tested various configs and stubbed out the kdump check. Thanks for fixing this, applied to 4.21. -- Jens Axboe
[PATCH v3] blk-mq: punt failed direct issue to dispatch list
After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. This results in a livelock. Instead of making special cases for what we can direct issue, and now having to deal with DM solving the livelock while still retaining a BUSY condition feedback loop, always just add a request that has been through ->queue_rq() to the hardware queue dispatch list. These are safe to use as no merging can take place there. Additionally, if requests do have prepped data from drivers, we aren't dependent on them not sharing space in the request structure to safely add them to the IO scheduler lists. This basically reverts ffe81d45322c and is based on a patch from Ming, but with the list insert case covered as well. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Cc: sta...@vger.kernel.org Suggested-by: Ming Lei Reported-by: Bart Van Assche Signed-off-by: Jens Axboe --- I've thrown the initial hang test reported by Bart at it, works fine. My reproducer for the corruption case is also happy, as expected. I'm running blktests and xfstests on it overnight. If that passes as expected, this qualms my initial worries on using ->dispatch as a holding place for these types of requests. diff --git a/block/blk-mq.c b/block/blk-mq.c index 3262d83b9e07..6a7566244de3 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, break; case BLK_STS_RESOURCE: case BLK_STS_DEV_RESOURCE: - /* -* If direct dispatch fails, we cannot allow any merging on -* this IO. Drivers (like SCSI) may have set up permanent state -* for this request, like SG tables and mappings, and if we -* merge to it later on then we'll still only do IO to the -* original part. -*/ - rq->cmd_flags |= REQ_NOMERGE; - blk_mq_update_dispatch_busy(hctx, true); __blk_mq_requeue_request(rq); break; @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, return ret; } -/* - * Don't allow direct dispatch of anything but regular reads/writes, - * as some of the other commands can potentially share request space - * with data we need for the IO scheduler. If we attempt a direct dispatch - * on those and fail, we can't safely add it to the scheduler afterwards - * without potentially overwriting data that the driver has already written. - */ -static bool blk_rq_can_direct_dispatch(struct request *rq) -{ - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; -} - static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_qc_t *cookie, @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, goto insert; } - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) + if (q->elevator && !bypass_insert) goto insert; if (!blk_mq_get_dispatch_budget(hctx)) @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_sched_insert_request(rq, false, run_queue, false); + blk_mq_request_bypass_insert(rq, run_queue); return BLK_STS_OK; } @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_sched_insert_request(rq, false, true, false); + blk_mq_request_bypass_insert(rq, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); @@ -1831,15 +1810,13 @@ void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, struct request *rq = list_first_entry(list, struct request, queuelist); - if (!blk_rq_can_direct_dispatch(rq)) - break; - list_del_init(>queuelist); ret = blk_mq_request_issue_directly(rq); if (ret != BLK_STS_OK) { if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { -
Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path
On 12/6/18 9:18 PM, Mike Snitzer wrote: > On Thu, Dec 06 2018 at 11:06pm -0500, > Jens Axboe wrote: > >> On 12/6/18 8:54 PM, Mike Snitzer wrote: >>> On Thu, Dec 06 2018 at 9:49pm -0500, >>> Jens Axboe wrote: >>> After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. Use the driver private RQF_DONTPREP to track this condition in DM. If we encounter a BUSY condition from blk_insert_cloned_request(), then flag the request with RQF_DONTPREP. When we next time see this request, ask blk_insert_cloned_request() to bypass insert the request directly. This avoids the livelock of repeatedly trying to direct dispatch a request, while still retaining the BUSY feedback loop for blk-mq so that we don't over-dispatch to the lower level queue and mess up opportunities for merging on the DM queue. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Reported-by: Bart Van Assche Cc: sta...@vger.kernel.org Signed-off-by: Jens Axboe --- This passes my testing as well, like the previous patch. But unlike the previous patch, we retain the BUSY feedback loop information for better merging. >>> >>> But it is kind of gross to workaround the new behaviour to "permanently >>> disallow direct dispatch of non read/write requests" by always failing >>> such requests back to DM for later immediate direct dispatch. That >>> bouncing of the request was acceptable when there was load-based >>> justification for having to retry (and in doing so: taking the cost of >>> freeing the clone request gotten via get_request() from the underlying >>> request_queues). >>> >>> Having to retry like this purely because the request isn't a read or >>> write seems costly.. every non-read-write will have implied >>> request_queue bouncing. In multipath's case: it could select an >>> entirely different underlying path the next time it is destaged (with >>> RQF_DONTPREP set). Which you'd think would negate all hope of IO >>> merging based performance improvements -- but that is a tangent I'll >>> need to ask Ming about (again). >>> >>> I really don't like this business of bouncing requests as a workaround >>> for the recent implementation of the corruption fix. >>> >>> Why not just add an override flag to _really_ allow direct dispatch for >>> _all_ types of requests? >>> >>> (just peeked at linux-block and it is looking like you took >>> jianchao.wang's series to avoid this hack... ;) >>> >>> Awesome.. my work is done for tonight! >> >> The whole point is doing something that is palatable to 4.20 and leaving >> the more experimental stuff to 4.21, where we have some weeks to verify >> that there are no conditions that cause IO stalls. I don't envision there >> will be, but I'm not willing to risk it this late in the 4.20 cycle. >> >> That said, this isn't a quick and dirty and I don't think it's fair >> calling this a hack. Using RQF_DONTPREP is quite common in drivers to >> retain state over multiple ->queue_rq invocations. Using it to avoid >> multiple direct dispatch failures (and obviously this new livelock) >> seems fine to me. > > But it bounces IO purely because non-read-write. That results in > guaranteed multiple blk_get_request() -- from underlying request_queues > request-based DM is stacked on -- for every non-read-write IO that is > cloned. That seems pathological. I must still be missing something. > >> I really don't want to go around and audit every driver for potential >> retained state over special commands, that's why the read+write thing is >> in place. It's the safe option, which is what we need right now. > > Maybe leave blk_mq_request_issue_directly() interface how it is, > non-read-write restriction and all, but export a new > __blk_mq_request_issue_directly() that _only_ > blk_insert_cloned_request() -- and future comparable users -- makes use > of? > > To me that is the best of both worlds: Fix corruption issue but don't > impose needless blk_get_request() dances for non-read-write IO issued to > dm-multipath. Alright, I hear your point. Maybe we are going to be better off with just dispatch insert. Here's a version of that on top of -git, basically reverting the previous one, and applying the two hunks from Ming, but also adding a missed one in blk_mq_try_issue_list_directly() where we don't want to be adding the current request back to the list. That should go to dispatch, too. Note note note - not tested yet. Will do so now. diff --git a/block/blk-mq.c
Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path
On Thu, Dec 06 2018 at 11:06pm -0500, Jens Axboe wrote: > On 12/6/18 8:54 PM, Mike Snitzer wrote: > > On Thu, Dec 06 2018 at 9:49pm -0500, > > Jens Axboe wrote: > > > >> After the direct dispatch corruption fix, we permanently disallow direct > >> dispatch of non read/write requests. This works fine off the normal IO > >> path, as they will be retried like any other failed direct dispatch > >> request. But for the blk_insert_cloned_request() that only DM uses to > >> bypass the bottom level scheduler, we always first attempt direct > >> dispatch. For some types of requests, that's now a permanent failure, > >> and no amount of retrying will make that succeed. > >> > >> Use the driver private RQF_DONTPREP to track this condition in DM. If > >> we encounter a BUSY condition from blk_insert_cloned_request(), then > >> flag the request with RQF_DONTPREP. When we next time see this request, > >> ask blk_insert_cloned_request() to bypass insert the request directly. > >> This avoids the livelock of repeatedly trying to direct dispatch a > >> request, while still retaining the BUSY feedback loop for blk-mq so > >> that we don't over-dispatch to the lower level queue and mess up > >> opportunities for merging on the DM queue. > >> > >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > >> Reported-by: Bart Van Assche > >> Cc: sta...@vger.kernel.org > >> Signed-off-by: Jens Axboe > >> > >> --- > >> > >> This passes my testing as well, like the previous patch. But unlike the > >> previous patch, we retain the BUSY feedback loop information for better > >> merging. > > > > But it is kind of gross to workaround the new behaviour to "permanently > > disallow direct dispatch of non read/write requests" by always failing > > such requests back to DM for later immediate direct dispatch. That > > bouncing of the request was acceptable when there was load-based > > justification for having to retry (and in doing so: taking the cost of > > freeing the clone request gotten via get_request() from the underlying > > request_queues). > > > > Having to retry like this purely because the request isn't a read or > > write seems costly.. every non-read-write will have implied > > request_queue bouncing. In multipath's case: it could select an > > entirely different underlying path the next time it is destaged (with > > RQF_DONTPREP set). Which you'd think would negate all hope of IO > > merging based performance improvements -- but that is a tangent I'll > > need to ask Ming about (again). > > > > I really don't like this business of bouncing requests as a workaround > > for the recent implementation of the corruption fix. > > > > Why not just add an override flag to _really_ allow direct dispatch for > > _all_ types of requests? > > > > (just peeked at linux-block and it is looking like you took > > jianchao.wang's series to avoid this hack... ;) > > > > Awesome.. my work is done for tonight! > > The whole point is doing something that is palatable to 4.20 and leaving > the more experimental stuff to 4.21, where we have some weeks to verify > that there are no conditions that cause IO stalls. I don't envision there > will be, but I'm not willing to risk it this late in the 4.20 cycle. > > That said, this isn't a quick and dirty and I don't think it's fair > calling this a hack. Using RQF_DONTPREP is quite common in drivers to > retain state over multiple ->queue_rq invocations. Using it to avoid > multiple direct dispatch failures (and obviously this new livelock) > seems fine to me. But it bounces IO purely because non-read-write. That results in guaranteed multiple blk_get_request() -- from underlying request_queues request-based DM is stacked on -- for every non-read-write IO that is cloned. That seems pathological. I must still be missing something. > I really don't want to go around and audit every driver for potential > retained state over special commands, that's why the read+write thing is > in place. It's the safe option, which is what we need right now. Maybe leave blk_mq_request_issue_directly() interface how it is, non-read-write restriction and all, but export a new __blk_mq_request_issue_directly() that _only_ blk_insert_cloned_request() -- and future comparable users -- makes use of? To me that is the best of both worlds: Fix corruption issue but don't impose needless blk_get_request() dances for non-read-write IO issued to dm-multipath. Mike
Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path
On 12/6/18 8:54 PM, Mike Snitzer wrote: > On Thu, Dec 06 2018 at 9:49pm -0500, > Jens Axboe wrote: > >> After the direct dispatch corruption fix, we permanently disallow direct >> dispatch of non read/write requests. This works fine off the normal IO >> path, as they will be retried like any other failed direct dispatch >> request. But for the blk_insert_cloned_request() that only DM uses to >> bypass the bottom level scheduler, we always first attempt direct >> dispatch. For some types of requests, that's now a permanent failure, >> and no amount of retrying will make that succeed. >> >> Use the driver private RQF_DONTPREP to track this condition in DM. If >> we encounter a BUSY condition from blk_insert_cloned_request(), then >> flag the request with RQF_DONTPREP. When we next time see this request, >> ask blk_insert_cloned_request() to bypass insert the request directly. >> This avoids the livelock of repeatedly trying to direct dispatch a >> request, while still retaining the BUSY feedback loop for blk-mq so >> that we don't over-dispatch to the lower level queue and mess up >> opportunities for merging on the DM queue. >> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >> Reported-by: Bart Van Assche >> Cc: sta...@vger.kernel.org >> Signed-off-by: Jens Axboe >> >> --- >> >> This passes my testing as well, like the previous patch. But unlike the >> previous patch, we retain the BUSY feedback loop information for better >> merging. > > But it is kind of gross to workaround the new behaviour to "permanently > disallow direct dispatch of non read/write requests" by always failing > such requests back to DM for later immediate direct dispatch. That > bouncing of the request was acceptable when there was load-based > justification for having to retry (and in doing so: taking the cost of > freeing the clone request gotten via get_request() from the underlying > request_queues). > > Having to retry like this purely because the request isn't a read or > write seems costly.. every non-read-write will have implied > request_queue bouncing. In multipath's case: it could select an > entirely different underlying path the next time it is destaged (with > RQF_DONTPREP set). Which you'd think would negate all hope of IO > merging based performance improvements -- but that is a tangent I'll > need to ask Ming about (again). > > I really don't like this business of bouncing requests as a workaround > for the recent implementation of the corruption fix. > > Why not just add an override flag to _really_ allow direct dispatch for > _all_ types of requests? > > (just peeked at linux-block and it is looking like you took > jianchao.wang's series to avoid this hack... ;) > > Awesome.. my work is done for tonight! The whole point is doing something that is palatable to 4.20 and leaving the more experimental stuff to 4.21, where we have some weeks to verify that there are no conditions that cause IO stalls. I don't envision there will be, but I'm not willing to risk it this late in the 4.20 cycle. That said, this isn't a quick and dirty and I don't think it's fair calling this a hack. Using RQF_DONTPREP is quite common in drivers to retain state over multiple ->queue_rq invocations. Using it to avoid multiple direct dispatch failures (and obviously this new livelock) seems fine to me. I really don't want to go around and audit every driver for potential retained state over special commands, that's why the read+write thing is in place. It's the safe option, which is what we need right now. -- Jens Axboe
Re: [GIT PULL] Follow up block fix
On Thu, Dec 6, 2018 at 6:12 PM Jens Axboe wrote: > > > Linus, I just know notice you are not on the CC for the discussion about > the patch. Don't pull this one yet. It'll solve the issue, but it'll also > mess up the BUSY feedback loop that DM relies on for good merging of > sequential IO. Testing a new patch now, will push it to you when folks > are happy and when my testing has completed. Ok, this pull request dropped from my queue. Linus smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path
On Thu, Dec 06 2018 at 9:49pm -0500, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Use the driver private RQF_DONTPREP to track this condition in DM. If > we encounter a BUSY condition from blk_insert_cloned_request(), then > flag the request with RQF_DONTPREP. When we next time see this request, > ask blk_insert_cloned_request() to bypass insert the request directly. > This avoids the livelock of repeatedly trying to direct dispatch a > request, while still retaining the BUSY feedback loop for blk-mq so > that we don't over-dispatch to the lower level queue and mess up > opportunities for merging on the DM queue. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Reported-by: Bart Van Assche > Cc: sta...@vger.kernel.org > Signed-off-by: Jens Axboe > > --- > > This passes my testing as well, like the previous patch. But unlike the > previous patch, we retain the BUSY feedback loop information for better > merging. But it is kind of gross to workaround the new behaviour to "permanently disallow direct dispatch of non read/write requests" by always failing such requests back to DM for later immediate direct dispatch. That bouncing of the request was acceptable when there was load-based justification for having to retry (and in doing so: taking the cost of freeing the clone request gotten via get_request() from the underlying request_queues). Having to retry like this purely because the request isn't a read or write seems costly.. every non-read-write will have implied request_queue bouncing. In multipath's case: it could select an entirely different underlying path the next time it is destaged (with RQF_DONTPREP set). Which you'd think would negate all hope of IO merging based performance improvements -- but that is a tangent I'll need to ask Ming about (again). I really don't like this business of bouncing requests as a workaround for the recent implementation of the corruption fix. Why not just add an override flag to _really_ allow direct dispatch for _all_ types of requests? (just peeked at linux-block and it is looking like you took jianchao.wang's series to avoid this hack... ;) Awesome.. my work is done for tonight! Mike
Re: [PATCH] blk-mq: fix corruption with direct issue
On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > > > But at that time, there isn't io scheduler for MQ, so in theory the > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > > add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. Hi Theodore, It is just a theory analysis. blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(), both are on real MQ disks. IO merge can be done on none or real io schedulers, so in theory there might be the risk from v4.1, but IO merge on sw queue didn't work for a bit long, especially it was fixed by ab42f35d9cb5ac49b5a2. As Jens mentioned in bugzilla, there are several conditions required for triggering the issue: - MQ device - queue busy can be triggered. It is hard to trigger in NVMe PCI, but may be possible on NVMe FC. However, it can be quite easy to trigger on SCSI devices. We know there are some MQ SCSI HBA, qlogic FC, megaraid_sas. - IO merge is enabled. I have setup scsi_debug in the following way: modprobe scsi_debug dev_size_mb=4096 clustering=1 \ max_luns=1 submit_queues=2 max_queue=2 - submit_queues=2 may set this disk as MQ - max_queue=4 may trigger the queue busy condition easily and run some write IO on ext4 over the disk: fio, kernel building,... for some time, but still can't trigger the data corruption once. I should have created more LUN, so that queue may be easier to become busy, will do that soon. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. virtio_scsi supports multi-queue mode, if that is enabled in your setting, you may try single queue and see if difference can be made. If multi-queue mode isn't enabled, your problem should be different with this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE. > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? IO hang is only triggered on machine with special CPU topo, it should be fine to revert b5b6e8c8d3b4 on normal VM. No other suggestions yet. Thanks, Ming
Re: [PATCH] blk-mq: fix corruption with direct issue
On 12/6/18 7:46 PM, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: >> >> But at that time, there isn't io scheduler for MQ, so in theory the >> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: >> add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. > > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? We should just make SCSI do the right thing, which is to unprep if it sees BUSY and prep next time again. Otherwise I fear the direct dispatch isn't going to be super useful, if a failed direct dispatch prevents future merging. This would be a lot less error prone as well for other cases. -- Jens Axboe
[PATCH V2] blk-mq: re-build queue map in case of kdump kernel
Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [4.438371] nvme nvme0: 16/0/0 default/read/poll queues [4.443277] BUG: unable to handle kernel NULL pointer dereference at 0098 [4.444681] PGD 0 P4D 0 [4.445367] Oops: [#1] SMP NOPTI [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 [4.454061] RAX: RBX: 888174448000 RCX: 0001 [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001 [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002 [4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538 [4.467803] R13: 0004 R14: 0001 R15: 0001 [4.469220] FS: () GS:88817bac() knlGS: [4.471554] CS: 0010 DS: ES: CR0: 80050033 [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0 [4.474264] DR0: DR1: DR2: [4.476007] DR3: DR6: fffe0ff0 DR7: 0400 [4.477061] PKRU: 5554 [4.477464] Call Trace: [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [4.479595] blk_mq_init_queue+0x32/0x4e [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [4.483930] ? try_to_wake_up+0x38d/0x3b3 [4.484478] ? process_one_work+0x179/0x2fc [4.485118] process_one_work+0x1d3/0x2fc [4.485655] ? rescuer_thread+0x2ae/0x2ae [4.486196] worker_thread+0x1e9/0x2be [4.486841] kthread+0x115/0x11d [4.487294] ? kthread_park+0x76/0x76 [4.487784] ret_from_fork+0x3a/0x50 [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [4.489428] Dumping ftrace buffer: [4.489939](ftrace buffer empty) [4.490492] CR2: 0098 [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig Cc: linux-n...@lists.infradead.org Cc: David Milburn Signed-off-by: Ming Lei --- block/blk-mq.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 900550594651..65770da99159 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2960,7 +2960,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) static int blk_mq_update_queue_map(struct blk_mq_tag_set *set) { - if (set->ops->map_queues) { + if (set->ops->map_queues && !is_kdump_kernel()) { int i; /* @@ -3030,6 +3030,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) */ if (is_kdump_kernel()) { set->nr_hw_queues = 1; + set->nr_maps = 1; set->queue_depth = min(64U, set->queue_depth); } /* @@ -3051,7 +3052,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) GFP_KERNEL, set->numa_node); if (!set->map[i].mq_map) goto out_free_mq_map; - set->map[i].nr_queues = set->nr_hw_queues; + set->map[i].nr_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues; } ret = blk_mq_update_queue_map(set); -- 2.9.5
Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel
On Thu, Dec 06, 2018 at 07:57:34PM -0700, Jens Axboe wrote: > On 12/6/18 7:55 PM, Ming Lei wrote: > > Now almost all .map_queues() implementation based on managed irq > > affinity doesn't update queue mapping and it just retrieves the > > old built mapping, so if nr_hw_queues is changed, the mapping talbe > > includes stale mapping. And only blk_mq_map_queues() may rebuild > > the mapping talbe. > > > > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. > > However, drivers often builds queue mapping before allocating tagset > > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set > > as 1 in case of kdump kernel, so wrong queue mapping is used, and > > kernel panic[1] is observed during booting. > > > > This patch fixes the kernel panic triggerd on nvme by rebulding the > > mapping table via blk_mq_map_queues(). > > > > [1] kernel panic log > > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues > > [4.443277] BUG: unable to handle kernel NULL pointer dereference at > > 0098 > > [4.444681] PGD 0 P4D 0 > > [4.445367] Oops: [#1] SMP NOPTI > > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted > > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 > > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.10.2-2.fc27 04/01/2014 > > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] > > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 > > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 > > e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 > > <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 > > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 > > [4.454061] RAX: RBX: 888174448000 RCX: > > 0001 > > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: > > 0001 > > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: > > 0002 > > [4.464580] R10: c900023b3c10 R11: 0004 R12: > > 888174448538 > > [4.467803] R13: 0004 R14: 0001 R15: > > 0001 > > [4.469220] FS: () GS:88817bac() > > knlGS: > > [4.471554] CS: 0010 DS: ES: CR0: 80050033 > > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: > > 00760ee0 > > [4.474264] DR0: DR1: DR2: > > > > [4.476007] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [4.477061] PKRU: 5554 > > [4.477464] Call Trace: > > [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad > > [4.479595] blk_mq_init_queue+0x32/0x4e > > [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] > > [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] > > [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] > > [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] > > [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 > > [4.483930] ? try_to_wake_up+0x38d/0x3b3 > > [4.484478] ? process_one_work+0x179/0x2fc > > [4.485118] process_one_work+0x1d3/0x2fc > > [4.485655] ? rescuer_thread+0x2ae/0x2ae > > [4.486196] worker_thread+0x1e9/0x2be > > [4.486841] kthread+0x115/0x11d > > [4.487294] ? kthread_park+0x76/0x76 > > [4.487784] ret_from_fork+0x3a/0x50 > > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi > > ip_tables > > [4.489428] Dumping ftrace buffer: > > [4.489939](ftrace buffer empty) > > [4.490492] CR2: 0098 > > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- > > > > Cc: Christoph Hellwig > > Cc: linux-n...@lists.infradead.org > > Cc: David Milburn > > Signed-off-by: Ming Lei > > --- > > block/blk-mq.c | 12 +--- > > 1 file changed, 9 insertions(+), 3 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 900550594651..a3e463a726a6 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -38,6 +38,11 @@ > > #include "blk-mq-sched.h" > > #include "blk-rq-qos.h" > > > > +static inline bool blk_mq_kdump_kernel(void) > > +{ > > + return !!is_kdump_kernel(); > > +} > > Let's drop the redundant !! here, and the wrapper? I would imagine the > wrapper is handy for testing outside of kdump, but I don't think we > should include it in the final. OK. Thanks, Ming
Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel
On 12/6/18 7:55 PM, Ming Lei wrote: > Now almost all .map_queues() implementation based on managed irq > affinity doesn't update queue mapping and it just retrieves the > old built mapping, so if nr_hw_queues is changed, the mapping talbe > includes stale mapping. And only blk_mq_map_queues() may rebuild > the mapping talbe. > > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. > However, drivers often builds queue mapping before allocating tagset > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set > as 1 in case of kdump kernel, so wrong queue mapping is used, and > kernel panic[1] is observed during booting. > > This patch fixes the kernel panic triggerd on nvme by rebulding the > mapping table via blk_mq_map_queues(). > > [1] kernel panic log > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues > [4.443277] BUG: unable to handle kernel NULL pointer dereference at > 0098 > [4.444681] PGD 0 P4D 0 > [4.445367] Oops: [#1] SMP NOPTI > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.10.2-2.fc27 04/01/2014 > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 > 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b > b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 > [4.454061] RAX: RBX: 888174448000 RCX: > 0001 > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: > 0001 > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: > 0002 > [4.464580] R10: c900023b3c10 R11: 0004 R12: > 888174448538 > [4.467803] R13: 0004 R14: 0001 R15: > 0001 > [4.469220] FS: () GS:88817bac() > knlGS: > [4.471554] CS: 0010 DS: ES: CR0: 80050033 > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: > 00760ee0 > [4.474264] DR0: DR1: DR2: > > [4.476007] DR3: DR6: fffe0ff0 DR7: > 0400 > [4.477061] PKRU: 5554 > [4.477464] Call Trace: > [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad > [4.479595] blk_mq_init_queue+0x32/0x4e > [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] > [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] > [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] > [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] > [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 > [4.483930] ? try_to_wake_up+0x38d/0x3b3 > [4.484478] ? process_one_work+0x179/0x2fc > [4.485118] process_one_work+0x1d3/0x2fc > [4.485655] ? rescuer_thread+0x2ae/0x2ae > [4.486196] worker_thread+0x1e9/0x2be > [4.486841] kthread+0x115/0x11d > [4.487294] ? kthread_park+0x76/0x76 > [4.487784] ret_from_fork+0x3a/0x50 > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi > ip_tables > [4.489428] Dumping ftrace buffer: > [4.489939](ftrace buffer empty) > [4.490492] CR2: 0098 > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- > > Cc: Christoph Hellwig > Cc: linux-n...@lists.infradead.org > Cc: David Milburn > Signed-off-by: Ming Lei > --- > block/blk-mq.c | 12 +--- > 1 file changed, 9 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 900550594651..a3e463a726a6 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -38,6 +38,11 @@ > #include "blk-mq-sched.h" > #include "blk-rq-qos.h" > > +static inline bool blk_mq_kdump_kernel(void) > +{ > + return !!is_kdump_kernel(); > +} Let's drop the redundant !! here, and the wrapper? I would imagine the wrapper is handy for testing outside of kdump, but I don't think we should include it in the final. Otherwise this looks fine, I can test it here too. -- Jens Axboe
[PATCH] blk-mq: re-build queue map in case of kdump kernel
Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [4.438371] nvme nvme0: 16/0/0 default/read/poll queues [4.443277] BUG: unable to handle kernel NULL pointer dereference at 0098 [4.444681] PGD 0 P4D 0 [4.445367] Oops: [#1] SMP NOPTI [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 [4.454061] RAX: RBX: 888174448000 RCX: 0001 [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001 [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002 [4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538 [4.467803] R13: 0004 R14: 0001 R15: 0001 [4.469220] FS: () GS:88817bac() knlGS: [4.471554] CS: 0010 DS: ES: CR0: 80050033 [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0 [4.474264] DR0: DR1: DR2: [4.476007] DR3: DR6: fffe0ff0 DR7: 0400 [4.477061] PKRU: 5554 [4.477464] Call Trace: [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [4.479595] blk_mq_init_queue+0x32/0x4e [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [4.483930] ? try_to_wake_up+0x38d/0x3b3 [4.484478] ? process_one_work+0x179/0x2fc [4.485118] process_one_work+0x1d3/0x2fc [4.485655] ? rescuer_thread+0x2ae/0x2ae [4.486196] worker_thread+0x1e9/0x2be [4.486841] kthread+0x115/0x11d [4.487294] ? kthread_park+0x76/0x76 [4.487784] ret_from_fork+0x3a/0x50 [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [4.489428] Dumping ftrace buffer: [4.489939](ftrace buffer empty) [4.490492] CR2: 0098 [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig Cc: linux-n...@lists.infradead.org Cc: David Milburn Signed-off-by: Ming Lei --- block/blk-mq.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 900550594651..a3e463a726a6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -38,6 +38,11 @@ #include "blk-mq-sched.h" #include "blk-rq-qos.h" +static inline bool blk_mq_kdump_kernel(void) +{ + return !!is_kdump_kernel(); +} + static void blk_mq_poll_stats_start(struct request_queue *q); static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); @@ -2960,7 +2965,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) static int blk_mq_update_queue_map(struct blk_mq_tag_set *set) { - if (set->ops->map_queues) { + if (set->ops->map_queues && !blk_mq_kdump_kernel()) { int i; /* @@ -3028,8 +3033,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) * memory constrained environment. Limit us to 1 queue and * 64 tags to prevent using too much memory. */ - if (is_kdump_kernel()) { + if (blk_mq_kdump_kernel()) { set->nr_hw_queues = 1; + set->nr_maps = 1; set->queue_depth = min(64U, set->queue_depth); } /* @@ -3051,7 +3057,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) GFP_KERNEL, set->numa_node); if (!set->map[i].mq_map) goto out_free_mq_map; -
[PATCH v2] block/dm: fix handling of busy off direct dispatch path
After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. Use the driver private RQF_DONTPREP to track this condition in DM. If we encounter a BUSY condition from blk_insert_cloned_request(), then flag the request with RQF_DONTPREP. When we next time see this request, ask blk_insert_cloned_request() to bypass insert the request directly. This avoids the livelock of repeatedly trying to direct dispatch a request, while still retaining the BUSY feedback loop for blk-mq so that we don't over-dispatch to the lower level queue and mess up opportunities for merging on the DM queue. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Reported-by: Bart Van Assche Cc: sta...@vger.kernel.org Signed-off-by: Jens Axboe --- This passes my testing as well, like the previous patch. But unlike the previous patch, we retain the BUSY feedback loop information for better merging. diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..cccda51e165f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct request_queue *q, * @q: the queue to submit the request * @rq: the request being queued */ -blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq) +blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq, + bool force_insert) { unsigned long flags; int where = ELEVATOR_INSERT_BACK; @@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + if (force_insert) { + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; + } else + return blk_mq_request_issue_directly(rq); } spin_lock_irqsave(q->queue_lock, flags); diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 7cd36e4d1310..e497a2ab6766 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -299,16 +299,20 @@ static void end_clone_request(struct request *clone, blk_status_t error) static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq) { + bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0; blk_status_t r; if (blk_queue_io_stat(clone->q)) clone->rq_flags |= RQF_IO_STAT; clone->start_time_ns = ktime_get_ns(); - r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE) + r = blk_insert_cloned_request(clone->q, clone, was_busy); + if (r == BLK_STS_RESOURCE || r == BLK_STS_DEV_RESOURCE) + rq->rq_flags |= RQF_DONTPREP; + else if (r != BLK_STS_OK) /* must complete clone in terms of original request */ dm_complete_request(rq, r); + return r; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4293dc1cd160..7cb84ee4c9f4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src, void *data); extern void blk_rq_unprep_clone(struct request *rq); extern blk_status_t blk_insert_cloned_request(struct request_queue *q, -struct request *rq); +struct request *rq, bool force_insert); extern int blk_rq_append_bio(struct request *rq, struct bio **bio); extern void blk_delay_queue(struct request_queue *, unsigned long); extern void blk_queue_split(struct request_queue *, struct bio **); -- Jens Axboe
Re: [PATCH] blk-mq: fix corruption with direct issue
On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > But at that time, there isn't io scheduler for MQ, so in theory the > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > add blk-mq adaptation of the deadline IO scheduler"). Hi Ming, How were serious you about this issue being there (theoretically) an issue since 4.11? Can you talk about how it might get triggered, and how we can test for it? The reason why I ask is because we're trying to track down a mysterious file system corruption problem on a 4.14.x stable kernel. The symptoms are *very* eerily similar to kernel bugzilla #201685. The problem is that the problem is super-rare --- roughly once a week out of a popuation of about 2500 systems. The workload is NFS serving. Unfortunately, the problem is since 4.14.63, we can no longer disable blk-mq for the virtio-scsi driver, thanks to the commit b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq vector affinity") getting backported into 4.14.63 as commit 70b522f163bbb32. We're considering reverting this patch in our 4.14 LTS kernel, and seeing whether it makes the problem go away. Is there any thing else you might suggest? Thanks, - Ted P.S. Unlike the repro's that users were seeing in #201685, we *did* have an I/O scheduler enabled --- it was mq-deadline. But right now, given your comments, and the corruptions that we're seeing, I'm not feeling very warm and fuzzy about block-mq. :-( :-( :-(
Re: block: fix direct dispatch issue failure for clones
On 12/6/18 7:16 PM, Jens Axboe wrote: > On 12/6/18 7:06 PM, Jens Axboe wrote: >> On 12/6/18 6:58 PM, Mike Snitzer wrote: There is another way to fix this - still do the direct dispatch, but have dm track if it failed and do bypass insert in that case. I didn't want do to that since it's more involved, but it's doable. Let me cook that up and test it... Don't like it, though. >>> >>> Not following how DM can track if issuing the request worked if it is >>> always told it worked with BLK_STS_OK. We care about feedback when the >>> request is actually issued because of the elaborate way blk-mq elevators >>> work. DM is forced to worry about all these details, as covered some in >>> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is >>> trying to have its cake and eat it too. It just wants IO scheduling to >>> work for request-based DM devices. That's it. >> >> It needs the feedback, I don't disagree on that part at all. If we >> always return OK, then that loop is broken. How about something like the >> below? Totally untested right now... >> >> We track if a request ever saw BLK_STS_RESOURCE from direct dispatch, >> and if it did, we store that information with RQF_DONTPREP. When we then >> next time go an insert a request, if it has RQF_DONTPREP set, then we >> ask blk_insert_cloned_request() to bypass insert. >> >> I'll go test this now. > > Passes the test case for me. Here's one that doesn't re-arrange the return value check into a switch. Turns out cleaner (and less LOC changes), and also doesn't fiddle with request after freeing it if we got OK return... Will give this a whirl too, just in case. diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..cccda51e165f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct request_queue *q, * @q: the queue to submit the request * @rq: the request being queued */ -blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq) +blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq, + bool force_insert) { unsigned long flags; int where = ELEVATOR_INSERT_BACK; @@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + if (force_insert) { + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; + } else + return blk_mq_request_issue_directly(rq); } spin_lock_irqsave(q->queue_lock, flags); diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 7cd36e4d1310..e497a2ab6766 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -299,16 +299,20 @@ static void end_clone_request(struct request *clone, blk_status_t error) static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq) { + bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0; blk_status_t r; if (blk_queue_io_stat(clone->q)) clone->rq_flags |= RQF_IO_STAT; clone->start_time_ns = ktime_get_ns(); - r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE) + r = blk_insert_cloned_request(clone->q, clone, was_busy); + if (r == BLK_STS_RESOURCE || r == BLK_STS_DEV_RESOURCE) + rq->rq_flags |= RQF_DONTPREP; + else if (r != BLK_STS_OK) /* must complete clone in terms of original request */ dm_complete_request(rq, r); + return r; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4293dc1cd160..7cb84ee4c9f4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src, void *data); extern void blk_rq_unprep_clone(struct request *rq); extern blk_status_t blk_insert_cloned_request(struct request_queue *q, -struct request *rq); +struct request *rq, bool force_insert); extern int blk_rq_append_bio(struct request *rq, struct bio **bio); extern void blk_delay_queue(struct request_queue *, unsigned long); extern void blk_queue_split(struct request_queue *, struct bio **); -- Jens Axboe
Re: block: fix direct dispatch issue failure for clones
On 12/6/18 7:06 PM, Jens Axboe wrote: > On 12/6/18 6:58 PM, Mike Snitzer wrote: >>> There is another way to fix this - still do the direct dispatch, but have >>> dm track if it failed and do bypass insert in that case. I didn't want do >>> to that since it's more involved, but it's doable. >>> >>> Let me cook that up and test it... Don't like it, though. >> >> Not following how DM can track if issuing the request worked if it is >> always told it worked with BLK_STS_OK. We care about feedback when the >> request is actually issued because of the elaborate way blk-mq elevators >> work. DM is forced to worry about all these details, as covered some in >> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is >> trying to have its cake and eat it too. It just wants IO scheduling to >> work for request-based DM devices. That's it. > > It needs the feedback, I don't disagree on that part at all. If we > always return OK, then that loop is broken. How about something like the > below? Totally untested right now... > > We track if a request ever saw BLK_STS_RESOURCE from direct dispatch, > and if it did, we store that information with RQF_DONTPREP. When we then > next time go an insert a request, if it has RQF_DONTPREP set, then we > ask blk_insert_cloned_request() to bypass insert. > > I'll go test this now. Passes the test case for me. -- Jens Axboe
Re: [GIT PULL] Follow up block fix
On 12/6/18 5:33 PM, Jens Axboe wrote: > On 12/6/18 5:31 PM, Bart Van Assche wrote: >> On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote: >>> On 12/6/18 5:20 PM, Bart Van Assche wrote: >>>> Which branch does that tag correspond to? Even after having run git fetch >>>> --tags I can't find that tag in your repository. >>> >>> I pushed it before I sent the email, where are you looking? >>> >>> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206 >> >> On my development system git has been configured to fetch from the following >> repository: >> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ >> >> Should I fetch your branches from the git.kernel.dk repository instead? > > Ah I see, I always push to git.kernel.dk, and that syncs to git.kernel.org > only every hour or so. So either one will work, but you might be a bit > behind if you use git.kernel.org. Linus, I just know notice you are not on the CC for the discussion about the patch. Don't pull this one yet. It'll solve the issue, but it'll also mess up the BUSY feedback loop that DM relies on for good merging of sequential IO. Testing a new patch now, will push it to you when folks are happy and when my testing has completed. -- Jens Axboe
Re: block: fix direct dispatch issue failure for clones
On Thu, Dec 06 2018 at 8:58pm -0500, Mike Snitzer wrote: > DM is forced to worry about all these details, as covered some in > the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is > trying to have its cake and eat it too. Gah, obviously meant: DM is _NOT_ trying to have its cake and eat it too. > It just wants IO scheduling to work for request-based DM devices.
Re: block: fix direct dispatch issue failure for clones
On 12/6/18 6:58 PM, Mike Snitzer wrote: >> There is another way to fix this - still do the direct dispatch, but have >> dm track if it failed and do bypass insert in that case. I didn't want do >> to that since it's more involved, but it's doable. >> >> Let me cook that up and test it... Don't like it, though. > > Not following how DM can track if issuing the request worked if it is > always told it worked with BLK_STS_OK. We care about feedback when the > request is actually issued because of the elaborate way blk-mq elevators > work. DM is forced to worry about all these details, as covered some in > the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is > trying to have its cake and eat it too. It just wants IO scheduling to > work for request-based DM devices. That's it. It needs the feedback, I don't disagree on that part at all. If we always return OK, then that loop is broken. How about something like the below? Totally untested right now... We track if a request ever saw BLK_STS_RESOURCE from direct dispatch, and if it did, we store that information with RQF_DONTPREP. When we then next time go an insert a request, if it has RQF_DONTPREP set, then we ask blk_insert_cloned_request() to bypass insert. I'll go test this now. diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..cccda51e165f 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct request_queue *q, * @q: the queue to submit the request * @rq: the request being queued */ -blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq) +blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *rq, + bool force_insert) { unsigned long flags; int where = ELEVATOR_INSERT_BACK; @@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + if (force_insert) { + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; + } else + return blk_mq_request_issue_directly(rq); } spin_lock_irqsave(q->queue_lock, flags); diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index 7cd36e4d1310..8d4c5020ccaa 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c @@ -299,16 +299,27 @@ static void end_clone_request(struct request *clone, blk_status_t error) static blk_status_t dm_dispatch_clone_request(struct request *clone, struct request *rq) { + bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0; blk_status_t r; if (blk_queue_io_stat(clone->q)) clone->rq_flags |= RQF_IO_STAT; clone->start_time_ns = ktime_get_ns(); - r = blk_insert_cloned_request(clone->q, clone); - if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DEV_RESOURCE) + + r = blk_insert_cloned_request(clone->q, clone, was_busy); + switch (r) { + default: /* must complete clone in terms of original request */ dm_complete_request(rq, r); + /* fall through */ + case BLK_STS_RESOURCE: + case BLK_STS_DEV_RESOURCE: + rq->rq_flags |= RQF_DONTPREP; + case BLK_STS_OK: + break; + } + return r; } diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 4293dc1cd160..7cb84ee4c9f4 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct request *rq_src, void *data); extern void blk_rq_unprep_clone(struct request *rq); extern blk_status_t blk_insert_cloned_request(struct request_queue *q, -struct request *rq); +struct request *rq, bool force_insert); extern int blk_rq_append_bio(struct request *rq, struct bio **bio); extern void blk_delay_queue(struct request_queue *, unsigned long); extern void blk_queue_split(struct request_queue *, struct bio **); -- Jens Axboe
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 9:34 AM, Jens Axboe wrote: > On 12/6/18 6:22 PM, jianchao.wang wrote: >> >> >> On 12/7/18 9:13 AM, Jens Axboe wrote: >>> On 12/6/18 6:04 PM, jianchao.wang wrote: On 12/7/18 6:20 AM, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); > Not sure about this because it will break the merging promotion for request based DM from Ming. 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback) We could use some other way to fix this. >>> >>> That really shouldn't matter as this is the cloned insert, merging should >>> have been done on the original request. >>> >>> >> Just quote some comments from the patch. >> >> " >>But dm-rq currently can't get the underlying queue's >> dispatch feedback at all. Without knowing whether a request was issued >> or not (e.g. due to underlying queue being busy) the dm-rq elevator will >> not be able to provide effective IO merging (as a side-effect of dm-rq >> currently blindly destaging a request from its elevator only to requeue >> it after a delay, which kills any opportunity for merging). This >> obviously causes very bad sequential IO performance. >> ... >> With this, request-based DM's blk-mq sequential IO performance is vastly >> improved (as much as 3X in mpath/virtio-scsi testing) >> " >> >> Using blk_mq_request_bypass_insert to replace the >> blk_mq_request_issue_directly >> could be a fast method to fix the current issue. Maybe we could get the >> merging >> promotion back after some time. > > This really sucks, mostly because DM wants to have it both ways - not use > the bottom level IO scheduler, but still actually use it if it makes sense. > > There is another way to fix this - still do the direct dispatch, but have > dm track if it failed and do bypass insert in that case. I didn't want do > to that since it's more involved, but it's doable. > > Let me cook that up and test it... Don't like it, though. > Actually, I have tried to fix this issue in the 1st patch of my patchset blk-mq: refactor code of issue directly. Just insert the non-read-write command into dispatch list directly and return BLK_STS_OK. Thanks Jianchao
Re: block: fix direct dispatch issue failure for clones
On Thu, Dec 06 2018 at 8:34pm -0500, Jens Axboe wrote: > On 12/6/18 6:22 PM, jianchao.wang wrote: > > > > > > On 12/7/18 9:13 AM, Jens Axboe wrote: > >> On 12/6/18 6:04 PM, jianchao.wang wrote: > >>> > >>> > >>> On 12/7/18 6:20 AM, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * > * bypass a potential scheduler on the bottom device for > * insert. > */ > -return blk_mq_request_issue_directly(rq); > +blk_mq_request_bypass_insert(rq, true); > +return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); > > >>> Not sure about this because it will break the merging promotion for > >>> request based DM > >>> from Ming. > >>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 > >>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request > >>> feedback) > >>> > >>> We could use some other way to fix this. > >> > >> That really shouldn't matter as this is the cloned insert, merging should > >> have been done on the original request. > >> > >> > > Just quote some comments from the patch. > > > > " > >But dm-rq currently can't get the underlying queue's > > dispatch feedback at all. Without knowing whether a request was issued > > or not (e.g. due to underlying queue being busy) the dm-rq elevator will > > not be able to provide effective IO merging (as a side-effect of dm-rq > > currently blindly destaging a request from its elevator only to requeue > > it after a delay, which kills any opportunity for merging). This > > obviously causes very bad sequential IO performance. > > ... > > With this, request-based DM's blk-mq sequential IO performance is vastly > > improved (as much as 3X in mpath/virtio-scsi testing) > > " > > > > Using blk_mq_request_bypass_insert to replace the > > blk_mq_request_issue_directly > > could be a fast method to fix the current issue. Maybe we could get the > > merging > > promotion back after some time. > > This really sucks, mostly because DM wants to have it both ways - not use > the bottom level IO scheduler, but still actually use it if it makes sense. Well no, that isn't what DM is doing. DM does have an upper layer scheduler that would like to be afforded the same capabilities that any request-based driver is given. Yes that comes with plumbing in safe passage for upper layer requests dispatched from a stacked blk-mq IO scheduler. > There is another way to fix this - still do the direct dispatch, but have > dm track if it failed and do bypass insert in that case. I didn't want do > to that since it's more involved, but it's doable. > > Let me cook that up and test it... Don't like it, though. Not following how DM can track if issuing the request worked if it is always told it worked with BLK_STS_OK. We care about feedback when the request is actually issued because of the elaborate way blk-mq elevators work. DM is forced to worry about all these details, as covered some in the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is trying to have its cake and eat it too. It just wants IO scheduling to work for request-based DM devices. That's it. Mike
Re: block: fix direct dispatch issue failure for clones
On Thu, Dec 06 2018 at 8:13pm -0500, Jens Axboe wrote: > On 12/6/18 6:04 PM, jianchao.wang wrote: > > > > > > On 12/7/18 6:20 AM, Jens Axboe wrote: > >> After the direct dispatch corruption fix, we permanently disallow direct > >> dispatch of non read/write requests. This works fine off the normal IO > >> path, as they will be retried like any other failed direct dispatch > >> request. But for the blk_insert_cloned_request() that only DM uses to > >> bypass the bottom level scheduler, we always first attempt direct > >> dispatch. For some types of requests, that's now a permanent failure, > >> and no amount of retrying will make that succeed. > >> > >> Don't use direct dispatch off the cloned insert path, always just use > >> bypass inserts. This still bypasses the bottom level scheduler, which is > >> what DM wants. > >> > >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > >> Signed-off-by: Jens Axboe > >> > >> --- > >> > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index deb56932f8c4..4c44e6fa0d08 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > >> request_queue *q, struct request * > >> * bypass a potential scheduler on the bottom device for > >> * insert. > >> */ > >> - return blk_mq_request_issue_directly(rq); > >> + blk_mq_request_bypass_insert(rq, true); > >> + return BLK_STS_OK; > >>} > >> > >>spin_lock_irqsave(q->queue_lock, flags); > >> > > Not sure about this because it will break the merging promotion for request > > based DM > > from Ming. > > 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 > > (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request > > feedback) > > > > We could use some other way to fix this. > > That really shouldn't matter as this is the cloned insert, merging should > have been done on the original request. Reading the header of 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 brings me back to relatively recent hell. Thing is, dm-rq was the original justification and consumer of blk_mq_request_issue_directly -- but Ming's later use of directly issuing requests has forced fixes that didn't consider the original valid/safe use of the interface that is now too rigid. dm-rq needs the functionality the blk_mq_request_issue_directly interface provides. Sorry to say we cannot lose the sequential IO performance improvements that the IO merging feedback loop gives us.
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/6/18 6:22 PM, jianchao.wang wrote: > > > On 12/7/18 9:13 AM, Jens Axboe wrote: >> On 12/6/18 6:04 PM, jianchao.wang wrote: >>> >>> >>> On 12/7/18 6:20 AM, Jens Axboe wrote: After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. Don't use direct dispatch off the cloned insert path, always just use bypass inserts. This still bypasses the bottom level scheduler, which is what DM wants. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Signed-off-by: Jens Axboe --- diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..4c44e6fa0d08 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; } spin_lock_irqsave(q->queue_lock, flags); >>> Not sure about this because it will break the merging promotion for request >>> based DM >>> from Ming. >>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 >>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request >>> feedback) >>> >>> We could use some other way to fix this. >> >> That really shouldn't matter as this is the cloned insert, merging should >> have been done on the original request. >> >> > Just quote some comments from the patch. > > " >But dm-rq currently can't get the underlying queue's > dispatch feedback at all. Without knowing whether a request was issued > or not (e.g. due to underlying queue being busy) the dm-rq elevator will > not be able to provide effective IO merging (as a side-effect of dm-rq > currently blindly destaging a request from its elevator only to requeue > it after a delay, which kills any opportunity for merging). This > obviously causes very bad sequential IO performance. > ... > With this, request-based DM's blk-mq sequential IO performance is vastly > improved (as much as 3X in mpath/virtio-scsi testing) > " > > Using blk_mq_request_bypass_insert to replace the > blk_mq_request_issue_directly > could be a fast method to fix the current issue. Maybe we could get the > merging > promotion back after some time. This really sucks, mostly because DM wants to have it both ways - not use the bottom level IO scheduler, but still actually use it if it makes sense. There is another way to fix this - still do the direct dispatch, but have dm track if it failed and do bypass insert in that case. I didn't want do to that since it's more involved, but it's doable. Let me cook that up and test it... Don't like it, though. -- Jens Axboe
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 9:13 AM, Jens Axboe wrote: > On 12/6/18 6:04 PM, jianchao.wang wrote: >> >> >> On 12/7/18 6:20 AM, Jens Axboe wrote: >>> After the direct dispatch corruption fix, we permanently disallow direct >>> dispatch of non read/write requests. This works fine off the normal IO >>> path, as they will be retried like any other failed direct dispatch >>> request. But for the blk_insert_cloned_request() that only DM uses to >>> bypass the bottom level scheduler, we always first attempt direct >>> dispatch. For some types of requests, that's now a permanent failure, >>> and no amount of retrying will make that succeed. >>> >>> Don't use direct dispatch off the cloned insert path, always just use >>> bypass inserts. This still bypasses the bottom level scheduler, which is >>> what DM wants. >>> >>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >>> Signed-off-by: Jens Axboe >>> >>> --- >>> >>> diff --git a/block/blk-core.c b/block/blk-core.c >>> index deb56932f8c4..4c44e6fa0d08 100644 >>> --- a/block/blk-core.c >>> +++ b/block/blk-core.c >>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct >>> request_queue *q, struct request * >>> * bypass a potential scheduler on the bottom device for >>> * insert. >>> */ >>> - return blk_mq_request_issue_directly(rq); >>> + blk_mq_request_bypass_insert(rq, true); >>> + return BLK_STS_OK; >>> } >>> >>> spin_lock_irqsave(q->queue_lock, flags); >>> >> Not sure about this because it will break the merging promotion for request >> based DM >> from Ming. >> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 >> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request >> feedback) >> >> We could use some other way to fix this. > > That really shouldn't matter as this is the cloned insert, merging should > have been done on the original request. > > Just quote some comments from the patch. " But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. ... With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing) " Using blk_mq_request_bypass_insert to replace the blk_mq_request_issue_directly could be a fast method to fix the current issue. Maybe we could get the merging promotion back after some time. Thanks Jianchao
Re: block: fix direct dispatch issue failure for clones
On Thu, Dec 06 2018 at 8:04pm -0500, jianchao.wang wrote: > > > On 12/7/18 6:20 AM, Jens Axboe wrote: > > After the direct dispatch corruption fix, we permanently disallow direct > > dispatch of non read/write requests. This works fine off the normal IO > > path, as they will be retried like any other failed direct dispatch > > request. But for the blk_insert_cloned_request() that only DM uses to > > bypass the bottom level scheduler, we always first attempt direct > > dispatch. For some types of requests, that's now a permanent failure, > > and no amount of retrying will make that succeed. > > > > Don't use direct dispatch off the cloned insert path, always just use > > bypass inserts. This still bypasses the bottom level scheduler, which is > > what DM wants. > > > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > > Signed-off-by: Jens Axboe > > > > --- > > > > diff --git a/block/blk-core.c b/block/blk-core.c > > index deb56932f8c4..4c44e6fa0d08 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > > request_queue *q, struct request * > > * bypass a potential scheduler on the bottom device for > > * insert. > > */ > > - return blk_mq_request_issue_directly(rq); > > + blk_mq_request_bypass_insert(rq, true); > > + return BLK_STS_OK; > > } > > > > spin_lock_irqsave(q->queue_lock, flags); > > > Not sure about this because it will break the merging promotion for request > based DM > from Ming. > 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 > (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request > feedback) Ngh.. yeah, forgot about that convoluted feedback loop. > We could use some other way to fix this. Yeah, afraid we have to.
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/6/18 6:04 PM, jianchao.wang wrote: > > > On 12/7/18 6:20 AM, Jens Axboe wrote: >> After the direct dispatch corruption fix, we permanently disallow direct >> dispatch of non read/write requests. This works fine off the normal IO >> path, as they will be retried like any other failed direct dispatch >> request. But for the blk_insert_cloned_request() that only DM uses to >> bypass the bottom level scheduler, we always first attempt direct >> dispatch. For some types of requests, that's now a permanent failure, >> and no amount of retrying will make that succeed. >> >> Don't use direct dispatch off the cloned insert path, always just use >> bypass inserts. This still bypasses the bottom level scheduler, which is >> what DM wants. >> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >> Signed-off-by: Jens Axboe >> >> --- >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index deb56932f8c4..4c44e6fa0d08 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct >> request_queue *q, struct request * >> * bypass a potential scheduler on the bottom device for >> * insert. >> */ >> -return blk_mq_request_issue_directly(rq); >> +blk_mq_request_bypass_insert(rq, true); >> +return BLK_STS_OK; >> } >> >> spin_lock_irqsave(q->queue_lock, flags); >> > Not sure about this because it will break the merging promotion for request > based DM > from Ming. > 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 > (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request > feedback) > > We could use some other way to fix this. That really shouldn't matter as this is the cloned insert, merging should have been done on the original request. -- Jens Axboe
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/7/18 6:20 AM, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); > Not sure about this because it will break the merging promotion for request based DM from Ming. 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback) We could use some other way to fix this. Thanks Jianchao
Re: [GIT PULL] Follow up block fix
On 12/6/18 5:31 PM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote: >> On 12/6/18 5:20 PM, Bart Van Assche wrote: >>> Which branch does that tag correspond to? Even after having run git fetch >>> --tags I can't find that tag in your repository. >> >> I pushed it before I sent the email, where are you looking? >> >> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206 > > On my development system git has been configured to fetch from the following > repository: > https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ > > Should I fetch your branches from the git.kernel.dk repository instead? Ah I see, I always push to git.kernel.dk, and that syncs to git.kernel.org only every hour or so. So either one will work, but you might be a bit behind if you use git.kernel.org. -- Jens Axboe
Re: [GIT PULL] Follow up block fix
On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote: > On 12/6/18 5:20 PM, Bart Van Assche wrote: > > Which branch does that tag correspond to? Even after having run git fetch > > --tags I can't find that tag in your repository. > > I pushed it before I sent the email, where are you looking? > > http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206 On my development system git has been configured to fetch from the following repository: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/ Should I fetch your branches from the git.kernel.dk repository instead? Thanks, Bart.
Re: [GIT PULL] Follow up block fix
On 12/6/18 5:20 PM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote: >> Just a single followup fix to the corruption fix from yesterday. We have >> an exported interface that does direct dispatch, with DM being the sole >> user of it. Change that to do bypass insert always instead of attempting >> direct dispatch. This fixes a deadlock that can happen on DM. >> Additionally, it gets rid of any exported user of direct dispatch, and >> enables DM to drop any BUSY handling from using the interface. >> >> Please pull! >> >> >> git://git.kernel.dk/linux-block.git tags/for-linus-20181206 > > Hi Jens, > > Which branch does that tag correspond to? Even after having run git fetch > --tags I can't find that tag in your repository. I pushed it before I sent the email, where are you looking? http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206 > Additionally, the patch on your for-linus branch is missing the "Cc: > stable" and "Reported-by:" tags that you had promised to add. Did I > look at the wrong branch? Doh I did - Linus, I'm going to amend the commit with that info and push it out again, jfyi. Now done. -- Jens Axboe
Re: [GIT PULL] Follow up block fix
On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote: > Just a single followup fix to the corruption fix from yesterday. We have > an exported interface that does direct dispatch, with DM being the sole > user of it. Change that to do bypass insert always instead of attempting > direct dispatch. This fixes a deadlock that can happen on DM. > Additionally, it gets rid of any exported user of direct dispatch, and > enables DM to drop any BUSY handling from using the interface. > > Please pull! > > > git://git.kernel.dk/linux-block.git tags/for-linus-20181206 Hi Jens, Which branch does that tag correspond to? Even after having run git fetch --tags I can't find that tag in your repository. Additionally, the patch on your for-linus branch is missing the "Cc: stable" and "Reported-by:" tags that you had promised to add. Did I look at the wrong branch? Thanks, Bart.
[GIT PULL] Follow up block fix
Hi Linus, Just a single followup fix to the corruption fix from yesterday. We have an exported interface that does direct dispatch, with DM being the sole user of it. Change that to do bypass insert always instead of attempting direct dispatch. This fixes a deadlock that can happen on DM. Additionally, it gets rid of any exported user of direct dispatch, and enables DM to drop any BUSY handling from using the interface. Please pull! git://git.kernel.dk/linux-block.git tags/for-linus-20181206 Jens Axboe (1): block: fix direct dispatch issue failure for clones block/blk-core.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) -- Jens Axboe
Re: [PATCH] block: fix direct dispatch issue failure for clones
On Thu, 2018-12-06 at 15:21 -0700, Jens Axboe wrote: > On 12/6/18 3:20 PM, Jens Axboe wrote: > > After the direct dispatch corruption fix, we permanently disallow direct > > dispatch of non read/write requests. This works fine off the normal IO > > path, as they will be retried like any other failed direct dispatch > > request. But for the blk_insert_cloned_request() that only DM uses to > > bypass the bottom level scheduler, we always first attempt direct > > dispatch. For some types of requests, that's now a permanent failure, > > and no amount of retrying will make that succeed. > > > > Don't use direct dispatch off the cloned insert path, always just use > > bypass inserts. This still bypasses the bottom level scheduler, which is > > what DM wants. > > > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > > Signed-off-by: Jens Axboe > > Bart, I'll add your reported-by here of course, and also a stable CC > since the original patch went into stable. Feel free to add the following: Tested-by: Bart Van Assche
Re: block: fix direct dispatch issue failure for clones
On 12/6/18 3:28 PM, Mike Snitzer wrote: > On Thu, Dec 06 2018 at 5:20pm -0500, > Jens Axboe wrote: > >> After the direct dispatch corruption fix, we permanently disallow direct >> dispatch of non read/write requests. This works fine off the normal IO >> path, as they will be retried like any other failed direct dispatch >> request. But for the blk_insert_cloned_request() that only DM uses to >> bypass the bottom level scheduler, we always first attempt direct >> dispatch. For some types of requests, that's now a permanent failure, >> and no amount of retrying will make that succeed. >> >> Don't use direct dispatch off the cloned insert path, always just use >> bypass inserts. This still bypasses the bottom level scheduler, which is >> what DM wants. >> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") >> Signed-off-by: Jens Axboe >> >> --- >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index deb56932f8c4..4c44e6fa0d08 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct >> request_queue *q, struct request * >> * bypass a potential scheduler on the bottom device for >> * insert. >> */ >> -return blk_mq_request_issue_directly(rq); >> +blk_mq_request_bypass_insert(rq, true); >> +return BLK_STS_OK; >> } >> >> spin_lock_irqsave(q->queue_lock, flags); > > Not sure what this trailing spin_lock_irqsave(q->queue_lock, flags) is > about.. but this looks good. It's because it's against current -git, that is gone from the 4.21 branch. > I'll cleanup dm-rq.c to do away with the > extra STS_RESOURCE checks for its call to blk_insert_cloned_request() > once this lands. That is indeed a nice benefit, all source based failure cases can be removed from the caller after this. -- Jens Axboe
Re: block: fix direct dispatch issue failure for clones
On Thu, Dec 06 2018 at 5:20pm -0500, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); Not sure what this trailing spin_lock_irqsave(q->queue_lock, flags) is about.. but this looks good. I'll cleanup dm-rq.c to do away with the extra STS_RESOURCE checks for its call to blk_insert_cloned_request() once this lands. Acked-by: Mike Snitzer Thanks.
Re: [PATCH] block: fix direct dispatch issue failure for clones
On 12/6/18 3:20 PM, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. > > Don't use direct dispatch off the cloned insert path, always just use > bypass inserts. This still bypasses the bottom level scheduler, which is > what DM wants. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Signed-off-by: Jens Axboe Bart, I'll add your reported-by here of course, and also a stable CC since the original patch went into stable. -- Jens Axboe
[PATCH] block: fix direct dispatch issue failure for clones
After the direct dispatch corruption fix, we permanently disallow direct dispatch of non read/write requests. This works fine off the normal IO path, as they will be retried like any other failed direct dispatch request. But for the blk_insert_cloned_request() that only DM uses to bypass the bottom level scheduler, we always first attempt direct dispatch. For some types of requests, that's now a permanent failure, and no amount of retrying will make that succeed. Don't use direct dispatch off the cloned insert path, always just use bypass inserts. This still bypasses the bottom level scheduler, which is what DM wants. Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") Signed-off-by: Jens Axboe --- diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..4c44e6fa0d08 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; } spin_lock_irqsave(q->queue_lock, flags); -- Jens Axboe
Re: for-next branch and blktests/srp
On 12/6/18 2:04 PM, Jens Axboe wrote: > On 12/6/18 1:56 PM, Bart Van Assche wrote: >> On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote: >>> If I merge Jens' for-next branch with Linus' master branch, boot the >>> resulting kernel in a VM and run blktests/tests/srp/002 then that test >>> never finishes. The same test passes against Linus' master branch. I >>> think this is a regression. The following appears in the system log if >>> I run that test: >>> >>> Call Trace: >>> INFO: task kworker/0:1:12 blocked for more than 120 seconds. >>> Call Trace: >>> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. >>> Call Trace: >>> INFO: task fio:2151 blocked for more than 120 seconds. >>> Call Trace: >>> INFO: task fio:2154 blocked for more than 120 seconds. >> >> Hi Jens, >> >> My test results so far are as follows: >> * With kernel v4.20-rc5 test srp/002 passes. >> * With your for-next branch test srp/002 reports the symptoms reported in my >> e-mail. >> * With Linus' master branch from this morning test srp/002 fails in the same >> way as >> your for-next branch. >> * Also with Linus' master branch, test srp/002 passes if I revert the >> following commit: >> ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems >> like that >> commit fixed one regression but introduced another regression. > > Yes, I'm on the same page, I've been able to reproduce. It seems to be related > to dm and bypass insert, which is somewhat odd. If I just do: > > diff --git a/block/blk-core.c b/block/blk-core.c > index deb56932f8c4..4c44e6fa0d08 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + blk_mq_request_bypass_insert(rq, true); > + return BLK_STS_OK; > } > > spin_lock_irqsave(q->queue_lock, flags); > > it works fine. Well, at least this regression is less serious, I'll bang > out a fix for it and ensure we make -rc6. I'm guessing it's the bypassin > of non-read/write, which your top of dispatch also shows to be a > non-read/write. But there should be no new failure case here that wasn't > possible before, only it's easier to hit now. OK, so here's the thing. As part of the corruption fix, we disallowed direct dispatch for anything that wasn't a read or write. This means that your WRITE_ZEROES will always fail direct dispatch. When it does, we return busy. But next time dm will try the exact same thing again, blk_insert_cloned_request() -> direct dispatch -> fail. Before we'd succeed eventually, now we will always fail for that type. The insert clone path is unique in that regard. So we have two options - the patch I did above which always just does bypass insert for DM, or we need to mark the request as having failed and just not retry direct dispatch for it. I'm still not crazy about exploring the dispatch insert off this path. And we'd need to do this on the original request in dm, not the clone we are passed or it won't be persistent. Hence I lean towards the already posted patch. -- Jens Axboe
Re: for-next branch and blktests/srp
On 12/6/18 1:56 PM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote: >> If I merge Jens' for-next branch with Linus' master branch, boot the >> resulting kernel in a VM and run blktests/tests/srp/002 then that test >> never finishes. The same test passes against Linus' master branch. I >> think this is a regression. The following appears in the system log if >> I run that test: >> >> Call Trace: >> INFO: task kworker/0:1:12 blocked for more than 120 seconds. >> Call Trace: >> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. >> Call Trace: >> INFO: task fio:2151 blocked for more than 120 seconds. >> Call Trace: >> INFO: task fio:2154 blocked for more than 120 seconds. > > Hi Jens, > > My test results so far are as follows: > * With kernel v4.20-rc5 test srp/002 passes. > * With your for-next branch test srp/002 reports the symptoms reported in my > e-mail. > * With Linus' master branch from this morning test srp/002 fails in the same > way as > your for-next branch. > * Also with Linus' master branch, test srp/002 passes if I revert the > following commit: > ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like > that > commit fixed one regression but introduced another regression. Yes, I'm on the same page, I've been able to reproduce. It seems to be related to dm and bypass insert, which is somewhat odd. If I just do: diff --git a/block/blk-core.c b/block/blk-core.c index deb56932f8c4..4c44e6fa0d08 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request * * bypass a potential scheduler on the bottom device for * insert. */ - return blk_mq_request_issue_directly(rq); + blk_mq_request_bypass_insert(rq, true); + return BLK_STS_OK; } spin_lock_irqsave(q->queue_lock, flags); it works fine. Well, at least this regression is less serious, I'll bang out a fix for it and ensure we make -rc6. I'm guessing it's the bypassin of non-read/write, which your top of dispatch also shows to be a non-read/write. But there should be no new failure case here that wasn't possible before, only it's easier to hit now. -- Jens Axboe
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote: > If I merge Jens' for-next branch with Linus' master branch, boot the > resulting kernel in a VM and run blktests/tests/srp/002 then that test > never finishes. The same test passes against Linus' master branch. I > think this is a regression. The following appears in the system log if > I run that test: > > Call Trace: > INFO: task kworker/0:1:12 blocked for more than 120 seconds. > Call Trace: > INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. > Call Trace: > INFO: task fio:2151 blocked for more than 120 seconds. > Call Trace: > INFO: task fio:2154 blocked for more than 120 seconds. Hi Jens, My test results so far are as follows: * With kernel v4.20-rc5 test srp/002 passes. * With your for-next branch test srp/002 reports the symptoms reported in my e-mail. * With Linus' master branch from this morning test srp/002 fails in the same way as your for-next branch. * Also with Linus' master branch, test srp/002 passes if I revert the following commit: ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like that commit fixed one regression but introduced another regression. Bart.
Re: [PATCH 4/4] block: expose devt for GENHD_FL_HIDDEN disks
On Thu, Dec 06, 2018 at 02:48:12PM -0200, Thadeu Lima de Souza Cascardo wrote: > Without this exposure, lsblk will fail as it tries to find out the > device's dev_t numbers. This causes a real problem for nvme multipath > devices, as their slaves are hidden. > > Exposing them fixes the problem, even though trying to open the devices > returns an error in the case of nvme multipath. So, right now, it's the > driver's responsibility to return a failure to open hidden devices. So the problem with this is that it will cause udev to actually create the /dev/nvmeXcYnZ nodes, which due to the previous patch will always fail to open, which is a bit confusing. I guess we could live with this if we add udev rules to supress the creation or something, but in general it is a bit ugly.
Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()
On 12/6/18 12:38 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> On 12/6/18 12:27 PM, Jeff Moyer wrote: >>> Jens Axboe writes: >>> It's 192 bytes, fairly substantial. Most items don't need to be cleared, especially not upfront. Clear the ones we do need to clear, and leave the other ones for setup when the iocb is prepared and submitted. >>> >>> What performance gains do you see from this? >> >> Before this, I had 1% in memset doing high IOPS. With it, that's gone. >> 1% is a lot, when you have just one thread doing everything from submission >> to completion. > > I'm used to customers complaining about fractions of a percent, so I get > it. :-) I just wanted to know we had some measurable impact, as I've > seen bugs crop up from code like this in the past. Oh for sure, I wouldn't do it if there wasn't a noticeable gain from this! -- Jens Axboe
Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()
Jens Axboe writes: > On 12/6/18 12:27 PM, Jeff Moyer wrote: >> Jens Axboe writes: >> >>> It's 192 bytes, fairly substantial. Most items don't need to be cleared, >>> especially not upfront. Clear the ones we do need to clear, and leave >>> the other ones for setup when the iocb is prepared and submitted. >> >> What performance gains do you see from this? > > Before this, I had 1% in memset doing high IOPS. With it, that's gone. > 1% is a lot, when you have just one thread doing everything from submission > to completion. I'm used to customers complaining about fractions of a percent, so I get it. :-) I just wanted to know we had some measurable impact, as I've seen bugs crop up from code like this in the past. Thanks! Jeff
Re: [PATCH 09/26] aio: only use blk plugs for > 2 depth submissions
On 12/6/18 12:29 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> Plugging is meant to optimize submission of a string of IOs, if we don't >> have more than 2 being submitted, don't bother setting up a plug. > > Is there really that much overhead in blk_{start|finish}_plug? It's not just the overhead of the functions themselves, it's the pointless work we end up doing for no reason at all. Plugging is, by defintion, meant to help batch things up, ammortizing the cost of each individual IO if you have a batch of them. If you don't have a batch, then there's no point. Normally callers of blk_start_plug() don't necessarily know how much IO is coming, but in this case we clearly do. -- Jens Axboe
Re: [PATCH 09/26] aio: only use blk plugs for > 2 depth submissions
Jens Axboe writes: > Plugging is meant to optimize submission of a string of IOs, if we don't > have more than 2 being submitted, don't bother setting up a plug. Is there really that much overhead in blk_{start|finish}_plug? -Jeff > > Reviewed-by: Christoph Hellwig > Signed-off-by: Jens Axboe > --- > fs/aio.c | 18 ++ > 1 file changed, 14 insertions(+), 4 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index 522c04864d82..ed6c3914477a 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -69,6 +69,12 @@ struct aio_ring { > struct io_event io_events[0]; > }; /* 128 bytes + ring size */ > > +/* > + * Plugging is meant to work with larger batches of IOs. If we don't > + * have more than the below, then don't bother setting up a plug. > + */ > +#define AIO_PLUG_THRESHOLD 2 > + > #define AIO_RING_PAGES 8 > > struct kioctx_table { > @@ -1919,7 +1925,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, > nr, > if (nr > ctx->nr_events) > nr = ctx->nr_events; > > - blk_start_plug(); > + if (nr > AIO_PLUG_THRESHOLD) > + blk_start_plug(); > for (i = 0; i < nr; i++) { > struct iocb __user *user_iocb; > > @@ -1932,7 +1939,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, > nr, > if (ret) > break; > } > - blk_finish_plug(); > + if (nr > AIO_PLUG_THRESHOLD) > + blk_finish_plug(); > > percpu_ref_put(>users); > return i ? i : ret; > @@ -1959,7 +1967,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, > ctx_id, > if (nr > ctx->nr_events) > nr = ctx->nr_events; > > - blk_start_plug(); > + if (nr > AIO_PLUG_THRESHOLD) > + blk_start_plug(); > for (i = 0; i < nr; i++) { > compat_uptr_t user_iocb; > > @@ -1972,7 +1981,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, > ctx_id, > if (ret) > break; > } > - blk_finish_plug(); > + if (nr > AIO_PLUG_THRESHOLD) > + blk_finish_plug(); > > percpu_ref_put(>users); > return i ? i : ret;
Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()
On 12/6/18 12:27 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> It's 192 bytes, fairly substantial. Most items don't need to be cleared, >> especially not upfront. Clear the ones we do need to clear, and leave >> the other ones for setup when the iocb is prepared and submitted. > > What performance gains do you see from this? Before this, I had 1% in memset doing high IOPS. With it, that's gone. 1% is a lot, when you have just one thread doing everything from submission to completion. -- Jens Axboe
Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()
Jens Axboe writes: > It's 192 bytes, fairly substantial. Most items don't need to be cleared, > especially not upfront. Clear the ones we do need to clear, and leave > the other ones for setup when the iocb is prepared and submitted. What performance gains do you see from this? -Jeff > Reviewed-by: Christoph Hellwig > Signed-off-by: Jens Axboe > --- > fs/aio.c | 9 +++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/fs/aio.c b/fs/aio.c > index eaceb40e6cf5..522c04864d82 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -1009,14 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct > kioctx *ctx) > { > struct aio_kiocb *req; > > - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO); > + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL); > if (unlikely(!req)) > return NULL; > > percpu_ref_get(>reqs); > + req->ki_ctx = ctx; > INIT_LIST_HEAD(>ki_list); > refcount_set(>ki_refcnt, 0); > - req->ki_ctx = ctx; > + req->ki_eventfd = NULL; > return req; > } > > @@ -1730,6 +1731,10 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, > struct iocb *iocb) > if (unlikely(!req->file)) > return -EBADF; > > + req->head = NULL; > + req->woken = false; > + req->cancelled = false; > + > apt.pt._qproc = aio_poll_queue_proc; > apt.pt._key = req->events; > apt.iocb = aiocb;
Re: [PATCH 03/26] block: wire up block device iopoll method
On 12/6/18 12:15 PM, Christoph Hellwig wrote: > On Thu, Dec 06, 2018 at 12:14:29PM -0700, Jens Axboe wrote: >> On 12/6/18 12:11 PM, Jeff Moyer wrote: >>> Jens Axboe writes: >>> From: Christoph Hellwig Just call blk_poll on the iocb cookie, we can derive the block device from the inode trivially. >>> >>> Does this work for multi-device file systems? >> >> It should, that's the whole purpose of having fops->iopoll. You should >> be able to derive it from inode + cookie. > > It still assumes the whole I/O will got to a single device. For > XFS this is always tree, but for btrfs this might mean I/O could > need splitting if ->iopoll was to be supported there. Right, we can only support one target for one particular piece of IO, but multi-device works fine in general. -- Jens Axboe
Re: [PATCH 01/26] fs: add an iopoll method to struct file_operations
Jens Axboe writes: > From: Christoph Hellwig > > This new methods is used to explicitly poll for I/O completion for an > iocb. It must be called for any iocb submitted asynchronously (that > is with a non-null ki_complete) which has the IOCB_HIPRI flag set. > > The method is assisted by a new ki_cookie field in struct iocb to store > the polling cookie. > > TODO: we can probably union ki_cookie with the existing hint and I/O > priority fields to avoid struct kiocb growth. Please document return values. Thanks, Jeff > > Reviewed-by: Johannes Thumshirn > Signed-off-by: Christoph Hellwig > Signed-off-by: Jens Axboe > --- > Documentation/filesystems/vfs.txt | 3 +++ > include/linux/fs.h| 2 ++ > 2 files changed, 5 insertions(+) > > diff --git a/Documentation/filesystems/vfs.txt > b/Documentation/filesystems/vfs.txt > index 5f71a252e2e0..d9dc5e4d82b9 100644 > --- a/Documentation/filesystems/vfs.txt > +++ b/Documentation/filesystems/vfs.txt > @@ -857,6 +857,7 @@ struct file_operations { > ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); > ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); > ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); > + int (*iopoll)(struct kiocb *kiocb, bool spin); > int (*iterate) (struct file *, struct dir_context *); > int (*iterate_shared) (struct file *, struct dir_context *); > __poll_t (*poll) (struct file *, struct poll_table_struct *); > @@ -902,6 +903,8 @@ otherwise noted. > >write_iter: possibly asynchronous write with iov_iter as source > > + iopoll: called when aio wants to poll for completions on HIPRI iocbs > + >iterate: called when the VFS needs to read the directory contents > >iterate_shared: called when the VFS needs to read the directory contents > diff --git a/include/linux/fs.h b/include/linux/fs.h > index a1ab233e6469..6a5f71f8ae06 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -310,6 +310,7 @@ struct kiocb { > int ki_flags; > u16 ki_hint; > u16 ki_ioprio; /* See linux/ioprio.h */ > + unsigned intki_cookie; /* for ->iopoll */ > } __randomize_layout; > > static inline bool is_sync_kiocb(struct kiocb *kiocb) > @@ -1781,6 +1782,7 @@ struct file_operations { > ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); > ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); > ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); > + int (*iopoll)(struct kiocb *kiocb, bool spin); > int (*iterate) (struct file *, struct dir_context *); > int (*iterate_shared) (struct file *, struct dir_context *); > __poll_t (*poll) (struct file *, struct poll_table_struct *);
Re: [PATCH 03/26] block: wire up block device iopoll method
On Thu, Dec 06, 2018 at 12:14:29PM -0700, Jens Axboe wrote: > On 12/6/18 12:11 PM, Jeff Moyer wrote: > > Jens Axboe writes: > > > >> From: Christoph Hellwig > >> > >> Just call blk_poll on the iocb cookie, we can derive the block device > >> from the inode trivially. > > > > Does this work for multi-device file systems? > > It should, that's the whole purpose of having fops->iopoll. You should > be able to derive it from inode + cookie. It still assumes the whole I/O will got to a single device. For XFS this is always tree, but for btrfs this might mean I/O could need splitting if ->iopoll was to be supported there.
Re: [PATCH 03/26] block: wire up block device iopoll method
On 12/6/18 12:11 PM, Jeff Moyer wrote: > Jens Axboe writes: > >> From: Christoph Hellwig >> >> Just call blk_poll on the iocb cookie, we can derive the block device >> from the inode trivially. > > Does this work for multi-device file systems? It should, that's the whole purpose of having fops->iopoll. You should be able to derive it from inode + cookie. -- Jens Axboe
Re: [PATCH 03/26] block: wire up block device iopoll method
Jens Axboe writes: > From: Christoph Hellwig > > Just call blk_poll on the iocb cookie, we can derive the block device > from the inode trivially. Does this work for multi-device file systems? -Jeff > > Reviewed-by: Johannes Thumshirn > Signed-off-by: Christoph Hellwig > Signed-off-by: Jens Axboe > --- > fs/block_dev.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/fs/block_dev.c b/fs/block_dev.c > index e1886cc7048f..6de8d35f6e41 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -281,6 +281,14 @@ struct blkdev_dio { > > static struct bio_set blkdev_dio_pool; > > +static int blkdev_iopoll(struct kiocb *kiocb, bool wait) > +{ > + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host); > + struct request_queue *q = bdev_get_queue(bdev); > + > + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait); > +} > + > static void blkdev_bio_end_io(struct bio *bio) > { > struct blkdev_dio *dio = bio->bi_private; > @@ -398,6 +406,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter > *iter, int nr_pages) > bio->bi_opf |= REQ_HIPRI; > > qc = submit_bio(bio); > + WRITE_ONCE(iocb->ki_cookie, qc); > break; > } > > @@ -2070,6 +2079,7 @@ const struct file_operations def_blk_fops = { > .llseek = block_llseek, > .read_iter = blkdev_read_iter, > .write_iter = blkdev_write_iter, > + .iopoll = blkdev_iopoll, > .mmap = generic_file_mmap, > .fsync = blkdev_fsync, > .unlocked_ioctl = block_ioctl,
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 11:12 -0700, Jens Axboe wrote: > On 12/6/18 11:10 AM, Bart Van Assche wrote: > > On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote: > > > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > > > > which would result in a non-zero exit, which should be expected for this > > > > test? > > > > > > Test srp/002 simulates network failures while running fio on top of > > > dm-mpath. > > > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep > > > retrying > > > the block layer requests it receives if these requests fail due to path > > > removal. > > > If fio reports "io_u error" for this test then that means that the test > > > failed. > > > I haven't seen fio reporting any I/O errors for this test with any > > > upstream > > > kernel that I tested in the past year or so. > > > > Hi Jens, > > > > Please also verify that the version of multipath-tools that you are running > > includes > > the following patch: > > I installed from apt... Says this: > > # apt show multipath-tools > Package: multipath-tools > Version: 0.6.4-5+deb9u1 > > I can run a self-compiled one, if this one doesn't work. Please do. I use the following script to build and install multipath-tools over the Debian binaries: export LIB=/lib apt-get install -y libaio-dev libdevmapper-dev libjson-c-dev librados-dev libreadline-dev libsystemd-dev liburcu-dev make -s make -s install Bart.
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote: > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > > which would result in a non-zero exit, which should be expected for this > > test? > > Test srp/002 simulates network failures while running fio on top of dm-mpath. > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep > retrying > the block layer requests it receives if these requests fail due to path > removal. > If fio reports "io_u error" for this test then that means that the test > failed. > I haven't seen fio reporting any I/O errors for this test with any upstream > kernel that I tested in the past year or so. Hi Jens, Please also verify that the version of multipath-tools that you are running includes the following patch: >From a2675025ae9f652b005345b9082f5042b32c992c Mon Sep 17 00:00:00 2001 From: Bart Van Assche Date: Fri, 18 Nov 2016 13:33:44 -0800 Subject: [PATCH] Avoid that reloading a map sporadically triggers I/O errors Avoid that reloading a map while there are no paths triggers a flush and hence unwanted I/O errors if 'queue_if_no_path' is enabled. Fixes: commit d569988e7528 ("libmultipath: Fixup 'DM_DEVICE_RELOAD' handling") Signed-off-by: Bart Van Assche --- libmultipath/devmapper.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c index 9b6b0537ba6d..1576dd017d6a 100644 --- a/libmultipath/devmapper.c +++ b/libmultipath/devmapper.c @@ -392,7 +392,7 @@ int dm_addmap_reload(struct multipath *mpp, char *params, int flush) params, ADDMAP_RO, SKIP_KPARTX_OFF); } if (r) - r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, flush, + r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush, 1, udev_flags, 0); return r; }
Re: for-next branch and blktests/srp
On 12/6/18 11:10 AM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote: >> On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: >>> which would result in a non-zero exit, which should be expected for this >>> test? >> >> Test srp/002 simulates network failures while running fio on top of dm-mpath. >> Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep >> retrying >> the block layer requests it receives if these requests fail due to path >> removal. >> If fio reports "io_u error" for this test then that means that the test >> failed. >> I haven't seen fio reporting any I/O errors for this test with any upstream >> kernel that I tested in the past year or so. > > Hi Jens, > > Please also verify that the version of multipath-tools that you are running > includes > the following patch: I installed from apt... Says this: # apt show multipath-tools Package: multipath-tools Version: 0.6.4-5+deb9u1 I can run a self-compiled one, if this one doesn't work. -- Jens Axboe
Re: for-next branch and blktests/srp
On 12/6/18 11:02 AM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: >> which would result in a non-zero exit, which should be expected for this >> test? > > Hi Jens, > > Test srp/002 simulates network failures while running fio on top of dm-mpath. > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep > retrying > the block layer requests it receives if these requests fail due to path > removal. > If fio reports "io_u error" for this test then that means that the test > failed. > I haven't seen fio reporting any I/O errors for this test with any upstream > kernel that I tested in the past year or so. Just ran it on current -git, and get the exact same results. Maybe there are some clues in the log files I sent in the previous email? -- Jens Axboe
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote: > which would result in a non-zero exit, which should be expected for this > test? Hi Jens, Test srp/002 simulates network failures while running fio on top of dm-mpath. Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep retrying the block layer requests it receives if these requests fail due to path removal. If fio reports "io_u error" for this test then that means that the test failed. I haven't seen fio reporting any I/O errors for this test with any upstream kernel that I tested in the past year or so. Bart.
Re: for-next branch and blktests/srp
On 12/6/18 10:28 AM, Jens Axboe wrote: > On 12/6/18 10:19 AM, Bart Van Assche wrote: >> On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote: >>> On 12/6/18 9:47 AM, Bart Van Assche wrote: If I merge Jens' for-next branch with Linus' master branch, boot the resulting kernel in a VM and run blktests/tests/srp/002 then that test never finishes. The same test passes against Linus' master branch. I think this is a regression. The following appears in the system log if I run that test: >>> >>> You are running that test on a dm device? Can you shed some light on >>> how that dm device is setup? >> >> Hi Jens, >> >> The dm device referred to in my e-mail is a dm-mpath device created by test >> srp/002. All parameters of that device are under control of that test script. >> From the srp/002 test script: >> >> DESCRIPTION="File I/O on top of multipath concurrently with logout and login >> (mq)" >> >> From srp/multipath.conf: >> >> defaults { >> find_multipaths no >> user_friendly_names yes >> queue_without_daemonno >> } >> >> devices { >> device { >> vendor "LIO-ORG|SCST_BIO|FUSIONIO" >> product ".*" >> features"1 queue_if_no_path" >> path_checkertur >> } >> } > > 5 kernel compiles later, and I think I'm almost ready to run it... axboe@dell:/home/axboe/git/blktests $ sudo ./check srp/002 srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] runtime 33.689s ... 32.212s --- tests/srp/002.out 2018-11-09 08:42:22.842029804 -0700 +++ /home/axboe/git/blktests/results/nodev/srp/002.out.bad 2018-12-06 10:45:35.995183521 -0700 @@ -1,2 +1 @@ Configured SRP target driver -Passed It runs for about 32-33s every time and then I get this. The fio output shows this: data-integrity-test-mq: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 ... fio-3.12-27-g68af Starting 16 threads data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) data-integrity-test-mq: Laying out IO file (1 file / 1MiB) fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.14.0: Input/output error: read offset=724992, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.5.0: Input/output error: read offset=983040, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.7.0: Input/output error: write offset=700416, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.8.0: Input/output error: write offset=237568, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0: Input/output error: write offset=1277952, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0: Input/output error: write offset=970752, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0: Input/output error: write offset=974848, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0: Input/output error: write offset=978944, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0: Input/output error: write offset=983040, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.15.0: Input/output error: write offset=172032, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.13.0: Input/output error: write offset=237568, buflen=4096 fio: io_u error on file /home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.3.0: Input/output error: write offset=294912, buflen=4096 fio: io_u error on
Re: for-next branch and blktests/srp
On 12/6/18 10:19 AM, Bart Van Assche wrote: > On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote: >> On 12/6/18 9:47 AM, Bart Van Assche wrote: >>> If I merge Jens' for-next branch with Linus' master branch, boot the >>> resulting kernel in a VM and run blktests/tests/srp/002 then that test >>> never finishes. The same test passes against Linus' master branch. I >>> think this is a regression. The following appears in the system log if >>> I run that test: >> >> You are running that test on a dm device? Can you shed some light on >> how that dm device is setup? > > Hi Jens, > > The dm device referred to in my e-mail is a dm-mpath device created by test > srp/002. All parameters of that device are under control of that test script. > From the srp/002 test script: > > DESCRIPTION="File I/O on top of multipath concurrently with logout and login > (mq)" > > From srp/multipath.conf: > > defaults { > find_multipaths no > user_friendly_names yes > queue_without_daemonno > } > > devices { > device { > vendor "LIO-ORG|SCST_BIO|FUSIONIO" > product ".*" > features"1 queue_if_no_path" > path_checkertur > } > } 5 kernel compiles later, and I think I'm almost ready to run it... -- Jens Axboe
Re: for-next branch and blktests/srp
On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote: > On 12/6/18 9:47 AM, Bart Van Assche wrote: > > If I merge Jens' for-next branch with Linus' master branch, boot the > > resulting kernel in a VM and run blktests/tests/srp/002 then that test > > never finishes. The same test passes against Linus' master branch. I > > think this is a regression. The following appears in the system log if > > I run that test: > > You are running that test on a dm device? Can you shed some light on > how that dm device is setup? Hi Jens, The dm device referred to in my e-mail is a dm-mpath device created by test srp/002. All parameters of that device are under control of that test script. >From the srp/002 test script: DESCRIPTION="File I/O on top of multipath concurrently with logout and login (mq)" >From srp/multipath.conf: defaults { find_multipaths no user_friendly_names yes queue_without_daemonno } devices { device { vendor "LIO-ORG|SCST_BIO|FUSIONIO" product ".*" features"1 queue_if_no_path" path_checkertur } } Bart.
Re: for-next branch and blktests/srp
On 12/6/18 9:47 AM, Bart Van Assche wrote: > Hello, > > If I merge Jens' for-next branch with Linus' master branch, boot the > resulting kernel in a VM and run blktests/tests/srp/002 then that test > never finishes. The same test passes against Linus' master branch. I > think this is a regression. The following appears in the system log if > I run that test: You are running that test on a dm device? Can you shed some light on how that dm device is setup? -- Jens Axboe
Re: [PATCH 2/2] nvme: create slaves/holder entries for multipath devices
On Thu, Dec 06, 2018 at 04:44:32PM +0100, Christoph Hellwig wrote: > On Wed, Nov 28, 2018 at 06:38:17AM -0200, Thadeu Lima de Souza Cascardo wrote: > > I will send a followup that includes your two patches, fixing up this one, > > plus > > another two because that is simply reverting the revert of Hanne's patch. > > Which > > means it still has the lsblk bug. > > > > Then, we have two options: do not use GENHD_FL_HIDDEN on multipath nvme > > disks, > > or expose devt for those disks. I am sending a patch for the second option. > > Can you please send it out so we can look at it before the 4.21 merge > window opens? Just sent. Sorry about the delay. Thanks. Cascardo.
[PATCH 4/4] block: expose devt for GENHD_FL_HIDDEN disks
Without this exposure, lsblk will fail as it tries to find out the device's dev_t numbers. This causes a real problem for nvme multipath devices, as their slaves are hidden. Exposing them fixes the problem, even though trying to open the devices returns an error in the case of nvme multipath. So, right now, it's the driver's responsibility to return a failure to open hidden devices. Signed-off-by: Thadeu Lima de Souza Cascardo --- block/genhd.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 7674fce32fca..65a7fa664188 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk, disk_alloc_events(disk); + disk_to_dev(disk)->devt = devt; if (disk->flags & GENHD_FL_HIDDEN) { /* * Don't let hidden disks show up in /proc/partitions, @@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, struct gendisk *disk, int ret; /* Register BDI before referencing it from bdev */ - disk_to_dev(disk)->devt = devt; ret = bdi_register_owner(disk->queue->backing_dev_info, disk_to_dev(disk)); WARN_ON(ret); - blk_register_region(disk_devt(disk), disk->minors, NULL, - exact_match, exact_lock, disk); } + blk_register_region(disk_devt(disk), disk->minors, NULL, + exact_match, exact_lock, disk); register_disk(parent, disk, groups); if (register_queue) blk_register_queue(disk); @@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk) WARN_ON(1); } - if (!(disk->flags & GENHD_FL_HIDDEN)) - blk_unregister_region(disk_devt(disk), disk->minors); + blk_unregister_region(disk_devt(disk), disk->minors); kobject_put(disk->part0.holder_dir); kobject_put(disk->slave_dir); -- 2.19.1
[PATCH 3/4] nvme: Should not warn when a disk path is opened
As we hide the disks used for nvme multipath, their open functions should be allowed to be called, as their devices are not exposed. However, not exposing the devices cause problems with lsblk, which won't find the devices and show them, which is even more of a problem when holders/slaves relationships are exposed. Not warning here allow us to expose the disks without creating spurious warnings. A failure should still be returned, which is the case here, and still allows lsblk to work. Signed-off-by: Thadeu Lima de Souza Cascardo --- drivers/nvme/host/core.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1dc29795f1ee..22424b2adfad 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1327,8 +1327,8 @@ static int nvme_open(struct block_device *bdev, fmode_t mode) struct nvme_ns *ns = bdev->bd_disk->private_data; #ifdef CONFIG_NVME_MULTIPATH - /* should never be called due to GENHD_FL_HIDDEN */ - if (WARN_ON_ONCE(ns->head->disk)) + /* Should fail as it's hidden */ + if (ns->head->disk) goto fail; #endif if (!kref_get_unless_zero(>kref)) -- 2.19.1
[PATCH 1/4] block: move holder tracking from struct block_device to hd_struct
From: Christoph Hellwig We'd like to track the slaves and holder for nvme multipath devices in the same standard fashion as all the other stacked block devices to make the life for things like distro installers easy. But struct block_device only exists while we have open instances, which we never have for the underlying devices of a nvme-multipath setup. But we can easily move the older list into struct hd_struct which exists all the time the block device exists, the only interesting bit is that we need a new mutex for it. Signed-off-by: Christoph Hellwig --- block/genhd.c| 4 +++ block/partition-generic.c| 4 +++ drivers/block/drbd/drbd_nl.c | 4 +-- drivers/md/bcache/super.c| 8 +++--- drivers/md/dm.c | 4 +-- drivers/md/md.c | 4 +-- fs/block_dev.c | 48 include/linux/fs.h | 11 +++-- include/linux/genhd.h| 4 +++ 9 files changed, 47 insertions(+), 44 deletions(-) diff --git a/block/genhd.c b/block/genhd.c index 0145bcb0cc76..7674fce32fca 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -1468,6 +1468,10 @@ struct gendisk *__alloc_disk_node(int minors, int node_id) disk->minors = minors; rand_initialize_disk(disk); +#ifdef CONFIG_SYSFS + INIT_LIST_HEAD(>part0.holder_disks); + mutex_init(>part0.holder_mutex); +#endif disk_to_dev(disk)->class = _class; disk_to_dev(disk)->type = _type; device_initialize(disk_to_dev(disk)); diff --git a/block/partition-generic.c b/block/partition-generic.c index 5f8db5c5140f..83edc23ff1e8 100644 --- a/block/partition-generic.c +++ b/block/partition-generic.c @@ -333,6 +333,10 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno, } seqcount_init(>nr_sects_seq); +#ifdef CONFIG_SYSFS + INIT_LIST_HEAD(>holder_disks); + mutex_init(>holder_mutex); +#endif pdev = part_to_dev(p); p->start_sect = start; diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c index d15703b1ffe8..42354e34b565 100644 --- a/drivers/block/drbd/drbd_nl.c +++ b/drivers/block/drbd/drbd_nl.c @@ -1670,7 +1670,7 @@ static struct block_device *open_backing_dev(struct drbd_device *device, if (!do_bd_link) return bdev; - err = bd_link_disk_holder(bdev, device->vdisk); + err = bd_link_disk_holder(bdev->bd_part, device->vdisk); if (err) { blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); drbd_err(device, "bd_link_disk_holder(\"%s\", ...) failed with %d\n", @@ -1719,7 +1719,7 @@ static void close_backing_dev(struct drbd_device *device, struct block_device *b if (!bdev) return; if (do_bd_unlink) - bd_unlink_disk_holder(bdev, device->vdisk); + bd_unlink_disk_holder(bdev->bd_part, device->vdisk); blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL); } diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index 7bbd670a5a84..e49b177b1c60 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -677,7 +677,7 @@ static void bcache_device_unlink(struct bcache_device *d) sysfs_remove_link(>kobj, "cache"); for_each_cache(ca, d->c, i) - bd_unlink_disk_holder(ca->bdev, d->disk); + bd_unlink_disk_holder(ca->bdev->bd_part, d->disk); } } @@ -688,7 +688,7 @@ static void bcache_device_link(struct bcache_device *d, struct cache_set *c, struct cache *ca; for_each_cache(ca, d->c, i) - bd_link_disk_holder(ca->bdev, d->disk); + bd_link_disk_holder(ca->bdev->bd_part, d->disk); snprintf(d->name, BCACHEDEVNAME_SIZE, "%s%u", name, d->id); @@ -936,7 +936,7 @@ void bch_cached_dev_run(struct cached_dev *dc) } add_disk(d->disk); - bd_link_disk_holder(dc->bdev, dc->disk.disk); + bd_link_disk_holder(dc->bdev->bd_part, dc->disk.disk); /* * won't show up in the uevent file, use udevadm monitor -e instead * only class / kset properties are persistent @@ -1199,7 +1199,7 @@ static void cached_dev_free(struct closure *cl) kthread_stop(dc->status_update_thread); if (atomic_read(>running)) - bd_unlink_disk_holder(dc->bdev, dc->disk.disk); + bd_unlink_disk_holder(dc->bdev->bd_part, dc->disk.disk); bcache_device_free(>disk); list_del(>list); diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ab72d79775ee..823b592f066b 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -770,7 +770,7 @@ static int open_table_device(struct table_device *td, dev_t dev, if (IS_ERR(bdev)) return PTR_ERR(bdev); - r = bd_link_disk_holder(bdev,
for-next branch and blktests/srp
Hello, If I merge Jens' for-next branch with Linus' master branch, boot the resulting kernel in a VM and run blktests/tests/srp/002 then that test never finishes. The same test passes against Linus' master branch. I think this is a regression. The following appears in the system log if I run that test: Call Trace: INFO: task kworker/0:1:12 blocked for more than 120 seconds. Call Trace: INFO: task ext4lazyinit:2079 blocked for more than 120 seconds. Call Trace: INFO: task fio:2151 blocked for more than 120 seconds. Call Trace: INFO: task fio:2154 blocked for more than 120 seconds. My list-pending-block-requests script produces the following output: dm-0 dm-0/hctx0/active:0 dm-0/hctx0/dispatch:b856a515 {.op=WRITE_ZEROES, .cmd_flags=SYNC, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1064, .internal_tag=-1} dm-0/hctx0/dispatch:6a2dca0a {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1065, .internal_tag=-1} dm-0/hctx0/dispatch:1baa5734 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1085, .internal_tag=-1} dm-0/hctx0/dispatch:2d618841 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1025, .internal_tag=-1} dm-0/hctx0/dispatch:c8e6e436 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1250, .internal_tag=-1} dm-0/hctx0/dispatch:7af12ca3 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1057, .internal_tag=-1} dm-0/hctx0/dispatch:2ad59d1e {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1060, .internal_tag=-1} dm-0/hctx0/dispatch:51c35166 {.op=READ, .cmd_flags=, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1188, .internal_tag=-1} dm-0/hctx0/dispatch:05da9610 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1253, .internal_tag=-1} dm-0/hctx0/dispatch:a25483c5 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1254, .internal_tag=-1} dm-0/hctx0/dispatch:7ef87035 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1255, .internal_tag=-1} dm-0/hctx0/dispatch:8a8f15c0 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1261, .internal_tag=-1} dm-0/hctx0/dispatch:d65ae959 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1263, .internal_tag=-1} dm-0/hctx0/dispatch:88aac981 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1264, .internal_tag=-1} dm-0/hctx0/dispatch:89489e70 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1265, .internal_tag=-1} dm-0/hctx0/dispatch:7c69e2d2 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1154, .internal_tag=-1} dm-0/hctx0/dispatch:e32aa9b5 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1155, .internal_tag=-1} dm-0/hctx0/dispatch:5b88a332 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1157, .internal_tag=-1} dm-0/hctx0/dispatch:1de14ef0 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1159, .internal_tag=-1} dm-0/hctx0/dispatch:a380d0f3 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1161, .internal_tag=-1} dm-0/hctx0/dispatch:c48cfa16 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1162, .internal_tag=-1} dm-0/hctx0/dispatch:11d96382 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1163, .internal_tag=-1} dm-0/hctx0/dispatch:b5efd5d7 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1167, .internal_tag=-1} dm-0/hctx0/dispatch:80f5c476 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1153, .internal_tag=-1} dm-0/hctx0/dispatch:c73c244d {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1266, .internal_tag=-1} dm-0/hctx0/dispatch:69c43800 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1268, .internal_tag=-1} dm-0/hctx0/dispatch:d2ec3c96 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1269, .internal_tag=-1} dm-0/hctx0/dispatch:9b2099db {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1270, .internal_tag=-1} dm-0/hctx0/dispatch:7d433090 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1272, .internal_tag=-1} dm-0/hctx0/dispatch:afa7d2ea {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1274, .internal_tag=-1} dm-0/hctx0/dispatch:c0f9d890 {.op=WRITE, .cmd_flags=SYNC|IDLE, .rq_flags=IO_STAT|STATS, .state=idle, .tag=1276, .internal_tag=-1}
Re: for-next branch and throttling
On 12/6/18 9:39 AM, Bart Van Assche wrote: > Hello, > > If I merge Jens' for-next branch with Linus' master branch and boot the > resulting kernel in a VM then the call trace shown below appears. This call > trace does not appear when building and booting the code from Linus' master > branch. Is this perhaps a regression? > > WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 > blk_throtl_bio+0xc00/0x1120 > Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover > i2c_piix4 pata_acpi > CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 > 04/01/2014 > RIP: 0010:blk_throtl_bio+0xc00/0x1120 It's some over-zealous rcu removal, simple fix is coming. CC Dennis. -- Jens Axboe
[PATCH 2/4] nvme: create slaves/holder entries for multipath devices
From: Christoph Hellwig This allows tools like distro installers easily track the relationship. Signed-off-by: Christoph Hellwig [cascardo: only add disk when there is ns->head] Signed-off-by: Thadeu Lima de Souza Cascardo --- drivers/nvme/host/core.c | 5 +++-- drivers/nvme/host/multipath.c | 13 +++-- drivers/nvme/host/nvme.h | 12 3 files changed, 22 insertions(+), 8 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 3ce7fc9df378..1dc29795f1ee 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -371,7 +371,7 @@ static void nvme_free_ns_head(struct kref *ref) struct nvme_ns_head *head = container_of(ref, struct nvme_ns_head, ref); - nvme_mpath_remove_disk(head); + nvme_mpath_remove_nshead(head); ida_simple_remove(>subsys->ns_ida, head->instance); list_del_init(>entry); cleanup_srcu_struct_quiesced(>srcu); @@ -3120,7 +3120,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups); - nvme_mpath_add_disk(ns, id); + nvme_mpath_add_ns(ns, id); nvme_fault_inject_init(ns); kfree(id); @@ -3144,6 +3144,7 @@ static void nvme_ns_remove(struct nvme_ns *ns) nvme_fault_inject_fini(ns); if (ns->disk && ns->disk->flags & GENHD_FL_UP) { + nvme_mpath_remove_ns(ns); del_gendisk(ns->disk); blk_cleanup_queue(ns->queue); if (blk_get_integrity(ns->disk)) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index ec310b1b9267..174c64643592 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -500,7 +500,7 @@ static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl, return 0; } -void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id) +void nvme_mpath_add_ns(struct nvme_ns *ns, struct nvme_id_ns *id) { if (nvme_ctrl_use_ana(ns->ctrl)) { mutex_lock(>ctrl->ana_lock); @@ -513,9 +513,18 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id) nvme_mpath_set_live(ns); mutex_unlock(>head->lock); } + + if (ns->head->disk) + bd_link_disk_holder(>disk->part0, ns->head->disk); +} + +void nvme_mpath_remove_ns(struct nvme_ns *ns) +{ + if (ns->head->disk) + bd_unlink_disk_holder(>disk->part0, ns->head->disk); } -void nvme_mpath_remove_disk(struct nvme_ns_head *head) +void nvme_mpath_remove_nshead(struct nvme_ns_head *head) { if (!head->disk) return; diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ae77eb16fd1f..365262d11e53 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -470,8 +470,9 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns, void nvme_failover_req(struct request *req); void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl); int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head); -void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id); -void nvme_mpath_remove_disk(struct nvme_ns_head *head); +void nvme_mpath_add_ns(struct nvme_ns *ns, struct nvme_id_ns *id); +void nvme_mpath_remove_ns(struct nvme_ns *ns); +void nvme_mpath_remove_nshead(struct nvme_ns_head *head); int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id); void nvme_mpath_uninit(struct nvme_ctrl *ctrl); void nvme_mpath_stop(struct nvme_ctrl *ctrl); @@ -515,11 +516,14 @@ static inline int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, { return 0; } -static inline void nvme_mpath_add_disk(struct nvme_ns *ns, +static inline void nvme_mpath_add_ns(struct nvme_ns *ns, struct nvme_id_ns *id) { } -static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head) +static inline void nvme_mpath_remove_ns(struct nvme_ns *ns) +{ +} +static inline void nvme_mpath_remove_nshead(struct nvme_ns_head *head) { } static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns) -- 2.19.1
[PATCH 0/4] nvme multipath: expose slaves/holders
Exposing slaves/holders is necessary in order to find out the real PCI device and its driver for the root filesystem when generating an initramfs with initramfs-tools. That fails right now for nvme multipath devices, which this patchset fixes. However, because the slave devices are hidden, lsblk fails without some extra patches, as it can't find the device numbers for the slave devices, and exits. Christoph Hellwig (2): block: move holder tracking from struct block_device to hd_struct nvme: create slaves/holder entries for multipath devices Thadeu Lima de Souza Cascardo (2): nvme: Should not warn when a disk path is opened block: expose devt for GENHD_FL_HIDDEN disks block/genhd.c | 13 ++ block/partition-generic.c | 4 +++ drivers/block/drbd/drbd_nl.c | 4 +-- drivers/md/bcache/super.c | 8 +++--- drivers/md/dm.c | 4 +-- drivers/md/md.c | 4 +-- drivers/nvme/host/core.c | 9 --- drivers/nvme/host/multipath.c | 13 -- drivers/nvme/host/nvme.h | 12 ++--- fs/block_dev.c| 48 +++ include/linux/fs.h| 11 +++- include/linux/genhd.h | 4 +++ 12 files changed, 75 insertions(+), 59 deletions(-) -- 2.19.1
for-next branch and throttling
Hello, If I merge Jens' for-next branch with Linus' master branch and boot the resulting kernel in a VM then the call trace shown below appears. This call trace does not appear when building and booting the code from Linus' master branch. Is this perhaps a regression? WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 blk_throtl_bio+0xc00/0x1120 Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover i2c_piix4 pata_acpi CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:blk_throtl_bio+0xc00/0x1120 Call Trace: ? create_task_io_context+0x1a2/0x1e0 ? lock_downgrade+0x2e0/0x2e0 ? kasan_check_read+0x11/0x20 ? do_raw_spin_unlock+0xa8/0x140 ? preempt_count_sub+0x18/0xd0 generic_make_request_checks+0x432/0xcc0 ? trace_event_raw_event_block_rq_requeue+0x290/0x290 ? __lock_acquire+0x5b0/0x17e0 ? __bio_associate_blkg.isra.33+0x1d0/0x3c0 generic_make_request+0x151/0x950 ? blk_queue_enter+0x7e0/0x7e0 submit_bio+0x9b/0x250 ? submit_bio+0x9b/0x250 ? generic_make_request+0x950/0x950 ? guard_bio_eod+0x120/0x2d0 submit_bh_wbc+0x2df/0x310 block_read_full_page+0x34c/0x4c0 ? I_BDEV+0x20/0x20 ? __bread_gfp+0x170/0x170 ? add_to_page_cache_locked+0x20/0x20 ? blkdev_writepages+0x10/0x10 ? blkdev_writepages+0x10/0x10 blkdev_readpage+0x18/0x20 do_read_cache_page+0x5ed/0xbf0 ? blkdev_writepages+0x10/0x10 ? find_valid_gpt+0x1a8/0x8b0 ? efi_partition+0x147/0x6c1 ? check_partition+0x1b2/0x314 ? blkdev_get+0x1f2/0x5a0 ? __device_add_disk+0x6c1/0x7e0 ? device_add_disk+0x13/0x20 ? pagecache_get_page+0x420/0x420 ? driver_probe_device+0x118/0x180 ? __driver_attach+0x167/0x190 ? bus_for_each_dev+0x105/0x160 ? driver_attach+0x2b/0x30 ? bus_add_driver+0x23d/0x350 ? driver_register+0xdc/0x160 ? register_virtio_driver+0x3c/0x60 ? init+0x51/0x1000 [virtio_blk] ? do_one_initcall+0xb4/0x3a1 ? do_init_module+0x103/0x360 ? load_module+0x443e/0x4630 ? __do_sys_finit_module+0x12d/0x1b0 ? __x64_sys_finit_module+0x43/0x50 ? do_syscall_64+0x71/0x210 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe ? match_held_lock+0x20/0x250 ? match_held_lock+0x35/0x250 ? blkdev_writepages+0x10/0x10 read_cache_page+0x3a/0x50 read_dev_sector+0x64/0x170 read_lba+0x1f9/0x3a0 ? msdos_partition+0x1370/0x1370 ? find_valid_gpt+0x1a8/0x8b0 ? kmem_cache_alloc_trace+0x151/0x340 ? find_valid_gpt+0x1a8/0x8b0 find_valid_gpt+0x1c6/0x8b0 ? is_gpt_valid.part.5+0x640/0x640 ? __alloc_pages_nodemask+0x1c4/0x4d0 ? lockdep_hardirqs_on+0x182/0x260 ? get_page_from_freelist+0x648/0x2070 ? mark_lock+0x69/0x8d0 ? trace_hardirqs_on+0x24/0x130 ? kasan_unpoison_shadow+0x35/0x50 ? get_page_from_freelist+0x648/0x2070 ? __asan_loadN+0xf/0x20 ? widen_string+0x73/0x170 ? __asan_loadN+0xf/0x20 ? widen_string+0x73/0x170 ? get_page_from_freelist+0x1eea/0x2070 ? set_precision+0x90/0x90 ? string+0x135/0x180 efi_partition+0x147/0x6c1 ? format_decode+0xce/0x550 ? enable_ptr_key_workfn+0x30/0x30 ? find_valid_gpt+0x8b0/0x8b0 ? vsnprintf+0x11d/0x820 ? pointer+0x4d0/0x4d0 ? snprintf+0x88/0xa0 ? snprintf+0x88/0xa0 ? vsprintf+0x20/0x20 ? find_valid_gpt+0x8b0/0x8b0 check_partition+0x1b2/0x314 rescan_partitions+0x13b/0x440 __blkdev_get+0x61f/0x930 ? check_disk_change+0xa0/0xa0 ? lock_downgrade+0x2e0/0x2e0 ? var_wake_function+0x90/0x90 blkdev_get+0x1f2/0x5a0 ? preempt_count_sub+0x18/0xd0 ? _raw_spin_unlock+0x31/0x50 ? bd_may_claim+0x80/0x80 ? bdget+0x26b/0x280 ? blkdev_writepage+0x20/0x20 ? kasan_check_read+0x11/0x20 ? kobject_put+0x23/0x220 __device_add_disk+0x6c1/0x7e0 ? blk_alloc_devt+0x150/0x150 ? __asan_storeN+0x12/0x20 device_add_disk+0x13/0x20 virtblk_probe+0x839/0xace [virtio_blk] ? virtblk_restore+0xd0/0xd0 [virtio_blk] ? mutex_unlock+0x12/0x20 ? kernfs_activate+0xd0/0xe0 ? kasan_check_write+0x14/0x20 ? kernfs_put+0x2c/0x280 ? vring_transport_features+0x19/0x70 ? vp_finalize_features+0xe6/0x160 ? vp_get_status+0x29/0x30 ? virtio_finalize_features+0xbe/0xe0 virtio_dev_probe+0x27d/0x390 really_probe+0x191/0x560 ? driver_probe_device+0x180/0x180 driver_probe_device+0x118/0x180 ? driver_probe_device+0x180/0x180 __driver_attach+0x167/0x190 bus_for_each_dev+0x105/0x160 ? subsys_dev_iter_exit+0x10/0x10 ? kasan_check_read+0x11/0x20 ? preempt_count_sub+0x18/0xd0 ? _raw_spin_unlock+0x31/0x50 driver_attach+0x2b/0x30 bus_add_driver+0x23d/0x350 driver_register+0xdc/0x160 ? 0xa007 register_virtio_driver+0x3c/0x60 init+0x51/0x1000 [virtio_blk] do_one_initcall+0xb4/0x3a1 ? trace_event_raw_event_initcall_finish+0x120/0x120 ? kasan_unpoison_shadow+0x35/0x50 ? __asan_register_globals+0x7b/0xa0 do_init_module+0x103/0x360 load_module+0x443e/0x4630 ? module_frob_arch_sections+0x20/0x20 ? vfs_read+0xe6/0x1e0 ? kernel_read+0x79/0xa0 ? kasan_check_write+0x14/0x20 ? kernel_read_file+0x259/0x310 ? do_mmap+0x431/0x6c0 __do_sys_finit_module+0x12d/0x1b0 ?
Re: [PATCH v3] block: add documentation for io_timeout
On Thu, Dec 06, 2018 at 08:06:16AM -0800, Bart Van Assche wrote: > On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote: > > Weiping Zhang 于2018年12月5日周三 下午10:49写道: > > > Christoph Hellwig 于2018年12月5日周三 下午10:40写道: > > > > Can you please also send a patch to not show this attribute for > > > > drivers without a timeout handler? Thanks! > > > > Is there a simple way do that ? > > How about checking the timeout member of struct blk_mq_ops for blk-mq and > checking the rq_timed_out_fn member in struct request_queue for the legacy > block layer? Just the former given that the legacy code is gone in for-next. > > Shall we return -ENOTSUPP when user read/write this attribute when > > driver has no timeout handler ? > > A much more elegant solution is to introduce a sysfs attribute group for the > io_timeout attribute and to make that group visible only if a timeout handler > has been defined. See e.g. disk_attr_group in block/genhd.c for an example. Agreed, that is the way to go.
Re: [PATCH v3] block: add documentation for io_timeout
On Wed, 2018-12-05 at 22:17 +0800, Weiping Zhang wrote: > +Description: > + io_timeout is a request’s timeouts at block layer in > + milliseconds. When the underlying driver starts processing > + a request, the generic block layer will start a timer, if > + this request cannot be completed in io_timeout milliseconds, > + a timeout event will occur. Sorry but I think this description is still somewhat confusing. How about changing that description into the following? io_timeout is the request timeout in milliseconds. If a request does not complete in this time then the block driver timeout handler is invoked. That timeout handler can decide to retry the request, to fail it or to start a device recovery strategy. Bart.
Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
On Thu, 2018-12-06 at 22:18 +0800, Weiping Zhang wrote: > Before this patch, even we set io_timeout to 30*HZ(default), but > blk_rq_timeout always return jiffies +5*HZ, > [1]. if there no pending request in timeout list, the timer callback > blk_rq_timed_out_timer will be called after 5*HZ, and then > blk_mq_check_expired will check is there exist some request > was delayed by compare jiffies and request->deadline, obvious > request is not timeout because we set request's timeouts is 30*HZ. > So for this case timer callback should be called at jiffies + 30 instead > of jiffies + 5*HZ. > > [2]. if there are pending request in timeout list, we compare request's > expiry and queue's expiry. If time_after(request->expire, queue->expire) > modify > queue->timeout.expire to request->expire, otherwise do nothing. > > So I think this patch just solve problem in [1], no other regression, or > what's > I missing here ? The blk_rq_timeout() function was introduced by commit 0d2602ca30e4 ("blk-mq: improve support for shared tags maps"). I think the purpose of that function is to make sure that the nr_active counter in struct blk_mq_hw_ctx gets updated at least once every five seconds. So there are two problems with this patch: - It reduces the frequency of 'nr_active' updates. I think that is wrong and also that it will negatively affect drivers that rely on this functionality, e.g. the SCSI core. - The patch description does not match the code changes in this patch. Bart.
Re: [PATCH v3] block: add documentation for io_timeout
On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote: > Weiping Zhang 于2018年12月5日周三 下午10:49写道: > > Christoph Hellwig 于2018年12月5日周三 下午10:40写道: > > > Can you please also send a patch to not show this attribute for > > > drivers without a timeout handler? Thanks! > > Is there a simple way do that ? How about checking the timeout member of struct blk_mq_ops for blk-mq and checking the rq_timed_out_fn member in struct request_queue for the legacy block layer? > Shall we return -ENOTSUPP when user read/write this attribute when > driver has no timeout handler ? A much more elegant solution is to introduce a sysfs attribute group for the io_timeout attribute and to make that group visible only if a timeout handler has been defined. See e.g. disk_attr_group in block/genhd.c for an example. Bart.
[PATCH 1/2] lightnvm: pblk: Do not overwrite ppa list with meta list
Currently when using PBLK with 0 sized metadata both ppa list and meta list points to the same memory since pblk_dma_meta_size() returns 0 in that case. This commit fix that issue by ensuring that pblk_dma_meta_size() always returns space equal to sizeof(struct pblk_sec_meta) and thus ppa list and meta list points to different memory address. Even that in that case drive does not really care about meta_list pointer, this is the easiest way to fix that issue without introducing changes in many places in the code just for 0 sized metadata case. Reported-by: Hans Holmberg Signed-off-by: Igor Konopko --- drivers/lightnvm/pblk.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h index bc40b1381ff6..e5c9ff2bf0da 100644 --- a/drivers/lightnvm/pblk.h +++ b/drivers/lightnvm/pblk.h @@ -1393,7 +1393,8 @@ static inline struct pblk_sec_meta *pblk_get_meta(struct pblk *pblk, static inline int pblk_dma_meta_size(struct pblk *pblk) { - return pblk->oob_meta_size * NVM_MAX_VLBA; + return max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size) + * NVM_MAX_VLBA; } static inline int pblk_is_oob_meta_supported(struct pblk *pblk) -- 2.17.1
[PATCH 2/2] lightnvm: pblk: Ensure that bio is not freed on recovery
When we are using PBLK with 0 sized metadata during recovery process we need to reference a last page of bio. Currently KASAN reports use-after-free in that case, since bio is freed on IO completion. This patch adds addtional bio reference to ensure, that we can still use bio memory after IO completion. It also ensures that we are not reusing the same bio on retry_rq path. Reported-by: Hans Holmberg Signed-off-by: Igor Konopko --- drivers/lightnvm/pblk-recovery.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c index 009faf5db40f..3fcf062d752c 100644 --- a/drivers/lightnvm/pblk-recovery.c +++ b/drivers/lightnvm/pblk-recovery.c @@ -376,12 +376,14 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, rq_ppas = pblk->min_write_pgs; rq_len = rq_ppas * geo->csecs; +retry_rq: bio = bio_map_kern(dev->q, data, rq_len, GFP_KERNEL); if (IS_ERR(bio)) return PTR_ERR(bio); bio->bi_iter.bi_sector = 0; /* internal bio */ bio_set_op_attrs(bio, REQ_OP_READ, 0); + bio_get(bio); rqd->bio = bio; rqd->opcode = NVM_OP_PREAD; @@ -394,7 +396,6 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, if (pblk_io_aligned(pblk, rq_ppas)) rqd->is_seq = 1; -retry_rq: for (i = 0; i < rqd->nr_ppas; ) { struct ppa_addr ppa; int pos; @@ -417,6 +418,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, if (ret) { pblk_err(pblk, "I/O submission failed: %d\n", ret); bio_put(bio); + bio_put(bio); return ret; } @@ -428,19 +430,25 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct pblk_line *line, if (padded) { pblk_log_read_err(pblk, rqd); + bio_put(bio); return -EINTR; } pad_distance = pblk_pad_distance(pblk, line); ret = pblk_recov_pad_line(pblk, line, pad_distance); - if (ret) + if (ret) { + bio_put(bio); return ret; + } padded = true; + bio_put(bio); goto retry_rq; } pblk_get_packed_meta(pblk, rqd); + bio_put(bio); + for (i = 0; i < rqd->nr_ppas; i++) { struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i); u64 lba = le64_to_cpu(meta->lba); -- 2.17.1
Re: [PATCH 2/2] nvme: create slaves/holder entries for multipath devices
On Wed, Nov 28, 2018 at 06:38:17AM -0200, Thadeu Lima de Souza Cascardo wrote: > I will send a followup that includes your two patches, fixing up this one, > plus > another two because that is simply reverting the revert of Hanne's patch. > Which > means it still has the lsblk bug. > > Then, we have two options: do not use GENHD_FL_HIDDEN on multipath nvme disks, > or expose devt for those disks. I am sending a patch for the second option. Can you please send it out so we can look at it before the 4.21 merge window opens?
Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT
Bart Van Assche 于2018年12月5日周三 下午11:59写道: > > On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote: > > @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req) > >* than an existing one, modify the timer. Round up to next nearest > >* second. > >*/ > > - expiry = blk_rq_timeout(round_jiffies_up(expiry)); > > + expiry = round_jiffies_up(expiry); > > If you would have read the comment above this code, you would have known > that this patch does not do what you think it does and additionally that > it introduces a regression. > Let's paste full comments here: /* * If the timer isn't already pending or this timeout is earlier * than an existing one, modify the timer. Round up to next nearest * second. */ Before this patch, even we set io_timeout to 30*HZ(default), but blk_rq_timeout always return jiffies +5*HZ, [1]. if there no pending request in timeout list, the timer callback blk_rq_timed_out_timer will be called after 5*HZ, and then blk_mq_check_expired will check is there exist some request was delayed by compare jiffies and request->deadline, obvious request is not timeout because we set request's timeouts is 30*HZ. So for this case timer callback should be called at jiffies + 30 instead of jiffies + 5*HZ. [2]. if there are pending request in timeout list, we compare request's expiry and queue's expiry. If time_after(request->expire, queue->expire) modify queue->timeout.expire to request->expire, otherwise do nothing. So I think this patch just solve problem in [1], no other regression, or what's I missing here ? Thanks Weiping > Bart.
Re: [PATCH 0/2][V2] io.latency test for blktests
On Wed, Dec 05, 2018 at 10:34:02AM -0500, Josef Bacik wrote: > v1->v2: > - dropped my python library, TIL about jq. > - fixed the spelling mistakes in the test. > > -- Original message -- > > This patchset is to add a test to verify io.latency is working properly, and > to > add all the supporting code to run that test. > > First is the cgroup2 infrastructure which is fairly straightforward. Just > verifies we have cgroup2, and gives us the helpers to check and make sure we > have the right controllers in place for the test. > > The second patch brings over some python scripts I put in xfstests for parsing > the fio json output. I looked at the existing fio performance stuff in > blktests, but we only capture bw stuff, which is wonky with this particular > test > because once the fast group is finished the slow group is allowed to go as > fast > as it wants. So I needed this to pull out actual jobtime spent. This will > give > us flexibility to pull out other fio performance data in the future. > > The final patch is the test itself. It simply runs a job by itself to get a > baseline view of the disk performance. Then it creates 2 cgroups, one fast > and > one slow, and runs the same job simultaneously in both groups. The result > should be that the fast group takes just slightly longer time than the > baseline > (I use a 15% threshold to be safe), and that the slow one takes considerably > longer. Thanks, I cleaned up a ton of shellcheck warnings (from `make check`) and pushed to https://github.com/osandov/blktests/tree/josef. On I tested with QEMU on Jens' for-next branch. With an emulated NVMe device, it failed with "Too much of a performance drop for the protected workload". On virtio-blk, I hit this: [ 1843.056452] INFO: task fio:20750 blocked for more than 120 seconds. [ 1843.057495] Not tainted 4.20.0-rc5-00251-g90efb26fa9a4 #19 [ 1843.058487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1843.059769] fio D0 20750 20747 0x0080 [ 1843.060688] Call Trace: [ 1843.061123] ? __schedule+0x286/0x870 [ 1843.061735] ? blkcg_iolatency_done_bio+0x680/0x680 [ 1843.062574] ? blkcg_iolatency_cleanup+0x60/0x60 [ 1843.063347] schedule+0x32/0x80 [ 1843.063874] io_schedule+0x12/0x40 [ 1843.064449] rq_qos_wait+0x9a/0x120 [ 1843.065007] ? karma_partition+0x210/0x210 [ 1843.065661] ? blkcg_iolatency_done_bio+0x680/0x680 [ 1843.066435] blkcg_iolatency_throttle+0x185/0x360 [ 1843.067196] __rq_qos_throttle+0x23/0x30 [ 1843.067958] blk_mq_make_request+0x101/0x5c0 [ 1843.068637] generic_make_request+0x1b3/0x3c0 [ 1843.069329] submit_bio+0x45/0x140 [ 1843.069876] blkdev_direct_IO+0x3db/0x440 [ 1843.070527] ? aio_complete+0x2f0/0x2f0 [ 1843.071146] generic_file_direct_write+0x96/0x160 [ 1843.071880] __generic_file_write_iter+0xb3/0x1c0 [ 1843.072599] ? blk_mq_dispatch_rq_list+0x3aa/0x550 [ 1843.073340] blkdev_write_iter+0xa0/0x120 [ 1843.073960] ? __fget+0x6e/0xa0 [ 1843.074452] aio_write+0x11f/0x1d0 [ 1843.074979] ? __blk_mq_run_hw_queue+0x6f/0xe0 [ 1843.075658] ? __check_object_size+0xa0/0x189 [ 1843.076345] ? preempt_count_add+0x5a/0xb0 [ 1843.077086] ? aio_read_events+0x259/0x380 [ 1843.077819] ? kmem_cache_alloc+0x16e/0x1c0 [ 1843.078427] io_submit_one+0x4a8/0x790 [ 1843.078975] ? read_events+0x76/0x150 [ 1843.079510] __se_sys_io_submit+0x98/0x1a0 [ 1843.080116] ? syscall_trace_enter+0x1d3/0x2d0 [ 1843.080785] do_syscall_64+0x55/0x160 [ 1843.081404] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1843.082210] RIP: 0033:0x7f6e571fc4ed [ 1843.082763] Code: Bad RIP value. [ 1843.083268] RSP: 002b:7ffc212b76f8 EFLAGS: 0246 ORIG_RAX: 00d1 [ 1843.084445] RAX: ffda RBX: 7f6e4c876870 RCX: 7f6e571fc4ed [ 1843.085545] RDX: 557c4bc11208 RSI: 0001 RDI: 7f6e4c85e000 [ 1843.086251] RBP: 7f6e4c85e000 R08: 557c4bc2b130 R09: 02f8 [ 1843.087308] R10: 557c4bbf4470 R11: 0246 R12: 0001 [ 1843.088310] R13: R14: 557c4bc11208 R15: 7f6e2b17f070
Re: [PATCH 15/26] aio: support for IO polling
On Wed, 2018-12-05 at 06:10 -0700, Jens Axboe wrote: > On 12/5/18 2:56 AM, Benny Halevy wrote: > > > +static int aio_iopoll_getevents(struct kioctx *ctx, > > > + struct io_event __user *event, > > > + unsigned int *nr_events, long min, long max) > > > +{ > > > + struct aio_kiocb *iocb; > > > + int to_poll, polled, ret; > > > + > > > + /* > > > + * Check if we already have done events that satisfy what we need > > > + */ > > > + if (!list_empty(>poll_completing)) { > > > + ret = aio_iopoll_reap(ctx, event, nr_events, max); > > > + if (ret < 0) > > > + return ret; > > > + /* if min is zero, still go through a poll check */ > > > > A function-level comment about that would be more visible. > > Yes, that might be a better idea. > > > > + if (min && *nr_events >= min) > > > > > > + return 0; > > > > if ((min && *nr_events >= min) || *nr_events >= max) ? > > > > Otherwise, if poll_completing or poll_submitted are not empty, > > besides getting to the "Shouldn't happen" case in aio_iopoll_reap, > > it looks like we're gonna waste some cycles and never call f_op->iopoll > > Agree, that would be a better place to catch it. I'll make those two > changes, thanks! > > > > + > > > + /* > > > + * Find up to 'max' worth of events to poll for, including the > > > + * events we already successfully polled > > > + */ > > > + polled = to_poll = 0; > > > + list_for_each_entry(iocb, >poll_completing, ki_list) { > > > + /* > > > + * Poll for needed events with spin == true, anything after > > > + * that we just check if we have more, up to max. > > > + */ > > > + bool spin = polled + *nr_events >= min; > > > > I'm confused. Shouldn't spin == true if polled + *nr_events < min? > > Heh, that does look off, and it is. I think the correct condition is: > > bool spin = !polled || *nr_events < min; > > and I'm not sure why that got broken. I just tested it, slight > performance win as expected correcting this. It ties in with the above > nr_events check - it's perfectly valid to pass min_nr == 0 and just do a > poll check. Conversely, if we already spun, just do non-spin checks on > followup poll checks. > Yep, that makes sense. > > > +static int __aio_iopoll_check(struct kioctx *ctx, struct io_event __user > > > *event, > > > + unsigned int *nr_events, long min_nr, long max_nr) > > > +{ > > > + int ret = 0; > > > + > > > + while (!*nr_events || !need_resched()) { > > > + int tmin = 0; > > > + > > > + if (*nr_events < min_nr) > > > + tmin = min_nr - *nr_events; > > > > Otherwise, shouldn't the function just return 0? > > > > Or perhaps you explicitly want to go through the tmin==0 case > > in aio_iopoll_getevents if *nr_events == min_nr (or min_nr==0)? > > No, we need to go through poll, if min_nr == 0, but only if min_nr == 0 > and !nr_events. But we only get here for nr_events != 0, so should be > fine as-is. > > Thanks for your review, very useful! I'll make the above changes right > now. > Thanks!