Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. This results in a > livelock. > > Instead of making special cases for what we can direct issue, and now > having to deal with DM solving the livelock while still retaining a BUSY > condition feedback loop, always just add a request that has been through > ->queue_rq() to the hardware queue dispatch list. These are safe to use > as no merging can take place there. Additionally, if requests do have > prepped data from drivers, we aren't dependent on them not sharing space > in the request structure to safely add them to the IO scheduler lists. > > This basically reverts ffe81d45322c and is based on a patch from Ming, > but with the list insert case covered as well. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Cc: sta...@vger.kernel.org > Suggested-by: Ming Lei > Reported-by: Bart Van Assche > Signed-off-by: Jens Axboe > > --- > > I've thrown the initial hang test reported by Bart at it, works fine. > My reproducer for the corruption case is also happy, as expected. > > I'm running blktests and xfstests on it overnight. If that passes as > expected, this qualms my initial worries on using ->dispatch as a > holding place for these types of requests. > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3262d83b9e07..6a7566244de3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > - /* > - * If direct dispatch fails, we cannot allow any merging on > - * this IO. Drivers (like SCSI) may have set up permanent state > - * for this request, like SG tables and mappings, and if we > - * merge to it later on then we'll still only do IO to the > - * original part. > - */ > - rq->cmd_flags |= REQ_NOMERGE; > - > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > return ret; > } > > -/* > - * Don't allow direct dispatch of anything but regular reads/writes, > - * as some of the other commands can potentially share request space > - * with data we need for the IO scheduler. If we attempt a direct dispatch > - * on those and fail, we can't safely add it to the scheduler afterwards > - * without potentially overwriting data that the driver has already written. > - */ > -static bool blk_rq_can_direct_dispatch(struct request *rq) > -{ > - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; > -} > - > static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > struct request *rq, > blk_qc_t *cookie, > @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > goto insert; > } > > - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) > + if (q->elevator && !bypass_insert) > goto insert; > > if (!blk_mq_get_dispatch_budget(hctx)) > @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > if (bypass_insert) > return BLK_STS_RESOURCE; > > - blk_mq_sched_insert_request(rq, false, run_queue, false); > + blk_mq_request_bypass_insert(rq, run_queue); > return BLK_STS_OK; > } > > @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) > - blk_mq_sched_insert_request(rq, false, true, false); > + blk_mq_request_bypass_insert(rq, true); > else if (ret != BLK_STS_OK) > blk_mq_end_request(rq, ret); > >
Re: [PATCH] blk-mq: fix corruption with direct issue
On Fri, Dec 07, 2018 at 11:44:39AM +0800, Ming Lei wrote: > On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote: > > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > > > > > But at that time, there isn't io scheduler for MQ, so in theory the > > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > > > add blk-mq adaptation of the deadline IO scheduler"). > > > > Hi Ming, > > > > How were serious you about this issue being there (theoretically) an > > issue since 4.11? Can you talk about how it might get triggered, and > > how we can test for it? The reason why I ask is because we're trying > > to track down a mysterious file system corruption problem on a 4.14.x > > stable kernel. The symptoms are *very* eerily similar to kernel > > bugzilla #201685. > > Hi Theodore, > > It is just a theory analysis. > > blk_mq_try_issue_directly() is called in two branches of > blk_mq_make_request(), > both are on real MQ disks. > > IO merge can be done on none or real io schedulers, so in theory there might > be the risk from v4.1, but IO merge on sw queue didn't work for a bit long, > especially it was fixed by ab42f35d9cb5ac49b5a2. > > As Jens mentioned in bugzilla, there are several conditions required > for triggering the issue: > > - MQ device > > - queue busy can be triggered. It is hard to trigger in NVMe PCI, > but may be possible on NVMe FC. However, it can be quite easy to > trigger on SCSI devices. We know there are some MQ SCSI HBA, > qlogic FC, megaraid_sas. > > - IO merge is enabled. > > I have setup scsi_debug in the following way: > > modprobe scsi_debug dev_size_mb=4096 clustering=1 \ > max_luns=1 submit_queues=2 max_queue=2 > > - submit_queues=2 may set this disk as MQ > - max_queue=4 may trigger the queue busy condition easily > > and run some write IO on ext4 over the disk: fio, kernel building,... for > some time, but still can't trigger the data corruption once. > > I should have created more LUN, so that queue may be easier to become > busy, will do that soon. Actually I should have used SDEBUG_OPT_HOST_BUSY to simulate the queue busy. Thanks, Ming
Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list
On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote: > After the direct dispatch corruption fix, we permanently disallow direct > dispatch of non read/write requests. This works fine off the normal IO > path, as they will be retried like any other failed direct dispatch > request. But for the blk_insert_cloned_request() that only DM uses to > bypass the bottom level scheduler, we always first attempt direct > dispatch. For some types of requests, that's now a permanent failure, > and no amount of retrying will make that succeed. This results in a > livelock. > > Instead of making special cases for what we can direct issue, and now > having to deal with DM solving the livelock while still retaining a BUSY > condition feedback loop, always just add a request that has been through > ->queue_rq() to the hardware queue dispatch list. These are safe to use > as no merging can take place there. Additionally, if requests do have > prepped data from drivers, we aren't dependent on them not sharing space > in the request structure to safely add them to the IO scheduler lists. > > This basically reverts ffe81d45322c and is based on a patch from Ming, > but with the list insert case covered as well. > > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue") > Cc: sta...@vger.kernel.org > Suggested-by: Ming Lei > Reported-by: Bart Van Assche > Signed-off-by: Jens Axboe > > --- > > I've thrown the initial hang test reported by Bart at it, works fine. > My reproducer for the corruption case is also happy, as expected. > > I'm running blktests and xfstests on it overnight. If that passes as > expected, this qualms my initial worries on using ->dispatch as a > holding place for these types of requests. > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3262d83b9e07..6a7566244de3 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > - /* > - * If direct dispatch fails, we cannot allow any merging on > - * this IO. Drivers (like SCSI) may have set up permanent state > - * for this request, like SG tables and mappings, and if we > - * merge to it later on then we'll still only do IO to the > - * original part. > - */ > - rq->cmd_flags |= REQ_NOMERGE; > - > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > return ret; > } > > -/* > - * Don't allow direct dispatch of anything but regular reads/writes, > - * as some of the other commands can potentially share request space > - * with data we need for the IO scheduler. If we attempt a direct dispatch > - * on those and fail, we can't safely add it to the scheduler afterwards > - * without potentially overwriting data that the driver has already written. > - */ > -static bool blk_rq_can_direct_dispatch(struct request *rq) > -{ > - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE; > -} > - > static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > struct request *rq, > blk_qc_t *cookie, > @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > goto insert; > } > > - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert)) > + if (q->elevator && !bypass_insert) > goto insert; > > if (!blk_mq_get_dispatch_budget(hctx)) > @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > if (bypass_insert) > return BLK_STS_RESOURCE; > > - blk_mq_sched_insert_request(rq, false, run_queue, false); > + blk_mq_request_bypass_insert(rq, run_queue); > return BLK_STS_OK; > } > > @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > > ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) > - blk_mq_sched_insert_request(rq, false, true, false); > + blk_mq_request_bypass_insert(rq, true); > else if (ret != BLK_STS_OK) > blk_mq_end_request(rq, ret); > >
Re: [PATCH] blk-mq: fix corruption with direct issue
On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote: > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote: > > > > But at that time, there isn't io scheduler for MQ, so in theory the > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline: > > add blk-mq adaptation of the deadline IO scheduler"). > > Hi Ming, > > How were serious you about this issue being there (theoretically) an > issue since 4.11? Can you talk about how it might get triggered, and > how we can test for it? The reason why I ask is because we're trying > to track down a mysterious file system corruption problem on a 4.14.x > stable kernel. The symptoms are *very* eerily similar to kernel > bugzilla #201685. Hi Theodore, It is just a theory analysis. blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(), both are on real MQ disks. IO merge can be done on none or real io schedulers, so in theory there might be the risk from v4.1, but IO merge on sw queue didn't work for a bit long, especially it was fixed by ab42f35d9cb5ac49b5a2. As Jens mentioned in bugzilla, there are several conditions required for triggering the issue: - MQ device - queue busy can be triggered. It is hard to trigger in NVMe PCI, but may be possible on NVMe FC. However, it can be quite easy to trigger on SCSI devices. We know there are some MQ SCSI HBA, qlogic FC, megaraid_sas. - IO merge is enabled. I have setup scsi_debug in the following way: modprobe scsi_debug dev_size_mb=4096 clustering=1 \ max_luns=1 submit_queues=2 max_queue=2 - submit_queues=2 may set this disk as MQ - max_queue=4 may trigger the queue busy condition easily and run some write IO on ext4 over the disk: fio, kernel building,... for some time, but still can't trigger the data corruption once. I should have created more LUN, so that queue may be easier to become busy, will do that soon. > > The problem is that the problem is super-rare --- roughly once a week > out of a popuation of about 2500 systems. The workload is NFS > serving. Unfortunately, the problem is since 4.14.63, we can no > longer disable blk-mq for the virtio-scsi driver, thanks to the commit > b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq > vector affinity") getting backported into 4.14.63 as commit > 70b522f163bbb32. virtio_scsi supports multi-queue mode, if that is enabled in your setting, you may try single queue and see if difference can be made. If multi-queue mode isn't enabled, your problem should be different with this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE. > We're considering reverting this patch in our 4.14 LTS kernel, and > seeing whether it makes the problem go away. Is there any thing else > you might suggest? IO hang is only triggered on machine with special CPU topo, it should be fine to revert b5b6e8c8d3b4 on normal VM. No other suggestions yet. Thanks, Ming
[PATCH V2] blk-mq: re-build queue map in case of kdump kernel
Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [4.438371] nvme nvme0: 16/0/0 default/read/poll queues [4.443277] BUG: unable to handle kernel NULL pointer dereference at 0098 [4.444681] PGD 0 P4D 0 [4.445367] Oops: [#1] SMP NOPTI [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 [4.454061] RAX: RBX: 888174448000 RCX: 0001 [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001 [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002 [4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538 [4.467803] R13: 0004 R14: 0001 R15: 0001 [4.469220] FS: () GS:88817bac() knlGS: [4.471554] CS: 0010 DS: ES: CR0: 80050033 [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0 [4.474264] DR0: DR1: DR2: [4.476007] DR3: DR6: fffe0ff0 DR7: 0400 [4.477061] PKRU: 5554 [4.477464] Call Trace: [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [4.479595] blk_mq_init_queue+0x32/0x4e [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [4.483930] ? try_to_wake_up+0x38d/0x3b3 [4.484478] ? process_one_work+0x179/0x2fc [4.485118] process_one_work+0x1d3/0x2fc [4.485655] ? rescuer_thread+0x2ae/0x2ae [4.486196] worker_thread+0x1e9/0x2be [4.486841] kthread+0x115/0x11d [4.487294] ? kthread_park+0x76/0x76 [4.487784] ret_from_fork+0x3a/0x50 [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [4.489428] Dumping ftrace buffer: [4.489939](ftrace buffer empty) [4.490492] CR2: 0098 [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig Cc: linux-n...@lists.infradead.org Cc: David Milburn Signed-off-by: Ming Lei --- block/blk-mq.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 900550594651..65770da99159 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2960,7 +2960,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) static int blk_mq_update_queue_map(struct blk_mq_tag_set *set) { - if (set->ops->map_queues) { + if (set->ops->map_queues && !is_kdump_kernel()) { int i; /* @@ -3030,6 +3030,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) */ if (is_kdump_kernel()) { set->nr_hw_queues = 1; + set->nr_maps = 1; set->queue_depth = min(64U, set->queue_depth); } /* @@ -3051,7 +3052,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) GFP_KERNEL, set->numa_node); if (!set->map[i].mq_map) goto out_free_mq_map; - set->map[i].nr_queues = set->nr_hw_queues; + set->map[i].nr_queues = is_kdump_kernel() ? 1 : set->nr_hw_queues; } ret = blk_mq_update_queue_map(set); -- 2.9.5
Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel
On Thu, Dec 06, 2018 at 07:57:34PM -0700, Jens Axboe wrote: > On 12/6/18 7:55 PM, Ming Lei wrote: > > Now almost all .map_queues() implementation based on managed irq > > affinity doesn't update queue mapping and it just retrieves the > > old built mapping, so if nr_hw_queues is changed, the mapping talbe > > includes stale mapping. And only blk_mq_map_queues() may rebuild > > the mapping talbe. > > > > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. > > However, drivers often builds queue mapping before allocating tagset > > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set > > as 1 in case of kdump kernel, so wrong queue mapping is used, and > > kernel panic[1] is observed during booting. > > > > This patch fixes the kernel panic triggerd on nvme by rebulding the > > mapping table via blk_mq_map_queues(). > > > > [1] kernel panic log > > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues > > [4.443277] BUG: unable to handle kernel NULL pointer dereference at > > 0098 > > [4.444681] PGD 0 P4D 0 > > [4.445367] Oops: [#1] SMP NOPTI > > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted > > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 > > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > > 1.10.2-2.fc27 04/01/2014 > > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] > > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 > > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 > > e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 > > <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 > > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 > > [4.454061] RAX: RBX: 888174448000 RCX: > > 0001 > > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: > > 0001 > > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: > > 0002 > > [4.464580] R10: c900023b3c10 R11: 0004 R12: > > 888174448538 > > [4.467803] R13: 0004 R14: 0001 R15: > > 0001 > > [4.469220] FS: () GS:88817bac() > > knlGS: > > [4.471554] CS: 0010 DS: ES: CR0: 80050033 > > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: > > 00760ee0 > > [4.474264] DR0: DR1: DR2: > > > > [4.476007] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [4.477061] PKRU: 5554 > > [4.477464] Call Trace: > > [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad > > [4.479595] blk_mq_init_queue+0x32/0x4e > > [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] > > [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] > > [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] > > [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] > > [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 > > [4.483930] ? try_to_wake_up+0x38d/0x3b3 > > [4.484478] ? process_one_work+0x179/0x2fc > > [4.485118] process_one_work+0x1d3/0x2fc > > [4.485655] ? rescuer_thread+0x2ae/0x2ae > > [4.486196] worker_thread+0x1e9/0x2be > > [4.486841] kthread+0x115/0x11d > > [4.487294] ? kthread_park+0x76/0x76 > > [4.487784] ret_from_fork+0x3a/0x50 > > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi > > ip_tables > > [4.489428] Dumping ftrace buffer: > > [4.489939](ftrace buffer empty) > > [4.490492] CR2: 0098 > > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- > > > > Cc: Christoph Hellwig > > Cc: linux-n...@lists.infradead.org > > Cc: David Milburn > > Signed-off-by: Ming Lei > > --- > > block/blk-mq.c | 12 +--- > > 1 file changed, 9 insertions(+), 3 deletions(-) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 900550594651..a3e463a726a6 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -38,6 +38,11 @@ > > #include "blk-mq-sched.h" > > #include "blk-rq-qos.h" > > > > +static inline bool blk_mq_kdump_kernel(void) > > +{ > > + return !!is_kdump_kernel(); > > +} > > Let's drop the redundant !! here, and the wrapper? I would imagine the > wrapper is handy for testing outside of kdump, but I don't think we > should include it in the final. OK. Thanks, Ming
[PATCH] blk-mq: re-build queue map in case of kdump kernel
Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [4.438371] nvme nvme0: 16/0/0 default/read/poll queues [4.443277] BUG: unable to handle kernel NULL pointer dereference at 0098 [4.444681] PGD 0 P4D 0 [4.445367] Oops: [#1] SMP NOPTI [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286 [4.454061] RAX: RBX: 888174448000 RCX: 0001 [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001 [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002 [4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538 [4.467803] R13: 0004 R14: 0001 R15: 0001 [4.469220] FS: () GS:88817bac() knlGS: [4.471554] CS: 0010 DS: ES: CR0: 80050033 [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0 [4.474264] DR0: DR1: DR2: [4.476007] DR3: DR6: fffe0ff0 DR7: 0400 [4.477061] PKRU: 5554 [4.477464] Call Trace: [4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [4.479595] blk_mq_init_queue+0x32/0x4e [4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [4.483930] ? try_to_wake_up+0x38d/0x3b3 [4.484478] ? process_one_work+0x179/0x2fc [4.485118] process_one_work+0x1d3/0x2fc [4.485655] ? rescuer_thread+0x2ae/0x2ae [4.486196] worker_thread+0x1e9/0x2be [4.486841] kthread+0x115/0x11d [4.487294] ? kthread_park+0x76/0x76 [4.487784] ret_from_fork+0x3a/0x50 [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [4.489428] Dumping ftrace buffer: [4.489939](ftrace buffer empty) [4.490492] CR2: 0098 [4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig Cc: linux-n...@lists.infradead.org Cc: David Milburn Signed-off-by: Ming Lei --- block/blk-mq.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 900550594651..a3e463a726a6 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -38,6 +38,11 @@ #include "blk-mq-sched.h" #include "blk-rq-qos.h" +static inline bool blk_mq_kdump_kernel(void) +{ + return !!is_kdump_kernel(); +} + static void blk_mq_poll_stats_start(struct request_queue *q); static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb); @@ -2960,7 +2965,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set) static int blk_mq_update_queue_map(struct blk_mq_tag_set *set) { - if (set->ops->map_queues) { + if (set->ops->map_queues && !blk_mq_kdump_kernel()) { int i; /* @@ -3028,8 +3033,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) * memory constrained environment. Limit us to 1 queue and * 64 tags to prevent using too much memory. */ - if (is_kdump_kernel()) { + if (blk_mq_kdump_kernel()) { set->nr_hw_queues = 1; + set->nr_maps = 1; set->queue_depth = min(64U, set->queue_depth); } /* @@ -3051,7 +3057,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) GFP_KERNEL, set->numa_node); if (!set->map[i].mq_map
Re: [PATCH] blk-mq: fix corruption with direct issue
On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote: > On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > > On 12/4/18 7:27 PM, Ming Lei wrote: > > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > > >> On 12/4/18 6:37 PM, Ming Lei wrote: > > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, > > >>>> then > > >>>> we queue the request up normally. However, the SCSI layer may have > > >>>> already setup SG tables etc for this particular command. If we later > > >>>> merge with this request, then the old tables are no longer valid. Once > > >>>> we issue the IO, we only read/write the original part of the request, > > >>>> not the new state of it. > > >>>> > > >>>> This causes data corruption, and is most often noticed with the file > > >>>> system complaining about the just read data being invalid: > > >>>> > > >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode > > >>>> #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > > >>>> > > >>>> because most of it is garbage... > > >>>> > > >>>> This doesn't happen from the normal issue path, as we will simply defer > > >>>> the request to the hardware queue dispatch list if we fail. Once it's > > >>>> on > > >>>> the dispatch list, we never merge with it. > > >>>> > > >>>> Fix this from the direct issue path by flagging the request as > > >>>> REQ_NOMERGE so we don't change the size of it before issue. > > >>>> > > >>>> See also: > > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > >>>> > > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in > > >>>> case of 'none'") > > >>>> Signed-off-by: Jens Axboe > > >>>> > > >>>> --- > > >>>> > > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644 > > >>>> --- a/block/blk-mq.c > > >>>> +++ b/block/blk-mq.c > > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t > > >>>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > > >>>>break; > > >>>>case BLK_STS_RESOURCE: > > >>>>case BLK_STS_DEV_RESOURCE: > > >>>> + /* > > >>>> + * If direct dispatch fails, we cannot allow any > > >>>> merging on > > >>>> + * this IO. Drivers (like SCSI) may have set up > > >>>> permanent state > > >>>> + * for this request, like SG tables and mappings, and > > >>>> if we > > >>>> + * merge to it later on then we'll still only do IO to > > >>>> the > > >>>> + * original part. > > >>>> + */ > > >>>> + rq->cmd_flags |= REQ_NOMERGE; > > >>>> + > > >>>>blk_mq_update_dispatch_busy(hctx, true); > > >>>>__blk_mq_requeue_request(rq); > > >>>>break; > > >>>> > > >>> > > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > > >>> 'rq.special_vec' share same space. > > >> > > >> We should rather limit the scope of the direct dispatch instead. It > > >> doesn't make sense to do for anything but read/write anyway. > > > > > > discard is kind of write, and it isn't treated very specially in make > > > request path, except for multi-range discard. > > > > The point of direct dispatch is to reduce latencies for requests, > > discards are so damn slow on ALL devices anyway that it doesn't make any > > sense to try direct dispatch to begin with, regardless of whether it > > possible or not. >
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote: > On 12/4/18 7:27 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > >> On 12/4/18 6:37 PM, Ming Lei wrote: > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >>>> we queue the request up normally. However, the SCSI layer may have > >>>> already setup SG tables etc for this particular command. If we later > >>>> merge with this request, then the old tables are no longer valid. Once > >>>> we issue the IO, we only read/write the original part of the request, > >>>> not the new state of it. > >>>> > >>>> This causes data corruption, and is most often noticed with the file > >>>> system complaining about the just read data being invalid: > >>>> > >>>> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: > >>>> comm dpkg-query: bad extra_isize 24937 (inode size 256) > >>>> > >>>> because most of it is garbage... > >>>> > >>>> This doesn't happen from the normal issue path, as we will simply defer > >>>> the request to the hardware queue dispatch list if we fail. Once it's on > >>>> the dispatch list, we never merge with it. > >>>> > >>>> Fix this from the direct issue path by flagging the request as > >>>> REQ_NOMERGE so we don't change the size of it before issue. > >>>> > >>>> See also: > >>>> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >>>> > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in > >>>> case of 'none'") > >>>> Signed-off-by: Jens Axboe > >>>> > >>>> --- > >>>> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644 > >>>> --- a/block/blk-mq.c > >>>> +++ b/block/blk-mq.c > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t > >>>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > >>>> break; > >>>> case BLK_STS_RESOURCE: > >>>> case BLK_STS_DEV_RESOURCE: > >>>> +/* > >>>> + * If direct dispatch fails, we cannot allow any > >>>> merging on > >>>> + * this IO. Drivers (like SCSI) may have set up > >>>> permanent state > >>>> + * for this request, like SG tables and mappings, and > >>>> if we > >>>> + * merge to it later on then we'll still only do IO to > >>>> the > >>>> + * original part. > >>>> + */ > >>>> +rq->cmd_flags |= REQ_NOMERGE; > >>>> + > >>>> blk_mq_update_dispatch_busy(hctx, true); > >>>> __blk_mq_requeue_request(rq); > >>>> break; > >>>> > >>> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and > >>> 'rq.special_vec' share same space. > >> > >> We should rather limit the scope of the direct dispatch instead. It > >> doesn't make sense to do for anything but read/write anyway. > > > > discard is kind of write, and it isn't treated very specially in make > > request path, except for multi-range discard. > > The point of direct dispatch is to reduce latencies for requests, > discards are so damn slow on ALL devices anyway that it doesn't make any > sense to try direct dispatch to begin with, regardless of whether it > possible or not. SCSI MQ device may benefit from direct dispatch from reduced lock contention. > > >>> So how about inserting this request via blk_mq_request_bypass_insert() > >>> in case that direct issue returns BUSY? Then it is invariant that > >>> any request queued via .queue_rq() won't enter scheduler queue. > >> > >> I did consider this, but I didn't want to experiment with exercising > >> a new path for an import
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: > >> comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case > >> of 'none'") > >> Signed-off-by: Jens Axboe > >> > >> --- > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 3f91c6e5b17a..d8f518c6ea38 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct > >> blk_mq_hw_ctx *hctx, > >>break; > >>case BLK_STS_RESOURCE: > >>case BLK_STS_DEV_RESOURCE: > >> + /* > >> + * If direct dispatch fails, we cannot allow any merging on > >> + * this IO. Drivers (like SCSI) may have set up permanent state > >> + * for this request, like SG tables and mappings, and if we > >> + * merge to it later on then we'll still only do IO to the > >> + * original part. > >> + */ > >> + rq->cmd_flags |= REQ_NOMERGE; > >> + > >>blk_mq_update_dispatch_busy(hctx, true); > >>__blk_mq_requeue_request(rq); > >>break; > >> > > > > Not sure it is enough to just mark it as NOMERGE, for example, driver > > may have setup the .special_vec for discard, and NOMERGE may not prevent > > request from entering elevator queue completely. Cause 'rq.rb_node' and > > 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. discard is kind of write, and it isn't treated very specially in make request path, except for multi-range discard. > > > So how about inserting this request via blk_mq_request_bypass_insert() > > in case that direct issue returns BUSY? Then it is invariant that > > any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. But marking NOMERGE still may have a hole on re-insert discard request as mentioned above. Given we never allow to re-insert queued request to scheduler queue except for 6ce3dd6eec1, I think it is the correct thing to do, and the fix is simple too. Thanks, Ming
Re: [PATCH] blk-mq: fix corruption with direct issue
On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > If we attempt a direct issue to a SCSI device, and it returns BUSY, then > we queue the request up normally. However, the SCSI layer may have > already setup SG tables etc for this particular command. If we later > merge with this request, then the old tables are no longer valid. Once > we issue the IO, we only read/write the original part of the request, > not the new state of it. > > This causes data corruption, and is most often noticed with the file > system complaining about the just read data being invalid: > > [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm > dpkg-query: bad extra_isize 24937 (inode size 256) > > because most of it is garbage... > > This doesn't happen from the normal issue path, as we will simply defer > the request to the hardware queue dispatch list if we fail. Once it's on > the dispatch list, we never merge with it. > > Fix this from the direct issue path by flagging the request as > REQ_NOMERGE so we don't change the size of it before issue. > > See also: > https://bugzilla.kernel.org/show_bug.cgi?id=201685 > > Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of > 'none'") > Signed-off-by: Jens Axboe > > --- > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3f91c6e5b17a..d8f518c6ea38 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > break; > case BLK_STS_RESOURCE: > case BLK_STS_DEV_RESOURCE: > + /* > + * If direct dispatch fails, we cannot allow any merging on > + * this IO. Drivers (like SCSI) may have set up permanent state > + * for this request, like SG tables and mappings, and if we > + * merge to it later on then we'll still only do IO to the > + * original part. > + */ > + rq->cmd_flags |= REQ_NOMERGE; > + > blk_mq_update_dispatch_busy(hctx, true); > __blk_mq_requeue_request(rq); > break; > Not sure it is enough to just mark it as NOMERGE, for example, driver may have setup the .special_vec for discard, and NOMERGE may not prevent request from entering elevator queue completely. Cause 'rq.rb_node' and 'rq.special_vec' share same space. So how about inserting this request via blk_mq_request_bypass_insert() in case that direct issue returns BUSY? Then it is invariant that any request queued via .queue_rq() won't enter scheduler queue. -- diff --git a/block/blk-mq.c b/block/blk-mq.c index 3f91c6e5b17a..4b2db0b2909e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, if (bypass_insert) return BLK_STS_RESOURCE; - blk_mq_sched_insert_request(rq, false, run_queue, false); + blk_mq_request_bypass_insert(rq, run_queue); return BLK_STS_OK; } @@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) - blk_mq_sched_insert_request(rq, false, true, false); + blk_mq_request_bypass_insert(rq, true); else if (ret != BLK_STS_OK) blk_mq_end_request(rq, ret); Thanks, Ming
Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag
On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote: > Problem statement : > Whenever try to get outstanding request via scsi_host_find_tag, > block layer will return stale entries instead of actual outstanding > request. Kernel panic if stale entry is inaccessible or memory is reused. > Fix : > Undo request mapping in blk_mq_put_driver_tag nce request is return. > > More detail : > Whenever each SDEV entry is created, block layer allocate separate tags > and static requestis.Those requests are not valid after SDEV is deleted > from the system. On the fly, block layer maps static rqs to rqs as below > from blk_mq_get_driver_tag() > > data.hctx->tags->rqs[rq->tag] = rq; > > Above mapping is active in-used requests and it is the same mapping which > is referred in function scsi_host_find_tag(). > After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some > entries which will never be reset in block layer. However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should have pointed to one active request instead of the stale one, right? Thanks, Ming
block: sbitmap related lockdep warning
0d1 [ 106.398018] RAX: ffda RBX: 7f8d989f5670 RCX: 7f8dbd04c687 [ 106.399001] RDX: 0207b780 RSI: 0001 RDI: 7f8dbe52d000 [ 106.399989] RBP: R08: 0001 R09: 0207b4e0 [ 106.400976] R10: 000c R11: 0202 R12: 7f8d989f5670 [ 106.401951] R13: R14: 0207b7b0 R15: 0205b7c0 [ 204.246336] null: module loaded [ 294.406501] null: module loaded [ 396.936354] null: module loaded [ 497.202198] end test sanity/001: (NO_HANG, 0) Thanks, Ming Lei
Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests
On Sun, Dec 2, 2018 at 12:48 AM Christoph Hellwig wrote: > > On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote: > > On 11/30/18 1:26 PM, Keith Busch wrote: > > > A driver may wish to iterate every tagged request, not just ones that > > > satisfy blk_mq_request_started(). The intended use is so a driver may > > > terminate entered requests on quiesced queues. > > > > How about we just move the started check into the handler passed in for > > those that care about it? Much saner to make the interface iterate > > everything, and leave whatever state check to the callback. > > So we used to do that, and I changed it back in May to test for > MQ_RQ_IN_FLIGHT, and then Ming changed it to check > blk_mq_request_started. So this is clearly a minefield of sorts.. The change to blk_mq_request_started() is for just fixing SCSI's boot hang issue. > > Note that at least mtip32xx, nbd, skd and the various nvme transports > want to use the function to terminate all requests in the error > path, and it would be great to have one single understood, documented > and debugged helper for that in the core, so this is a vote for moving > more of the logic in your second helper into the core code. skd > will need actually use ->complete to release resources for that, though > and mtip plays some odd abort bits. If it weren't for the interesting > abort behavior in nvme-fc that means we could even unexport the > low-level interface. Agree, and especially SCSI's use should be understood given SCSI is so widely deployed in product system. thanks, Ming Lei
[PATCH] block: fix single range discard merge
There are actually two kinds of discard merge: - one is the normal discard merge, just like normal read/write request, and call it single-range discard - another is the multi-range discard, queue_max_discard_segments(rq->q) > 1 For the former case, queue_max_discard_segments(rq->q) is 1, and we should handle this kind of discard merge like the normal read/write request. This patch fixes the following kernel panic issue[1], which is caused by not removing the single-range discard request from elevator queue. Guangwu has one raid discard test case, in which this issue is a bit easier to trigger, and I verified that this patch can fix the kernel panic issue in Guangwu's test case. [1] kernel panic log from Jens's report BUG: unable to handle kernel NULL pointer dereference at 0148 PGD 0 P4D 0. Oops: [#1] SMP PTI CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \ 4.20.0-rc3-00649-ge64d9a554a91-dirty #14 Hardware name: Wiwynn \ Leopard-Orv2/Leopard-DDR BW, BIOS LBM08 03/03/2017 Workqueue: kblockd \ blk_mq_run_work_fnRIP: \ 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 \ 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 00 00 \ 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 72 37 \ f6 87 b0 00 00 00 02 RSP: 0018:c90004aabd30 EFLAGS: 00010246 \ RAX: 0003 RBX: 888465ea1300 RCX: c90004aabde8 RDX: RSI: c90004aabde8 RDI: RBP: R08: 888465ea1348 R09: R10: 1000 R11: R12: 888465ea1300 R13: R14: 888465ea1348 R15: 888465d1 FS: () GS:88846f9c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0148 CR3: 0220a003 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: blk_mq_dispatch_rq_list+0xec/0x480 ? elv_rb_del+0x11/0x30 blk_mq_do_dispatch_sched+0x6e/0xf0 blk_mq_sched_dispatch_requests+0xfa/0x170 __blk_mq_run_hw_queue+0x5f/0xe0 process_one_work+0x154/0x350 worker_thread+0x46/0x3c0 kthread+0xf5/0x130 ? process_one_work+0x350/0x350 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x1f/0x30 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel \ kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet mii \ cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq \ button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc nvme \ nvme_core fuse sg loop efivarfs autofs4 CR2: 0148 \ ---[ end trace 340a1fb996df1b9b ]--- RIP: 0010:blk_mq_get_driver_tag+0x81/0x120 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 \ 00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 \ 20 72 37 f6 87 b0 00 00 00 02 Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached") Reported-by: Jens Axboe Cc: Guangwu Zhang Cc: Christoph Hellwig Cc: Jianchao Wang Signed-off-by: Ming Lei --- block/blk-merge.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-merge.c b/block/blk-merge.c index e7696c47489a..7695034f4b87 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -820,7 +820,7 @@ static struct request *attempt_merge(struct request_queue *q, req->__data_len += blk_rq_bytes(next); - if (req_op(req) != REQ_OP_DISCARD) + if (!blk_discard_mergable(req)) elv_merge_requests(q, req, next); /* -- 2.9.5
Re: [PATCH 7/7] blk-mq: use plug for devices that implement ->commits_rqs()
On Wed, Nov 28, 2018 at 06:35:38AM -0700, Jens Axboe wrote: > If we have that hook, we know the driver handles bd->last == true in > a smart fashion. If it does, even for multiple hardware queues, it's > a good idea to flush batches of requests to the device, if we have > batches of requests from the submitter. > > Signed-off-by: Jens Axboe > --- > block/blk-mq.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 9e1045152e45..091557a43d53 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1945,7 +1945,12 @@ static blk_qc_t blk_mq_make_request(struct > request_queue *q, struct bio *bio) > /* bypass scheduler for flush rq */ > blk_insert_flush(rq); > blk_mq_run_hw_queue(data.hctx, true); > - } else if (plug && q->nr_hw_queues == 1) { > + } else if (plug && (q->nr_hw_queues == 1 || q->mq_ops->commit_rqs)) { > + /* > + * Use plugging if we have a ->commit_rqs() hook as well, > + * as we know the driver uses bd->last in a smart > + * fashion. > + */ > unsigned int request_count = plug->rq_count; > struct request *last = NULL; > > -- > 2.17.1 > Reviewed-by: Ming Lei thanks, Ming
Re: [PATCH 5/7] ataflop: implement mq_ops->commit_rqs() hook
On Wed, Nov 28, 2018 at 06:35:36AM -0700, Jens Axboe wrote: > We need this for blk-mq to kick things into gear, if we told it that > we had more IO coming, but then failed to deliver on that promise. > > Reviewed-by: Omar Sandoval > Signed-off-by: Jens Axboe > --- > drivers/block/ataflop.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c > index f88b4c26d422..a0c6745b0034 100644 > --- a/drivers/block/ataflop.c > +++ b/drivers/block/ataflop.c > @@ -1471,6 +1471,15 @@ static void setup_req_params( int drive ) > ReqTrack, ReqSector, (unsigned long)ReqData )); > } > > +static void ataflop_commit_rqs(struct blk_mq_hw_ctx *hctx) > +{ > + spin_lock_irq(_lock); > + atari_disable_irq(IRQ_MFP_FDC); > + finish_fdc(); > + atari_enable_irq(IRQ_MFP_FDC); > + spin_unlock_irq(_lock); > +} > + > static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx, >const struct blk_mq_queue_data *bd) > { > @@ -1947,6 +1956,7 @@ static const struct block_device_operations floppy_fops > = { > > static const struct blk_mq_ops ataflop_mq_ops = { > .queue_rq = ataflop_queue_rq, > + .commit_rqs = ataflop_commit_rqs, > }; > > static struct kobject *floppy_find(dev_t dev, int *part, void *data) > -- > 2.17.1 > Reviewed-by: Ming Lei thanks, Ming
Re: [PATCH 6/7] blk-mq: use bd->last == true for list inserts
On Wed, Nov 28, 2018 at 06:35:37AM -0700, Jens Axboe wrote: > If we are issuing a list of requests, we know if we're at the last one. > If we fail issuing, ensure that we call ->commits_rqs() to flush any > potential previous requests. > > Reviewed-by: Omar Sandoval > Signed-off-by: Jens Axboe > --- > block/blk-core.c | 2 +- > block/blk-mq.c | 16 > block/blk-mq.h | 2 +- > 3 files changed, 10 insertions(+), 10 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index d107d016b92b..3f6f5e6c2fe4 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1334,7 +1334,7 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + return blk_mq_request_issue_directly(rq, true); > } > EXPORT_SYMBOL_GPL(blk_insert_cloned_request); > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 18ff47a85ad3..9e1045152e45 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1743,12 +1743,12 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx > *hctx, struct request *rq) > > static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > struct request *rq, > - blk_qc_t *cookie) > + blk_qc_t *cookie, bool last) > { > struct request_queue *q = rq->q; > struct blk_mq_queue_data bd = { > .rq = rq, > - .last = true, > + .last = last, > }; > blk_qc_t new_cookie; > blk_status_t ret; > @@ -1783,7 +1783,7 @@ static blk_status_t __blk_mq_issue_directly(struct > blk_mq_hw_ctx *hctx, > static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx, > struct request *rq, > blk_qc_t *cookie, > - bool bypass_insert) > + bool bypass_insert, bool last) > { > struct request_queue *q = rq->q; > bool run_queue = true; > @@ -1812,7 +1812,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > goto insert; > } > > - return __blk_mq_issue_directly(hctx, rq, cookie); > + return __blk_mq_issue_directly(hctx, rq, cookie, last); > insert: > if (bypass_insert) > return BLK_STS_RESOURCE; > @@ -1831,7 +1831,7 @@ static void blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > > hctx_lock(hctx, _idx); > > - ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false); > + ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true); > if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) > blk_mq_sched_insert_request(rq, false, true, false); > else if (ret != BLK_STS_OK) > @@ -1840,7 +1840,7 @@ static void blk_mq_try_issue_directly(struct > blk_mq_hw_ctx *hctx, > hctx_unlock(hctx, srcu_idx); > } > > -blk_status_t blk_mq_request_issue_directly(struct request *rq) > +blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last) > { > blk_status_t ret; > int srcu_idx; > @@ -1848,7 +1848,7 @@ blk_status_t blk_mq_request_issue_directly(struct > request *rq) > struct blk_mq_hw_ctx *hctx = rq->mq_hctx; > > hctx_lock(hctx, _idx); > - ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true); > + ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true, last); > hctx_unlock(hctx, srcu_idx); > > return ret; > @@ -1863,7 +1863,7 @@ void blk_mq_try_issue_list_directly(struct > blk_mq_hw_ctx *hctx, > queuelist); > > list_del_init(>queuelist); > - ret = blk_mq_request_issue_directly(rq); > + ret = blk_mq_request_issue_directly(rq, list_empty(list)); > if (ret != BLK_STS_OK) { > if (ret == BLK_STS_RESOURCE || > ret == BLK_STS_DEV_RESOURCE) { > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 9ae8e9f8f8b1..7291e5379358 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -69,7 +69,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, > struct blk_mq_ctx *ctx, > struct list_head *list); > > /* Used by blk_insert_cloned_request() to issue request directly */ > -blk_status_t blk_mq_request_issue_directly(struct request *rq); > +blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last); > void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx, > struct list_head *list); > > -- > 2.17.1 > Reviewed-by: Ming Lei Thanks, Ming
Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook
On Wed, Nov 28, 2018 at 08:13:43PM -0700, Jens Axboe wrote: > On 11/28/18 7:51 PM, Ming Lei wrote: > > On Wed, Nov 28, 2018 at 07:19:09PM -0700, Jens Axboe wrote: > >> On 11/28/18 6:23 PM, Ming Lei wrote: > >>> On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote: > >>>> On 11/27/18 7:10 PM, Ming Lei wrote: > >>>>> On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote: > >>>>>> We need this for blk-mq to kick things into gear, if we told it that > >>>>>> we had more IO coming, but then failed to deliver on that promise. > >>>>>> > >>>>>> Signed-off-by: Jens Axboe > >>>>>> --- > >>>>>> drivers/block/virtio_blk.c | 15 +++ > >>>>>> 1 file changed, 15 insertions(+) > >>>>>> > >>>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > >>>>>> index 6e869d05f91e..b49c57e77780 100644 > >>>>>> --- a/drivers/block/virtio_blk.c > >>>>>> +++ b/drivers/block/virtio_blk.c > >>>>>> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq) > >>>>>>spin_unlock_irqrestore(>vqs[qid].lock, flags); > >>>>>> } > >>>>>> > >>>>>> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx) > >>>>>> +{ > >>>>>> + struct virtio_blk *vblk = hctx->queue->queuedata; > >>>>>> + int qid = hctx->queue_num; > >>>>>> + bool kick; > >>>>>> + > >>>>>> + spin_lock_irq(>vqs[qid].lock); > >>>>>> + kick = virtqueue_kick_prepare(vblk->vqs[qid].vq); > >>>>>> + spin_unlock_irq(>vqs[qid].lock); > >>>>>> + > >>>>>> + if (kick) > >>>>>> + virtqueue_notify(vblk->vqs[qid].vq); > >>>>>> +} > >>>>>> + > >>>>>> static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, > >>>>>> const struct blk_mq_queue_data *bd) > >>>>>> { > >>>>>> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request > >>>>>> *req) > >>>>>> > >>>>>> static const struct blk_mq_ops virtio_mq_ops = { > >>>>>>.queue_rq = virtio_queue_rq, > >>>>>> + .commit_rqs = virtio_commit_rqs, > >>>>>>.complete = virtblk_request_done, > >>>>>>.init_request = virtblk_init_request, > >>>>>> #ifdef CONFIG_VIRTIO_BLK_SCSI > >>>>>> -- > >>>>>> 2.17.1 > >>>>>> > >>>>> > >>>>> If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq() > >>>>> should have been removed for saving the world switch per .queue_rq() > >>>> > >>>> ->commits_rqs() is only for the case where bd->last is set to false, > >>>> and we never make it to the end and flag bd->last == true. If bd->last > >>>> is true, the driver should kick things into gear. > >>> > >>> OK, looks I misunderstood it. However, virtio-blk doesn't need this > >>> change since virtio_queue_rq() can handle it well. This patch may > >>> introduce > >>> one unnecessary VM world switch in case of queue busy. > >> > >> Not it won't, it may in the case of some failure outside of the driver. > > > > If the failure is because of out of tag, blk_mq_dispatch_wake() will > > rerun the queue, and the bd->last will be set finally. Or is there > > other failure(outside of driver) not covered? > > The point is to make this happen when we commit the IOs, not needing to > do a restart (or relying on IO being in-flight). If we're submitting a > string of requests, we should not rely on failures happening only due to > IO being going and thus restarting us. It defeats the purpose of even > having ->last in the first place. OK, it makes sense. > > >> The only reason that virtio-blk doesn't currently hang is because it > >> has restart logic, and the failure case only happens in the if we > >> already have IO in-flight. > > > > Yeah, virtqueue_kic
Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook
On Wed, Nov 28, 2018 at 07:19:09PM -0700, Jens Axboe wrote: > On 11/28/18 6:23 PM, Ming Lei wrote: > > On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote: > >> On 11/27/18 7:10 PM, Ming Lei wrote: > >>> On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote: > >>>> We need this for blk-mq to kick things into gear, if we told it that > >>>> we had more IO coming, but then failed to deliver on that promise. > >>>> > >>>> Signed-off-by: Jens Axboe > >>>> --- > >>>> drivers/block/virtio_blk.c | 15 +++ > >>>> 1 file changed, 15 insertions(+) > >>>> > >>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > >>>> index 6e869d05f91e..b49c57e77780 100644 > >>>> --- a/drivers/block/virtio_blk.c > >>>> +++ b/drivers/block/virtio_blk.c > >>>> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq) > >>>> spin_unlock_irqrestore(>vqs[qid].lock, flags); > >>>> } > >>>> > >>>> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx) > >>>> +{ > >>>> +struct virtio_blk *vblk = hctx->queue->queuedata; > >>>> +int qid = hctx->queue_num; > >>>> +bool kick; > >>>> + > >>>> +spin_lock_irq(>vqs[qid].lock); > >>>> +kick = virtqueue_kick_prepare(vblk->vqs[qid].vq); > >>>> +spin_unlock_irq(>vqs[qid].lock); > >>>> + > >>>> +if (kick) > >>>> +virtqueue_notify(vblk->vqs[qid].vq); > >>>> +} > >>>> + > >>>> static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, > >>>> const struct blk_mq_queue_data *bd) > >>>> { > >>>> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request > >>>> *req) > >>>> > >>>> static const struct blk_mq_ops virtio_mq_ops = { > >>>> .queue_rq = virtio_queue_rq, > >>>> +.commit_rqs = virtio_commit_rqs, > >>>> .complete = virtblk_request_done, > >>>> .init_request = virtblk_init_request, > >>>> #ifdef CONFIG_VIRTIO_BLK_SCSI > >>>> -- > >>>> 2.17.1 > >>>> > >>> > >>> If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq() > >>> should have been removed for saving the world switch per .queue_rq() > >> > >> ->commits_rqs() is only for the case where bd->last is set to false, > >> and we never make it to the end and flag bd->last == true. If bd->last > >> is true, the driver should kick things into gear. > > > > OK, looks I misunderstood it. However, virtio-blk doesn't need this > > change since virtio_queue_rq() can handle it well. This patch may introduce > > one unnecessary VM world switch in case of queue busy. > > Not it won't, it may in the case of some failure outside of the driver. If the failure is because of out of tag, blk_mq_dispatch_wake() will rerun the queue, and the bd->last will be set finally. Or is there other failure(outside of driver) not covered? > The only reason that virtio-blk doesn't currently hang is because it > has restart logic, and the failure case only happens in the if we > already have IO in-flight. Yeah, virtqueue_kick() is called in case of any error in virtio_queue_rq(), so I am still wondering why we have to implement .commit_rqs() for virtio-blk. > For the NVMe variant, that's not going to be the case. OK. > > > IMO bd->last won't work well in case of io scheduler given the rq_list > > only includes one single request. > > But that's a fake limitation that definitely should just be lifted, > the fact that blk-mq-sched is _currently_ just doing single requests > is woefully inefficient. I agree, but seems a bit hard given we have to consider request merge. > > > I wrote this kind of patch(never posted) before to use sort of > > ->commits_rqs() to replace the current bd->last mechanism which need > > one extra driver tag, which may improve the above case, also code gets > > cleaned up. > > It doesn't need one extra driver tag, we currently get an extra one just > to flag ->last correctly. That's not a requirement, that's a limitation > of the current implementation. We could get rid of that, and it it > proves to be an issue, that's not hard to do. What do you think about using .commit_rqs() to replace ->last? For example, just call .commit_rqs() after the last request is queued to driver successfully. Then we can remove bd->last and avoid to get the extra tag for figuring out bd->last. Thanks, Ming
Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook
On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote: > On 11/27/18 7:10 PM, Ming Lei wrote: > > On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote: > >> We need this for blk-mq to kick things into gear, if we told it that > >> we had more IO coming, but then failed to deliver on that promise. > >> > >> Signed-off-by: Jens Axboe > >> --- > >> drivers/block/virtio_blk.c | 15 +++ > >> 1 file changed, 15 insertions(+) > >> > >> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > >> index 6e869d05f91e..b49c57e77780 100644 > >> --- a/drivers/block/virtio_blk.c > >> +++ b/drivers/block/virtio_blk.c > >> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq) > >>spin_unlock_irqrestore(>vqs[qid].lock, flags); > >> } > >> > >> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx) > >> +{ > >> + struct virtio_blk *vblk = hctx->queue->queuedata; > >> + int qid = hctx->queue_num; > >> + bool kick; > >> + > >> + spin_lock_irq(>vqs[qid].lock); > >> + kick = virtqueue_kick_prepare(vblk->vqs[qid].vq); > >> + spin_unlock_irq(>vqs[qid].lock); > >> + > >> + if (kick) > >> + virtqueue_notify(vblk->vqs[qid].vq); > >> +} > >> + > >> static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, > >> const struct blk_mq_queue_data *bd) > >> { > >> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request *req) > >> > >> static const struct blk_mq_ops virtio_mq_ops = { > >>.queue_rq = virtio_queue_rq, > >> + .commit_rqs = virtio_commit_rqs, > >>.complete = virtblk_request_done, > >>.init_request = virtblk_init_request, > >> #ifdef CONFIG_VIRTIO_BLK_SCSI > >> -- > >> 2.17.1 > >> > > > > If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq() > > should have been removed for saving the world switch per .queue_rq() > > ->commits_rqs() is only for the case where bd->last is set to false, > and we never make it to the end and flag bd->last == true. If bd->last > is true, the driver should kick things into gear. OK, looks I misunderstood it. However, virtio-blk doesn't need this change since virtio_queue_rq() can handle it well. This patch may introduce one unnecessary VM world switch in case of queue busy. IMO bd->last won't work well in case of io scheduler given the rq_list only includes one single request. I wrote this kind of patch(never posted) before to use sort of ->commits_rqs() to replace the current bd->last mechanism which need one extra driver tag, which may improve the above case, also code gets cleaned up. Thanks, Ming
Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook
On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote: > We need this for blk-mq to kick things into gear, if we told it that > we had more IO coming, but then failed to deliver on that promise. > > Signed-off-by: Jens Axboe > --- > drivers/block/virtio_blk.c | 15 +++ > 1 file changed, 15 insertions(+) > > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c > index 6e869d05f91e..b49c57e77780 100644 > --- a/drivers/block/virtio_blk.c > +++ b/drivers/block/virtio_blk.c > @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq) > spin_unlock_irqrestore(>vqs[qid].lock, flags); > } > > +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx) > +{ > + struct virtio_blk *vblk = hctx->queue->queuedata; > + int qid = hctx->queue_num; > + bool kick; > + > + spin_lock_irq(>vqs[qid].lock); > + kick = virtqueue_kick_prepare(vblk->vqs[qid].vq); > + spin_unlock_irq(>vqs[qid].lock); > + > + if (kick) > + virtqueue_notify(vblk->vqs[qid].vq); > +} > + > static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx, > const struct blk_mq_queue_data *bd) > { > @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request *req) > > static const struct blk_mq_ops virtio_mq_ops = { > .queue_rq = virtio_queue_rq, > + .commit_rqs = virtio_commit_rqs, > .complete = virtblk_request_done, > .init_request = virtblk_init_request, > #ifdef CONFIG_VIRTIO_BLK_SCSI > -- > 2.17.1 > If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq() should have been removed for saving the world switch per .queue_rq() thanks, Ming
Re: [PATCH 7/8] blk-mq: use bd->last == true for list inserts
On Mon, Nov 26, 2018 at 09:35:55AM -0700, Jens Axboe wrote: > If we are issuing a list of requests, we know if we're at the last one. > If we fail issuing, ensure that we call ->commits_rqs() to flush any > potential previous requests. > > Signed-off-by: Jens Axboe > --- > block/blk-core.c | 2 +- > block/blk-mq.c | 32 > block/blk-mq.h | 2 +- > 3 files changed, 26 insertions(+), 10 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index c9758d185357..808a65d23f1a 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -1334,7 +1334,7 @@ blk_status_t blk_insert_cloned_request(struct > request_queue *q, struct request * >* bypass a potential scheduler on the bottom device for >* insert. >*/ > - return blk_mq_request_issue_directly(rq); > + return blk_mq_request_issue_directly(rq, true); > } > EXPORT_SYMBOL_GPL(blk_insert_cloned_request); > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 6a249bf6ed00..0a12cec0b426 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1260,6 +1260,14 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, > struct list_head *list, > if (!list_empty(list)) { > bool needs_restart; > > + /* > + * If we didn't flush the entire list, we could have told > + * the driver there was more coming, but that turned out to > + * be a lie. > + */ > + if (q->mq_ops->commit_rqs) > + q->mq_ops->commit_rqs(hctx); > + Looks you miss to do it for blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx(), in which only one request is added to the rq_list. Maybe we can call the .commit_rqs(hctx) just at the end of blk_mq_sched_dispatch_requests() for cover all non-direct-issue cases. Thanks, Ming
Re: [PATCH 3/8] blk-mq: add mq_ops->commit_rqs()
On Mon, Nov 26, 2018 at 09:35:51AM -0700, Jens Axboe wrote: > blk-mq passes information to the hardware about any given request being > the last that we will issue in this sequence. The point is that hardware > can defer costly doorbell type writes to the last request. But if we run > into errors issuing a sequence of requests, we may never send the request > with bd->last == true set. For that case, we need a hook that tells the > hardware that nothing else is coming right now. > > For failures returned by the drivers ->queue_rq() hook, the driver is > responsible for flushing pending requests, if it uses bd->last to > optimize that part. This works like before, no changes there. > > Signed-off-by: Jens Axboe > --- > include/linux/blk-mq.h | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index ca0520ca6437..1fd139b65a6e 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -117,6 +117,7 @@ struct blk_mq_queue_data { > > typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *, > const struct blk_mq_queue_data *); > +typedef void (commit_rqs_fn)(struct blk_mq_hw_ctx *); > /* takes rq->cmd_flags as input, returns a hardware type index */ > typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int); > typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *); > @@ -144,6 +145,15 @@ struct blk_mq_ops { >*/ > queue_rq_fn *queue_rq; > > + /* > + * If a driver uses bd->last to judge when to submit requests to > + * hardware, it must define this function. In case of errors that > + * make us stop issuing further requests, this hook serves the > + * purpose of kicking the hardware (which the last request otherwise > + * would have done). > + */ > + commit_rqs_fn *commit_rqs; > + > /* >* Return a queue map type for the given request/bio flags >*/ > -- > 2.17.1 > Looks fine, Reviewed-by: Ming Lei Thanks, Ming
Re: [PATCH] block/025: test discard sector alignement and sector size overflow
On Mon, Nov 26, 2018 at 04:33:10PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote: > > This test covers the following two issues: > > > > 1) discard sector need to be aligned with logical block size > > > > 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing > > with discard sector size > > > > Signed-off-by: Ming Lei > > --- > > tests/block/025 | 37 + > > tests/block/025.out | 2 ++ > > 2 files changed, 39 insertions(+) > > create mode 100755 tests/block/025 > > create mode 100644 tests/block/025.out > > > > diff --git a/tests/block/025 b/tests/block/025 > > new file mode 100755 > > index ..32b632431793 > > --- /dev/null > > +++ b/tests/block/025 > > @@ -0,0 +1,37 @@ > > +#!/bin/bash > > +# SPDX-License-Identifier: GPL-3.0+ > > +# Copyright (C) 2018 Ming Lei > > +# > > +# Check two corener cases of BLKDISCARD. > > +# > > +# 1) test if discard bio's sector is algined with logical size, fixed by > > +#1adfc5e4136f ("block: make sure discard bio is aligned with logical > > block size") > > Hm, I'm not seeing how this test case tests this commit. Aren't 2049G > and 512M both aligned to 4096 bytes? 2049G caused 32bit 'nr_sects' in __blkdev_issue_discard() overflow, please see commit 4800bf7bc8c725e955fcb ("block: fix 32 bit overflow in __blkdev_issue_discard()"). 4096 logical block size may trigger 'req_sects' un-alignment issue because 'nr_sects' from fs is 512byte aligned, please see 1adfc5e4136f5967d59 ("block: make sure discard bio is aligned with logical block size"). Thanks, Ming
Re: [PATCH] block/025: test discard sector alignement and sector size overflow
On Thu, Nov 15, 2018 at 12:01 PM Ming Lei wrote: > > This test covers the following two issues: > > 1) discard sector need to be aligned with logical block size > > 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing > with discard sector size > > Signed-off-by: Ming Lei > --- > tests/block/025 | 37 + > tests/block/025.out | 2 ++ > 2 files changed, 39 insertions(+) > create mode 100755 tests/block/025 > create mode 100644 tests/block/025.out > > diff --git a/tests/block/025 b/tests/block/025 > new file mode 100755 > index ..32b632431793 > --- /dev/null > +++ b/tests/block/025 > @@ -0,0 +1,37 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-3.0+ > +# Copyright (C) 2018 Ming Lei > +# > +# Check two corener cases of BLKDISCARD. > +# > +# 1) test if discard bio's sector is algined with logical size, fixed by > +#1adfc5e4136f ("block: make sure discard bio is aligned with logical > block size") > +# 2) test 32 bit overflow when comparing discard sector size. Fixed by > +#4800bf7bc8c72 ("block: fix 32 bit overflow in __blkdev_issue_discard()") > + > +. tests/block/rc > +. common/scsi_debug > + > +DESCRIPTION="check sector alignment and sector size overflow of BLKDISCARD" > + > +requires() { > + _have_scsi_debug > +} > + > +test() { > + echo "Running ${TEST_NAME}" > + > + rm -f "$FULL" > + > + # Create virtual device with unmap_zeroes_data support > + if ! _init_scsi_debug virtual_gb=2049 sector_size=4096 lbpws10=1 > dev_size_mb=512; then > + return 1 > + fi > + > + local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" > + blkdiscard "$dev" > + > + _exit_scsi_debug > + > + echo "Test complete" > +} > diff --git a/tests/block/025.out b/tests/block/025.out > new file mode 100644 > index ..fd9a6d5f70de > --- /dev/null > +++ b/tests/block/025.out > @@ -0,0 +1,2 @@ > +Running block/025 > +Test complete > -- > 2.9.5 > Gentle ping... thanks, Ming Lei
Re: [PATCH 2/7] block: Remove bio->bi_ioc
On Tue, Nov 20, 2018 at 10:31:19AM -0700, Jens Axboe wrote: > On 11/20/18 10:21 AM, Ming Lei wrote: > > On Mon, Nov 19, 2018 at 12:51:26PM +0900, Damien Le Moal wrote: > >> bio->bi_ioc is never set so always NULL. Remove references to it in > >> bio_disassociate_task() and in rq_ioc() and delete this field from > >> struct bio. With this change, rq_ioc() always returns > >> current->io_context without the need for a bio argument. Further > >> simplify the code and make it more readable by also removing this > >> helper, which also allows to simplify blk_mq_sched_assign_ioc() by > >> removing its bio argument. > >> > >> Signed-off-by: Damien Le Moal > >> --- > >> block/bio.c | 4 > >> block/blk-core.c | 2 +- > >> block/blk-mq-sched.c | 4 ++-- > >> block/blk-mq-sched.h | 2 +- > >> block/blk-mq.c| 4 ++-- > >> block/blk.h | 16 > >> include/linux/blk_types.h | 3 +-- > >> 7 files changed, 7 insertions(+), 28 deletions(-) > >> > >> diff --git a/block/bio.c b/block/bio.c > >> index 4f4d9884443b..03895cc0d74a 100644 > >> --- a/block/bio.c > >> +++ b/block/bio.c > >> @@ -2027,10 +2027,6 @@ int bio_associate_blkg(struct bio *bio, struct > >> blkcg_gq *blkg) > >> */ > >> void bio_disassociate_task(struct bio *bio) > >> { > >> - if (bio->bi_ioc) { > >> - put_io_context(bio->bi_ioc); > >> - bio->bi_ioc = NULL; > >> - } > >>if (bio->bi_css) { > >>css_put(bio->bi_css); > >>bio->bi_css = NULL; > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index d6e8ab9ca99d..492648c96992 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -813,7 +813,7 @@ unsigned int blk_plug_queued_count(struct > >> request_queue *q) > >> > >> void blk_init_request_from_bio(struct request *req, struct bio *bio) > >> { > >> - struct io_context *ioc = rq_ioc(bio); > >> + struct io_context *ioc = current->io_context; > >> > >>if (bio->bi_opf & REQ_RAHEAD) > >>req->cmd_flags |= REQ_FAILFAST_MASK; > >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > >> index d084f731d104..13b8dc332541 100644 > >> --- a/block/blk-mq-sched.c > >> +++ b/block/blk-mq-sched.c > >> @@ -31,10 +31,10 @@ void blk_mq_sched_free_hctx_data(struct request_queue > >> *q, > >> } > >> EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > >> > >> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio) > >> +void blk_mq_sched_assign_ioc(struct request *rq) > >> { > >>struct request_queue *q = rq->q; > >> - struct io_context *ioc = rq_ioc(bio); > >> + struct io_context *ioc = current->io_context; > >>struct io_cq *icq; > >> > >>spin_lock_irq(>queue_lock); > >> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h > >> index 7ff5671bf128..0f719c8532ae 100644 > >> --- a/block/blk-mq-sched.h > >> +++ b/block/blk-mq-sched.h > >> @@ -8,7 +8,7 @@ > >> void blk_mq_sched_free_hctx_data(struct request_queue *q, > >> void (*exit)(struct blk_mq_hw_ctx *)); > >> > >> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio); > >> +void blk_mq_sched_assign_ioc(struct request *rq); > >> > >> void blk_mq_sched_request_inserted(struct request *rq); > >> bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 32b246ed44c0..636f80b96fa6 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -389,8 +389,8 @@ static struct request *blk_mq_get_request(struct > >> request_queue *q, > >>if (!op_is_flush(data->cmd_flags)) { > >>rq->elv.icq = NULL; > >>if (e && e->type->ops.prepare_request) { > >> - if (e->type->icq_cache && rq_ioc(bio)) > >> - blk_mq_sched_assign_ioc(rq, bio); > >> + if (e->type->icq_cache) > >> + blk_mq_sched_assign_ioc(rq); > >> > >>e->type->ops.prepare_re
Re: [PATCH 2/7] block: Remove bio->bi_ioc
On Mon, Nov 19, 2018 at 12:51:26PM +0900, Damien Le Moal wrote: > bio->bi_ioc is never set so always NULL. Remove references to it in > bio_disassociate_task() and in rq_ioc() and delete this field from > struct bio. With this change, rq_ioc() always returns > current->io_context without the need for a bio argument. Further > simplify the code and make it more readable by also removing this > helper, which also allows to simplify blk_mq_sched_assign_ioc() by > removing its bio argument. > > Signed-off-by: Damien Le Moal > --- > block/bio.c | 4 > block/blk-core.c | 2 +- > block/blk-mq-sched.c | 4 ++-- > block/blk-mq-sched.h | 2 +- > block/blk-mq.c| 4 ++-- > block/blk.h | 16 > include/linux/blk_types.h | 3 +-- > 7 files changed, 7 insertions(+), 28 deletions(-) > > diff --git a/block/bio.c b/block/bio.c > index 4f4d9884443b..03895cc0d74a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -2027,10 +2027,6 @@ int bio_associate_blkg(struct bio *bio, struct > blkcg_gq *blkg) > */ > void bio_disassociate_task(struct bio *bio) > { > - if (bio->bi_ioc) { > - put_io_context(bio->bi_ioc); > - bio->bi_ioc = NULL; > - } > if (bio->bi_css) { > css_put(bio->bi_css); > bio->bi_css = NULL; > diff --git a/block/blk-core.c b/block/blk-core.c > index d6e8ab9ca99d..492648c96992 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -813,7 +813,7 @@ unsigned int blk_plug_queued_count(struct request_queue > *q) > > void blk_init_request_from_bio(struct request *req, struct bio *bio) > { > - struct io_context *ioc = rq_ioc(bio); > + struct io_context *ioc = current->io_context; > > if (bio->bi_opf & REQ_RAHEAD) > req->cmd_flags |= REQ_FAILFAST_MASK; > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index d084f731d104..13b8dc332541 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -31,10 +31,10 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q, > } > EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data); > > -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio) > +void blk_mq_sched_assign_ioc(struct request *rq) > { > struct request_queue *q = rq->q; > - struct io_context *ioc = rq_ioc(bio); > + struct io_context *ioc = current->io_context; > struct io_cq *icq; > > spin_lock_irq(>queue_lock); > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h > index 7ff5671bf128..0f719c8532ae 100644 > --- a/block/blk-mq-sched.h > +++ b/block/blk-mq-sched.h > @@ -8,7 +8,7 @@ > void blk_mq_sched_free_hctx_data(struct request_queue *q, >void (*exit)(struct blk_mq_hw_ctx *)); > > -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio); > +void blk_mq_sched_assign_ioc(struct request *rq); > > void blk_mq_sched_request_inserted(struct request *rq); > bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio, > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 32b246ed44c0..636f80b96fa6 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -389,8 +389,8 @@ static struct request *blk_mq_get_request(struct > request_queue *q, > if (!op_is_flush(data->cmd_flags)) { > rq->elv.icq = NULL; > if (e && e->type->ops.prepare_request) { > - if (e->type->icq_cache && rq_ioc(bio)) > - blk_mq_sched_assign_ioc(rq, bio); > + if (e->type->icq_cache) > + blk_mq_sched_assign_ioc(rq); > > e->type->ops.prepare_request(rq, bio); > rq->rq_flags |= RQF_ELVPRIV; > diff --git a/block/blk.h b/block/blk.h > index 816a9abb87cd..610948157a5b 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -254,22 +254,6 @@ void ioc_clear_queue(struct request_queue *q); > > int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int > node); > > -/** > - * rq_ioc - determine io_context for request allocation > - * @bio: request being allocated is for this bio (can be %NULL) > - * > - * Determine io_context to use for request allocation for @bio. May return > - * %NULL if %current->io_context doesn't exist. > - */ > -static inline struct io_context *rq_ioc(struct bio *bio) > -{ > -#ifdef CONFIG_BLK_CGROUP > - if (bio && bio->bi_ioc) > - return bio->bi_ioc; > -#endif > - return current->io_context; > -} > - > /** > * create_io_context - try to create task->io_context > * @gfp_mask: allocation mask > diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h > index dbdbfbd6a987..c0ba1a038ff3 100644 > --- a/include/linux/blk_types.h > +++ b/include/linux/blk_types.h > @@ -174,10 +174,9 @@ struct bio { > void*bi_private; > #ifdef CONFIG_BLK_CGROUP > /* > - * Optional ioc and
[PATCH 4.21 V3] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs. This patch fixes the issue by the following approach: 1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing all ctxs 2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release handler of .mq_kobj 3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that .mq_kobj is always released after all ctxs are freed. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: Guenter Roeck Cc: "jianchao.wang" Cc: Guenter Roeck Cc: Greg Kroah-Hartman Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- V3: - keep to allocate q->queue_ctx via percpu allocator, so one extra pointer reference can be saved for getting ctx V2: - allocate 'blk_mq_ctx' inside blk_mq_init_allocated_queue() - allocate q->mq_kobj directly block/blk-mq-sysfs.c | 34 -- block/blk-mq.c | 39 --- block/blk-mq.h | 6 ++ include/linux/blkdev.h | 2 +- 4 files changed, 63 insertions(+), 18 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3d25b9c419e9..6efef1f679f0 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -15,6 +15,18 @@ static void blk_mq_sysfs_release(struct kobject *kobj) { + struct blk_mq_ctxs *ctxs = container_of(kobj, struct blk_mq_ctxs, kobj); + + free_percpu(ctxs->queue_ctx); + kfree(ctxs); +} + +static void blk_mq_ctx_sysfs_release(struct kobject *kobj) +{ + struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj); + + /* ctx->ctxs won't be released until all ctx are freed */ + kobject_put(>ctxs->kobj); } static void blk_mq_hw_sysfs_release(struct kobject *kobj) @@ -213,7 +225,7 @@ static struct kobj_type blk_mq_ktype = { static struct kobj_type blk_mq_ctx_ktype = { .sysfs_ops = _mq_sysfs_ops, .default_attrs = default_ctx_attrs, - .release= blk_mq_sysfs_release, + .release= blk_mq_ctx_sysfs_release, }; static struct kobj_type blk_mq_hw_ktype = { @@ -245,7 +257,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) if (!hctx->nr_ctx) return 0; - ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num); + ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num); if (ret) return ret; @@ -268,8 +280,8 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); - kobject_uevent(>mq_kobj, KOBJ_REMOVE); - kobject_del(>mq_kobj); + kobject_uevent(q->mq_kobj, KOBJ_REMOVE); + kobject_del(q->mq_kobj); kobject_put(>kobj); q->mq_sysfs_init_done = false; @@ -289,7 +301,7 @@ void blk_mq_sysfs_deinit(struct request_queue *q) ctx = per_cpu_ptr(q->queue_ctx, cpu); kobject_put(>kobj); } - kobject_put(>mq_kobj); + kobject_put(q->mq_kobj); } void blk_mq_sysfs_init(struct request_queue *q) @@ -297,10 +309,12 @@ void blk_mq_sysfs_init(struct request_queue *q) struct blk_mq_ctx *ctx; int cpu; - kobject_init(>mq_kobj, _mq_ktype); + kobject_init(q->mq_kobj, _mq_ktype); for_each_possible_cpu(cpu) { ctx = per_cpu_ptr(q->queue_ctx, cpu); + + kobject_get(q->mq_kobj); kobject_init(>kobj, _mq_ctx_ktype); } } @@ -313,11 +327,11 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) WARN_ON_ONCE(!q->kobj.parent); lockdep_assert_held(>sysfs_lock); - ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq"); + ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq"); if (ret < 0) goto out; - kobject_uevent(>mq_kobj, KOBJ_ADD); + kobject_uevent(q->mq_kobj, KOBJ_ADD); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); @@ -334,8 +348,8 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) while (--i >= 0) blk_mq_unregister_hctx(q->queue_hw_ctx[i]); - kobject_uevent(>mq_kobj, KOBJ_REMOVE); - kobject_del(>mq_kobj); + kobject_uevent(q->mq_kobj, KOBJ_REMOVE); + kobject_del(q->mq_kobj); kobject_put(>kobj); return ret; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 32b246ed44c0..9228f2e9bd40 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.
Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array
On Mon, Nov 19, 2018 at 11:17:49AM +0100, Greg Kroah-Hartman wrote: > On Mon, Nov 19, 2018 at 10:04:27AM +0800, Ming Lei wrote: > > On Sat, Nov 17, 2018 at 11:03:42AM +0100, Greg Kroah-Hartman wrote: > > > On Sat, Nov 17, 2018 at 10:34:18AM +0800, Ming Lei wrote: > > > > On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote: > > > > > On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote: > > > > > > Now q->queue_ctx is just one read-mostly table for query the > > > > > > 'blk_mq_ctx' instance from one cpu index, it isn't necessary > > > > > > to allocate it as percpu variable. One simple array may be > > > > > > more efficient. > > > > > > > > > > "may be", have you run benchmarks to be sure? If so, can you add the > > > > > results of them to this changelog? If there is no measurable > > > > > difference, then why make this change at all? > > > > > > > > __blk_mq_get_ctx() is used in fast path, what do you think about which > > > > one is more efficient? > > > > > > > > - *per_cpu_ptr(q->queue_ctx, cpu); > > > > > > > > - q->queue_ctx[cpu] > > > > > > You need to actually test to see which one is faster, you might be > > > surprised :) > > > > > > In other words, do not just guess. > > > > No performance difference is observed wrt. this patchset when I > > run the following fio test on null_blk(modprobe null_blk) in my VM: > > > > fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=32 \ > > --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 > > \ > > --name=null_blk-ttest-randread --rw=randread > > > > Running test is important, but IMO it is more important to understand > > the idea behind is correct, or the approach can be proved as correct. > > > > Given the count of test cases can be increased exponentially when the > > related > > factors or settings are covered, obviously we can't run all the test cases. > > And what happens when you start to scale the number of queues and cpus > in the system? It is related with cpus. > Does both options work the same? This patch may introduce one extra memory read for getting one 'blk_mq_ctx' instance. > Why did the original code have per-cpu variables? Each instance of 'struct blk_mq_ctx' is actually percpu, so it is natural to allocate it as one percpu variable. However, there is the kobject lifetime issue if we allocate all 'blk_mq_ctx' instances as one single percpu variable, because we can't release just one part(for one cpu) of the single percpu variable. So this patch converts the percpu variable into one loop-up table(read only) and one 'blk_mq_ctx' instance for each CPU, and all the instances are allocated from the local NUMA node of its CPU. Another approach is to keep the percpu allocation, and introduces one reference counter for counting how many active 'ctx' there are, and only free the percpu variable when the ref drops zero, then we may save one extra memory read for __blk_mq_get_ctx(). Thanks, Ming
Re: [PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
On Mon, Nov 19, 2018 at 11:06:06AM +0100, Greg Kroah-Hartman wrote: > On Sat, Nov 17, 2018 at 10:26:38AM +0800, Ming Lei wrote: > > On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote: > > > On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote: > > > > @@ -456,7 +456,7 @@ struct request_queue { > > > > /* > > > > * mq queue kobject > > > > */ > > > > - struct kobject mq_kobj; > > > > + struct kobject *mq_kobj; > > > > > > What is this kobject even used for? It wasn't obvious at all from this > > > patch, why is it needed if you are not using it to reference count the > > > larger structure here? > > > > All attributes and kobjects under /sys/block/$DEV/mq are covered by this > > kobject > > actually, and all are for exposing blk-mq specific information, but now > > there is > > only blk-mq, and legacy io path is removed. > > I am sorry, but I really can not parse this sentance at all. > > What Documentation/ABI/ entries are covered by this kobject, that should > help me out more. And what do you mean by "legacy io"? After blk-mq is introduced, there are two main IO request paths: 1) blk-mq, IO is queued via blk_mq_make_request() 2) legacy IO path, IO is queued via blk_queue_bio() The legacy IO path will be removed in 4.21, and patches have been queued in for-4.21/block. > > > That is why I mentioned we may remove this kobject last time and move all > > under > > /sys/block/$DEV/queue, however you thought that may break some userspace. > > Who relies on these sysfs files today? I don't know, actually the question is from you, :-( https://marc.info/?l=linux-kernel=154205455332755=2 Even we remove q->mq_kobj, the same kobject lifetime issue is still here in kobjects embedded in 'struct blk_mq_ctx'. > > > If we want to backport them to stable, this patch may be a bit easier to go. > > Why do you want to backport any of this to stable? For this report from Guenter, it should be enough to backport the 1st patch, and it shouldn't be very hard to backport both. I can help to backport them if patches can't be applied cleanly. Thanks, Ming
Re: [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE
On Fri, Nov 16, 2018 at 02:58:03PM +0100, Christoph Hellwig wrote: > On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote: > > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after > > splitting"), > > physical segment number is mainly figured out in blk_queue_split() for > > fast path, and the flag of BIO_SEG_VALID is set there too. > > > > Now only blk_recount_segments() and blk_recalc_rq_segments() use this > > flag. > > > > Basically blk_recount_segments() is bypassed in fast path given > > BIO_SEG_VALID > > is set in blk_queue_split(). > > > > For another user of blk_recalc_rq_segments(): > > > > - run in partial completion branch of blk_update_request, which is an > > unusual case > > > > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag > > is killed > > since dm-rq is the only user. > > > > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense > > any more. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > block/blk-merge.c | 31 ++- > > block/blk-mq-debugfs.c | 1 - > > block/blk-mq.c | 3 --- > > drivers/md/dm-table.c | 13 - > > include/linux/blkdev.h | 1 - > > 5 files changed, 6 insertions(+), 43 deletions(-) > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index 153a659fde74..06be298be332 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -351,8 +351,7 @@ void blk_queue_split(struct request_queue *q, struct > > bio **bio) > > EXPORT_SYMBOL(blk_queue_split); > > > > static unsigned int __blk_recalc_rq_segments(struct request_queue *q, > > -struct bio *bio, > > -bool no_sg_merge) > > +struct bio *bio) > > { > > struct bio_vec bv, bvprv = { NULL }; > > int cluster, prev = 0; > > @@ -379,13 +378,6 @@ static unsigned int __blk_recalc_rq_segments(struct > > request_queue *q, > > nr_phys_segs = 0; > > for_each_bio(bio) { > > bio_for_each_bvec(bv, bio, iter) { > > - /* > > -* If SG merging is disabled, each bio vector is > > -* a segment > > -*/ > > - if (no_sg_merge) > > - goto new_segment; > > - > > if (prev && cluster) { > > if (seg_size + bv.bv_len > > > queue_max_segment_size(q)) > > @@ -420,27 +412,16 @@ static unsigned int __blk_recalc_rq_segments(struct > > request_queue *q, > > > > void blk_recalc_rq_segments(struct request *rq) > > { > > - bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE, > > - >q->queue_flags); > > - > > - rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio, > > - no_sg_merge); > > + rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio); > > Can we rename __blk_recalc_rq_segments to blk_recalc_rq_segments > can kill the old blk_recalc_rq_segments now? Sure. Thanks, Ming
Re: [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE
On Thu, Nov 15, 2018 at 06:18:11PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote: > > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after > > splitting"), > > physical segment number is mainly figured out in blk_queue_split() for > > fast path, and the flag of BIO_SEG_VALID is set there too. > > > > Now only blk_recount_segments() and blk_recalc_rq_segments() use this > > flag. > > > > Basically blk_recount_segments() is bypassed in fast path given > > BIO_SEG_VALID > > is set in blk_queue_split(). > > > > For another user of blk_recalc_rq_segments(): > > > > - run in partial completion branch of blk_update_request, which is an > > unusual case > > > > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag > > is killed > > since dm-rq is the only user. > > > > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense > > any more. > > This commit message wasn't very clear. Is it the case that > QUEUE_FLAG_NO_SG_MERGE is no longer set by any drivers? OK, I will add the explanation to commit log in next version. 05f1dd53152173 (block: add queue flag for disabling SG merging) introduces this flag for NVMe performance purpose only, so that merging to segment can be bypassed for NVMe. Actually this optimization was bypassed by 54efd50bfd873e2d (block: make generic_make_request handle arbitrarily sized bios) and bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"). Now segment computation can be very quick, given most of times one bvec can be thought as one segment, so we can remove the flag. thanks, Ming
Re: [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number
On Thu, Nov 15, 2018 at 06:11:40PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:53:04PM +0800, Ming Lei wrote: > > It is wrong to use bio->bi_vcnt to figure out how many segments > > there are in the bio even though CLONED flag isn't set on this bio, > > because this bio may be splitted or advanced. > > > > So always use bio_segments() in blk_recount_segments(), and it shouldn't > > cause any performance loss now because the physical segment number is > > figured > > out in blk_queue_split() and BIO_SEG_VALID is set meantime since > > bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting"). > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Fixes: 7f60dcaaf91 ("block: blk-merge: fix blk_recount_segments()") > > From what I can tell, the problem was originally introduced by > 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max > segments") > > Is that right? Indeed, will update it in next version. Thanks, Ming
Re: [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256
On Thu, Nov 15, 2018 at 05:59:36PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:53:02PM +0800, Ming Lei wrote: > > Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to > > increase BIO_MAX_PAGES for it. > > You mentioned to it in the cover letter, but this needs more explanation > in the commit message. Why did CONFIG_THP_SWAP require > 256? Why does > multipage bvecs remove that requirement? CONFIG_THP_SWAP needs to split one TH page into normal pages and adds them all to one bio. With multipage-bvec, it just takes one bvec to hold them all. thanks, Ming
Re: [PATCH V10 14/19] block: enable multipage bvecs
On Fri, Nov 16, 2018 at 02:53:08PM +0100, Christoph Hellwig wrote: > > - > > - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) { > > - bv->bv_len += len; > > - bio->bi_iter.bi_size += len; > > - return true; > > - } > > + struct request_queue *q = NULL; > > + > > + if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len) > > + && (off + len) <= PAGE_SIZE) > > How could the page struct be the same, but the range beyond PAGE_SIZE > (at least with the existing callers)? > > Also no need for the inner btraces, and the && always goes on the > first line. OK. > > > + if (bio->bi_disk) > > + q = bio->bi_disk->queue; > > + > > + /* disable multi-page bvec too if cluster isn't enabled */ > > + if (!q || !blk_queue_cluster(q) || > > + ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) != > > +(page_to_phys(page) + off))) > > + return false; > > + merge: > > + bv->bv_len += len; > > + bio->bi_iter.bi_size += len; > > + return true; > > Ok, this is scary, as it will give differen results depending on when > bi_disk is assigned. It is just merge or not, both can be handled well now. > But then again we shouldn't really do the cluster > check here, but rather when splitting the bio for the actual low-level > driver. Yeah, I thought of this way too, but it may cause tons of bio split for no-clustering, and there are quite a few scsi devices which require to disable clustering. [linux]$ git grep -n DISABLE_CLUSTERING ./drivers/scsi/ | wc -l 28 Or we may introduce bio_split_to_single_page_bvec() to allocate & convert to single-page bvec table for non-clustering, will try this approach in next version. > > (and eventually we should kill this clustering setting off in favor > of our normal segment limits). Yeah, it has been in my post-multi-page todo list already, :-) thanks, Ming
Re: [PATCH V10 14/19] block: enable multipage bvecs
On Thu, Nov 15, 2018 at 05:56:27PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:53:01PM +0800, Ming Lei wrote: > > This patch pulls the trigger for multi-page bvecs. > > > > Now any request queue which supports queue cluster will see multi-page > > bvecs. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > block/bio.c | 24 ++-- > > 1 file changed, 18 insertions(+), 6 deletions(-) > > > > diff --git a/block/bio.c b/block/bio.c > > index 6486722d4d4b..ed6df6f8e63d 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > This comment above __bio_try_merge_page() doesn't make sense after this > change: > > This is a > a useful optimisation for file systems with a block size smaller than the > page size. > > Can you please get rid of it in this patch? I understand __bio_try_merge_page() still works for original cases, so looks the optimization for sub-pagesize is still there too, isn't it? > > > @@ -767,12 +767,24 @@ bool __bio_try_merge_page(struct bio *bio, struct > > page *page, > > > > if (bio->bi_vcnt > 0) { > > struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1]; > > - > > - if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) { > > - bv->bv_len += len; > > - bio->bi_iter.bi_size += len; > > - return true; > > - } > > + struct request_queue *q = NULL; > > + > > + if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len) > > + && (off + len) <= PAGE_SIZE) > > + goto merge; > > The parentheses around (bv->bv_offset + bv->bv_len) and (off + len) are > unnecessary noise. > > What's the point of the new (off + len) <= PAGE_SIZE check? Yeah, I don't know why I did it, :-(, the check is absolutely always true. > > > + > > + if (bio->bi_disk) > > + q = bio->bi_disk->queue; > > + > > + /* disable multi-page bvec too if cluster isn't enabled */ > > + if (!q || !blk_queue_cluster(q) || > > + ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) != > > +(page_to_phys(page) + off))) > > More unnecessary parentheses here. OK. Thanks, Ming
Re: [PATCH V10 13/19] iomap & xfs: only account for new added page
On Fri, Nov 16, 2018 at 02:49:36PM +0100, Christoph Hellwig wrote: > I'd much rather have __bio_try_merge_page only do merges in > the same page, and have a new __bio_try_merge_segment that does > multi-page merges. This will keep the accounting a lot simpler. Looks this way is clever, will do it in next version. Thanks, Ming
Re: [PATCH V10 13/19] iomap & xfs: only account for new added page
On Thu, Nov 15, 2018 at 05:46:58PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:53:00PM +0800, Ming Lei wrote: > > After multi-page is enabled, one new page may be merged to a segment > > even though it is a new added page. > > > > This patch deals with this issue by post-check in case of merge, and > > only a freshly new added page need to be dealt with for iomap & xfs. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > fs/iomap.c | 22 ++ > > fs/xfs/xfs_aops.c | 10 -- > > include/linux/bio.h | 11 +++ > > 3 files changed, 33 insertions(+), 10 deletions(-) > > > > diff --git a/fs/iomap.c b/fs/iomap.c > > index df0212560b36..a1b97a5c726a 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -288,6 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > > loff_t length, void *data, > > loff_t orig_pos = pos; > > unsigned poff, plen; > > sector_t sector; > > + bool need_account = false; > > > > if (iomap->type == IOMAP_INLINE) { > > WARN_ON_ONCE(pos); > > @@ -313,18 +314,15 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > > loff_t length, void *data, > > */ > > sector = iomap_sector(iomap, pos); > > if (ctx->bio && bio_end_sector(ctx->bio) == sector) { > > - if (__bio_try_merge_page(ctx->bio, page, plen, poff)) > > + if (__bio_try_merge_page(ctx->bio, page, plen, poff)) { > > + need_account = iop && bio_is_last_segment(ctx->bio, > > + page, plen, poff); > > It's redundant to make this iop && ... since you already check > iop && need_account below. Maybe rename it to added_page? Also, this > indentation is wack. We may avoid to call bio_is_last_segment() in case of !iop, and will fix the indentation. Looks added_page is one better name. > > > goto done; > > + } > > is_contig = true; > > } > > > > - /* > > -* If we start a new segment we need to increase the read count, and we > > -* need to do so before submitting any previous full bio to make sure > > -* that we don't prematurely unlock the page. > > -*/ > > - if (iop) > > - atomic_inc(>read_count); > > + need_account = true; > > > > if (!ctx->bio || !is_contig || bio_full(ctx->bio)) { > > gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL); > > @@ -347,6 +345,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, > > loff_t length, void *data, > > __bio_add_page(ctx->bio, page, plen, poff); > > done: > > /* > > +* If we add a new page we need to increase the read count, and we > > +* need to do so before submitting any previous full bio to make sure > > +* that we don't prematurely unlock the page. > > +*/ > > + if (iop && need_account) > > + atomic_inc(>read_count); > > + > > + /* > > * Move the caller beyond our range so that it keeps making progress. > > * For that we have to include any leading non-uptodate ranges, but > > * we can skip trailing ones as they will be handled in the next > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > index 1f1829e506e8..d8e9cc9f751a 100644 > > --- a/fs/xfs/xfs_aops.c > > +++ b/fs/xfs/xfs_aops.c > > @@ -603,6 +603,7 @@ xfs_add_to_ioend( > > unsignedlen = i_blocksize(inode); > > unsignedpoff = offset & (PAGE_SIZE - 1); > > sector_tsector; > > + boolneed_account; > > > > sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) + > > ((of
Re: [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec
On Thu, Nov 15, 2018 at 05:22:45PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote: > > This patch introduces one extra iterator variable to > > bio_for_each_segment_all(), > > then we can allow bio_for_each_segment_all() to iterate over multi-page > > bvec. > > > > Given it is just one mechannical & simple change on all > > bio_for_each_segment_all() > > users, this patch does tree-wide change in one single patch, so that we can > > avoid to use a temporary helper for this conversion. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: linux-fsde...@vger.kernel.org > > Cc: Alexander Viro > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: linux-bt...@vger.kernel.org > > Cc: David Sterba > > Cc: Darrick J. Wong > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > block/bio.c | 27 ++- > > block/blk-zoned.c | 1 + > > block/bounce.c| 6 -- > > drivers/md/bcache/btree.c | 3 ++- > > drivers/md/dm-crypt.c | 3 ++- > > drivers/md/raid1.c| 3 ++- > > drivers/staging/erofs/data.c | 3 ++- > > drivers/staging/erofs/unzip_vle.c | 3 ++- > > fs/block_dev.c| 6 -- > > fs/btrfs/compression.c| 3 ++- > > fs/btrfs/disk-io.c| 3 ++- > > fs/btrfs/extent_io.c | 12 > > fs/btrfs/inode.c | 6 -- > > fs/btrfs/raid56.c | 3 ++- > > fs/crypto/bio.c | 3 ++- > > fs/direct-io.c| 4 +++- > > fs/exofs/ore.c| 3 ++- > > fs/exofs/ore_raid.c | 3 ++- > > fs/ext4/page-io.c | 3 ++- > > fs/ext4/readpage.c| 3 ++- > > fs/f2fs/data.c| 9 ++--- > > fs/gfs2/lops.c| 6 -- > > fs/gfs2/meta_io.c | 3 ++- > > fs/iomap.c| 6 -- > > fs/mpage.c| 3 ++- > > fs/xfs/xfs_aops.c | 5 +++-- > > include/linux/bio.h | 11 +-- > > include/linux/bvec.h | 31 +++ > > 28 files changed, 129 insertions(+), 46 deletions(-) > > > > [snip] > > > diff --git a/include/linux/bio.h b/include/linux/bio.h > > index 3496c816946e..1a2430a8b89d 100644 > > --- a/include/linux/bio.h > > +++ b/include/linux/bio.h > > @@ -131,12 +131,19 @@ static inline bool bio_full(struct bio *bio) > > return bio->bi_vcnt >= bio->bi_max_vecs; > > } > > > > +#define bvec_for_each_segment(bv, bvl, i, iter_all) > > \ > > + for (bv = bvec_init_iter_all(_all);\ > > + (iter_all.done < (bvl)->bv_len) && \ > > + ((bvec_next_segment((bvl), _all)), 1); \ > > The parentheses around (bvec_next_segment((bvl), _all)) are > unnecessary. OK. > > > + iter_all.done += bv->bv_len, i += 1) > > + > > /* > > * drivers should _never_ use the all version - the bio may have been split > > * before it got to the driver and the driver won't own all of it > > */ > > -#define bio_for_each_segment_all(bvl, bio, i) > > \ > > - for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++) > > +#define bio_for_each_segment_all(bvl, bio, i, iter_all)\ > > + for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; > > iter_all.idx++)\ > > + bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), > > i, iter_all) > > Would it be possible to move i into iter_all to streamline this a bit? That may may cause unnecessary conversion work for us, because the local variable 'i' is defined in external function. > > > static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter > > *iter, > > unsigned bytes, bool mp) > > diff --git a/include
Re: [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec
On Thu, Nov 15, 2018 at 01:42:52PM +0100, David Sterba wrote: > On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote: > > diff --git a/block/blk-zoned.c b/block/blk-zoned.c > > index 13ba2011a306..789b09ae402a 100644 > > --- a/block/blk-zoned.c > > +++ b/block/blk-zoned.c > > @@ -123,6 +123,7 @@ static int blk_report_zones(struct gendisk *disk, > > sector_t sector, > > unsigned int z = 0, n, nrz = *nr_zones; > > sector_t capacity = get_capacity(disk); > > int ret; > > + struct bvec_iter_all iter_all; > > > > while (z < nrz && sector < capacity) { > > n = nrz - z; > > iter_all is added but not used and I don't see any > bio_for_each_segment_all for conversion in this function. Good catch, will fix it in next version. Thanks, Ming
Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
On Fri, Nov 16, 2018 at 02:46:45PM +0100, Christoph Hellwig wrote: > > - bio_for_each_segment_all(bv, bio, i) { > > + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) { > > This really needs a comment. Otherwise it looks fine to me. OK, will do it in next version. Thanks, Ming
Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()
On Thu, Nov 15, 2018 at 04:44:02PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:58PM +0800, Ming Lei wrote: > > bch_bio_alloc_pages() is always called on one new bio, so it is safe > > to access the bvec table directly. Given it is the only kind of this > > case, open code the bvec table access since bio_for_each_segment_all() > > will be changed to support for iterating over multipage bvec. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Acked-by: Coly Li > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > drivers/md/bcache/util.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c > > index 20eddeac1531..8517aebcda2d 100644 > > --- a/drivers/md/bcache/util.c > > +++ b/drivers/md/bcache/util.c > > @@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask) > > int i; > > struct bio_vec *bv; > > > > - bio_for_each_segment_all(bv, bio, i) { > > + for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) { > > This is missing an i++. Good catch, will fix it in next version. thanks, Ming
Re: [PATCH V10 10/19] block: loop: pass multi-page bvec to iov_iter
On Thu, Nov 15, 2018 at 04:40:22PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:57PM +0800, Ming Lei wrote: > > iov_iter is implemented with bvec itererator, so it is safe to pass > > multipage bvec to it, and this way is much more efficient than > > passing one page in each bvec. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Reviewed-by: Omar Sandoval > > Comments below. > > > Signed-off-by: Ming Lei > > --- > > drivers/block/loop.c | 23 --- > > 1 file changed, 12 insertions(+), 11 deletions(-) > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > > index bf6bc35aaf88..a3fd418ec637 100644 > > --- a/drivers/block/loop.c > > +++ b/drivers/block/loop.c > > @@ -515,16 +515,16 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > struct bio *bio = rq->bio; > > struct file *file = lo->lo_backing_file; > > unsigned int offset; > > - int segments = 0; > > + int nr_bvec = 0; > > int ret; > > > > if (rq->bio != rq->biotail) { > > - struct req_iterator iter; > > + struct bvec_iter iter; > > struct bio_vec tmp; > > > > __rq_for_each_bio(bio, rq) > > - segments += bio_segments(bio); > > - bvec = kmalloc_array(segments, sizeof(struct bio_vec), > > + nr_bvec += bio_bvecs(bio); > > + bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), > > GFP_NOIO); > > if (!bvec) > > return -EIO; > > @@ -533,13 +533,14 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > /* > > * The bios of the request may be started from the middle of > > * the 'bvec' because of bio splitting, so we can't directly > > -* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment > > +* copy bio->bi_iov_vec to new bvec. The bio_for_each_bvec > > * API will take care of all details for us. > > */ > > - rq_for_each_segment(tmp, rq, iter) { > > - *bvec = tmp; > > - bvec++; > > - } > > + __rq_for_each_bio(bio, rq) > > + bio_for_each_bvec(tmp, bio, iter) { > > + *bvec = tmp; > > + bvec++; > > + } > > Even if they're not strictly necessary, could you please include the > curly braces for __rq_for_each_bio() here? Sure, will do it. > > > bvec = cmd->bvec; > > offset = 0; > > } else { > > @@ -550,11 +551,11 @@ static int lo_rw_aio(struct loop_device *lo, struct > > loop_cmd *cmd, > > */ > > offset = bio->bi_iter.bi_bvec_done; > > bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > > - segments = bio_segments(bio); > > + nr_bvec = bio_bvecs(bio); > > This scared me for a second, but it's fine to do here because we haven't > actually enabled multipage bvecs yet, right? Well, it is fine, all helpers supporting multi-page bvec actually works well when it isn't enabled, cause single-page bvec is one special case in which multi-page bevc helpers have to deal with. Thanks, Ming
Re: [PATCH V10 09/19] block: introduce bio_bvecs()
On Fri, Nov 16, 2018 at 02:45:41PM +0100, Christoph Hellwig wrote: > On Thu, Nov 15, 2018 at 04:52:56PM +0800, Ming Lei wrote: > > There are still cases in which we need to use bio_bvecs() for get the > > number of multi-page segment, so introduce it. > > The only user in your final tree seems to be the loop driver, and > even that one only uses the helper for read/write bios. > > I think something like this would be much simpler in the end: > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index d509902a8046..712511815ac6 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -514,16 +514,18 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, > struct request *rq = blk_mq_rq_from_pdu(cmd); > struct bio *bio = rq->bio; > struct file *file = lo->lo_backing_file; > + struct bvec_iter bvec_iter; > + struct bio_vec tmp; > unsigned int offset; > int nr_bvec = 0; > int ret; > > + __rq_for_each_bio(bio, rq) > + bio_for_each_bvec(tmp, bio, bvec_iter) > + nr_bvec++; > + > if (rq->bio != rq->biotail) { > - struct bvec_iter iter; > - struct bio_vec tmp; > > - __rq_for_each_bio(bio, rq) > - nr_bvec += bio_bvecs(bio); > bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec), >GFP_NOIO); > if (!bvec) > @@ -537,7 +539,7 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, >* API will take care of all details for us. >*/ > __rq_for_each_bio(bio, rq) > - bio_for_each_bvec(tmp, bio, iter) { > + bio_for_each_bvec(tmp, bio, bvec_iter) { > *bvec = tmp; > bvec++; > } > @@ -551,7 +553,6 @@ static int lo_rw_aio(struct loop_device *lo, struct > loop_cmd *cmd, >*/ > offset = bio->bi_iter.bi_bvec_done; > bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter); > - nr_bvec = bio_bvecs(bio); > } > atomic_set(>ref, 2); > > diff --git a/include/linux/bio.h b/include/linux/bio.h > index dcad0b69f57a..379440d1ced0 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -200,30 +200,6 @@ static inline unsigned bio_segments(struct bio *bio) > } > } > > -static inline unsigned bio_bvecs(struct bio *bio) > -{ > - unsigned bvecs = 0; > - struct bio_vec bv; > - struct bvec_iter iter; > - > - /* > - * We special case discard/write same/write zeroes, because they > - * interpret bi_size differently: > - */ > - switch (bio_op(bio)) { > - case REQ_OP_DISCARD: > - case REQ_OP_SECURE_ERASE: > - case REQ_OP_WRITE_ZEROES: > - return 0; > - case REQ_OP_WRITE_SAME: > - return 1; > - default: > - bio_for_each_bvec(bv, bio, iter) > - bvecs++; > - return bvecs; > - } > -} > - > /* > * get a reference to a bio, so it won't disappear. the intended use is > * something like: OK, will do it in next version. Thanks, Ming
Re: [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs
On Fri, Nov 16, 2018 at 02:38:45PM +0100, Christoph Hellwig wrote: > On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote: > > BTRFS is the only user of this helper, so move this helper into > > BTRFS, and implement it via bio_for_each_segment_all(), since > > bio->bi_vcnt may not equal to number of pages after multipage bvec > > is enabled. > > btrfs only uses the value to check if it is larger than 1. No amount > of multipage bio merging should ever make bi_vcnt go from 0 to 1 or > vice versa. Could you explain a bit why? Suppose 2 physically continuous pages are added to this bio, .bi_vcnt can be 1 in case of multi-page bvec, but it is 2 in case of single-page bvec. Thanks, Ming
Re: [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs
On Thu, Nov 15, 2018 at 04:23:56PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote: > > BTRFS is the only user of this helper, so move this helper into > > BTRFS, and implement it via bio_for_each_segment_all(), since > > bio->bi_vcnt may not equal to number of pages after multipage bvec > > is enabled. > > Shouldn't you also get rid of bio_pages_all() in this patch? Good catch! thanks, Ming
Re: [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page
On Fri, Nov 16, 2018 at 02:37:10PM +0100, Christoph Hellwig wrote: > On Thu, Nov 15, 2018 at 04:52:54PM +0800, Ming Lei wrote: > > index 2955a4ea2fa8..161e14b8b180 100644 > > --- a/fs/btrfs/compression.c > > +++ b/fs/btrfs/compression.c > > @@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct > > inode *inode, u64 start, > > static u64 bio_end_offset(struct bio *bio) > > { > > struct bio_vec *last = bio_last_bvec_all(bio); > > + struct bio_vec bv; > > > > - return page_offset(last->bv_page) + last->bv_len + last->bv_offset; > > + bvec_last_segment(last, ); > > + > > + return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset; > > I don't think we need this. If last is a multi-page bvec bv_offset > will already contain the correct offset from the first page. Yeah, it is true for this specific case, looks we can drop this patch. thanks, Ming
Re: [PATCH V10 05/19] block: introduce bvec_last_segment()
On Thu, Nov 15, 2018 at 03:23:56PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:52PM +0800, Ming Lei wrote: > > BTRFS and guard_bio_eod() need to get the last singlepage segment > > from one multipage bvec, so introduce this helper to make them happy. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Reviewed-by: Omar Sandoval > > Minor comments below. > > > Signed-off-by: Ming Lei > > --- > > include/linux/bvec.h | 25 + > > 1 file changed, 25 insertions(+) > > > > diff --git a/include/linux/bvec.h b/include/linux/bvec.h > > index 3d61352cd8cf..01616a0b6220 100644 > > --- a/include/linux/bvec.h > > +++ b/include/linux/bvec.h > > @@ -216,4 +216,29 @@ static inline bool mp_bvec_iter_advance(const struct > > bio_vec *bv, > > .bi_bvec_done = 0,\ > > } > > > > +/* > > + * Get the last singlepage segment from the multipage bvec and store it > > + * in @seg > > + */ > > +static inline void bvec_last_segment(const struct bio_vec *bvec, > > + struct bio_vec *seg) > > Indentation is all messed up here. Will fix it. > > > +{ > > + unsigned total = bvec->bv_offset + bvec->bv_len; > > + unsigned last_page = total / PAGE_SIZE; > > + > > + if (last_page * PAGE_SIZE == total) > > + last_page--; > > I think this could just be > > unsigned int last_page = (total - 1) / PAGE_SIZE; This way is really elegant. Thanks, Ming
Re: [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg
On Fri, Nov 16, 2018 at 02:33:14PM +0100, Christoph Hellwig wrote: > > + if (!*sg) > > + return sglist; > > + else { > > No need for an else after an early return. OK, good catch! Thanks, Ming
Re: [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count
On Thu, Nov 15, 2018 at 12:20:28PM -0800, Omar Sandoval wrote: > On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote: > > First it is more efficient to use bio_for_each_bvec() in both > > blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how > > many multi-page bvecs there are in the bio. > > > > Secondly once bio_for_each_bvec() is used, the bvec may need to be > > splitted because its length can be very longer than max segment size, > > so we have to split the big bvec into several segments. > > > > Thirdly when splitting multi-page bvec into segments, the max segment > > limit may be reached, so the bio split need to be considered under > > this situation too. > > > > Cc: Dave Chinner > > Cc: Kent Overstreet > > Cc: Mike Snitzer > > Cc: dm-de...@redhat.com > > Cc: Alexander Viro > > Cc: linux-fsde...@vger.kernel.org > > Cc: Shaohua Li > > Cc: linux-r...@vger.kernel.org > > Cc: linux-er...@lists.ozlabs.org > > Cc: David Sterba > > Cc: linux-bt...@vger.kernel.org > > Cc: Darrick J. Wong > > Cc: linux-...@vger.kernel.org > > Cc: Gao Xiang > > Cc: Christoph Hellwig > > Cc: Theodore Ts'o > > Cc: linux-e...@vger.kernel.org > > Cc: Coly Li > > Cc: linux-bca...@vger.kernel.org > > Cc: Boaz Harrosh > > Cc: Bob Peterson > > Cc: cluster-de...@redhat.com > > Signed-off-by: Ming Lei > > --- > > block/blk-merge.c | 90 > > ++- > > 1 file changed, 76 insertions(+), 14 deletions(-) > > > > diff --git a/block/blk-merge.c b/block/blk-merge.c > > index 91b2af332a84..6f7deb94a23f 100644 > > --- a/block/blk-merge.c > > +++ b/block/blk-merge.c > > @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct > > request_queue *q, > > return sectors; > > } > > > > +/* > > + * Split the bvec @bv into segments, and update all kinds of > > + * variables. > > + */ > > +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv, > > + unsigned *nsegs, unsigned *last_seg_size, > > + unsigned *front_seg_size, unsigned *sectors) > > +{ > > + bool need_split = false; > > + unsigned len = bv->bv_len; > > + unsigned total_len = 0; > > + unsigned new_nsegs = 0, seg_size = 0; > > "unsigned int" here and everywhere else. > > > + if ((*nsegs >= queue_max_segments(q)) || !len) > > + return need_split; > > + > > + /* > > +* Multipage bvec may be too big to hold in one segment, > > +* so the current bvec has to be splitted as multiple > > +* segments. > > +*/ > > + while (new_nsegs + *nsegs < queue_max_segments(q)) { > > + seg_size = min(queue_max_segment_size(q), len); > > + > > + new_nsegs++; > > + total_len += seg_size; > > + len -= seg_size; > > + > > + if ((queue_virt_boundary(q) && ((bv->bv_offset + > > + total_len) & queue_virt_boundary(q))) || !len) > > + break; > > Checking queue_virt_boundary(q) != 0 is superfluous, and the len check > could just control the loop, i.e., > > while (len && new_nsegs + *nsegs < queue_max_segments(q)) { > seg_size = min(queue_max_segment_size(q), len); > > new_nsegs++; > total_len += seg_size; > len -= seg_size; > > if ((bv->bv_offset + total_len) & queue_virt_boundary(q)) > break; > } > > And if you rewrite it this way, I _think_ you can get rid of this > special case: > > if ((*nsegs >= queue_max_segments(q)) || !len) > return need_split; > > above. Good point, will do in next version. > > > + } > > + > > + /* split in the middle of the bvec */ > > + if (len) > > + need_split = true; > > need_split is unnecessary, just return len != 0. OK. > > > + > > + /* update front segment size */ > > + if (!*nsegs) { > > + unsigned first_seg_size = seg_size; > > + > > + if (new_nsegs > 1) > > + first_seg_size = queue_max_segment_size(q); > > + if (*front_seg_size < first_seg_size) > > + *front_seg_size = first_seg_size; > > + } > > + > > + /* update other varibles */ > > + *last_seg_size = seg_size; > >
Re: [PATCH V10 01/19] block: introduce multi-page page bvec helpers
On Sun, Nov 18, 2018 at 08:10:14PM -0700, Jens Axboe wrote: > On 11/18/18 7:23 PM, Ming Lei wrote: > > On Fri, Nov 16, 2018 at 02:13:05PM +0100, Christoph Hellwig wrote: > >>> -#define bvec_iter_page(bvec, iter) \ > >>> +#define mp_bvec_iter_page(bvec, iter)\ > >>> (__bvec_iter_bvec((bvec), (iter))->bv_page) > >>> > >>> -#define bvec_iter_len(bvec, iter)\ > >>> +#define mp_bvec_iter_len(bvec, iter) \ > >> > >> I'd much prefer if we would stick to the segment naming that > >> we also use in the higher level helper. > >> > >> So segment_iter_page, segment_iter_len, etc. > > > > We discussed the naming problem before, one big problem is that the > > 'segment' > > in bio_for_each_segment*() means one single page segment actually. > > > > If we use segment_iter_page() here for multi-page segment, it may > > confuse people. > > > > Of course, I prefer to the naming of segment/page, > > > > And Jens didn't agree to rename bio_for_each_segment*() before. > > I didn't like frivolous renaming (and I still don't), but mp_ > is horrible imho. Don't name these after the fact that they > are done in conjunction with supporting multipage bvecs. That > very fact will be irrelevant very soon OK, so what is your suggestion for the naming issue? Are you fine to use segment_iter_page() here? Then the term of 'segment' may be interpreted as multi-page segment here, but as single-page in bio_for_each_segment*(). thanks Ming
Re: [PATCH V10 02/19] block: introduce bio_for_each_bvec()
On Fri, Nov 16, 2018 at 02:30:28PM +0100, Christoph Hellwig wrote: > > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter > > *iter, > > + unsigned bytes, bool mp) > > I think these magic 'bool np' arguments and wrappers over wrapper > don't help anyone to actually understand the code. I'd vote for > removing as many wrappers as we really don't need, and passing the > actual segment limit instead of the magic bool flag. Something like > this untested patch: I think this way is fine, just a little comment. > > diff --git a/include/linux/bio.h b/include/linux/bio.h > index 277921ad42e7..dcad0b69f57a 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -138,30 +138,21 @@ static inline bool bio_full(struct bio *bio) > bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), > i, iter_all) > > static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter > *iter, > - unsigned bytes, bool mp) > + unsigned bytes, unsigned max_segment) The new parameter should have been named as 'max_segment_len' or 'max_seg_len'. > { > iter->bi_sector += bytes >> 9; > > if (bio_no_advance_iter(bio)) > iter->bi_size -= bytes; > else > - if (!mp) > - bvec_iter_advance(bio->bi_io_vec, iter, bytes); > - else > - mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes); > + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_segment); > /* TODO: It is reasonable to complete bio with error here. */ > } > > static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter, > unsigned bytes) > { > - __bio_advance_iter(bio, iter, bytes, false); > -} > - > -static inline void bio_advance_mp_iter(struct bio *bio, struct bvec_iter > *iter, > -unsigned bytes) > -{ > - __bio_advance_iter(bio, iter, bytes, true); > + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE); > } > > #define __bio_for_each_segment(bvl, bio, iter, start) > \ > @@ -177,7 +168,7 @@ static inline void bio_advance_mp_iter(struct bio *bio, > struct bvec_iter *iter, > for (iter = (start);\ >(iter).bi_size && \ > ((bvl = bio_iter_mp_iovec((bio), (iter))), 1); \ > - bio_advance_mp_iter((bio), &(iter), (bvl).bv_len)) > + __bio_advance_iter((bio), &(iter), (bvl).bv_len, 0)) Even we might pass '-1' for multi-page segment. > > /* returns one real segment(multipage bvec) each time */ > #define bio_for_each_bvec(bvl, bio, iter)\ > diff --git a/include/linux/bvec.h b/include/linux/bvec.h > index 02f26d2b59ad..5e2ed46c1c88 100644 > --- a/include/linux/bvec.h > +++ b/include/linux/bvec.h > @@ -138,8 +138,7 @@ struct bvec_iter_all { > }) > > static inline bool __bvec_iter_advance(const struct bio_vec *bv, > -struct bvec_iter *iter, > -unsigned bytes, bool mp) > + struct bvec_iter *iter, unsigned bytes, unsigned max_segment) > { > if (WARN_ONCE(bytes > iter->bi_size, >"Attempted to advance past end of bvec iter\n")) { > @@ -148,18 +147,18 @@ static inline bool __bvec_iter_advance(const struct > bio_vec *bv, > } > > while (bytes) { > - unsigned len; > + unsigned segment_len = mp_bvec_iter_len(bv, *iter); > > - if (mp) > - len = mp_bvec_iter_len(bv, *iter); > - else > - len = bvec_iter_len(bv, *iter); > + if (max_segment) { > + max_segment -= bvec_iter_offset(bv, *iter); > + segment_len = min(segment_len, max_segment); Looks 'max_segment' needs to be constant, shouldn't be updated. If '-1' is passed for multipage case, the above change may become: segment_len = min_t(segment_len, max_seg_len - bvec_iter_offset(bv, *iter)); This way is more clean, but with extra cost of the above line for multipage case. Thanks, Ming
Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array
On Sat, Nov 17, 2018 at 11:03:42AM +0100, Greg Kroah-Hartman wrote: > On Sat, Nov 17, 2018 at 10:34:18AM +0800, Ming Lei wrote: > > On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote: > > > On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote: > > > > Now q->queue_ctx is just one read-mostly table for query the > > > > 'blk_mq_ctx' instance from one cpu index, it isn't necessary > > > > to allocate it as percpu variable. One simple array may be > > > > more efficient. > > > > > > "may be", have you run benchmarks to be sure? If so, can you add the > > > results of them to this changelog? If there is no measurable > > > difference, then why make this change at all? > > > > __blk_mq_get_ctx() is used in fast path, what do you think about which > > one is more efficient? > > > > - *per_cpu_ptr(q->queue_ctx, cpu); > > > > - q->queue_ctx[cpu] > > You need to actually test to see which one is faster, you might be > surprised :) > > In other words, do not just guess. No performance difference is observed wrt. this patchset when I run the following fio test on null_blk(modprobe null_blk) in my VM: fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=32 \ --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 \ --name=null_blk-ttest-randread --rw=randread Running test is important, but IMO it is more important to understand the idea behind is correct, or the approach can be proved as correct. Given the count of test cases can be increased exponentially when the related factors or settings are covered, obviously we can't run all the test cases. > > > At least the latter isn't worse than the former. > > How do you know? As I explained, q->queue_ctx is basically read-only loop-up table after queue is initialized, and there isn't any benefit to use percpu allocator here. > > > Especially q->queue_ctx is just a read-only look-up table, it doesn't > > make sense to make it percpu any more. > > > > Not mention q->queue_ctx[cpu] is more clean/readable. > > Again, please test to verify this. IMO, 'test' can't be enough to verify if one approach is correct, or patch is correct given we can't cover all cases by test, and it should be served as a supplement for proof or patch analysis, seems it is usually done in patch commit log, or review. Thanks, Ming
Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array
On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote: > On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote: > > Now q->queue_ctx is just one read-mostly table for query the > > 'blk_mq_ctx' instance from one cpu index, it isn't necessary > > to allocate it as percpu variable. One simple array may be > > more efficient. > > "may be", have you run benchmarks to be sure? If so, can you add the > results of them to this changelog? If there is no measurable > difference, then why make this change at all? __blk_mq_get_ctx() is used in fast path, what do you think about which one is more efficient? - *per_cpu_ptr(q->queue_ctx, cpu); - q->queue_ctx[cpu] At least the latter isn't worse than the former. Especially q->queue_ctx is just a read-only look-up table, it doesn't make sense to make it percpu any more. Not mention q->queue_ctx[cpu] is more clean/readable. Thanks, Ming
Re: [PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote: > On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote: > > @@ -456,7 +456,7 @@ struct request_queue { > > /* > > * mq queue kobject > > */ > > - struct kobject mq_kobj; > > + struct kobject *mq_kobj; > > What is this kobject even used for? It wasn't obvious at all from this > patch, why is it needed if you are not using it to reference count the > larger structure here? All attributes and kobjects under /sys/block/$DEV/mq are covered by this kobject actually, and all are for exposing blk-mq specific information, but now there is only blk-mq, and legacy io path is removed. That is why I mentioned we may remove this kobject last time and move all under /sys/block/$DEV/queue, however you thought that may break some userspace. If we want to backport them to stable, this patch may be a bit easier to go. Thanks, Ming
Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()
On Fri, Nov 16, 2018 at 03:04:57PM +1100, Dave Chinner wrote: > On Thu, Nov 15, 2018 at 02:24:19PM -0800, Darrick J. Wong wrote: > > On Fri, Nov 16, 2018 at 09:13:37AM +1100, Dave Chinner wrote: > > > On Thu, Nov 15, 2018 at 11:10:36AM +0800, Ming Lei wrote: > > > > On Thu, Nov 15, 2018 at 12:22:01PM +1100, Dave Chinner wrote: > > > > > On Thu, Nov 15, 2018 at 09:06:52AM +0800, Ming Lei wrote: > > > > > > On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote: > > > > > > > On 11/13/18 2:43 PM, Dave Chinner wrote: > > > > > > > > From: Dave Chinner > > > > > > > > > > > > > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to > > > > > > > > fall into an endless loop in the discard code. The test is > > > > > > > > creating > > > > > > > > a device that is exactly 2^32 sectors in size to test mkfs > > > > > > > > boundary > > > > > > > > conditions around the 32 bit sector overflow region. > > > > > > > > > > > > > > > > mkfs issues a discard for the entire device size by default, and > > > > > > > > hence this throws a sector count of 2^32 into > > > > > > > > blkdev_issue_discard(). It takes the number of sectors to > > > > > > > > discard as > > > > > > > > a sector_t - a 64 bit value. > > > > > > > > > > > > > > > > The commit ba5d73851e71 ("block: cleanup > > > > > > > > __blkdev_issue_discard") > > > > > > > > takes this sector count and casts it to a 32 bit value before > > > > > > > > comapring it against the maximum allowed discard size the device > > > > > > > > has. This truncates away the upper 32 bits, and so if the lower > > > > > > > > 32 > > > > > > > > bits of the sector count is zero, it starts issuing discards of > > > > > > > > length 0. This causes the code to fall into an endless loop, > > > > > > > > issuing > > > > > > > > a zero length discards over and over again on the same sector. > > > > > > > > > > > > > > Applied, thanks. Ming, can you please add a blktests test for > > > > > > > this case? This is the 2nd time it's been broken. > > > > > > > > > > > > OK, I will add zram discard test in blktests, which should cover the > > > > > > 1st report. For the xfs/259, I need to investigate if it is easy to > > > > > > do in blktests. > > > > > > > > > > Just write a test that creates block devices of 2^32 + (-1,0,1) > > > > > sectors and runs a discard across the entire device. That's all that > > > > > xfs/259 it doing - exercising mkfs on 2TB, 4TB and 16TB boundaries. > > > > > i.e. the boundaries where sectors and page cache indexes (on 4k page > > > > > size systems) overflow 32 bit int and unsigned int sizes. mkfs > > > > > issues a discard for the entire device, so it's testing that as > > > > > well... > > > > > > > > Indeed, I can reproduce this issue via the following commands: > > > > > > > > modprobe scsi_debug virtual_gb=2049 sector_size=512 lbpws10=1 > > > > dev_size_mb=512 > > > > blkdiscard /dev/sde > > > > > > > > > > > > > > You need to write tests that exercise write_same, write_zeros and > > > > > discard operations around these boundaries, because they all take > > > > > a 64 bit sector count and stuff them into 32 bit size fields in > > > > > the bio tha tis being submitted. > > > > > > > > write_same/write_zeros are usually used by driver directly, so we > > > > may need make the test case on some specific device. > > > > > > My local linux iscsi server and client advertise support for them. > > > It definitely does not ships zeros across the wire(*) when I use > > > things like FALLOC_FL_ZERO_RANGE, but fstests does not have block > > > device fallocate() tests for zeroing or punching... > > > > fstests does (generic/{349,350,351}) but those basic functionality tests > > don't include creating a 2^32 block device and seeing if overflows > > happen... :/ > > They don't run on my test machines because they require a modular > kernel and I run a monolithic kernel specified externally by the > qemu command line on all my test VMs. > > generic/349 [not run] scsi_debug module not found > generic/350 [not run] scsi_debug module not found > generic/351 [not run] scsi_debug module not found Given both null_blk and scsi_debug have tons of module parameters, it should be inevitable to build them as modules, then you may setup them in any way you want. Most of my tests are done in VM too, I start VM by one Fedora offical kernel, and install modules built from host via 'scp & tar zxf' inside the VM, then use new built kernel to start VM and run any tests. The whole script is quite simple, and no need host root privilege. Thanks, Ming
[PATCH V2 0/2] blk-mq: fix kobject lifetime issue
Hi Jens, The 1st patch fixes the kobject lifetime issue which is triggerd when DEBUG_KOBJECT_RELEASE is enabled. The 2nd patch can be thought as one follow-up cleanup. V2: - allocate 'blk_mq_ctx' inside blk_mq_init_allocated_queue() - allocate q->mq_kobj directly Ming Lei (2): blk-mq: not embed .mq_kobj and ctx->kobj into queue instance blk-mq: alloc q->queue_ctx as normal array block/blk-mq-sysfs.c | 51 - block/blk-mq.c | 61 +- block/blk-mq.h | 4 ++-- include/linux/blkdev.h | 4 ++-- 4 files changed, 84 insertions(+), 36 deletions(-) Cc: "jianchao.wang" Cc: Guenter Roeck Cc: Greg Kroah-Hartman -- 2.9.5
[PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array
Now q->queue_ctx is just one read-mostly table for query the 'blk_mq_ctx' instance from one cpu index, it isn't necessary to allocate it as percpu variable. One simple array may be more efficient. Cc: "jianchao.wang" Cc: Guenter Roeck Cc: Greg Kroah-Hartman Signed-off-by: Ming Lei --- block/blk-mq-sysfs.c | 14 -- block/blk-mq.c | 18 ++ block/blk-mq.h | 2 +- include/linux/blkdev.h | 2 +- 4 files changed, 16 insertions(+), 20 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index cc2fef909afc..ae2cafd6f8a8 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -290,19 +290,15 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx) void blk_mq_sysfs_deinit(struct request_queue *q) { - struct blk_mq_ctx *ctx; int cpu; - for_each_possible_cpu(cpu) { - ctx = *per_cpu_ptr(q->queue_ctx, cpu); - kobject_put(>kobj); - } + for_each_possible_cpu(cpu) + kobject_put(>queue_ctx[cpu]->kobj); kobject_put(q->mq_kobj); } int blk_mq_sysfs_init(struct request_queue *q) { - struct blk_mq_ctx *ctx; int cpu; struct kobject *mq_kobj; @@ -312,10 +308,8 @@ int blk_mq_sysfs_init(struct request_queue *q) kobject_init(mq_kobj, _mq_ktype); - for_each_possible_cpu(cpu) { - ctx = *per_cpu_ptr(q->queue_ctx, cpu); - kobject_init(>kobj, _mq_ctx_ktype); - } + for_each_possible_cpu(cpu) + kobject_init(>queue_ctx[cpu]->kobj, _mq_ctx_ktype); q->mq_kobj = mq_kobj; return 0; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 376c04778d33..85d5dba56272 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, unsigned int i, j; for_each_possible_cpu(i) { - struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_ctx, i); + struct blk_mq_ctx *__ctx = q->queue_ctx[i]; struct blk_mq_hw_ctx *hctx; __ctx->cpu = i; @@ -2385,7 +2385,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) set->map[0].mq_map[i] = 0; } - ctx = *per_cpu_ptr(q->queue_ctx, i); + ctx = q->queue_ctx[i]; for (j = 0; j < set->nr_maps; j++) { hctx = blk_mq_map_queue_type(q, j, i); @@ -2520,9 +2520,9 @@ static void blk_mq_dealloc_queue_ctx(struct request_queue *q, bool free_ctxs) if (free_ctxs) { int cpu; for_each_possible_cpu(cpu) - kfree(*per_cpu_ptr(q->queue_ctx, cpu)); + kfree(q->queue_ctx[cpu]); } - free_percpu(q->queue_ctx); + kfree(q->queue_ctx); } static int blk_mq_alloc_queue_ctx(struct request_queue *q) @@ -2530,7 +2530,9 @@ static int blk_mq_alloc_queue_ctx(struct request_queue *q) struct blk_mq_ctx *ctx; int cpu; - q->queue_ctx = alloc_percpu(struct blk_mq_ctx *); + q->queue_ctx = kmalloc_array_node(nr_cpu_ids, + sizeof(struct blk_mq_ctx *), + GFP_KERNEL, q->tag_set->numa_node); if (!q->queue_ctx) return -ENOMEM; @@ -2538,7 +2540,7 @@ static int blk_mq_alloc_queue_ctx(struct request_queue *q) ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu)); if (!ctx) goto fail; - *per_cpu_ptr(q->queue_ctx, cpu) = ctx; + q->queue_ctx[cpu] = ctx; } return 0; @@ -2759,6 +2761,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, /* mark the queue as mq asap */ q->mq_ops = set->ops; + q->tag_set = set; + q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn, blk_mq_poll_stats_bkt, BLK_MQ_POLL_STATS_BKTS, q); @@ -2786,8 +2790,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, INIT_WORK(>timeout_work, blk_mq_timeout_work); blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); - q->tag_set = set; - q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT; if (!(set->flags & BLK_MQ_F_SG_MERGE)) diff --git a/block/blk-mq.h b/block/blk-mq.h index 84898793c230..97829388e1db 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -129,7 +129,7 @@ static inline enum mq_rq_state blk_mq_rq_state(struct request *rq) static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, unsigned int cp
[PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs, so each kobject's lifetime has to be independent, then the objects(mq_kobj, ctx) which hosts its own kobject have to be allocated dynamically. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: Guenter Roeck Cc: "jianchao.wang" Cc: Guenter Roeck Cc: Greg Kroah-Hartman Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- block/blk-mq-sysfs.c | 41 + block/blk-mq.c | 55 +- block/blk-mq.h | 4 ++-- include/linux/blkdev.h | 4 ++-- 4 files changed, 78 insertions(+), 26 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3d25b9c419e9..cc2fef909afc 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -15,6 +15,14 @@ static void blk_mq_sysfs_release(struct kobject *kobj) { + kfree(kobj); +} + +static void blk_mq_ctx_sysfs_release(struct kobject *kobj) +{ + struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj); + + kfree(ctx); } static void blk_mq_hw_sysfs_release(struct kobject *kobj) @@ -213,7 +221,7 @@ static struct kobj_type blk_mq_ktype = { static struct kobj_type blk_mq_ctx_ktype = { .sysfs_ops = _mq_sysfs_ops, .default_attrs = default_ctx_attrs, - .release= blk_mq_sysfs_release, + .release= blk_mq_ctx_sysfs_release, }; static struct kobj_type blk_mq_hw_ktype = { @@ -245,7 +253,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) if (!hctx->nr_ctx) return 0; - ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num); + ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num); if (ret) return ret; @@ -268,8 +276,8 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); - kobject_uevent(>mq_kobj, KOBJ_REMOVE); - kobject_del(>mq_kobj); + kobject_uevent(q->mq_kobj, KOBJ_REMOVE); + kobject_del(q->mq_kobj); kobject_put(>kobj); q->mq_sysfs_init_done = false; @@ -286,23 +294,30 @@ void blk_mq_sysfs_deinit(struct request_queue *q) int cpu; for_each_possible_cpu(cpu) { - ctx = per_cpu_ptr(q->queue_ctx, cpu); + ctx = *per_cpu_ptr(q->queue_ctx, cpu); kobject_put(>kobj); } - kobject_put(>mq_kobj); + kobject_put(q->mq_kobj); } -void blk_mq_sysfs_init(struct request_queue *q) +int blk_mq_sysfs_init(struct request_queue *q) { struct blk_mq_ctx *ctx; int cpu; + struct kobject *mq_kobj; + + mq_kobj = kzalloc(sizeof(*mq_kobj), GFP_KERNEL); + if (!mq_kobj) + return -ENOMEM; - kobject_init(>mq_kobj, _mq_ktype); + kobject_init(mq_kobj, _mq_ktype); for_each_possible_cpu(cpu) { - ctx = per_cpu_ptr(q->queue_ctx, cpu); + ctx = *per_cpu_ptr(q->queue_ctx, cpu); kobject_init(>kobj, _mq_ctx_ktype); } + q->mq_kobj = mq_kobj; + return 0; } int __blk_mq_register_dev(struct device *dev, struct request_queue *q) @@ -313,11 +328,11 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) WARN_ON_ONCE(!q->kobj.parent); lockdep_assert_held(>sysfs_lock); - ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq"); + ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq"); if (ret < 0) goto out; - kobject_uevent(>mq_kobj, KOBJ_ADD); + kobject_uevent(q->mq_kobj, KOBJ_ADD); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); @@ -334,8 +349,8 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) while (--i >= 0) blk_mq_unregister_hctx(q->queue_hw_ctx[i]); - kobject_uevent(>mq_kobj, KOBJ_REMOVE); - kobject_del(>mq_kobj); + kobject_uevent(q->mq_kobj, KOBJ_REMOVE); + kobject_del(q->mq_kobj); kobject_put(>kobj); return ret; } diff --git a/block/blk-mq.c b/block/blk-mq.c index 3b823891b3ef..376c04778d33 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, unsigned int i, j; for_each_possible_cpu(i) { - struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i); + struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_
Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
On Fri, Nov 16, 2018 at 02:11:07PM +0800, jianchao.wang wrote: > > > On 11/16/18 11:28 AM, Ming Lei wrote: > ... > > > > +struct blk_mq_kobj { > > + struct kobject kobj; > > +}; > > + > > static void blk_mq_sysfs_release(struct kobject *kobj) > > { > > + struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj, > > + kobj); > > + kfree(mq_kobj); > > +} > > + > ... > > > > -void blk_mq_sysfs_init(struct request_queue *q) > > +int blk_mq_sysfs_init(struct request_queue *q) > > { > > struct blk_mq_ctx *ctx; > > int cpu; > > + struct blk_mq_kobj *mq_kobj; > > + > > + mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL); > > + if (!mq_kobj) > > + return -ENOMEM; > > > > - kobject_init(>mq_kobj, _mq_ktype); > > + kobject_init(_kobj->kobj, _mq_ktype); > > > > for_each_possible_cpu(cpu) { > > - ctx = per_cpu_ptr(q->queue_ctx, cpu); > > + ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu)); > > + if (!ctx) > > + goto fail; > > + *per_cpu_ptr(q->queue_ctx, cpu) = ctx; > > kobject_init(>kobj, _mq_ctx_ktype); > > } > > + q->mq_kobj = _kobj->kobj; > > + return 0; > > + > > + fail: > > + for_each_possible_cpu(cpu) { > > + ctx = *per_cpu_ptr(q->queue_ctx, cpu); > > + if (ctx) > > + kobject_put(>kobj); > > + } > > + kobject_put(_kobj->kobj); > > + return -ENOMEM; > > } > > > blk_mq_kobj looks meaningless, why do we need it, or do I miss something ? Right, it should have been allocated directly. > And maybe we should allocate ctx in blk_mq_init_allocated_queue. Looks either way is fine. Thanks, Ming
Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
On Fri, Nov 16, 2018 at 01:52:05AM -0500, Sasha Levin wrote: > On Fri, Nov 16, 2018 at 11:28:25AM +0800, Ming Lei wrote: > > Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime > > from block layer's view, actually they don't because userspace may > > grab one kobject anytime via sysfs, so each kobject's lifetime has > > to be independent, then the objects(mq_kobj, ctx) which hosts its > > own kobject have to be allocated dynamically. > > > > This patch fixes kernel panic issue during booting when > > DEBUG_KOBJECT_RELEASE > > is enabled. > > > > Reported-by: Guenter Roeck > > Cc: Guenter Roeck > > Cc: Greg Kroah-Hartman > > Cc: sta...@vger.kernel.org > > Signed-off-by: Ming Lei > > What does this patch depend on? It doesn't apply to Linus's tree nor to > the block tree. > > Also, could you please cc lkml with patches? It depends on for-4.21/block. thanks, Ming
[PATCH 0/2] blk-mq: fix kobject lifetime issue
Hi Jens, The 1st patch fixes the kobject lifetime issue which is triggerd when DEBUG_KOBJECT_RELEASE is enabled. The 2nd patch can be thought as one follow-up cleanup. Ming Lei (2): blk-mq: not embed .mq_kobj and ctx->kobj into queue instance blk-mq: alloc q->queue_ctx as normal array block/blk-mq-sysfs.c | 59 +++--- block/blk-mq.c | 19 ++-- block/blk-mq.h | 4 ++-- include/linux/blkdev.h | 4 ++-- 4 files changed, 62 insertions(+), 24 deletions(-) Cc: Guenter Roeck Cc: Greg Kroah-Hartman -- 2.9.5
[PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs, so each kobject's lifetime has to be independent, then the objects(mq_kobj, ctx) which hosts its own kobject have to be allocated dynamically. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: Guenter Roeck Cc: Guenter Roeck Cc: Greg Kroah-Hartman Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- block/blk-mq-sysfs.c | 59 +++--- block/blk-mq.c | 13 ++- block/blk-mq.h | 4 ++-- include/linux/blkdev.h | 4 ++-- 4 files changed, 58 insertions(+), 22 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 3d25b9c419e9..bab236955f56 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -13,8 +13,22 @@ #include "blk-mq.h" #include "blk-mq-tag.h" +struct blk_mq_kobj { + struct kobject kobj; +}; + static void blk_mq_sysfs_release(struct kobject *kobj) { + struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj, + kobj); + kfree(mq_kobj); +} + +static void blk_mq_ctx_sysfs_release(struct kobject *kobj) +{ + struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj); + + kfree(ctx); } static void blk_mq_hw_sysfs_release(struct kobject *kobj) @@ -213,7 +227,7 @@ static struct kobj_type blk_mq_ktype = { static struct kobj_type blk_mq_ctx_ktype = { .sysfs_ops = _mq_sysfs_ops, .default_attrs = default_ctx_attrs, - .release= blk_mq_sysfs_release, + .release= blk_mq_ctx_sysfs_release, }; static struct kobj_type blk_mq_hw_ktype = { @@ -245,7 +259,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx) if (!hctx->nr_ctx) return 0; - ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num); + ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num); if (ret) return ret; @@ -268,8 +282,8 @@ void blk_mq_unregister_dev(struct device *dev, struct request_queue *q) queue_for_each_hw_ctx(q, hctx, i) blk_mq_unregister_hctx(hctx); - kobject_uevent(>mq_kobj, KOBJ_REMOVE); - kobject_del(>mq_kobj); + kobject_uevent(q->mq_kobj, KOBJ_REMOVE); + kobject_del(q->mq_kobj); kobject_put(>kobj); q->mq_sysfs_init_done = false; @@ -286,23 +300,42 @@ void blk_mq_sysfs_deinit(struct request_queue *q) int cpu; for_each_possible_cpu(cpu) { - ctx = per_cpu_ptr(q->queue_ctx, cpu); + ctx = *per_cpu_ptr(q->queue_ctx, cpu); kobject_put(>kobj); } - kobject_put(>mq_kobj); + kobject_put(q->mq_kobj); } -void blk_mq_sysfs_init(struct request_queue *q) +int blk_mq_sysfs_init(struct request_queue *q) { struct blk_mq_ctx *ctx; int cpu; + struct blk_mq_kobj *mq_kobj; + + mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL); + if (!mq_kobj) + return -ENOMEM; - kobject_init(>mq_kobj, _mq_ktype); + kobject_init(_kobj->kobj, _mq_ktype); for_each_possible_cpu(cpu) { - ctx = per_cpu_ptr(q->queue_ctx, cpu); + ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu)); + if (!ctx) + goto fail; + *per_cpu_ptr(q->queue_ctx, cpu) = ctx; kobject_init(>kobj, _mq_ctx_ktype); } + q->mq_kobj = _kobj->kobj; + return 0; + + fail: + for_each_possible_cpu(cpu) { + ctx = *per_cpu_ptr(q->queue_ctx, cpu); + if (ctx) + kobject_put(>kobj); + } + kobject_put(_kobj->kobj); + return -ENOMEM; } int __blk_mq_register_dev(struct device *dev, struct request_queue *q) @@ -313,11 +346,11 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) WARN_ON_ONCE(!q->kobj.parent); lockdep_assert_held(>sysfs_lock); - ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq"); + ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq"); if (ret < 0) goto out; - kobject_uevent(>mq_kobj, KOBJ_ADD); + kobject_uevent(q->mq_kobj, KOBJ_ADD); queue_for_each_hw_ctx(q, hctx, i) { ret = blk_mq_register_hctx(hctx); @@ -334,8 +367,8 @@ int __blk_mq_register_dev(struct device *dev, struct request_queue *q) while (--i >= 0) blk_mq_unregister_hctx(q->queue_hw_ctx[i]); - kobject_ue
[PATCH 2/2] blk-mq: alloc q->queue_ctx as normal array
Now q->queue_ctx is just one read-mostly table for query the 'blk_mq_ctx' instance from one cpu index, it isn't necessary to allocate it as percpu variable. One simple array may be more efficient. Cc: Guenter Roeck Cc: Greg Kroah-Hartman Signed-off-by: Ming Lei --- block/blk-mq-sysfs.c | 6 +++--- block/blk-mq.c | 12 +++- block/blk-mq.h | 2 +- include/linux/blkdev.h | 2 +- 4 files changed, 12 insertions(+), 10 deletions(-) diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index bab236955f56..dc4ac733a125 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -300,7 +300,7 @@ void blk_mq_sysfs_deinit(struct request_queue *q) int cpu; for_each_possible_cpu(cpu) { - ctx = *per_cpu_ptr(q->queue_ctx, cpu); + ctx = q->queue_ctx[cpu]; kobject_put(>kobj); } kobject_put(q->mq_kobj); @@ -322,7 +322,7 @@ int blk_mq_sysfs_init(struct request_queue *q) ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu)); if (!ctx) goto fail; - *per_cpu_ptr(q->queue_ctx, cpu) = ctx; + q->queue_ctx[cpu] = ctx; kobject_init(>kobj, _mq_ctx_ktype); } q->mq_kobj = _kobj->kobj; @@ -330,7 +330,7 @@ int blk_mq_sysfs_init(struct request_queue *q) fail: for_each_possible_cpu(cpu) { - ctx = *per_cpu_ptr(q->queue_ctx, cpu); + ctx = q->queue_ctx[cpu]; if (ctx) kobject_put(>kobj); } diff --git a/block/blk-mq.c b/block/blk-mq.c index 3589ee601f37..20c485e1817a 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue *q, unsigned int i, j; for_each_possible_cpu(i) { - struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_ctx, i); + struct blk_mq_ctx *__ctx = q->queue_ctx[i]; struct blk_mq_hw_ctx *hctx; __ctx->cpu = i; @@ -2385,7 +2385,7 @@ static void blk_mq_map_swqueue(struct request_queue *q) set->map[0].mq_map[i] = 0; } - ctx = *per_cpu_ptr(q->queue_ctx, i); + ctx = q->queue_ctx[i]; for (j = 0; j < set->nr_maps; j++) { hctx = blk_mq_map_queue_type(q, j, i); @@ -2541,7 +2541,7 @@ void blk_mq_release(struct request_queue *q) */ blk_mq_sysfs_deinit(q); - free_percpu(q->queue_ctx); + kfree(q->queue_ctx); } struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set) @@ -2731,7 +2731,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, if (!q->poll_cb) goto err_exit; - q->queue_ctx = alloc_percpu(struct blk_mq_ctx *); + q->queue_ctx = kmalloc_array_node(nr_cpu_ids, + sizeof(struct blk_mq_ctx *), + GFP_KERNEL, set->numa_node); if (!q->queue_ctx) goto err_exit; @@ -2798,7 +2800,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, err_sys_init: blk_mq_sysfs_deinit(q); err_percpu: - free_percpu(q->queue_ctx); + kfree(q->queue_ctx); err_exit: q->mq_ops = NULL; return ERR_PTR(-ENOMEM); diff --git a/block/blk-mq.h b/block/blk-mq.h index 84898793c230..97829388e1db 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -129,7 +129,7 @@ static inline enum mq_rq_state blk_mq_rq_state(struct request *rq) static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q, unsigned int cpu) { - return *per_cpu_ptr(q->queue_ctx, cpu); + return q->queue_ctx[cpu]; } /* diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 9e3892bd67fd..9b6ddc5c7a40 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -407,7 +407,7 @@ struct request_queue { const struct blk_mq_ops *mq_ops; /* sw queues */ - struct blk_mq_ctx __percpu **queue_ctx; + struct blk_mq_ctx **queue_ctx; unsigned intnr_queues; unsigned intqueue_depth; -- 2.9.5
[PATCH] block/025: test discard sector alignement and sector size overflow
This test covers the following two issues: 1) discard sector need to be aligned with logical block size 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing with discard sector size Signed-off-by: Ming Lei --- tests/block/025 | 37 + tests/block/025.out | 2 ++ 2 files changed, 39 insertions(+) create mode 100755 tests/block/025 create mode 100644 tests/block/025.out diff --git a/tests/block/025 b/tests/block/025 new file mode 100755 index ..32b632431793 --- /dev/null +++ b/tests/block/025 @@ -0,0 +1,37 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-3.0+ +# Copyright (C) 2018 Ming Lei +# +# Check two corener cases of BLKDISCARD. +# +# 1) test if discard bio's sector is algined with logical size, fixed by +#1adfc5e4136f ("block: make sure discard bio is aligned with logical block size") +# 2) test 32 bit overflow when comparing discard sector size. Fixed by +#4800bf7bc8c72 ("block: fix 32 bit overflow in __blkdev_issue_discard()") + +. tests/block/rc +. common/scsi_debug + +DESCRIPTION="check sector alignment and sector size overflow of BLKDISCARD" + +requires() { + _have_scsi_debug +} + +test() { + echo "Running ${TEST_NAME}" + + rm -f "$FULL" + + # Create virtual device with unmap_zeroes_data support + if ! _init_scsi_debug virtual_gb=2049 sector_size=4096 lbpws10=1 dev_size_mb=512; then + return 1 + fi + + local dev="/dev/${SCSI_DEBUG_DEVICES[0]}" + blkdiscard "$dev" + + _exit_scsi_debug + + echo "Test complete" +} diff --git a/tests/block/025.out b/tests/block/025.out new file mode 100644 index ..fd9a6d5f70de --- /dev/null +++ b/tests/block/025.out @@ -0,0 +1,2 @@ +Running block/025 +Test complete -- 2.9.5
Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()
On Thu, Nov 15, 2018 at 12:22:01PM +1100, Dave Chinner wrote: > On Thu, Nov 15, 2018 at 09:06:52AM +0800, Ming Lei wrote: > > On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote: > > > On 11/13/18 2:43 PM, Dave Chinner wrote: > > > > From: Dave Chinner > > > > > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to > > > > fall into an endless loop in the discard code. The test is creating > > > > a device that is exactly 2^32 sectors in size to test mkfs boundary > > > > conditions around the 32 bit sector overflow region. > > > > > > > > mkfs issues a discard for the entire device size by default, and > > > > hence this throws a sector count of 2^32 into > > > > blkdev_issue_discard(). It takes the number of sectors to discard as > > > > a sector_t - a 64 bit value. > > > > > > > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > > > > takes this sector count and casts it to a 32 bit value before > > > > comapring it against the maximum allowed discard size the device > > > > has. This truncates away the upper 32 bits, and so if the lower 32 > > > > bits of the sector count is zero, it starts issuing discards of > > > > length 0. This causes the code to fall into an endless loop, issuing > > > > a zero length discards over and over again on the same sector. > > > > > > Applied, thanks. Ming, can you please add a blktests test for > > > this case? This is the 2nd time it's been broken. > > > > OK, I will add zram discard test in blktests, which should cover the > > 1st report. For the xfs/259, I need to investigate if it is easy to > > do in blktests. > > Just write a test that creates block devices of 2^32 + (-1,0,1) > sectors and runs a discard across the entire device. That's all that > xfs/259 it doing - exercising mkfs on 2TB, 4TB and 16TB boundaries. > i.e. the boundaries where sectors and page cache indexes (on 4k page > size systems) overflow 32 bit int and unsigned int sizes. mkfs > issues a discard for the entire device, so it's testing that as > well... Indeed, I can reproduce this issue via the following commands: modprobe scsi_debug virtual_gb=2049 sector_size=512 lbpws10=1 dev_size_mb=512 blkdiscard /dev/sde > > You need to write tests that exercise write_same, write_zeros and > discard operations around these boundaries, because they all take > a 64 bit sector count and stuff them into 32 bit size fields in > the bio tha tis being submitted. write_same/write_zeros are usually used by driver directly, so we may need make the test case on some specific device. Thanks, Ming
Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()
On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote: > On 11/13/18 2:43 PM, Dave Chinner wrote: > > From: Dave Chinner > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to > > fall into an endless loop in the discard code. The test is creating > > a device that is exactly 2^32 sectors in size to test mkfs boundary > > conditions around the 32 bit sector overflow region. > > > > mkfs issues a discard for the entire device size by default, and > > hence this throws a sector count of 2^32 into > > blkdev_issue_discard(). It takes the number of sectors to discard as > > a sector_t - a 64 bit value. > > > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > > takes this sector count and casts it to a 32 bit value before > > comapring it against the maximum allowed discard size the device > > has. This truncates away the upper 32 bits, and so if the lower 32 > > bits of the sector count is zero, it starts issuing discards of > > length 0. This causes the code to fall into an endless loop, issuing > > a zero length discards over and over again on the same sector. > > Applied, thanks. Ming, can you please add a blktests test for > this case? This is the 2nd time it's been broken. OK, I will add zram discard test in blktests, which should cover the 1st report. For the xfs/259, I need to investigate if it is easy to do in blktests. Thanks, Ming
Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()
On Wed, Nov 14, 2018 at 4:09 PM Dave Chinner wrote: > > On Wed, Nov 14, 2018 at 10:53:11AM +0800, Ming Lei wrote: > > On Wed, Nov 14, 2018 at 5:44 AM Dave Chinner wrote: > > > > > > From: Dave Chinner > > > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to > > > fall into an endless loop in the discard code. The test is creating > > > a device that is exactly 2^32 sectors in size to test mkfs boundary > > > conditions around the 32 bit sector overflow region. > > > > > > mkfs issues a discard for the entire device size by default, and > > > hence this throws a sector count of 2^32 into > > > blkdev_issue_discard(). It takes the number of sectors to discard as > > > a sector_t - a 64 bit value. > > > > > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > > > takes this sector count and casts it to a 32 bit value before > > > comapring it against the maximum allowed discard size the device > > > has. This truncates away the upper 32 bits, and so if the lower 32 > > > bits of the sector count is zero, it starts issuing discards of > > > length 0. This causes the code to fall into an endless loop, issuing > > > a zero length discards over and over again on the same sector. > > > > > > Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > > > Signed-off-by: Dave Chinner > > > --- > > > block/blk-lib.c | 5 - > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c > > > index e8b3bb9bf375..144e156ed341 100644 > > > --- a/block/blk-lib.c > > > +++ b/block/blk-lib.c > > > @@ -55,9 +55,12 @@ int __blkdev_issue_discard(struct block_device *bdev, > > > sector_t sector, > > > return -EINVAL; > > > > > > while (nr_sects) { > > > - unsigned int req_sects = min_t(unsigned int, nr_sects, > > > + sector_t req_sects = min_t(sector_t, nr_sects, > > > bio_allowed_max_sectors(q)); > > > > bio_allowed_max_sectors(q) is always < UINT_MAX, and 'sector_t' is only > > required during the comparison, so another simpler fix might be the > > following, > > could you test if it is workable? > > > > diff --git a/block/blk-lib.c b/block/blk-lib.c > > index e8b3bb9bf375..6ef44f99e83f 100644 > > --- a/block/blk-lib.c > > +++ b/block/blk-lib.c > > @@ -55,7 +55,7 @@ int __blkdev_issue_discard(struct block_device > > *bdev, sector_t sector, > > return -EINVAL; > > > > while (nr_sects) { > > - unsigned int req_sects = min_t(unsigned int, nr_sects, > > + unsigned int req_sects = min_t(sector_t, nr_sects, > > bio_allowed_max_sectors(q)); > > Rearrange the deck chairs all you like, just make sure you fix your > regression test suite to exercise obvious boundary conditions like > this so the next cleanup doesn't break the code again. Good point, we may add comment on the overflow story. > > > > > > > + WARN_ON_ONCE(req_sects == 0); > > > > The above line isn't necessary given 'nr_sects' can't be zero. > > Except it was 0 and it caused the bug I had to fix. So it should > have a warning on it. Obviously, it can't be zero except CPU is broken, :-) Thanks, Ming Lei
Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()
On Wed, Nov 14, 2018 at 5:44 AM Dave Chinner wrote: > > From: Dave Chinner > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to > fall into an endless loop in the discard code. The test is creating > a device that is exactly 2^32 sectors in size to test mkfs boundary > conditions around the 32 bit sector overflow region. > > mkfs issues a discard for the entire device size by default, and > hence this throws a sector count of 2^32 into > blkdev_issue_discard(). It takes the number of sectors to discard as > a sector_t - a 64 bit value. > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > takes this sector count and casts it to a 32 bit value before > comapring it against the maximum allowed discard size the device > has. This truncates away the upper 32 bits, and so if the lower 32 > bits of the sector count is zero, it starts issuing discards of > length 0. This causes the code to fall into an endless loop, issuing > a zero length discards over and over again on the same sector. > > Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard") > Signed-off-by: Dave Chinner > --- > block/blk-lib.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/block/blk-lib.c b/block/blk-lib.c > index e8b3bb9bf375..144e156ed341 100644 > --- a/block/blk-lib.c > +++ b/block/blk-lib.c > @@ -55,9 +55,12 @@ int __blkdev_issue_discard(struct block_device *bdev, > sector_t sector, > return -EINVAL; > > while (nr_sects) { > - unsigned int req_sects = min_t(unsigned int, nr_sects, > + sector_t req_sects = min_t(sector_t, nr_sects, > bio_allowed_max_sectors(q)); bio_allowed_max_sectors(q) is always < UINT_MAX, and 'sector_t' is only required during the comparison, so another simpler fix might be the following, could you test if it is workable? diff --git a/block/blk-lib.c b/block/blk-lib.c index e8b3bb9bf375..6ef44f99e83f 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -55,7 +55,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, return -EINVAL; while (nr_sects) { - unsigned int req_sects = min_t(unsigned int, nr_sects, + unsigned int req_sects = min_t(sector_t, nr_sects, bio_allowed_max_sectors(q)); > > + WARN_ON_ONCE(req_sects == 0); The above line isn't necessary given 'nr_sects' can't be zero. Thanks, Ming Lei
Re: [PATCH] block: Clear kernel memory before copying to user
On Thu, Nov 8, 2018 at 6:07 PM Johannes Thumshirn wrote: > > On 08/11/2018 02:22, Keith Busch wrote: > > $ ./sg-test /dev/sda | grep -v 0 > > 40733f4019db8001244019db4065244019db0094244019db > > c025244019dbc0e43a4019db40973a4019dbc0623a4019db > > 800c244019dbc0d61d4019dbc05f244019db80091e4019db > > 40913a4019db806f3f4019db40a83f4019dbc083244019db > > 80eb1e4019db00a93f4019dbc09a3a4019db40503f4019db > > 007f1b4019dbc0d91e4019db40551e4019db804a1b4019db > > > > -- > > > > Hi Keith, > Can you please add the above to blktests? > > This would be very useful for downstream distributions. I guess the issue may depend on specific QEMU version, just tried the test over virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed this problem. Thanks, Ming Lei
Re: [PATCH] block: Clear kernel memory before copying to user
On Wed, Nov 7, 2018 at 11:47 PM Keith Busch wrote: > > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote: > > blk_update_request() may tell us how much progress made, :-) > > Except when it doesn't, which is 100% of the time for many block > drivers, including nvme. Please look at blk_mq_end_request()(<- nvme_complete_rq()), which calls blk_update_request(). Thanks, Ming Lei
Re: [PATCH] block: Clear kernel memory before copying to user
On Wed, Nov 7, 2018 at 11:19 PM Keith Busch wrote: > > On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote: > > On Wed, Nov 7, 2018 at 10:42 PM Keith Busch wrote: > > > > > > If the kernel allocates a bounce buffer for user read data, this memory > > > needs to be cleared before copying it to the user, otherwise it may leak > > > kernel memory to user space. > > > > > > Signed-off-by: Keith Busch > > > --- > > > block/bio.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/block/bio.c b/block/bio.c > > > index d5368a445561..a50d59236b19 100644 > > > --- a/block/bio.c > > > +++ b/block/bio.c > > > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue > > > *q, > > > if (ret) > > > goto cleanup; > > > } else { > > > + zero_fill_bio(bio); > > > iov_iter_advance(iter, bio->bi_iter.bi_size); > > > } > > > > This way looks inefficient because zero fill should only be required > > for short READ. > > Sure, but how do you know that happened before copying the bounce buffer > to user space? blk_update_request() may tell us how much progress made, :-) So it should work by calling the following code after the req is completed and before copying data to userspace: __rq_for_each_bio(bio, rq) zero_fill_bio(bio); > > We could zero the pages on allocation if that's better (and doesn't zero > twice if __GFP_ZERO was already provided): > > --- > diff --git a/block/bio.c b/block/bio.c > index d5368a445561..a1b6383294f4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > nr_pages = 1 << map_data->page_order; > i = map_data->offset / PAGE_SIZE; > } > + > + if (iov_iter_rw(iter) == READ) > + gfp_mask |= __GFP_ZERO; No big difference compared with before, because most of times the zero fill shouldn't be needed. However, this patch changes to do it every time. Thanks, Ming Lei
Re: [PATCH] block: Clear kernel memory before copying to user
On Wed, Nov 7, 2018 at 10:42 PM Keith Busch wrote: > > If the kernel allocates a bounce buffer for user read data, this memory > needs to be cleared before copying it to the user, otherwise it may leak > kernel memory to user space. > > Signed-off-by: Keith Busch > --- > block/bio.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/block/bio.c b/block/bio.c > index d5368a445561..a50d59236b19 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q, > if (ret) > goto cleanup; > } else { > + zero_fill_bio(bio); > iov_iter_advance(iter, bio->bi_iter.bi_size); > } This way looks inefficient because zero fill should only be required for short READ. Thanks, Ming Lei
Re: [PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size
On Tue, Nov 06, 2018 at 04:48:13PM +, Rui Salvaterra wrote: > On Mon, 5 Nov 2018 at 03:41, Ming Lei wrote: > > > > V2 addresses Christoph's comment by introducing bio_allowed_max_sectors(). > > > > Ping... > > > > Thanks, > > Ming > > Hi, Ming, > > Sorry for the delay. I tested your V2 against Linux 4.20-rc1 and > everything seems fine. FWIW, V2 is also > > Tested-by: Rui Salvaterra Rui, thanks for your test on V2. Jens, what do you think about this patchset? Thanks, Ming
Re: [PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size
On Mon, Oct 29, 2018 at 08:57:16PM +0800, Ming Lei wrote: > Hi, > > The 1st & 3rd patch fixes bio size alignment issue. > > The 2nd patch cleans up __blkdev_issue_discard() a bit. > > V2: > - introduce helper of bio_allowed_max_sectors() > - add commit log for patch 2 > > > Ming Lei (3): > block: make sure discard bio is aligned with logical block size > block: cleanup __blkdev_issue_discard() > block: make sure writesame bio is aligned with logical block size > > block/blk-lib.c | 26 +++--- > block/blk-merge.c | 3 ++- > block/blk.h | 10 ++ > 3 files changed, 19 insertions(+), 20 deletions(-) > > Cc: Rui Salvaterra > Cc: sta...@vger.kernel.org > Cc: Mike Snitzer > Cc: Christoph Hellwig > Cc: Xiao Ni > Cc: Mariusz Dabrowski V2 addresses Christoph's comment by introducing bio_allowed_max_sectors(). Ping... Thanks, Ming
[PATCH 1/4] Revert "irq: add support for allocating (and affinitizing) sets of IRQs"
This reverts commit 1d44f6f43e229ca06bf680aa7eb5ad380eaa5d72. --- drivers/pci/msi.c | 14 -- include/linux/interrupt.h | 4 kernel/irq/affinity.c | 40 +--- 3 files changed, 9 insertions(+), 49 deletions(-) diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index 265ed3e4c920..af24ed50a245 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -1036,13 +1036,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, int minvec, int maxvec, if (maxvec < minvec) return -ERANGE; - /* -* If the caller is passing in sets, we can't support a range of -* vectors. The caller needs to handle that. -*/ - if (affd && affd->nr_sets && minvec != maxvec) - return -EINVAL; - if (WARN_ON_ONCE(dev->msi_enabled)) return -EINVAL; @@ -1094,13 +1087,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev, if (maxvec < minvec) return -ERANGE; - /* -* If the caller is passing in sets, we can't support a range of -* supported vectors. The caller needs to handle that. -*/ - if (affd && affd->nr_sets && minvec != maxvec) - return -EINVAL; - if (WARN_ON_ONCE(dev->msix_enabled)) return -EINVAL; diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index ca397ff40836..1d6711c28271 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -247,14 +247,10 @@ struct irq_affinity_notify { * the MSI(-X) vector space * @post_vectors: Don't apply affinity to @post_vectors at end of * the MSI(-X) vector space - * @nr_sets: Length of passed in *sets array - * @sets: Number of affinitized sets */ struct irq_affinity { int pre_vectors; int post_vectors; - int nr_sets; - int *sets; }; #if defined(CONFIG_SMP) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 2046a0f0f0f1..f4f29b9d90ee 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -180,7 +180,6 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) int curvec, usedvecs; cpumask_var_t nmsk, npresmsk, *node_to_cpumask; struct cpumask *masks = NULL; - int i, nr_sets; /* * If there aren't any vectors left after applying the pre/post @@ -211,23 +210,10 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd) get_online_cpus(); build_node_to_cpumask(node_to_cpumask); - /* -* Spread on present CPUs starting from affd->pre_vectors. If we -* have multiple sets, build each sets affinity mask separately. -*/ - nr_sets = affd->nr_sets; - if (!nr_sets) - nr_sets = 1; - - for (i = 0, usedvecs = 0; i < nr_sets; i++) { - int this_vecs = affd->sets ? affd->sets[i] : affvecs; - int nr; - - nr = irq_build_affinity_masks(affd, curvec, this_vecs, - node_to_cpumask, cpu_present_mask, - nmsk, masks + usedvecs); - usedvecs += nr; - } + /* Spread on present CPUs starting from affd->pre_vectors */ + usedvecs = irq_build_affinity_masks(affd, curvec, affvecs, + node_to_cpumask, cpu_present_mask, + nmsk, masks); /* * Spread on non present CPUs starting from the next vector to be @@ -272,21 +258,13 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, const struct irq_affinity { int resv = affd->pre_vectors + affd->post_vectors; int vecs = maxvec - resv; - int set_vecs; + int ret; if (resv > minvec) return 0; - if (affd->nr_sets) { - int i; - - for (i = 0, set_vecs = 0; i < affd->nr_sets; i++) - set_vecs += affd->sets[i]; - } else { - get_online_cpus(); - set_vecs = cpumask_weight(cpu_possible_mask); - put_online_cpus(); - } - - return resv + min(set_vecs, vecs); + get_online_cpus(); + ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv; + put_online_cpus(); + return ret; } -- 2.9.5
[PATCH V2] block: brd: associate with queue until adding disk
brd_free() may be called in failure path on one brd instance which disk isn't added yet, so release handler of gendisk may free the associated request_queue early and causes the following use-after-free[1]. This patch fixes this issue by associating gendisk with request_queue just before adding disk. [1] KASAN: use-after-free Read in del_timer_syncNon-volatile memory driver v1.3 Linux agpgart interface v0.103 [drm] Initialized vgem 1.0.0 20120112 for virtual device on minor 0 usbcore: registered new interface driver udl == BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218 Read of size 8 at addr 8801d1b6b540 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0+ #88 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218 lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844 del_timer_sync+0xb7/0x270 kernel/time/timer.c:1283 blk_cleanup_queue+0x413/0x710 block/blk-core.c:809 brd_free+0x5d/0x71 drivers/block/brd.c:422 brd_init+0x2eb/0x393 drivers/block/brd.c:518 do_one_initcall+0x145/0x957 init/main.c:890 do_initcall_level init/main.c:958 [inline] do_initcalls init/main.c:966 [inline] do_basic_setup init/main.c:984 [inline] kernel_init_freeable+0x5c6/0x6b9 init/main.c:1148 kernel_init+0x11/0x1ae init/main.c:1068 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:350 Reported-by: syzbot+3701447012fe951da...@syzkaller.appspotmail.com Signed-off-by: Ming Lei --- V2: - don't reference queue via disk->queue in brd_alloc() drivers/block/brd.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index df8103dd40ac..c18586fccb6f 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -396,15 +396,14 @@ static struct brd_device *brd_alloc(int i) disk->first_minor = i * max_part; disk->fops = _fops; disk->private_data = brd; - disk->queue = brd->brd_queue; disk->flags = GENHD_FL_EXT_DEVT; sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); - disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO; + brd->brd_queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO; /* Tell the block layer that this is not a rotational device */ - blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue); - blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue); + blk_queue_flag_set(QUEUE_FLAG_NONROT, brd->brd_queue); + blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, brd->brd_queue); return brd; @@ -436,6 +435,7 @@ static struct brd_device *brd_init_one(int i, bool *new) brd = brd_alloc(i); if (brd) { + brd->brd_disk->queue = brd->brd_queue; add_disk(brd->brd_disk); list_add_tail(>brd_list, _devices); } @@ -503,8 +503,14 @@ static int __init brd_init(void) /* point of no return */ - list_for_each_entry(brd, _devices, brd_list) + list_for_each_entry(brd, _devices, brd_list) { + /* +* associate with queue just before adding disk for +* avoiding to mess up failure path +*/ + brd->brd_disk->queue = brd->brd_queue; add_disk(brd->brd_disk); + } blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS, THIS_MODULE, brd_probe, NULL, NULL); -- 2.9.5
[PATCH] block: brd: associate with queue until adding disk
brd_free() may be called in failure path on one brd instance which disk isn't added yet, so release handler of gendisk may free the associated request_queue early and causes the following use-after-free[1]. This patch fixes this issue by associating gendisk with request_queue just before adding disk. [1] KASAN: use-after-free Read in del_timer_syncNon-volatile memory driver v1.3 Linux agpgart interface v0.103 [drm] Initialized vgem 1.0.0 20120112 for virtual device on minor 0 usbcore: registered new interface driver udl == BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218 Read of size 8 at addr 8801d1b6b540 by task swapper/0/1 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0+ #88 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256 kasan_report_error mm/kasan/report.c:354 [inline] kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412 __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433 __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218 lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844 del_timer_sync+0xb7/0x270 kernel/time/timer.c:1283 blk_cleanup_queue+0x413/0x710 block/blk-core.c:809 brd_free+0x5d/0x71 drivers/block/brd.c:422 brd_init+0x2eb/0x393 drivers/block/brd.c:518 do_one_initcall+0x145/0x957 init/main.c:890 do_initcall_level init/main.c:958 [inline] do_initcalls init/main.c:966 [inline] do_basic_setup init/main.c:984 [inline] kernel_init_freeable+0x5c6/0x6b9 init/main.c:1148 kernel_init+0x11/0x1ae init/main.c:1068 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:350 Reported-by: syzbot+3701447012fe951da...@syzkaller.appspotmail.com Signed-off-by: Ming Lei --- drivers/block/brd.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index df8103dd40ac..6b3b8964b996 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -396,7 +396,6 @@ static struct brd_device *brd_alloc(int i) disk->first_minor = i * max_part; disk->fops = _fops; disk->private_data = brd; - disk->queue = brd->brd_queue; disk->flags = GENHD_FL_EXT_DEVT; sprintf(disk->disk_name, "ram%d", i); set_capacity(disk, rd_size * 2); @@ -436,6 +435,7 @@ static struct brd_device *brd_init_one(int i, bool *new) brd = brd_alloc(i); if (brd) { + brd->brd_disk->queue = brd->brd_queue; add_disk(brd->brd_disk); list_add_tail(>brd_list, _devices); } @@ -503,8 +503,14 @@ static int __init brd_init(void) /* point of no return */ - list_for_each_entry(brd, _devices, brd_list) + list_for_each_entry(brd, _devices, brd_list) { + /* +* associate with queue just before adding disk for +* avoiding to mess up failure path +*/ + brd->brd_disk->queue = brd->brd_queue; add_disk(brd->brd_disk); + } blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS, THIS_MODULE, brd_probe, NULL, NULL); -- 2.9.5
Re: [PATCH] block: call rq_qos_exit() after queue is frozen
On Wed, Oct 24, 2018 at 9:18 PM Ming Lei wrote: > > rq_qos_exit() removes the current q->rq_qos, this action has to be > done after queue is frozen, otherwise the IO queue path may never > be waken up, then IO hang is caused. > > So fixes this issue by moving rq_qos_exit() after queue is frozen. > > Cc: Josef Bacik > Signed-off-by: Ming Lei > --- > block/blk-core.c | 3 +++ > block/blk-sysfs.c | 2 -- > 2 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 3ed60723e242..0a5a6cf5add7 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -785,6 +785,9 @@ void blk_cleanup_queue(struct request_queue *q) > * prevent that q->request_fn() gets invoked after draining finished. > */ > blk_freeze_queue(q); > + > + rq_qos_exit(q); > + > spin_lock_irq(lock); > queue_flag_set(QUEUE_FLAG_DEAD, q); > spin_unlock_irq(lock); > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 3772671cf2bc..d4f1280965b6 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -994,8 +994,6 @@ void blk_unregister_queue(struct gendisk *disk) > kobject_del(>kobj); > blk_trace_remove_sysfs(disk_to_dev(disk)); > > - rq_qos_exit(q); > - > mutex_lock(>sysfs_lock); > if (q->request_fn || (q->mq_ops && q->elevator)) > elv_unregister_queue(q); > -- > 2.9.5 > Ping... -- Ming Lei
[PATCH V2 3/3] block: make sure writesame bio is aligned with logical block size
Obviously the created writesame bio has to be aligned with logical block size, and use bio_allowed_max_sectors() to retrieve this number. Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}") Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index d58d5d87dd88..e8b3bb9bf375 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -149,7 +149,7 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, return -EOPNOTSUPP; /* Ensure that max_write_same_sectors doesn't overflow bi_size */ - max_write_same_sectors = UINT_MAX >> 9; + max_write_same_sectors = bio_allowed_max_sectors(q); while (nr_sects) { bio = blk_next_bio(bio, 1, gfp_mask); -- 2.9.5
[PATCH V2 2/3] block: cleanup __blkdev_issue_discard()
Cleanup __blkdev_issue_discard() a bit: - remove local variable of 'end_sect' - remove code block of 'fail' Cc: Rui Salvaterra Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 23 ++- 1 file changed, 6 insertions(+), 17 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index d56fd159d2e8..d58d5d87dd88 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -51,15 +51,12 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if ((sector | nr_sects) & bs_mask) return -EINVAL; - while (nr_sects) { - unsigned int req_sects = nr_sects; - sector_t end_sect; - - if (!req_sects) - goto fail; - req_sects = min(req_sects, bio_allowed_max_sectors(q)); + if (!nr_sects) + return -EINVAL; - end_sect = sector + req_sects; + while (nr_sects) { + unsigned int req_sects = min_t(unsigned int, nr_sects, + bio_allowed_max_sectors(q)); bio = blk_next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; @@ -67,8 +64,8 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, bio_set_op_attrs(bio, op, 0); bio->bi_iter.bi_size = req_sects << 9; + sector += req_sects; nr_sects -= req_sects; - sector = end_sect; /* * We can loop for a long time in here, if someone does @@ -81,14 +78,6 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, *biop = bio; return 0; - -fail: - if (bio) { - submit_bio_wait(bio); - bio_put(bio); - } - *biop = NULL; - return -EOPNOTSUPP; } EXPORT_SYMBOL(__blkdev_issue_discard); -- 2.9.5
[PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size
Hi, The 1st & 3rd patch fixes bio size alignment issue. The 2nd patch cleans up __blkdev_issue_discard() a bit. V2: - introduce helper of bio_allowed_max_sectors() - add commit log for patch 2 Ming Lei (3): block: make sure discard bio is aligned with logical block size block: cleanup __blkdev_issue_discard() block: make sure writesame bio is aligned with logical block size block/blk-lib.c | 26 +++--- block/blk-merge.c | 3 ++- block/blk.h | 10 ++ 3 files changed, 19 insertions(+), 20 deletions(-) Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski -- 2.9.5
[PATCH V2 1/3] block: make sure discard bio is aligned with logical block size
Obviously the created discard bio has to be aligned with logical block size. This patch introduces the helper of bio_allowed_max_sectors() for this purpose. Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()") Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks") Reported-by: Rui Salvaterra Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 3 +-- block/blk-merge.c | 3 ++- block/blk.h | 10 ++ 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 76f867ea9a9b..d56fd159d2e8 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -57,8 +57,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if (!req_sects) goto fail; - if (req_sects > UINT_MAX >> 9) - req_sects = UINT_MAX >> 9; + req_sects = min(req_sects, bio_allowed_max_sectors(q)); end_sect = sector + req_sects; diff --git a/block/blk-merge.c b/block/blk-merge.c index 42a46744c11b..507fbaa6b4c0 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -90,7 +90,8 @@ static struct bio *blk_bio_discard_split(struct request_queue *q, /* Zero-sector (unknown) and one-sector granularities are the same. */ granularity = max(q->limits.discard_granularity >> 9, 1U); - max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9); + max_discard_sectors = min(q->limits.max_discard_sectors, + bio_allowed_max_sectors(q)); max_discard_sectors -= max_discard_sectors % granularity; if (unlikely(!max_discard_sectors)) { diff --git a/block/blk.h b/block/blk.h index a1841b8ff129..2680aa4fcd88 100644 --- a/block/blk.h +++ b/block/blk.h @@ -396,6 +396,16 @@ static inline unsigned long blk_rq_deadline(struct request *rq) } /* + * The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size + * is defined as 'unsigned int', meantime it has to aligned to with logical + * block size which is the minimum accepted unit by hardware. + */ +static inline unsigned int bio_allowed_max_sectors(struct request_queue *q) +{ + return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9; +} + +/* * Internal io_context interface */ void get_io_context(struct io_context *ioc); -- 2.9.5
Re: [PATCH V2 1/3] blk-mq: refactor the code of issue request directly
On Sat, Oct 27, 2018 at 12:01:09AM +0800, Jianchao Wang wrote: > Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly > into one interface which is able to handle the return value from > .queue_rq callback. Due to we can only issue directly w/o io > scheduler, so remove the blk_mq_get_driver_tag. It isn't correct for dm-rq, see blk_insert_cloned_request(). Thanks, Ming
Re: [PATCH 1/3] block: make sure discard bio is aligned with logical block size
On Sun, Oct 28, 2018 at 04:49:47PM +0100, Christoph Hellwig wrote: > On Sun, Oct 28, 2018 at 08:51:31AM +0800, Ming Lei wrote: > > On Fri, Oct 26, 2018 at 09:44:15AM +0200, Christoph Hellwig wrote: > > > > if (req_sects > UINT_MAX >> 9) > > > > - req_sects = UINT_MAX >> 9; > > > > + req_sects = (UINT_MAX >> 9) & ~bs_mask; > > > > > > Given that we have this same thing duplicated in write zeroes > > > what about a documented helper? > > > > IMO, using UINT_MAX & bs_mask is better because it is self-explanatory > > in the context. > > I don't think it is in any way. I understand it because I know the > code, but there is nothing that documents why we do that. Then how about introducing this helper? /* + * The max sectors one bio can handle is 'UINT_MAX >> 9' becasue + * bvec_iter.bi_size is defined as 'unsigned int', also it has to aligned + * to with logical block size which is minimum accepted unit by hardware. + */ +static inline unsigned int blk_max_allowed_max_secotrs(struct request_queue *q) +{ + return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9; +} + +/* Thanks, Ming
Re: [PATCH 1/3] block: make sure discard bio is aligned with logical block size
On Fri, Oct 26, 2018 at 09:44:15AM +0200, Christoph Hellwig wrote: > > if (req_sects > UINT_MAX >> 9) > > - req_sects = UINT_MAX >> 9; > > + req_sects = (UINT_MAX >> 9) & ~bs_mask; > > Given that we have this same thing duplicated in write zeroes > what about a documented helper? IMO, using UINT_MAX & bs_mask is better because it is self-explanatory in the context. If we introduce one helper, it may not be easy to find a better name than UINT_MAX. thanks, Ming
[PATCH 0/3] block: make sure discard/writesame bio is aligned with logical block size
Hi, The 1st & 3rd patch fixes bio size alignment issue. The 2nd patch cleans up __blkdev_issue_discard() a bit. Thanks, Ming Lei (3): block: make sure discard bio is aligned with logical block size block: cleanup __blkdev_issue_discard() block: make sure writesame bio is aligned with logical block size block/blk-lib.c | 25 ++--- 1 file changed, 6 insertions(+), 19 deletions(-) Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski -- 2.9.5
[PATCH 3/3] block: make sure writesame bio is aligned with logical block size
Obviously the created writesame bio has to be aligned with logical block size. Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}") Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index 93011785f710..1750f0e480c0 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -149,7 +149,7 @@ static int __blkdev_issue_write_same(struct block_device *bdev, sector_t sector, return -EOPNOTSUPP; /* Ensure that max_write_same_sectors doesn't overflow bi_size */ - max_write_same_sectors = UINT_MAX >> 9; + max_write_same_sectors = (UINT_MAX >> 9) & ~bs_mask; while (nr_sects) { bio = next_bio(bio, 1, gfp_mask); -- 2.9.5
[PATCH 2/3] block: cleanup __blkdev_issue_discard()
Cleanup __blkdev_issue_discard(). Cc: Rui Salvaterra Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 23 +-- 1 file changed, 5 insertions(+), 18 deletions(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index aa3944946b2f..93011785f710 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -52,16 +52,11 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if ((sector | nr_sects) & bs_mask) return -EINVAL; - while (nr_sects) { - unsigned int req_sects = nr_sects; - sector_t end_sect; - - if (!req_sects) - goto fail; - if (req_sects > UINT_MAX >> 9) - req_sects = (UINT_MAX >> 9) & ~bs_mask; + if (!nr_sects) + return -EINVAL; - end_sect = sector + req_sects; + while (nr_sects) { + unsigned int req_sects = min(nr_sects, (UINT_MAX >> 9) & ~bs_mask); bio = next_bio(bio, 0, gfp_mask); bio->bi_iter.bi_sector = sector; @@ -69,8 +64,8 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, bio_set_op_attrs(bio, op, 0); bio->bi_iter.bi_size = req_sects << 9; + sector += req_sects; nr_sects -= req_sects; - sector = end_sect; /* * We can loop for a long time in here, if someone does @@ -83,14 +78,6 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, *biop = bio; return 0; - -fail: - if (bio) { - submit_bio_wait(bio); - bio_put(bio); - } - *biop = NULL; - return -EOPNOTSUPP; } EXPORT_SYMBOL(__blkdev_issue_discard); -- 2.9.5
[PATCH 1/3] block: make sure discard bio is aligned with logical block size
Obviously the created discard bio has to be aligned with logical block size. Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()") Reported-by: Rui Salvaterra Cc: Rui Salvaterra Cc: sta...@vger.kernel.org Cc: Mike Snitzer Cc: Christoph Hellwig Cc: Xiao Ni Cc: Mariusz Dabrowski Signed-off-by: Ming Lei --- block/blk-lib.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-lib.c b/block/blk-lib.c index bbd44666f2b5..aa3944946b2f 100644 --- a/block/blk-lib.c +++ b/block/blk-lib.c @@ -59,7 +59,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector, if (!req_sects) goto fail; if (req_sects > UINT_MAX >> 9) - req_sects = UINT_MAX >> 9; + req_sects = (UINT_MAX >> 9) & ~bs_mask; end_sect = sector + req_sects; -- 2.9.5
[PATCH] block: call rq_qos_exit() after queue is frozen
rq_qos_exit() removes the current q->rq_qos, this action has to be done after queue is frozen, otherwise the IO queue path may never be waken up, then IO hang is caused. So fixes this issue by moving rq_qos_exit() after queue is frozen. Cc: Josef Bacik Signed-off-by: Ming Lei --- block/blk-core.c | 3 +++ block/blk-sysfs.c | 2 -- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 3ed60723e242..0a5a6cf5add7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -785,6 +785,9 @@ void blk_cleanup_queue(struct request_queue *q) * prevent that q->request_fn() gets invoked after draining finished. */ blk_freeze_queue(q); + + rq_qos_exit(q); + spin_lock_irq(lock); queue_flag_set(QUEUE_FLAG_DEAD, q); spin_unlock_irq(lock); diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 3772671cf2bc..d4f1280965b6 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -994,8 +994,6 @@ void blk_unregister_queue(struct gendisk *disk) kobject_del(>kobj); blk_trace_remove_sysfs(disk_to_dev(disk)); - rq_qos_exit(q); - mutex_lock(>sysfs_lock); if (q->request_fn || (q->mq_ops && q->elevator)) elv_unregister_queue(q); -- 2.9.5
Re: [PATCH] block: setup bounce bio_sets properly
On Thu, Oct 18, 2018 at 03:24:47PM -0600, Jens Axboe wrote: > We're only setting up the bounce bio sets if we happen > to need bouncing for regular HIGHMEM, not if we only need > it for ISA devices. > > Reported-by: Ondrej Zary > Tested-by: Ondrej Zary > Signed-off-by: Jens Axboe > > diff --git a/block/bounce.c b/block/bounce.c > index b30071ac4ec6..1356a2f4aae2 100644 > --- a/block/bounce.c > +++ b/block/bounce.c > @@ -31,6 +31,24 @@ > static struct bio_set bounce_bio_set, bounce_bio_split; > static mempool_t page_pool, isa_page_pool; > > +static void init_bounce_bioset(void) > +{ > + static bool bounce_bs_setup; > + int ret; > + > + if (bounce_bs_setup) > + return; > + > + ret = bioset_init(_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS); > + BUG_ON(ret); > + if (bioset_integrity_create(_bio_set, BIO_POOL_SIZE)) > + BUG_ON(1); > + > + ret = bioset_init(_bio_split, BIO_POOL_SIZE, 0, 0); > + BUG_ON(ret); > + bounce_bs_setup = true; When initcall is run from do_basic_setup(), all CPUs and scheduler have been up, it is likely that init_emergency_pool() is run when one driver is calling blk_queue_bounce_limit() to start init_emergency_isa_pool(). So the above code might be run twice. Thanks, Ming
Re: [PATCH 4/5] block: introduce helpers for allocating IO buffers from slab
On Thu, Oct 18, 2018 at 05:22:19PM +0200, Christoph Hellwig wrote: > On Thu, Oct 18, 2018 at 08:11:23AM -0700, Matthew Wilcox wrote: > > On Thu, Oct 18, 2018 at 04:42:07PM +0200, Christoph Hellwig wrote: > > > This all seems quite complicated. > > > > > > I think the interface we'd want is more one that has a little > > > cache of a single page in the queue, and a little bitmap which > > > sub-page size blocks of it are used. > > > > > > Something like (pseudo code minus locking): > > > > > > void *blk_alloc_sector_buffer(struct block_device *bdev, gfp_t gfp) > > > { > > > unsigned block_size = block_size(bdev); > > > > > > if (blocksize >= PAGE_SIZE) > > > return (void *)__get_free_pages(gfp, get_order(blocksize)); > > > > > > if (bdev->fragment_cache_page) { > > > [ fragment_cache_page using > > > e.g. bitmap and return if found] > > > } > > > > > > bdev->fragment_cache_page = (void *)__get_free_page(gfp); > > > goto find_again; > > > } > > > > This looks a lot like page_frag_alloc() except I think page_frag_alloc() > > may be more efficient. > > Oh, nice. Sounds like XFS should just use page_frag_alloc. I'll give > it a spin. XFS or other fs can use page_frag_alloc() directly, seems not necessary to introduce this change in block layer any more given 512-aligned buffer should be fine everywhere. The only benefit to make it as block helper is that the offset or size can be checked with q->dma_alignment. Dave/Jens, do you think which way is better? Put allocation as block helper or fs uses page_frag_alloc() directly for allocating 512*N-byte buffer(total size is less than PAGE_SIZE)? Thanks, Ming
Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer
On Thu, Oct 18, 2018 at 07:52:59PM -0600, Jens Axboe wrote: > On 10/18/18 7:39 PM, Ming Lei wrote: > > On Thu, Oct 18, 2018 at 07:33:50PM -0600, Jens Axboe wrote: > >> On 10/18/18 7:28 PM, Ming Lei wrote: > >>> On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote: > >>>> On 10/18/18 7:18 AM, Ming Lei wrote: > >>>>> Now we only check if DMA IO buffer is aligned to queue_dma_alignment() > >>>>> for pass-through request, and it isn't done for normal IO request. > >>>>> > >>>>> Given the check has to be done on each bvec, it isn't efficient to add > >>>>> the > >>>>> check in generic_make_request_checks(). > >>>>> > >>>>> This patch addes one WARN in blk_queue_split() for capturing this issue. > >>>> > >>>> I don't want to do this, because then we are forever doomed to > >>>> have something that fully loops a bio at submission time. I > >>>> absolutely hate the splitting we have and the need for it, > >>>> hopefully it can go away for a subset of IOs at some point. > >>>> > >>>> In many ways, this seems to be somewhat of a made-up problem, I don't > >>>> recall a single bug report for something like this over decades of > >>>> working with the IO stack. 512b alignment restrictions for DMA seems > >>>> absolutely insane. I know people claim they exist, but clearly that > >>>> isn't a hard requirement or we would have been boned years ago. > >>> > >>> There are still some drivers with this requirement: > >>> > >>> drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, > >>> sdev->sector_size - 1); > >>> drivers/ata/pata_macio.c:812: > >>> blk_queue_update_dma_alignment(sdev->request_queue, 31); > >>> drivers/ata/pata_macio.c:827: > >>> blk_queue_update_dma_alignment(sdev->request_queue, 15); > >>> drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, > >>> dev->blk_size-1); > >>> drivers/block/rsxx/dev.c:282: > >>> blk_queue_dma_alignment(card->queue, blk_size - 1); > >>> drivers/block/xen-blkfront.c:957: blk_queue_dma_alignment(rq, 511); > >>> drivers/ide/ide-cd.c:1512: blk_queue_dma_alignment(q, 31); > >>> drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment > >>> (sdev->request_queue, 512 - 1); > >>> drivers/staging/rts5208/rtsx.c:94: > >>> blk_queue_dma_alignment(sdev->request_queue, (512 - 1)); > >>> drivers/usb/image/microtek.c:329: > >>> blk_queue_dma_alignment(s->request_queue, (512 - 1)); > >>> drivers/usb/storage/scsiglue.c:92: > >>> blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); > >>> drivers/usb/storage/uas.c:818: > >>> blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); > >> > >> Of course, I too can grep :-) > >> > >> My point is that these settings might not match reality. And the > >> WARN_ON(), as implemented, is going to trigger on any device that > >> DOESN'T set the alignment, as Bart pointed out. > > > > It is just a WARN_ON_ONCE() which exactly shows something which need > > further attention, then related people may take a look and we can move > > on. > > > > So I think it is correct thing to do. > > It most certainly is NOT the right thing to do, when we know that: > > 1) We currently have drivers setting an alignment that we don't meet > 2) We have drivers not setting an alignment, and getting 512 by default The 512 default should have been removed given it isn't respected at all in normal io path, but it is included from the beginning of 2.6.12 > 3) We have drivers setting an alignment that seems incorrect Then WARN_ON() is helpful for both 1) and 2) after the default 512 limit is removed. Thanks, Ming
Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer
On Thu, Oct 18, 2018 at 07:33:50PM -0600, Jens Axboe wrote: > On 10/18/18 7:28 PM, Ming Lei wrote: > > On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote: > >> On 10/18/18 7:18 AM, Ming Lei wrote: > >>> Now we only check if DMA IO buffer is aligned to queue_dma_alignment() > >>> for pass-through request, and it isn't done for normal IO request. > >>> > >>> Given the check has to be done on each bvec, it isn't efficient to add the > >>> check in generic_make_request_checks(). > >>> > >>> This patch addes one WARN in blk_queue_split() for capturing this issue. > >> > >> I don't want to do this, because then we are forever doomed to > >> have something that fully loops a bio at submission time. I > >> absolutely hate the splitting we have and the need for it, > >> hopefully it can go away for a subset of IOs at some point. > >> > >> In many ways, this seems to be somewhat of a made-up problem, I don't > >> recall a single bug report for something like this over decades of > >> working with the IO stack. 512b alignment restrictions for DMA seems > >> absolutely insane. I know people claim they exist, but clearly that > >> isn't a hard requirement or we would have been boned years ago. > > > > There are still some drivers with this requirement: > > > > drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, > > sdev->sector_size - 1); > > drivers/ata/pata_macio.c:812: > > blk_queue_update_dma_alignment(sdev->request_queue, 31); > > drivers/ata/pata_macio.c:827: > > blk_queue_update_dma_alignment(sdev->request_queue, 15); > > drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, > > dev->blk_size-1); > > drivers/block/rsxx/dev.c:282: > > blk_queue_dma_alignment(card->queue, blk_size - 1); > > drivers/block/xen-blkfront.c:957: blk_queue_dma_alignment(rq, 511); > > drivers/ide/ide-cd.c:1512: blk_queue_dma_alignment(q, 31); > > drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment > > (sdev->request_queue, 512 - 1); > > drivers/staging/rts5208/rtsx.c:94: > > blk_queue_dma_alignment(sdev->request_queue, (512 - 1)); > > drivers/usb/image/microtek.c:329: > > blk_queue_dma_alignment(s->request_queue, (512 - 1)); > > drivers/usb/storage/scsiglue.c:92: > > blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); > > drivers/usb/storage/uas.c:818: > > blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); > > Of course, I too can grep :-) > > My point is that these settings might not match reality. And the > WARN_ON(), as implemented, is going to trigger on any device that > DOESN'T set the alignment, as Bart pointed out. It is just a WARN_ON_ONCE() which exactly shows something which need further attention, then related people may take a look and we can move on. So I think it is correct thing to do. Thanks, Ming
Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer
On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote: > On 10/18/18 7:18 AM, Ming Lei wrote: > > Now we only check if DMA IO buffer is aligned to queue_dma_alignment() > > for pass-through request, and it isn't done for normal IO request. > > > > Given the check has to be done on each bvec, it isn't efficient to add the > > check in generic_make_request_checks(). > > > > This patch addes one WARN in blk_queue_split() for capturing this issue. > > I don't want to do this, because then we are forever doomed to > have something that fully loops a bio at submission time. I > absolutely hate the splitting we have and the need for it, > hopefully it can go away for a subset of IOs at some point. > > In many ways, this seems to be somewhat of a made-up problem, I don't > recall a single bug report for something like this over decades of > working with the IO stack. 512b alignment restrictions for DMA seems > absolutely insane. I know people claim they exist, but clearly that > isn't a hard requirement or we would have been boned years ago. There are still some drivers with this requirement: drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, sdev->sector_size - 1); drivers/ata/pata_macio.c:812: blk_queue_update_dma_alignment(sdev->request_queue, 31); drivers/ata/pata_macio.c:827: blk_queue_update_dma_alignment(sdev->request_queue, 15); drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, dev->blk_size-1); drivers/block/rsxx/dev.c:282: blk_queue_dma_alignment(card->queue, blk_size - 1); drivers/block/xen-blkfront.c:957: blk_queue_dma_alignment(rq, 511); drivers/ide/ide-cd.c:1512: blk_queue_dma_alignment(q, 31); drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment (sdev->request_queue, 512 - 1); drivers/staging/rts5208/rtsx.c:94: blk_queue_dma_alignment(sdev->request_queue, (512 - 1)); drivers/usb/image/microtek.c:329: blk_queue_dma_alignment(s->request_queue, (512 - 1)); drivers/usb/storage/scsiglue.c:92: blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); drivers/usb/storage/uas.c:818: blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1)); Thanks, Ming
[PATCH 3/5] block: make dma_alignment as stacked limit
This patch converts .dma_alignment into stacked limit, so the stack driver may get updated with underlying dma alignment, and allocate IO buffer as queue DMA aligned. Cc: Vitaly Kuznetsov Cc: Dave Chinner Cc: Linux FS Devel Cc: Darrick J. Wong Cc: x...@vger.kernel.org Cc: Dave Chinner Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Matthew Wilcox Signed-off-by: Ming Lei --- block/blk-settings.c | 89 +--- 1 file changed, 50 insertions(+), 39 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index cf9cd241dc16..aef4510a99b6 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -525,6 +525,54 @@ void blk_queue_stack_limits(struct request_queue *t, struct request_queue *b) EXPORT_SYMBOL(blk_queue_stack_limits); /** + * blk_queue_dma_alignment - set dma length and memory alignment + * @q: the request queue for the device + * @mask: alignment mask + * + * description: + *set required memory and length alignment for direct dma transactions. + *this is used when building direct io requests for the queue. + * + **/ +void blk_queue_dma_alignment(struct request_queue *q, int mask) +{ + q->limits.dma_alignment = mask; +} +EXPORT_SYMBOL(blk_queue_dma_alignment); + +static int __blk_queue_update_dma_alignment(struct queue_limits *t, int mask) +{ + BUG_ON(mask >= PAGE_SIZE); + + if (mask > t->dma_alignment) + return mask; + else + return t->dma_alignment; +} + +/** + * blk_queue_update_dma_alignment - update dma length and memory alignment + * @q: the request queue for the device + * @mask: alignment mask + * + * description: + *update required memory and length alignment for direct dma transactions. + *If the requested alignment is larger than the current alignment, then + *the current queue alignment is updated to the new value, otherwise it + *is left alone. The design of this is to allow multiple objects + *(driver, device, transport etc) to set their respective + *alignments without having them interfere. + * + **/ +void blk_queue_update_dma_alignment(struct request_queue *q, int mask) +{ + q->limits.dma_alignment = + __blk_queue_update_dma_alignment(>limits, mask); +} +EXPORT_SYMBOL(blk_queue_update_dma_alignment); + + +/** * blk_stack_limits - adjust queue_limits for stacked devices * @t: the stacking driver limits (top device) * @b: the underlying queue limits (bottom, component device) @@ -563,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, b->seg_boundary_mask); t->virt_boundary_mask = min_not_zero(t->virt_boundary_mask, b->virt_boundary_mask); + t->dma_alignment = __blk_queue_update_dma_alignment(t, + b->dma_alignment); t->max_segments = min_not_zero(t->max_segments, b->max_segments); t->max_discard_segments = min_not_zero(t->max_discard_segments, @@ -818,45 +868,6 @@ void blk_queue_virt_boundary(struct request_queue *q, unsigned long mask) } EXPORT_SYMBOL(blk_queue_virt_boundary); -/** - * blk_queue_dma_alignment - set dma length and memory alignment - * @q: the request queue for the device - * @mask: alignment mask - * - * description: - *set required memory and length alignment for direct dma transactions. - *this is used when building direct io requests for the queue. - * - **/ -void blk_queue_dma_alignment(struct request_queue *q, int mask) -{ - q->limits.dma_alignment = mask; -} -EXPORT_SYMBOL(blk_queue_dma_alignment); - -/** - * blk_queue_update_dma_alignment - update dma length and memory alignment - * @q: the request queue for the device - * @mask: alignment mask - * - * description: - *update required memory and length alignment for direct dma transactions. - *If the requested alignment is larger than the current alignment, then - *the current queue alignment is updated to the new value, otherwise it - *is left alone. The design of this is to allow multiple objects - *(driver, device, transport etc) to set their respective - *alignments without having them interfere. - * - **/ -void blk_queue_update_dma_alignment(struct request_queue *q, int mask) -{ - BUG_ON(mask > PAGE_SIZE); - - if (mask > q->limits.dma_alignment) - q->limits.dma_alignment = mask; -} -EXPORT_SYMBOL(blk_queue_update_dma_alignment); - void blk_queue_flush_queueable(struct request_queue *q, bool queueable) { if (queueable) -- 2.9.5
[PATCH 2/5] block: move .dma_alignment into q->limits
Turns out q->dma_alignement should be stack limit because now bvec table is immutalbe, the underlying queue's dma alignment has to be perceptible by stack driver, so IO buffer can be allocated as dma aligned before adding to bio. So this patch moves .dma_alignment into q->limits and prepares for making it as one stacked limit. Cc: Vitaly Kuznetsov Cc: Dave Chinner Cc: Linux FS Devel Cc: Darrick J. Wong Cc: x...@vger.kernel.org Cc: Dave Chinner Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Matthew Wilcox Signed-off-by: Ming Lei --- block/blk-settings.c | 6 +++--- include/linux/blkdev.h | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/block/blk-settings.c b/block/blk-settings.c index ffd459969689..cf9cd241dc16 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -830,7 +830,7 @@ EXPORT_SYMBOL(blk_queue_virt_boundary); **/ void blk_queue_dma_alignment(struct request_queue *q, int mask) { - q->dma_alignment = mask; + q->limits.dma_alignment = mask; } EXPORT_SYMBOL(blk_queue_dma_alignment); @@ -852,8 +852,8 @@ void blk_queue_update_dma_alignment(struct request_queue *q, int mask) { BUG_ON(mask > PAGE_SIZE); - if (mask > q->dma_alignment) - q->dma_alignment = mask; + if (mask > q->limits.dma_alignment) + q->limits.dma_alignment = mask; } EXPORT_SYMBOL(blk_queue_update_dma_alignment); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 61207560e826..be938a31bc2e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -366,6 +366,7 @@ struct queue_limits { unsigned long seg_boundary_mask; unsigned long virt_boundary_mask; + unsigned intdma_alignment; unsigned intmax_hw_sectors; unsigned intmax_dev_sectors; unsigned intchunk_sectors; @@ -561,7 +562,6 @@ struct request_queue { unsigned intdma_drain_size; void*dma_drain_buffer; unsigned intdma_pad_mask; - unsigned intdma_alignment; struct blk_queue_tag*queue_tags; @@ -1617,7 +1617,7 @@ static inline unsigned int bdev_zone_sectors(struct block_device *bdev) static inline int queue_dma_alignment(struct request_queue *q) { - return q ? q->dma_alignment : 511; + return q ? q->limits.dma_alignment : 511; } static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr, -- 2.9.5
[PATCH 0/5] block: introduce helpers for allocating io buffer from slab
Hi, Filesystems may allocate io buffer from slab, and use this buffer to submit bio. This way may break storage drivers if they have special requirement on DMA alignment. The patch 1 adds one warning if the io buffer isn't aligned to DMA alignment. The 2nd & 3rd patches make DMA alignment as stacked limit. The 4th patch introduces helpers for allocating io buffer from slab, and DMA alignment is respected on this allocation. The 5th patch converts xfs to use the introduced helpers for allocating io buffer from slab. See previous discussion on this topic: https://marc.info/?t=15373485754=1=2 Thanks, Ming Ming Lei (5): block: warn on un-aligned DMA IO buffer block: move .dma_alignment into q->limits block: make dma_alignment as stacked limit block: introduce helpers for allocating IO buffers from slab xfs: use block layer helpers to allocate io buffer from slab block/Makefile | 3 +- block/blk-core.c| 2 + block/blk-merge.c | 2 + block/blk-sec-buf.c | 144 block/blk-settings.c| 89 +++ fs/xfs/xfs_buf.c| 28 - fs/xfs/xfs_super.c | 13 +++- include/linux/blk-sec-buf.h | 43 + include/linux/blkdev.h | 9 ++- 9 files changed, 287 insertions(+), 46 deletions(-) create mode 100644 block/blk-sec-buf.c create mode 100644 include/linux/blk-sec-buf.h Cc: Vitaly Kuznetsov Cc: Dave Chinner Cc: Linux FS Devel Cc: Darrick J. Wong Cc: x...@vger.kernel.org Cc: Dave Chinner Cc: Christoph Hellwig Cc: Bart Van Assche Cc: Matthew Wilcox -- 2.9.5