from:"Ming Lei"

Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-07 Thread Ming Lei

On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
> 
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
> 
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: sta...@vger.kernel.org
> Suggested-by: Ming Lei 
> Reported-by: Bart Van Assche 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> I've thrown the initial hang test reported by Bart at it, works fine.
> My reproducer for the corruption case is also happy, as expected.
> 
> I'm running blktests and xfstests on it overnight. If that passes as
> expected, this qualms my initial worries on using ->dispatch as a
> holding place for these types of requests.
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3262d83b9e07..6a7566244de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
> - /*
> -  * If direct dispatch fails, we cannot allow any merging on
> -  * this IO. Drivers (like SCSI) may have set up permanent state
> -  * for this request, like SG tables and mappings, and if we
> -  * merge to it later on then we'll still only do IO to the
> -  * original part.
> -  */
> - rq->cmd_flags |= REQ_NOMERGE;
> -
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;
> @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -/*
> - * Don't allow direct dispatch of anything but regular reads/writes,
> - * as some of the other commands can potentially share request space
> - * with data we need for the IO scheduler. If we attempt a direct dispatch
> - * on those and fail, we can't safely add it to the scheduler afterwards
> - * without potentially overwriting data that the driver has already written.
> - */
> -static bool blk_rq_can_direct_dispatch(struct request *rq)
> -{
> - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
> + if (q->elevator && !bypass_insert)
>   goto insert;
>  
>   if (!blk_mq_get_dispatch_budget(hctx))
> @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
>  
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> + blk_mq_request_bypass_insert(rq, run_queue);
>   return BLK_STS_OK;
>  }
>  
> @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
>   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> - blk_mq_sched_insert_request(rq, false, true, false);
> + blk_mq_request_bypass_insert(rq, true);
>   else if (ret != BLK_STS_OK)
>   blk_mq_end_request(rq, ret);
>  
>

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-07 Thread Ming Lei

On Fri, Dec 07, 2018 at 11:44:39AM +0800, Ming Lei wrote:
> On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> > On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > > 
> > > But at that time, there isn't io scheduler for MQ, so in theory the
> > > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > > add blk-mq adaptation of the deadline IO scheduler").
> > 
> > Hi Ming,
> > 
> > How were serious you about this issue being there (theoretically) an
> > issue since 4.11?  Can you talk about how it might get triggered, and
> > how we can test for it?  The reason why I ask is because we're trying
> > to track down a mysterious file system corruption problem on a 4.14.x
> > stable kernel.  The symptoms are *very* eerily similar to kernel
> > bugzilla #201685.
> 
> Hi Theodore,
> 
> It is just a theory analysis.
> 
> blk_mq_try_issue_directly() is called in two branches of 
> blk_mq_make_request(),
> both are on real MQ disks.
> 
> IO merge can be done on none or real io schedulers, so in theory there might
> be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
> especially it was fixed by ab42f35d9cb5ac49b5a2.
> 
> As Jens mentioned in bugzilla, there are several conditions required
> for triggering the issue:
> 
> - MQ device
> 
> - queue busy can be triggered. It is hard to trigger in NVMe PCI,
>   but may be possible on NVMe FC. However, it can be quite easy to
>   trigger on SCSI devices. We know there are some MQ SCSI HBA,
>   qlogic FC, megaraid_sas.
> 
> - IO merge is enabled. 
> 
> I have setup scsi_debug in the following way:
> 
> modprobe scsi_debug dev_size_mb=4096 clustering=1 \
>   max_luns=1 submit_queues=2 max_queue=2
> 
> - submit_queues=2 may set this disk as MQ
> - max_queue=4 may trigger the queue busy condition easily
> 
> and run some write IO on ext4 over the disk: fio, kernel building,... for
> some time, but still can't trigger the data corruption once.
> 
> I should have created more LUN, so that queue may be easier to become
> busy, will do that soon.

Actually I should have used SDEBUG_OPT_HOST_BUSY to simulate the queue busy.

Thanks,
Ming

Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-07 Thread Ming Lei

On Thu, Dec 06, 2018 at 10:17:44PM -0700, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
> 
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
> 
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: sta...@vger.kernel.org
> Suggested-by: Ming Lei 
> Reported-by: Bart Van Assche 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> I've thrown the initial hang test reported by Bart at it, works fine.
> My reproducer for the corruption case is also happy, as expected.
> 
> I'm running blktests and xfstests on it overnight. If that passes as
> expected, this qualms my initial worries on using ->dispatch as a
> holding place for these types of requests.
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3262d83b9e07..6a7566244de3 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
> - /*
> -  * If direct dispatch fails, we cannot allow any merging on
> -  * this IO. Drivers (like SCSI) may have set up permanent state
> -  * for this request, like SG tables and mappings, and if we
> -  * merge to it later on then we'll still only do IO to the
> -  * original part.
> -  */
> - rq->cmd_flags |= REQ_NOMERGE;
> -
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;
> @@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -/*
> - * Don't allow direct dispatch of anything but regular reads/writes,
> - * as some of the other commands can potentially share request space
> - * with data we need for the IO scheduler. If we attempt a direct dispatch
> - * on those and fail, we can't safely add it to the scheduler afterwards
> - * without potentially overwriting data that the driver has already written.
> - */
> -static bool blk_rq_can_direct_dispatch(struct request *rq)
> -{
> - return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> @@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
> + if (q->elevator && !bypass_insert)
>   goto insert;
>  
>   if (!blk_mq_get_dispatch_budget(hctx))
> @@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
>  
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> + blk_mq_request_bypass_insert(rq, run_queue);
>   return BLK_STS_OK;
>  }
>  
> @@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
>   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
> - blk_mq_sched_insert_request(rq, false, true, false);
> + blk_mq_request_bypass_insert(rq, true);
>   else if (ret != BLK_STS_OK)
>   blk_mq_end_request(rq, ret);
>  
>

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-06 Thread Ming Lei

On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > 
> > But at that time, there isn't io scheduler for MQ, so in theory the
> > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.

Hi Theodore,

It is just a theory analysis.

blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(),
both are on real MQ disks.

IO merge can be done on none or real io schedulers, so in theory there might
be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
especially it was fixed by ab42f35d9cb5ac49b5a2.

As Jens mentioned in bugzilla, there are several conditions required
for triggering the issue:

- MQ device

- queue busy can be triggered. It is hard to trigger in NVMe PCI,
  but may be possible on NVMe FC. However, it can be quite easy to
  trigger on SCSI devices. We know there are some MQ SCSI HBA,
  qlogic FC, megaraid_sas.

- IO merge is enabled. 

I have setup scsi_debug in the following way:

modprobe scsi_debug dev_size_mb=4096 clustering=1 \
max_luns=1 submit_queues=2 max_queue=2

- submit_queues=2 may set this disk as MQ
- max_queue=4 may trigger the queue busy condition easily

and run some write IO on ext4 over the disk: fio, kernel building,... for
some time, but still can't trigger the data corruption once.

I should have created more LUN, so that queue may be easier to become
busy, will do that soon.

> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.

virtio_scsi supports multi-queue mode, if that is enabled in your
setting, you may try single queue and see if difference can be made.

If multi-queue mode isn't enabled, your problem should be different with
this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE.

> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

IO hang is only triggered on machine with special CPU topo, it should be
fine to revert b5b6e8c8d3b4 on normal VM.

No other suggestions yet.

Thanks,
Ming

[PATCH V2] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei

Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[4.443277] BUG: unable to handle kernel NULL pointer dereference at 
0098
[4.444681] PGD 0 P4D 0
[4.445367] Oops:  [#1] SMP NOPTI
[4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-2.fc27 04/01/2014
[4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 
98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
[4.454061] RAX:  RBX: 888174448000 RCX: 0001
[4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001
[4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002
[4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538
[4.467803] R13: 0004 R14: 0001 R15: 0001
[4.469220] FS:  () GS:88817bac() 
knlGS:
[4.471554] CS:  0010 DS:  ES:  CR0: 80050033
[4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0
[4.474264] DR0:  DR1:  DR2: 
[4.476007] DR3:  DR6: fffe0ff0 DR7: 0400
[4.477061] PKRU: 5554
[4.477464] Call Trace:
[4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[4.479595]  blk_mq_init_queue+0x32/0x4e
[4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[4.483930]  ? try_to_wake_up+0x38d/0x3b3
[4.484478]  ? process_one_work+0x179/0x2fc
[4.485118]  process_one_work+0x1d3/0x2fc
[4.485655]  ? rescuer_thread+0x2ae/0x2ae
[4.486196]  worker_thread+0x1e9/0x2be
[4.486841]  kthread+0x115/0x11d
[4.487294]  ? kthread_park+0x76/0x76
[4.487784]  ret_from_fork+0x3a/0x50
[4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
ip_tables
[4.489428] Dumping ftrace buffer:
[4.489939](ftrace buffer empty)
[4.490492] CR2: 0098
[4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig 
Cc: linux-n...@lists.infradead.org
Cc: David Milburn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900550594651..65770da99159 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2960,7 +2960,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set 
*set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-   if (set->ops->map_queues) {
+   if (set->ops->map_queues && !is_kdump_kernel()) {
int i;
 
/*
@@ -3030,6 +3030,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 */
if (is_kdump_kernel()) {
set->nr_hw_queues = 1;
+   set->nr_maps = 1;
set->queue_depth = min(64U, set->queue_depth);
}
/*
@@ -3051,7 +3052,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
  GFP_KERNEL, set->numa_node);
if (!set->map[i].mq_map)
goto out_free_mq_map;
-   set->map[i].nr_queues = set->nr_hw_queues;
+   set->map[i].nr_queues = is_kdump_kernel() ? 1 : 
set->nr_hw_queues;
}
 
ret = blk_mq_update_queue_map(set);
-- 
2.9.5

Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei

On Thu, Dec 06, 2018 at 07:57:34PM -0700, Jens Axboe wrote:
> On 12/6/18 7:55 PM, Ming Lei wrote:
> > Now almost all .map_queues() implementation based on managed irq
> > affinity doesn't update queue mapping and it just retrieves the
> > old built mapping, so if nr_hw_queues is changed, the mapping talbe
> > includes stale mapping. And only blk_mq_map_queues() may rebuild
> > the mapping talbe.
> > 
> > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
> > However, drivers often builds queue mapping before allocating tagset
> > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
> > as 1 in case of kdump kernel, so wrong queue mapping is used, and
> > kernel panic[1] is observed during booting.
> > 
> > This patch fixes the kernel panic triggerd on nvme by rebulding the
> > mapping table via blk_mq_map_queues().
> > 
> > [1] kernel panic log
> > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues
> > [4.443277] BUG: unable to handle kernel NULL pointer dereference at 
> > 0098
> > [4.444681] PGD 0 P4D 0
> > [4.445367] Oops:  [#1] SMP NOPTI
> > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
> > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
> > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.10.2-2.fc27 04/01/2014
> > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
> > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 
> > e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 
> > <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
> > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
> > [4.454061] RAX:  RBX: 888174448000 RCX: 
> > 0001
> > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 
> > 0001
> > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 
> > 0002
> > [4.464580] R10: c900023b3c10 R11: 0004 R12: 
> > 888174448538
> > [4.467803] R13: 0004 R14: 0001 R15: 
> > 0001
> > [4.469220] FS:  () GS:88817bac() 
> > knlGS:
> > [4.471554] CS:  0010 DS:  ES:  CR0: 80050033
> > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 
> > 00760ee0
> > [4.474264] DR0:  DR1:  DR2: 
> > 
> > [4.476007] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [4.477061] PKRU: 5554
> > [4.477464] Call Trace:
> > [4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
> > [4.479595]  blk_mq_init_queue+0x32/0x4e
> > [4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
> > [4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
> > [4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
> > [4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
> > [4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
> > [4.483930]  ? try_to_wake_up+0x38d/0x3b3
> > [4.484478]  ? process_one_work+0x179/0x2fc
> > [4.485118]  process_one_work+0x1d3/0x2fc
> > [4.485655]  ? rescuer_thread+0x2ae/0x2ae
> > [4.486196]  worker_thread+0x1e9/0x2be
> > [4.486841]  kthread+0x115/0x11d
> > [4.487294]  ? kthread_park+0x76/0x76
> > [4.487784]  ret_from_fork+0x3a/0x50
> > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
> > ip_tables
> > [4.489428] Dumping ftrace buffer:
> > [4.489939](ftrace buffer empty)
> > [4.490492] CR2: 0098
> > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
> > 
> > Cc: Christoph Hellwig 
> > Cc: linux-n...@lists.infradead.org
> > Cc: David Milburn 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 900550594651..a3e463a726a6 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -38,6 +38,11 @@
> >  #include "blk-mq-sched.h"
> >  #include "blk-rq-qos.h"
> >  
> > +static inline bool blk_mq_kdump_kernel(void)
> > +{
> > +   return !!is_kdump_kernel();
> > +}
> 
> Let's drop the redundant !! here, and the wrapper? I would imagine the
> wrapper is handy for testing outside of kdump, but I don't think we
> should include it in the final.

OK.


Thanks,
Ming

[PATCH] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei

Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[4.443277] BUG: unable to handle kernel NULL pointer dereference at 
0098
[4.444681] PGD 0 P4D 0
[4.445367] Oops:  [#1] SMP NOPTI
[4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-2.fc27 04/01/2014
[4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 
98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
[4.454061] RAX:  RBX: 888174448000 RCX: 0001
[4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001
[4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002
[4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538
[4.467803] R13: 0004 R14: 0001 R15: 0001
[4.469220] FS:  () GS:88817bac() 
knlGS:
[4.471554] CS:  0010 DS:  ES:  CR0: 80050033
[4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0
[4.474264] DR0:  DR1:  DR2: 
[4.476007] DR3:  DR6: fffe0ff0 DR7: 0400
[4.477061] PKRU: 5554
[4.477464] Call Trace:
[4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[4.479595]  blk_mq_init_queue+0x32/0x4e
[4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[4.483930]  ? try_to_wake_up+0x38d/0x3b3
[4.484478]  ? process_one_work+0x179/0x2fc
[4.485118]  process_one_work+0x1d3/0x2fc
[4.485655]  ? rescuer_thread+0x2ae/0x2ae
[4.486196]  worker_thread+0x1e9/0x2be
[4.486841]  kthread+0x115/0x11d
[4.487294]  ? kthread_park+0x76/0x76
[4.487784]  ret_from_fork+0x3a/0x50
[4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
ip_tables
[4.489428] Dumping ftrace buffer:
[4.489939](ftrace buffer empty)
[4.490492] CR2: 0098
[4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig 
Cc: linux-n...@lists.infradead.org
Cc: David Milburn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900550594651..a3e463a726a6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -38,6 +38,11 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
+static inline bool blk_mq_kdump_kernel(void)
+{
+   return !!is_kdump_kernel();
+}
+
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2960,7 +2965,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set 
*set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-   if (set->ops->map_queues) {
+   if (set->ops->map_queues && !blk_mq_kdump_kernel()) {
int i;
 
/*
@@ -3028,8 +3033,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 * memory constrained environment. Limit us to 1 queue and
 * 64 tags to prevent using too much memory.
 */
-   if (is_kdump_kernel()) {
+   if (blk_mq_kdump_kernel()) {
set->nr_hw_queues = 1;
+   set->nr_maps = 1;
set->queue_depth = min(64U, set->queue_depth);
}
/*
@@ -3051,7 +3057,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
  GFP_KERNEL, set->numa_node);
if (!set->map[i].mq_map

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei

On Wed, Dec 05, 2018 at 10:58:02AM +0800, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> > On 12/4/18 7:27 PM, Ming Lei wrote:
> > > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> > >> On 12/4/18 6:37 PM, Ming Lei wrote:
> > >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> > >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, 
> > >>>> then
> > >>>> we queue the request up normally. However, the SCSI layer may have
> > >>>> already setup SG tables etc for this particular command. If we later
> > >>>> merge with this request, then the old tables are no longer valid. Once
> > >>>> we issue the IO, we only read/write the original part of the request,
> > >>>> not the new state of it.
> > >>>>
> > >>>> This causes data corruption, and is most often noticed with the file
> > >>>> system complaining about the just read data being invalid:
> > >>>>
> > >>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode 
> > >>>> #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
> > >>>>
> > >>>> because most of it is garbage...
> > >>>>
> > >>>> This doesn't happen from the normal issue path, as we will simply defer
> > >>>> the request to the hardware queue dispatch list if we fail. Once it's 
> > >>>> on
> > >>>> the dispatch list, we never merge with it.
> > >>>>
> > >>>> Fix this from the direct issue path by flagging the request as
> > >>>> REQ_NOMERGE so we don't change the size of it before issue.
> > >>>>
> > >>>> See also:
> > >>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> > >>>>
> > >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
> > >>>> case of 'none'")
> > >>>> Signed-off-by: Jens Axboe 
> > >>>>
> > >>>> ---
> > >>>>
> > >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> > >>>> index 3f91c6e5b17a..d8f518c6ea38 100644
> > >>>> --- a/block/blk-mq.c
> > >>>> +++ b/block/blk-mq.c
> > >>>> @@ -1715,6 +1715,15 @@ static blk_status_t 
> > >>>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >>>>break;
> > >>>>case BLK_STS_RESOURCE:
> > >>>>case BLK_STS_DEV_RESOURCE:
> > >>>> +  /*
> > >>>> +   * If direct dispatch fails, we cannot allow any 
> > >>>> merging on
> > >>>> +   * this IO. Drivers (like SCSI) may have set up 
> > >>>> permanent state
> > >>>> +   * for this request, like SG tables and mappings, and 
> > >>>> if we
> > >>>> +   * merge to it later on then we'll still only do IO to 
> > >>>> the
> > >>>> +   * original part.
> > >>>> +   */
> > >>>> +  rq->cmd_flags |= REQ_NOMERGE;
> > >>>> +
> > >>>>blk_mq_update_dispatch_busy(hctx, true);
> > >>>>__blk_mq_requeue_request(rq);
> > >>>>break;
> > >>>>
> > >>>
> > >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> > >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> > >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> > >>> 'rq.special_vec' share same space.
> > >>
> > >> We should rather limit the scope of the direct dispatch instead. It
> > >> doesn't make sense to do for anything but read/write anyway.
> > > 
> > > discard is kind of write, and it isn't treated very specially in make
> > > request path, except for multi-range discard.
> > 
> > The point of direct dispatch is to reduce latencies for requests,
> > discards are so damn slow on ALL devices anyway that it doesn't make any
> > sense to try direct dispatch to begin with, regardless of whether it
> > possible or not.
>

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei

On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
> On 12/4/18 7:27 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> >> On 12/4/18 6:37 PM, Ming Lei wrote:
> >>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >>>> we queue the request up normally. However, the SCSI layer may have
> >>>> already setup SG tables etc for this particular command. If we later
> >>>> merge with this request, then the old tables are no longer valid. Once
> >>>> we issue the IO, we only read/write the original part of the request,
> >>>> not the new state of it.
> >>>>
> >>>> This causes data corruption, and is most often noticed with the file
> >>>> system complaining about the just read data being invalid:
> >>>>
> >>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
> >>>> comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>>>
> >>>> because most of it is garbage...
> >>>>
> >>>> This doesn't happen from the normal issue path, as we will simply defer
> >>>> the request to the hardware queue dispatch list if we fail. Once it's on
> >>>> the dispatch list, we never merge with it.
> >>>>
> >>>> Fix this from the direct issue path by flagging the request as
> >>>> REQ_NOMERGE so we don't change the size of it before issue.
> >>>>
> >>>> See also:
> >>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>>>
> >>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in 
> >>>> case of 'none'")
> >>>> Signed-off-by: Jens Axboe 
> >>>>
> >>>> ---
> >>>>
> >>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>>> index 3f91c6e5b17a..d8f518c6ea38 100644
> >>>> --- a/block/blk-mq.c
> >>>> +++ b/block/blk-mq.c
> >>>> @@ -1715,6 +1715,15 @@ static blk_status_t 
> >>>> __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >>>>  break;
> >>>>  case BLK_STS_RESOURCE:
> >>>>  case BLK_STS_DEV_RESOURCE:
> >>>> +/*
> >>>> + * If direct dispatch fails, we cannot allow any 
> >>>> merging on
> >>>> + * this IO. Drivers (like SCSI) may have set up 
> >>>> permanent state
> >>>> + * for this request, like SG tables and mappings, and 
> >>>> if we
> >>>> + * merge to it later on then we'll still only do IO to 
> >>>> the
> >>>> + * original part.
> >>>> + */
> >>>> +rq->cmd_flags |= REQ_NOMERGE;
> >>>> +
> >>>>  blk_mq_update_dispatch_busy(hctx, true);
> >>>>  __blk_mq_requeue_request(rq);
> >>>>  break;
> >>>>
> >>>
> >>> Not sure it is enough to just mark it as NOMERGE, for example, driver
> >>> may have setup the .special_vec for discard, and NOMERGE may not prevent
> >>> request from entering elevator queue completely. Cause 'rq.rb_node' and
> >>> 'rq.special_vec' share same space.
> >>
> >> We should rather limit the scope of the direct dispatch instead. It
> >> doesn't make sense to do for anything but read/write anyway.
> > 
> > discard is kind of write, and it isn't treated very specially in make
> > request path, except for multi-range discard.
> 
> The point of direct dispatch is to reduce latencies for requests,
> discards are so damn slow on ALL devices anyway that it doesn't make any
> sense to try direct dispatch to begin with, regardless of whether it
> possible or not.

SCSI MQ device may benefit from direct dispatch from reduced lock contention.

> 
> >>> So how about inserting this request via blk_mq_request_bypass_insert()
> >>> in case that direct issue returns BUSY? Then it is invariant that
> >>> any request queued via .queue_rq() won't enter scheduler queue.
> >>
> >> I did consider this, but I didn't want to experiment with exercising
> >> a new path for an import

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei

On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
> On 12/4/18 6:37 PM, Ming Lei wrote:
> > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> >> we queue the request up normally. However, the SCSI layer may have
> >> already setup SG tables etc for this particular command. If we later
> >> merge with this request, then the old tables are no longer valid. Once
> >> we issue the IO, we only read/write the original part of the request,
> >> not the new state of it.
> >>
> >> This causes data corruption, and is most often noticed with the file
> >> system complaining about the just read data being invalid:
> >>
> >> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: 
> >> comm dpkg-query: bad extra_isize 24937 (inode size 256)
> >>
> >> because most of it is garbage...
> >>
> >> This doesn't happen from the normal issue path, as we will simply defer
> >> the request to the hardware queue dispatch list if we fail. Once it's on
> >> the dispatch list, we never merge with it.
> >>
> >> Fix this from the direct issue path by flagging the request as
> >> REQ_NOMERGE so we don't change the size of it before issue.
> >>
> >> See also:
> >>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> >>
> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case 
> >> of 'none'")
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 3f91c6e5b17a..d8f518c6ea38 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> >> blk_mq_hw_ctx *hctx,
> >>break;
> >>case BLK_STS_RESOURCE:
> >>case BLK_STS_DEV_RESOURCE:
> >> +  /*
> >> +   * If direct dispatch fails, we cannot allow any merging on
> >> +   * this IO. Drivers (like SCSI) may have set up permanent state
> >> +   * for this request, like SG tables and mappings, and if we
> >> +   * merge to it later on then we'll still only do IO to the
> >> +   * original part.
> >> +   */
> >> +  rq->cmd_flags |= REQ_NOMERGE;
> >> +
> >>blk_mq_update_dispatch_busy(hctx, true);
> >>__blk_mq_requeue_request(rq);
> >>break;
> >>
> > 
> > Not sure it is enough to just mark it as NOMERGE, for example, driver
> > may have setup the .special_vec for discard, and NOMERGE may not prevent
> > request from entering elevator queue completely. Cause 'rq.rb_node' and
> > 'rq.special_vec' share same space.
> 
> We should rather limit the scope of the direct dispatch instead. It
> doesn't make sense to do for anything but read/write anyway.

discard is kind of write, and it isn't treated very specially in make
request path, except for multi-range discard.

> 
> > So how about inserting this request via blk_mq_request_bypass_insert()
> > in case that direct issue returns BUSY? Then it is invariant that
> > any request queued via .queue_rq() won't enter scheduler queue.
> 
> I did consider this, but I didn't want to experiment with exercising
> a new path for an important bug fix. You do realize that your original
> patch has been corrupting data for months? I think a little caution
> is in order here.

But marking NOMERGE still may have a hole on re-insert discard request as
mentioned above.

Given we never allow to re-insert queued request to scheduler queue
except for 6ce3dd6eec1, I think it is the correct thing to do, and the
fix is simple too.

Thanks,
Ming

Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-04 Thread Ming Lei

On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
> we queue the request up normally. However, the SCSI layer may have
> already setup SG tables etc for this particular command. If we later
> merge with this request, then the old tables are no longer valid. Once
> we issue the IO, we only read/write the original part of the request,
> not the new state of it.
> 
> This causes data corruption, and is most often noticed with the file
> system complaining about the just read data being invalid:
> 
> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm 
> dpkg-query: bad extra_isize 24937 (inode size 256)
> 
> because most of it is garbage...
> 
> This doesn't happen from the normal issue path, as we will simply defer
> the request to the hardware queue dispatch list if we fail. Once it's on
> the dispatch list, we never merge with it.
> 
> Fix this from the direct issue path by flagging the request as
> REQ_NOMERGE so we don't change the size of it before issue.
> 
> See also:
>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 
> 'none'")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 3f91c6e5b17a..d8f518c6ea38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   break;
>   case BLK_STS_RESOURCE:
>   case BLK_STS_DEV_RESOURCE:
> + /*
> +  * If direct dispatch fails, we cannot allow any merging on
> +  * this IO. Drivers (like SCSI) may have set up permanent state
> +  * for this request, like SG tables and mappings, and if we
> +  * merge to it later on then we'll still only do IO to the
> +  * original part.
> +  */
> + rq->cmd_flags |= REQ_NOMERGE;
> +
>   blk_mq_update_dispatch_busy(hctx, true);
>   __blk_mq_requeue_request(rq);
>   break;
> 

Not sure it is enough to just mark it as NOMERGE, for example, driver
may have setup the .special_vec for discard, and NOMERGE may not prevent
request from entering elevator queue completely. Cause 'rq.rb_node' and
'rq.special_vec' share same space.

So how about inserting this request via blk_mq_request_bypass_insert()
in case that direct issue returns BUSY? Then it is invariant that
any request queued via .queue_rq() won't enter scheduler queue.

--

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e5b17a..4b2db0b2909e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1764,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (bypass_insert)
return BLK_STS_RESOURCE;
 
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
+   blk_mq_request_bypass_insert(rq, run_queue);
return BLK_STS_OK;
 }
 
@@ -1780,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-   blk_mq_sched_insert_request(rq, false, true, false);
+   blk_mq_request_bypass_insert(rq, true);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 


Thanks,
Ming

Re: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Ming Lei

On Tue, Dec 04, 2018 at 03:30:11PM +0530, Kashyap Desai wrote:
> Problem statement :
> Whenever try to get outstanding request via scsi_host_find_tag,
> block layer will return stale entries instead of actual outstanding
> request. Kernel panic if stale entry is inaccessible or memory is reused.
> Fix :
> Undo request mapping in blk_mq_put_driver_tag  nce request is return.
> 
> More detail :
> Whenever each SDEV entry is created, block layer allocate separate tags
> and static requestis.Those requests are not valid after SDEV is deleted
> from the system. On the fly, block layer maps static rqs to rqs as below
> from blk_mq_get_driver_tag()
> 
> data.hctx->tags->rqs[rq->tag] = rq;
> 
> Above mapping is active in-used requests and it is the same mapping which
> is referred in function scsi_host_find_tag().
> After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
> entries which will never be reset in block layer.

However, if rq & rq->tag is valid, data.hctx->tags->rqs[rq->tag] should
have pointed to one active request instead of the stale one, right?

Thanks,
Ming

block: sbitmap related lockdep warning

2018-12-03 Thread Ming Lei

0d1
[  106.398018] RAX: ffda RBX: 7f8d989f5670 RCX: 7f8dbd04c687
[  106.399001] RDX: 0207b780 RSI: 0001 RDI: 7f8dbe52d000
[  106.399989] RBP:  R08: 0001 R09: 0207b4e0
[  106.400976] R10: 000c R11: 0202 R12: 7f8d989f5670
[  106.401951] R13:  R14: 0207b7b0 R15: 0205b7c0
[  204.246336] null: module loaded
[  294.406501] null: module loaded
[  396.936354] null: module loaded
[  497.202198] end test sanity/001: (NO_HANG, 0)


Thanks,
Ming Lei

Re: [PATCH 1/2] blk-mq: Export iterating all tagged requests

2018-12-02 Thread Ming Lei

On Sun, Dec 2, 2018 at 12:48 AM Christoph Hellwig  wrote:
>
> On Fri, Nov 30, 2018 at 01:36:09PM -0700, Jens Axboe wrote:
> > On 11/30/18 1:26 PM, Keith Busch wrote:
> > > A driver may wish to iterate every tagged request, not just ones that
> > > satisfy blk_mq_request_started(). The intended use is so a driver may
> > > terminate entered requests on quiesced queues.
> >
> > How about we just move the started check into the handler passed in for
> > those that care about it? Much saner to make the interface iterate
> > everything, and leave whatever state check to the callback.
>
> So we used to do that, and I changed it back in May to test for
> MQ_RQ_IN_FLIGHT, and then Ming changed it to check
> blk_mq_request_started.  So this is clearly a minefield of sorts..

The change to blk_mq_request_started() is for just fixing SCSI's boot
hang issue.

>
> Note that at least mtip32xx, nbd, skd and the various nvme transports
> want to use the function to terminate all requests in the error
> path, and it would be great to have one single understood, documented
> and debugged helper for that in the core, so this is a vote for moving
> more of the logic in your second helper into the core code.  skd
> will need actually use ->complete to release resources for that, though
> and mtip plays some odd abort bits.  If it weren't for the interesting
> abort behavior in nvme-fc that means we could even unexport the
> low-level interface.

Agree, and especially SCSI's use should be understood given
SCSI is so widely deployed in product system.

thanks,
Ming Lei

[PATCH] block: fix single range discard merge

2018-11-30 Thread Ming Lei

There are actually two kinds of discard merge:

- one is the normal discard merge, just like normal read/write request,
and call it single-range discard

- another is the multi-range discard, queue_max_discard_segments(rq->q) > 1

For the former case, queue_max_discard_segments(rq->q) is 1, and we
should handle this kind of discard merge like the normal read/write
request.

This patch fixes the following kernel panic issue[1], which is caused by
not removing the single-range discard request from elevator queue.

Guangwu has one raid discard test case, in which this issue is a bit
easier to trigger, and I verified that this patch can fix the kernel
panic issue in Guangwu's test case.

[1] kernel panic log from Jens's report

 BUG: unable to handle kernel NULL pointer dereference at 0148
 PGD 0 P4D 0.
 Oops:  [#1] SMP PTI
 CPU: 37 PID: 763 Comm: kworker/37:1H Not tainted \
4.20.0-rc3-00649-ge64d9a554a91-dirty #14  Hardware name: Wiwynn \
Leopard-Orv2/Leopard-DDR BW, BIOS LBM08   03/03/2017   Workqueue: kblockd \
blk_mq_run_work_fnRIP: \
0010:blk_mq_get_driver_tag+0x81/0x120   
Code: 24 \
10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 25 28 00 
00 00 \
0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 39 43 20 
72 37 \
f6 87 b0 00 00 00 02  RSP: 0018:c90004aabd30 EFLAGS: 00010246   
  \
  RAX: 0003 RBX: 888465ea1300 RCX: c90004aabde8
 RDX:  RSI: c90004aabde8 RDI: 
 RBP:  R08: 888465ea1348 R09: 
 R10: 1000 R11:  R12: 888465ea1300
 R13:  R14: 888465ea1348 R15: 888465d1
 FS:  () GS:88846f9c() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0148 CR3: 0220a003 CR4: 003606e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 Call Trace:
  blk_mq_dispatch_rq_list+0xec/0x480
  ? elv_rb_del+0x11/0x30
  blk_mq_do_dispatch_sched+0x6e/0xf0
  blk_mq_sched_dispatch_requests+0xfa/0x170
  __blk_mq_run_hw_queue+0x5f/0xe0
  process_one_work+0x154/0x350
  worker_thread+0x46/0x3c0
  kthread+0xf5/0x130
  ? process_one_work+0x350/0x350
  ? kthread_destroy_worker+0x50/0x50
  ret_from_fork+0x1f/0x30
 Modules linked in: sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel \
kvm switchtec irqbypass iTCO_wdt iTCO_vendor_support efivars cdc_ether usbnet 
mii \
cdc_acm i2c_i801 lpc_ich mfd_core ipmi_si ipmi_devintf ipmi_msghandler 
acpi_cpufreq \
button sch_fq_codel nfsd nfs_acl lockd grace auth_rpcgss oid_registry sunrpc 
nvme \
nvme_core fuse sg loop efivarfs autofs4  CR2: 0148  
  \

 ---[ end trace 340a1fb996df1b9b ]---
 RIP: 0010:blk_mq_get_driver_tag+0x81/0x120
 Code: 24 10 48 89 7c 24 20 74 21 83 fa ff 0f 95 c0 48 8b 4c 24 28 65 48 33 0c 
25 28 \
00 00 00 0f 85 96 00 00 00 48 83 c4 30 5b 5d c3 <48> 8b 87 48 01 00 00 8b 40 04 
39 43 \
20 72 37 f6 87 b0 00 00 00 02

Fixes: 445251d0f4d329a ("blk-mq: fix discard merge with scheduler attached")
Reported-by: Jens Axboe 
Cc: Guangwu Zhang 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Signed-off-by: Ming Lei 
---
 block/blk-merge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e7696c47489a..7695034f4b87 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -820,7 +820,7 @@ static struct request *attempt_merge(struct request_queue 
*q,
 
req->__data_len += blk_rq_bytes(next);
 
-   if (req_op(req) != REQ_OP_DISCARD)
+   if (!blk_discard_mergable(req))
elv_merge_requests(q, req, next);
 
/*
-- 
2.9.5

Re: [PATCH 7/7] blk-mq: use plug for devices that implement ->commits_rqs()

2018-11-28 Thread Ming Lei

On Wed, Nov 28, 2018 at 06:35:38AM -0700, Jens Axboe wrote:
> If we have that hook, we know the driver handles bd->last == true in
> a smart fashion. If it does, even for multiple hardware queues, it's
> a good idea to flush batches of requests to the device, if we have
> batches of requests from the submitter.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-mq.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9e1045152e45..091557a43d53 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1945,7 +1945,12 @@ static blk_qc_t blk_mq_make_request(struct 
> request_queue *q, struct bio *bio)
>   /* bypass scheduler for flush rq */
>   blk_insert_flush(rq);
>   blk_mq_run_hw_queue(data.hctx, true);
> - } else if (plug && q->nr_hw_queues == 1) {
> + } else if (plug && (q->nr_hw_queues == 1 || q->mq_ops->commit_rqs)) {
> + /*
> +  * Use plugging if we have a ->commit_rqs() hook as well,
> +  * as we know the driver uses bd->last in a smart
> +  * fashion.
> +  */
>   unsigned int request_count = plug->rq_count;
>   struct request *last = NULL;
>  
> -- 
> 2.17.1
> 

Reviewed-by: Ming Lei 

thanks,
Ming

Re: [PATCH 5/7] ataflop: implement mq_ops->commit_rqs() hook

2018-11-28 Thread Ming Lei

On Wed, Nov 28, 2018 at 06:35:36AM -0700, Jens Axboe wrote:
> We need this for blk-mq to kick things into gear, if we told it that
> we had more IO coming, but then failed to deliver on that promise.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/ataflop.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
> index f88b4c26d422..a0c6745b0034 100644
> --- a/drivers/block/ataflop.c
> +++ b/drivers/block/ataflop.c
> @@ -1471,6 +1471,15 @@ static void setup_req_params( int drive )
>   ReqTrack, ReqSector, (unsigned long)ReqData ));
>  }
>  
> +static void ataflop_commit_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> + spin_lock_irq(_lock);
> + atari_disable_irq(IRQ_MFP_FDC);
> + finish_fdc();
> + atari_enable_irq(IRQ_MFP_FDC);
> + spin_unlock_irq(_lock);
> +}
> +
>  static blk_status_t ataflop_queue_rq(struct blk_mq_hw_ctx *hctx,
>const struct blk_mq_queue_data *bd)
>  {
> @@ -1947,6 +1956,7 @@ static const struct block_device_operations floppy_fops 
> = {
>  
>  static const struct blk_mq_ops ataflop_mq_ops = {
>   .queue_rq = ataflop_queue_rq,
> + .commit_rqs = ataflop_commit_rqs,
>  };
>  
>  static struct kobject *floppy_find(dev_t dev, int *part, void *data)
> -- 
> 2.17.1
> 

Reviewed-by: Ming Lei 

thanks,
Ming

Re: [PATCH 6/7] blk-mq: use bd->last == true for list inserts

2018-11-28 Thread Ming Lei

On Wed, Nov 28, 2018 at 06:35:37AM -0700, Jens Axboe wrote:
> If we are issuing a list of requests, we know if we're at the last one.
> If we fail issuing, ensure that we call ->commits_rqs() to flush any
> potential previous requests.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-core.c |  2 +-
>  block/blk-mq.c   | 16 
>  block/blk-mq.h   |  2 +-
>  3 files changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d107d016b92b..3f6f5e6c2fe4 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1334,7 +1334,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + return blk_mq_request_issue_directly(rq, true);
>  }
>  EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 18ff47a85ad3..9e1045152e45 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1743,12 +1743,12 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
> *hctx, struct request *rq)
>  
>  static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - blk_qc_t *cookie)
> + blk_qc_t *cookie, bool last)
>  {
>   struct request_queue *q = rq->q;
>   struct blk_mq_queue_data bd = {
>   .rq = rq,
> - .last = true,
> + .last = last,
>   };
>   blk_qc_t new_cookie;
>   blk_status_t ret;
> @@ -1783,7 +1783,7 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> - bool bypass_insert)
> + bool bypass_insert, bool last)
>  {
>   struct request_queue *q = rq->q;
>   bool run_queue = true;
> @@ -1812,7 +1812,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - return __blk_mq_issue_directly(hctx, rq, cookie);
> + return __blk_mq_issue_directly(hctx, rq, cookie, last);
>  insert:
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
> @@ -1831,7 +1831,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>   hctx_lock(hctx, _idx);
>  
> - ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> + ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false, true);
>   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
>   blk_mq_sched_insert_request(rq, false, true, false);
>   else if (ret != BLK_STS_OK)
> @@ -1840,7 +1840,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   hctx_unlock(hctx, srcu_idx);
>  }
>  
> -blk_status_t blk_mq_request_issue_directly(struct request *rq)
> +blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
>  {
>   blk_status_t ret;
>   int srcu_idx;
> @@ -1848,7 +1848,7 @@ blk_status_t blk_mq_request_issue_directly(struct 
> request *rq)
>   struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
>  
>   hctx_lock(hctx, _idx);
> - ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true);
> + ret = __blk_mq_try_issue_directly(hctx, rq, _cookie, true, last);
>   hctx_unlock(hctx, srcu_idx);
>  
>   return ret;
> @@ -1863,7 +1863,7 @@ void blk_mq_try_issue_list_directly(struct 
> blk_mq_hw_ctx *hctx,
>   queuelist);
>  
>   list_del_init(>queuelist);
> - ret = blk_mq_request_issue_directly(rq);
> + ret = blk_mq_request_issue_directly(rq, list_empty(list));
>   if (ret != BLK_STS_OK) {
>   if (ret == BLK_STS_RESOURCE ||
>   ret == BLK_STS_DEV_RESOURCE) {
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 9ae8e9f8f8b1..7291e5379358 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -69,7 +69,7 @@ void blk_mq_insert_requests(struct blk_mq_hw_ctx *hctx, 
> struct blk_mq_ctx *ctx,
>   struct list_head *list);
>  
>  /* Used by blk_insert_cloned_request() to issue request directly */
> -blk_status_t blk_mq_request_issue_directly(struct request *rq);
> +blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last);
>  void blk_mq_try_issue_list_directly(struct blk_mq_hw_ctx *hctx,
>   struct list_head *list);
>  
> -- 
> 2.17.1
> 

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook

2018-11-28 Thread Ming Lei

On Wed, Nov 28, 2018 at 08:13:43PM -0700, Jens Axboe wrote:
> On 11/28/18 7:51 PM, Ming Lei wrote:
> > On Wed, Nov 28, 2018 at 07:19:09PM -0700, Jens Axboe wrote:
> >> On 11/28/18 6:23 PM, Ming Lei wrote:
> >>> On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote:
> >>>> On 11/27/18 7:10 PM, Ming Lei wrote:
> >>>>> On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote:
> >>>>>> We need this for blk-mq to kick things into gear, if we told it that
> >>>>>> we had more IO coming, but then failed to deliver on that promise.
> >>>>>>
> >>>>>> Signed-off-by: Jens Axboe 
> >>>>>> ---
> >>>>>>  drivers/block/virtio_blk.c | 15 +++
> >>>>>>  1 file changed, 15 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> >>>>>> index 6e869d05f91e..b49c57e77780 100644
> >>>>>> --- a/drivers/block/virtio_blk.c
> >>>>>> +++ b/drivers/block/virtio_blk.c
> >>>>>> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq)
> >>>>>>spin_unlock_irqrestore(>vqs[qid].lock, flags);
> >>>>>>  }
> >>>>>>  
> >>>>>> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> >>>>>> +{
> >>>>>> +  struct virtio_blk *vblk = hctx->queue->queuedata;
> >>>>>> +  int qid = hctx->queue_num;
> >>>>>> +  bool kick;
> >>>>>> +
> >>>>>> +  spin_lock_irq(>vqs[qid].lock);
> >>>>>> +  kick = virtqueue_kick_prepare(vblk->vqs[qid].vq);
> >>>>>> +  spin_unlock_irq(>vqs[qid].lock);
> >>>>>> +
> >>>>>> +  if (kick)
> >>>>>> +  virtqueue_notify(vblk->vqs[qid].vq);
> >>>>>> +}
> >>>>>> +
> >>>>>>  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>>>>>   const struct blk_mq_queue_data *bd)
> >>>>>>  {
> >>>>>> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request 
> >>>>>> *req)
> >>>>>>  
> >>>>>>  static const struct blk_mq_ops virtio_mq_ops = {
> >>>>>>.queue_rq   = virtio_queue_rq,
> >>>>>> +  .commit_rqs = virtio_commit_rqs,
> >>>>>>.complete   = virtblk_request_done,
> >>>>>>.init_request   = virtblk_init_request,
> >>>>>>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> >>>>>> -- 
> >>>>>> 2.17.1
> >>>>>>
> >>>>>
> >>>>> If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq()
> >>>>> should have been removed for saving the world switch per .queue_rq()
> >>>>
> >>>> ->commits_rqs() is only for the case where bd->last is set to false,
> >>>> and we never make it to the end and flag bd->last == true. If bd->last
> >>>> is true, the driver should kick things into gear.
> >>>
> >>> OK, looks I misunderstood it. However, virtio-blk doesn't need this
> >>> change since virtio_queue_rq() can handle it well. This patch may 
> >>> introduce
> >>> one unnecessary VM world switch in case of queue busy.
> >>
> >> Not it won't, it may in the case of some failure outside of the driver.
> > 
> > If the failure is because of out of tag, blk_mq_dispatch_wake() will
> > rerun the queue, and the bd->last will be set finally. Or is there
> > other failure(outside of driver) not covered?
> 
> The point is to make this happen when we commit the IOs, not needing to
> do a restart (or relying on IO being in-flight). If we're submitting a
> string of requests, we should not rely on failures happening only due to
> IO being going and thus restarting us. It defeats the purpose of even
> having ->last in the first place.

OK, it makes sense.

> 
> >> The only reason that virtio-blk doesn't currently hang is because it
> >> has restart logic, and the failure case only happens in the if we
> >> already have IO in-flight.
> > 
> > Yeah, virtqueue_kic

Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook

2018-11-28 Thread Ming Lei

On Wed, Nov 28, 2018 at 07:19:09PM -0700, Jens Axboe wrote:
> On 11/28/18 6:23 PM, Ming Lei wrote:
> > On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote:
> >> On 11/27/18 7:10 PM, Ming Lei wrote:
> >>> On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote:
> >>>> We need this for blk-mq to kick things into gear, if we told it that
> >>>> we had more IO coming, but then failed to deliver on that promise.
> >>>>
> >>>> Signed-off-by: Jens Axboe 
> >>>> ---
> >>>>  drivers/block/virtio_blk.c | 15 +++
> >>>>  1 file changed, 15 insertions(+)
> >>>>
> >>>> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> >>>> index 6e869d05f91e..b49c57e77780 100644
> >>>> --- a/drivers/block/virtio_blk.c
> >>>> +++ b/drivers/block/virtio_blk.c
> >>>> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq)
> >>>>  spin_unlock_irqrestore(>vqs[qid].lock, flags);
> >>>>  }
> >>>>  
> >>>> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> >>>> +{
> >>>> +struct virtio_blk *vblk = hctx->queue->queuedata;
> >>>> +int qid = hctx->queue_num;
> >>>> +bool kick;
> >>>> +
> >>>> +spin_lock_irq(>vqs[qid].lock);
> >>>> +kick = virtqueue_kick_prepare(vblk->vqs[qid].vq);
> >>>> +spin_unlock_irq(>vqs[qid].lock);
> >>>> +
> >>>> +if (kick)
> >>>> +virtqueue_notify(vblk->vqs[qid].vq);
> >>>> +}
> >>>> +
> >>>>  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>>> const struct blk_mq_queue_data *bd)
> >>>>  {
> >>>> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request 
> >>>> *req)
> >>>>  
> >>>>  static const struct blk_mq_ops virtio_mq_ops = {
> >>>>  .queue_rq   = virtio_queue_rq,
> >>>> +.commit_rqs = virtio_commit_rqs,
> >>>>  .complete   = virtblk_request_done,
> >>>>  .init_request   = virtblk_init_request,
> >>>>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> >>>> -- 
> >>>> 2.17.1
> >>>>
> >>>
> >>> If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq()
> >>> should have been removed for saving the world switch per .queue_rq()
> >>
> >> ->commits_rqs() is only for the case where bd->last is set to false,
> >> and we never make it to the end and flag bd->last == true. If bd->last
> >> is true, the driver should kick things into gear.
> > 
> > OK, looks I misunderstood it. However, virtio-blk doesn't need this
> > change since virtio_queue_rq() can handle it well. This patch may introduce
> > one unnecessary VM world switch in case of queue busy.
> 
> Not it won't, it may in the case of some failure outside of the driver.

If the failure is because of out of tag, blk_mq_dispatch_wake() will
rerun the queue, and the bd->last will be set finally. Or is there
other failure(outside of driver) not covered?

> The only reason that virtio-blk doesn't currently hang is because it
> has restart logic, and the failure case only happens in the if we
> already have IO in-flight.

Yeah, virtqueue_kick() is called in case of any error in virtio_queue_rq(),
so I am still wondering why we have to implement .commit_rqs() for virtio-blk.

> For the NVMe variant, that's not going to be the case.

OK.

> 
> > IMO bd->last won't work well in case of io scheduler given the rq_list
> > only includes one single request.
> 
> But that's a fake limitation that definitely should just be lifted,
> the fact that blk-mq-sched is _currently_ just doing single requests
> is woefully inefficient.

I agree, but seems a bit hard given we have to consider request
merge.

> 
> > I wrote this kind of patch(never posted) before to use sort of
> > ->commits_rqs() to replace the current bd->last mechanism which need
> > one extra driver tag, which may improve the above case, also code gets
> > cleaned up.
> 
> It doesn't need one extra driver tag, we currently get an extra one just
> to flag ->last correctly. That's not a requirement, that's a limitation
> of the current implementation. We could get rid of that, and it it
> proves to be an issue, that's not hard to do.

What do you think about using .commit_rqs() to replace ->last? For
example, just call .commit_rqs() after the last request is queued to
driver successfully. Then we can remove bd->last and avoid to get the
extra tag for figuring out bd->last.

Thanks,
Ming

Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook

2018-11-28 Thread Ming Lei

On Tue, Nov 27, 2018 at 07:34:51PM -0700, Jens Axboe wrote:
> On 11/27/18 7:10 PM, Ming Lei wrote:
> > On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote:
> >> We need this for blk-mq to kick things into gear, if we told it that
> >> we had more IO coming, but then failed to deliver on that promise.
> >>
> >> Signed-off-by: Jens Axboe 
> >> ---
> >>  drivers/block/virtio_blk.c | 15 +++
> >>  1 file changed, 15 insertions(+)
> >>
> >> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> >> index 6e869d05f91e..b49c57e77780 100644
> >> --- a/drivers/block/virtio_blk.c
> >> +++ b/drivers/block/virtio_blk.c
> >> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq)
> >>spin_unlock_irqrestore(>vqs[qid].lock, flags);
> >>  }
> >>  
> >> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> >> +{
> >> +  struct virtio_blk *vblk = hctx->queue->queuedata;
> >> +  int qid = hctx->queue_num;
> >> +  bool kick;
> >> +
> >> +  spin_lock_irq(>vqs[qid].lock);
> >> +  kick = virtqueue_kick_prepare(vblk->vqs[qid].vq);
> >> +  spin_unlock_irq(>vqs[qid].lock);
> >> +
> >> +  if (kick)
> >> +  virtqueue_notify(vblk->vqs[qid].vq);
> >> +}
> >> +
> >>  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
> >>   const struct blk_mq_queue_data *bd)
> >>  {
> >> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request *req)
> >>  
> >>  static const struct blk_mq_ops virtio_mq_ops = {
> >>.queue_rq   = virtio_queue_rq,
> >> +  .commit_rqs = virtio_commit_rqs,
> >>.complete   = virtblk_request_done,
> >>.init_request   = virtblk_init_request,
> >>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> >> -- 
> >> 2.17.1
> >>
> > 
> > If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq()
> > should have been removed for saving the world switch per .queue_rq()
> 
> ->commits_rqs() is only for the case where bd->last is set to false,
> and we never make it to the end and flag bd->last == true. If bd->last
> is true, the driver should kick things into gear.

OK, looks I misunderstood it. However, virtio-blk doesn't need this
change since virtio_queue_rq() can handle it well. This patch may introduce
one unnecessary VM world switch in case of queue busy.

IMO bd->last won't work well in case of io scheduler given the rq_list
only includes one single request.

I wrote this kind of patch(never posted) before to use sort of ->commits_rqs()
to replace the current bd->last mechanism which need one extra driver tag,
which may improve the above case, also code gets cleaned up.

Thanks,
Ming

Re: [PATCH 5/8] virtio_blk: implement mq_ops->commit_rqs() hook

2018-11-27 Thread Ming Lei

On Mon, Nov 26, 2018 at 09:35:53AM -0700, Jens Axboe wrote:
> We need this for blk-mq to kick things into gear, if we told it that
> we had more IO coming, but then failed to deliver on that promise.
> 
> Signed-off-by: Jens Axboe 
> ---
>  drivers/block/virtio_blk.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 6e869d05f91e..b49c57e77780 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -214,6 +214,20 @@ static void virtblk_done(struct virtqueue *vq)
>   spin_unlock_irqrestore(>vqs[qid].lock, flags);
>  }
>  
> +static void virtio_commit_rqs(struct blk_mq_hw_ctx *hctx)
> +{
> + struct virtio_blk *vblk = hctx->queue->queuedata;
> + int qid = hctx->queue_num;
> + bool kick;
> +
> + spin_lock_irq(>vqs[qid].lock);
> + kick = virtqueue_kick_prepare(vblk->vqs[qid].vq);
> + spin_unlock_irq(>vqs[qid].lock);
> +
> + if (kick)
> + virtqueue_notify(vblk->vqs[qid].vq);
> +}
> +
>  static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx *hctx,
>  const struct blk_mq_queue_data *bd)
>  {
> @@ -638,6 +652,7 @@ static void virtblk_initialize_rq(struct request *req)
>  
>  static const struct blk_mq_ops virtio_mq_ops = {
>   .queue_rq   = virtio_queue_rq,
> + .commit_rqs = virtio_commit_rqs,
>   .complete   = virtblk_request_done,
>   .init_request   = virtblk_init_request,
>  #ifdef CONFIG_VIRTIO_BLK_SCSI
> -- 
> 2.17.1
> 

If .commit_rqs() is implemented, virtqueue_notify() in virtio_queue_rq()
should have been removed for saving the world switch per .queue_rq()

thanks,
Ming

Re: [PATCH 7/8] blk-mq: use bd->last == true for list inserts

2018-11-27 Thread Ming Lei

On Mon, Nov 26, 2018 at 09:35:55AM -0700, Jens Axboe wrote:
> If we are issuing a list of requests, we know if we're at the last one.
> If we fail issuing, ensure that we call ->commits_rqs() to flush any
> potential previous requests.
> 
> Signed-off-by: Jens Axboe 
> ---
>  block/blk-core.c |  2 +-
>  block/blk-mq.c   | 32 
>  block/blk-mq.h   |  2 +-
>  3 files changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index c9758d185357..808a65d23f1a 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1334,7 +1334,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + return blk_mq_request_issue_directly(rq, true);
>  }
>  EXPORT_SYMBOL_GPL(blk_insert_cloned_request);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 6a249bf6ed00..0a12cec0b426 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1260,6 +1260,14 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list,
>   if (!list_empty(list)) {
>   bool needs_restart;
>  
> + /*
> +  * If we didn't flush the entire list, we could have told
> +  * the driver there was more coming, but that turned out to
> +  * be a lie.
> +  */
> + if (q->mq_ops->commit_rqs)
> + q->mq_ops->commit_rqs(hctx);
> +

Looks you miss to do it for blk_mq_do_dispatch_sched() and
blk_mq_do_dispatch_ctx(), in which only one request is added to
the rq_list.

Maybe we can call the .commit_rqs(hctx) just at the end of
blk_mq_sched_dispatch_requests() for cover all non-direct-issue
cases.

Thanks,
Ming

Re: [PATCH 3/8] blk-mq: add mq_ops->commit_rqs()

2018-11-27 Thread Ming Lei

On Mon, Nov 26, 2018 at 09:35:51AM -0700, Jens Axboe wrote:
> blk-mq passes information to the hardware about any given request being
> the last that we will issue in this sequence. The point is that hardware
> can defer costly doorbell type writes to the last request. But if we run
> into errors issuing a sequence of requests, we may never send the request
> with bd->last == true set. For that case, we need a hook that tells the
> hardware that nothing else is coming right now.
> 
> For failures returned by the drivers ->queue_rq() hook, the driver is
> responsible for flushing pending requests, if it uses bd->last to
> optimize that part. This works like before, no changes there.
> 
> Signed-off-by: Jens Axboe 
> ---
>  include/linux/blk-mq.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index ca0520ca6437..1fd139b65a6e 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -117,6 +117,7 @@ struct blk_mq_queue_data {
>  
>  typedef blk_status_t (queue_rq_fn)(struct blk_mq_hw_ctx *,
>   const struct blk_mq_queue_data *);
> +typedef void (commit_rqs_fn)(struct blk_mq_hw_ctx *);
>  /* takes rq->cmd_flags as input, returns a hardware type index */
>  typedef int (rq_flags_to_type_fn)(struct request_queue *, unsigned int);
>  typedef bool (get_budget_fn)(struct blk_mq_hw_ctx *);
> @@ -144,6 +145,15 @@ struct blk_mq_ops {
>*/
>   queue_rq_fn *queue_rq;
>  
> + /*
> +  * If a driver uses bd->last to judge when to submit requests to
> +  * hardware, it must define this function. In case of errors that
> +  * make us stop issuing further requests, this hook serves the
> +  * purpose of kicking the hardware (which the last request otherwise
> +  * would have done).
> +  */
> + commit_rqs_fn   *commit_rqs;
> +
>   /*
>* Return a queue map type for the given request/bio flags
>*/
> -- 
> 2.17.1
>

Looks fine,

Reviewed-by: Ming Lei 

Thanks,
Ming

Re: [PATCH] block/025: test discard sector alignement and sector size overflow

2018-11-26 Thread Ming Lei

On Mon, Nov 26, 2018 at 04:33:10PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 12:00:17PM +0800, Ming Lei wrote:
> > This test covers the following two issues:
> > 
> > 1) discard sector need to be aligned with logical block size
> > 
> > 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
> > with discard sector size
> > 
> > Signed-off-by: Ming Lei 
> > ---
> >  tests/block/025 | 37 +
> >  tests/block/025.out |  2 ++
> >  2 files changed, 39 insertions(+)
> >  create mode 100755 tests/block/025
> >  create mode 100644 tests/block/025.out
> > 
> > diff --git a/tests/block/025 b/tests/block/025
> > new file mode 100755
> > index ..32b632431793
> > --- /dev/null
> > +++ b/tests/block/025
> > @@ -0,0 +1,37 @@
> > +#!/bin/bash
> > +# SPDX-License-Identifier: GPL-3.0+
> > +# Copyright (C) 2018 Ming Lei 
> > +#
> > +# Check two corener cases of BLKDISCARD.
> > +#
> > +# 1) test if discard bio's sector is algined with logical size, fixed by
> > +#1adfc5e4136f ("block: make sure discard bio is aligned with logical 
> > block size")
> 
> Hm, I'm not seeing how this test case tests this commit. Aren't 2049G
> and 512M both aligned to 4096 bytes?

2049G caused 32bit 'nr_sects' in __blkdev_issue_discard() overflow,
please see commit 4800bf7bc8c725e955fcb ("block: fix 32 bit overflow in 
__blkdev_issue_discard()").

4096 logical block size may trigger 'req_sects' un-alignment issue
because 'nr_sects' from fs is 512byte aligned, please see 1adfc5e4136f5967d59 
("block:
make sure discard bio is aligned with logical block size").

Thanks,
Ming

Re: [PATCH] block/025: test discard sector alignement and sector size overflow

2018-11-20 Thread Ming Lei

On Thu, Nov 15, 2018 at 12:01 PM Ming Lei  wrote:
>
> This test covers the following two issues:
>
> 1) discard sector need to be aligned with logical block size
>
> 2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
> with discard sector size
>
> Signed-off-by: Ming Lei 
> ---
>  tests/block/025 | 37 +
>  tests/block/025.out |  2 ++
>  2 files changed, 39 insertions(+)
>  create mode 100755 tests/block/025
>  create mode 100644 tests/block/025.out
>
> diff --git a/tests/block/025 b/tests/block/025
> new file mode 100755
> index ..32b632431793
> --- /dev/null
> +++ b/tests/block/025
> @@ -0,0 +1,37 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-3.0+
> +# Copyright (C) 2018 Ming Lei 
> +#
> +# Check two corener cases of BLKDISCARD.
> +#
> +# 1) test if discard bio's sector is algined with logical size, fixed by
> +#1adfc5e4136f ("block: make sure discard bio is aligned with logical 
> block size")
> +# 2) test 32 bit overflow when comparing discard sector size. Fixed by
> +#4800bf7bc8c72 ("block: fix 32 bit overflow in __blkdev_issue_discard()")
> +
> +. tests/block/rc
> +. common/scsi_debug
> +
> +DESCRIPTION="check sector alignment and sector size overflow of BLKDISCARD"
> +
> +requires() {
> +   _have_scsi_debug
> +}
> +
> +test() {
> +   echo "Running ${TEST_NAME}"
> +
> +   rm -f "$FULL"
> +
> +   # Create virtual device with unmap_zeroes_data support
> +   if ! _init_scsi_debug virtual_gb=2049 sector_size=4096 lbpws10=1 
> dev_size_mb=512; then
> +   return 1
> +   fi
> +
> +   local dev="/dev/${SCSI_DEBUG_DEVICES[0]}"
> +   blkdiscard "$dev"
> +
> +   _exit_scsi_debug
> +
> +   echo "Test complete"
> +}
> diff --git a/tests/block/025.out b/tests/block/025.out
> new file mode 100644
> index ..fd9a6d5f70de
> --- /dev/null
> +++ b/tests/block/025.out
> @@ -0,0 +1,2 @@
> +Running block/025
> +Test complete
> --
> 2.9.5
>

Gentle ping...

thanks,
Ming Lei

Re: [PATCH 2/7] block: Remove bio->bi_ioc

2018-11-20 Thread Ming Lei

On Tue, Nov 20, 2018 at 10:31:19AM -0700, Jens Axboe wrote:
> On 11/20/18 10:21 AM, Ming Lei wrote:
> > On Mon, Nov 19, 2018 at 12:51:26PM +0900, Damien Le Moal wrote:
> >> bio->bi_ioc is never set so always NULL. Remove references to it in
> >> bio_disassociate_task() and in rq_ioc() and delete this field from
> >> struct bio. With this change, rq_ioc() always returns
> >> current->io_context without the need for a bio argument. Further
> >> simplify the code and make it more readable by also removing this
> >> helper, which also allows to simplify blk_mq_sched_assign_ioc() by
> >> removing its bio argument.
> >>
> >> Signed-off-by: Damien Le Moal 
> >> ---
> >>  block/bio.c   |  4 
> >>  block/blk-core.c  |  2 +-
> >>  block/blk-mq-sched.c  |  4 ++--
> >>  block/blk-mq-sched.h  |  2 +-
> >>  block/blk-mq.c|  4 ++--
> >>  block/blk.h   | 16 
> >>  include/linux/blk_types.h |  3 +--
> >>  7 files changed, 7 insertions(+), 28 deletions(-)
> >>
> >> diff --git a/block/bio.c b/block/bio.c
> >> index 4f4d9884443b..03895cc0d74a 100644
> >> --- a/block/bio.c
> >> +++ b/block/bio.c
> >> @@ -2027,10 +2027,6 @@ int bio_associate_blkg(struct bio *bio, struct 
> >> blkcg_gq *blkg)
> >>   */
> >>  void bio_disassociate_task(struct bio *bio)
> >>  {
> >> -  if (bio->bi_ioc) {
> >> -  put_io_context(bio->bi_ioc);
> >> -  bio->bi_ioc = NULL;
> >> -  }
> >>if (bio->bi_css) {
> >>css_put(bio->bi_css);
> >>bio->bi_css = NULL;
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index d6e8ab9ca99d..492648c96992 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -813,7 +813,7 @@ unsigned int blk_plug_queued_count(struct 
> >> request_queue *q)
> >>  
> >>  void blk_init_request_from_bio(struct request *req, struct bio *bio)
> >>  {
> >> -  struct io_context *ioc = rq_ioc(bio);
> >> +  struct io_context *ioc = current->io_context;
> >>  
> >>if (bio->bi_opf & REQ_RAHEAD)
> >>req->cmd_flags |= REQ_FAILFAST_MASK;
> >> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> >> index d084f731d104..13b8dc332541 100644
> >> --- a/block/blk-mq-sched.c
> >> +++ b/block/blk-mq-sched.c
> >> @@ -31,10 +31,10 @@ void blk_mq_sched_free_hctx_data(struct request_queue 
> >> *q,
> >>  }
> >>  EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> >>  
> >> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio)
> >> +void blk_mq_sched_assign_ioc(struct request *rq)
> >>  {
> >>struct request_queue *q = rq->q;
> >> -  struct io_context *ioc = rq_ioc(bio);
> >> +  struct io_context *ioc = current->io_context;
> >>struct io_cq *icq;
> >>  
> >>spin_lock_irq(>queue_lock);
> >> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> >> index 7ff5671bf128..0f719c8532ae 100644
> >> --- a/block/blk-mq-sched.h
> >> +++ b/block/blk-mq-sched.h
> >> @@ -8,7 +8,7 @@
> >>  void blk_mq_sched_free_hctx_data(struct request_queue *q,
> >> void (*exit)(struct blk_mq_hw_ctx *));
> >>  
> >> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio);
> >> +void blk_mq_sched_assign_ioc(struct request *rq);
> >>  
> >>  void blk_mq_sched_request_inserted(struct request *rq);
> >>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 32b246ed44c0..636f80b96fa6 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -389,8 +389,8 @@ static struct request *blk_mq_get_request(struct 
> >> request_queue *q,
> >>if (!op_is_flush(data->cmd_flags)) {
> >>rq->elv.icq = NULL;
> >>if (e && e->type->ops.prepare_request) {
> >> -  if (e->type->icq_cache && rq_ioc(bio))
> >> -  blk_mq_sched_assign_ioc(rq, bio);
> >> +  if (e->type->icq_cache)
> >> +  blk_mq_sched_assign_ioc(rq);
> >>  
> >>e->type->ops.prepare_re

Re: [PATCH 2/7] block: Remove bio->bi_ioc

2018-11-20 Thread Ming Lei

On Mon, Nov 19, 2018 at 12:51:26PM +0900, Damien Le Moal wrote:
> bio->bi_ioc is never set so always NULL. Remove references to it in
> bio_disassociate_task() and in rq_ioc() and delete this field from
> struct bio. With this change, rq_ioc() always returns
> current->io_context without the need for a bio argument. Further
> simplify the code and make it more readable by also removing this
> helper, which also allows to simplify blk_mq_sched_assign_ioc() by
> removing its bio argument.
> 
> Signed-off-by: Damien Le Moal 
> ---
>  block/bio.c   |  4 
>  block/blk-core.c  |  2 +-
>  block/blk-mq-sched.c  |  4 ++--
>  block/blk-mq-sched.h  |  2 +-
>  block/blk-mq.c|  4 ++--
>  block/blk.h   | 16 
>  include/linux/blk_types.h |  3 +--
>  7 files changed, 7 insertions(+), 28 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 4f4d9884443b..03895cc0d74a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -2027,10 +2027,6 @@ int bio_associate_blkg(struct bio *bio, struct 
> blkcg_gq *blkg)
>   */
>  void bio_disassociate_task(struct bio *bio)
>  {
> - if (bio->bi_ioc) {
> - put_io_context(bio->bi_ioc);
> - bio->bi_ioc = NULL;
> - }
>   if (bio->bi_css) {
>   css_put(bio->bi_css);
>   bio->bi_css = NULL;
> diff --git a/block/blk-core.c b/block/blk-core.c
> index d6e8ab9ca99d..492648c96992 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -813,7 +813,7 @@ unsigned int blk_plug_queued_count(struct request_queue 
> *q)
>  
>  void blk_init_request_from_bio(struct request *req, struct bio *bio)
>  {
> - struct io_context *ioc = rq_ioc(bio);
> + struct io_context *ioc = current->io_context;
>  
>   if (bio->bi_opf & REQ_RAHEAD)
>   req->cmd_flags |= REQ_FAILFAST_MASK;
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index d084f731d104..13b8dc332541 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -31,10 +31,10 @@ void blk_mq_sched_free_hctx_data(struct request_queue *q,
>  }
>  EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
>  
> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio)
> +void blk_mq_sched_assign_ioc(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> - struct io_context *ioc = rq_ioc(bio);
> + struct io_context *ioc = current->io_context;
>   struct io_cq *icq;
>  
>   spin_lock_irq(>queue_lock);
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 7ff5671bf128..0f719c8532ae 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -8,7 +8,7 @@
>  void blk_mq_sched_free_hctx_data(struct request_queue *q,
>void (*exit)(struct blk_mq_hw_ctx *));
>  
> -void blk_mq_sched_assign_ioc(struct request *rq, struct bio *bio);
> +void blk_mq_sched_assign_ioc(struct request *rq);
>  
>  void blk_mq_sched_request_inserted(struct request *rq);
>  bool blk_mq_sched_try_merge(struct request_queue *q, struct bio *bio,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 32b246ed44c0..636f80b96fa6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -389,8 +389,8 @@ static struct request *blk_mq_get_request(struct 
> request_queue *q,
>   if (!op_is_flush(data->cmd_flags)) {
>   rq->elv.icq = NULL;
>   if (e && e->type->ops.prepare_request) {
> - if (e->type->icq_cache && rq_ioc(bio))
> - blk_mq_sched_assign_ioc(rq, bio);
> + if (e->type->icq_cache)
> + blk_mq_sched_assign_ioc(rq);
>  
>   e->type->ops.prepare_request(rq, bio);
>   rq->rq_flags |= RQF_ELVPRIV;
> diff --git a/block/blk.h b/block/blk.h
> index 816a9abb87cd..610948157a5b 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -254,22 +254,6 @@ void ioc_clear_queue(struct request_queue *q);
>  
>  int create_task_io_context(struct task_struct *task, gfp_t gfp_mask, int 
> node);
>  
> -/**
> - * rq_ioc - determine io_context for request allocation
> - * @bio: request being allocated is for this bio (can be %NULL)
> - *
> - * Determine io_context to use for request allocation for @bio.  May return
> - * %NULL if %current->io_context doesn't exist.
> - */
> -static inline struct io_context *rq_ioc(struct bio *bio)
> -{
> -#ifdef CONFIG_BLK_CGROUP
> - if (bio && bio->bi_ioc)
> - return bio->bi_ioc;
> -#endif
> - return current->io_context;
> -}
> -
>  /**
>   * create_io_context - try to create task->io_context
>   * @gfp_mask: allocation mask
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index dbdbfbd6a987..c0ba1a038ff3 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -174,10 +174,9 @@ struct bio {
>   void*bi_private;
>  #ifdef CONFIG_BLK_CGROUP
>   /*
> -  * Optional ioc and

[PATCH 4.21 V3] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-19 Thread Ming Lei

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs.

This patch fixes the issue by the following approach:

1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
all ctxs

2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
handler of .mq_kobj

3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
.mq_kobj is always released after all ctxs are freed.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck 
Cc: "jianchao.wang" 
Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
V3:
- keep to allocate q->queue_ctx via percpu allocator, so one extra
  pointer reference can be saved for getting ctx
V2:
- allocate 'blk_mq_ctx' inside blk_mq_init_allocated_queue()
- allocate q->mq_kobj directly 

 block/blk-mq-sysfs.c   | 34 --
 block/blk-mq.c | 39 ---
 block/blk-mq.h |  6 ++
 include/linux/blkdev.h |  2 +-
 4 files changed, 63 insertions(+), 18 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3d25b9c419e9..6efef1f679f0 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -15,6 +15,18 @@
 
 static void blk_mq_sysfs_release(struct kobject *kobj)
 {
+   struct blk_mq_ctxs *ctxs = container_of(kobj, struct blk_mq_ctxs, kobj);
+
+   free_percpu(ctxs->queue_ctx);
+   kfree(ctxs);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+   struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+   /* ctx->ctxs won't be released until all ctx are freed */
+   kobject_put(>ctxs->kobj);
 }
 
 static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -213,7 +225,7 @@ static struct kobj_type blk_mq_ktype = {
 static struct kobj_type blk_mq_ctx_ktype = {
.sysfs_ops  = _mq_sysfs_ops,
.default_attrs  = default_ctx_attrs,
-   .release= blk_mq_sysfs_release,
+   .release= blk_mq_ctx_sysfs_release,
 };
 
 static struct kobj_type blk_mq_hw_ktype = {
@@ -245,7 +257,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
if (!hctx->nr_ctx)
return 0;
 
-   ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num);
+   ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num);
if (ret)
return ret;
 
@@ -268,8 +280,8 @@ void blk_mq_unregister_dev(struct device *dev, struct 
request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
 
-   kobject_uevent(>mq_kobj, KOBJ_REMOVE);
-   kobject_del(>mq_kobj);
+   kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+   kobject_del(q->mq_kobj);
kobject_put(>kobj);
 
q->mq_sysfs_init_done = false;
@@ -289,7 +301,7 @@ void blk_mq_sysfs_deinit(struct request_queue *q)
ctx = per_cpu_ptr(q->queue_ctx, cpu);
kobject_put(>kobj);
}
-   kobject_put(>mq_kobj);
+   kobject_put(q->mq_kobj);
 }
 
 void blk_mq_sysfs_init(struct request_queue *q)
@@ -297,10 +309,12 @@ void blk_mq_sysfs_init(struct request_queue *q)
struct blk_mq_ctx *ctx;
int cpu;
 
-   kobject_init(>mq_kobj, _mq_ktype);
+   kobject_init(q->mq_kobj, _mq_ktype);
 
for_each_possible_cpu(cpu) {
ctx = per_cpu_ptr(q->queue_ctx, cpu);
+
+   kobject_get(q->mq_kobj);
kobject_init(>kobj, _mq_ctx_ktype);
}
 }
@@ -313,11 +327,11 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(>sysfs_lock);
 
-   ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq");
+   ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq");
if (ret < 0)
goto out;
 
-   kobject_uevent(>mq_kobj, KOBJ_ADD);
+   kobject_uevent(q->mq_kobj, KOBJ_ADD);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
@@ -334,8 +348,8 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
while (--i >= 0)
blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
 
-   kobject_uevent(>mq_kobj, KOBJ_REMOVE);
-   kobject_del(>mq_kobj);
+   kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+   kobject_del(q->mq_kobj);
kobject_put(>kobj);
return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 32b246ed44c0..9228f2e9bd40 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.

Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array

2018-11-19 Thread Ming Lei

On Mon, Nov 19, 2018 at 11:17:49AM +0100, Greg Kroah-Hartman wrote:
> On Mon, Nov 19, 2018 at 10:04:27AM +0800, Ming Lei wrote:
> > On Sat, Nov 17, 2018 at 11:03:42AM +0100, Greg Kroah-Hartman wrote:
> > > On Sat, Nov 17, 2018 at 10:34:18AM +0800, Ming Lei wrote:
> > > > On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote:
> > > > > On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote:
> > > > > > Now q->queue_ctx is just one read-mostly table for query the
> > > > > > 'blk_mq_ctx' instance from one cpu index, it isn't necessary
> > > > > > to allocate it as percpu variable. One simple array may be
> > > > > > more efficient.
> > > > > 
> > > > > "may be", have you run benchmarks to be sure?  If so, can you add the
> > > > > results of them to this changelog?  If there is no measurable
> > > > > difference, then why make this change at all?
> > > > 
> > > > __blk_mq_get_ctx() is used in fast path, what do you think about which
> > > > one is more efficient?
> > > > 
> > > > - *per_cpu_ptr(q->queue_ctx, cpu);
> > > > 
> > > > - q->queue_ctx[cpu]
> > > 
> > > You need to actually test to see which one is faster, you might be
> > > surprised :)
> > > 
> > > In other words, do not just guess.
> > 
> > No performance difference is observed wrt. this patchset when I
> > run the following fio test on null_blk(modprobe null_blk) in my VM:
> > 
> > fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=32 \
> >   --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 
> > \
> >   --name=null_blk-ttest-randread --rw=randread
> > 
> > Running test is important, but IMO it is more important to understand
> > the idea behind is correct, or the approach can be proved as correct.
> > 
> > Given the count of test cases can be increased exponentially when the 
> > related
> > factors or settings are covered, obviously we can't run all the test cases.
> 
> And what happens when you start to scale the number of queues and cpus
> in the system?

It is related with cpus.

> Does both options work the same?

This patch may introduce one extra memory read for getting one
'blk_mq_ctx' instance.

> Why did the original code have per-cpu variables?

Each instance of 'struct blk_mq_ctx' is actually percpu, so it is natural
to allocate it as one percpu variable.

However, there is the kobject lifetime issue if we allocate all 'blk_mq_ctx'
instances as one single percpu variable, because we can't release just one
part(for one cpu) of the single percpu variable.

So this patch converts the percpu variable into one loop-up table(read
only) and one 'blk_mq_ctx' instance for each CPU, and all the instances
are allocated from the local NUMA node of its CPU.

Another approach is to keep the percpu allocation, and introduces one
reference counter for counting how many active 'ctx' there are, and only
free the percpu variable when the ref drops zero, then we may save one
extra memory read for __blk_mq_get_ctx().


Thanks,
Ming

Re: [PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-19 Thread Ming Lei

On Mon, Nov 19, 2018 at 11:06:06AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Nov 17, 2018 at 10:26:38AM +0800, Ming Lei wrote:
> > On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> > > > @@ -456,7 +456,7 @@ struct request_queue {
> > > > /*
> > > >  * mq queue kobject
> > > >  */
> > > > -   struct kobject mq_kobj;
> > > > +   struct kobject *mq_kobj;
> > > 
> > > What is this kobject even used for?  It wasn't obvious at all from this
> > > patch, why is it needed if you are not using it to reference count the
> > > larger structure here?
> > 
> > All attributes and kobjects under /sys/block/$DEV/mq are covered by this 
> > kobject
> > actually, and all are for exposing blk-mq specific information, but now 
> > there is
> > only blk-mq, and legacy io path is removed.
> 
> I am sorry, but I really can not parse this sentance at all.
> 
> What Documentation/ABI/ entries are covered by this kobject, that should
> help me out more.  And what do you mean by "legacy io"?

After blk-mq is introduced, there are two main IO request paths:

1) blk-mq, IO is queued via blk_mq_make_request()

2) legacy IO path, IO is queued via blk_queue_bio()

The legacy IO path will be removed in 4.21, and patches have been queued
in for-4.21/block.

> 
> > That is why I mentioned we may remove this kobject last time and move all 
> > under
> > /sys/block/$DEV/queue, however you thought that may break some userspace.
> 
> Who relies on these sysfs files today?

I don't know, actually the question is from you, :-(

https://marc.info/?l=linux-kernel=154205455332755=2

Even we remove q->mq_kobj, the same kobject lifetime issue is still here in
kobjects embedded in 'struct blk_mq_ctx'.

> 
> > If we want to backport them to stable, this patch may be a bit easier to go.
> 
> Why do you want to backport any of this to stable?

For this report from Guenter, it should be enough to backport the 1st patch,
and it shouldn't be very hard to backport both. 

I can help to backport them if patches can't be applied cleanly.

Thanks,
Ming

Re: [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:58:03PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after 
> > splitting"),
> > physical segment number is mainly figured out in blk_queue_split() for
> > fast path, and the flag of BIO_SEG_VALID is set there too.
> > 
> > Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> > flag.
> > 
> > Basically blk_recount_segments() is bypassed in fast path given 
> > BIO_SEG_VALID
> > is set in blk_queue_split().
> > 
> > For another user of blk_recalc_rq_segments():
> > 
> > - run in partial completion branch of blk_update_request, which is an 
> > unusual case
> > 
> > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag 
> > is killed
> > since dm-rq is the only user.
> > 
> > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense 
> > any more.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-merge.c  | 31 ++-
> >  block/blk-mq-debugfs.c |  1 -
> >  block/blk-mq.c |  3 ---
> >  drivers/md/dm-table.c  | 13 -
> >  include/linux/blkdev.h |  1 -
> >  5 files changed, 6 insertions(+), 43 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 153a659fde74..06be298be332 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -351,8 +351,7 @@ void blk_queue_split(struct request_queue *q, struct 
> > bio **bio)
> >  EXPORT_SYMBOL(blk_queue_split);
> >  
> >  static unsigned int __blk_recalc_rq_segments(struct request_queue *q,
> > -struct bio *bio,
> > -bool no_sg_merge)
> > +struct bio *bio)
> >  {
> > struct bio_vec bv, bvprv = { NULL };
> > int cluster, prev = 0;
> > @@ -379,13 +378,6 @@ static unsigned int __blk_recalc_rq_segments(struct 
> > request_queue *q,
> > nr_phys_segs = 0;
> > for_each_bio(bio) {
> > bio_for_each_bvec(bv, bio, iter) {
> > -   /*
> > -* If SG merging is disabled, each bio vector is
> > -* a segment
> > -*/
> > -   if (no_sg_merge)
> > -   goto new_segment;
> > -
> > if (prev && cluster) {
> > if (seg_size + bv.bv_len
> > > queue_max_segment_size(q))
> > @@ -420,27 +412,16 @@ static unsigned int __blk_recalc_rq_segments(struct 
> > request_queue *q,
> >  
> >  void blk_recalc_rq_segments(struct request *rq)
> >  {
> > -   bool no_sg_merge = !!test_bit(QUEUE_FLAG_NO_SG_MERGE,
> > -   >q->queue_flags);
> > -
> > -   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio,
> > -   no_sg_merge);
> > +   rq->nr_phys_segments = __blk_recalc_rq_segments(rq->q, rq->bio);
> 
> Can we rename __blk_recalc_rq_segments to blk_recalc_rq_segments
> can kill the old blk_recalc_rq_segments now?

Sure.

Thanks,
Ming

Re: [PATCH V10 18/19] block: kill QUEUE_FLAG_NO_SG_MERGE

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 06:18:11PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:05PM +0800, Ming Lei wrote:
> > Since bdced438acd83ad83a6c ("block: setup bi_phys_segments after 
> > splitting"),
> > physical segment number is mainly figured out in blk_queue_split() for
> > fast path, and the flag of BIO_SEG_VALID is set there too.
> > 
> > Now only blk_recount_segments() and blk_recalc_rq_segments() use this
> > flag.
> > 
> > Basically blk_recount_segments() is bypassed in fast path given 
> > BIO_SEG_VALID
> > is set in blk_queue_split().
> > 
> > For another user of blk_recalc_rq_segments():
> > 
> > - run in partial completion branch of blk_update_request, which is an 
> > unusual case
> > 
> > - run in blk_cloned_rq_check_limits(), still not a big problem if the flag 
> > is killed
> > since dm-rq is the only user.
> > 
> > Multi-page bvec is enabled now, QUEUE_FLAG_NO_SG_MERGE doesn't make sense 
> > any more.
> 
> This commit message wasn't very clear. Is it the case that
> QUEUE_FLAG_NO_SG_MERGE is no longer set by any drivers?

OK, I will add the explanation to commit log in next version.

05f1dd53152173 (block: add queue flag for disabling SG merging) introduces this
flag for NVMe performance purpose only, so that merging to segment can
be bypassed for NVMe.

Actually this optimization was bypassed by 54efd50bfd873e2d (block: make
generic_make_request handle arbitrarily sized bios) and bdced438acd83ad83a6c
("block: setup bi_phys_segments after splitting").

Now segment computation can be very quick, given most of times one bvec
can be thought as one segment, so we can remove the flag.

thanks, 
Ming

Re: [PATCH V10 17/19] block: don't use bio->bi_vcnt to figure out segment number

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 06:11:40PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:04PM +0800, Ming Lei wrote:
> > It is wrong to use bio->bi_vcnt to figure out how many segments
> > there are in the bio even though CLONED flag isn't set on this bio,
> > because this bio may be splitted or advanced.
> > 
> > So always use bio_segments() in blk_recount_segments(), and it shouldn't
> > cause any performance loss now because the physical segment number is 
> > figured
> > out in blk_queue_split() and BIO_SEG_VALID is set meantime since
> > bdced438acd83ad83a6c ("block: setup bi_phys_segments after splitting").
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Fixes: 7f60dcaaf91 ("block: blk-merge: fix blk_recount_segments()")
> 
> From what I can tell, the problem was originally introduced by
> 76d8137a3113 ("blk-merge: recaculate segment if it isn't less than max 
> segments")
> 
> Is that right?

Indeed, will update it in next version.

Thanks,
Ming

Re: [PATCH V10 15/19] block: always define BIO_MAX_PAGES as 256

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 05:59:36PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:02PM +0800, Ming Lei wrote:
> > Now multi-page bvec can cover CONFIG_THP_SWAP, so we don't need to
> > increase BIO_MAX_PAGES for it.
> 
> You mentioned to it in the cover letter, but this needs more explanation
> in the commit message. Why did CONFIG_THP_SWAP require > 256? Why does
> multipage bvecs remove that requirement?

CONFIG_THP_SWAP needs to split one TH page into normal pages and adds
them all to one bio. With multipage-bvec, it just takes one bvec to
hold them all.

thanks,
Ming

Re: [PATCH V10 14/19] block: enable multipage bvecs

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:53:08PM +0100, Christoph Hellwig wrote:
> > -
> > -   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > -   bv->bv_len += len;
> > -   bio->bi_iter.bi_size += len;
> > -   return true;
> > -   }
> > +   struct request_queue *q = NULL;
> > +
> > +   if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> > +   && (off + len) <= PAGE_SIZE)
> 
> How could the page struct be the same, but the range beyond PAGE_SIZE
> (at least with the existing callers)?
> 
> Also no need for the inner btraces, and the && always goes on the
> first line.

OK.

> 
> > +   if (bio->bi_disk)
> > +   q = bio->bi_disk->queue;
> > +
> > +   /* disable multi-page bvec too if cluster isn't enabled */
> > +   if (!q || !blk_queue_cluster(q) ||
> > +   ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> > +(page_to_phys(page) + off)))
> > +   return false;
> > + merge:
> > +   bv->bv_len += len;
> > +   bio->bi_iter.bi_size += len;
> > +   return true;
> 
> Ok, this is scary, as it will give differen results depending on when
> bi_disk is assigned.

It is just merge or not, both can be handled well now.

> But then again we shouldn't really do the cluster
> check here, but rather when splitting the bio for the actual low-level
> driver.

Yeah, I thought of this way too, but it may cause tons of bio split for
no-clustering, and there are quite a few scsi devices which require
to disable clustering.

[linux]$ git grep -n DISABLE_CLUSTERING ./drivers/scsi/ | wc -l
 28

Or we may introduce bio_split_to_single_page_bvec() to allocate &
convert to single-page bvec table for non-clustering, will try this
approach in next version.

> 
> (and eventually we should kill this clustering setting off in favor
> of our normal segment limits).

Yeah, it has been in my post-multi-page todo list already, :-)

thanks,
Ming

Re: [PATCH V10 14/19] block: enable multipage bvecs

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 05:56:27PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:01PM +0800, Ming Lei wrote:
> > This patch pulls the trigger for multi-page bvecs.
> > 
> > Now any request queue which supports queue cluster will see multi-page
> > bvecs.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c | 24 ++--
> >  1 file changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 6486722d4d4b..ed6df6f8e63d 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> 
> This comment above __bio_try_merge_page() doesn't make sense after this
> change:
> 
>  This is a
>  a useful optimisation for file systems with a block size smaller than the
>  page size.
> 
> Can you please get rid of it in this patch?

I understand __bio_try_merge_page() still works for original cases, so
looks the optimization for sub-pagesize is still there too, isn't it?

> 
> > @@ -767,12 +767,24 @@ bool __bio_try_merge_page(struct bio *bio, struct 
> > page *page,
> >  
> > if (bio->bi_vcnt > 0) {
> > struct bio_vec *bv = >bi_io_vec[bio->bi_vcnt - 1];
> > -
> > -   if (page == bv->bv_page && off == bv->bv_offset + bv->bv_len) {
> > -   bv->bv_len += len;
> > -   bio->bi_iter.bi_size += len;
> > -   return true;
> > -   }
> > +   struct request_queue *q = NULL;
> > +
> > +   if (page == bv->bv_page && off == (bv->bv_offset + bv->bv_len)
> > +   && (off + len) <= PAGE_SIZE)
> > +   goto merge;
> 
> The parentheses around (bv->bv_offset + bv->bv_len) and (off + len) are
> unnecessary noise.
> 
> What's the point of the new (off + len) <= PAGE_SIZE check?

Yeah, I don't know why I did it, :-(, the check is absolutely always true.

> 
> > +
> > +   if (bio->bi_disk)
> > +   q = bio->bi_disk->queue;
> > +
> > +   /* disable multi-page bvec too if cluster isn't enabled */
> > +   if (!q || !blk_queue_cluster(q) ||
> > +   ((page_to_phys(bv->bv_page) + bv->bv_offset + bv->bv_len) !=
> > +(page_to_phys(page) + off)))
> 
> More unnecessary parentheses here.

OK.

Thanks,
Ming

Re: [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:49:36PM +0100, Christoph Hellwig wrote:
> I'd much rather have __bio_try_merge_page only do merges in
> the same page, and have a new __bio_try_merge_segment that does
> multi-page merges.  This will keep the accounting a lot simpler.

Looks this way is clever, will do it in next version.

Thanks,
Ming

Re: [PATCH V10 13/19] iomap & xfs: only account for new added page

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 05:46:58PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:53:00PM +0800, Ming Lei wrote:
> > After multi-page is enabled, one new page may be merged to a segment
> > even though it is a new added page.
> > 
> > This patch deals with this issue by post-check in case of merge, and
> > only a freshly new added page need to be dealt with for iomap & xfs.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  fs/iomap.c  | 22 ++
> >  fs/xfs/xfs_aops.c   | 10 --
> >  include/linux/bio.h | 11 +++
> >  3 files changed, 33 insertions(+), 10 deletions(-)
> > 
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index df0212560b36..a1b97a5c726a 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -288,6 +288,7 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > loff_t orig_pos = pos;
> > unsigned poff, plen;
> > sector_t sector;
> > +   bool need_account = false;
> >  
> > if (iomap->type == IOMAP_INLINE) {
> > WARN_ON_ONCE(pos);
> > @@ -313,18 +314,15 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> >  */
> > sector = iomap_sector(iomap, pos);
> > if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> > -   if (__bio_try_merge_page(ctx->bio, page, plen, poff))
> > +   if (__bio_try_merge_page(ctx->bio, page, plen, poff)) {
> > +   need_account = iop && bio_is_last_segment(ctx->bio,
> > +   page, plen, poff);
> 
> It's redundant to make this iop && ... since you already check
> iop && need_account below. Maybe rename it to added_page? Also, this
> indentation is wack.

We may avoid to call bio_is_last_segment() in case of !iop, and will
fix the indentation.

Looks added_page is one better name.

> 
> > goto done;
> > +   }
> > is_contig = true;
> > }
> >  
> > -   /*
> > -* If we start a new segment we need to increase the read count, and we
> > -* need to do so before submitting any previous full bio to make sure
> > -* that we don't prematurely unlock the page.
> > -*/
> > -   if (iop)
> > -   atomic_inc(>read_count);
> > +   need_account = true;
> >  
> > if (!ctx->bio || !is_contig || bio_full(ctx->bio)) {
> > gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
> > @@ -347,6 +345,14 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> > loff_t length, void *data,
> > __bio_add_page(ctx->bio, page, plen, poff);
> >  done:
> > /*
> > +* If we add a new page we need to increase the read count, and we
> > +* need to do so before submitting any previous full bio to make sure
> > +* that we don't prematurely unlock the page.
> > +*/
> > +   if (iop && need_account)
> > +   atomic_inc(>read_count);
> > +
> > +   /*
> >  * Move the caller beyond our range so that it keeps making progress.
> >  * For that we have to include any leading non-uptodate ranges, but
> >  * we can skip trailing ones as they will be handled in the next
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 1f1829e506e8..d8e9cc9f751a 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -603,6 +603,7 @@ xfs_add_to_ioend(
> > unsignedlen = i_blocksize(inode);
> > unsignedpoff = offset & (PAGE_SIZE - 1);
> > sector_tsector;
> > +   boolneed_account;
> >  
> > sector = xfs_fsb_to_db(ip, wpc->imap.br_startblock) +
> > ((of

Re: [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 05:22:45PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote:
> > This patch introduces one extra iterator variable to 
> > bio_for_each_segment_all(),
> > then we can allow bio_for_each_segment_all() to iterate over multi-page 
> > bvec.
> > 
> > Given it is just one mechannical & simple change on all 
> > bio_for_each_segment_all()
> > users, this patch does tree-wide change in one single patch, so that we can
> > avoid to use a temporary helper for this conversion.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Alexander Viro 
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: linux-bt...@vger.kernel.org
> > Cc: David Sterba 
> > Cc: Darrick J. Wong 
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/bio.c   | 27 ++-
> >  block/blk-zoned.c |  1 +
> >  block/bounce.c|  6 --
> >  drivers/md/bcache/btree.c |  3 ++-
> >  drivers/md/dm-crypt.c |  3 ++-
> >  drivers/md/raid1.c|  3 ++-
> >  drivers/staging/erofs/data.c  |  3 ++-
> >  drivers/staging/erofs/unzip_vle.c |  3 ++-
> >  fs/block_dev.c|  6 --
> >  fs/btrfs/compression.c|  3 ++-
> >  fs/btrfs/disk-io.c|  3 ++-
> >  fs/btrfs/extent_io.c  | 12 
> >  fs/btrfs/inode.c  |  6 --
> >  fs/btrfs/raid56.c |  3 ++-
> >  fs/crypto/bio.c   |  3 ++-
> >  fs/direct-io.c|  4 +++-
> >  fs/exofs/ore.c|  3 ++-
> >  fs/exofs/ore_raid.c   |  3 ++-
> >  fs/ext4/page-io.c |  3 ++-
> >  fs/ext4/readpage.c|  3 ++-
> >  fs/f2fs/data.c|  9 ++---
> >  fs/gfs2/lops.c|  6 --
> >  fs/gfs2/meta_io.c |  3 ++-
> >  fs/iomap.c|  6 --
> >  fs/mpage.c|  3 ++-
> >  fs/xfs/xfs_aops.c |  5 +++--
> >  include/linux/bio.h   | 11 +--
> >  include/linux/bvec.h  | 31 +++
> >  28 files changed, 129 insertions(+), 46 deletions(-)
> > 
> 
> [snip]
> 
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 3496c816946e..1a2430a8b89d 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -131,12 +131,19 @@ static inline bool bio_full(struct bio *bio)
> > return bio->bi_vcnt >= bio->bi_max_vecs;
> >  }
> >  
> > +#define bvec_for_each_segment(bv, bvl, i, iter_all)
> > \
> > +   for (bv = bvec_init_iter_all(_all);\
> > +   (iter_all.done < (bvl)->bv_len) &&  \
> > +   ((bvec_next_segment((bvl), _all)), 1); \
> 
> The parentheses around (bvec_next_segment((bvl), _all)) are
> unnecessary.

OK.

> 
> > +   iter_all.done += bv->bv_len, i += 1)
> > +
> >  /*
> >   * drivers should _never_ use the all version - the bio may have been split
> >   * before it got to the driver and the driver won't own all of it
> >   */
> > -#define bio_for_each_segment_all(bvl, bio, i)  
> > \
> > -   for (i = 0, bvl = (bio)->bi_io_vec; i < (bio)->bi_vcnt; i++, bvl++)
> > +#define bio_for_each_segment_all(bvl, bio, i, iter_all)\
> > +   for (i = 0, iter_all.idx = 0; iter_all.idx < (bio)->bi_vcnt; 
> > iter_all.idx++)\
> > +   bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
> > i, iter_all)
> 
> Would it be possible to move i into iter_all to streamline this a bit?

That may may cause unnecessary conversion work for us, because the local
variable 'i' is defined in external function.

> 
> >  static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> > *iter,
> >   unsigned bytes, bool mp)
> > diff --git a/include

Re: [PATCH V10 12/19] block: allow bio_for_each_segment_all() to iterate over multi-page bvec

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 01:42:52PM +0100, David Sterba wrote:
> On Thu, Nov 15, 2018 at 04:52:59PM +0800, Ming Lei wrote:
> > diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> > index 13ba2011a306..789b09ae402a 100644
> > --- a/block/blk-zoned.c
> > +++ b/block/blk-zoned.c
> > @@ -123,6 +123,7 @@ static int blk_report_zones(struct gendisk *disk, 
> > sector_t sector,
> > unsigned int z = 0, n, nrz = *nr_zones;
> > sector_t capacity = get_capacity(disk);
> > int ret;
> > +   struct bvec_iter_all iter_all;
> >  
> > while (z < nrz && sector < capacity) {
> > n = nrz - z;
> 
> iter_all is added but not used and I don't see any
> bio_for_each_segment_all for conversion in this function.

Good catch, will fix it in next version.

Thanks,
Ming

Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:46:45PM +0100, Christoph Hellwig wrote:
> > -   bio_for_each_segment_all(bv, bio, i) {
> > +   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {
> 
> This really needs a comment.  Otherwise it looks fine to me.

OK, will do it in next version.

Thanks,
Ming

Re: [PATCH V10 11/19] bcache: avoid to use bio_for_each_segment_all() in bch_bio_alloc_pages()

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 04:44:02PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:58PM +0800, Ming Lei wrote:
> > bch_bio_alloc_pages() is always called on one new bio, so it is safe
> > to access the bvec table directly. Given it is the only kind of this
> > case, open code the bvec table access since bio_for_each_segment_all()
> > will be changed to support for iterating over multipage bvec.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Acked-by: Coly Li 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/md/bcache/util.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/md/bcache/util.c b/drivers/md/bcache/util.c
> > index 20eddeac1531..8517aebcda2d 100644
> > --- a/drivers/md/bcache/util.c
> > +++ b/drivers/md/bcache/util.c
> > @@ -270,7 +270,7 @@ int bch_bio_alloc_pages(struct bio *bio, gfp_t gfp_mask)
> > int i;
> > struct bio_vec *bv;
> >  
> > -   bio_for_each_segment_all(bv, bio, i) {
> > +   for (i = 0, bv = bio->bi_io_vec; i < bio->bi_vcnt; bv++) {
> 
> This is missing an i++.

Good catch, will fix it in next version.

thanks,
Ming

Re: [PATCH V10 10/19] block: loop: pass multi-page bvec to iov_iter

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 04:40:22PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:57PM +0800, Ming Lei wrote:
> > iov_iter is implemented with bvec itererator, so it is safe to pass
> > multipage bvec to it, and this way is much more efficient than
> > passing one page in each bvec.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> 
> Reviewed-by: Omar Sandoval 
> 
> Comments below.
> 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/block/loop.c | 23 ---
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index bf6bc35aaf88..a3fd418ec637 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -515,16 +515,16 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> > struct bio *bio = rq->bio;
> > struct file *file = lo->lo_backing_file;
> > unsigned int offset;
> > -   int segments = 0;
> > +   int nr_bvec = 0;
> > int ret;
> >  
> > if (rq->bio != rq->biotail) {
> > -   struct req_iterator iter;
> > +   struct bvec_iter iter;
> > struct bio_vec tmp;
> >  
> > __rq_for_each_bio(bio, rq)
> > -   segments += bio_segments(bio);
> > -   bvec = kmalloc_array(segments, sizeof(struct bio_vec),
> > +   nr_bvec += bio_bvecs(bio);
> > +   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
> >  GFP_NOIO);
> > if (!bvec)
> > return -EIO;
> > @@ -533,13 +533,14 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> > /*
> >  * The bios of the request may be started from the middle of
> >  * the 'bvec' because of bio splitting, so we can't directly
> > -* copy bio->bi_iov_vec to new bvec. The rq_for_each_segment
> > +* copy bio->bi_iov_vec to new bvec. The bio_for_each_bvec
> >  * API will take care of all details for us.
> >  */
> > -   rq_for_each_segment(tmp, rq, iter) {
> > -   *bvec = tmp;
> > -   bvec++;
> > -   }
> > +   __rq_for_each_bio(bio, rq)
> > +   bio_for_each_bvec(tmp, bio, iter) {
> > +   *bvec = tmp;
> > +   bvec++;
> > +   }
> 
> Even if they're not strictly necessary, could you please include the
> curly braces for __rq_for_each_bio() here?

Sure, will do it.

> 
> > bvec = cmd->bvec;
> > offset = 0;
> > } else {
> > @@ -550,11 +551,11 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> > loop_cmd *cmd,
> >  */
> > offset = bio->bi_iter.bi_bvec_done;
> > bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> > -   segments = bio_segments(bio);
> > +   nr_bvec = bio_bvecs(bio);
> 
> This scared me for a second, but it's fine to do here because we haven't
> actually enabled multipage bvecs yet, right?

Well, it is fine, all helpers supporting multi-page bvec actually works
well when it isn't enabled, cause single-page bvec is one special case in
which multi-page bevc helpers have to deal with.

Thanks,
Ming

Re: [PATCH V10 09/19] block: introduce bio_bvecs()

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:45:41PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:56PM +0800, Ming Lei wrote:
> > There are still cases in which we need to use bio_bvecs() for get the
> > number of multi-page segment, so introduce it.
> 
> The only user in your final tree seems to be the loop driver, and
> even that one only uses the helper for read/write bios.
> 
> I think something like this would be much simpler in the end:
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d509902a8046..712511815ac6 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -514,16 +514,18 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>   struct request *rq = blk_mq_rq_from_pdu(cmd);
>   struct bio *bio = rq->bio;
>   struct file *file = lo->lo_backing_file;
> + struct bvec_iter bvec_iter;
> + struct bio_vec tmp;
>   unsigned int offset;
>   int nr_bvec = 0;
>   int ret;
>  
> + __rq_for_each_bio(bio, rq)
> + bio_for_each_bvec(tmp, bio, bvec_iter)
> + nr_bvec++;
> +
>   if (rq->bio != rq->biotail) {
> - struct bvec_iter iter;
> - struct bio_vec tmp;
>  
> - __rq_for_each_bio(bio, rq)
> - nr_bvec += bio_bvecs(bio);
>   bvec = kmalloc_array(nr_bvec, sizeof(struct bio_vec),
>GFP_NOIO);
>   if (!bvec)
> @@ -537,7 +539,7 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>* API will take care of all details for us.
>*/
>   __rq_for_each_bio(bio, rq)
> - bio_for_each_bvec(tmp, bio, iter) {
> + bio_for_each_bvec(tmp, bio, bvec_iter) {
>   *bvec = tmp;
>   bvec++;
>   }
> @@ -551,7 +553,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
> loop_cmd *cmd,
>*/
>   offset = bio->bi_iter.bi_bvec_done;
>   bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> - nr_bvec = bio_bvecs(bio);
>   }
>   atomic_set(>ref, 2);
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index dcad0b69f57a..379440d1ced0 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -200,30 +200,6 @@ static inline unsigned bio_segments(struct bio *bio)
>   }
>  }
>  
> -static inline unsigned bio_bvecs(struct bio *bio)
> -{
> - unsigned bvecs = 0;
> - struct bio_vec bv;
> - struct bvec_iter iter;
> -
> - /*
> -  * We special case discard/write same/write zeroes, because they
> -  * interpret bi_size differently:
> -  */
> - switch (bio_op(bio)) {
> - case REQ_OP_DISCARD:
> - case REQ_OP_SECURE_ERASE:
> - case REQ_OP_WRITE_ZEROES:
> - return 0;
> - case REQ_OP_WRITE_SAME:
> - return 1;
> - default:
> - bio_for_each_bvec(bv, bio, iter)
> - bvecs++;
> - return bvecs;
> - }
> -}
> -
>  /*
>   * get a reference to a bio, so it won't disappear. the intended use is
>   * something like:

OK, will do it in next version.

Thanks,
Ming

Re: [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:38:45PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> > BTRFS is the only user of this helper, so move this helper into
> > BTRFS, and implement it via bio_for_each_segment_all(), since
> > bio->bi_vcnt may not equal to number of pages after multipage bvec
> > is enabled.
> 
> btrfs only uses the value to check if it is larger than 1.  No amount
> of multipage bio merging should ever make bi_vcnt go from 0 to 1 or
> vice versa.

Could you explain a bit why?

Suppose 2 physically continuous pages are added to this bio, .bi_vcnt
can be 1 in case of multi-page bvec, but it is 2 in case of single-page
bvec.

Thanks,
Ming

Re: [PATCH V10 08/19] btrfs: move bio_pages_all() to btrfs

2018-11-19 Thread Ming Lei

On Thu, Nov 15, 2018 at 04:23:56PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:55PM +0800, Ming Lei wrote:
> > BTRFS is the only user of this helper, so move this helper into
> > BTRFS, and implement it via bio_for_each_segment_all(), since
> > bio->bi_vcnt may not equal to number of pages after multipage bvec
> > is enabled.
> 
> Shouldn't you also get rid of bio_pages_all() in this patch?

Good catch!

thanks,
Ming

Re: [PATCH V10 07/19] btrfs: use bvec_last_segment to get bio's last page

2018-11-19 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:37:10PM +0100, Christoph Hellwig wrote:
> On Thu, Nov 15, 2018 at 04:52:54PM +0800, Ming Lei wrote:
> > index 2955a4ea2fa8..161e14b8b180 100644
> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -400,8 +400,11 @@ blk_status_t btrfs_submit_compressed_write(struct 
> > inode *inode, u64 start,
> >  static u64 bio_end_offset(struct bio *bio)
> >  {
> > struct bio_vec *last = bio_last_bvec_all(bio);
> > +   struct bio_vec bv;
> >  
> > -   return page_offset(last->bv_page) + last->bv_len + last->bv_offset;
> > +   bvec_last_segment(last, );
> > +
> > +   return page_offset(bv.bv_page) + bv.bv_len + bv.bv_offset;
> 
> I don't think we need this.  If last is a multi-page bvec bv_offset
> will already contain the correct offset from the first page.

Yeah, it is true for this specific case, looks we can drop this patch.


thanks,
Ming

Re: [PATCH V10 05/19] block: introduce bvec_last_segment()

2018-11-18 Thread Ming Lei

On Thu, Nov 15, 2018 at 03:23:56PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:52PM +0800, Ming Lei wrote:
> > BTRFS and guard_bio_eod() need to get the last singlepage segment
> > from one multipage bvec, so introduce this helper to make them happy.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> 
> Reviewed-by: Omar Sandoval 
> 
> Minor comments below.
> 
> > Signed-off-by: Ming Lei 
> > ---
> >  include/linux/bvec.h | 25 +
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> > index 3d61352cd8cf..01616a0b6220 100644
> > --- a/include/linux/bvec.h
> > +++ b/include/linux/bvec.h
> > @@ -216,4 +216,29 @@ static inline bool mp_bvec_iter_advance(const struct 
> > bio_vec *bv,
> > .bi_bvec_done   = 0,\
> >  }
> >  
> > +/*
> > + * Get the last singlepage segment from the multipage bvec and store it
> > + * in @seg
> > + */
> > +static inline void bvec_last_segment(const struct bio_vec *bvec,
> > +   struct bio_vec *seg)
> 
> Indentation is all messed up here.

Will fix it.

> 
> > +{
> > +   unsigned total = bvec->bv_offset + bvec->bv_len;
> > +   unsigned last_page = total / PAGE_SIZE;
> > +
> > +   if (last_page * PAGE_SIZE == total)
> > +   last_page--;
> 
> I think this could just be
> 
>   unsigned int last_page = (total - 1) / PAGE_SIZE;

This way is really elegant.

Thanks,
Ming

Re: [PATCH V10 04/19] block: use bio_for_each_bvec() to map sg

2018-11-18 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:33:14PM +0100, Christoph Hellwig wrote:
> > +   if (!*sg)
> > +   return sglist;
> > +   else {
> 
> No need for an else after an early return.

OK, good catch!

Thanks,
Ming

Re: [PATCH V10 03/19] block: use bio_for_each_bvec() to compute multi-page bvec count

2018-11-18 Thread Ming Lei

On Thu, Nov 15, 2018 at 12:20:28PM -0800, Omar Sandoval wrote:
> On Thu, Nov 15, 2018 at 04:52:50PM +0800, Ming Lei wrote:
> > First it is more efficient to use bio_for_each_bvec() in both
> > blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
> > many multi-page bvecs there are in the bio.
> > 
> > Secondly once bio_for_each_bvec() is used, the bvec may need to be
> > splitted because its length can be very longer than max segment size,
> > so we have to split the big bvec into several segments.
> > 
> > Thirdly when splitting multi-page bvec into segments, the max segment
> > limit may be reached, so the bio split need to be considered under
> > this situation too.
> > 
> > Cc: Dave Chinner 
> > Cc: Kent Overstreet 
> > Cc: Mike Snitzer 
> > Cc: dm-de...@redhat.com
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Shaohua Li 
> > Cc: linux-r...@vger.kernel.org
> > Cc: linux-er...@lists.ozlabs.org
> > Cc: David Sterba 
> > Cc: linux-bt...@vger.kernel.org
> > Cc: Darrick J. Wong 
> > Cc: linux-...@vger.kernel.org
> > Cc: Gao Xiang 
> > Cc: Christoph Hellwig 
> > Cc: Theodore Ts'o 
> > Cc: linux-e...@vger.kernel.org
> > Cc: Coly Li 
> > Cc: linux-bca...@vger.kernel.org
> > Cc: Boaz Harrosh 
> > Cc: Bob Peterson 
> > Cc: cluster-de...@redhat.com
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-merge.c | 90 
> > ++-
> >  1 file changed, 76 insertions(+), 14 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 91b2af332a84..6f7deb94a23f 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -160,6 +160,62 @@ static inline unsigned get_max_io_size(struct 
> > request_queue *q,
> > return sectors;
> >  }
> >  
> > +/*
> > + * Split the bvec @bv into segments, and update all kinds of
> > + * variables.
> > + */
> > +static bool bvec_split_segs(struct request_queue *q, struct bio_vec *bv,
> > +   unsigned *nsegs, unsigned *last_seg_size,
> > +   unsigned *front_seg_size, unsigned *sectors)
> > +{
> > +   bool need_split = false;
> > +   unsigned len = bv->bv_len;
> > +   unsigned total_len = 0;
> > +   unsigned new_nsegs = 0, seg_size = 0;
> 
> "unsigned int" here and everywhere else.
> 
> > +   if ((*nsegs >= queue_max_segments(q)) || !len)
> > +   return need_split;
> > +
> > +   /*
> > +* Multipage bvec may be too big to hold in one segment,
> > +* so the current bvec has to be splitted as multiple
> > +* segments.
> > +*/
> > +   while (new_nsegs + *nsegs < queue_max_segments(q)) {
> > +   seg_size = min(queue_max_segment_size(q), len);
> > +
> > +   new_nsegs++;
> > +   total_len += seg_size;
> > +   len -= seg_size;
> > +
> > +   if ((queue_virt_boundary(q) && ((bv->bv_offset +
> > +   total_len) & queue_virt_boundary(q))) || !len)
> > +   break;
> 
> Checking queue_virt_boundary(q) != 0 is superfluous, and the len check
> could just control the loop, i.e.,
> 
>   while (len && new_nsegs + *nsegs < queue_max_segments(q)) {
>   seg_size = min(queue_max_segment_size(q), len);
> 
>   new_nsegs++;
>   total_len += seg_size;
>   len -= seg_size;
> 
>   if ((bv->bv_offset + total_len) & queue_virt_boundary(q))
>   break;
>   }
> 
> And if you rewrite it this way, I _think_ you can get rid of this
> special case:
> 
>   if ((*nsegs >= queue_max_segments(q)) || !len)
>   return need_split;
> 
> above.

Good point, will do in next version.

> 
> > +   }
> > +
> > +   /* split in the middle of the bvec */
> > +   if (len)
> > +   need_split = true;
> 
> need_split is unnecessary, just return len != 0.

OK.

> 
> > +
> > +   /* update front segment size */
> > +   if (!*nsegs) {
> > +   unsigned first_seg_size = seg_size;
> > +
> > +   if (new_nsegs > 1)
> > +   first_seg_size = queue_max_segment_size(q);
> > +   if (*front_seg_size < first_seg_size)
> > +   *front_seg_size = first_seg_size;
> > +   }
> > +
> > +   /* update other varibles */
> > +   *last_seg_size = seg_size;
> >

Re: [PATCH V10 01/19] block: introduce multi-page page bvec helpers

2018-11-18 Thread Ming Lei

On Sun, Nov 18, 2018 at 08:10:14PM -0700, Jens Axboe wrote:
> On 11/18/18 7:23 PM, Ming Lei wrote:
> > On Fri, Nov 16, 2018 at 02:13:05PM +0100, Christoph Hellwig wrote:
> >>> -#define bvec_iter_page(bvec, iter)   \
> >>> +#define mp_bvec_iter_page(bvec, iter)\
> >>>   (__bvec_iter_bvec((bvec), (iter))->bv_page)
> >>>  
> >>> -#define bvec_iter_len(bvec, iter)\
> >>> +#define mp_bvec_iter_len(bvec, iter) \
> >>
> >> I'd much prefer if we would stick to the segment naming that
> >> we also use in the higher level helper.
> >>
> >> So segment_iter_page, segment_iter_len, etc.
> > 
> > We discussed the naming problem before, one big problem is that the 
> > 'segment'
> > in bio_for_each_segment*() means one single page segment actually.
> > 
> > If we use segment_iter_page() here for multi-page segment, it may
> > confuse people.
> > 
> > Of course, I prefer to the naming of segment/page, 
> > 
> > And Jens didn't agree to rename bio_for_each_segment*() before.
> 
> I didn't like frivolous renaming (and I still don't), but mp_
> is horrible imho. Don't name these after the fact that they
> are done in conjunction with supporting multipage bvecs. That
> very fact will be irrelevant very soon

OK, so what is your suggestion for the naming issue?

Are you fine to use segment_iter_page() here? Then the term of 'segment'
may be interpreted as multi-page segment here, but as single-page in
bio_for_each_segment*().

thanks
Ming

Re: [PATCH V10 02/19] block: introduce bio_for_each_bvec()

2018-11-18 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:30:28PM +0100, Christoph Hellwig wrote:
> > +static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> > *iter,
> > + unsigned bytes, bool mp)
> 
> I think these magic 'bool np' arguments and wrappers over wrapper
> don't help anyone to actually understand the code.  I'd vote for
> removing as many wrappers as we really don't need, and passing the
> actual segment limit instead of the magic bool flag.  Something like
> this untested patch:

I think this way is fine, just a little comment.

> 
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 277921ad42e7..dcad0b69f57a 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -138,30 +138,21 @@ static inline bool bio_full(struct bio *bio)
>   bvec_for_each_segment(bvl, &((bio)->bi_io_vec[iter_all.idx]), 
> i, iter_all)
>  
>  static inline void __bio_advance_iter(struct bio *bio, struct bvec_iter 
> *iter,
> -   unsigned bytes, bool mp)
> +   unsigned bytes, unsigned max_segment)

The new parameter should have been named as 'max_segment_len' or
'max_seg_len'.

>  {
>   iter->bi_sector += bytes >> 9;
>  
>   if (bio_no_advance_iter(bio))
>   iter->bi_size -= bytes;
>   else
> - if (!mp)
> - bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> - else
> - mp_bvec_iter_advance(bio->bi_io_vec, iter, bytes);
> + __bvec_iter_advance(bio->bi_io_vec, iter, bytes, max_segment);
>   /* TODO: It is reasonable to complete bio with error here. */
>  }
>  
>  static inline void bio_advance_iter(struct bio *bio, struct bvec_iter *iter,
>   unsigned bytes)
>  {
> - __bio_advance_iter(bio, iter, bytes, false);
> -}
> -
> -static inline void bio_advance_mp_iter(struct bio *bio, struct bvec_iter 
> *iter,
> -unsigned bytes)
> -{
> - __bio_advance_iter(bio, iter, bytes, true);
> + __bio_advance_iter(bio, iter, bytes, PAGE_SIZE);
>  }
>  
>  #define __bio_for_each_segment(bvl, bio, iter, start)
> \
> @@ -177,7 +168,7 @@ static inline void bio_advance_mp_iter(struct bio *bio, 
> struct bvec_iter *iter,
>   for (iter = (start);\
>(iter).bi_size &&  \
>   ((bvl = bio_iter_mp_iovec((bio), (iter))), 1);  \
> -  bio_advance_mp_iter((bio), &(iter), (bvl).bv_len))
> +  __bio_advance_iter((bio), &(iter), (bvl).bv_len, 0))

Even we might pass '-1' for multi-page segment.

>  
>  /* returns one real segment(multipage bvec) each time */
>  #define bio_for_each_bvec(bvl, bio, iter)\
> diff --git a/include/linux/bvec.h b/include/linux/bvec.h
> index 02f26d2b59ad..5e2ed46c1c88 100644
> --- a/include/linux/bvec.h
> +++ b/include/linux/bvec.h
> @@ -138,8 +138,7 @@ struct bvec_iter_all {
>  })
>  
>  static inline bool __bvec_iter_advance(const struct bio_vec *bv,
> -struct bvec_iter *iter,
> -unsigned bytes, bool mp)
> + struct bvec_iter *iter, unsigned bytes, unsigned max_segment)
>  {
>   if (WARN_ONCE(bytes > iter->bi_size,
>"Attempted to advance past end of bvec iter\n")) {
> @@ -148,18 +147,18 @@ static inline bool __bvec_iter_advance(const struct 
> bio_vec *bv,
>   }
>  
>   while (bytes) {
> - unsigned len;
> + unsigned segment_len = mp_bvec_iter_len(bv, *iter);
>  
> - if (mp)
> - len = mp_bvec_iter_len(bv, *iter);
> - else
> - len = bvec_iter_len(bv, *iter);
> + if (max_segment) {
> + max_segment -= bvec_iter_offset(bv, *iter);
> + segment_len = min(segment_len, max_segment);

Looks 'max_segment' needs to be constant, shouldn't be updated.

If '-1' is passed for multipage case, the above change may become:

segment_len = min_t(segment_len, max_seg_len - 
bvec_iter_offset(bv, *iter));

This way is more clean, but with extra cost of the above line for multipage
case.

Thanks,
Ming

Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array

2018-11-18 Thread Ming Lei

On Sat, Nov 17, 2018 at 11:03:42AM +0100, Greg Kroah-Hartman wrote:
> On Sat, Nov 17, 2018 at 10:34:18AM +0800, Ming Lei wrote:
> > On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote:
> > > On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote:
> > > > Now q->queue_ctx is just one read-mostly table for query the
> > > > 'blk_mq_ctx' instance from one cpu index, it isn't necessary
> > > > to allocate it as percpu variable. One simple array may be
> > > > more efficient.
> > > 
> > > "may be", have you run benchmarks to be sure?  If so, can you add the
> > > results of them to this changelog?  If there is no measurable
> > > difference, then why make this change at all?
> > 
> > __blk_mq_get_ctx() is used in fast path, what do you think about which
> > one is more efficient?
> > 
> > - *per_cpu_ptr(q->queue_ctx, cpu);
> > 
> > - q->queue_ctx[cpu]
> 
> You need to actually test to see which one is faster, you might be
> surprised :)
> 
> In other words, do not just guess.

No performance difference is observed wrt. this patchset when I
run the following fio test on null_blk(modprobe null_blk) in my VM:

fio --direct=1 --size=128G --bsrange=4k-4k --runtime=40 --numjobs=32 \
  --ioengine=libaio --iodepth=64 --group_reporting=1 --filename=/dev/nullb0 \
  --name=null_blk-ttest-randread --rw=randread

Running test is important, but IMO it is more important to understand
the idea behind is correct, or the approach can be proved as correct.

Given the count of test cases can be increased exponentially when the related
factors or settings are covered, obviously we can't run all the test cases.

> 
> > At least the latter isn't worse than the former.
> 
> How do you know?

As I explained, q->queue_ctx is basically read-only loop-up table after
queue is initialized, and there isn't any benefit to use percpu
allocator here.

> 
> > Especially q->queue_ctx is just a read-only look-up table, it doesn't
> > make sense to make it percpu any more.
> > 
> > Not mention q->queue_ctx[cpu] is more clean/readable.
> 
> Again, please test to verify this.

IMO, 'test' can't be enough to verify if one approach is correct,
or patch is correct given we can't cover all cases by test, and it should be
served as a supplement for proof or patch analysis, seems it is usually done
in patch commit log, or review. 

Thanks,
Ming

Re: [PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array

2018-11-16 Thread Ming Lei

On Fri, Nov 16, 2018 at 06:06:23AM -0800, Greg Kroah-Hartman wrote:
> On Fri, Nov 16, 2018 at 07:23:11PM +0800, Ming Lei wrote:
> > Now q->queue_ctx is just one read-mostly table for query the
> > 'blk_mq_ctx' instance from one cpu index, it isn't necessary
> > to allocate it as percpu variable. One simple array may be
> > more efficient.
> 
> "may be", have you run benchmarks to be sure?  If so, can you add the
> results of them to this changelog?  If there is no measurable
> difference, then why make this change at all?

__blk_mq_get_ctx() is used in fast path, what do you think about which
one is more efficient?

- *per_cpu_ptr(q->queue_ctx, cpu);

- q->queue_ctx[cpu]

At least the latter isn't worse than the former.

Especially q->queue_ctx is just a read-only look-up table, it doesn't
make sense to make it percpu any more.

Not mention q->queue_ctx[cpu] is more clean/readable.

Thanks,
Ming

Re: [PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-16 Thread Ming Lei

On Fri, Nov 16, 2018 at 06:05:21AM -0800, Greg Kroah-Hartman wrote:
> On Fri, Nov 16, 2018 at 07:23:10PM +0800, Ming Lei wrote:
> > @@ -456,7 +456,7 @@ struct request_queue {
> > /*
> >  * mq queue kobject
> >  */
> > -   struct kobject mq_kobj;
> > +   struct kobject *mq_kobj;
> 
> What is this kobject even used for?  It wasn't obvious at all from this
> patch, why is it needed if you are not using it to reference count the
> larger structure here?

All attributes and kobjects under /sys/block/$DEV/mq are covered by this kobject
actually, and all are for exposing blk-mq specific information, but now there is
only blk-mq, and legacy io path is removed.

That is why I mentioned we may remove this kobject last time and move all under
/sys/block/$DEV/queue, however you thought that may break some userspace.

If we want to backport them to stable, this patch may be a bit easier to go.

Thanks,
Ming

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-16 Thread Ming Lei

On Fri, Nov 16, 2018 at 03:04:57PM +1100, Dave Chinner wrote:
> On Thu, Nov 15, 2018 at 02:24:19PM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 16, 2018 at 09:13:37AM +1100, Dave Chinner wrote:
> > > On Thu, Nov 15, 2018 at 11:10:36AM +0800, Ming Lei wrote:
> > > > On Thu, Nov 15, 2018 at 12:22:01PM +1100, Dave Chinner wrote:
> > > > > On Thu, Nov 15, 2018 at 09:06:52AM +0800, Ming Lei wrote:
> > > > > > On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote:
> > > > > > > On 11/13/18 2:43 PM, Dave Chinner wrote:
> > > > > > > > From: Dave Chinner 
> > > > > > > > 
> > > > > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> > > > > > > > fall into an endless loop in the discard code. The test is 
> > > > > > > > creating
> > > > > > > > a device that is exactly 2^32 sectors in size to test mkfs 
> > > > > > > > boundary
> > > > > > > > conditions around the 32 bit sector overflow region.
> > > > > > > > 
> > > > > > > > mkfs issues a discard for the entire device size by default, and
> > > > > > > > hence this throws a sector count of 2^32 into
> > > > > > > > blkdev_issue_discard(). It takes the number of sectors to 
> > > > > > > > discard as
> > > > > > > > a sector_t - a 64 bit value.
> > > > > > > > 
> > > > > > > > The commit ba5d73851e71 ("block: cleanup 
> > > > > > > > __blkdev_issue_discard")
> > > > > > > > takes this sector count and casts it to a 32 bit value before
> > > > > > > > comapring it against the maximum allowed discard size the device
> > > > > > > > has. This truncates away the upper 32 bits, and so if the lower 
> > > > > > > > 32
> > > > > > > > bits of the sector count is zero, it starts issuing discards of
> > > > > > > > length 0. This causes the code to fall into an endless loop, 
> > > > > > > > issuing
> > > > > > > > a zero length discards over and over again on the same sector.
> > > > > > > 
> > > > > > > Applied, thanks. Ming, can you please add a blktests test for
> > > > > > > this case? This is the 2nd time it's been broken.
> > > > > > 
> > > > > > OK, I will add zram discard test in blktests, which should cover the
> > > > > > 1st report. For the xfs/259, I need to investigate if it is easy to
> > > > > > do in blktests.
> > > > > 
> > > > > Just write a test that creates block devices of 2^32 + (-1,0,1)
> > > > > sectors and runs a discard across the entire device. That's all that
> > > > > xfs/259 it doing - exercising mkfs on 2TB, 4TB and 16TB boundaries.
> > > > > i.e. the boundaries where sectors and page cache indexes (on 4k page
> > > > > size systems) overflow 32 bit int and unsigned int sizes. mkfs
> > > > > issues a discard for the entire device, so it's testing that as
> > > > > well...
> > > > 
> > > > Indeed, I can reproduce this issue via the following commands:
> > > > 
> > > > modprobe scsi_debug virtual_gb=2049 sector_size=512 lbpws10=1 
> > > > dev_size_mb=512
> > > > blkdiscard /dev/sde
> > > > 
> > > > > 
> > > > > You need to write tests that exercise write_same, write_zeros and
> > > > > discard operations around these boundaries, because they all take
> > > > > a 64 bit sector count and stuff them into 32 bit size fields in
> > > > > the bio tha tis being submitted.
> > > > 
> > > > write_same/write_zeros are usually used by driver directly, so we
> > > > may need make the test case on some specific device.
> > > 
> > > My local linux iscsi server and client advertise support for them.
> > > It definitely does not ships zeros across the wire(*) when I use
> > > things like FALLOC_FL_ZERO_RANGE, but fstests does not have block
> > > device fallocate() tests for zeroing or punching...
> > 
> > fstests does (generic/{349,350,351}) but those basic functionality tests
> > don't include creating a 2^32 block device and seeing if overflows
> > happen... :/
> 
> They don't run on my test machines because they require a modular
> kernel and I run a monolithic kernel specified externally by the
> qemu command line on all my test VMs.
> 
> generic/349 [not run] scsi_debug module not found
> generic/350 [not run] scsi_debug module not found
> generic/351 [not run] scsi_debug module not found

Given both null_blk and scsi_debug have tons of module parameters,
it should be inevitable to build them as modules, then you may setup
them in any way you want.

Most of my tests are done in VM too, I start VM by one Fedora offical
kernel, and install modules built from host via 'scp & tar zxf' inside
the VM, then use new built kernel to start VM and run any tests. The
whole script is quite simple, and no need host root privilege.

Thanks,
Ming

[PATCH V2 0/2] blk-mq: fix kobject lifetime issue

2018-11-16 Thread Ming Lei

Hi Jens,

The 1st patch fixes the kobject lifetime issue which is triggerd when
DEBUG_KOBJECT_RELEASE is enabled.

The 2nd patch can be thought as one follow-up cleanup.

V2:
- allocate 'blk_mq_ctx' inside blk_mq_init_allocated_queue()
- allocate q->mq_kobj directly 

Ming Lei (2):
  blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
  blk-mq: alloc q->queue_ctx as normal array

 block/blk-mq-sysfs.c   | 51 -
 block/blk-mq.c | 61 +-
 block/blk-mq.h |  4 ++--
 include/linux/blkdev.h |  4 ++--
 4 files changed, 84 insertions(+), 36 deletions(-)

Cc: "jianchao.wang" 
Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 

-- 
2.9.5

[PATCH V2 for-4.21 2/2] blk-mq: alloc q->queue_ctx as normal array

2018-11-16 Thread Ming Lei

Now q->queue_ctx is just one read-mostly table for query the
'blk_mq_ctx' instance from one cpu index, it isn't necessary
to allocate it as percpu variable. One simple array may be
more efficient.

Cc: "jianchao.wang" 
Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c   | 14 --
 block/blk-mq.c | 18 ++
 block/blk-mq.h |  2 +-
 include/linux/blkdev.h |  2 +-
 4 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index cc2fef909afc..ae2cafd6f8a8 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -290,19 +290,15 @@ void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx)
 
 void blk_mq_sysfs_deinit(struct request_queue *q)
 {
-   struct blk_mq_ctx *ctx;
int cpu;
 
-   for_each_possible_cpu(cpu) {
-   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
-   kobject_put(>kobj);
-   }
+   for_each_possible_cpu(cpu)
+   kobject_put(>queue_ctx[cpu]->kobj);
kobject_put(q->mq_kobj);
 }
 
 int blk_mq_sysfs_init(struct request_queue *q)
 {
-   struct blk_mq_ctx *ctx;
int cpu;
struct kobject *mq_kobj;
 
@@ -312,10 +308,8 @@ int blk_mq_sysfs_init(struct request_queue *q)
 
kobject_init(mq_kobj, _mq_ktype);
 
-   for_each_possible_cpu(cpu) {
-   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
-   kobject_init(>kobj, _mq_ctx_ktype);
-   }
+   for_each_possible_cpu(cpu)
+   kobject_init(>queue_ctx[cpu]->kobj, _mq_ctx_ktype);
q->mq_kobj = mq_kobj;
return 0;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 376c04778d33..85d5dba56272 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
unsigned int i, j;
 
for_each_possible_cpu(i) {
-   struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_ctx, i);
+   struct blk_mq_ctx *__ctx = q->queue_ctx[i];
struct blk_mq_hw_ctx *hctx;
 
__ctx->cpu = i;
@@ -2385,7 +2385,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
set->map[0].mq_map[i] = 0;
}
 
-   ctx = *per_cpu_ptr(q->queue_ctx, i);
+   ctx = q->queue_ctx[i];
for (j = 0; j < set->nr_maps; j++) {
hctx = blk_mq_map_queue_type(q, j, i);
 
@@ -2520,9 +2520,9 @@ static void blk_mq_dealloc_queue_ctx(struct request_queue 
*q, bool free_ctxs)
if (free_ctxs) {
int cpu;
for_each_possible_cpu(cpu)
-   kfree(*per_cpu_ptr(q->queue_ctx, cpu));
+   kfree(q->queue_ctx[cpu]);
}
-   free_percpu(q->queue_ctx);
+   kfree(q->queue_ctx);
 }
 
 static int blk_mq_alloc_queue_ctx(struct request_queue *q)
@@ -2530,7 +2530,9 @@ static int blk_mq_alloc_queue_ctx(struct request_queue *q)
struct blk_mq_ctx *ctx;
int cpu;
 
-   q->queue_ctx = alloc_percpu(struct blk_mq_ctx *);
+   q->queue_ctx = kmalloc_array_node(nr_cpu_ids,
+ sizeof(struct blk_mq_ctx *),
+ GFP_KERNEL, q->tag_set->numa_node);
if (!q->queue_ctx)
return -ENOMEM;
 
@@ -2538,7 +2540,7 @@ static int blk_mq_alloc_queue_ctx(struct request_queue *q)
ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
if (!ctx)
goto fail;
-   *per_cpu_ptr(q->queue_ctx, cpu) = ctx;
+   q->queue_ctx[cpu] = ctx;
}
 
return 0;
@@ -2759,6 +2761,8 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
/* mark the queue as mq asap */
q->mq_ops = set->ops;
 
+   q->tag_set = set;
+
q->poll_cb = blk_stat_alloc_callback(blk_mq_poll_stats_fn,
 blk_mq_poll_stats_bkt,
 BLK_MQ_POLL_STATS_BKTS, q);
@@ -2786,8 +2790,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
INIT_WORK(>timeout_work, blk_mq_timeout_work);
blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
-   q->tag_set = set;
-
q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;
 
if (!(set->flags & BLK_MQ_F_SG_MERGE))
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 84898793c230..97829388e1db 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -129,7 +129,7 @@ static inline enum mq_rq_state blk_mq_rq_state(struct 
request *rq)
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
   unsigned int cp

[PATCH V2 for-4.21 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-16 Thread Ming Lei

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs, so each kobject's lifetime has
to be independent, then the objects(mq_kobj, ctx) which hosts its
own kobject have to be allocated dynamically.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck 
Cc: "jianchao.wang" 
Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c   | 41 +
 block/blk-mq.c | 55 +-
 block/blk-mq.h |  4 ++--
 include/linux/blkdev.h |  4 ++--
 4 files changed, 78 insertions(+), 26 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3d25b9c419e9..cc2fef909afc 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -15,6 +15,14 @@
 
 static void blk_mq_sysfs_release(struct kobject *kobj)
 {
+   kfree(kobj);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+   struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+   kfree(ctx);
 }
 
 static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -213,7 +221,7 @@ static struct kobj_type blk_mq_ktype = {
 static struct kobj_type blk_mq_ctx_ktype = {
.sysfs_ops  = _mq_sysfs_ops,
.default_attrs  = default_ctx_attrs,
-   .release= blk_mq_sysfs_release,
+   .release= blk_mq_ctx_sysfs_release,
 };
 
 static struct kobj_type blk_mq_hw_ktype = {
@@ -245,7 +253,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
if (!hctx->nr_ctx)
return 0;
 
-   ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num);
+   ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num);
if (ret)
return ret;
 
@@ -268,8 +276,8 @@ void blk_mq_unregister_dev(struct device *dev, struct 
request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
 
-   kobject_uevent(>mq_kobj, KOBJ_REMOVE);
-   kobject_del(>mq_kobj);
+   kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+   kobject_del(q->mq_kobj);
kobject_put(>kobj);
 
q->mq_sysfs_init_done = false;
@@ -286,23 +294,30 @@ void blk_mq_sysfs_deinit(struct request_queue *q)
int cpu;
 
for_each_possible_cpu(cpu) {
-   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
kobject_put(>kobj);
}
-   kobject_put(>mq_kobj);
+   kobject_put(q->mq_kobj);
 }
 
-void blk_mq_sysfs_init(struct request_queue *q)
+int blk_mq_sysfs_init(struct request_queue *q)
 {
struct blk_mq_ctx *ctx;
int cpu;
+   struct kobject *mq_kobj;
+
+   mq_kobj = kzalloc(sizeof(*mq_kobj), GFP_KERNEL);
+   if (!mq_kobj)
+   return -ENOMEM;
 
-   kobject_init(>mq_kobj, _mq_ktype);
+   kobject_init(mq_kobj, _mq_ktype);
 
for_each_possible_cpu(cpu) {
-   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
kobject_init(>kobj, _mq_ctx_ktype);
}
+   q->mq_kobj = mq_kobj;
+   return 0;
 }
 
 int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
@@ -313,11 +328,11 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(>sysfs_lock);
 
-   ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq");
+   ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq");
if (ret < 0)
goto out;
 
-   kobject_uevent(>mq_kobj, KOBJ_ADD);
+   kobject_uevent(q->mq_kobj, KOBJ_ADD);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
@@ -334,8 +349,8 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
while (--i >= 0)
blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
 
-   kobject_uevent(>mq_kobj, KOBJ_REMOVE);
-   kobject_del(>mq_kobj);
+   kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+   kobject_del(q->mq_kobj);
kobject_put(>kobj);
return ret;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3b823891b3ef..376c04778d33 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
unsigned int i, j;
 
for_each_possible_cpu(i) {
-   struct blk_mq_ctx *__ctx = per_cpu_ptr(q->queue_ctx, i);
+   struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_

Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-15 Thread Ming Lei

On Fri, Nov 16, 2018 at 02:11:07PM +0800, jianchao.wang wrote:
> 
> 
> On 11/16/18 11:28 AM, Ming Lei wrote:
> ...
> >  
> > +struct blk_mq_kobj {
> > +   struct kobject kobj;
> > +};
> > +
> >  static void blk_mq_sysfs_release(struct kobject *kobj)
> >  {
> > +   struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj,
> > +  kobj);
> > +   kfree(mq_kobj);
> > +}
> > +
> ...
> >  
> > -void blk_mq_sysfs_init(struct request_queue *q)
> > +int blk_mq_sysfs_init(struct request_queue *q)
> >  {
> > struct blk_mq_ctx *ctx;
> > int cpu;
> > +   struct blk_mq_kobj *mq_kobj;
> > +
> > +   mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL);
> > +   if (!mq_kobj)
> > +   return -ENOMEM;
> >  
> > -   kobject_init(>mq_kobj, _mq_ktype);
> > +   kobject_init(_kobj->kobj, _mq_ktype);
> >  
> > for_each_possible_cpu(cpu) {
> > -   ctx = per_cpu_ptr(q->queue_ctx, cpu);
> > +   ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
> > +   if (!ctx)
> > +   goto fail;
> > +   *per_cpu_ptr(q->queue_ctx, cpu) = ctx;
> > kobject_init(>kobj, _mq_ctx_ktype);
> > }
> > +   q->mq_kobj = _kobj->kobj;
> > +   return 0;
> > +
> > + fail:
> > +   for_each_possible_cpu(cpu) {
> > +   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
> > +   if (ctx)
> > +   kobject_put(>kobj);
> > +   }
> > +   kobject_put(_kobj->kobj);
> > +   return -ENOMEM;
> >  }
> 
> 
> blk_mq_kobj looks meaningless, why do we need it, or do I miss something ?

Right, it should have been allocated directly.

> And maybe we should allocate ctx in blk_mq_init_allocated_queue.

Looks either way is fine.


Thanks,
Ming

Re: [PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-15 Thread Ming Lei

On Fri, Nov 16, 2018 at 01:52:05AM -0500, Sasha Levin wrote:
> On Fri, Nov 16, 2018 at 11:28:25AM +0800, Ming Lei wrote:
> > Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
> > from block layer's view, actually they don't because userspace may
> > grab one kobject anytime via sysfs, so each kobject's lifetime has
> > to be independent, then the objects(mq_kobj, ctx) which hosts its
> > own kobject have to be allocated dynamically.
> > 
> > This patch fixes kernel panic issue during booting when 
> > DEBUG_KOBJECT_RELEASE
> > is enabled.
> > 
> > Reported-by: Guenter Roeck 
> > Cc: Guenter Roeck 
> > Cc: Greg Kroah-Hartman 
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Ming Lei 
> 
> What does this patch depend on? It doesn't apply to Linus's tree nor to
> the block tree.
> 
> Also, could you please cc lkml with patches?

It depends on for-4.21/block.

thanks,
Ming

[PATCH 0/2] blk-mq: fix kobject lifetime issue

2018-11-15 Thread Ming Lei

Hi Jens,

The 1st patch fixes the kobject lifetime issue which is triggerd when
DEBUG_KOBJECT_RELEASE is enabled.

The 2nd patch can be thought as one follow-up cleanup.


Ming Lei (2):
  blk-mq: not embed .mq_kobj and ctx->kobj into queue instance
  blk-mq: alloc q->queue_ctx as normal array

 block/blk-mq-sysfs.c   | 59 +++---
 block/blk-mq.c | 19 ++--
 block/blk-mq.h |  4 ++--
 include/linux/blkdev.h |  4 ++--
 4 files changed, 62 insertions(+), 24 deletions(-)

Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 


-- 
2.9.5

[PATCH 1/2] blk-mq: not embed .mq_kobj and ctx->kobj into queue instance

2018-11-15 Thread Ming Lei

Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
from block layer's view, actually they don't because userspace may
grab one kobject anytime via sysfs, so each kobject's lifetime has
to be independent, then the objects(mq_kobj, ctx) which hosts its
own kobject have to be allocated dynamically.

This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
is enabled.

Reported-by: Guenter Roeck 
Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c   | 59 +++---
 block/blk-mq.c | 13 ++-
 block/blk-mq.h |  4 ++--
 include/linux/blkdev.h |  4 ++--
 4 files changed, 58 insertions(+), 22 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 3d25b9c419e9..bab236955f56 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -13,8 +13,22 @@
 #include "blk-mq.h"
 #include "blk-mq-tag.h"
 
+struct blk_mq_kobj {
+   struct kobject kobj;
+};
+
 static void blk_mq_sysfs_release(struct kobject *kobj)
 {
+   struct blk_mq_kobj *mq_kobj = container_of(kobj, struct blk_mq_kobj,
+  kobj);
+   kfree(mq_kobj);
+}
+
+static void blk_mq_ctx_sysfs_release(struct kobject *kobj)
+{
+   struct blk_mq_ctx *ctx = container_of(kobj, struct blk_mq_ctx, kobj);
+
+   kfree(ctx);
 }
 
 static void blk_mq_hw_sysfs_release(struct kobject *kobj)
@@ -213,7 +227,7 @@ static struct kobj_type blk_mq_ktype = {
 static struct kobj_type blk_mq_ctx_ktype = {
.sysfs_ops  = _mq_sysfs_ops,
.default_attrs  = default_ctx_attrs,
-   .release= blk_mq_sysfs_release,
+   .release= blk_mq_ctx_sysfs_release,
 };
 
 static struct kobj_type blk_mq_hw_ktype = {
@@ -245,7 +259,7 @@ static int blk_mq_register_hctx(struct blk_mq_hw_ctx *hctx)
if (!hctx->nr_ctx)
return 0;
 
-   ret = kobject_add(>kobj, >mq_kobj, "%u", hctx->queue_num);
+   ret = kobject_add(>kobj, q->mq_kobj, "%u", hctx->queue_num);
if (ret)
return ret;
 
@@ -268,8 +282,8 @@ void blk_mq_unregister_dev(struct device *dev, struct 
request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
blk_mq_unregister_hctx(hctx);
 
-   kobject_uevent(>mq_kobj, KOBJ_REMOVE);
-   kobject_del(>mq_kobj);
+   kobject_uevent(q->mq_kobj, KOBJ_REMOVE);
+   kobject_del(q->mq_kobj);
kobject_put(>kobj);
 
q->mq_sysfs_init_done = false;
@@ -286,23 +300,42 @@ void blk_mq_sysfs_deinit(struct request_queue *q)
int cpu;
 
for_each_possible_cpu(cpu) {
-   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
kobject_put(>kobj);
}
-   kobject_put(>mq_kobj);
+   kobject_put(q->mq_kobj);
 }
 
-void blk_mq_sysfs_init(struct request_queue *q)
+int blk_mq_sysfs_init(struct request_queue *q)
 {
struct blk_mq_ctx *ctx;
int cpu;
+   struct blk_mq_kobj *mq_kobj;
+
+   mq_kobj = kzalloc(sizeof(struct blk_mq_kobj), GFP_KERNEL);
+   if (!mq_kobj)
+   return -ENOMEM;
 
-   kobject_init(>mq_kobj, _mq_ktype);
+   kobject_init(_kobj->kobj, _mq_ktype);
 
for_each_possible_cpu(cpu) {
-   ctx = per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
+   if (!ctx)
+   goto fail;
+   *per_cpu_ptr(q->queue_ctx, cpu) = ctx;
kobject_init(>kobj, _mq_ctx_ktype);
}
+   q->mq_kobj = _kobj->kobj;
+   return 0;
+
+ fail:
+   for_each_possible_cpu(cpu) {
+   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
+   if (ctx)
+   kobject_put(>kobj);
+   }
+   kobject_put(_kobj->kobj);
+   return -ENOMEM;
 }
 
 int __blk_mq_register_dev(struct device *dev, struct request_queue *q)
@@ -313,11 +346,11 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
WARN_ON_ONCE(!q->kobj.parent);
lockdep_assert_held(>sysfs_lock);
 
-   ret = kobject_add(>mq_kobj, kobject_get(>kobj), "%s", "mq");
+   ret = kobject_add(q->mq_kobj, kobject_get(>kobj), "%s", "mq");
if (ret < 0)
goto out;
 
-   kobject_uevent(>mq_kobj, KOBJ_ADD);
+   kobject_uevent(q->mq_kobj, KOBJ_ADD);
 
queue_for_each_hw_ctx(q, hctx, i) {
ret = blk_mq_register_hctx(hctx);
@@ -334,8 +367,8 @@ int __blk_mq_register_dev(struct device *dev, struct 
request_queue *q)
while (--i >= 0)
blk_mq_unregister_hctx(q->queue_hw_ctx[i]);
 
-   kobject_ue

[PATCH 2/2] blk-mq: alloc q->queue_ctx as normal array

2018-11-15 Thread Ming Lei

Now q->queue_ctx is just one read-mostly table for query the
'blk_mq_ctx' instance from one cpu index, it isn't necessary
to allocate it as percpu variable. One simple array may be
more efficient.

Cc: Guenter Roeck 
Cc: Greg Kroah-Hartman 
Signed-off-by: Ming Lei 
---
 block/blk-mq-sysfs.c   |  6 +++---
 block/blk-mq.c | 12 +++-
 block/blk-mq.h |  2 +-
 include/linux/blkdev.h |  2 +-
 4 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index bab236955f56..dc4ac733a125 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -300,7 +300,7 @@ void blk_mq_sysfs_deinit(struct request_queue *q)
int cpu;
 
for_each_possible_cpu(cpu) {
-   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = q->queue_ctx[cpu];
kobject_put(>kobj);
}
kobject_put(q->mq_kobj);
@@ -322,7 +322,7 @@ int blk_mq_sysfs_init(struct request_queue *q)
ctx = kzalloc_node(sizeof(*ctx), GFP_KERNEL, cpu_to_node(cpu));
if (!ctx)
goto fail;
-   *per_cpu_ptr(q->queue_ctx, cpu) = ctx;
+   q->queue_ctx[cpu] = ctx;
kobject_init(>kobj, _mq_ctx_ktype);
}
q->mq_kobj = _kobj->kobj;
@@ -330,7 +330,7 @@ int blk_mq_sysfs_init(struct request_queue *q)
 
  fail:
for_each_possible_cpu(cpu) {
-   ctx = *per_cpu_ptr(q->queue_ctx, cpu);
+   ctx = q->queue_ctx[cpu];
if (ctx)
kobject_put(>kobj);
}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3589ee601f37..20c485e1817a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2299,7 +2299,7 @@ static void blk_mq_init_cpu_queues(struct request_queue 
*q,
unsigned int i, j;
 
for_each_possible_cpu(i) {
-   struct blk_mq_ctx *__ctx = *per_cpu_ptr(q->queue_ctx, i);
+   struct blk_mq_ctx *__ctx = q->queue_ctx[i];
struct blk_mq_hw_ctx *hctx;
 
__ctx->cpu = i;
@@ -2385,7 +2385,7 @@ static void blk_mq_map_swqueue(struct request_queue *q)
set->map[0].mq_map[i] = 0;
}
 
-   ctx = *per_cpu_ptr(q->queue_ctx, i);
+   ctx = q->queue_ctx[i];
for (j = 0; j < set->nr_maps; j++) {
hctx = blk_mq_map_queue_type(q, j, i);
 
@@ -2541,7 +2541,7 @@ void blk_mq_release(struct request_queue *q)
 */
blk_mq_sysfs_deinit(q);
 
-   free_percpu(q->queue_ctx);
+   kfree(q->queue_ctx);
 }
 
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
@@ -2731,7 +2731,9 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
if (!q->poll_cb)
goto err_exit;
 
-   q->queue_ctx = alloc_percpu(struct blk_mq_ctx *);
+   q->queue_ctx = kmalloc_array_node(nr_cpu_ids,
+ sizeof(struct blk_mq_ctx *),
+ GFP_KERNEL, set->numa_node);
if (!q->queue_ctx)
goto err_exit;
 
@@ -2798,7 +2800,7 @@ struct request_queue *blk_mq_init_allocated_queue(struct 
blk_mq_tag_set *set,
 err_sys_init:
blk_mq_sysfs_deinit(q);
 err_percpu:
-   free_percpu(q->queue_ctx);
+   kfree(q->queue_ctx);
 err_exit:
q->mq_ops = NULL;
return ERR_PTR(-ENOMEM);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 84898793c230..97829388e1db 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -129,7 +129,7 @@ static inline enum mq_rq_state blk_mq_rq_state(struct 
request *rq)
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
   unsigned int cpu)
 {
-   return *per_cpu_ptr(q->queue_ctx, cpu);
+   return q->queue_ctx[cpu];
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 9e3892bd67fd..9b6ddc5c7a40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -407,7 +407,7 @@ struct request_queue {
const struct blk_mq_ops *mq_ops;
 
/* sw queues */
-   struct blk_mq_ctx __percpu  **queue_ctx;
+   struct blk_mq_ctx   **queue_ctx;
unsigned intnr_queues;
 
unsigned intqueue_depth;
-- 
2.9.5

[PATCH] block/025: test discard sector alignement and sector size overflow

2018-11-14 Thread Ming Lei

This test covers the following two issues:

1) discard sector need to be aligned with logical block size

2) make sure 'sector_t' instead of 'unsigned int' is used when comparing
with discard sector size

Signed-off-by: Ming Lei 
---
 tests/block/025 | 37 +
 tests/block/025.out |  2 ++
 2 files changed, 39 insertions(+)
 create mode 100755 tests/block/025
 create mode 100644 tests/block/025.out

diff --git a/tests/block/025 b/tests/block/025
new file mode 100755
index ..32b632431793
--- /dev/null
+++ b/tests/block/025
@@ -0,0 +1,37 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2018 Ming Lei 
+#
+# Check two corener cases of BLKDISCARD.
+#
+# 1) test if discard bio's sector is algined with logical size, fixed by
+#1adfc5e4136f ("block: make sure discard bio is aligned with logical block 
size")
+# 2) test 32 bit overflow when comparing discard sector size. Fixed by
+#4800bf7bc8c72 ("block: fix 32 bit overflow in __blkdev_issue_discard()")
+
+. tests/block/rc
+. common/scsi_debug
+
+DESCRIPTION="check sector alignment and sector size overflow of BLKDISCARD"
+
+requires() {
+   _have_scsi_debug
+}
+
+test() {
+   echo "Running ${TEST_NAME}"
+
+   rm -f "$FULL"
+
+   # Create virtual device with unmap_zeroes_data support
+   if ! _init_scsi_debug virtual_gb=2049 sector_size=4096 lbpws10=1 
dev_size_mb=512; then
+   return 1
+   fi
+
+   local dev="/dev/${SCSI_DEBUG_DEVICES[0]}"
+   blkdiscard "$dev"
+
+   _exit_scsi_debug
+
+   echo "Test complete"
+}
diff --git a/tests/block/025.out b/tests/block/025.out
new file mode 100644
index ..fd9a6d5f70de
--- /dev/null
+++ b/tests/block/025.out
@@ -0,0 +1,2 @@
+Running block/025
+Test complete
-- 
2.9.5

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-14 Thread Ming Lei

On Thu, Nov 15, 2018 at 12:22:01PM +1100, Dave Chinner wrote:
> On Thu, Nov 15, 2018 at 09:06:52AM +0800, Ming Lei wrote:
> > On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote:
> > > On 11/13/18 2:43 PM, Dave Chinner wrote:
> > > > From: Dave Chinner 
> > > > 
> > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> > > > fall into an endless loop in the discard code. The test is creating
> > > > a device that is exactly 2^32 sectors in size to test mkfs boundary
> > > > conditions around the 32 bit sector overflow region.
> > > > 
> > > > mkfs issues a discard for the entire device size by default, and
> > > > hence this throws a sector count of 2^32 into
> > > > blkdev_issue_discard(). It takes the number of sectors to discard as
> > > > a sector_t - a 64 bit value.
> > > > 
> > > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> > > > takes this sector count and casts it to a 32 bit value before
> > > > comapring it against the maximum allowed discard size the device
> > > > has. This truncates away the upper 32 bits, and so if the lower 32
> > > > bits of the sector count is zero, it starts issuing discards of
> > > > length 0. This causes the code to fall into an endless loop, issuing
> > > > a zero length discards over and over again on the same sector.
> > > 
> > > Applied, thanks. Ming, can you please add a blktests test for
> > > this case? This is the 2nd time it's been broken.
> > 
> > OK, I will add zram discard test in blktests, which should cover the
> > 1st report. For the xfs/259, I need to investigate if it is easy to
> > do in blktests.
> 
> Just write a test that creates block devices of 2^32 + (-1,0,1)
> sectors and runs a discard across the entire device. That's all that
> xfs/259 it doing - exercising mkfs on 2TB, 4TB and 16TB boundaries.
> i.e. the boundaries where sectors and page cache indexes (on 4k page
> size systems) overflow 32 bit int and unsigned int sizes. mkfs
> issues a discard for the entire device, so it's testing that as
> well...

Indeed, I can reproduce this issue via the following commands:

modprobe scsi_debug virtual_gb=2049 sector_size=512 lbpws10=1 dev_size_mb=512
blkdiscard /dev/sde

> 
> You need to write tests that exercise write_same, write_zeros and
> discard operations around these boundaries, because they all take
> a 64 bit sector count and stuff them into 32 bit size fields in
> the bio tha tis being submitted.

write_same/write_zeros are usually used by driver directly, so we
may need make the test case on some specific device.

Thanks,
Ming

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-14 Thread Ming Lei

On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote:
> On 11/13/18 2:43 PM, Dave Chinner wrote:
> > From: Dave Chinner 
> > 
> > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> > fall into an endless loop in the discard code. The test is creating
> > a device that is exactly 2^32 sectors in size to test mkfs boundary
> > conditions around the 32 bit sector overflow region.
> > 
> > mkfs issues a discard for the entire device size by default, and
> > hence this throws a sector count of 2^32 into
> > blkdev_issue_discard(). It takes the number of sectors to discard as
> > a sector_t - a 64 bit value.
> > 
> > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> > takes this sector count and casts it to a 32 bit value before
> > comapring it against the maximum allowed discard size the device
> > has. This truncates away the upper 32 bits, and so if the lower 32
> > bits of the sector count is zero, it starts issuing discards of
> > length 0. This causes the code to fall into an endless loop, issuing
> > a zero length discards over and over again on the same sector.
> 
> Applied, thanks. Ming, can you please add a blktests test for
> this case? This is the 2nd time it's been broken.

OK, I will add zram discard test in blktests, which should cover the
1st report. For the xfs/259, I need to investigate if it is easy to
do in blktests.

Thanks,
Ming

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-14 Thread Ming Lei

On Wed, Nov 14, 2018 at 4:09 PM Dave Chinner  wrote:
>
> On Wed, Nov 14, 2018 at 10:53:11AM +0800, Ming Lei wrote:
> > On Wed, Nov 14, 2018 at 5:44 AM Dave Chinner  wrote:
> > >
> > > From: Dave Chinner 
> > >
> > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> > > fall into an endless loop in the discard code. The test is creating
> > > a device that is exactly 2^32 sectors in size to test mkfs boundary
> > > conditions around the 32 bit sector overflow region.
> > >
> > > mkfs issues a discard for the entire device size by default, and
> > > hence this throws a sector count of 2^32 into
> > > blkdev_issue_discard(). It takes the number of sectors to discard as
> > > a sector_t - a 64 bit value.
> > >
> > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> > > takes this sector count and casts it to a 32 bit value before
> > > comapring it against the maximum allowed discard size the device
> > > has. This truncates away the upper 32 bits, and so if the lower 32
> > > bits of the sector count is zero, it starts issuing discards of
> > > length 0. This causes the code to fall into an endless loop, issuing
> > > a zero length discards over and over again on the same sector.
> > >
> > > Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> > > Signed-off-by: Dave Chinner 
> > > ---
> > >  block/blk-lib.c | 5 -
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > > index e8b3bb9bf375..144e156ed341 100644
> > > --- a/block/blk-lib.c
> > > +++ b/block/blk-lib.c
> > > @@ -55,9 +55,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
> > > sector_t sector,
> > > return -EINVAL;
> > >
> > > while (nr_sects) {
> > > -   unsigned int req_sects = min_t(unsigned int, nr_sects,
> > > +   sector_t req_sects = min_t(sector_t, nr_sects,
> > > bio_allowed_max_sectors(q));
> >
> > bio_allowed_max_sectors(q) is always < UINT_MAX, and 'sector_t' is only
> > required during the comparison, so another simpler fix might be the 
> > following,
> > could you test if it is workable?
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e8b3bb9bf375..6ef44f99e83f 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -55,7 +55,7 @@ int __blkdev_issue_discard(struct block_device
> > *bdev, sector_t sector,
> > return -EINVAL;
> >
> > while (nr_sects) {
> > -   unsigned int req_sects = min_t(unsigned int, nr_sects,
> > +   unsigned int req_sects = min_t(sector_t, nr_sects,
> > bio_allowed_max_sectors(q));
>
> Rearrange the deck chairs all you like, just make sure you fix your
> regression test suite to exercise obvious boundary conditions like
> this so the next cleanup doesn't break the code again.

Good point, we may add comment on the overflow story.

>
> > >
> > > +   WARN_ON_ONCE(req_sects == 0);
> >
> > The above line isn't necessary given 'nr_sects' can't be zero.
>
> Except it was 0 and it caused the bug I had to fix. So it should
> have a warning on it.

Obviously, it can't be zero except CPU is broken, :-)

Thanks,
Ming Lei

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-13 Thread Ming Lei

On Wed, Nov 14, 2018 at 5:44 AM Dave Chinner  wrote:
>
> From: Dave Chinner 
>
> A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> fall into an endless loop in the discard code. The test is creating
> a device that is exactly 2^32 sectors in size to test mkfs boundary
> conditions around the 32 bit sector overflow region.
>
> mkfs issues a discard for the entire device size by default, and
> hence this throws a sector count of 2^32 into
> blkdev_issue_discard(). It takes the number of sectors to discard as
> a sector_t - a 64 bit value.
>
> The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> takes this sector count and casts it to a 32 bit value before
> comapring it against the maximum allowed discard size the device
> has. This truncates away the upper 32 bits, and so if the lower 32
> bits of the sector count is zero, it starts issuing discards of
> length 0. This causes the code to fall into an endless loop, issuing
> a zero length discards over and over again on the same sector.
>
> Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> Signed-off-by: Dave Chinner 
> ---
>  block/blk-lib.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e8b3bb9bf375..144e156ed341 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -55,9 +55,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
> sector_t sector,
> return -EINVAL;
>
> while (nr_sects) {
> -   unsigned int req_sects = min_t(unsigned int, nr_sects,
> +   sector_t req_sects = min_t(sector_t, nr_sects,
> bio_allowed_max_sectors(q));

bio_allowed_max_sectors(q) is always < UINT_MAX, and 'sector_t' is only
required during the comparison, so another simpler fix might be the following,
could you test if it is workable?

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e8b3bb9bf375..6ef44f99e83f 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -55,7 +55,7 @@ int __blkdev_issue_discard(struct block_device
*bdev, sector_t sector,
return -EINVAL;

while (nr_sects) {
-   unsigned int req_sects = min_t(unsigned int, nr_sects,
+   unsigned int req_sects = min_t(sector_t, nr_sects,
bio_allowed_max_sectors(q));
>
> +   WARN_ON_ONCE(req_sects == 0);

The above line isn't necessary given 'nr_sects' can't be zero.


Thanks,
Ming Lei

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-08 Thread Ming Lei

On Thu, Nov 8, 2018 at 6:07 PM Johannes Thumshirn  wrote:
>
> On 08/11/2018 02:22, Keith Busch wrote:
> > $ ./sg-test /dev/sda | grep -v 0
> > 40733f4019db8001244019db4065244019db0094244019db
> > c025244019dbc0e43a4019db40973a4019dbc0623a4019db
> > 800c244019dbc0d61d4019dbc05f244019db80091e4019db
> > 40913a4019db806f3f4019db40a83f4019dbc083244019db
> > 80eb1e4019db00a93f4019dbc09a3a4019db40503f4019db
> > 007f1b4019dbc0d91e4019db40551e4019db804a1b4019db
> > 
> > --
> >
>
> Hi Keith,
> Can you please add the above to blktests?
>
> This would be very useful for downstream distributions.

I guess the issue may depend on specific QEMU version, just tried the test over
virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed
this problem.

Thanks,
Ming Lei

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Ming Lei

On Wed, Nov 7, 2018 at 11:47 PM Keith Busch  wrote:
>
> On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> > blk_update_request() may tell us how much progress made, :-)
>
> Except when it doesn't, which is 100% of the time for many block
> drivers, including nvme.

Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
calls blk_update_request().

Thanks,
Ming Lei

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Ming Lei

On Wed, Nov 7, 2018 at 11:19 PM Keith Busch  wrote:
>
> On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote:
> > On Wed, Nov 7, 2018 at 10:42 PM Keith Busch  wrote:
> > >
> > > If the kernel allocates a bounce buffer for user read data, this memory
> > > needs to be cleared before copying it to the user, otherwise it may leak
> > > kernel memory to user space.
> > >
> > > Signed-off-by: Keith Busch 
> > > ---
> > >  block/bio.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index d5368a445561..a50d59236b19 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue 
> > > *q,
> > > if (ret)
> > > goto cleanup;
> > > } else {
> > > +   zero_fill_bio(bio);
> > > iov_iter_advance(iter, bio->bi_iter.bi_size);
> > > }
> >
> > This way looks inefficient because zero fill should only be required
> > for short READ.
>
> Sure, but how do you know that happened before copying the bounce buffer
> to user space?

blk_update_request() may tell us how much progress made, :-)

So it should work by calling the following code after the req is completed
and before copying data to userspace:

__rq_for_each_bio(bio, rq)
zero_fill_bio(bio);
>
> We could zero the pages on allocation if that's better (and doesn't zero
> twice if __GFP_ZERO was already provided):
>
> ---
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a1b6383294f4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> nr_pages = 1 << map_data->page_order;
> i = map_data->offset / PAGE_SIZE;
> }
> +
> +   if (iov_iter_rw(iter) == READ)
> +   gfp_mask |= __GFP_ZERO;

No big difference compared with before, because most of times the zero fill
shouldn't be needed. However, this patch changes to do it every time.

Thanks,
Ming Lei

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Ming Lei

On Wed, Nov 7, 2018 at 10:42 PM Keith Busch  wrote:
>
> If the kernel allocates a bounce buffer for user read data, this memory
> needs to be cleared before copying it to the user, otherwise it may leak
> kernel memory to user space.
>
> Signed-off-by: Keith Busch 
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a50d59236b19 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> if (ret)
> goto cleanup;
> } else {
> +   zero_fill_bio(bio);
> iov_iter_advance(iter, bio->bi_iter.bi_size);
> }

This way looks inefficient because zero fill should only be required
for short READ.

Thanks,
Ming Lei

Re: [PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size

2018-11-06 Thread Ming Lei

On Tue, Nov 06, 2018 at 04:48:13PM +, Rui Salvaterra wrote:
> On Mon, 5 Nov 2018 at 03:41, Ming Lei  wrote:
> >
> > V2 addresses Christoph's comment by introducing bio_allowed_max_sectors().
> >
> > Ping...
> >
> > Thanks,
> > Ming
> 
> Hi, Ming,
> 
> Sorry for the delay. I tested your V2 against Linux 4.20-rc1 and
> everything seems fine. FWIW, V2 is also
> 
> Tested-by: Rui Salvaterra 

Rui, thanks for your test on V2.

Jens, what do you think about this patchset?

Thanks,
Ming

Re: [PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size

2018-11-04 Thread Ming Lei

On Mon, Oct 29, 2018 at 08:57:16PM +0800, Ming Lei wrote:
> Hi,
> 
> The 1st & 3rd patch fixes bio size alignment issue.
> 
> The 2nd patch cleans up __blkdev_issue_discard() a bit.
> 
> V2:
>   - introduce helper of bio_allowed_max_sectors()
>   - add commit log for patch 2
> 
> 
> Ming Lei (3):
>   block: make sure discard bio is aligned with logical block size
>   block: cleanup __blkdev_issue_discard()
>   block: make sure writesame bio is aligned with logical block size
> 
>  block/blk-lib.c   | 26 +++---
>  block/blk-merge.c |  3 ++-
>  block/blk.h   | 10 ++
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> Cc: Rui Salvaterra 
> Cc: sta...@vger.kernel.org
> Cc: Mike Snitzer 
> Cc: Christoph Hellwig 
> Cc: Xiao Ni 
> Cc: Mariusz Dabrowski 

V2 addresses Christoph's comment by introducing bio_allowed_max_sectors().

Ping...

Thanks,
Ming

[PATCH 1/4] Revert "irq: add support for allocating (and affinitizing) sets of IRQs"

2018-11-02 Thread Ming Lei

This reverts commit 1d44f6f43e229ca06bf680aa7eb5ad380eaa5d72.
---
 drivers/pci/msi.c | 14 --
 include/linux/interrupt.h |  4 
 kernel/irq/affinity.c | 40 +---
 3 files changed, 9 insertions(+), 49 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 265ed3e4c920..af24ed50a245 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -1036,13 +1036,6 @@ static int __pci_enable_msi_range(struct pci_dev *dev, 
int minvec, int maxvec,
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msi_enabled))
return -EINVAL;
 
@@ -1094,13 +1087,6 @@ static int __pci_enable_msix_range(struct pci_dev *dev,
if (maxvec < minvec)
return -ERANGE;
 
-   /*
-* If the caller is passing in sets, we can't support a range of
-* supported vectors. The caller needs to handle that.
-*/
-   if (affd && affd->nr_sets && minvec != maxvec)
-   return -EINVAL;
-
if (WARN_ON_ONCE(dev->msix_enabled))
return -EINVAL;
 
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ca397ff40836..1d6711c28271 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -247,14 +247,10 @@ struct irq_affinity_notify {
  * the MSI(-X) vector space
  * @post_vectors:  Don't apply affinity to @post_vectors at end of
  * the MSI(-X) vector space
- * @nr_sets:   Length of passed in *sets array
- * @sets:  Number of affinitized sets
  */
 struct irq_affinity {
int pre_vectors;
int post_vectors;
-   int nr_sets;
-   int *sets;
 };
 
 #if defined(CONFIG_SMP)
diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 2046a0f0f0f1..f4f29b9d90ee 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -180,7 +180,6 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int curvec, usedvecs;
cpumask_var_t nmsk, npresmsk, *node_to_cpumask;
struct cpumask *masks = NULL;
-   int i, nr_sets;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -211,23 +210,10 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
 
-   /*
-* Spread on present CPUs starting from affd->pre_vectors. If we
-* have multiple sets, build each sets affinity mask separately.
-*/
-   nr_sets = affd->nr_sets;
-   if (!nr_sets)
-   nr_sets = 1;
-
-   for (i = 0, usedvecs = 0; i < nr_sets; i++) {
-   int this_vecs = affd->sets ? affd->sets[i] : affvecs;
-   int nr;
-
-   nr = irq_build_affinity_masks(affd, curvec, this_vecs,
- node_to_cpumask, cpu_present_mask,
- nmsk, masks + usedvecs);
-   usedvecs += nr;
-   }
+   /* Spread on present CPUs starting from affd->pre_vectors */
+   usedvecs = irq_build_affinity_masks(affd, curvec, affvecs,
+   node_to_cpumask, cpu_present_mask,
+   nmsk, masks);
 
/*
 * Spread on non present CPUs starting from the next vector to be
@@ -272,21 +258,13 @@ int irq_calc_affinity_vectors(int minvec, int maxvec, 
const struct irq_affinity
 {
int resv = affd->pre_vectors + affd->post_vectors;
int vecs = maxvec - resv;
-   int set_vecs;
+   int ret;
 
if (resv > minvec)
return 0;
 
-   if (affd->nr_sets) {
-   int i;
-
-   for (i = 0, set_vecs = 0;  i < affd->nr_sets; i++)
-   set_vecs += affd->sets[i];
-   } else {
-   get_online_cpus();
-   set_vecs = cpumask_weight(cpu_possible_mask);
-   put_online_cpus();
-   }
-
-   return resv + min(set_vecs, vecs);
+   get_online_cpus();
+   ret = min_t(int, cpumask_weight(cpu_possible_mask), vecs) + resv;
+   put_online_cpus();
+   return ret;
 }
-- 
2.9.5

[PATCH V2] block: brd: associate with queue until adding disk

2018-11-01 Thread Ming Lei

brd_free() may be called in failure path on one brd instance which
disk isn't added yet, so release handler of gendisk may free the
associated request_queue early and causes the following use-after-free[1].

This patch fixes this issue by associating gendisk with request_queue
just before adding disk.

[1] KASAN: use-after-free Read in del_timer_syncNon-volatile memory driver v1.3
Linux agpgart interface v0.103
[drm] Initialized vgem 1.0.0 20120112 for virtual device on minor 0
usbcore: registered new interface driver udl
==
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
kernel/locking/lockdep.c:3218
Read of size 8 at addr 8801d1b6b540 by task swapper/0/1

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x244/0x39d lib/dump_stack.c:113
  print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
  lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
  del_timer_sync+0xb7/0x270 kernel/time/timer.c:1283
  blk_cleanup_queue+0x413/0x710 block/blk-core.c:809
  brd_free+0x5d/0x71 drivers/block/brd.c:422
  brd_init+0x2eb/0x393 drivers/block/brd.c:518
  do_one_initcall+0x145/0x957 init/main.c:890
  do_initcall_level init/main.c:958 [inline]
  do_initcalls init/main.c:966 [inline]
  do_basic_setup init/main.c:984 [inline]
  kernel_init_freeable+0x5c6/0x6b9 init/main.c:1148
  kernel_init+0x11/0x1ae init/main.c:1068
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:350

Reported-by: syzbot+3701447012fe951da...@syzkaller.appspotmail.com
Signed-off-by: Ming Lei 
---
V2:
- don't reference queue via disk->queue in brd_alloc()


 drivers/block/brd.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index df8103dd40ac..c18586fccb6f 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -396,15 +396,14 @@ static struct brd_device *brd_alloc(int i)
disk->first_minor   = i * max_part;
disk->fops  = _fops;
disk->private_data  = brd;
-   disk->queue = brd->brd_queue;
disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
-   disk->queue->backing_dev_info->capabilities |= BDI_CAP_SYNCHRONOUS_IO;
+   brd->brd_queue->backing_dev_info->capabilities |= 
BDI_CAP_SYNCHRONOUS_IO;
 
/* Tell the block layer that this is not a rotational device */
-   blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
-   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, disk->queue);
+   blk_queue_flag_set(QUEUE_FLAG_NONROT, brd->brd_queue);
+   blk_queue_flag_clear(QUEUE_FLAG_ADD_RANDOM, brd->brd_queue);
 
return brd;
 
@@ -436,6 +435,7 @@ static struct brd_device *brd_init_one(int i, bool *new)
 
brd = brd_alloc(i);
if (brd) {
+   brd->brd_disk->queue = brd->brd_queue;
add_disk(brd->brd_disk);
list_add_tail(>brd_list, _devices);
}
@@ -503,8 +503,14 @@ static int __init brd_init(void)
 
/* point of no return */
 
-   list_for_each_entry(brd, _devices, brd_list)
+   list_for_each_entry(brd, _devices, brd_list) {
+   /*
+* associate with queue just before adding disk for
+* avoiding to mess up failure path
+*/
+   brd->brd_disk->queue = brd->brd_queue;
add_disk(brd->brd_disk);
+   }
 
blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS,
  THIS_MODULE, brd_probe, NULL, NULL);
-- 
2.9.5

[PATCH] block: brd: associate with queue until adding disk

2018-10-31 Thread Ming Lei

brd_free() may be called in failure path on one brd instance which
disk isn't added yet, so release handler of gendisk may free the
associated request_queue early and causes the following use-after-free[1].

This patch fixes this issue by associating gendisk with request_queue
just before adding disk.

[1] KASAN: use-after-free Read in del_timer_syncNon-volatile memory driver v1.3
Linux agpgart interface v0.103
[drm] Initialized vgem 1.0.0 20120112 for virtual device on minor 0
usbcore: registered new interface driver udl
==
BUG: KASAN: use-after-free in __lock_acquire+0x36d9/0x4c20
kernel/locking/lockdep.c:3218
Read of size 8 at addr 8801d1b6b540 by task swapper/0/1

CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0+ #88
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x244/0x39d lib/dump_stack.c:113
  print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  __lock_acquire+0x36d9/0x4c20 kernel/locking/lockdep.c:3218
  lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3844
  del_timer_sync+0xb7/0x270 kernel/time/timer.c:1283
  blk_cleanup_queue+0x413/0x710 block/blk-core.c:809
  brd_free+0x5d/0x71 drivers/block/brd.c:422
  brd_init+0x2eb/0x393 drivers/block/brd.c:518
  do_one_initcall+0x145/0x957 init/main.c:890
  do_initcall_level init/main.c:958 [inline]
  do_initcalls init/main.c:966 [inline]
  do_basic_setup init/main.c:984 [inline]
  kernel_init_freeable+0x5c6/0x6b9 init/main.c:1148
  kernel_init+0x11/0x1ae init/main.c:1068
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:350

Reported-by: syzbot+3701447012fe951da...@syzkaller.appspotmail.com
Signed-off-by: Ming Lei 
---
 drivers/block/brd.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index df8103dd40ac..6b3b8964b996 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -396,7 +396,6 @@ static struct brd_device *brd_alloc(int i)
disk->first_minor   = i * max_part;
disk->fops  = _fops;
disk->private_data  = brd;
-   disk->queue = brd->brd_queue;
disk->flags = GENHD_FL_EXT_DEVT;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
@@ -436,6 +435,7 @@ static struct brd_device *brd_init_one(int i, bool *new)
 
brd = brd_alloc(i);
if (brd) {
+   brd->brd_disk->queue = brd->brd_queue;
add_disk(brd->brd_disk);
list_add_tail(>brd_list, _devices);
}
@@ -503,8 +503,14 @@ static int __init brd_init(void)
 
/* point of no return */
 
-   list_for_each_entry(brd, _devices, brd_list)
+   list_for_each_entry(brd, _devices, brd_list) {
+   /*
+* associate with queue just before adding disk for
+* avoiding to mess up failure path
+*/
+   brd->brd_disk->queue = brd->brd_queue;
add_disk(brd->brd_disk);
+   }
 
blk_register_region(MKDEV(RAMDISK_MAJOR, 0), 1UL << MINORBITS,
  THIS_MODULE, brd_probe, NULL, NULL);
-- 
2.9.5

Re: [PATCH] block: call rq_qos_exit() after queue is frozen

2018-10-30 Thread Ming Lei

On Wed, Oct 24, 2018 at 9:18 PM Ming Lei  wrote:
>
> rq_qos_exit() removes the current q->rq_qos, this action has to be
> done after queue is frozen, otherwise the IO queue path may never
> be waken up, then IO hang is caused.
>
> So fixes this issue by moving rq_qos_exit() after queue is frozen.
>
> Cc: Josef Bacik 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-core.c  | 3 +++
>  block/blk-sysfs.c | 2 --
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3ed60723e242..0a5a6cf5add7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -785,6 +785,9 @@ void blk_cleanup_queue(struct request_queue *q)
>  * prevent that q->request_fn() gets invoked after draining finished.
>  */
> blk_freeze_queue(q);
> +
> +   rq_qos_exit(q);
> +
> spin_lock_irq(lock);
> queue_flag_set(QUEUE_FLAG_DEAD, q);
> spin_unlock_irq(lock);
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 3772671cf2bc..d4f1280965b6 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -994,8 +994,6 @@ void blk_unregister_queue(struct gendisk *disk)
> kobject_del(>kobj);
> blk_trace_remove_sysfs(disk_to_dev(disk));
>
> -   rq_qos_exit(q);
> -
> mutex_lock(>sysfs_lock);
>         if (q->request_fn || (q->mq_ops && q->elevator))
> elv_unregister_queue(q);
> --
> 2.9.5
>

Ping...

-- 
Ming Lei

[PATCH V2 3/3] block: make sure writesame bio is aligned with logical block size

2018-10-29 Thread Ming Lei

Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.

Fixes: b49a0871be31a745b2ef ("block: remove split code in 
blkdev_issue_{discard,write_same}")
Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index d58d5d87dd88..e8b3bb9bf375 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -149,7 +149,7 @@ static int __blkdev_issue_write_same(struct block_device 
*bdev, sector_t sector,
return -EOPNOTSUPP;
 
/* Ensure that max_write_same_sectors doesn't overflow bi_size */
-   max_write_same_sectors = UINT_MAX >> 9;
+   max_write_same_sectors = bio_allowed_max_sectors(q);
 
while (nr_sects) {
bio = blk_next_bio(bio, 1, gfp_mask);
-- 
2.9.5

[PATCH V2 2/3] block: cleanup __blkdev_issue_discard()

2018-10-29 Thread Ming Lei

Cleanup __blkdev_issue_discard() a bit:

- remove local variable of 'end_sect'
- remove code block of 'fail'

Cc: Rui Salvaterra 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index d56fd159d2e8..d58d5d87dd88 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -51,15 +51,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
 
-   while (nr_sects) {
-   unsigned int req_sects = nr_sects;
-   sector_t end_sect;
-
-   if (!req_sects)
-   goto fail;
-   req_sects = min(req_sects, bio_allowed_max_sectors(q));
+   if (!nr_sects)
+   return -EINVAL;
 
-   end_sect = sector + req_sects;
+   while (nr_sects) {
+   unsigned int req_sects = min_t(unsigned int, nr_sects,
+   bio_allowed_max_sectors(q));
 
bio = blk_next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
@@ -67,8 +64,8 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
bio_set_op_attrs(bio, op, 0);
 
bio->bi_iter.bi_size = req_sects << 9;
+   sector += req_sects;
nr_sects -= req_sects;
-   sector = end_sect;
 
/*
 * We can loop for a long time in here, if someone does
@@ -81,14 +78,6 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 
*biop = bio;
return 0;
-
-fail:
-   if (bio) {
-   submit_bio_wait(bio);
-   bio_put(bio);
-   }
-   *biop = NULL;
-   return -EOPNOTSUPP;
 }
 EXPORT_SYMBOL(__blkdev_issue_discard);
 
-- 
2.9.5

[PATCH V2 0/3] block: make sure discard/writesame bio is aligned with logical block size

2018-10-29 Thread Ming Lei

Hi,

The 1st & 3rd patch fixes bio size alignment issue.

The 2nd patch cleans up __blkdev_issue_discard() a bit.

V2:
- introduce helper of bio_allowed_max_sectors()
- add commit log for patch 2


Ming Lei (3):
  block: make sure discard bio is aligned with logical block size
  block: cleanup __blkdev_issue_discard()
  block: make sure writesame bio is aligned with logical block size

 block/blk-lib.c   | 26 +++---
 block/blk-merge.c |  3 ++-
 block/blk.h   | 10 ++
 3 files changed, 19 insertions(+), 20 deletions(-)

Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
-- 
2.9.5

[PATCH V2 1/3] block: make sure discard bio is aligned with logical block size

2018-10-29 Thread Ming Lei

Obviously the created discard bio has to be aligned with logical block size.

This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.

Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in 
blkdev_issue_discard()")
Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment 
checks")
Reported-by: Rui Salvaterra 
Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c   |  3 +--
 block/blk-merge.c |  3 ++-
 block/blk.h   | 10 ++
 3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 76f867ea9a9b..d56fd159d2e8 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -57,8 +57,7 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 
if (!req_sects)
goto fail;
-   if (req_sects > UINT_MAX >> 9)
-   req_sects = UINT_MAX >> 9;
+   req_sects = min(req_sects, bio_allowed_max_sectors(q));
 
end_sect = sector + req_sects;
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 42a46744c11b..507fbaa6b4c0 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -90,7 +90,8 @@ static struct bio *blk_bio_discard_split(struct request_queue 
*q,
/* Zero-sector (unknown) and one-sector granularities are the same.  */
granularity = max(q->limits.discard_granularity >> 9, 1U);
 
-   max_discard_sectors = min(q->limits.max_discard_sectors, UINT_MAX >> 9);
+   max_discard_sectors = min(q->limits.max_discard_sectors,
+   bio_allowed_max_sectors(q));
max_discard_sectors -= max_discard_sectors % granularity;
 
if (unlikely(!max_discard_sectors)) {
diff --git a/block/blk.h b/block/blk.h
index a1841b8ff129..2680aa4fcd88 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -396,6 +396,16 @@ static inline unsigned long blk_rq_deadline(struct request 
*rq)
 }
 
 /*
+ * The max size one bio can handle is UINT_MAX becasue bvec_iter.bi_size
+ * is defined as 'unsigned int', meantime it has to aligned to with logical
+ * block size which is the minimum accepted unit by hardware.
+ */
+static inline unsigned int bio_allowed_max_sectors(struct request_queue *q)
+{
+   return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
+}
+
+/*
  * Internal io_context interface
  */
 void get_io_context(struct io_context *ioc);
-- 
2.9.5

Re: [PATCH V2 1/3] blk-mq: refactor the code of issue request directly

2018-10-28 Thread Ming Lei

On Sat, Oct 27, 2018 at 12:01:09AM +0800, Jianchao Wang wrote:
> Merge blk_mq_try_issue_directly and __blk_mq_try_issue_directly
> into one interface which is able to handle the return value from
> .queue_rq callback. Due to we can only issue directly w/o io
> scheduler, so remove the blk_mq_get_driver_tag.

It isn't correct for dm-rq, see blk_insert_cloned_request().

Thanks,
Ming

Re: [PATCH 1/3] block: make sure discard bio is aligned with logical block size

2018-10-28 Thread Ming Lei

On Sun, Oct 28, 2018 at 04:49:47PM +0100, Christoph Hellwig wrote:
> On Sun, Oct 28, 2018 at 08:51:31AM +0800, Ming Lei wrote:
> > On Fri, Oct 26, 2018 at 09:44:15AM +0200, Christoph Hellwig wrote:
> > > > if (req_sects > UINT_MAX >> 9)
> > > > -   req_sects = UINT_MAX >> 9;
> > > > +   req_sects = (UINT_MAX >> 9) & ~bs_mask;
> > > 
> > > Given that we have this same thing duplicated in write zeroes
> > > what about a documented helper?
> > 
> > IMO, using UINT_MAX & bs_mask is better because it is self-explanatory
> > in the context.
> 
> I don't think it is in any way.  I understand it because I know the
> code, but there is nothing that documents why we do that.

Then how about introducing this helper?

 /*
+ * The max sectors one bio can handle is 'UINT_MAX >> 9' becasue
+ * bvec_iter.bi_size is defined as 'unsigned int', also it has to aligned
+ * to with logical block size which is minimum accepted unit by hardware.
+ */
+static inline unsigned int blk_max_allowed_max_secotrs(struct request_queue *q)
+{
+   return round_down(UINT_MAX, queue_logical_block_size(q)) >> 9;
+}
+
+/*

Thanks,
Ming

Re: [PATCH 1/3] block: make sure discard bio is aligned with logical block size

2018-10-27 Thread Ming Lei

On Fri, Oct 26, 2018 at 09:44:15AM +0200, Christoph Hellwig wrote:
> > if (req_sects > UINT_MAX >> 9)
> > -   req_sects = UINT_MAX >> 9;
> > +   req_sects = (UINT_MAX >> 9) & ~bs_mask;
> 
> Given that we have this same thing duplicated in write zeroes
> what about a documented helper?

IMO, using UINT_MAX & bs_mask is better because it is self-explanatory
in the context.

If we introduce one helper, it may not be easy to find a better
name than UINT_MAX.

thanks,
Ming

[PATCH 0/3] block: make sure discard/writesame bio is aligned with logical block size

2018-10-26 Thread Ming Lei

Hi,

The 1st & 3rd patch fixes bio size alignment issue.

The 2nd patch cleans up __blkdev_issue_discard() a bit.

Thanks,


Ming Lei (3):
  block: make sure discard bio is aligned with logical block size
  block: cleanup __blkdev_issue_discard()
  block: make sure writesame bio is aligned with logical block size

 block/blk-lib.c | 25 ++---
 1 file changed, 6 insertions(+), 19 deletions(-)

Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 

-- 
2.9.5

[PATCH 3/3] block: make sure writesame bio is aligned with logical block size

2018-10-26 Thread Ming Lei

Obviously the created writesame bio has to be aligned with logical block
size.

Fixes: b49a0871be31a745b2ef ("block: remove split code in 
blkdev_issue_{discard,write_same}")
Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index 93011785f710..1750f0e480c0 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -149,7 +149,7 @@ static int __blkdev_issue_write_same(struct block_device 
*bdev, sector_t sector,
return -EOPNOTSUPP;
 
/* Ensure that max_write_same_sectors doesn't overflow bi_size */
-   max_write_same_sectors = UINT_MAX >> 9;
+   max_write_same_sectors = (UINT_MAX >> 9) & ~bs_mask;
 
while (nr_sects) {
bio = next_bio(bio, 1, gfp_mask);
-- 
2.9.5

[PATCH 2/3] block: cleanup __blkdev_issue_discard()

2018-10-26 Thread Ming Lei

Cleanup __blkdev_issue_discard().

Cc: Rui Salvaterra 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c | 23 +--
 1 file changed, 5 insertions(+), 18 deletions(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index aa3944946b2f..93011785f710 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -52,16 +52,11 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
 
-   while (nr_sects) {
-   unsigned int req_sects = nr_sects;
-   sector_t end_sect;
-
-   if (!req_sects)
-   goto fail;
-   if (req_sects > UINT_MAX >> 9)
-   req_sects = (UINT_MAX >> 9) & ~bs_mask;
+   if (!nr_sects)
+   return -EINVAL;
 
-   end_sect = sector + req_sects;
+   while (nr_sects) {
+   unsigned int req_sects = min(nr_sects, (UINT_MAX >> 9) & 
~bs_mask);
 
bio = next_bio(bio, 0, gfp_mask);
bio->bi_iter.bi_sector = sector;
@@ -69,8 +64,8 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
bio_set_op_attrs(bio, op, 0);
 
bio->bi_iter.bi_size = req_sects << 9;
+   sector += req_sects;
nr_sects -= req_sects;
-   sector = end_sect;
 
/*
 * We can loop for a long time in here, if someone does
@@ -83,14 +78,6 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
 
*biop = bio;
return 0;
-
-fail:
-   if (bio) {
-   submit_bio_wait(bio);
-   bio_put(bio);
-   }
-   *biop = NULL;
-   return -EOPNOTSUPP;
 }
 EXPORT_SYMBOL(__blkdev_issue_discard);
 
-- 
2.9.5

[PATCH 1/3] block: make sure discard bio is aligned with logical block size

2018-10-26 Thread Ming Lei

Obviously the created discard bio has to be aligned with logical block
size.

Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in 
blkdev_issue_discard()")
Reported-by: Rui Salvaterra 
Cc: Rui Salvaterra 
Cc: sta...@vger.kernel.org
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Cc: Xiao Ni 
Cc: Mariusz Dabrowski 
Signed-off-by: Ming Lei 
---
 block/blk-lib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index bbd44666f2b5..aa3944946b2f 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -59,7 +59,7 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
if (!req_sects)
goto fail;
if (req_sects > UINT_MAX >> 9)
-   req_sects = UINT_MAX >> 9;
+   req_sects = (UINT_MAX >> 9) & ~bs_mask;
 
end_sect = sector + req_sects;
 
-- 
2.9.5

[PATCH] block: call rq_qos_exit() after queue is frozen

2018-10-24 Thread Ming Lei

rq_qos_exit() removes the current q->rq_qos, this action has to be
done after queue is frozen, otherwise the IO queue path may never
be waken up, then IO hang is caused.

So fixes this issue by moving rq_qos_exit() after queue is frozen.

Cc: Josef Bacik 
Signed-off-by: Ming Lei 
---
 block/blk-core.c  | 3 +++
 block/blk-sysfs.c | 2 --
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 3ed60723e242..0a5a6cf5add7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -785,6 +785,9 @@ void blk_cleanup_queue(struct request_queue *q)
 * prevent that q->request_fn() gets invoked after draining finished.
 */
blk_freeze_queue(q);
+
+   rq_qos_exit(q);
+
spin_lock_irq(lock);
queue_flag_set(QUEUE_FLAG_DEAD, q);
spin_unlock_irq(lock);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3772671cf2bc..d4f1280965b6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -994,8 +994,6 @@ void blk_unregister_queue(struct gendisk *disk)
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
 
-   rq_qos_exit(q);
-
mutex_lock(>sysfs_lock);
if (q->request_fn || (q->mq_ops && q->elevator))
elv_unregister_queue(q);
-- 
2.9.5

Re: [PATCH] block: setup bounce bio_sets properly

2018-10-19 Thread Ming Lei

On Thu, Oct 18, 2018 at 03:24:47PM -0600, Jens Axboe wrote:
> We're only setting up the bounce bio sets if we happen
> to need bouncing for regular HIGHMEM, not if we only need
> it for ISA devices.
> 
> Reported-by: Ondrej Zary 
> Tested-by: Ondrej Zary 
> Signed-off-by: Jens Axboe 
> 
> diff --git a/block/bounce.c b/block/bounce.c
> index b30071ac4ec6..1356a2f4aae2 100644
> --- a/block/bounce.c
> +++ b/block/bounce.c
> @@ -31,6 +31,24 @@
>  static struct bio_set bounce_bio_set, bounce_bio_split;
>  static mempool_t page_pool, isa_page_pool;
>  
> +static void init_bounce_bioset(void)
> +{
> + static bool bounce_bs_setup;
> + int ret;
> +
> + if (bounce_bs_setup)
> + return;
> +
> + ret = bioset_init(_bio_set, BIO_POOL_SIZE, 0, BIOSET_NEED_BVECS);
> + BUG_ON(ret);
> + if (bioset_integrity_create(_bio_set, BIO_POOL_SIZE))
> + BUG_ON(1);
> +
> + ret = bioset_init(_bio_split, BIO_POOL_SIZE, 0, 0);
> + BUG_ON(ret);
> + bounce_bs_setup = true;

When initcall is run from do_basic_setup(), all CPUs and scheduler have been up,
it is likely that init_emergency_pool() is run when one driver is calling
blk_queue_bounce_limit() to start init_emergency_isa_pool().

So the above code might be run twice.

Thanks,
Ming

Re: [PATCH 4/5] block: introduce helpers for allocating IO buffers from slab

2018-10-18 Thread Ming Lei

On Thu, Oct 18, 2018 at 05:22:19PM +0200, Christoph Hellwig wrote:
> On Thu, Oct 18, 2018 at 08:11:23AM -0700, Matthew Wilcox wrote:
> > On Thu, Oct 18, 2018 at 04:42:07PM +0200, Christoph Hellwig wrote:
> > > This all seems quite complicated.
> > > 
> > > I think the interface we'd want is more one that has a little
> > > cache of a single page in the queue, and a little bitmap which
> > > sub-page size blocks of it are used.
> > > 
> > > Something like (pseudo code minus locking):
> > > 
> > > void *blk_alloc_sector_buffer(struct block_device *bdev, gfp_t gfp)
> > > {
> > >   unsigned block_size = block_size(bdev);
> > > 
> > >   if (blocksize >= PAGE_SIZE)
> > >   return (void *)__get_free_pages(gfp, get_order(blocksize));
> > > 
> > >   if (bdev->fragment_cache_page) {
> > >   [ fragment_cache_page using
> > > e.g. bitmap and return if found]
> > >   }
> > > 
> > >   bdev->fragment_cache_page = (void *)__get_free_page(gfp);
> > >   goto find_again;
> > > }
> > 
> > This looks a lot like page_frag_alloc() except I think page_frag_alloc()
> > may be more efficient.
> 
> Oh, nice. Sounds like XFS should just use page_frag_alloc.  I'll give
> it a spin.

XFS or other fs can use page_frag_alloc() directly, seems not necessary to
introduce this change in block layer any more given 512-aligned buffer
should be fine everywhere.

The only benefit to make it as block helper is that the offset or size
can be checked with q->dma_alignment.

Dave/Jens, do you think which way is better? Put allocation as block
helper or fs uses page_frag_alloc() directly for allocating 512*N-byte 
buffer(total size is less than PAGE_SIZE)?

Thanks,
Ming

Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer

2018-10-18 Thread Ming Lei

On Thu, Oct 18, 2018 at 07:52:59PM -0600, Jens Axboe wrote:
> On 10/18/18 7:39 PM, Ming Lei wrote:
> > On Thu, Oct 18, 2018 at 07:33:50PM -0600, Jens Axboe wrote:
> >> On 10/18/18 7:28 PM, Ming Lei wrote:
> >>> On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote:
> >>>> On 10/18/18 7:18 AM, Ming Lei wrote:
> >>>>> Now we only check if DMA IO buffer is aligned to queue_dma_alignment()
> >>>>> for pass-through request, and it isn't done for normal IO request.
> >>>>>
> >>>>> Given the check has to be done on each bvec, it isn't efficient to add 
> >>>>> the
> >>>>> check in generic_make_request_checks().
> >>>>>
> >>>>> This patch addes one WARN in blk_queue_split() for capturing this issue.
> >>>>
> >>>> I don't want to do this, because then we are forever doomed to
> >>>> have something that fully loops a bio at submission time. I
> >>>> absolutely hate the splitting we have and the need for it,
> >>>> hopefully it can go away for a subset of IOs at some point.
> >>>>
> >>>> In many ways, this seems to be somewhat of a made-up problem, I don't
> >>>> recall a single bug report for something like this over decades of
> >>>> working with the IO stack. 512b alignment restrictions for DMA seems
> >>>> absolutely insane. I know people claim they exist, but clearly that
> >>>> isn't a hard requirement or we would have been boned years ago.
> >>>
> >>> There are still some drivers with this requirement:
> >>>
> >>> drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, 
> >>> sdev->sector_size - 1);
> >>> drivers/ata/pata_macio.c:812:   
> >>> blk_queue_update_dma_alignment(sdev->request_queue, 31);
> >>> drivers/ata/pata_macio.c:827:   
> >>> blk_queue_update_dma_alignment(sdev->request_queue, 15);
> >>> drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, 
> >>> dev->blk_size-1);
> >>> drivers/block/rsxx/dev.c:282:   
> >>> blk_queue_dma_alignment(card->queue, blk_size - 1);
> >>> drivers/block/xen-blkfront.c:957:   blk_queue_dma_alignment(rq, 511);
> >>> drivers/ide/ide-cd.c:1512:  blk_queue_dma_alignment(q, 31);
> >>> drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment 
> >>> (sdev->request_queue, 512 - 1);
> >>> drivers/staging/rts5208/rtsx.c:94:  
> >>> blk_queue_dma_alignment(sdev->request_queue, (512 - 1));
> >>> drivers/usb/image/microtek.c:329:   
> >>> blk_queue_dma_alignment(s->request_queue, (512 - 1));
> >>> drivers/usb/storage/scsiglue.c:92:  
> >>> blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
> >>> drivers/usb/storage/uas.c:818:  
> >>> blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
> >>
> >> Of course, I too can grep :-)
> >>
> >> My point is that these settings might not match reality. And the
> >> WARN_ON(), as implemented, is going to trigger on any device that
> >> DOESN'T set the alignment, as Bart pointed out.
> > 
> > It is just a WARN_ON_ONCE() which exactly shows something which need
> > further attention, then related people may take a look and we can move
> > on.
> > 
> > So I think it is correct thing to do.
> 
> It most certainly is NOT the right thing to do, when we know that:
> 
> 1) We currently have drivers setting an alignment that we don't meet
> 2) We have drivers not setting an alignment, and getting 512 by default

The 512 default should have been removed given it isn't respected at
all in normal io path, but it is included from the beginning of 2.6.12

> 3) We have drivers setting an alignment that seems incorrect

Then WARN_ON() is helpful for both 1) and 2) after the default 512
limit is removed.

Thanks,
Ming

Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer

2018-10-18 Thread Ming Lei

On Thu, Oct 18, 2018 at 07:33:50PM -0600, Jens Axboe wrote:
> On 10/18/18 7:28 PM, Ming Lei wrote:
> > On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote:
> >> On 10/18/18 7:18 AM, Ming Lei wrote:
> >>> Now we only check if DMA IO buffer is aligned to queue_dma_alignment()
> >>> for pass-through request, and it isn't done for normal IO request.
> >>>
> >>> Given the check has to be done on each bvec, it isn't efficient to add the
> >>> check in generic_make_request_checks().
> >>>
> >>> This patch addes one WARN in blk_queue_split() for capturing this issue.
> >>
> >> I don't want to do this, because then we are forever doomed to
> >> have something that fully loops a bio at submission time. I
> >> absolutely hate the splitting we have and the need for it,
> >> hopefully it can go away for a subset of IOs at some point.
> >>
> >> In many ways, this seems to be somewhat of a made-up problem, I don't
> >> recall a single bug report for something like this over decades of
> >> working with the IO stack. 512b alignment restrictions for DMA seems
> >> absolutely insane. I know people claim they exist, but clearly that
> >> isn't a hard requirement or we would have been boned years ago.
> > 
> > There are still some drivers with this requirement:
> > 
> > drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, 
> > sdev->sector_size - 1);
> > drivers/ata/pata_macio.c:812:   
> > blk_queue_update_dma_alignment(sdev->request_queue, 31);
> > drivers/ata/pata_macio.c:827:   
> > blk_queue_update_dma_alignment(sdev->request_queue, 15);
> > drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, 
> > dev->blk_size-1);
> > drivers/block/rsxx/dev.c:282:   
> > blk_queue_dma_alignment(card->queue, blk_size - 1);
> > drivers/block/xen-blkfront.c:957:   blk_queue_dma_alignment(rq, 511);
> > drivers/ide/ide-cd.c:1512:  blk_queue_dma_alignment(q, 31);
> > drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment 
> > (sdev->request_queue, 512 - 1);
> > drivers/staging/rts5208/rtsx.c:94:  
> > blk_queue_dma_alignment(sdev->request_queue, (512 - 1));
> > drivers/usb/image/microtek.c:329:   
> > blk_queue_dma_alignment(s->request_queue, (512 - 1));
> > drivers/usb/storage/scsiglue.c:92:  
> > blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
> > drivers/usb/storage/uas.c:818:  
> > blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
> 
> Of course, I too can grep :-)
> 
> My point is that these settings might not match reality. And the
> WARN_ON(), as implemented, is going to trigger on any device that
> DOESN'T set the alignment, as Bart pointed out.

It is just a WARN_ON_ONCE() which exactly shows something which need
further attention, then related people may take a look and we can move
on.

So I think it is correct thing to do.

Thanks,
Ming

Re: [PATCH 1/5] block: warn on un-aligned DMA IO buffer

2018-10-18 Thread Ming Lei

On Thu, Oct 18, 2018 at 08:27:28AM -0600, Jens Axboe wrote:
> On 10/18/18 7:18 AM, Ming Lei wrote:
> > Now we only check if DMA IO buffer is aligned to queue_dma_alignment()
> > for pass-through request, and it isn't done for normal IO request.
> > 
> > Given the check has to be done on each bvec, it isn't efficient to add the
> > check in generic_make_request_checks().
> > 
> > This patch addes one WARN in blk_queue_split() for capturing this issue.
> 
> I don't want to do this, because then we are forever doomed to
> have something that fully loops a bio at submission time. I
> absolutely hate the splitting we have and the need for it,
> hopefully it can go away for a subset of IOs at some point.
> 
> In many ways, this seems to be somewhat of a made-up problem, I don't
> recall a single bug report for something like this over decades of
> working with the IO stack. 512b alignment restrictions for DMA seems
> absolutely insane. I know people claim they exist, but clearly that
> isn't a hard requirement or we would have been boned years ago.

There are still some drivers with this requirement:

drivers/ata/libata-scsi.c:1308: blk_queue_update_dma_alignment(q, 
sdev->sector_size - 1);
drivers/ata/pata_macio.c:812:   
blk_queue_update_dma_alignment(sdev->request_queue, 31);
drivers/ata/pata_macio.c:827:   
blk_queue_update_dma_alignment(sdev->request_queue, 15);
drivers/block/ps3disk.c:470:blk_queue_dma_alignment(queue, dev->blk_size-1);
drivers/block/rsxx/dev.c:282:   blk_queue_dma_alignment(card->queue, 
blk_size - 1);
drivers/block/xen-blkfront.c:957:   blk_queue_dma_alignment(rq, 511);
drivers/ide/ide-cd.c:1512:  blk_queue_dma_alignment(q, 31);
drivers/message/fusion/mptscsih.c:2388: blk_queue_dma_alignment 
(sdev->request_queue, 512 - 1);
drivers/staging/rts5208/rtsx.c:94:  
blk_queue_dma_alignment(sdev->request_queue, (512 - 1));
drivers/usb/image/microtek.c:329:   
blk_queue_dma_alignment(s->request_queue, (512 - 1));
drivers/usb/storage/scsiglue.c:92:  
blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));
drivers/usb/storage/uas.c:818:  
blk_queue_update_dma_alignment(sdev->request_queue, (512 - 1));


Thanks,
Ming

[PATCH 3/5] block: make dma_alignment as stacked limit

2018-10-18 Thread Ming Lei

This patch converts .dma_alignment into stacked limit, so the stack
driver may get updated with underlying dma alignment, and allocate
IO buffer as queue DMA aligned.

Cc: Vitaly Kuznetsov 
Cc: Dave Chinner 
Cc: Linux FS Devel 
Cc: Darrick J. Wong 
Cc: x...@vger.kernel.org
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Matthew Wilcox 
Signed-off-by: Ming Lei 
---
 block/blk-settings.c | 89 +---
 1 file changed, 50 insertions(+), 39 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index cf9cd241dc16..aef4510a99b6 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -525,6 +525,54 @@ void blk_queue_stack_limits(struct request_queue *t, 
struct request_queue *b)
 EXPORT_SYMBOL(blk_queue_stack_limits);
 
 /**
+ * blk_queue_dma_alignment - set dma length and memory alignment
+ * @q: the request queue for the device
+ * @mask:  alignment mask
+ *
+ * description:
+ *set required memory and length alignment for direct dma transactions.
+ *this is used when building direct io requests for the queue.
+ *
+ **/
+void blk_queue_dma_alignment(struct request_queue *q, int mask)
+{
+   q->limits.dma_alignment = mask;
+}
+EXPORT_SYMBOL(blk_queue_dma_alignment);
+
+static int __blk_queue_update_dma_alignment(struct queue_limits *t, int mask)
+{
+   BUG_ON(mask >= PAGE_SIZE);
+
+   if (mask > t->dma_alignment)
+   return mask;
+   else
+   return t->dma_alignment;
+}
+
+/**
+ * blk_queue_update_dma_alignment - update dma length and memory alignment
+ * @q: the request queue for the device
+ * @mask:  alignment mask
+ *
+ * description:
+ *update required memory and length alignment for direct dma transactions.
+ *If the requested alignment is larger than the current alignment, then
+ *the current queue alignment is updated to the new value, otherwise it
+ *is left alone.  The design of this is to allow multiple objects
+ *(driver, device, transport etc) to set their respective
+ *alignments without having them interfere.
+ *
+ **/
+void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
+{
+   q->limits.dma_alignment =
+   __blk_queue_update_dma_alignment(>limits, mask);
+}
+EXPORT_SYMBOL(blk_queue_update_dma_alignment);
+
+
+/**
  * blk_stack_limits - adjust queue_limits for stacked devices
  * @t: the stacking driver limits (top device)
  * @b:  the underlying queue limits (bottom, component device)
@@ -563,6 +611,8 @@ int blk_stack_limits(struct queue_limits *t, struct 
queue_limits *b,
b->seg_boundary_mask);
t->virt_boundary_mask = min_not_zero(t->virt_boundary_mask,
b->virt_boundary_mask);
+   t->dma_alignment = __blk_queue_update_dma_alignment(t,
+   b->dma_alignment);
 
t->max_segments = min_not_zero(t->max_segments, b->max_segments);
t->max_discard_segments = min_not_zero(t->max_discard_segments,
@@ -818,45 +868,6 @@ void blk_queue_virt_boundary(struct request_queue *q, 
unsigned long mask)
 }
 EXPORT_SYMBOL(blk_queue_virt_boundary);
 
-/**
- * blk_queue_dma_alignment - set dma length and memory alignment
- * @q: the request queue for the device
- * @mask:  alignment mask
- *
- * description:
- *set required memory and length alignment for direct dma transactions.
- *this is used when building direct io requests for the queue.
- *
- **/
-void blk_queue_dma_alignment(struct request_queue *q, int mask)
-{
-   q->limits.dma_alignment = mask;
-}
-EXPORT_SYMBOL(blk_queue_dma_alignment);
-
-/**
- * blk_queue_update_dma_alignment - update dma length and memory alignment
- * @q: the request queue for the device
- * @mask:  alignment mask
- *
- * description:
- *update required memory and length alignment for direct dma transactions.
- *If the requested alignment is larger than the current alignment, then
- *the current queue alignment is updated to the new value, otherwise it
- *is left alone.  The design of this is to allow multiple objects
- *(driver, device, transport etc) to set their respective
- *alignments without having them interfere.
- *
- **/
-void blk_queue_update_dma_alignment(struct request_queue *q, int mask)
-{
-   BUG_ON(mask > PAGE_SIZE);
-
-   if (mask > q->limits.dma_alignment)
-   q->limits.dma_alignment = mask;
-}
-EXPORT_SYMBOL(blk_queue_update_dma_alignment);
-
 void blk_queue_flush_queueable(struct request_queue *q, bool queueable)
 {
if (queueable)
-- 
2.9.5

[PATCH 2/5] block: move .dma_alignment into q->limits

2018-10-18 Thread Ming Lei

Turns out q->dma_alignement should be stack limit because now bvec table
is immutalbe, the underlying queue's dma alignment has to be perceptible
by stack driver, so IO buffer can be allocated as dma aligned before
adding to bio.

So this patch moves .dma_alignment into q->limits and prepares for
making it as one stacked limit.

Cc: Vitaly Kuznetsov 
Cc: Dave Chinner 
Cc: Linux FS Devel 
Cc: Darrick J. Wong 
Cc: x...@vger.kernel.org
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Matthew Wilcox 
Signed-off-by: Ming Lei 
---
 block/blk-settings.c   | 6 +++---
 include/linux/blkdev.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/blk-settings.c b/block/blk-settings.c
index ffd459969689..cf9cd241dc16 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -830,7 +830,7 @@ EXPORT_SYMBOL(blk_queue_virt_boundary);
  **/
 void blk_queue_dma_alignment(struct request_queue *q, int mask)
 {
-   q->dma_alignment = mask;
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
@@ -852,8 +852,8 @@ void blk_queue_update_dma_alignment(struct request_queue 
*q, int mask)
 {
BUG_ON(mask > PAGE_SIZE);
 
-   if (mask > q->dma_alignment)
-   q->dma_alignment = mask;
+   if (mask > q->limits.dma_alignment)
+   q->limits.dma_alignment = mask;
 }
 EXPORT_SYMBOL(blk_queue_update_dma_alignment);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 61207560e826..be938a31bc2e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -366,6 +366,7 @@ struct queue_limits {
unsigned long   seg_boundary_mask;
unsigned long   virt_boundary_mask;
 
+   unsigned intdma_alignment;
unsigned intmax_hw_sectors;
unsigned intmax_dev_sectors;
unsigned intchunk_sectors;
@@ -561,7 +562,6 @@ struct request_queue {
unsigned intdma_drain_size;
void*dma_drain_buffer;
unsigned intdma_pad_mask;
-   unsigned intdma_alignment;
 
struct blk_queue_tag*queue_tags;
 
@@ -1617,7 +1617,7 @@ static inline unsigned int bdev_zone_sectors(struct 
block_device *bdev)
 
 static inline int queue_dma_alignment(struct request_queue *q)
 {
-   return q ? q->dma_alignment : 511;
+   return q ? q->limits.dma_alignment : 511;
 }
 
 static inline int blk_rq_aligned(struct request_queue *q, unsigned long addr,
-- 
2.9.5

[PATCH 0/5] block: introduce helpers for allocating io buffer from slab

2018-10-18 Thread Ming Lei

Hi,

Filesystems may allocate io buffer from slab, and use this buffer to
submit bio. This way may break storage drivers if they have special
requirement on DMA alignment.

The patch 1 adds one warning if the io buffer isn't aligned to DMA
alignment.

The 2nd & 3rd patches make DMA alignment as stacked limit.

The 4th patch introduces helpers for allocating io buffer from slab,
and DMA alignment is respected on this allocation.

The 5th patch converts xfs to use the introduced helpers for allocating
io buffer from slab.

See previous discussion on this topic:

https://marc.info/?t=15373485754=1=2

Thanks,
Ming

Ming Lei (5):
  block: warn on un-aligned DMA IO buffer
  block: move .dma_alignment into q->limits
  block: make dma_alignment as stacked limit
  block: introduce helpers for allocating IO buffers from slab
  xfs: use block layer helpers to allocate io buffer from slab

 block/Makefile  |   3 +-
 block/blk-core.c|   2 +
 block/blk-merge.c   |   2 +
 block/blk-sec-buf.c | 144 
 block/blk-settings.c|  89 +++
 fs/xfs/xfs_buf.c|  28 -
 fs/xfs/xfs_super.c  |  13 +++-
 include/linux/blk-sec-buf.h |  43 +
 include/linux/blkdev.h  |   9 ++-
 9 files changed, 287 insertions(+), 46 deletions(-)
 create mode 100644 block/blk-sec-buf.c
 create mode 100644 include/linux/blk-sec-buf.h

Cc: Vitaly Kuznetsov 
Cc: Dave Chinner 
Cc: Linux FS Devel 
Cc: Darrick J. Wong 
Cc: x...@vger.kernel.org
Cc: Dave Chinner 
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Matthew Wilcox 


-- 
2.9.5

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2196 matches

Mail list logo