Re: [PATCH V2] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Jens Axboe
On 12/6/18 8:03 PM, Ming Lei wrote:
> Now almost all .map_queues() implementation based on managed irq
> affinity doesn't update queue mapping and it just retrieves the
> old built mapping, so if nr_hw_queues is changed, the mapping talbe
> includes stale mapping. And only blk_mq_map_queues() may rebuild
> the mapping talbe.
> 
> One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
> However, drivers often builds queue mapping before allocating tagset
> via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
> as 1 in case of kdump kernel, so wrong queue mapping is used, and
> kernel panic[1] is observed during booting.
> 
> This patch fixes the kernel panic triggerd on nvme by rebulding the
> mapping table via blk_mq_map_queues().
> 
> [1] kernel panic log
> [4.438371] nvme nvme0: 16/0/0 default/read/poll queues
> [4.443277] BUG: unable to handle kernel NULL pointer dereference at 
> 0098
> [4.444681] PGD 0 P4D 0
> [4.445367] Oops:  [#1] SMP NOPTI
> [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
> 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
> [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
> [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
> 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b 
> b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
> [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
> [4.454061] RAX:  RBX: 888174448000 RCX: 
> 0001
> [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 
> 0001
> [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 
> 0002
> [4.464580] R10: c900023b3c10 R11: 0004 R12: 
> 888174448538
> [4.467803] R13: 0004 R14: 0001 R15: 
> 0001
> [4.469220] FS:  () GS:88817bac() 
> knlGS:
> [4.471554] CS:  0010 DS:  ES:  CR0: 80050033
> [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 
> 00760ee0
> [4.474264] DR0:  DR1:  DR2: 
> 
> [4.476007] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [4.477061] PKRU: 5554
> [4.477464] Call Trace:
> [4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
> [4.479595]  blk_mq_init_queue+0x32/0x4e
> [4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
> [4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
> [4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
> [4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
> [4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
> [4.483930]  ? try_to_wake_up+0x38d/0x3b3
> [4.484478]  ? process_one_work+0x179/0x2fc
> [4.485118]  process_one_work+0x1d3/0x2fc
> [4.485655]  ? rescuer_thread+0x2ae/0x2ae
> [4.486196]  worker_thread+0x1e9/0x2be
> [4.486841]  kthread+0x115/0x11d
> [4.487294]  ? kthread_park+0x76/0x76
> [4.487784]  ret_from_fork+0x3a/0x50
> [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
> ip_tables
> [4.489428] Dumping ftrace buffer:
> [4.489939](ftrace buffer empty)
> [4.490492] CR2: 0098
> [4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Works for me, tested various configs and stubbed out the kdump check.
Thanks for fixing this, applied to 4.21.

-- 
Jens Axboe



[PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-06 Thread Jens Axboe
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: sta...@vger.kernel.org
Suggested-by: Ming Lei 
Reported-by: Bart Van Assche 
Signed-off-by: Jens Axboe 

---

I've thrown the initial hang test reported by Bart at it, works fine.
My reproducer for the corruption case is also happy, as expected.

I'm running blktests and xfstests on it overnight. If that passes as
expected, this qualms my initial worries on using ->dispatch as a
holding place for these types of requests.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3262d83b9e07..6a7566244de3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1715,15 +1715,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
break;
case BLK_STS_RESOURCE:
case BLK_STS_DEV_RESOURCE:
-   /*
-* If direct dispatch fails, we cannot allow any merging on
-* this IO. Drivers (like SCSI) may have set up permanent state
-* for this request, like SG tables and mappings, and if we
-* merge to it later on then we'll still only do IO to the
-* original part.
-*/
-   rq->cmd_flags |= REQ_NOMERGE;
-
blk_mq_update_dispatch_busy(hctx, true);
__blk_mq_requeue_request(rq);
break;
@@ -1736,18 +1727,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-/*
- * Don't allow direct dispatch of anything but regular reads/writes,
- * as some of the other commands can potentially share request space
- * with data we need for the IO scheduler. If we attempt a direct dispatch
- * on those and fail, we can't safely add it to the scheduler afterwards
- * without potentially overwriting data that the driver has already written.
- */
-static bool blk_rq_can_direct_dispatch(struct request *rq)
-{
-   return req_op(rq) == REQ_OP_READ || req_op(rq) == REQ_OP_WRITE;
-}
-
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
blk_qc_t *cookie,
@@ -1769,7 +1748,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (!blk_rq_can_direct_dispatch(rq) || (q->elevator && !bypass_insert))
+   if (q->elevator && !bypass_insert)
goto insert;
 
if (!blk_mq_get_dispatch_budget(hctx))
@@ -1785,7 +1764,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (bypass_insert)
return BLK_STS_RESOURCE;
 
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
+   blk_mq_request_bypass_insert(rq, run_queue);
return BLK_STS_OK;
 }
 
@@ -1801,7 +1780,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
-   blk_mq_sched_insert_request(rq, false, true, false);
+   blk_mq_request_bypass_insert(rq, true);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 
@@ -1831,15 +1810,13 @@ void blk_mq_try_issue_list_directly(struct 
blk_mq_hw_ctx *hctx,
struct request *rq = list_first_entry(list, struct request,
queuelist);
 
-   if (!blk_rq_can_direct_dispatch(rq))
-   break;
-
list_del_init(>queuelist);
ret = blk_mq_request_issue_directly(rq);
if (ret != BLK_STS_OK) {
if (ret == BLK_STS_RESOURCE ||
ret == BLK_STS_DEV_RESOURCE) {
-

Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Jens Axboe
On 12/6/18 9:18 PM, Mike Snitzer wrote:
> On Thu, Dec 06 2018 at 11:06pm -0500,
> Jens Axboe  wrote:
> 
>> On 12/6/18 8:54 PM, Mike Snitzer wrote:
>>> On Thu, Dec 06 2018 at  9:49pm -0500,
>>> Jens Axboe  wrote:
>>>
 After the direct dispatch corruption fix, we permanently disallow direct
 dispatch of non read/write requests. This works fine off the normal IO
 path, as they will be retried like any other failed direct dispatch
 request. But for the blk_insert_cloned_request() that only DM uses to
 bypass the bottom level scheduler, we always first attempt direct
 dispatch. For some types of requests, that's now a permanent failure,
 and no amount of retrying will make that succeed.

 Use the driver private RQF_DONTPREP to track this condition in DM. If
 we encounter a BUSY condition from blk_insert_cloned_request(), then
 flag the request with RQF_DONTPREP. When we next time see this request,
 ask blk_insert_cloned_request() to bypass insert the request directly.
 This avoids the livelock of repeatedly trying to direct dispatch a
 request, while still retaining the BUSY feedback loop for blk-mq so
 that we don't over-dispatch to the lower level queue and mess up
 opportunities for merging on the DM queue.

 Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
 Reported-by: Bart Van Assche 
 Cc: sta...@vger.kernel.org
 Signed-off-by: Jens Axboe 

 ---

 This passes my testing as well, like the previous patch. But unlike the
 previous patch, we retain the BUSY feedback loop information for better
 merging.
>>>
>>> But it is kind of gross to workaround the new behaviour to "permanently
>>> disallow direct dispatch of non read/write requests" by always failing
>>> such requests back to DM for later immediate direct dispatch.  That
>>> bouncing of the request was acceptable when there was load-based
>>> justification for having to retry (and in doing so: taking the cost of
>>> freeing the clone request gotten via get_request() from the underlying
>>> request_queues).
>>>
>>> Having to retry like this purely because the request isn't a read or
>>> write seems costly.. every non-read-write will have implied
>>> request_queue bouncing.  In multipath's case: it could select an
>>> entirely different underlying path the next time it is destaged (with
>>> RQF_DONTPREP set).  Which you'd think would negate all hope of IO
>>> merging based performance improvements -- but that is a tangent I'll
>>> need to ask Ming about (again).
>>>
>>> I really don't like this business of bouncing requests as a workaround
>>> for the recent implementation of the corruption fix.
>>>
>>> Why not just add an override flag to _really_ allow direct dispatch for
>>> _all_ types of requests?
>>>
>>> (just peeked at linux-block and it is looking like you took 
>>> jianchao.wang's series to avoid this hack... ;)
>>>
>>> Awesome.. my work is done for tonight!
>>
>> The whole point is doing something that is palatable to 4.20 and leaving
>> the more experimental stuff to 4.21, where we have some weeks to verify
>> that there are no conditions that cause IO stalls. I don't envision there
>> will be, but I'm not willing to risk it this late in the 4.20 cycle.
>>
>> That said, this isn't a quick and dirty and I don't think it's fair
>> calling this a hack. Using RQF_DONTPREP is quite common in drivers to
>> retain state over multiple ->queue_rq invocations. Using it to avoid
>> multiple direct dispatch failures (and obviously this new livelock)
>> seems fine to me.
> 
> But it bounces IO purely because non-read-write.  That results in
> guaranteed multiple blk_get_request() -- from underlying request_queues
> request-based DM is stacked on -- for every non-read-write IO that is
> cloned.  That seems pathological.  I must still be missing something.
> 
>> I really don't want to go around and audit every driver for potential
>> retained state over special commands, that's why the read+write thing is
>> in place. It's the safe option, which is what we need right now.
> 
> Maybe leave blk_mq_request_issue_directly() interface how it is,
> non-read-write restriction and all, but export a new
> __blk_mq_request_issue_directly() that _only_
> blk_insert_cloned_request() -- and future comparable users -- makes use
> of?
> 
> To me that is the best of both worlds: Fix corruption issue but don't
> impose needless blk_get_request() dances for non-read-write IO issued to
> dm-multipath.

Alright, I hear your point. Maybe we are going to be better off with
just dispatch insert. Here's a version of that on top of -git, basically
reverting the previous one, and applying the two hunks from Ming, but
also adding a missed one in blk_mq_try_issue_list_directly() where we
don't want to be adding the current request back to the list. That should
go to dispatch, too.

Note note note - not tested yet. Will do so now.


diff --git a/block/blk-mq.c 

Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at 11:06pm -0500,
Jens Axboe  wrote:

> On 12/6/18 8:54 PM, Mike Snitzer wrote:
> > On Thu, Dec 06 2018 at  9:49pm -0500,
> > Jens Axboe  wrote:
> > 
> >> After the direct dispatch corruption fix, we permanently disallow direct
> >> dispatch of non read/write requests. This works fine off the normal IO
> >> path, as they will be retried like any other failed direct dispatch
> >> request. But for the blk_insert_cloned_request() that only DM uses to
> >> bypass the bottom level scheduler, we always first attempt direct
> >> dispatch. For some types of requests, that's now a permanent failure,
> >> and no amount of retrying will make that succeed.
> >>
> >> Use the driver private RQF_DONTPREP to track this condition in DM. If
> >> we encounter a BUSY condition from blk_insert_cloned_request(), then
> >> flag the request with RQF_DONTPREP. When we next time see this request,
> >> ask blk_insert_cloned_request() to bypass insert the request directly.
> >> This avoids the livelock of repeatedly trying to direct dispatch a
> >> request, while still retaining the BUSY feedback loop for blk-mq so
> >> that we don't over-dispatch to the lower level queue and mess up
> >> opportunities for merging on the DM queue.
> >>
> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> >> Reported-by: Bart Van Assche 
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> This passes my testing as well, like the previous patch. But unlike the
> >> previous patch, we retain the BUSY feedback loop information for better
> >> merging.
> > 
> > But it is kind of gross to workaround the new behaviour to "permanently
> > disallow direct dispatch of non read/write requests" by always failing
> > such requests back to DM for later immediate direct dispatch.  That
> > bouncing of the request was acceptable when there was load-based
> > justification for having to retry (and in doing so: taking the cost of
> > freeing the clone request gotten via get_request() from the underlying
> > request_queues).
> > 
> > Having to retry like this purely because the request isn't a read or
> > write seems costly.. every non-read-write will have implied
> > request_queue bouncing.  In multipath's case: it could select an
> > entirely different underlying path the next time it is destaged (with
> > RQF_DONTPREP set).  Which you'd think would negate all hope of IO
> > merging based performance improvements -- but that is a tangent I'll
> > need to ask Ming about (again).
> > 
> > I really don't like this business of bouncing requests as a workaround
> > for the recent implementation of the corruption fix.
> > 
> > Why not just add an override flag to _really_ allow direct dispatch for
> > _all_ types of requests?
> > 
> > (just peeked at linux-block and it is looking like you took 
> > jianchao.wang's series to avoid this hack... ;)
> > 
> > Awesome.. my work is done for tonight!
> 
> The whole point is doing something that is palatable to 4.20 and leaving
> the more experimental stuff to 4.21, where we have some weeks to verify
> that there are no conditions that cause IO stalls. I don't envision there
> will be, but I'm not willing to risk it this late in the 4.20 cycle.
> 
> That said, this isn't a quick and dirty and I don't think it's fair
> calling this a hack. Using RQF_DONTPREP is quite common in drivers to
> retain state over multiple ->queue_rq invocations. Using it to avoid
> multiple direct dispatch failures (and obviously this new livelock)
> seems fine to me.

But it bounces IO purely because non-read-write.  That results in
guaranteed multiple blk_get_request() -- from underlying request_queues
request-based DM is stacked on -- for every non-read-write IO that is
cloned.  That seems pathological.  I must still be missing something.

> I really don't want to go around and audit every driver for potential
> retained state over special commands, that's why the read+write thing is
> in place. It's the safe option, which is what we need right now.

Maybe leave blk_mq_request_issue_directly() interface how it is,
non-read-write restriction and all, but export a new
__blk_mq_request_issue_directly() that _only_
blk_insert_cloned_request() -- and future comparable users -- makes use
of?

To me that is the best of both worlds: Fix corruption issue but don't
impose needless blk_get_request() dances for non-read-write IO issued to
dm-multipath.

Mike


Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Jens Axboe
On 12/6/18 8:54 PM, Mike Snitzer wrote:
> On Thu, Dec 06 2018 at  9:49pm -0500,
> Jens Axboe  wrote:
> 
>> After the direct dispatch corruption fix, we permanently disallow direct
>> dispatch of non read/write requests. This works fine off the normal IO
>> path, as they will be retried like any other failed direct dispatch
>> request. But for the blk_insert_cloned_request() that only DM uses to
>> bypass the bottom level scheduler, we always first attempt direct
>> dispatch. For some types of requests, that's now a permanent failure,
>> and no amount of retrying will make that succeed.
>>
>> Use the driver private RQF_DONTPREP to track this condition in DM. If
>> we encounter a BUSY condition from blk_insert_cloned_request(), then
>> flag the request with RQF_DONTPREP. When we next time see this request,
>> ask blk_insert_cloned_request() to bypass insert the request directly.
>> This avoids the livelock of repeatedly trying to direct dispatch a
>> request, while still retaining the BUSY feedback loop for blk-mq so
>> that we don't over-dispatch to the lower level queue and mess up
>> opportunities for merging on the DM queue.
>>
>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>> Reported-by: Bart Van Assche 
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Jens Axboe 
>>
>> ---
>>
>> This passes my testing as well, like the previous patch. But unlike the
>> previous patch, we retain the BUSY feedback loop information for better
>> merging.
> 
> But it is kind of gross to workaround the new behaviour to "permanently
> disallow direct dispatch of non read/write requests" by always failing
> such requests back to DM for later immediate direct dispatch.  That
> bouncing of the request was acceptable when there was load-based
> justification for having to retry (and in doing so: taking the cost of
> freeing the clone request gotten via get_request() from the underlying
> request_queues).
> 
> Having to retry like this purely because the request isn't a read or
> write seems costly.. every non-read-write will have implied
> request_queue bouncing.  In multipath's case: it could select an
> entirely different underlying path the next time it is destaged (with
> RQF_DONTPREP set).  Which you'd think would negate all hope of IO
> merging based performance improvements -- but that is a tangent I'll
> need to ask Ming about (again).
> 
> I really don't like this business of bouncing requests as a workaround
> for the recent implementation of the corruption fix.
> 
> Why not just add an override flag to _really_ allow direct dispatch for
> _all_ types of requests?
> 
> (just peeked at linux-block and it is looking like you took 
> jianchao.wang's series to avoid this hack... ;)
> 
> Awesome.. my work is done for tonight!

The whole point is doing something that is palatable to 4.20 and leaving
the more experimental stuff to 4.21, where we have some weeks to verify
that there are no conditions that cause IO stalls. I don't envision there
will be, but I'm not willing to risk it this late in the 4.20 cycle.

That said, this isn't a quick and dirty and I don't think it's fair
calling this a hack. Using RQF_DONTPREP is quite common in drivers to
retain state over multiple ->queue_rq invocations. Using it to avoid
multiple direct dispatch failures (and obviously this new livelock)
seems fine to me.

I really don't want to go around and audit every driver for potential
retained state over special commands, that's why the read+write thing is
in place. It's the safe option, which is what we need right now.

-- 
Jens Axboe



Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 6:12 PM Jens Axboe  wrote:
>
>
> Linus, I just know notice you are not on the CC for the discussion about
> the patch. Don't pull this one yet. It'll solve the issue, but it'll also
> mess up the BUSY feedback loop that DM relies on for good merging of
> sequential IO. Testing a new patch now, will push it to you when folks
> are happy and when my testing has completed.

Ok, this pull request dropped from my queue.

   Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  9:49pm -0500,
Jens Axboe  wrote:

> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Use the driver private RQF_DONTPREP to track this condition in DM. If
> we encounter a BUSY condition from blk_insert_cloned_request(), then
> flag the request with RQF_DONTPREP. When we next time see this request,
> ask blk_insert_cloned_request() to bypass insert the request directly.
> This avoids the livelock of repeatedly trying to direct dispatch a
> request, while still retaining the BUSY feedback loop for blk-mq so
> that we don't over-dispatch to the lower level queue and mess up
> opportunities for merging on the DM queue.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Reported-by: Bart Van Assche 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> This passes my testing as well, like the previous patch. But unlike the
> previous patch, we retain the BUSY feedback loop information for better
> merging.

But it is kind of gross to workaround the new behaviour to "permanently
disallow direct dispatch of non read/write requests" by always failing
such requests back to DM for later immediate direct dispatch.  That
bouncing of the request was acceptable when there was load-based
justification for having to retry (and in doing so: taking the cost of
freeing the clone request gotten via get_request() from the underlying
request_queues).

Having to retry like this purely because the request isn't a read or
write seems costly.. every non-read-write will have implied
request_queue bouncing.  In multipath's case: it could select an
entirely different underlying path the next time it is destaged (with
RQF_DONTPREP set).  Which you'd think would negate all hope of IO
merging based performance improvements -- but that is a tangent I'll
need to ask Ming about (again).

I really don't like this business of bouncing requests as a workaround
for the recent implementation of the corruption fix.

Why not just add an override flag to _really_ allow direct dispatch for
_all_ types of requests?

(just peeked at linux-block and it is looking like you took 
jianchao.wang's series to avoid this hack... ;)

Awesome.. my work is done for tonight!

Mike


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-06 Thread Ming Lei
On Thu, Dec 06, 2018 at 09:46:42PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> > 
> > But at that time, there isn't io scheduler for MQ, so in theory the
> > issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> > add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.

Hi Theodore,

It is just a theory analysis.

blk_mq_try_issue_directly() is called in two branches of blk_mq_make_request(),
both are on real MQ disks.

IO merge can be done on none or real io schedulers, so in theory there might
be the risk from v4.1, but IO merge on sw queue didn't work for a bit long,
especially it was fixed by ab42f35d9cb5ac49b5a2.

As Jens mentioned in bugzilla, there are several conditions required
for triggering the issue:

- MQ device

- queue busy can be triggered. It is hard to trigger in NVMe PCI,
  but may be possible on NVMe FC. However, it can be quite easy to
  trigger on SCSI devices. We know there are some MQ SCSI HBA,
  qlogic FC, megaraid_sas.

- IO merge is enabled. 

I have setup scsi_debug in the following way:

modprobe scsi_debug dev_size_mb=4096 clustering=1 \
max_luns=1 submit_queues=2 max_queue=2

- submit_queues=2 may set this disk as MQ
- max_queue=4 may trigger the queue busy condition easily

and run some write IO on ext4 over the disk: fio, kernel building,... for
some time, but still can't trigger the data corruption once.

I should have created more LUN, so that queue may be easier to become
busy, will do that soon.

> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.

virtio_scsi supports multi-queue mode, if that is enabled in your
setting, you may try single queue and see if difference can be made.

If multi-queue mode isn't enabled, your problem should be different with
this one. I remember multi-queue mode isn't enabled on virtio-scsi in GCE.

> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

IO hang is only triggered on machine with special CPU topo, it should be
fine to revert b5b6e8c8d3b4 on normal VM.

No other suggestions yet.

Thanks,
Ming


Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-06 Thread Jens Axboe
On 12/6/18 7:46 PM, Theodore Y. Ts'o wrote:
> On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
>>
>> But at that time, there isn't io scheduler for MQ, so in theory the
>> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
>> add blk-mq adaptation of the deadline IO scheduler").
> 
> Hi Ming,
> 
> How were serious you about this issue being there (theoretically) an
> issue since 4.11?  Can you talk about how it might get triggered, and
> how we can test for it?  The reason why I ask is because we're trying
> to track down a mysterious file system corruption problem on a 4.14.x
> stable kernel.  The symptoms are *very* eerily similar to kernel
> bugzilla #201685.
> 
> The problem is that the problem is super-rare --- roughly once a week
> out of a popuation of about 2500 systems.  The workload is NFS
> serving.  Unfortunately, the problem is since 4.14.63, we can no
> longer disable blk-mq for the virtio-scsi driver, thanks to the commit
> b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
> vector affinity") getting backported into 4.14.63 as commit
> 70b522f163bbb32.
> 
> We're considering reverting this patch in our 4.14 LTS kernel, and
> seeing whether it makes the problem go away.  Is there any thing else
> you might suggest?

We should just make SCSI do the right thing, which is to unprep if
it sees BUSY and prep next time again. Otherwise I fear the direct
dispatch isn't going to be super useful, if a failed direct dispatch
prevents future merging.

This would be a lot less error prone as well for other cases.

-- 
Jens Axboe



[PATCH V2] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei
Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[4.443277] BUG: unable to handle kernel NULL pointer dereference at 
0098
[4.444681] PGD 0 P4D 0
[4.445367] Oops:  [#1] SMP NOPTI
[4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-2.fc27 04/01/2014
[4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 
98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
[4.454061] RAX:  RBX: 888174448000 RCX: 0001
[4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001
[4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002
[4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538
[4.467803] R13: 0004 R14: 0001 R15: 0001
[4.469220] FS:  () GS:88817bac() 
knlGS:
[4.471554] CS:  0010 DS:  ES:  CR0: 80050033
[4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0
[4.474264] DR0:  DR1:  DR2: 
[4.476007] DR3:  DR6: fffe0ff0 DR7: 0400
[4.477061] PKRU: 5554
[4.477464] Call Trace:
[4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[4.479595]  blk_mq_init_queue+0x32/0x4e
[4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[4.483930]  ? try_to_wake_up+0x38d/0x3b3
[4.484478]  ? process_one_work+0x179/0x2fc
[4.485118]  process_one_work+0x1d3/0x2fc
[4.485655]  ? rescuer_thread+0x2ae/0x2ae
[4.486196]  worker_thread+0x1e9/0x2be
[4.486841]  kthread+0x115/0x11d
[4.487294]  ? kthread_park+0x76/0x76
[4.487784]  ret_from_fork+0x3a/0x50
[4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
ip_tables
[4.489428] Dumping ftrace buffer:
[4.489939](ftrace buffer empty)
[4.490492] CR2: 0098
[4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig 
Cc: linux-n...@lists.infradead.org
Cc: David Milburn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900550594651..65770da99159 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2960,7 +2960,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set 
*set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-   if (set->ops->map_queues) {
+   if (set->ops->map_queues && !is_kdump_kernel()) {
int i;
 
/*
@@ -3030,6 +3030,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 */
if (is_kdump_kernel()) {
set->nr_hw_queues = 1;
+   set->nr_maps = 1;
set->queue_depth = min(64U, set->queue_depth);
}
/*
@@ -3051,7 +3052,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
  GFP_KERNEL, set->numa_node);
if (!set->map[i].mq_map)
goto out_free_mq_map;
-   set->map[i].nr_queues = set->nr_hw_queues;
+   set->map[i].nr_queues = is_kdump_kernel() ? 1 : 
set->nr_hw_queues;
}
 
ret = blk_mq_update_queue_map(set);
-- 
2.9.5



Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei
On Thu, Dec 06, 2018 at 07:57:34PM -0700, Jens Axboe wrote:
> On 12/6/18 7:55 PM, Ming Lei wrote:
> > Now almost all .map_queues() implementation based on managed irq
> > affinity doesn't update queue mapping and it just retrieves the
> > old built mapping, so if nr_hw_queues is changed, the mapping talbe
> > includes stale mapping. And only blk_mq_map_queues() may rebuild
> > the mapping talbe.
> > 
> > One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
> > However, drivers often builds queue mapping before allocating tagset
> > via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
> > as 1 in case of kdump kernel, so wrong queue mapping is used, and
> > kernel panic[1] is observed during booting.
> > 
> > This patch fixes the kernel panic triggerd on nvme by rebulding the
> > mapping table via blk_mq_map_queues().
> > 
> > [1] kernel panic log
> > [4.438371] nvme nvme0: 16/0/0 default/read/poll queues
> > [4.443277] BUG: unable to handle kernel NULL pointer dereference at 
> > 0098
> > [4.444681] PGD 0 P4D 0
> > [4.445367] Oops:  [#1] SMP NOPTI
> > [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
> > 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
> > [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.10.2-2.fc27 04/01/2014
> > [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> > [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
> > [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 
> > e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 
> > <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
> > [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
> > [4.454061] RAX:  RBX: 888174448000 RCX: 
> > 0001
> > [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 
> > 0001
> > [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 
> > 0002
> > [4.464580] R10: c900023b3c10 R11: 0004 R12: 
> > 888174448538
> > [4.467803] R13: 0004 R14: 0001 R15: 
> > 0001
> > [4.469220] FS:  () GS:88817bac() 
> > knlGS:
> > [4.471554] CS:  0010 DS:  ES:  CR0: 80050033
> > [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 
> > 00760ee0
> > [4.474264] DR0:  DR1:  DR2: 
> > 
> > [4.476007] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [4.477061] PKRU: 5554
> > [4.477464] Call Trace:
> > [4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
> > [4.479595]  blk_mq_init_queue+0x32/0x4e
> > [4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
> > [4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
> > [4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
> > [4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
> > [4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
> > [4.483930]  ? try_to_wake_up+0x38d/0x3b3
> > [4.484478]  ? process_one_work+0x179/0x2fc
> > [4.485118]  process_one_work+0x1d3/0x2fc
> > [4.485655]  ? rescuer_thread+0x2ae/0x2ae
> > [4.486196]  worker_thread+0x1e9/0x2be
> > [4.486841]  kthread+0x115/0x11d
> > [4.487294]  ? kthread_park+0x76/0x76
> > [4.487784]  ret_from_fork+0x3a/0x50
> > [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
> > ip_tables
> > [4.489428] Dumping ftrace buffer:
> > [4.489939](ftrace buffer empty)
> > [4.490492] CR2: 0098
> > [4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
> > 
> > Cc: Christoph Hellwig 
> > Cc: linux-n...@lists.infradead.org
> > Cc: David Milburn 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 12 +---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 900550594651..a3e463a726a6 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -38,6 +38,11 @@
> >  #include "blk-mq-sched.h"
> >  #include "blk-rq-qos.h"
> >  
> > +static inline bool blk_mq_kdump_kernel(void)
> > +{
> > +   return !!is_kdump_kernel();
> > +}
> 
> Let's drop the redundant !! here, and the wrapper? I would imagine the
> wrapper is handy for testing outside of kdump, but I don't think we
> should include it in the final.

OK.


Thanks,
Ming


Re: [PATCH] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Jens Axboe
On 12/6/18 7:55 PM, Ming Lei wrote:
> Now almost all .map_queues() implementation based on managed irq
> affinity doesn't update queue mapping and it just retrieves the
> old built mapping, so if nr_hw_queues is changed, the mapping talbe
> includes stale mapping. And only blk_mq_map_queues() may rebuild
> the mapping talbe.
> 
> One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
> However, drivers often builds queue mapping before allocating tagset
> via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
> as 1 in case of kdump kernel, so wrong queue mapping is used, and
> kernel panic[1] is observed during booting.
> 
> This patch fixes the kernel panic triggerd on nvme by rebulding the
> mapping table via blk_mq_map_queues().
> 
> [1] kernel panic log
> [4.438371] nvme nvme0: 16/0/0 default/read/poll queues
> [4.443277] BUG: unable to handle kernel NULL pointer dereference at 
> 0098
> [4.444681] PGD 0 P4D 0
> [4.445367] Oops:  [#1] SMP NOPTI
> [4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
> 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
> [4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
> [4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
> [4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
> 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b 
> b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
> [4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
> [4.454061] RAX:  RBX: 888174448000 RCX: 
> 0001
> [4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 
> 0001
> [4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 
> 0002
> [4.464580] R10: c900023b3c10 R11: 0004 R12: 
> 888174448538
> [4.467803] R13: 0004 R14: 0001 R15: 
> 0001
> [4.469220] FS:  () GS:88817bac() 
> knlGS:
> [4.471554] CS:  0010 DS:  ES:  CR0: 80050033
> [4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 
> 00760ee0
> [4.474264] DR0:  DR1:  DR2: 
> 
> [4.476007] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [4.477061] PKRU: 5554
> [4.477464] Call Trace:
> [4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
> [4.479595]  blk_mq_init_queue+0x32/0x4e
> [4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
> [4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
> [4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
> [4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
> [4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
> [4.483930]  ? try_to_wake_up+0x38d/0x3b3
> [4.484478]  ? process_one_work+0x179/0x2fc
> [4.485118]  process_one_work+0x1d3/0x2fc
> [4.485655]  ? rescuer_thread+0x2ae/0x2ae
> [4.486196]  worker_thread+0x1e9/0x2be
> [4.486841]  kthread+0x115/0x11d
> [4.487294]  ? kthread_park+0x76/0x76
> [4.487784]  ret_from_fork+0x3a/0x50
> [4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
> ip_tables
> [4.489428] Dumping ftrace buffer:
> [4.489939](ftrace buffer empty)
> [4.490492] CR2: 0098
> [4.491052] ---[ end trace 03cd268ad5a86ff7 ]---
> 
> Cc: Christoph Hellwig 
> Cc: linux-n...@lists.infradead.org
> Cc: David Milburn 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 900550594651..a3e463a726a6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -38,6 +38,11 @@
>  #include "blk-mq-sched.h"
>  #include "blk-rq-qos.h"
>  
> +static inline bool blk_mq_kdump_kernel(void)
> +{
> + return !!is_kdump_kernel();
> +}

Let's drop the redundant !! here, and the wrapper? I would imagine the
wrapper is handy for testing outside of kdump, but I don't think we
should include it in the final.

Otherwise this looks fine, I can test it here too.

-- 
Jens Axboe



[PATCH] blk-mq: re-build queue map in case of kdump kernel

2018-12-06 Thread Ming Lei
Now almost all .map_queues() implementation based on managed irq
affinity doesn't update queue mapping and it just retrieves the
old built mapping, so if nr_hw_queues is changed, the mapping talbe
includes stale mapping. And only blk_mq_map_queues() may rebuild
the mapping talbe.

One case is that we limit .nr_hw_queues as 1 in case of kdump kernel.
However, drivers often builds queue mapping before allocating tagset
via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set
as 1 in case of kdump kernel, so wrong queue mapping is used, and
kernel panic[1] is observed during booting.

This patch fixes the kernel panic triggerd on nvme by rebulding the
mapping table via blk_mq_map_queues().

[1] kernel panic log
[4.438371] nvme nvme0: 16/0/0 default/read/poll queues
[4.443277] BUG: unable to handle kernel NULL pointer dereference at 
0098
[4.444681] PGD 0 P4D 0
[4.445367] Oops:  [#1] SMP NOPTI
[4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 
4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459
[4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-2.fc27 04/01/2014
[4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core]
[4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222
[4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 
04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 
98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6
[4.453132] RSP: 0018:c900023b3cd8 EFLAGS: 00010286
[4.454061] RAX:  RBX: 888174448000 RCX: 0001
[4.456480] RDX: 0001 RSI: e8feffc506c0 RDI: 0001
[4.458750] RBP: 88810722d008 R08: 88817647a880 R09: 0002
[4.464580] R10: c900023b3c10 R11: 0004 R12: 888174448538
[4.467803] R13: 0004 R14: 0001 R15: 0001
[4.469220] FS:  () GS:88817bac() 
knlGS:
[4.471554] CS:  0010 DS:  ES:  CR0: 80050033
[4.472464] CR2: 0098 CR3: 000174e4e001 CR4: 00760ee0
[4.474264] DR0:  DR1:  DR2: 
[4.476007] DR3:  DR6: fffe0ff0 DR7: 0400
[4.477061] PKRU: 5554
[4.477464] Call Trace:
[4.478731]  blk_mq_init_allocated_queue+0x36a/0x3ad
[4.479595]  blk_mq_init_queue+0x32/0x4e
[4.480178]  nvme_validate_ns+0x98/0x623 [nvme_core]
[4.480963]  ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core]
[4.481685]  ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core]
[4.482601]  nvme_scan_work+0x23a/0x29b [nvme_core]
[4.483269]  ? _raw_spin_unlock_irqrestore+0x25/0x38
[4.483930]  ? try_to_wake_up+0x38d/0x3b3
[4.484478]  ? process_one_work+0x179/0x2fc
[4.485118]  process_one_work+0x1d3/0x2fc
[4.485655]  ? rescuer_thread+0x2ae/0x2ae
[4.486196]  worker_thread+0x1e9/0x2be
[4.486841]  kthread+0x115/0x11d
[4.487294]  ? kthread_park+0x76/0x76
[4.487784]  ret_from_fork+0x3a/0x50
[4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi 
ip_tables
[4.489428] Dumping ftrace buffer:
[4.489939](ftrace buffer empty)
[4.490492] CR2: 0098
[4.491052] ---[ end trace 03cd268ad5a86ff7 ]---

Cc: Christoph Hellwig 
Cc: linux-n...@lists.infradead.org
Cc: David Milburn 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 900550594651..a3e463a726a6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -38,6 +38,11 @@
 #include "blk-mq-sched.h"
 #include "blk-rq-qos.h"
 
+static inline bool blk_mq_kdump_kernel(void)
+{
+   return !!is_kdump_kernel();
+}
+
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
@@ -2960,7 +2965,7 @@ static int blk_mq_alloc_rq_maps(struct blk_mq_tag_set 
*set)
 
 static int blk_mq_update_queue_map(struct blk_mq_tag_set *set)
 {
-   if (set->ops->map_queues) {
+   if (set->ops->map_queues && !blk_mq_kdump_kernel()) {
int i;
 
/*
@@ -3028,8 +3033,9 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 * memory constrained environment. Limit us to 1 queue and
 * 64 tags to prevent using too much memory.
 */
-   if (is_kdump_kernel()) {
+   if (blk_mq_kdump_kernel()) {
set->nr_hw_queues = 1;
+   set->nr_maps = 1;
set->queue_depth = min(64U, set->queue_depth);
}
/*
@@ -3051,7 +3057,7 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
  GFP_KERNEL, set->numa_node);
if (!set->map[i].mq_map)
goto out_free_mq_map;
-   

[PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Jens Axboe
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed.

Use the driver private RQF_DONTPREP to track this condition in DM. If
we encounter a BUSY condition from blk_insert_cloned_request(), then
flag the request with RQF_DONTPREP. When we next time see this request,
ask blk_insert_cloned_request() to bypass insert the request directly.
This avoids the livelock of repeatedly trying to direct dispatch a
request, while still retaining the BUSY feedback loop for blk-mq so
that we don't over-dispatch to the lower level queue and mess up
opportunities for merging on the DM queue.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Reported-by: Bart Van Assche 
Cc: sta...@vger.kernel.org
Signed-off-by: Jens Axboe 

---

This passes my testing as well, like the previous patch. But unlike the
previous patch, we retain the BUSY feedback loop information for better
merging.

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..cccda51e165f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct 
request_queue *q,
  * @q:  the queue to submit the request
  * @rq: the request being queued
  */
-blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq)
+blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq,
+  bool force_insert)
 {
unsigned long flags;
int where = ELEVATOR_INSERT_BACK;
@@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   if (force_insert) {
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
+   } else
+   return blk_mq_request_issue_directly(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 7cd36e4d1310..e497a2ab6766 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -299,16 +299,20 @@ static void end_clone_request(struct request *clone, 
blk_status_t error)
 
 static blk_status_t dm_dispatch_clone_request(struct request *clone, struct 
request *rq)
 {
+   bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0;
blk_status_t r;
 
if (blk_queue_io_stat(clone->q))
clone->rq_flags |= RQF_IO_STAT;
 
clone->start_time_ns = ktime_get_ns();
-   r = blk_insert_cloned_request(clone->q, clone);
-   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != 
BLK_STS_DEV_RESOURCE)
+   r = blk_insert_cloned_request(clone->q, clone, was_busy);
+   if (r == BLK_STS_RESOURCE || r == BLK_STS_DEV_RESOURCE)
+   rq->rq_flags |= RQF_DONTPREP;
+   else if (r != BLK_STS_OK)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
+
return r;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4293dc1cd160..7cb84ee4c9f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct 
request *rq_src,
 void *data);
 extern void blk_rq_unprep_clone(struct request *rq);
 extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
-struct request *rq);
+struct request *rq, bool force_insert);
 extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_queue_split(struct request_queue *, struct bio **);

-- 
Jens Axboe



Re: [PATCH] blk-mq: fix corruption with direct issue

2018-12-06 Thread Theodore Y. Ts'o
On Wed, Dec 05, 2018 at 11:03:01AM +0800, Ming Lei wrote:
> 
> But at that time, there isn't io scheduler for MQ, so in theory the
> issue should be there since v4.11, especially 945ffb60c11d ("mq-deadline:
> add blk-mq adaptation of the deadline IO scheduler").

Hi Ming,

How were serious you about this issue being there (theoretically) an
issue since 4.11?  Can you talk about how it might get triggered, and
how we can test for it?  The reason why I ask is because we're trying
to track down a mysterious file system corruption problem on a 4.14.x
stable kernel.  The symptoms are *very* eerily similar to kernel
bugzilla #201685.

The problem is that the problem is super-rare --- roughly once a week
out of a popuation of about 2500 systems.  The workload is NFS
serving.  Unfortunately, the problem is since 4.14.63, we can no
longer disable blk-mq for the virtio-scsi driver, thanks to the commit
b5b6e8c8d3b4 ("scsi: virtio_scsi: fix IO hang caused by automatic irq
vector affinity") getting backported into 4.14.63 as commit
70b522f163bbb32.

We're considering reverting this patch in our 4.14 LTS kernel, and
seeing whether it makes the problem go away.  Is there any thing else
you might suggest?

Thanks,

- Ted

P.S.  Unlike the repro's that users were seeing in #201685, we *did*
have an I/O scheduler enabled --- it was mq-deadline.  But right now,
given your comments, and the corruptions that we're seeing, I'm not
feeling very warm and fuzzy about block-mq.  :-( :-( :-(


Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 7:16 PM, Jens Axboe wrote:
> On 12/6/18 7:06 PM, Jens Axboe wrote:
>> On 12/6/18 6:58 PM, Mike Snitzer wrote:
 There is another way to fix this - still do the direct dispatch, but have
 dm track if it failed and do bypass insert in that case. I didn't want do
 to that since it's more involved, but it's doable.

 Let me cook that up and test it... Don't like it, though.
>>>
>>> Not following how DM can track if issuing the request worked if it is
>>> always told it worked with BLK_STS_OK.  We care about feedback when the
>>> request is actually issued because of the elaborate way blk-mq elevators
>>> work.  DM is forced to worry about all these details, as covered some in
>>> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
>>> trying to have its cake and eat it too.  It just wants IO scheduling to
>>> work for request-based DM devices.  That's it.
>>
>> It needs the feedback, I don't disagree on that part at all. If we
>> always return OK, then that loop is broken. How about something like the
>> below? Totally untested right now...
>>
>> We track if a request ever saw BLK_STS_RESOURCE from direct dispatch,
>> and if it did, we store that information with RQF_DONTPREP. When we then
>> next time go an insert a request, if it has RQF_DONTPREP set, then we
>> ask blk_insert_cloned_request() to bypass insert.
>>
>> I'll go test this now.
> 
> Passes the test case for me.

Here's one that doesn't re-arrange the return value check into a switch.
Turns out cleaner (and less LOC changes), and also doesn't fiddle with
request after freeing it if we got OK return...

Will give this a whirl too, just in case.

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..cccda51e165f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct 
request_queue *q,
  * @q:  the queue to submit the request
  * @rq: the request being queued
  */
-blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq)
+blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq,
+  bool force_insert)
 {
unsigned long flags;
int where = ELEVATOR_INSERT_BACK;
@@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   if (force_insert) {
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
+   } else
+   return blk_mq_request_issue_directly(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 7cd36e4d1310..e497a2ab6766 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -299,16 +299,20 @@ static void end_clone_request(struct request *clone, 
blk_status_t error)
 
 static blk_status_t dm_dispatch_clone_request(struct request *clone, struct 
request *rq)
 {
+   bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0;
blk_status_t r;
 
if (blk_queue_io_stat(clone->q))
clone->rq_flags |= RQF_IO_STAT;
 
clone->start_time_ns = ktime_get_ns();
-   r = blk_insert_cloned_request(clone->q, clone);
-   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != 
BLK_STS_DEV_RESOURCE)
+   r = blk_insert_cloned_request(clone->q, clone, was_busy);
+   if (r == BLK_STS_RESOURCE || r == BLK_STS_DEV_RESOURCE)
+   rq->rq_flags |= RQF_DONTPREP;
+   else if (r != BLK_STS_OK)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
+
return r;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4293dc1cd160..7cb84ee4c9f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct 
request *rq_src,
 void *data);
 extern void blk_rq_unprep_clone(struct request *rq);
 extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
-struct request *rq);
+struct request *rq, bool force_insert);
 extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_queue_split(struct request_queue *, struct bio **);

-- 
Jens Axboe



Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 7:06 PM, Jens Axboe wrote:
> On 12/6/18 6:58 PM, Mike Snitzer wrote:
>>> There is another way to fix this - still do the direct dispatch, but have
>>> dm track if it failed and do bypass insert in that case. I didn't want do
>>> to that since it's more involved, but it's doable.
>>>
>>> Let me cook that up and test it... Don't like it, though.
>>
>> Not following how DM can track if issuing the request worked if it is
>> always told it worked with BLK_STS_OK.  We care about feedback when the
>> request is actually issued because of the elaborate way blk-mq elevators
>> work.  DM is forced to worry about all these details, as covered some in
>> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
>> trying to have its cake and eat it too.  It just wants IO scheduling to
>> work for request-based DM devices.  That's it.
> 
> It needs the feedback, I don't disagree on that part at all. If we
> always return OK, then that loop is broken. How about something like the
> below? Totally untested right now...
> 
> We track if a request ever saw BLK_STS_RESOURCE from direct dispatch,
> and if it did, we store that information with RQF_DONTPREP. When we then
> next time go an insert a request, if it has RQF_DONTPREP set, then we
> ask blk_insert_cloned_request() to bypass insert.
> 
> I'll go test this now.

Passes the test case for me.

-- 
Jens Axboe



Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Jens Axboe
On 12/6/18 5:33 PM, Jens Axboe wrote:
> On 12/6/18 5:31 PM, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote:
>>> On 12/6/18 5:20 PM, Bart Van Assche wrote:
>>>> Which branch does that tag correspond to? Even after having run git fetch
>>>> --tags I can't find that tag in your repository.
>>>
>>> I pushed it before I sent the email, where are you looking?
>>>
>>> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206
>>
>> On my development system git has been configured to fetch from the following 
>> repository:
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
>>
>> Should I fetch your branches from the git.kernel.dk repository instead?
> 
> Ah I see, I always push to git.kernel.dk, and that syncs to git.kernel.org
> only every hour or so. So either one will work, but you might be a bit
> behind if you use git.kernel.org.

Linus, I just know notice you are not on the CC for the discussion about
the patch. Don't pull this one yet. It'll solve the issue, but it'll also
mess up the BUSY feedback loop that DM relies on for good merging of
sequential IO. Testing a new patch now, will push it to you when folks
are happy and when my testing has completed.

-- 
Jens Axboe



Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  8:58pm -0500,
Mike Snitzer  wrote:

> DM is forced to worry about all these details, as covered some in
> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
> trying to have its cake and eat it too.

Gah, obviously meant: DM is _NOT_ trying to have its cake and eat it too.

> It just wants IO scheduling to work for request-based DM devices.


Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 6:58 PM, Mike Snitzer wrote:
>> There is another way to fix this - still do the direct dispatch, but have
>> dm track if it failed and do bypass insert in that case. I didn't want do
>> to that since it's more involved, but it's doable.
>>
>> Let me cook that up and test it... Don't like it, though.
> 
> Not following how DM can track if issuing the request worked if it is
> always told it worked with BLK_STS_OK.  We care about feedback when the
> request is actually issued because of the elaborate way blk-mq elevators
> work.  DM is forced to worry about all these details, as covered some in
> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
> trying to have its cake and eat it too.  It just wants IO scheduling to
> work for request-based DM devices.  That's it.

It needs the feedback, I don't disagree on that part at all. If we
always return OK, then that loop is broken. How about something like the
below? Totally untested right now...

We track if a request ever saw BLK_STS_RESOURCE from direct dispatch,
and if it did, we store that information with RQF_DONTPREP. When we then
next time go an insert a request, if it has RQF_DONTPREP set, then we
ask blk_insert_cloned_request() to bypass insert.

I'll go test this now.

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..cccda51e165f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2617,7 +2617,8 @@ static int blk_cloned_rq_check_limits(struct 
request_queue *q,
  * @q:  the queue to submit the request
  * @rq: the request being queued
  */
-blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq)
+blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request 
*rq,
+  bool force_insert)
 {
unsigned long flags;
int where = ELEVATOR_INSERT_BACK;
@@ -2637,7 +2638,11 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   if (force_insert) {
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
+   } else
+   return blk_mq_request_issue_directly(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 7cd36e4d1310..8d4c5020ccaa 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -299,16 +299,27 @@ static void end_clone_request(struct request *clone, 
blk_status_t error)
 
 static blk_status_t dm_dispatch_clone_request(struct request *clone, struct 
request *rq)
 {
+   bool was_busy = (rq->rq_flags & RQF_DONTPREP) != 0;
blk_status_t r;
 
if (blk_queue_io_stat(clone->q))
clone->rq_flags |= RQF_IO_STAT;
 
clone->start_time_ns = ktime_get_ns();
-   r = blk_insert_cloned_request(clone->q, clone);
-   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != 
BLK_STS_DEV_RESOURCE)
+
+   r = blk_insert_cloned_request(clone->q, clone, was_busy);
+   switch (r) {
+   default:
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
+   /* fall through */
+   case BLK_STS_RESOURCE:
+   case BLK_STS_DEV_RESOURCE:
+   rq->rq_flags |= RQF_DONTPREP;
+   case BLK_STS_OK:
+   break;
+   }
+
return r;
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4293dc1cd160..7cb84ee4c9f4 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -994,7 +994,7 @@ extern int blk_rq_prep_clone(struct request *rq, struct 
request *rq_src,
 void *data);
 extern void blk_rq_unprep_clone(struct request *rq);
 extern blk_status_t blk_insert_cloned_request(struct request_queue *q,
-struct request *rq);
+struct request *rq, bool force_insert);
 extern int blk_rq_append_bio(struct request *rq, struct bio **bio);
 extern void blk_delay_queue(struct request_queue *, unsigned long);
 extern void blk_queue_split(struct request_queue *, struct bio **);

-- 
Jens Axboe



Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 9:34 AM, Jens Axboe wrote:
> On 12/6/18 6:22 PM, jianchao.wang wrote:
>>
>>
>> On 12/7/18 9:13 AM, Jens Axboe wrote:
>>> On 12/6/18 6:04 PM, jianchao.wang wrote:


 On 12/7/18 6:20 AM, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
>
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
>
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 
>
> ---
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
>
 Not sure about this because it will break the merging promotion for 
 request based DM
 from Ming.
 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
 (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
 feedback)

 We could use some other way to fix this.
>>>
>>> That really shouldn't matter as this is the cloned insert, merging should
>>> have been done on the original request.
>>>
>>>
>> Just quote some comments from the patch.
>>
>> "
>>But dm-rq currently can't get the underlying queue's
>> dispatch feedback at all.  Without knowing whether a request was issued
>> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
>> not be able to provide effective IO merging (as a side-effect of dm-rq
>> currently blindly destaging a request from its elevator only to requeue
>> it after a delay, which kills any opportunity for merging).  This
>> obviously causes very bad sequential IO performance.
>> ...
>> With this, request-based DM's blk-mq sequential IO performance is vastly
>> improved (as much as 3X in mpath/virtio-scsi testing)
>> "
>>
>> Using blk_mq_request_bypass_insert to replace the 
>> blk_mq_request_issue_directly
>> could be a fast method to fix the current issue. Maybe we could get the 
>> merging
>> promotion back after some time.
> 
> This really sucks, mostly because DM wants to have it both ways - not use
> the bottom level IO scheduler, but still actually use it if it makes sense.
> 
> There is another way to fix this - still do the direct dispatch, but have
> dm track if it failed and do bypass insert in that case. I didn't want do
> to that since it's more involved, but it's doable.
> 
> Let me cook that up and test it... Don't like it, though.
> 
Actually, I have tried to fix this issue in the 1st patch of my patchset
blk-mq: refactor code of issue directly.

Just insert the non-read-write command into dispatch list directly and return
BLK_STS_OK.

Thanks
Jianchao


Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  8:34pm -0500,
Jens Axboe  wrote:

> On 12/6/18 6:22 PM, jianchao.wang wrote:
> > 
> > 
> > On 12/7/18 9:13 AM, Jens Axboe wrote:
> >> On 12/6/18 6:04 PM, jianchao.wang wrote:
> >>>
> >>>
> >>> On 12/7/18 6:20 AM, Jens Axboe wrote:
>  After the direct dispatch corruption fix, we permanently disallow direct
>  dispatch of non read/write requests. This works fine off the normal IO
>  path, as they will be retried like any other failed direct dispatch
>  request. But for the blk_insert_cloned_request() that only DM uses to
>  bypass the bottom level scheduler, we always first attempt direct
>  dispatch. For some types of requests, that's now a permanent failure,
>  and no amount of retrying will make that succeed.
> 
>  Don't use direct dispatch off the cloned insert path, always just use
>  bypass inserts. This still bypasses the bottom level scheduler, which is
>  what DM wants.
> 
>  Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>  Signed-off-by: Jens Axboe 
> 
>  ---
> 
>  diff --git a/block/blk-core.c b/block/blk-core.c
>  index deb56932f8c4..4c44e6fa0d08 100644
>  --- a/block/blk-core.c
>  +++ b/block/blk-core.c
>  @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>  request_queue *q, struct request *
>    * bypass a potential scheduler on the bottom device for
>    * insert.
>    */
>  -return blk_mq_request_issue_directly(rq);
>  +blk_mq_request_bypass_insert(rq, true);
>  +return BLK_STS_OK;
>   }
>   
>   spin_lock_irqsave(q->queue_lock, flags);
> 
> >>> Not sure about this because it will break the merging promotion for 
> >>> request based DM
> >>> from Ming.
> >>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> >>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> >>> feedback)
> >>>
> >>> We could use some other way to fix this.
> >>
> >> That really shouldn't matter as this is the cloned insert, merging should
> >> have been done on the original request.
> >>
> >>
> > Just quote some comments from the patch.
> > 
> > "
> >But dm-rq currently can't get the underlying queue's
> > dispatch feedback at all.  Without knowing whether a request was issued
> > or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> > not be able to provide effective IO merging (as a side-effect of dm-rq
> > currently blindly destaging a request from its elevator only to requeue
> > it after a delay, which kills any opportunity for merging).  This
> > obviously causes very bad sequential IO performance.
> > ...
> > With this, request-based DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing)
> > "
> > 
> > Using blk_mq_request_bypass_insert to replace the 
> > blk_mq_request_issue_directly
> > could be a fast method to fix the current issue. Maybe we could get the 
> > merging
> > promotion back after some time.
> 
> This really sucks, mostly because DM wants to have it both ways - not use
> the bottom level IO scheduler, but still actually use it if it makes sense.

Well no, that isn't what DM is doing.  DM does have an upper layer
scheduler that would like to be afforded the same capabilities that any
request-based driver is given.  Yes that comes with plumbing in safe
passage for upper layer requests dispatched from a stacked blk-mq IO
scheduler.
 
> There is another way to fix this - still do the direct dispatch, but have
> dm track if it failed and do bypass insert in that case. I didn't want do
> to that since it's more involved, but it's doable.
> 
> Let me cook that up and test it... Don't like it, though.

Not following how DM can track if issuing the request worked if it is
always told it worked with BLK_STS_OK.  We care about feedback when the
request is actually issued because of the elaborate way blk-mq elevators
work.  DM is forced to worry about all these details, as covered some in
the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
trying to have its cake and eat it too.  It just wants IO scheduling to
work for request-based DM devices.  That's it.

Mike


Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  8:13pm -0500,
Jens Axboe  wrote:

> On 12/6/18 6:04 PM, jianchao.wang wrote:
> > 
> > 
> > On 12/7/18 6:20 AM, Jens Axboe wrote:
> >> After the direct dispatch corruption fix, we permanently disallow direct
> >> dispatch of non read/write requests. This works fine off the normal IO
> >> path, as they will be retried like any other failed direct dispatch
> >> request. But for the blk_insert_cloned_request() that only DM uses to
> >> bypass the bottom level scheduler, we always first attempt direct
> >> dispatch. For some types of requests, that's now a permanent failure,
> >> and no amount of retrying will make that succeed.
> >>
> >> Don't use direct dispatch off the cloned insert path, always just use
> >> bypass inserts. This still bypasses the bottom level scheduler, which is
> >> what DM wants.
> >>
> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index deb56932f8c4..4c44e6fa0d08 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> >> request_queue *q, struct request *
> >> * bypass a potential scheduler on the bottom device for
> >> * insert.
> >> */
> >> -  return blk_mq_request_issue_directly(rq);
> >> +  blk_mq_request_bypass_insert(rq, true);
> >> +  return BLK_STS_OK;
> >>}
> >>  
> >>spin_lock_irqsave(q->queue_lock, flags);
> >>
> > Not sure about this because it will break the merging promotion for request 
> > based DM
> > from Ming.
> > 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> > (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > feedback)
> > 
> > We could use some other way to fix this.
> 
> That really shouldn't matter as this is the cloned insert, merging should
> have been done on the original request.

Reading the header of 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 brings me
back to relatively recent hell.

Thing is, dm-rq was the original justification and consumer of
blk_mq_request_issue_directly -- but Ming's later use of directly issuing
requests has forced fixes that didn't consider the original valid/safe
use of the interface that is now too rigid.  dm-rq needs the
functionality the blk_mq_request_issue_directly interface provides.

Sorry to say we cannot lose the sequential IO performance improvements
that the IO merging feedback loop gives us.


Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 6:22 PM, jianchao.wang wrote:
> 
> 
> On 12/7/18 9:13 AM, Jens Axboe wrote:
>> On 12/6/18 6:04 PM, jianchao.wang wrote:
>>>
>>>
>>> On 12/7/18 6:20 AM, Jens Axboe wrote:
 After the direct dispatch corruption fix, we permanently disallow direct
 dispatch of non read/write requests. This works fine off the normal IO
 path, as they will be retried like any other failed direct dispatch
 request. But for the blk_insert_cloned_request() that only DM uses to
 bypass the bottom level scheduler, we always first attempt direct
 dispatch. For some types of requests, that's now a permanent failure,
 and no amount of retrying will make that succeed.

 Don't use direct dispatch off the cloned insert path, always just use
 bypass inserts. This still bypasses the bottom level scheduler, which is
 what DM wants.

 Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
 Signed-off-by: Jens Axboe 

 ---

 diff --git a/block/blk-core.c b/block/blk-core.c
 index deb56932f8c4..4c44e6fa0d08 100644
 --- a/block/blk-core.c
 +++ b/block/blk-core.c
 @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
 request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
 -  return blk_mq_request_issue_directly(rq);
 +  blk_mq_request_bypass_insert(rq, true);
 +  return BLK_STS_OK;
}
  
spin_lock_irqsave(q->queue_lock, flags);

>>> Not sure about this because it will break the merging promotion for request 
>>> based DM
>>> from Ming.
>>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
>>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
>>> feedback)
>>>
>>> We could use some other way to fix this.
>>
>> That really shouldn't matter as this is the cloned insert, merging should
>> have been done on the original request.
>>
>>
> Just quote some comments from the patch.
> 
> "
>But dm-rq currently can't get the underlying queue's
> dispatch feedback at all.  Without knowing whether a request was issued
> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> not be able to provide effective IO merging (as a side-effect of dm-rq
> currently blindly destaging a request from its elevator only to requeue
> it after a delay, which kills any opportunity for merging).  This
> obviously causes very bad sequential IO performance.
> ...
> With this, request-based DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing)
> "
> 
> Using blk_mq_request_bypass_insert to replace the 
> blk_mq_request_issue_directly
> could be a fast method to fix the current issue. Maybe we could get the 
> merging
> promotion back after some time.

This really sucks, mostly because DM wants to have it both ways - not use
the bottom level IO scheduler, but still actually use it if it makes sense.

There is another way to fix this - still do the direct dispatch, but have
dm track if it failed and do bypass insert in that case. I didn't want do
to that since it's more involved, but it's doable.

Let me cook that up and test it... Don't like it, though.

-- 
Jens Axboe



Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 9:13 AM, Jens Axboe wrote:
> On 12/6/18 6:04 PM, jianchao.wang wrote:
>>
>>
>> On 12/7/18 6:20 AM, Jens Axboe wrote:
>>> After the direct dispatch corruption fix, we permanently disallow direct
>>> dispatch of non read/write requests. This works fine off the normal IO
>>> path, as they will be retried like any other failed direct dispatch
>>> request. But for the blk_insert_cloned_request() that only DM uses to
>>> bypass the bottom level scheduler, we always first attempt direct
>>> dispatch. For some types of requests, that's now a permanent failure,
>>> and no amount of retrying will make that succeed.
>>>
>>> Don't use direct dispatch off the cloned insert path, always just use
>>> bypass inserts. This still bypasses the bottom level scheduler, which is
>>> what DM wants.
>>>
>>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>>> Signed-off-by: Jens Axboe 
>>>
>>> ---
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index deb56932f8c4..4c44e6fa0d08 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>>> request_queue *q, struct request *
>>>  * bypass a potential scheduler on the bottom device for
>>>  * insert.
>>>  */
>>> -   return blk_mq_request_issue_directly(rq);
>>> +   blk_mq_request_bypass_insert(rq, true);
>>> +   return BLK_STS_OK;
>>> }
>>>  
>>> spin_lock_irqsave(q->queue_lock, flags);
>>>
>> Not sure about this because it will break the merging promotion for request 
>> based DM
>> from Ming.
>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
>> feedback)
>>
>> We could use some other way to fix this.
> 
> That really shouldn't matter as this is the cloned insert, merging should
> have been done on the original request.
> 
> 
Just quote some comments from the patch.

"
   But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.
...
With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing)
"

Using blk_mq_request_bypass_insert to replace the blk_mq_request_issue_directly
could be a fast method to fix the current issue. Maybe we could get the merging
promotion back after some time.



Thanks
Jianchao



Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  8:04pm -0500,
jianchao.wang  wrote:

> 
> 
> On 12/7/18 6:20 AM, Jens Axboe wrote:
> > After the direct dispatch corruption fix, we permanently disallow direct
> > dispatch of non read/write requests. This works fine off the normal IO
> > path, as they will be retried like any other failed direct dispatch
> > request. But for the blk_insert_cloned_request() that only DM uses to
> > bypass the bottom level scheduler, we always first attempt direct
> > dispatch. For some types of requests, that's now a permanent failure,
> > and no amount of retrying will make that succeed.
> > 
> > Don't use direct dispatch off the cloned insert path, always just use
> > bypass inserts. This still bypasses the bottom level scheduler, which is
> > what DM wants.
> > 
> > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> > Signed-off-by: Jens Axboe 
> > 
> > ---
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index deb56932f8c4..4c44e6fa0d08 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> > request_queue *q, struct request *
> >  * bypass a potential scheduler on the bottom device for
> >  * insert.
> >  */
> > -   return blk_mq_request_issue_directly(rq);
> > +   blk_mq_request_bypass_insert(rq, true);
> > +   return BLK_STS_OK;
> > }
> >  
> > spin_lock_irqsave(q->queue_lock, flags);
> > 
> Not sure about this because it will break the merging promotion for request 
> based DM
> from Ming.
> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> feedback)

Ngh.. yeah, forgot about that convoluted feedback loop.

> We could use some other way to fix this.

Yeah, afraid we have to.


Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 6:04 PM, jianchao.wang wrote:
> 
> 
> On 12/7/18 6:20 AM, Jens Axboe wrote:
>> After the direct dispatch corruption fix, we permanently disallow direct
>> dispatch of non read/write requests. This works fine off the normal IO
>> path, as they will be retried like any other failed direct dispatch
>> request. But for the blk_insert_cloned_request() that only DM uses to
>> bypass the bottom level scheduler, we always first attempt direct
>> dispatch. For some types of requests, that's now a permanent failure,
>> and no amount of retrying will make that succeed.
>>
>> Don't use direct dispatch off the cloned insert path, always just use
>> bypass inserts. This still bypasses the bottom level scheduler, which is
>> what DM wants.
>>
>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>> Signed-off-by: Jens Axboe 
>>
>> ---
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index deb56932f8c4..4c44e6fa0d08 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>> request_queue *q, struct request *
>>   * bypass a potential scheduler on the bottom device for
>>   * insert.
>>   */
>> -return blk_mq_request_issue_directly(rq);
>> +blk_mq_request_bypass_insert(rq, true);
>> +return BLK_STS_OK;
>>  }
>>  
>>  spin_lock_irqsave(q->queue_lock, flags);
>>
> Not sure about this because it will break the merging promotion for request 
> based DM
> from Ming.
> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> feedback)
> 
> We could use some other way to fix this.

That really shouldn't matter as this is the cloned insert, merging should
have been done on the original request.


-- 
Jens Axboe



Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread jianchao.wang



On 12/7/18 6:20 AM, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> 
Not sure about this because it will break the merging promotion for request 
based DM
from Ming.
396eaf21ee17c476e8f66249fb1f4a39003d0ab4
(blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback)

We could use some other way to fix this.

Thanks
Jianchao


Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Jens Axboe
On 12/6/18 5:31 PM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote:
>> On 12/6/18 5:20 PM, Bart Van Assche wrote:
>>> Which branch does that tag correspond to? Even after having run git fetch
>>> --tags I can't find that tag in your repository.
>>
>> I pushed it before I sent the email, where are you looking?
>>
>> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206
> 
> On my development system git has been configured to fetch from the following 
> repository:
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/
> 
> Should I fetch your branches from the git.kernel.dk repository instead?

Ah I see, I always push to git.kernel.dk, and that syncs to git.kernel.org
only every hour or so. So either one will work, but you might be a bit
behind if you use git.kernel.org.

-- 
Jens Axboe



Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 17:26 -0700, Jens Axboe wrote:
> On 12/6/18 5:20 PM, Bart Van Assche wrote:
> > Which branch does that tag correspond to? Even after having run git fetch
> > --tags I can't find that tag in your repository.
> 
> I pushed it before I sent the email, where are you looking?
> 
> http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206

On my development system git has been configured to fetch from the following 
repository:
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/

Should I fetch your branches from the git.kernel.dk repository instead?

Thanks,

Bart.


Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Jens Axboe
On 12/6/18 5:20 PM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote:
>> Just a single followup fix to the corruption fix from yesterday. We have
>> an exported interface that does direct dispatch, with DM being the sole
>> user of it. Change that to do bypass insert always instead of attempting
>> direct dispatch. This fixes a deadlock that can happen on DM.
>> Additionally, it gets rid of any exported user of direct dispatch, and
>> enables DM to drop any BUSY handling from using the interface.
>>
>> Please pull!
>>
>>
>>   git://git.kernel.dk/linux-block.git tags/for-linus-20181206
> 
> Hi Jens,
> 
> Which branch does that tag correspond to? Even after having run git fetch
> --tags I can't find that tag in your repository.

I pushed it before I sent the email, where are you looking?

http://git.kernel.dk/cgit/linux-block/tag/?h=for-linus-20181206

> Additionally, the patch on your for-linus branch is missing the "Cc:
> stable" and "Reported-by:" tags that you had promised to add. Did I
> look at the wrong branch?

Doh I did - Linus, I'm going to amend the commit with that info and push
it out again, jfyi. Now done.

-- 
Jens Axboe



Re: [GIT PULL] Follow up block fix

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 16:59 -0700, Jens Axboe wrote:
> Just a single followup fix to the corruption fix from yesterday. We have
> an exported interface that does direct dispatch, with DM being the sole
> user of it. Change that to do bypass insert always instead of attempting
> direct dispatch. This fixes a deadlock that can happen on DM.
> Additionally, it gets rid of any exported user of direct dispatch, and
> enables DM to drop any BUSY handling from using the interface.
> 
> Please pull!
> 
> 
>   git://git.kernel.dk/linux-block.git tags/for-linus-20181206

Hi Jens,

Which branch does that tag correspond to? Even after having run git fetch
--tags I can't find that tag in your repository. Additionally, the patch on
your for-linus branch is missing the "Cc: stable" and "Reported-by:" tags
that you had promised to add. Did I look at the wrong branch?

Thanks,

Bart.


[GIT PULL] Follow up block fix

2018-12-06 Thread Jens Axboe
Hi Linus,

Just a single followup fix to the corruption fix from yesterday. We have
an exported interface that does direct dispatch, with DM being the sole
user of it. Change that to do bypass insert always instead of attempting
direct dispatch. This fixes a deadlock that can happen on DM.
Additionally, it gets rid of any exported user of direct dispatch, and
enables DM to drop any BUSY handling from using the interface.

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20181206



Jens Axboe (1):
  block: fix direct dispatch issue failure for clones

 block/blk-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

-- 
Jens Axboe



Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 15:21 -0700, Jens Axboe wrote:
> On 12/6/18 3:20 PM, Jens Axboe wrote:
> > After the direct dispatch corruption fix, we permanently disallow direct
> > dispatch of non read/write requests. This works fine off the normal IO
> > path, as they will be retried like any other failed direct dispatch
> > request. But for the blk_insert_cloned_request() that only DM uses to
> > bypass the bottom level scheduler, we always first attempt direct
> > dispatch. For some types of requests, that's now a permanent failure,
> > and no amount of retrying will make that succeed.
> > 
> > Don't use direct dispatch off the cloned insert path, always just use
> > bypass inserts. This still bypasses the bottom level scheduler, which is
> > what DM wants.
> > 
> > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> > Signed-off-by: Jens Axboe 
> 
> Bart, I'll add your reported-by here of course, and also a stable CC
> since the original patch went into stable.

Feel free to add the following:

Tested-by: Bart Van Assche 



Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 3:28 PM, Mike Snitzer wrote:
> On Thu, Dec 06 2018 at  5:20pm -0500,
> Jens Axboe  wrote:
> 
>> After the direct dispatch corruption fix, we permanently disallow direct
>> dispatch of non read/write requests. This works fine off the normal IO
>> path, as they will be retried like any other failed direct dispatch
>> request. But for the blk_insert_cloned_request() that only DM uses to
>> bypass the bottom level scheduler, we always first attempt direct
>> dispatch. For some types of requests, that's now a permanent failure,
>> and no amount of retrying will make that succeed.
>>
>> Don't use direct dispatch off the cloned insert path, always just use
>> bypass inserts. This still bypasses the bottom level scheduler, which is
>> what DM wants.
>>
>> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>> Signed-off-by: Jens Axboe 
>>
>> ---
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index deb56932f8c4..4c44e6fa0d08 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>> request_queue *q, struct request *
>>   * bypass a potential scheduler on the bottom device for
>>   * insert.
>>   */
>> -return blk_mq_request_issue_directly(rq);
>> +blk_mq_request_bypass_insert(rq, true);
>> +return BLK_STS_OK;
>>  }
>>  
>>  spin_lock_irqsave(q->queue_lock, flags);
> 
> Not sure what this trailing spin_lock_irqsave(q->queue_lock, flags) is
> about.. but this looks good.

It's because it's against current -git, that is gone from the 4.21 branch.


> I'll cleanup dm-rq.c to do away with the
> extra STS_RESOURCE checks for its call to blk_insert_cloned_request()
> once this lands.
That is indeed a nice benefit, all source based failure cases can be
removed from the caller after this.

-- 
Jens Axboe



Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer
On Thu, Dec 06 2018 at  5:20pm -0500,
Jens Axboe  wrote:

> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);

Not sure what this trailing spin_lock_irqsave(q->queue_lock, flags) is
about.. but this looks good.  I'll cleanup dm-rq.c to do away with the
extra STS_RESOURCE checks for its call to blk_insert_cloned_request()
once this lands.

Acked-by: Mike Snitzer 

Thanks.


Re: [PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
On 12/6/18 3:20 PM, Jens Axboe wrote:
> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 

Bart, I'll add your reported-by here of course, and also a stable CC
since the original patch went into stable.

-- 
Jens Axboe



[PATCH] block: fix direct dispatch issue failure for clones

2018-12-06 Thread Jens Axboe
After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed.

Don't use direct dispatch off the cloned insert path, always just use
bypass inserts. This still bypasses the bottom level scheduler, which is
what DM wants.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Signed-off-by: Jens Axboe 

---

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..4c44e6fa0d08 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
}
 
spin_lock_irqsave(q->queue_lock, flags);

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 2:04 PM, Jens Axboe wrote:
> On 12/6/18 1:56 PM, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote:
>>> If I merge Jens' for-next branch with Linus' master branch, boot the
>>> resulting kernel in a VM and run blktests/tests/srp/002 then that test
>>> never finishes. The same test passes against Linus' master branch. I
>>> think this is a regression. The following appears in the system log if
>>> I run that test:
>>>
>>> Call Trace:
>>> INFO: task kworker/0:1:12 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task fio:2151 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task fio:2154 blocked for more than 120 seconds.
>>
>> Hi Jens,
>>
>> My test results so far are as follows:
>> * With kernel v4.20-rc5 test srp/002 passes.
>> * With your for-next branch test srp/002 reports the symptoms reported in my 
>> e-mail.
>> * With Linus' master branch from this morning test srp/002 fails in the same 
>> way as
>>   your for-next branch.
>> * Also with Linus' master branch, test srp/002 passes if I revert the 
>> following commit:
>>   ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems 
>> like that
>>   commit fixed one regression but introduced another regression.
> 
> Yes, I'm on the same page, I've been able to reproduce. It seems to be related
> to dm and bypass insert, which is somewhat odd. If I just do:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> 
> it works fine. Well, at least this regression is less serious, I'll bang
> out a fix for it and ensure we make -rc6. I'm guessing it's the bypassin
> of non-read/write, which your top of dispatch also shows to be a
> non-read/write. But there should be no new failure case here that wasn't
> possible before, only it's easier to hit now.

OK, so here's the thing. As part of the corruption fix, we disallowed
direct dispatch for anything that wasn't a read or write. This means
that your WRITE_ZEROES will always fail direct dispatch. When it does,
we return busy. But next time dm will try the exact same thing again,
blk_insert_cloned_request() -> direct dispatch -> fail. Before we'd succeed
eventually, now we will always fail for that type.

The insert clone path is unique in that regard.

So we have two options - the patch I did above which always just does
bypass insert for DM, or we need to mark the request as having failed
and just not retry direct dispatch for it. I'm still not crazy about
exploring the dispatch insert off this path.

And we'd need to do this on the original request in dm, not the clone
we are passed or it won't be persistent. Hence I lean towards the
already posted patch.

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 1:56 PM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote:
>> If I merge Jens' for-next branch with Linus' master branch, boot the
>> resulting kernel in a VM and run blktests/tests/srp/002 then that test
>> never finishes. The same test passes against Linus' master branch. I
>> think this is a regression. The following appears in the system log if
>> I run that test:
>>
>> Call Trace:
>> INFO: task kworker/0:1:12 blocked for more than 120 seconds.
>> Call Trace:
>> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
>> Call Trace:
>> INFO: task fio:2151 blocked for more than 120 seconds.
>> Call Trace:
>> INFO: task fio:2154 blocked for more than 120 seconds.
> 
> Hi Jens,
> 
> My test results so far are as follows:
> * With kernel v4.20-rc5 test srp/002 passes.
> * With your for-next branch test srp/002 reports the symptoms reported in my 
> e-mail.
> * With Linus' master branch from this morning test srp/002 fails in the same 
> way as
>   your for-next branch.
> * Also with Linus' master branch, test srp/002 passes if I revert the 
> following commit:
>   ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like 
> that
>   commit fixed one regression but introduced another regression.

Yes, I'm on the same page, I've been able to reproduce. It seems to be related
to dm and bypass insert, which is somewhat odd. If I just do:

diff --git a/block/blk-core.c b/block/blk-core.c
index deb56932f8c4..4c44e6fa0d08 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
}
 
spin_lock_irqsave(q->queue_lock, flags);

it works fine. Well, at least this regression is less serious, I'll bang
out a fix for it and ensure we make -rc6. I'm guessing it's the bypassin
of non-read/write, which your top of dispatch also shows to be a
non-read/write. But there should be no new failure case here that wasn't
possible before, only it's easier to hit now.

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote:
> If I merge Jens' for-next branch with Linus' master branch, boot the
> resulting kernel in a VM and run blktests/tests/srp/002 then that test
> never finishes. The same test passes against Linus' master branch. I
> think this is a regression. The following appears in the system log if
> I run that test:
> 
> Call Trace:
> INFO: task kworker/0:1:12 blocked for more than 120 seconds.
> Call Trace:
> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
> Call Trace:
> INFO: task fio:2151 blocked for more than 120 seconds.
> Call Trace:
> INFO: task fio:2154 blocked for more than 120 seconds.

Hi Jens,

My test results so far are as follows:
* With kernel v4.20-rc5 test srp/002 passes.
* With your for-next branch test srp/002 reports the symptoms reported in my 
e-mail.
* With Linus' master branch from this morning test srp/002 fails in the same 
way as
  your for-next branch.
* Also with Linus' master branch, test srp/002 passes if I revert the following 
commit:
  ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like 
that
  commit fixed one regression but introduced another regression.

Bart.


Re: [PATCH 4/4] block: expose devt for GENHD_FL_HIDDEN disks

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 02:48:12PM -0200, Thadeu Lima de Souza Cascardo wrote:
> Without this exposure, lsblk will fail as it tries to find out the
> device's dev_t numbers. This causes a real problem for nvme multipath
> devices, as their slaves are hidden.
> 
> Exposing them fixes the problem, even though trying to open the devices
> returns an error in the case of nvme multipath. So, right now, it's the
> driver's responsibility to return a failure to open hidden devices.

So the problem with this is that it will cause udev to actually create
the /dev/nvmeXcYnZ nodes, which due to the previous patch will always
fail to open, which is a bit confusing.  I guess we could live with this
if we add udev rules to supress the creation or something, but in general
it is a bit ugly.


Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-06 Thread Jens Axboe
On 12/6/18 12:38 PM, Jeff Moyer wrote:
> Jens Axboe  writes:
> 
>> On 12/6/18 12:27 PM, Jeff Moyer wrote:
>>> Jens Axboe  writes:
>>>
 It's 192 bytes, fairly substantial. Most items don't need to be cleared,
 especially not upfront. Clear the ones we do need to clear, and leave
 the other ones for setup when the iocb is prepared and submitted.
>>>
>>> What performance gains do you see from this?
>>
>> Before this, I had 1% in memset doing high IOPS. With it, that's gone.
>> 1% is a lot, when you have just one thread doing everything from submission
>> to completion.
> 
> I'm used to customers complaining about fractions of a percent, so I get
> it.  :-)  I just wanted to know we had some measurable impact, as I've
> seen bugs crop up from code like this in the past.

Oh for sure, I wouldn't do it if there wasn't a noticeable gain from this!


-- 
Jens Axboe



Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-06 Thread Jeff Moyer
Jens Axboe  writes:

> On 12/6/18 12:27 PM, Jeff Moyer wrote:
>> Jens Axboe  writes:
>> 
>>> It's 192 bytes, fairly substantial. Most items don't need to be cleared,
>>> especially not upfront. Clear the ones we do need to clear, and leave
>>> the other ones for setup when the iocb is prepared and submitted.
>> 
>> What performance gains do you see from this?
>
> Before this, I had 1% in memset doing high IOPS. With it, that's gone.
> 1% is a lot, when you have just one thread doing everything from submission
> to completion.

I'm used to customers complaining about fractions of a percent, so I get
it.  :-)  I just wanted to know we had some measurable impact, as I've
seen bugs crop up from code like this in the past.

Thanks!
Jeff


Re: [PATCH 09/26] aio: only use blk plugs for > 2 depth submissions

2018-12-06 Thread Jens Axboe
On 12/6/18 12:29 PM, Jeff Moyer wrote:
> Jens Axboe  writes:
> 
>> Plugging is meant to optimize submission of a string of IOs, if we don't
>> have more than 2 being submitted, don't bother setting up a plug.
> 
> Is there really that much overhead in blk_{start|finish}_plug?

It's not just the overhead of the functions themselves, it's the
pointless work we end up doing for no reason at all. Plugging is, by
defintion, meant to help batch things up, ammortizing the cost of each
individual IO if you have a batch of them. If you don't have a batch,
then there's no point. Normally callers of blk_start_plug() don't
necessarily know how much IO is coming, but in this case we clearly do.

-- 
Jens Axboe



Re: [PATCH 09/26] aio: only use blk plugs for > 2 depth submissions

2018-12-06 Thread Jeff Moyer
Jens Axboe  writes:

> Plugging is meant to optimize submission of a string of IOs, if we don't
> have more than 2 being submitted, don't bother setting up a plug.

Is there really that much overhead in blk_{start|finish}_plug?
-Jeff

>
> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> ---
>  fs/aio.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index 522c04864d82..ed6c3914477a 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -69,6 +69,12 @@ struct aio_ring {
>   struct io_event io_events[0];
>  }; /* 128 bytes + ring size */
>  
> +/*
> + * Plugging is meant to work with larger batches of IOs. If we don't
> + * have more than the below, then don't bother setting up a plug.
> + */
> +#define AIO_PLUG_THRESHOLD   2
> +
>  #define AIO_RING_PAGES   8
>  
>  struct kioctx_table {
> @@ -1919,7 +1925,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, 
> nr,
>   if (nr > ctx->nr_events)
>   nr = ctx->nr_events;
>  
> - blk_start_plug();
> + if (nr > AIO_PLUG_THRESHOLD)
> + blk_start_plug();
>   for (i = 0; i < nr; i++) {
>   struct iocb __user *user_iocb;
>  
> @@ -1932,7 +1939,8 @@ SYSCALL_DEFINE3(io_submit, aio_context_t, ctx_id, long, 
> nr,
>   if (ret)
>   break;
>   }
> - blk_finish_plug();
> + if (nr > AIO_PLUG_THRESHOLD)
> + blk_finish_plug();
>  
>   percpu_ref_put(>users);
>   return i ? i : ret;
> @@ -1959,7 +1967,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, 
> ctx_id,
>   if (nr > ctx->nr_events)
>   nr = ctx->nr_events;
>  
> - blk_start_plug();
> + if (nr > AIO_PLUG_THRESHOLD)
> + blk_start_plug();
>   for (i = 0; i < nr; i++) {
>   compat_uptr_t user_iocb;
>  
> @@ -1972,7 +1981,8 @@ COMPAT_SYSCALL_DEFINE3(io_submit, compat_aio_context_t, 
> ctx_id,
>   if (ret)
>   break;
>   }
> - blk_finish_plug();
> + if (nr > AIO_PLUG_THRESHOLD)
> + blk_finish_plug();
>  
>   percpu_ref_put(>users);
>   return i ? i : ret;


Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-06 Thread Jens Axboe
On 12/6/18 12:27 PM, Jeff Moyer wrote:
> Jens Axboe  writes:
> 
>> It's 192 bytes, fairly substantial. Most items don't need to be cleared,
>> especially not upfront. Clear the ones we do need to clear, and leave
>> the other ones for setup when the iocb is prepared and submitted.
> 
> What performance gains do you see from this?

Before this, I had 1% in memset doing high IOPS. With it, that's gone.
1% is a lot, when you have just one thread doing everything from submission
to completion.

-- 
Jens Axboe



Re: [PATCH 08/26] aio: don't zero entire aio_kiocb aio_get_req()

2018-12-06 Thread Jeff Moyer
Jens Axboe  writes:

> It's 192 bytes, fairly substantial. Most items don't need to be cleared,
> especially not upfront. Clear the ones we do need to clear, and leave
> the other ones for setup when the iocb is prepared and submitted.

What performance gains do you see from this?

-Jeff

> Reviewed-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> ---
>  fs/aio.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/fs/aio.c b/fs/aio.c
> index eaceb40e6cf5..522c04864d82 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -1009,14 +1009,15 @@ static inline struct aio_kiocb *aio_get_req(struct 
> kioctx *ctx)
>  {
>   struct aio_kiocb *req;
>  
> - req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL|__GFP_ZERO);
> + req = kmem_cache_alloc(kiocb_cachep, GFP_KERNEL);
>   if (unlikely(!req))
>   return NULL;
>  
>   percpu_ref_get(>reqs);
> + req->ki_ctx = ctx;
>   INIT_LIST_HEAD(>ki_list);
>   refcount_set(>ki_refcnt, 0);
> - req->ki_ctx = ctx;
> + req->ki_eventfd = NULL;
>   return req;
>  }
>  
> @@ -1730,6 +1731,10 @@ static ssize_t aio_poll(struct aio_kiocb *aiocb, 
> struct iocb *iocb)
>   if (unlikely(!req->file))
>   return -EBADF;
>  
> + req->head = NULL;
> + req->woken = false;
> + req->cancelled = false;
> +
>   apt.pt._qproc = aio_poll_queue_proc;
>   apt.pt._key = req->events;
>   apt.iocb = aiocb;


Re: [PATCH 03/26] block: wire up block device iopoll method

2018-12-06 Thread Jens Axboe
On 12/6/18 12:15 PM, Christoph Hellwig wrote:
> On Thu, Dec 06, 2018 at 12:14:29PM -0700, Jens Axboe wrote:
>> On 12/6/18 12:11 PM, Jeff Moyer wrote:
>>> Jens Axboe  writes:
>>>
 From: Christoph Hellwig 

 Just call blk_poll on the iocb cookie, we can derive the block device
 from the inode trivially.
>>>
>>> Does this work for multi-device file systems?
>>
>> It should, that's the whole purpose of having fops->iopoll. You should
>> be able to derive it from inode + cookie.
> 
> It still assumes the whole I/O will got to a single device.  For
> XFS this is always tree, but for btrfs this might mean I/O could
> need splitting if ->iopoll was to be supported there.

Right, we can only support one target for one particular piece of IO,
but multi-device works fine in general.

-- 
Jens Axboe



Re: [PATCH 01/26] fs: add an iopoll method to struct file_operations

2018-12-06 Thread Jeff Moyer
Jens Axboe  writes:

> From: Christoph Hellwig 
>
> This new methods is used to explicitly poll for I/O completion for an
> iocb.  It must be called for any iocb submitted asynchronously (that
> is with a non-null ki_complete) which has the IOCB_HIPRI flag set.
>
> The method is assisted by a new ki_cookie field in struct iocb to store
> the polling cookie.
>
> TODO: we can probably union ki_cookie with the existing hint and I/O
> priority fields to avoid struct kiocb growth.

Please document return values.

Thanks,
Jeff

>
> Reviewed-by: Johannes Thumshirn 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> ---
>  Documentation/filesystems/vfs.txt | 3 +++
>  include/linux/fs.h| 2 ++
>  2 files changed, 5 insertions(+)
>
> diff --git a/Documentation/filesystems/vfs.txt 
> b/Documentation/filesystems/vfs.txt
> index 5f71a252e2e0..d9dc5e4d82b9 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -857,6 +857,7 @@ struct file_operations {
>   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
>   ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>   ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
> + int (*iopoll)(struct kiocb *kiocb, bool spin);
>   int (*iterate) (struct file *, struct dir_context *);
>   int (*iterate_shared) (struct file *, struct dir_context *);
>   __poll_t (*poll) (struct file *, struct poll_table_struct *);
> @@ -902,6 +903,8 @@ otherwise noted.
>  
>write_iter: possibly asynchronous write with iov_iter as source
>  
> +  iopoll: called when aio wants to poll for completions on HIPRI iocbs
> +
>iterate: called when the VFS needs to read the directory contents
>  
>iterate_shared: called when the VFS needs to read the directory contents
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a1ab233e6469..6a5f71f8ae06 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -310,6 +310,7 @@ struct kiocb {
>   int ki_flags;
>   u16 ki_hint;
>   u16 ki_ioprio; /* See linux/ioprio.h */
> + unsigned intki_cookie; /* for ->iopoll */
>  } __randomize_layout;
>  
>  static inline bool is_sync_kiocb(struct kiocb *kiocb)
> @@ -1781,6 +1782,7 @@ struct file_operations {
>   ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
>   ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
>   ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
> + int (*iopoll)(struct kiocb *kiocb, bool spin);
>   int (*iterate) (struct file *, struct dir_context *);
>   int (*iterate_shared) (struct file *, struct dir_context *);
>   __poll_t (*poll) (struct file *, struct poll_table_struct *);


Re: [PATCH 03/26] block: wire up block device iopoll method

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 12:14:29PM -0700, Jens Axboe wrote:
> On 12/6/18 12:11 PM, Jeff Moyer wrote:
> > Jens Axboe  writes:
> > 
> >> From: Christoph Hellwig 
> >>
> >> Just call blk_poll on the iocb cookie, we can derive the block device
> >> from the inode trivially.
> > 
> > Does this work for multi-device file systems?
> 
> It should, that's the whole purpose of having fops->iopoll. You should
> be able to derive it from inode + cookie.

It still assumes the whole I/O will got to a single device.  For
XFS this is always tree, but for btrfs this might mean I/O could
need splitting if ->iopoll was to be supported there.


Re: [PATCH 03/26] block: wire up block device iopoll method

2018-12-06 Thread Jens Axboe
On 12/6/18 12:11 PM, Jeff Moyer wrote:
> Jens Axboe  writes:
> 
>> From: Christoph Hellwig 
>>
>> Just call blk_poll on the iocb cookie, we can derive the block device
>> from the inode trivially.
> 
> Does this work for multi-device file systems?

It should, that's the whole purpose of having fops->iopoll. You should
be able to derive it from inode + cookie.

-- 
Jens Axboe



Re: [PATCH 03/26] block: wire up block device iopoll method

2018-12-06 Thread Jeff Moyer
Jens Axboe  writes:

> From: Christoph Hellwig 
>
> Just call blk_poll on the iocb cookie, we can derive the block device
> from the inode trivially.

Does this work for multi-device file systems?

-Jeff

>
> Reviewed-by: Johannes Thumshirn 
> Signed-off-by: Christoph Hellwig 
> Signed-off-by: Jens Axboe 
> ---
>  fs/block_dev.c | 10 ++
>  1 file changed, 10 insertions(+)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index e1886cc7048f..6de8d35f6e41 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -281,6 +281,14 @@ struct blkdev_dio {
>  
>  static struct bio_set blkdev_dio_pool;
>  
> +static int blkdev_iopoll(struct kiocb *kiocb, bool wait)
> +{
> + struct block_device *bdev = I_BDEV(kiocb->ki_filp->f_mapping->host);
> + struct request_queue *q = bdev_get_queue(bdev);
> +
> + return blk_poll(q, READ_ONCE(kiocb->ki_cookie), wait);
> +}
> +
>  static void blkdev_bio_end_io(struct bio *bio)
>  {
>   struct blkdev_dio *dio = bio->bi_private;
> @@ -398,6 +406,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>   bio->bi_opf |= REQ_HIPRI;
>  
>   qc = submit_bio(bio);
> + WRITE_ONCE(iocb->ki_cookie, qc);
>   break;
>   }
>  
> @@ -2070,6 +2079,7 @@ const struct file_operations def_blk_fops = {
>   .llseek = block_llseek,
>   .read_iter  = blkdev_read_iter,
>   .write_iter = blkdev_write_iter,
> + .iopoll = blkdev_iopoll,
>   .mmap   = generic_file_mmap,
>   .fsync  = blkdev_fsync,
>   .unlocked_ioctl = block_ioctl,


Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 11:12 -0700, Jens Axboe wrote:
> On 12/6/18 11:10 AM, Bart Van Assche wrote:
> > On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote:
> > > On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> > > > which would result in a non-zero exit, which should be expected for this
> > > > test?
> > > 
> > > Test srp/002 simulates network failures while running fio on top of 
> > > dm-mpath.
> > > Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
> > > retrying
> > > the block layer requests it receives if these requests fail due to path 
> > > removal.
> > > If fio reports "io_u error" for this test then that means that the test 
> > > failed.
> > > I haven't seen fio reporting any I/O errors for this test with any 
> > > upstream
> > > kernel that I tested in the past year or so.
> > 
> > Hi Jens,
> > 
> > Please also verify that the version of multipath-tools that you are running 
> > includes
> > the following patch:
> 
> I installed from apt... Says this:
> 
> # apt show multipath-tools
> Package: multipath-tools
> Version: 0.6.4-5+deb9u1
> 
> I can run a self-compiled one, if this one doesn't work.

Please do. I use the following script to build and install multipath-tools
over the Debian binaries:

export LIB=/lib
apt-get install -y libaio-dev libdevmapper-dev libjson-c-dev librados-dev 
libreadline-dev libsystemd-dev liburcu-dev
make -s 
make -s install

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> > which would result in a non-zero exit, which should be expected for this
> > test?
> 
> Test srp/002 simulates network failures while running fio on top of dm-mpath.
> Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
> retrying
> the block layer requests it receives if these requests fail due to path 
> removal.
> If fio reports "io_u error" for this test then that means that the test 
> failed.
> I haven't seen fio reporting any I/O errors for this test with any upstream
> kernel that I tested in the past year or so.

Hi Jens,

Please also verify that the version of multipath-tools that you are running 
includes
the following patch:

>From a2675025ae9f652b005345b9082f5042b32c992c Mon Sep 17 00:00:00 2001
From: Bart Van Assche 
Date: Fri, 18 Nov 2016 13:33:44 -0800
Subject: [PATCH] Avoid that reloading a map sporadically triggers I/O errors

Avoid that reloading a map while there are no paths triggers a flush
and hence unwanted I/O errors if 'queue_if_no_path' is enabled.

Fixes: commit d569988e7528 ("libmultipath: Fixup 'DM_DEVICE_RELOAD' handling")
Signed-off-by: Bart Van Assche 
---
 libmultipath/devmapper.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/devmapper.c b/libmultipath/devmapper.c
index 9b6b0537ba6d..1576dd017d6a 100644
--- a/libmultipath/devmapper.c
+++ b/libmultipath/devmapper.c
@@ -392,7 +392,7 @@ int dm_addmap_reload(struct multipath *mpp, char *params, 
int flush)
  params, ADDMAP_RO, SKIP_KPARTX_OFF);
}
if (r)
-   r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, flush,
+   r = dm_simplecmd(DM_DEVICE_RESUME, mpp->alias, !flush,
 1, udev_flags, 0);
return r;
 }



Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 11:10 AM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 10:02 -0800, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
>>> which would result in a non-zero exit, which should be expected for this
>>> test?
>>
>> Test srp/002 simulates network failures while running fio on top of dm-mpath.
>> Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
>> retrying
>> the block layer requests it receives if these requests fail due to path 
>> removal.
>> If fio reports "io_u error" for this test then that means that the test 
>> failed.
>> I haven't seen fio reporting any I/O errors for this test with any upstream
>> kernel that I tested in the past year or so.
> 
> Hi Jens,
> 
> Please also verify that the version of multipath-tools that you are running 
> includes
> the following patch:

I installed from apt... Says this:

# apt show multipath-tools
Package: multipath-tools
Version: 0.6.4-5+deb9u1

I can run a self-compiled one, if this one doesn't work.

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 11:02 AM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
>> which would result in a non-zero exit, which should be expected for this
>> test?
> 
> Hi Jens,
> 
> Test srp/002 simulates network failures while running fio on top of dm-mpath.
> Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep 
> retrying
> the block layer requests it receives if these requests fail due to path 
> removal.
> If fio reports "io_u error" for this test then that means that the test 
> failed.
> I haven't seen fio reporting any I/O errors for this test with any upstream
> kernel that I tested in the past year or so.

Just ran it on current -git, and get the exact same results. Maybe there are
some clues in the log files I sent in the previous email?

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:48 -0700, Jens Axboe wrote:
> which would result in a non-zero exit, which should be expected for this
> test?

Hi Jens,

Test srp/002 simulates network failures while running fio on top of dm-mpath.
Since queue_if_no_path is enabled in multipath.conf dm-mpath will keep retrying
the block layer requests it receives if these requests fail due to path removal.
If fio reports "io_u error" for this test then that means that the test failed.
I haven't seen fio reporting any I/O errors for this test with any upstream
kernel that I tested in the past year or so.

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 10:28 AM, Jens Axboe wrote:
> On 12/6/18 10:19 AM, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote:
>>> On 12/6/18 9:47 AM, Bart Van Assche wrote:
 If I merge Jens' for-next branch with Linus' master branch, boot the
 resulting kernel in a VM and run blktests/tests/srp/002 then that test
 never finishes. The same test passes against Linus' master branch. I
 think this is a regression. The following appears in the system log if
 I run that test:
>>>
>>> You are running that test on a dm device? Can you shed some light on
>>> how that dm device is setup?
>>
>> Hi Jens,
>>
>> The dm device referred to in my e-mail is a dm-mpath device created by test
>> srp/002. All parameters of that device are under control of that test script.
>> From the srp/002 test script:
>>
>> DESCRIPTION="File I/O on top of multipath concurrently with logout and login 
>> (mq)"
>>
>> From srp/multipath.conf:
>>
>> defaults {
>>  find_multipaths no
>>  user_friendly_names yes
>>  queue_without_daemonno
>> }
>>
>> devices {
>>  device {
>>  vendor  "LIO-ORG|SCST_BIO|FUSIONIO"
>>  product ".*"
>>  features"1 queue_if_no_path"
>>  path_checkertur
>>  }
>> }
> 
> 5 kernel compiles later, and I think I'm almost ready to run it...

axboe@dell:/home/axboe/git/blktests $ sudo ./check srp/002
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) 
[failed]
runtime  33.689s  ...  32.212s
--- tests/srp/002.out   2018-11-09 08:42:22.842029804 -0700
+++ /home/axboe/git/blktests/results/nodev/srp/002.out.bad  2018-12-06 
10:45:35.995183521 -0700
@@ -1,2 +1 @@
 Configured SRP target driver
-Passed

It runs for about 32-33s every time and then I get this. The fio output shows 
this:

data-integrity-test-mq: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 
4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.12-27-g68af
Starting 16 threads
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
data-integrity-test-mq: Laying out IO file (1 file / 1MiB)
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.14.0:
 Input/output error: read offset=724992, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.5.0:
 Input/output error: read offset=983040, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.7.0:
 Input/output error: write offset=700416, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.8.0:
 Input/output error: write offset=237568, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0:
 Input/output error: write offset=1277952, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0:
 Input/output error: write offset=970752, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0:
 Input/output error: write offset=974848, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0:
 Input/output error: write offset=978944, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.9.0:
 Input/output error: write offset=983040, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.15.0:
 Input/output error: write offset=172032, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.13.0:
 Input/output error: write offset=237568, buflen=4096
fio: io_u error on file 
/home/axboe/git/blktests/results/tmpdir.srp.002.7WR/mnt0/data-integrity-test-mq.3.0:
 Input/output error: write offset=294912, buflen=4096
fio: io_u error on 

Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 10:19 AM, Bart Van Assche wrote:
> On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote:
>> On 12/6/18 9:47 AM, Bart Van Assche wrote:
>>> If I merge Jens' for-next branch with Linus' master branch, boot the
>>> resulting kernel in a VM and run blktests/tests/srp/002 then that test
>>> never finishes. The same test passes against Linus' master branch. I
>>> think this is a regression. The following appears in the system log if
>>> I run that test:
>>
>> You are running that test on a dm device? Can you shed some light on
>> how that dm device is setup?
> 
> Hi Jens,
> 
> The dm device referred to in my e-mail is a dm-mpath device created by test
> srp/002. All parameters of that device are under control of that test script.
> From the srp/002 test script:
> 
> DESCRIPTION="File I/O on top of multipath concurrently with logout and login 
> (mq)"
> 
> From srp/multipath.conf:
> 
> defaults {
>   find_multipaths no
>   user_friendly_names yes
>   queue_without_daemonno
> }
> 
> devices {
>   device {
>   vendor  "LIO-ORG|SCST_BIO|FUSIONIO"
>   product ".*"
>   features"1 queue_if_no_path"
>   path_checkertur
>   }
> }

5 kernel compiles later, and I think I'm almost ready to run it...

-- 
Jens Axboe



Re: for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 10:00 -0700, Jens Axboe wrote:
> On 12/6/18 9:47 AM, Bart Van Assche wrote:
> > If I merge Jens' for-next branch with Linus' master branch, boot the
> > resulting kernel in a VM and run blktests/tests/srp/002 then that test
> > never finishes. The same test passes against Linus' master branch. I
> > think this is a regression. The following appears in the system log if
> > I run that test:
> 
> You are running that test on a dm device? Can you shed some light on
> how that dm device is setup?

Hi Jens,

The dm device referred to in my e-mail is a dm-mpath device created by test
srp/002. All parameters of that device are under control of that test script.
>From the srp/002 test script:

DESCRIPTION="File I/O on top of multipath concurrently with logout and login 
(mq)"

>From srp/multipath.conf:

defaults {
find_multipaths no
user_friendly_names yes
queue_without_daemonno
}

devices {
device {
vendor  "LIO-ORG|SCST_BIO|FUSIONIO"
product ".*"
features"1 queue_if_no_path"
path_checkertur
}
}

Bart.


Re: for-next branch and blktests/srp

2018-12-06 Thread Jens Axboe
On 12/6/18 9:47 AM, Bart Van Assche wrote:
> Hello,
> 
> If I merge Jens' for-next branch with Linus' master branch, boot the
> resulting kernel in a VM and run blktests/tests/srp/002 then that test
> never finishes. The same test passes against Linus' master branch. I
> think this is a regression. The following appears in the system log if
> I run that test:

You are running that test on a dm device? Can you shed some light on
how that dm device is setup?

-- 
Jens Axboe



Re: [PATCH 2/2] nvme: create slaves/holder entries for multipath devices

2018-12-06 Thread Thadeu Lima de Souza Cascardo
On Thu, Dec 06, 2018 at 04:44:32PM +0100, Christoph Hellwig wrote:
> On Wed, Nov 28, 2018 at 06:38:17AM -0200, Thadeu Lima de Souza Cascardo wrote:
> > I will send a followup that includes your two patches, fixing up this one, 
> > plus
> > another two because that is simply reverting the revert of Hanne's patch. 
> > Which
> > means it still has the lsblk bug.
> > 
> > Then, we have two options: do not use GENHD_FL_HIDDEN on multipath nvme 
> > disks,
> > or expose devt for those disks. I am sending a patch for the second option.
> 
> Can you please send it out so we can look at it before the 4.21 merge
> window opens?

Just sent. Sorry about the delay.

Thanks.
Cascardo.


[PATCH 4/4] block: expose devt for GENHD_FL_HIDDEN disks

2018-12-06 Thread Thadeu Lima de Souza Cascardo
Without this exposure, lsblk will fail as it tries to find out the
device's dev_t numbers. This causes a real problem for nvme multipath
devices, as their slaves are hidden.

Exposing them fixes the problem, even though trying to open the devices
returns an error in the case of nvme multipath. So, right now, it's the
driver's responsibility to return a failure to open hidden devices.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 block/genhd.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 7674fce32fca..65a7fa664188 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -687,6 +687,7 @@ static void __device_add_disk(struct device *parent, struct 
gendisk *disk,
 
disk_alloc_events(disk);
 
+   disk_to_dev(disk)->devt = devt;
if (disk->flags & GENHD_FL_HIDDEN) {
/*
 * Don't let hidden disks show up in /proc/partitions,
@@ -698,13 +699,12 @@ static void __device_add_disk(struct device *parent, 
struct gendisk *disk,
int ret;
 
/* Register BDI before referencing it from bdev */
-   disk_to_dev(disk)->devt = devt;
ret = bdi_register_owner(disk->queue->backing_dev_info,
disk_to_dev(disk));
WARN_ON(ret);
-   blk_register_region(disk_devt(disk), disk->minors, NULL,
-   exact_match, exact_lock, disk);
}
+   blk_register_region(disk_devt(disk), disk->minors, NULL,
+   exact_match, exact_lock, disk);
register_disk(parent, disk, groups);
if (register_queue)
blk_register_queue(disk);
@@ -776,8 +776,7 @@ void del_gendisk(struct gendisk *disk)
WARN_ON(1);
}
 
-   if (!(disk->flags & GENHD_FL_HIDDEN))
-   blk_unregister_region(disk_devt(disk), disk->minors);
+   blk_unregister_region(disk_devt(disk), disk->minors);
 
kobject_put(disk->part0.holder_dir);
kobject_put(disk->slave_dir);
-- 
2.19.1



[PATCH 3/4] nvme: Should not warn when a disk path is opened

2018-12-06 Thread Thadeu Lima de Souza Cascardo
As we hide the disks used for nvme multipath, their open functions
should be allowed to be called, as their devices are not exposed.

However, not exposing the devices cause problems with lsblk, which won't
find the devices and show them, which is even more of a problem when
holders/slaves relationships are exposed.

Not warning here allow us to expose the disks without creating spurious
warnings. A failure should still be returned, which is the case here,
and still allows lsblk to work.

Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 drivers/nvme/host/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1dc29795f1ee..22424b2adfad 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1327,8 +1327,8 @@ static int nvme_open(struct block_device *bdev, fmode_t 
mode)
struct nvme_ns *ns = bdev->bd_disk->private_data;
 
 #ifdef CONFIG_NVME_MULTIPATH
-   /* should never be called due to GENHD_FL_HIDDEN */
-   if (WARN_ON_ONCE(ns->head->disk))
+   /* Should fail as it's hidden */
+   if (ns->head->disk)
goto fail;
 #endif
if (!kref_get_unless_zero(>kref))
-- 
2.19.1



[PATCH 1/4] block: move holder tracking from struct block_device to hd_struct

2018-12-06 Thread Thadeu Lima de Souza Cascardo
From: Christoph Hellwig 

We'd like to track the slaves and holder for nvme multipath devices
in the same standard fashion as all the other stacked block devices
to make the life for things like distro installers easy.

But struct block_device only exists while we have open instances,
which we never have for the underlying devices of a nvme-multipath
setup.  But we can easily move the older list into struct hd_struct
which exists all the time the block device exists, the only interesting
bit is that we need a new mutex for it.

Signed-off-by: Christoph Hellwig 
---
 block/genhd.c|  4 +++
 block/partition-generic.c|  4 +++
 drivers/block/drbd/drbd_nl.c |  4 +--
 drivers/md/bcache/super.c|  8 +++---
 drivers/md/dm.c  |  4 +--
 drivers/md/md.c  |  4 +--
 fs/block_dev.c   | 48 
 include/linux/fs.h   | 11 +++--
 include/linux/genhd.h|  4 +++
 9 files changed, 47 insertions(+), 44 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 0145bcb0cc76..7674fce32fca 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1468,6 +1468,10 @@ struct gendisk *__alloc_disk_node(int minors, int 
node_id)
 
disk->minors = minors;
rand_initialize_disk(disk);
+#ifdef CONFIG_SYSFS
+   INIT_LIST_HEAD(>part0.holder_disks);
+   mutex_init(>part0.holder_mutex);
+#endif
disk_to_dev(disk)->class = _class;
disk_to_dev(disk)->type = _type;
device_initialize(disk_to_dev(disk));
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 5f8db5c5140f..83edc23ff1e8 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -333,6 +333,10 @@ struct hd_struct *add_partition(struct gendisk *disk, int 
partno,
}
 
seqcount_init(>nr_sects_seq);
+#ifdef CONFIG_SYSFS
+   INIT_LIST_HEAD(>holder_disks);
+   mutex_init(>holder_mutex);
+#endif
pdev = part_to_dev(p);
 
p->start_sect = start;
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index d15703b1ffe8..42354e34b565 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -1670,7 +1670,7 @@ static struct block_device *open_backing_dev(struct 
drbd_device *device,
if (!do_bd_link)
return bdev;
 
-   err = bd_link_disk_holder(bdev, device->vdisk);
+   err = bd_link_disk_holder(bdev->bd_part, device->vdisk);
if (err) {
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
drbd_err(device, "bd_link_disk_holder(\"%s\", ...) failed with 
%d\n",
@@ -1719,7 +1719,7 @@ static void close_backing_dev(struct drbd_device *device, 
struct block_device *b
if (!bdev)
return;
if (do_bd_unlink)
-   bd_unlink_disk_holder(bdev, device->vdisk);
+   bd_unlink_disk_holder(bdev->bd_part, device->vdisk);
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
 }
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 7bbd670a5a84..e49b177b1c60 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -677,7 +677,7 @@ static void bcache_device_unlink(struct bcache_device *d)
sysfs_remove_link(>kobj, "cache");
 
for_each_cache(ca, d->c, i)
-   bd_unlink_disk_holder(ca->bdev, d->disk);
+   bd_unlink_disk_holder(ca->bdev->bd_part, d->disk);
}
 }
 
@@ -688,7 +688,7 @@ static void bcache_device_link(struct bcache_device *d, 
struct cache_set *c,
struct cache *ca;
 
for_each_cache(ca, d->c, i)
-   bd_link_disk_holder(ca->bdev, d->disk);
+   bd_link_disk_holder(ca->bdev->bd_part, d->disk);
 
snprintf(d->name, BCACHEDEVNAME_SIZE,
 "%s%u", name, d->id);
@@ -936,7 +936,7 @@ void bch_cached_dev_run(struct cached_dev *dc)
}
 
add_disk(d->disk);
-   bd_link_disk_holder(dc->bdev, dc->disk.disk);
+   bd_link_disk_holder(dc->bdev->bd_part, dc->disk.disk);
/*
 * won't show up in the uevent file, use udevadm monitor -e instead
 * only class / kset properties are persistent
@@ -1199,7 +1199,7 @@ static void cached_dev_free(struct closure *cl)
kthread_stop(dc->status_update_thread);
 
if (atomic_read(>running))
-   bd_unlink_disk_holder(dc->bdev, dc->disk.disk);
+   bd_unlink_disk_holder(dc->bdev->bd_part, dc->disk.disk);
bcache_device_free(>disk);
list_del(>list);
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab72d79775ee..823b592f066b 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -770,7 +770,7 @@ static int open_table_device(struct table_device *td, dev_t 
dev,
if (IS_ERR(bdev))
return PTR_ERR(bdev);
 
-   r = bd_link_disk_holder(bdev, 

for-next branch and blktests/srp

2018-12-06 Thread Bart Van Assche
Hello,

If I merge Jens' for-next branch with Linus' master branch, boot the
resulting kernel in a VM and run blktests/tests/srp/002 then that test
never finishes. The same test passes against Linus' master branch. I
think this is a regression. The following appears in the system log if
I run that test:

Call Trace:
INFO: task kworker/0:1:12 blocked for more than 120 seconds.
Call Trace:
INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
Call Trace:
INFO: task fio:2151 blocked for more than 120 seconds.
Call Trace:
INFO: task fio:2154 blocked for more than 120 seconds.

My list-pending-block-requests script produces the following output:

dm-0
dm-0/hctx0/active:0
dm-0/hctx0/dispatch:b856a515 {.op=WRITE_ZEROES, .cmd_flags=SYNC, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1064, .internal_tag=-1}
dm-0/hctx0/dispatch:6a2dca0a {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1065, .internal_tag=-1}
dm-0/hctx0/dispatch:1baa5734 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1085, .internal_tag=-1}
dm-0/hctx0/dispatch:2d618841 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1025, .internal_tag=-1}
dm-0/hctx0/dispatch:c8e6e436 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1250, .internal_tag=-1}
dm-0/hctx0/dispatch:7af12ca3 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1057, .internal_tag=-1}
dm-0/hctx0/dispatch:2ad59d1e {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1060, .internal_tag=-1}
dm-0/hctx0/dispatch:51c35166 {.op=READ, .cmd_flags=, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1188, .internal_tag=-1}
dm-0/hctx0/dispatch:05da9610 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1253, .internal_tag=-1}
dm-0/hctx0/dispatch:a25483c5 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1254, .internal_tag=-1}
dm-0/hctx0/dispatch:7ef87035 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1255, .internal_tag=-1}
dm-0/hctx0/dispatch:8a8f15c0 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1261, .internal_tag=-1}
dm-0/hctx0/dispatch:d65ae959 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1263, .internal_tag=-1}
dm-0/hctx0/dispatch:88aac981 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1264, .internal_tag=-1}
dm-0/hctx0/dispatch:89489e70 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1265, .internal_tag=-1}
dm-0/hctx0/dispatch:7c69e2d2 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1154, .internal_tag=-1}
dm-0/hctx0/dispatch:e32aa9b5 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1155, .internal_tag=-1}
dm-0/hctx0/dispatch:5b88a332 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1157, .internal_tag=-1}
dm-0/hctx0/dispatch:1de14ef0 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1159, .internal_tag=-1}
dm-0/hctx0/dispatch:a380d0f3 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1161, .internal_tag=-1}
dm-0/hctx0/dispatch:c48cfa16 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1162, .internal_tag=-1}
dm-0/hctx0/dispatch:11d96382 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1163, .internal_tag=-1}
dm-0/hctx0/dispatch:b5efd5d7 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1167, .internal_tag=-1}
dm-0/hctx0/dispatch:80f5c476 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1153, .internal_tag=-1}
dm-0/hctx0/dispatch:c73c244d {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1266, .internal_tag=-1}
dm-0/hctx0/dispatch:69c43800 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1268, .internal_tag=-1}
dm-0/hctx0/dispatch:d2ec3c96 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1269, .internal_tag=-1}
dm-0/hctx0/dispatch:9b2099db {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1270, .internal_tag=-1}
dm-0/hctx0/dispatch:7d433090 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1272, .internal_tag=-1}
dm-0/hctx0/dispatch:afa7d2ea {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1274, .internal_tag=-1}
dm-0/hctx0/dispatch:c0f9d890 {.op=WRITE, .cmd_flags=SYNC|IDLE, 
.rq_flags=IO_STAT|STATS, .state=idle, .tag=1276, .internal_tag=-1}

Re: for-next branch and throttling

2018-12-06 Thread Jens Axboe
On 12/6/18 9:39 AM, Bart Van Assche wrote:
> Hello,
> 
> If I merge Jens' for-next branch with Linus' master branch and boot the
> resulting kernel in a VM then the call trace shown below appears. This call
> trace does not appear when building and booting the code from Linus' master
> branch. Is this perhaps a regression?
> 
> WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 
> blk_throtl_bio+0xc00/0x1120
> Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover 
> i2c_piix4 pata_acpi
> CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 
> 04/01/2014
> RIP: 0010:blk_throtl_bio+0xc00/0x1120

It's some over-zealous rcu removal, simple fix is coming. CC Dennis.

-- 
Jens Axboe



[PATCH 2/4] nvme: create slaves/holder entries for multipath devices

2018-12-06 Thread Thadeu Lima de Souza Cascardo
From: Christoph Hellwig 

This allows tools like distro installers easily track the relationship.

Signed-off-by: Christoph Hellwig 
[cascardo: only add disk when there is ns->head]
Signed-off-by: Thadeu Lima de Souza Cascardo 
---
 drivers/nvme/host/core.c  |  5 +++--
 drivers/nvme/host/multipath.c | 13 +++--
 drivers/nvme/host/nvme.h  | 12 
 3 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3ce7fc9df378..1dc29795f1ee 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -371,7 +371,7 @@ static void nvme_free_ns_head(struct kref *ref)
struct nvme_ns_head *head =
container_of(ref, struct nvme_ns_head, ref);
 
-   nvme_mpath_remove_disk(head);
+   nvme_mpath_remove_nshead(head);
ida_simple_remove(>subsys->ns_ida, head->instance);
list_del_init(>entry);
cleanup_srcu_struct_quiesced(>srcu);
@@ -3120,7 +3120,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
device_add_disk(ctrl->device, ns->disk, nvme_ns_id_attr_groups);
 
-   nvme_mpath_add_disk(ns, id);
+   nvme_mpath_add_ns(ns, id);
nvme_fault_inject_init(ns);
kfree(id);
 
@@ -3144,6 +3144,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
 
nvme_fault_inject_fini(ns);
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
+   nvme_mpath_remove_ns(ns);
del_gendisk(ns->disk);
blk_cleanup_queue(ns->queue);
if (blk_get_integrity(ns->disk))
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index ec310b1b9267..174c64643592 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -500,7 +500,7 @@ static int nvme_set_ns_ana_state(struct nvme_ctrl *ctrl,
return 0;
 }
 
-void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id)
+void nvme_mpath_add_ns(struct nvme_ns *ns, struct nvme_id_ns *id)
 {
if (nvme_ctrl_use_ana(ns->ctrl)) {
mutex_lock(>ctrl->ana_lock);
@@ -513,9 +513,18 @@ void nvme_mpath_add_disk(struct nvme_ns *ns, struct 
nvme_id_ns *id)
nvme_mpath_set_live(ns);
mutex_unlock(>head->lock);
}
+
+   if (ns->head->disk)
+   bd_link_disk_holder(>disk->part0, ns->head->disk);
+}
+
+void nvme_mpath_remove_ns(struct nvme_ns *ns)
+{
+   if (ns->head->disk)
+   bd_unlink_disk_holder(>disk->part0, ns->head->disk);
 }
 
-void nvme_mpath_remove_disk(struct nvme_ns_head *head)
+void nvme_mpath_remove_nshead(struct nvme_ns_head *head)
 {
if (!head->disk)
return;
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ae77eb16fd1f..365262d11e53 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -470,8 +470,9 @@ void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
 void nvme_failover_req(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
-void nvme_mpath_add_disk(struct nvme_ns *ns, struct nvme_id_ns *id);
-void nvme_mpath_remove_disk(struct nvme_ns_head *head);
+void nvme_mpath_add_ns(struct nvme_ns *ns, struct nvme_id_ns *id);
+void nvme_mpath_remove_ns(struct nvme_ns *ns);
+void nvme_mpath_remove_nshead(struct nvme_ns_head *head);
 int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id);
 void nvme_mpath_uninit(struct nvme_ctrl *ctrl);
 void nvme_mpath_stop(struct nvme_ctrl *ctrl);
@@ -515,11 +516,14 @@ static inline int nvme_mpath_alloc_disk(struct nvme_ctrl 
*ctrl,
 {
return 0;
 }
-static inline void nvme_mpath_add_disk(struct nvme_ns *ns,
+static inline void nvme_mpath_add_ns(struct nvme_ns *ns,
struct nvme_id_ns *id)
 {
 }
-static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
+static inline void nvme_mpath_remove_ns(struct nvme_ns *ns)
+{
+}
+static inline void nvme_mpath_remove_nshead(struct nvme_ns_head *head)
 {
 }
 static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
-- 
2.19.1



[PATCH 0/4] nvme multipath: expose slaves/holders

2018-12-06 Thread Thadeu Lima de Souza Cascardo
Exposing slaves/holders is necessary in order to find out the real PCI device
and its driver for the root filesystem when generating an initramfs with
initramfs-tools. That fails right now for nvme multipath devices, which this
patchset fixes.

However, because the slave devices are hidden, lsblk fails without some extra
patches, as it can't find the device numbers for the slave devices, and exits.

Christoph Hellwig (2):
  block: move holder tracking from struct block_device to hd_struct
  nvme: create slaves/holder entries for multipath devices

Thadeu Lima de Souza Cascardo (2):
  nvme: Should not warn when a disk path is opened
  block: expose devt for GENHD_FL_HIDDEN disks

 block/genhd.c | 13 ++
 block/partition-generic.c |  4 +++
 drivers/block/drbd/drbd_nl.c  |  4 +--
 drivers/md/bcache/super.c |  8 +++---
 drivers/md/dm.c   |  4 +--
 drivers/md/md.c   |  4 +--
 drivers/nvme/host/core.c  |  9 ---
 drivers/nvme/host/multipath.c | 13 --
 drivers/nvme/host/nvme.h  | 12 ++---
 fs/block_dev.c| 48 +++
 include/linux/fs.h| 11 +++-
 include/linux/genhd.h |  4 +++
 12 files changed, 75 insertions(+), 59 deletions(-)

-- 
2.19.1



for-next branch and throttling

2018-12-06 Thread Bart Van Assche
Hello,

If I merge Jens' for-next branch with Linus' master branch and boot the
resulting kernel in a VM then the call trace shown below appears. This call
trace does not appear when building and booting the code from Linus' master
branch. Is this perhaps a regression?

WARNING: CPU: 1 PID: 257 at block/blk-throttle.c:2128 
blk_throtl_bio+0xc00/0x1120
Modules linked in: floppy(+) virtio_blk(+) virtio_net net_failover failover 
i2c_piix4 pata_acpi
CPU: 1 PID: 257 Comm: systemd-udevd Not tainted 4.20.0-rc5-dbg+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
RIP: 0010:blk_throtl_bio+0xc00/0x1120
Call Trace:
 ? create_task_io_context+0x1a2/0x1e0
 ? lock_downgrade+0x2e0/0x2e0
 ? kasan_check_read+0x11/0x20
 ? do_raw_spin_unlock+0xa8/0x140
 ? preempt_count_sub+0x18/0xd0
 generic_make_request_checks+0x432/0xcc0
 ? trace_event_raw_event_block_rq_requeue+0x290/0x290
 ? __lock_acquire+0x5b0/0x17e0
 ? __bio_associate_blkg.isra.33+0x1d0/0x3c0
 generic_make_request+0x151/0x950
 ? blk_queue_enter+0x7e0/0x7e0
 submit_bio+0x9b/0x250
 ? submit_bio+0x9b/0x250
 ? generic_make_request+0x950/0x950
 ? guard_bio_eod+0x120/0x2d0
 submit_bh_wbc+0x2df/0x310
 block_read_full_page+0x34c/0x4c0
 ? I_BDEV+0x20/0x20
 ? __bread_gfp+0x170/0x170
 ? add_to_page_cache_locked+0x20/0x20
 ? blkdev_writepages+0x10/0x10
 ? blkdev_writepages+0x10/0x10
 blkdev_readpage+0x18/0x20
 do_read_cache_page+0x5ed/0xbf0
 ? blkdev_writepages+0x10/0x10
 ? find_valid_gpt+0x1a8/0x8b0
 ? efi_partition+0x147/0x6c1
 ? check_partition+0x1b2/0x314
 ? blkdev_get+0x1f2/0x5a0
 ? __device_add_disk+0x6c1/0x7e0
 ? device_add_disk+0x13/0x20
 ? pagecache_get_page+0x420/0x420
 ? driver_probe_device+0x118/0x180
 ? __driver_attach+0x167/0x190
 ? bus_for_each_dev+0x105/0x160
 ? driver_attach+0x2b/0x30
 ? bus_add_driver+0x23d/0x350
 ? driver_register+0xdc/0x160
 ? register_virtio_driver+0x3c/0x60
 ? init+0x51/0x1000 [virtio_blk]
 ? do_one_initcall+0xb4/0x3a1
 ? do_init_module+0x103/0x360
 ? load_module+0x443e/0x4630
 ? __do_sys_finit_module+0x12d/0x1b0
 ? __x64_sys_finit_module+0x43/0x50
 ? do_syscall_64+0x71/0x210
 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
 ? match_held_lock+0x20/0x250
 ? match_held_lock+0x35/0x250
 ? blkdev_writepages+0x10/0x10
 read_cache_page+0x3a/0x50
 read_dev_sector+0x64/0x170
 read_lba+0x1f9/0x3a0
 ? msdos_partition+0x1370/0x1370
 ? find_valid_gpt+0x1a8/0x8b0
 ? kmem_cache_alloc_trace+0x151/0x340
 ? find_valid_gpt+0x1a8/0x8b0
 find_valid_gpt+0x1c6/0x8b0
 ? is_gpt_valid.part.5+0x640/0x640
 ? __alloc_pages_nodemask+0x1c4/0x4d0
 ? lockdep_hardirqs_on+0x182/0x260
 ? get_page_from_freelist+0x648/0x2070
 ? mark_lock+0x69/0x8d0
 ? trace_hardirqs_on+0x24/0x130
 ? kasan_unpoison_shadow+0x35/0x50
 ? get_page_from_freelist+0x648/0x2070
 ? __asan_loadN+0xf/0x20
 ? widen_string+0x73/0x170
 ? __asan_loadN+0xf/0x20
 ? widen_string+0x73/0x170
 ? get_page_from_freelist+0x1eea/0x2070
 ? set_precision+0x90/0x90
 ? string+0x135/0x180
 efi_partition+0x147/0x6c1
 ? format_decode+0xce/0x550
 ? enable_ptr_key_workfn+0x30/0x30
 ? find_valid_gpt+0x8b0/0x8b0
 ? vsnprintf+0x11d/0x820
 ? pointer+0x4d0/0x4d0
 ? snprintf+0x88/0xa0
 ? snprintf+0x88/0xa0
 ? vsprintf+0x20/0x20
 ? find_valid_gpt+0x8b0/0x8b0
 check_partition+0x1b2/0x314
 rescan_partitions+0x13b/0x440
 __blkdev_get+0x61f/0x930
 ? check_disk_change+0xa0/0xa0
 ? lock_downgrade+0x2e0/0x2e0
 ? var_wake_function+0x90/0x90
 blkdev_get+0x1f2/0x5a0
 ? preempt_count_sub+0x18/0xd0
 ? _raw_spin_unlock+0x31/0x50
 ? bd_may_claim+0x80/0x80
 ? bdget+0x26b/0x280
 ? blkdev_writepage+0x20/0x20
 ? kasan_check_read+0x11/0x20
 ? kobject_put+0x23/0x220
 __device_add_disk+0x6c1/0x7e0
 ? blk_alloc_devt+0x150/0x150
 ? __asan_storeN+0x12/0x20
 device_add_disk+0x13/0x20
 virtblk_probe+0x839/0xace [virtio_blk]
 ? virtblk_restore+0xd0/0xd0 [virtio_blk]
 ? mutex_unlock+0x12/0x20
 ? kernfs_activate+0xd0/0xe0
 ? kasan_check_write+0x14/0x20
 ? kernfs_put+0x2c/0x280
 ? vring_transport_features+0x19/0x70
 ? vp_finalize_features+0xe6/0x160
 ? vp_get_status+0x29/0x30
 ? virtio_finalize_features+0xbe/0xe0
 virtio_dev_probe+0x27d/0x390
 really_probe+0x191/0x560
 ? driver_probe_device+0x180/0x180
 driver_probe_device+0x118/0x180
 ? driver_probe_device+0x180/0x180
 __driver_attach+0x167/0x190
 bus_for_each_dev+0x105/0x160
 ? subsys_dev_iter_exit+0x10/0x10
 ? kasan_check_read+0x11/0x20
 ? preempt_count_sub+0x18/0xd0
 ? _raw_spin_unlock+0x31/0x50
 driver_attach+0x2b/0x30
 bus_add_driver+0x23d/0x350
 driver_register+0xdc/0x160
 ? 0xa007
 register_virtio_driver+0x3c/0x60
 init+0x51/0x1000 [virtio_blk]
 do_one_initcall+0xb4/0x3a1
 ? trace_event_raw_event_initcall_finish+0x120/0x120
 ? kasan_unpoison_shadow+0x35/0x50
 ? __asan_register_globals+0x7b/0xa0
 do_init_module+0x103/0x360
 load_module+0x443e/0x4630
 ? module_frob_arch_sections+0x20/0x20
 ? vfs_read+0xe6/0x1e0
 ? kernel_read+0x79/0xa0
 ? kasan_check_write+0x14/0x20
 ? kernel_read_file+0x259/0x310
 ? do_mmap+0x431/0x6c0
 __do_sys_finit_module+0x12d/0x1b0
 ? 

Re: [PATCH v3] block: add documentation for io_timeout

2018-12-06 Thread Christoph Hellwig
On Thu, Dec 06, 2018 at 08:06:16AM -0800, Bart Van Assche wrote:
> On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote:
> > Weiping Zhang  于2018年12月5日周三 下午10:49写道:
> > > Christoph Hellwig  于2018年12月5日周三 下午10:40写道:
> > > > Can you please also send a patch to not show this attribute for
> > > > drivers without a timeout handler?  Thanks!
> > 
> > Is there a simple way do that ?
> 
> How about checking the timeout member of struct blk_mq_ops for blk-mq and
> checking the rq_timed_out_fn member in struct request_queue for the legacy
> block layer?

Just the former given that the legacy code is gone in for-next.

> > Shall we return -ENOTSUPP when user read/write this attribute when
> > driver has no timeout handler ?
> 
> A much more elegant solution is to introduce a sysfs attribute group for the
> io_timeout attribute and to make that group visible only if a timeout handler
> has been defined. See e.g. disk_attr_group in block/genhd.c for an example.

Agreed, that is the way to go.


Re: [PATCH v3] block: add documentation for io_timeout

2018-12-06 Thread Bart Van Assche
On Wed, 2018-12-05 at 22:17 +0800, Weiping Zhang wrote:
> +Description:
> + io_timeout is a request’s timeouts at block layer in
> + milliseconds. When the underlying driver starts processing
> + a request, the generic block layer will start a timer, if
> + this request cannot be completed in io_timeout milliseconds,
> + a timeout event will occur.

Sorry but I think this description is still somewhat confusing. How about
changing that description into the following?

io_timeout is the request timeout in milliseconds. If a request does not
complete in this time then the block driver timeout handler is invoked.
That timeout handler can decide to retry the request, to fail it or to
start a device recovery strategy.

Bart.


Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-06 Thread Bart Van Assche
On Thu, 2018-12-06 at 22:18 +0800, Weiping Zhang wrote:
> Before this patch, even we set io_timeout to 30*HZ(default), but
> blk_rq_timeout always return jiffies +5*HZ,
>   [1]. if there no pending request in timeout list, the timer callback
> blk_rq_timed_out_timer will be called after 5*HZ, and then
> blk_mq_check_expired will check is there exist some request
> was delayed by compare jiffies and request->deadline, obvious
> request is not timeout because we set request's timeouts is 30*HZ.
> So for this case timer callback should be called at jiffies + 30 instead
> of jiffies + 5*HZ.
> 
>   [2]. if there are pending request in timeout list, we compare request's
> expiry and queue's expiry. If time_after(request->expire, queue->expire) 
> modify
> queue->timeout.expire to request->expire, otherwise do nothing.
> 
> So I think this patch just solve problem in [1], no other regression, or 
> what's
> I missing here ?

The blk_rq_timeout() function was introduced by commit 0d2602ca30e4 ("blk-mq:
improve support for shared tags maps"). I think the purpose of that function
is to make sure that the nr_active counter in struct blk_mq_hw_ctx gets updated
at least once every five seconds. So there are two problems with this patch:
- It reduces the frequency of 'nr_active' updates. I think that is wrong and
  also that it will negatively affect drivers that rely on this functionality,
  e.g. the SCSI core.
- The patch description does not match the code changes in this patch.

Bart.



Re: [PATCH v3] block: add documentation for io_timeout

2018-12-06 Thread Bart Van Assche
On Wed, 2018-12-05 at 22:59 +0800, Weiping Zhang wrote:
> Weiping Zhang  于2018年12月5日周三 下午10:49写道:
> > Christoph Hellwig  于2018年12月5日周三 下午10:40写道:
> > > Can you please also send a patch to not show this attribute for
> > > drivers without a timeout handler?  Thanks!
> 
> Is there a simple way do that ?

How about checking the timeout member of struct blk_mq_ops for blk-mq and
checking the rq_timed_out_fn member in struct request_queue for the legacy
block layer?

> Shall we return -ENOTSUPP when user read/write this attribute when
> driver has no timeout handler ?

A much more elegant solution is to introduce a sysfs attribute group for the
io_timeout attribute and to make that group visible only if a timeout handler
has been defined. See e.g. disk_attr_group in block/genhd.c for an example.

Bart.


[PATCH 1/2] lightnvm: pblk: Do not overwrite ppa list with meta list

2018-12-06 Thread Igor Konopko
Currently when using PBLK with 0 sized metadata both ppa list
and meta list points to the same memory since pblk_dma_meta_size()
returns 0 in that case.

This commit fix that issue by ensuring that pblk_dma_meta_size()
always returns space equal to sizeof(struct pblk_sec_meta) and thus
ppa list and meta list points to different memory address.

Even that in that case drive does not really care about meta_list
pointer, this is the easiest way to fix that issue without introducing
changes in many places in the code just for 0 sized metadata case.

Reported-by: Hans Holmberg 
Signed-off-by: Igor Konopko 
---
 drivers/lightnvm/pblk.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index bc40b1381ff6..e5c9ff2bf0da 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -1393,7 +1393,8 @@ static inline struct pblk_sec_meta *pblk_get_meta(struct 
pblk *pblk,
 
 static inline int pblk_dma_meta_size(struct pblk *pblk)
 {
-   return pblk->oob_meta_size * NVM_MAX_VLBA;
+   return max_t(int, sizeof(struct pblk_sec_meta), pblk->oob_meta_size)
+   * NVM_MAX_VLBA;
 }
 
 static inline int pblk_is_oob_meta_supported(struct pblk *pblk)
-- 
2.17.1



[PATCH 2/2] lightnvm: pblk: Ensure that bio is not freed on recovery

2018-12-06 Thread Igor Konopko
When we are using PBLK with 0 sized metadata during recovery
process we need to reference a last page of bio. Currently
KASAN reports use-after-free in that case, since bio is
freed on IO completion.

This patch adds addtional bio reference to ensure, that we
can still use bio memory after IO completion. It also ensures
that we are not reusing the same bio on retry_rq path.

Reported-by: Hans Holmberg 
Signed-off-by: Igor Konopko 
---
 drivers/lightnvm/pblk-recovery.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/lightnvm/pblk-recovery.c b/drivers/lightnvm/pblk-recovery.c
index 009faf5db40f..3fcf062d752c 100644
--- a/drivers/lightnvm/pblk-recovery.c
+++ b/drivers/lightnvm/pblk-recovery.c
@@ -376,12 +376,14 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
rq_ppas = pblk->min_write_pgs;
rq_len = rq_ppas * geo->csecs;
 
+retry_rq:
bio = bio_map_kern(dev->q, data, rq_len, GFP_KERNEL);
if (IS_ERR(bio))
return PTR_ERR(bio);
 
bio->bi_iter.bi_sector = 0; /* internal bio */
bio_set_op_attrs(bio, REQ_OP_READ, 0);
+   bio_get(bio);
 
rqd->bio = bio;
rqd->opcode = NVM_OP_PREAD;
@@ -394,7 +396,6 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
if (pblk_io_aligned(pblk, rq_ppas))
rqd->is_seq = 1;
 
-retry_rq:
for (i = 0; i < rqd->nr_ppas; ) {
struct ppa_addr ppa;
int pos;
@@ -417,6 +418,7 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
if (ret) {
pblk_err(pblk, "I/O submission failed: %d\n", ret);
bio_put(bio);
+   bio_put(bio);
return ret;
}
 
@@ -428,19 +430,25 @@ static int pblk_recov_scan_oob(struct pblk *pblk, struct 
pblk_line *line,
 
if (padded) {
pblk_log_read_err(pblk, rqd);
+   bio_put(bio);
return -EINTR;
}
 
pad_distance = pblk_pad_distance(pblk, line);
ret = pblk_recov_pad_line(pblk, line, pad_distance);
-   if (ret)
+   if (ret) {
+   bio_put(bio);
return ret;
+   }
 
padded = true;
+   bio_put(bio);
goto retry_rq;
}
 
pblk_get_packed_meta(pblk, rqd);
+   bio_put(bio);
+
for (i = 0; i < rqd->nr_ppas; i++) {
struct pblk_sec_meta *meta = pblk_get_meta(pblk, meta_list, i);
u64 lba = le64_to_cpu(meta->lba);
-- 
2.17.1



Re: [PATCH 2/2] nvme: create slaves/holder entries for multipath devices

2018-12-06 Thread Christoph Hellwig
On Wed, Nov 28, 2018 at 06:38:17AM -0200, Thadeu Lima de Souza Cascardo wrote:
> I will send a followup that includes your two patches, fixing up this one, 
> plus
> another two because that is simply reverting the revert of Hanne's patch. 
> Which
> means it still has the lsblk bug.
> 
> Then, we have two options: do not use GENHD_FL_HIDDEN on multipath nvme disks,
> or expose devt for those disks. I am sending a patch for the second option.

Can you please send it out so we can look at it before the 4.21 merge
window opens?


Re: [PATCH 1/2] block: get rid of BLK_MAX_TIMEOUT

2018-12-06 Thread Weiping Zhang
Bart Van Assche  于2018年12月5日周三 下午11:59写道:
>
> On Wed, 2018-12-05 at 23:37 +0800, Weiping Zhang wrote:
> > @@ -130,7 +119,7 @@ void blk_add_timer(struct request *req)
> >* than an existing one, modify the timer. Round up to next nearest
> >* second.
> >*/
> > - expiry = blk_rq_timeout(round_jiffies_up(expiry));
> > + expiry = round_jiffies_up(expiry);
>
> If you would have read the comment above this code, you would have known
> that this patch does not do what you think it does and additionally that
> it introduces a regression.
>
Let's paste full comments here:
/*
 * If the timer isn't already pending or this timeout is earlier
 * than an existing one, modify the timer. Round up to next nearest
 * second.
 */
Before this patch, even we set io_timeout to 30*HZ(default), but
blk_rq_timeout always return jiffies +5*HZ,
  [1]. if there no pending request in timeout list, the timer callback
blk_rq_timed_out_timer will be called after 5*HZ, and then
blk_mq_check_expired will check is there exist some request
was delayed by compare jiffies and request->deadline, obvious
request is not timeout because we set request's timeouts is 30*HZ.
So for this case timer callback should be called at jiffies + 30 instead
of jiffies + 5*HZ.

  [2]. if there are pending request in timeout list, we compare request's
expiry and queue's expiry. If time_after(request->expire, queue->expire) modify
queue->timeout.expire to request->expire, otherwise do nothing.

So I think this patch just solve problem in [1], no other regression, or what's
I missing here ?

Thanks
Weiping
> Bart.


Re: [PATCH 0/2][V2] io.latency test for blktests

2018-12-06 Thread Omar Sandoval
On Wed, Dec 05, 2018 at 10:34:02AM -0500, Josef Bacik wrote:
> v1->v2:
> - dropped my python library, TIL about jq.
> - fixed the spelling mistakes in the test.
> 
> -- Original message --
> 
> This patchset is to add a test to verify io.latency is working properly, and 
> to
> add all the supporting code to run that test.
> 
> First is the cgroup2 infrastructure which is fairly straightforward.  Just
> verifies we have cgroup2, and gives us the helpers to check and make sure we
> have the right controllers in place for the test.
> 
> The second patch brings over some python scripts I put in xfstests for parsing
> the fio json output.  I looked at the existing fio performance stuff in
> blktests, but we only capture bw stuff, which is wonky with this particular 
> test
> because once the fast group is finished the slow group is allowed to go as 
> fast
> as it wants.  So I needed this to pull out actual jobtime spent.  This will 
> give
> us flexibility to pull out other fio performance data in the future.
> 
> The final patch is the test itself.  It simply runs a job by itself to get a
> baseline view of the disk performance.  Then it creates 2 cgroups, one fast 
> and
> one slow, and runs the same job simultaneously in both groups.  The result
> should be that the fast group takes just slightly longer time than the 
> baseline
> (I use a 15% threshold to be safe), and that the slow one takes considerably
> longer.  Thanks,

I cleaned up a ton of shellcheck warnings (from `make check`) and pushed
to https://github.com/osandov/blktests/tree/josef. On I tested with QEMU
on Jens' for-next branch. With an emulated NVMe device, it failed with
"Too much of a performance drop for the protected workload". On
virtio-blk, I hit this:

[ 1843.056452] INFO: task fio:20750 blocked for more than 120 seconds.
[ 1843.057495]   Not tainted 4.20.0-rc5-00251-g90efb26fa9a4 #19
[ 1843.058487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1843.059769] fio D0 20750  20747 0x0080
[ 1843.060688] Call Trace:
[ 1843.061123]  ? __schedule+0x286/0x870
[ 1843.061735]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.062574]  ? blkcg_iolatency_cleanup+0x60/0x60
[ 1843.063347]  schedule+0x32/0x80
[ 1843.063874]  io_schedule+0x12/0x40
[ 1843.064449]  rq_qos_wait+0x9a/0x120
[ 1843.065007]  ? karma_partition+0x210/0x210
[ 1843.065661]  ? blkcg_iolatency_done_bio+0x680/0x680
[ 1843.066435]  blkcg_iolatency_throttle+0x185/0x360
[ 1843.067196]  __rq_qos_throttle+0x23/0x30
[ 1843.067958]  blk_mq_make_request+0x101/0x5c0
[ 1843.068637]  generic_make_request+0x1b3/0x3c0
[ 1843.069329]  submit_bio+0x45/0x140
[ 1843.069876]  blkdev_direct_IO+0x3db/0x440
[ 1843.070527]  ? aio_complete+0x2f0/0x2f0
[ 1843.071146]  generic_file_direct_write+0x96/0x160
[ 1843.071880]  __generic_file_write_iter+0xb3/0x1c0
[ 1843.072599]  ? blk_mq_dispatch_rq_list+0x3aa/0x550
[ 1843.073340]  blkdev_write_iter+0xa0/0x120
[ 1843.073960]  ? __fget+0x6e/0xa0
[ 1843.074452]  aio_write+0x11f/0x1d0
[ 1843.074979]  ? __blk_mq_run_hw_queue+0x6f/0xe0
[ 1843.075658]  ? __check_object_size+0xa0/0x189
[ 1843.076345]  ? preempt_count_add+0x5a/0xb0
[ 1843.077086]  ? aio_read_events+0x259/0x380
[ 1843.077819]  ? kmem_cache_alloc+0x16e/0x1c0
[ 1843.078427]  io_submit_one+0x4a8/0x790
[ 1843.078975]  ? read_events+0x76/0x150
[ 1843.079510]  __se_sys_io_submit+0x98/0x1a0
[ 1843.080116]  ? syscall_trace_enter+0x1d3/0x2d0
[ 1843.080785]  do_syscall_64+0x55/0x160
[ 1843.081404]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1843.082210] RIP: 0033:0x7f6e571fc4ed
[ 1843.082763] Code: Bad RIP value.
[ 1843.083268] RSP: 002b:7ffc212b76f8 EFLAGS: 0246 ORIG_RAX: 
00d1
[ 1843.084445] RAX: ffda RBX: 7f6e4c876870 RCX: 7f6e571fc4ed
[ 1843.085545] RDX: 557c4bc11208 RSI: 0001 RDI: 7f6e4c85e000
[ 1843.086251] RBP: 7f6e4c85e000 R08: 557c4bc2b130 R09: 02f8
[ 1843.087308] R10: 557c4bbf4470 R11: 0246 R12: 0001
[ 1843.088310] R13:  R14: 557c4bc11208 R15: 7f6e2b17f070


Re: [PATCH 15/26] aio: support for IO polling

2018-12-06 Thread Benny Halevy
On Wed, 2018-12-05 at 06:10 -0700, Jens Axboe wrote:
> On 12/5/18 2:56 AM, Benny Halevy wrote:
> > > +static int aio_iopoll_getevents(struct kioctx *ctx,
> > > + struct io_event __user *event,
> > > + unsigned int *nr_events, long min, long max)
> > > +{
> > > + struct aio_kiocb *iocb;
> > > + int to_poll, polled, ret;
> > > +
> > > + /*
> > > +  * Check if we already have done events that satisfy what we need
> > > +  */
> > > + if (!list_empty(>poll_completing)) {
> > > + ret = aio_iopoll_reap(ctx, event, nr_events, max);
> > > + if (ret < 0)
> > > + return ret;
> > > + /* if min is zero, still go through a poll check */
> > 
> > A function-level comment about that would be more visible.
> 
> Yes, that might be a better idea.
> 
> > > + if (min && *nr_events >= min)
> > > 
> > > + return 0;
> > 
> > if ((min && *nr_events >= min) || *nr_events >= max) ?
> > 
> > Otherwise, if poll_completing or poll_submitted are not empty,
> > besides getting to the "Shouldn't happen" case in aio_iopoll_reap,
> > it looks like we're gonna waste some cycles and never call f_op->iopoll
> 
> Agree, that would be a better place to catch it. I'll make those two
> changes, thanks!
> 
> > > +
> > > + /*
> > > +  * Find up to 'max' worth of events to poll for, including the
> > > +  * events we already successfully polled
> > > +  */
> > > + polled = to_poll = 0;
> > > + list_for_each_entry(iocb, >poll_completing, ki_list) {
> > > + /*
> > > +  * Poll for needed events with spin == true, anything after
> > > +  * that we just check if we have more, up to max.
> > > +  */
> > > + bool spin = polled + *nr_events >= min;
> > 
> > I'm confused. Shouldn't spin == true if polled + *nr_events < min?
> 
> Heh, that does look off, and it is. I think the correct condition is:
> 
> bool spin = !polled || *nr_events < min;
> 
> and I'm not sure why that got broken. I just tested it, slight
> performance win as expected correcting this. It ties in with the above
> nr_events check - it's perfectly valid to pass min_nr == 0 and just do a
> poll check. Conversely, if we already spun, just do non-spin checks on
> followup poll checks.
> 

Yep, that makes sense.

> > > +static int __aio_iopoll_check(struct kioctx *ctx, struct io_event __user 
> > > *event,
> > > +   unsigned int *nr_events, long min_nr, long max_nr)
> > > +{
> > > + int ret = 0;
> > > +
> > > + while (!*nr_events || !need_resched()) {
> > > + int tmin = 0;
> > > +
> > > + if (*nr_events < min_nr)
> > > + tmin = min_nr - *nr_events;
> > 
> > Otherwise, shouldn't the function just return 0?
> > 
> > Or perhaps you explicitly want to go through the tmin==0 case
> > in aio_iopoll_getevents if *nr_events == min_nr (or min_nr==0)?
> 
> No, we need to go through poll, if min_nr == 0, but only if min_nr == 0
> and !nr_events. But we only get here for nr_events != 0, so should be
> fine as-is.
> 
> Thanks for your review, very useful! I'll make the above changes right
> now.
> 

Thanks!