from:"Mike Snitzer"

Re: [PATCH v3] blk-mq: punt failed direct issue to dispatch list

2018-12-07 Thread Mike Snitzer

On Fri, Dec 07 2018 at 12:17am -0500,
Jens Axboe  wrote:

> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed. This results in a
> livelock.
> 
> Instead of making special cases for what we can direct issue, and now
> having to deal with DM solving the livelock while still retaining a BUSY
> condition feedback loop, always just add a request that has been through
> ->queue_rq() to the hardware queue dispatch list. These are safe to use
> as no merging can take place there. Additionally, if requests do have
> prepped data from drivers, we aren't dependent on them not sharing space
> in the request structure to safely add them to the IO scheduler lists.
> 
> This basically reverts ffe81d45322c and is based on a patch from Ming,
> but with the list insert case covered as well.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Cc: sta...@vger.kernel.org
> Suggested-by: Ming Lei 
> Reported-by: Bart Van Assche 
> Signed-off-by: Jens Axboe 

Looks good, thanks!

Acked-by: Mike Snitzer

Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at 11:06pm -0500,
Jens Axboe  wrote:

> On 12/6/18 8:54 PM, Mike Snitzer wrote:
> > On Thu, Dec 06 2018 at  9:49pm -0500,
> > Jens Axboe  wrote:
> > 
> >> After the direct dispatch corruption fix, we permanently disallow direct
> >> dispatch of non read/write requests. This works fine off the normal IO
> >> path, as they will be retried like any other failed direct dispatch
> >> request. But for the blk_insert_cloned_request() that only DM uses to
> >> bypass the bottom level scheduler, we always first attempt direct
> >> dispatch. For some types of requests, that's now a permanent failure,
> >> and no amount of retrying will make that succeed.
> >>
> >> Use the driver private RQF_DONTPREP to track this condition in DM. If
> >> we encounter a BUSY condition from blk_insert_cloned_request(), then
> >> flag the request with RQF_DONTPREP. When we next time see this request,
> >> ask blk_insert_cloned_request() to bypass insert the request directly.
> >> This avoids the livelock of repeatedly trying to direct dispatch a
> >> request, while still retaining the BUSY feedback loop for blk-mq so
> >> that we don't over-dispatch to the lower level queue and mess up
> >> opportunities for merging on the DM queue.
> >>
> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> >> Reported-by: Bart Van Assche 
> >> Cc: sta...@vger.kernel.org
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> This passes my testing as well, like the previous patch. But unlike the
> >> previous patch, we retain the BUSY feedback loop information for better
> >> merging.
> > 
> > But it is kind of gross to workaround the new behaviour to "permanently
> > disallow direct dispatch of non read/write requests" by always failing
> > such requests back to DM for later immediate direct dispatch.  That
> > bouncing of the request was acceptable when there was load-based
> > justification for having to retry (and in doing so: taking the cost of
> > freeing the clone request gotten via get_request() from the underlying
> > request_queues).
> > 
> > Having to retry like this purely because the request isn't a read or
> > write seems costly.. every non-read-write will have implied
> > request_queue bouncing.  In multipath's case: it could select an
> > entirely different underlying path the next time it is destaged (with
> > RQF_DONTPREP set).  Which you'd think would negate all hope of IO
> > merging based performance improvements -- but that is a tangent I'll
> > need to ask Ming about (again).
> > 
> > I really don't like this business of bouncing requests as a workaround
> > for the recent implementation of the corruption fix.
> > 
> > Why not just add an override flag to _really_ allow direct dispatch for
> > _all_ types of requests?
> > 
> > (just peeked at linux-block and it is looking like you took 
> > jianchao.wang's series to avoid this hack... ;)
> > 
> > Awesome.. my work is done for tonight!
> 
> The whole point is doing something that is palatable to 4.20 and leaving
> the more experimental stuff to 4.21, where we have some weeks to verify
> that there are no conditions that cause IO stalls. I don't envision there
> will be, but I'm not willing to risk it this late in the 4.20 cycle.
> 
> That said, this isn't a quick and dirty and I don't think it's fair
> calling this a hack. Using RQF_DONTPREP is quite common in drivers to
> retain state over multiple ->queue_rq invocations. Using it to avoid
> multiple direct dispatch failures (and obviously this new livelock)
> seems fine to me.

But it bounces IO purely because non-read-write.  That results in
guaranteed multiple blk_get_request() -- from underlying request_queues
request-based DM is stacked on -- for every non-read-write IO that is
cloned.  That seems pathological.  I must still be missing something.

> I really don't want to go around and audit every driver for potential
> retained state over special commands, that's why the read+write thing is
> in place. It's the safe option, which is what we need right now.

Maybe leave blk_mq_request_issue_directly() interface how it is,
non-read-write restriction and all, but export a new
__blk_mq_request_issue_directly() that _only_
blk_insert_cloned_request() -- and future comparable users -- makes use
of?

To me that is the best of both worlds: Fix corruption issue but don't
impose needless blk_get_request() dances for non-read-write IO issued to
dm-multipath.

Mike

Re: [PATCH v2] block/dm: fix handling of busy off direct dispatch path

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  9:49pm -0500,
Jens Axboe  wrote:

> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Use the driver private RQF_DONTPREP to track this condition in DM. If
> we encounter a BUSY condition from blk_insert_cloned_request(), then
> flag the request with RQF_DONTPREP. When we next time see this request,
> ask blk_insert_cloned_request() to bypass insert the request directly.
> This avoids the livelock of repeatedly trying to direct dispatch a
> request, while still retaining the BUSY feedback loop for blk-mq so
> that we don't over-dispatch to the lower level queue and mess up
> opportunities for merging on the DM queue.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Reported-by: Bart Van Assche 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> This passes my testing as well, like the previous patch. But unlike the
> previous patch, we retain the BUSY feedback loop information for better
> merging.

But it is kind of gross to workaround the new behaviour to "permanently
disallow direct dispatch of non read/write requests" by always failing
such requests back to DM for later immediate direct dispatch.  That
bouncing of the request was acceptable when there was load-based
justification for having to retry (and in doing so: taking the cost of
freeing the clone request gotten via get_request() from the underlying
request_queues).

Having to retry like this purely because the request isn't a read or
write seems costly.. every non-read-write will have implied
request_queue bouncing.  In multipath's case: it could select an
entirely different underlying path the next time it is destaged (with
RQF_DONTPREP set).  Which you'd think would negate all hope of IO
merging based performance improvements -- but that is a tangent I'll
need to ask Ming about (again).

I really don't like this business of bouncing requests as a workaround
for the recent implementation of the corruption fix.

Why not just add an override flag to _really_ allow direct dispatch for
_all_ types of requests?

(just peeked at linux-block and it is looking like you took 
jianchao.wang's series to avoid this hack... ;)

Awesome.. my work is done for tonight!

Mike

Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  8:58pm -0500,
Mike Snitzer  wrote:

> DM is forced to worry about all these details, as covered some in
> the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
> trying to have its cake and eat it too.

Gah, obviously meant: DM is _NOT_ trying to have its cake and eat it too.

> It just wants IO scheduling to work for request-based DM devices.

Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  8:34pm -0500,
Jens Axboe  wrote:

> On 12/6/18 6:22 PM, jianchao.wang wrote:
> > 
> > 
> > On 12/7/18 9:13 AM, Jens Axboe wrote:
> >> On 12/6/18 6:04 PM, jianchao.wang wrote:
> >>>
> >>>
> >>> On 12/7/18 6:20 AM, Jens Axboe wrote:
>  After the direct dispatch corruption fix, we permanently disallow direct
>  dispatch of non read/write requests. This works fine off the normal IO
>  path, as they will be retried like any other failed direct dispatch
>  request. But for the blk_insert_cloned_request() that only DM uses to
>  bypass the bottom level scheduler, we always first attempt direct
>  dispatch. For some types of requests, that's now a permanent failure,
>  and no amount of retrying will make that succeed.
> 
>  Don't use direct dispatch off the cloned insert path, always just use
>  bypass inserts. This still bypasses the bottom level scheduler, which is
>  what DM wants.
> 
>  Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
>  Signed-off-by: Jens Axboe 
> 
>  ---
> 
>  diff --git a/block/blk-core.c b/block/blk-core.c
>  index deb56932f8c4..4c44e6fa0d08 100644
>  --- a/block/blk-core.c
>  +++ b/block/blk-core.c
>  @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
>  request_queue *q, struct request *
>    * bypass a potential scheduler on the bottom device for
>    * insert.
>    */
>  -return blk_mq_request_issue_directly(rq);
>  +blk_mq_request_bypass_insert(rq, true);
>  +return BLK_STS_OK;
>   }
>   
>   spin_lock_irqsave(q->queue_lock, flags);
> 
> >>> Not sure about this because it will break the merging promotion for 
> >>> request based DM
> >>> from Ming.
> >>> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> >>> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> >>> feedback)
> >>>
> >>> We could use some other way to fix this.
> >>
> >> That really shouldn't matter as this is the cloned insert, merging should
> >> have been done on the original request.
> >>
> >>
> > Just quote some comments from the patch.
> > 
> > "
> >But dm-rq currently can't get the underlying queue's
> > dispatch feedback at all.  Without knowing whether a request was issued
> > or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> > not be able to provide effective IO merging (as a side-effect of dm-rq
> > currently blindly destaging a request from its elevator only to requeue
> > it after a delay, which kills any opportunity for merging).  This
> > obviously causes very bad sequential IO performance.
> > ...
> > With this, request-based DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing)
> > "
> > 
> > Using blk_mq_request_bypass_insert to replace the 
> > blk_mq_request_issue_directly
> > could be a fast method to fix the current issue. Maybe we could get the 
> > merging
> > promotion back after some time.
> 
> This really sucks, mostly because DM wants to have it both ways - not use
> the bottom level IO scheduler, but still actually use it if it makes sense.

Well no, that isn't what DM is doing.  DM does have an upper layer
scheduler that would like to be afforded the same capabilities that any
request-based driver is given.  Yes that comes with plumbing in safe
passage for upper layer requests dispatched from a stacked blk-mq IO
scheduler.
 
> There is another way to fix this - still do the direct dispatch, but have
> dm track if it failed and do bypass insert in that case. I didn't want do
> to that since it's more involved, but it's doable.
> 
> Let me cook that up and test it... Don't like it, though.

Not following how DM can track if issuing the request worked if it is
always told it worked with BLK_STS_OK.  We care about feedback when the
request is actually issued because of the elaborate way blk-mq elevators
work.  DM is forced to worry about all these details, as covered some in
the header for commit 396eaf21ee17c476e8f66249fb1f4a39003d0ab4, it is
trying to have its cake and eat it too.  It just wants IO scheduling to
work for request-based DM devices.  That's it.

Mike

Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  8:13pm -0500,
Jens Axboe  wrote:

> On 12/6/18 6:04 PM, jianchao.wang wrote:
> > 
> > 
> > On 12/7/18 6:20 AM, Jens Axboe wrote:
> >> After the direct dispatch corruption fix, we permanently disallow direct
> >> dispatch of non read/write requests. This works fine off the normal IO
> >> path, as they will be retried like any other failed direct dispatch
> >> request. But for the blk_insert_cloned_request() that only DM uses to
> >> bypass the bottom level scheduler, we always first attempt direct
> >> dispatch. For some types of requests, that's now a permanent failure,
> >> and no amount of retrying will make that succeed.
> >>
> >> Don't use direct dispatch off the cloned insert path, always just use
> >> bypass inserts. This still bypasses the bottom level scheduler, which is
> >> what DM wants.
> >>
> >> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> >> Signed-off-by: Jens Axboe 
> >>
> >> ---
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index deb56932f8c4..4c44e6fa0d08 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> >> request_queue *q, struct request *
> >> * bypass a potential scheduler on the bottom device for
> >> * insert.
> >> */
> >> -  return blk_mq_request_issue_directly(rq);
> >> +  blk_mq_request_bypass_insert(rq, true);
> >> +  return BLK_STS_OK;
> >>}
> >>  
> >>spin_lock_irqsave(q->queue_lock, flags);
> >>
> > Not sure about this because it will break the merging promotion for request 
> > based DM
> > from Ming.
> > 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> > (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > feedback)
> > 
> > We could use some other way to fix this.
> 
> That really shouldn't matter as this is the cloned insert, merging should
> have been done on the original request.

Reading the header of 396eaf21ee17c476e8f66249fb1f4a39003d0ab4 brings me
back to relatively recent hell.

Thing is, dm-rq was the original justification and consumer of
blk_mq_request_issue_directly -- but Ming's later use of directly issuing
requests has forced fixes that didn't consider the original valid/safe
use of the interface that is now too rigid.  dm-rq needs the
functionality the blk_mq_request_issue_directly interface provides.

Sorry to say we cannot lose the sequential IO performance improvements
that the IO merging feedback loop gives us.

Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  8:04pm -0500,
jianchao.wang  wrote:

> 
> 
> On 12/7/18 6:20 AM, Jens Axboe wrote:
> > After the direct dispatch corruption fix, we permanently disallow direct
> > dispatch of non read/write requests. This works fine off the normal IO
> > path, as they will be retried like any other failed direct dispatch
> > request. But for the blk_insert_cloned_request() that only DM uses to
> > bypass the bottom level scheduler, we always first attempt direct
> > dispatch. For some types of requests, that's now a permanent failure,
> > and no amount of retrying will make that succeed.
> > 
> > Don't use direct dispatch off the cloned insert path, always just use
> > bypass inserts. This still bypasses the bottom level scheduler, which is
> > what DM wants.
> > 
> > Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> > Signed-off-by: Jens Axboe 
> > 
> > ---
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index deb56932f8c4..4c44e6fa0d08 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> > request_queue *q, struct request *
> >  * bypass a potential scheduler on the bottom device for
> >  * insert.
> >  */
> > -   return blk_mq_request_issue_directly(rq);
> > +   blk_mq_request_bypass_insert(rq, true);
> > +   return BLK_STS_OK;
> > }
> >  
> > spin_lock_irqsave(q->queue_lock, flags);
> > 
> Not sure about this because it will break the merging promotion for request 
> based DM
> from Ming.
> 396eaf21ee17c476e8f66249fb1f4a39003d0ab4
> (blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> feedback)

Ngh.. yeah, forgot about that convoluted feedback loop.

> We could use some other way to fix this.

Yeah, afraid we have to.

Re: block: fix direct dispatch issue failure for clones

2018-12-06 Thread Mike Snitzer

On Thu, Dec 06 2018 at  5:20pm -0500,
Jens Axboe  wrote:

> After the direct dispatch corruption fix, we permanently disallow direct
> dispatch of non read/write requests. This works fine off the normal IO
> path, as they will be retried like any other failed direct dispatch
> request. But for the blk_insert_cloned_request() that only DM uses to
> bypass the bottom level scheduler, we always first attempt direct
> dispatch. For some types of requests, that's now a permanent failure,
> and no amount of retrying will make that succeed.
> 
> Don't use direct dispatch off the cloned insert path, always just use
> bypass inserts. This still bypasses the bottom level scheduler, which is
> what DM wants.
> 
> Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - return blk_mq_request_issue_directly(rq);
> + blk_mq_request_bypass_insert(rq, true);
> + return BLK_STS_OK;
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);

Not sure what this trailing spin_lock_irqsave(q->queue_lock, flags) is
about.. but this looks good.  I'll cleanup dm-rq.c to do away with the
extra STS_RESOURCE checks for its call to blk_insert_cloned_request()
once this lands.

Acked-by: Mike Snitzer 

Thanks.

Re: nvme: allow ANA support to be independent of native multipathing

2018-11-19 Thread Mike Snitzer

On Mon, Nov 19 2018 at  4:39am -0500,
Christoph Hellwig  wrote:

> On Fri, Nov 16, 2018 at 02:28:02PM -0500, Mike Snitzer wrote:
> > You rejected the idea of allowing fine-grained control over whether
> > native NVMe multipathing is enabled or not on a per-namespace basis.
> > All we have is the coarse-grained nvme_core.multipath=N knob.  Now
> > you're forecasting removing even that.  Please don't do that.
> 
> The whole point is that this hook was intended as a band aid for the
> hypothetical pre-existing setups.  Not ever for new development.

It pains me that you're marching us towards increased conflict that
needs resolution through more formal means.  And that now I'm going to
have to document the timeline of your authoritarian approach to
stifling another Linux maintainer's freedom to do his job of delivering
on functionality I've been depended on for many years.

> > Please, PLEASE take v2 of this patch.. please? ;)
> 
> See the previous mail for the plan ahead.   I'm sick and tired of you
> trying to sneak your new developemts back in.

I've only ever posted patches in the open and never has it been with the
idea of sneaking anything in.  Miscategorizing my actions as such is a
gross abuse that I will not tolerate.

If we were to confine ourselves to this v2 patch I pleaded with you to
take: https://lkml.org/lkml/2018/11/17/4

1) Hannes proposed a simplistic patch that didn't account for the fact
   that NVMe's ana workqueue wouldn't get kicked, his intent was to make
   ANA work independent of your native multipathing.  (The fact he or I
   even need to post such patches, to unwind your tight-coupling of ANA
   and multipathing, speaks to how you've calculatingly undermined any
   effort to implement proper NVMe multipathing outside of the NVMe
   driver)

2) ANA and multipathing are completely disjoint on an NVMe spec level.
   You know this.

SO: will you be taking my v2 patch for 4.21 or not?

Please advise, thanks.
Mike

Re: [PATCH 2/2] blk-mq: provide a helper to check if a queue is busy

2018-11-08 Thread Mike Snitzer

On Thu, Nov 08 2018 at 10:42am -0500,
Jens Axboe  wrote:

> Returns true if the queue currently has requests pending,
> false if not.
> 
> DM can use this to replace the atomic_inc/dec they do per device
> to see if a device is busy.
> 
> Cc: Mike Snitzer 
> Signed-off-by: Jens Axboe 

Reviewed-by: Mike Snitzer

Re: [PATCH 1/2] blk-mq-tag: change busy_iter_fn to return whether to continue or not

2018-11-08 Thread Mike Snitzer

On Thu, Nov 08 2018 at 10:42am -0500,
Jens Axboe  wrote:

> We have this functionality in sbitmap, but we don't export it in
> blk-mq for users of the tags busy iteration. This can be useful
> for stopping the iteration, if the caller doesn't need to find
> more requests.
> 
> Cc: Mike Snitzer 
> Signed-off-by: Jens Axboe 

Reviewed-by: Mike Snitzer 

(could add Tested-by too but probably overkill?)

Thanks,
Mike

Re: [PATCH blktests 3/3] Add NVMeOF dm-mpath tests

2018-08-17 Thread Mike Snitzer

On Fri, Aug 17 2018 at 11:46am -0400,
Bart Van Assche  wrote:

> On Fri, 2018-08-17 at 10:24 -0400, Mike Snitzer wrote:
> > Put differently: until Jens stops taking hch's pull requests despite
> > concerns being raised against hch's approach (that hch completely
> > ignores because he rules with an iron fist from the top of a mountain in
> > the Alps) we're pretty much screwed.
> 
> If everything works out as expected I will attend that mountain summit a
> few weeks from now. Unless I missed something it's still possible to join
> the summit as an attendee. I think you would be welcome.

Cannot make it.

But even if I did attend I've had quite a few instances where hch has
said one thing in person, he later changes his mind, and then silently
executes on implementing something that only serves his needs.

Not the most productive.

Re: [PATCH blktests 3/3] Add NVMeOF dm-mpath tests

2018-08-17 Thread Mike Snitzer

On Wed, Aug 15 2018 at  4:37pm -0400,
Bart Van Assche  wrote:

> Add a series of tests for the NVMeOF drivers on top of the dm-mpath
> driver. These tests are similar to the tests under tests/srp. Both
> tests use the dm-mpath driver for multipath and the loopback
> functionality of the rdma_rxe driver. The only difference is that the
> nvmeof-mp tests use the NVMeOF initiator and target drivers instead
> of the SRP initiator and target drivers.
> 
> Signed-off-by: Bart Van Assche 

I like the prospect of keeping NVMe honest by testing it with DM
multipath.

But will you grow dependent on ANA in the future?  If so then you're out
of luck given the way the NVMe driver's ANA support was maliciously
added without care for making it work without the NVMe driver's native
multipath support.

Seems very few people care about making NVMe multipath _not_ so tightly
coupled to native NVMe multipath.  And sadly I don't have time to work
on untangling the "all-in" nature of NVMe ANA and native NVMe
multipath.

Put differently: until Jens stops taking hch's pull requests despite
concerns being raised against hch's approach (that hch completely
ignores because he rules with an iron fist from the top of a mountain in
the Alps) we're pretty much screwed.

Mike

Re: [PATCH 0/3] block: fix mismatch of figuring out physical segment number

2018-07-31 Thread Mike Snitzer

On Tue, Jul 31 2018 at  6:49am -0400,
Ming Lei  wrote:

> Hi,
> 
> This patchset fixes one issue related with physical segment computation,
> which is found by Mike. In case of dm-rq, the warning of 
> 'blk_cloned_rq_check_limits:
> over max segments limit' can be triggered easily.
> 
> Follows the cause:
> 
> 1) in IO fast path(blk_queue_split()), we always figure out physical segment 
> number
> no matter the flag of QUEUE_FLAG_NO_SG_MERGE is set or not.
> 
> 2) only blk_recount_segments() and blk_recalc_rq_segments() uses the flag of
> QUEUE_FLAG_NO_SG_MERGE, but the two are only called in some unusual
> cases, such as request clone in dm-rq.
> 
> 3) the above two computation don't match, and cause the warning of
> "blk_cloned_rq_check_limits: over max segments limit".
> 
> This patchset fixes this issue by killing the queue flag since it is basically
> bypassed since v4.4, and no one complains that at all. Also multipage bvec 
> will
> come soon, and it doesn't make sense to keep QUEUE_FLAG_NO_SG_MERGE any more.

Thanks Ming, for the series:

Acked-by: Mike Snitzer

Re: block: Fix cloning of requests with a special payload

2018-06-28 Thread Mike Snitzer

On Wed, Jun 27 2018 at  3:55pm -0400,
Bart Van Assche  wrote:

> This patch avoids that removing a path controlled by the dm-mpath driver
> while mkfs is running triggers the following kernel bug:
> 
> kernel BUG at block/blk-core.c:3347!
> invalid opcode:  [#1] PREEMPT SMP KASAN
> CPU: 20 PID: 24369 Comm: mkfs.ext4 Not tainted 4.18.0-rc1-dbg+ #2
> RIP: 0010:blk_end_request_all+0x68/0x70
> Call Trace:
>  
>  dm_softirq_done+0x326/0x3d0 [dm_mod]
>  blk_done_softirq+0x19b/0x1e0
>  __do_softirq+0x128/0x60d
>  irq_exit+0x100/0x110
>  smp_call_function_single_interrupt+0x90/0x330
>  call_function_single_interrupt+0xf/0x20
>  
> 
> Fixes: f9d03f96b988 ("block: improve handling of the magic discard payload")
> Signed-off-by: Bart Van Assche 
> Cc: Christoph Hellwig 
> Cc: Mike Snitzer 
> Cc: Ming Lei 
> Cc: Hannes Reinecke 
> Cc: Johannes Thumshirn 
> Cc: 

Acked-by: Mike Snitzer

Re: block: Make __bio_clone_fast() copy bi_vcnt

2018-06-28 Thread Mike Snitzer

On Thu, Jun 28 2018 at 11:21am -0400,
Bart Van Assche  wrote:

> On 06/27/18 17:30, Ming Lei wrote:
> >One core idea of immutable bvec is to use bio->bi_iter and the original
> >bvec table to iterate over anywhere in the bio. That is why .bi_io_vec
> >needs to copy, but not see any reason why .bi_vcnt needs to do.
> >
> >Do you have use cases on .bi_vcnt for cloned bio?
> 
> So far this is only a theoretical concern. There are many functions
> in the block layer that use .bi_vcnt, and it is a lot of work to
> figure out all the callers of all these functions.

No point wasting time with that.  I don't understand why Ming cares.
Your change is obviously correct.  The state should get transfered over
to reflect reality.

This patch doesn't harm anything, it just prevents some clone-specific
failure in the future.

Acked-by: Mike Snitzer

Re: [PATCH v2] Revert "block: Add warning for bi_next not NULL in bio_endio()"

2018-06-19 Thread Mike Snitzer

On Tue, Jun 19 2018 at  1:26pm -0400,
Bart Van Assche  wrote:

> Commit 0ba99ca4838b ("block: Add warning for bi_next not NULL in
> bio_endio()") breaks the dm driver. end_clone_bio() detects whether
> or not a bio is the last bio associated with a request by checking
> the .bi_next field. Commit 0ba99ca4838b clears that field before
> end_clone_bio() has had a chance to inspect that field. Hence revert
> commit 0ba99ca4838b.
> 
> This patch avoids that KASAN reports the following complaint when
> running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):
> 
> ==
> BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
> Read of size 4 at addr 8801300e06d0 by task ksoftirqd/0/9
> 
> CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> Call Trace:
>  dump_stack+0xa4/0xf5
>  print_address_description+0x6f/0x270
>  kasan_report+0x241/0x360
>  __asan_load4+0x78/0x80
>  bio_advance+0x11b/0x1d0
>  blk_update_request+0xa7/0x5b0
>  scsi_end_request+0x56/0x320 [scsi_mod]
>  scsi_io_completion+0x7d6/0xb20 [scsi_mod]
>  scsi_finish_command+0x1c0/0x280 [scsi_mod]
>  scsi_softirq_done+0x19a/0x230 [scsi_mod]
>  blk_mq_complete_request+0x160/0x240
>  scsi_mq_done+0x50/0x1a0 [scsi_mod]
>  srp_recv_done+0x515/0x1330 [ib_srp]
>  __ib_process_cq+0xa0/0xf0 [ib_core]
>  ib_poll_handler+0x38/0xa0 [ib_core]
>  irq_poll_softirq+0xe8/0x1f0
>  __do_softirq+0x128/0x60d
>  run_ksoftirqd+0x3f/0x60
>  smpboot_thread_fn+0x352/0x460
>  kthread+0x1c1/0x1e0
>  ret_from_fork+0x24/0x30
> 
> Allocated by task 1918:
>  save_stack+0x43/0xd0
>  kasan_kmalloc+0xad/0xe0
>  kasan_slab_alloc+0x11/0x20
>  kmem_cache_alloc+0xfe/0x350
>  mempool_alloc_slab+0x15/0x20
>  mempool_alloc+0xfb/0x270
>  bio_alloc_bioset+0x244/0x350
>  submit_bh_wbc+0x9c/0x2f0
>  __block_write_full_page+0x299/0x5a0
>  block_write_full_page+0x16b/0x180
>  blkdev_writepage+0x18/0x20
>  __writepage+0x42/0x80
>  write_cache_pages+0x376/0x8a0
>  generic_writepages+0xbe/0x110
>  blkdev_writepages+0xe/0x10
>  do_writepages+0x9b/0x180
>  __filemap_fdatawrite_range+0x178/0x1c0
>  file_write_and_wait_range+0x59/0xc0
>  blkdev_fsync+0x46/0x80
>  vfs_fsync_range+0x66/0x100
>  do_fsync+0x3d/0x70
>  __x64_sys_fsync+0x21/0x30
>  do_syscall_64+0x77/0x230
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Freed by task 9:
>  save_stack+0x43/0xd0
>  __kasan_slab_free+0x137/0x190
>  kasan_slab_free+0xe/0x10
>  kmem_cache_free+0xd3/0x380
>  mempool_free_slab+0x17/0x20
>  mempool_free+0x63/0x160
>  bio_free+0x81/0xa0
>  bio_put+0x59/0x60
>  end_bio_bh_io_sync+0x5d/0x70
>  bio_endio+0x1a7/0x360
>  blk_update_request+0xd0/0x5b0
>  end_clone_bio+0xa3/0xd0 [dm_mod]
>  bio_endio+0x1a7/0x360
>  blk_update_request+0xd0/0x5b0
>  scsi_end_request+0x56/0x320 [scsi_mod]
>  scsi_io_completion+0x7d6/0xb20 [scsi_mod]
>  scsi_finish_command+0x1c0/0x280 [scsi_mod]
>  scsi_softirq_done+0x19a/0x230 [scsi_mod]
>  blk_mq_complete_request+0x160/0x240
>  scsi_mq_done+0x50/0x1a0 [scsi_mod]
>  srp_recv_done+0x515/0x1330 [ib_srp]
>  __ib_process_cq+0xa0/0xf0 [ib_core]
>  ib_poll_handler+0x38/0xa0 [ib_core]
>  irq_poll_softirq+0xe8/0x1f0
>  __do_softirq+0x128/0x60d
> 
> The buggy address belongs to the object at 8801300e0640
>  which belongs to the cache bio-0 of size 200
> The buggy address is located 144 bytes inside of
>  200-byte region [8801300e0640, 8801300e0708)
> The buggy address belongs to the page:
> page:ea0004c03800 count:1 mapcount:0 mapping:88015a563a00 index:0x0 
> compound_mapcount: 0
> flags: 0x80008100(slab|head)
> raw: 80008100 dead0100 dead0200 88015a563a00
> raw:  00330033 0001 
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
>  8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> >8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  ^
>  8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ==
> 
> Fixes: 0ba99ca4838b ("block: Add warning for bi_next not NULL in bio_endio()")
> Signed-off-by: Bart Van Assche 
> Cc: Kent Overstreet 
> Cc: Mike Snitzer 

Acked-by: Mike Snitzer 

Thanks Bart.

Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-15 Thread Mike Snitzer

On Fri, Jun 15 2018 at  5:59am -0400,
Damien Le Moal  wrote:

> Mike,
> 
> On 6/15/18 02:58, Mike Snitzer wrote:
> > On Thu, Jun 14 2018 at  1:37pm -0400,
> > Luis R. Rodriguez  wrote:
> > 
> >> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> >>> On Wed, Jun 13 2018 at  8:11pm -0400,
> >>> Luis R. Rodriguez  wrote:
> >>>
> >>>> Setting up a zoned disks in a generic form is not so trivial. There
> >>>> is also quite a bit of tribal knowledge with these devices which is not
> >>>> easy to find.
> >>>>
> >>>> The currently supplied demo script works but it is not generic enough to 
> >>>> be
> >>>> practical for Linux distributions or even developers which often move
> >>>> from one kernel to another.
> >>>>
> >>>> This tries to put a bit of this tribal knowledge into an initial udev
> >>>> rule for development with the hopes Linux distributions can later
> >>>> deploy. Three rule are added. One rule is optional for now, it should be
> >>>> extended later to be more distribution-friendly and then I think this
> >>>> may be ready for consideration for integration on distributions.
> >>>>
> >>>> 1) scheduler setup
> >>>
> >>> This is wrong.. if zoned devices are so dependent on deadline or
> >>> mq-deadline then the kernel should allow them to be hardcoded.  I know
> >>> Jens removed the API to do so but the fact that drivers need to rely on
> >>> hacks like this udev rule to get a functional device is proof we need to
> >>> allow drivers to impose the scheduler used.
> >>
> >> This is the point to the patch as well, I actually tend to agree with you,
> >> and I had tried to draw up a patch to do just that, however its *not* 
> >> possible
> >> today to do this and would require some consensus. So from what I can tell
> >> we *have* to live with this one or a form of it. Ie a file describing which
> >> disk serial gets deadline and which one gets mq-deadline.
> >>
> >> Jens?
> >>
> >> Anyway, let's assume this is done in the kernel, which one would use 
> >> deadline,
> >> which one would use mq-deadline?
> > 
> > The zoned storage driver needs to make that call based on what mode it
> > is in.  If it is using blk-mq then it selects mq-deadline, otherwise
> > deadline.
> 
> As Bart pointed out, deadline is an alias of mq-deadline. So using
> "deadline" as the scheduler name works in both legacy and mq cases.
> 
> >>>> 2) backlist f2fs devices
> >>>
> >>> There should porbably be support in dm-zoned for detecting whether a
> >>> zoned device was formatted with f2fs (assuming there is a known f2fs
> >>> superblock)?
> >>
> >> Not sure what you mean. Are you suggesting we always setup dm-zoned for
> >> all zoned disks and just make an excemption on dm-zone code to somehow
> >> use the disk directly if a filesystem supports zoned disks directly 
> >> somehow?
> > 
> > No, I'm saying that a udev rule wouldn't be needed if dm-zoned just
> > errored out if asked to consume disks that already have an f2fs
> > superblock.  And existing filesystems should get conflicting superblock
> > awareness "for free" if blkid or whatever is trained to be aware of
> > f2fs's superblock.
> 
> Well that is the case already: on startup, dm-zoned will read its own
> metadata from sector 0, same as f2fs would do with its super-block. If
> the format/magic does not match expected values, dm-zoned will bail out
> and return an error. dm-zoned metadata and f2fs metadata reside in the
> same place and overwrite each other. There is no way to get one working
> on top of the other. I do not see any possibility of a problem on startup.
> 
> But definitely, the user land format tools can step on each other toes.
> That needs fixing.

Right, I was talking about in the .ctr path for initial device creation,
not activation of a previously created dm-zoned device.

But I agree it makes most sense to do this check in userspace.

> >> f2fs does not require dm-zoned. What would be required is a bit more 
> >> complex
> >> given one could dedicate portions of the disk to f2fs and other portions to
> >> another filesystem, which would require dm-zoned.
> >>
> >> Also filesystems which *do not* support zoned disks should *not* be 
> >> allowing
> >>

Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Mike Snitzer

On Thu, Jun 14 2018 at  1:37pm -0400,
Luis R. Rodriguez  wrote:

> On Thu, Jun 14, 2018 at 08:38:06AM -0400, Mike Snitzer wrote:
> > On Wed, Jun 13 2018 at  8:11pm -0400,
> > Luis R. Rodriguez  wrote:
> > 
> > > Setting up a zoned disks in a generic form is not so trivial. There
> > > is also quite a bit of tribal knowledge with these devices which is not
> > > easy to find.
> > > 
> > > The currently supplied demo script works but it is not generic enough to 
> > > be
> > > practical for Linux distributions or even developers which often move
> > > from one kernel to another.
> > > 
> > > This tries to put a bit of this tribal knowledge into an initial udev
> > > rule for development with the hopes Linux distributions can later
> > > deploy. Three rule are added. One rule is optional for now, it should be
> > > extended later to be more distribution-friendly and then I think this
> > > may be ready for consideration for integration on distributions.
> > > 
> > > 1) scheduler setup
> > 
> > This is wrong.. if zoned devices are so dependent on deadline or
> > mq-deadline then the kernel should allow them to be hardcoded.  I know
> > Jens removed the API to do so but the fact that drivers need to rely on
> > hacks like this udev rule to get a functional device is proof we need to
> > allow drivers to impose the scheduler used.
> 
> This is the point to the patch as well, I actually tend to agree with you,
> and I had tried to draw up a patch to do just that, however its *not* possible
> today to do this and would require some consensus. So from what I can tell
> we *have* to live with this one or a form of it. Ie a file describing which
> disk serial gets deadline and which one gets mq-deadline.
> 
> Jens?
> 
> Anyway, let's assume this is done in the kernel, which one would use deadline,
> which one would use mq-deadline?

The zoned storage driver needs to make that call based on what mode it
is in.  If it is using blk-mq then it selects mq-deadline, otherwise
deadline.
 
> > > 2) backlist f2fs devices
> > 
> > There should porbably be support in dm-zoned for detecting whether a
> > zoned device was formatted with f2fs (assuming there is a known f2fs
> > superblock)?
> 
> Not sure what you mean. Are you suggesting we always setup dm-zoned for
> all zoned disks and just make an excemption on dm-zone code to somehow
> use the disk directly if a filesystem supports zoned disks directly somehow?

No, I'm saying that a udev rule wouldn't be needed if dm-zoned just
errored out if asked to consume disks that already have an f2fs
superblock.  And existing filesystems should get conflicting superblock
awareness "for free" if blkid or whatever is trained to be aware of
f2fs's superblock.
 
> f2fs does not require dm-zoned. What would be required is a bit more complex
> given one could dedicate portions of the disk to f2fs and other portions to
> another filesystem, which would require dm-zoned.
> 
> Also filesystems which *do not* support zoned disks should *not* be allowing
> direct setup. Today that's all filesystems other than f2fs, in the future
> that may change. Those are bullets we are allowing to trigger for users
> just waiting to shot themselves on the foot with.
> 
> So who's going to work on all the above?

It should take care of itself if existing tools are trained to be aware
of new signatures.  E.g. ext4 and xfs already are aware of one another
so that you cannot reformat a device with the other unless force is
given.

Same kind of mutual exclussion needs to happen for zoned devices.

So the zoned device tools, dm-zoned, f2fs, whatever.. they need to be
updated to not step on each others toes.  And other filesystems' tools
need to be updated to be zoned device aware.

> The point of the udev script is to illustrate the pains to properly deploy
> zoned disks on distributions today and without a roadmap... this is what
> at least I need on my systems today to reasonably deploy these disks for
> my own development.
> 
> Consensus is indeed needed for a broader picture.

Yeap.

> > > 3) run dmsetup for the rest of devices
> > 
> > automagically running dmsetup directly from udev to create a dm-zoned
> > target is very much wrong.  It just gets in the way of proper support
> > that should be add to appropriate tools that admins use to setup their
> > zoned devices.  For instance, persistent use of dm-zoned target should
> > be made reliable with a volume manager..
> 
> Ah yes, but who's working on that? How long will it take?

No idea, as is (from my vantage point) there is close to zero demand for
zoned devices.  It won't be

Re: dm-zoned-tools: add zoned disk udev rules for scheduler / dmsetup

2018-06-14 Thread Mike Snitzer

On Wed, Jun 13 2018 at  8:11pm -0400,
Luis R. Rodriguez  wrote:

> Setting up a zoned disks in a generic form is not so trivial. There
> is also quite a bit of tribal knowledge with these devices which is not
> easy to find.
> 
> The currently supplied demo script works but it is not generic enough to be
> practical for Linux distributions or even developers which often move
> from one kernel to another.
> 
> This tries to put a bit of this tribal knowledge into an initial udev
> rule for development with the hopes Linux distributions can later
> deploy. Three rule are added. One rule is optional for now, it should be
> extended later to be more distribution-friendly and then I think this
> may be ready for consideration for integration on distributions.
> 
> 1) scheduler setup

This is wrong.. if zoned devices are so dependent on deadline or
mq-deadline then the kernel should allow them to be hardcoded.  I know
Jens removed the API to do so but the fact that drivers need to rely on
hacks like this udev rule to get a functional device is proof we need to
allow drivers to impose the scheduler used.

> 2) backlist f2fs devices

There should porbably be support in dm-zoned for detecting whether a
zoned device was formatted with f2fs (assuming there is a known f2fs
superblock)?

> 3) run dmsetup for the rest of devices

automagically running dmsetup directly from udev to create a dm-zoned
target is very much wrong.  It just gets in the way of proper support
that should be add to appropriate tools that admins use to setup their
zoned devices.  For instance, persistent use of dm-zoned target should
be made reliable with a volume manager..

In general this udev script is unwelcome and makes things way worse for
the long-term success of zoned devices.

I don't dispute there is an obvious void for how to properly setup zoned
devices, but this script is _not_ what should fill that void.

So a heartfelt:

Nacked-by: Mike Snitzer

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-22 Thread Mike Snitzer

On Tue, May 22 2018 at  2:41am -0400,
Christoph Hellwig <h...@infradead.org> wrote:

> On Mon, May 21, 2018 at 07:38:55PM -0400, Kent Overstreet wrote:
> > On Mon, May 21, 2018 at 02:24:32PM -0400, Mike Snitzer wrote:
> > > Every single data structure change in this series should be reviewed for
> > > unforeseen alignment consequences.  Jens seemed to say that is
> > > worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
> > > it up a temp topic branch should be published for others to inspect.
> > > 
> > > Could be alignment hasn't been a historic concern for a bunch of the
> > > data structures changed in this series.. if so then all we can do is fix
> > > up any obvious potential for false sharing.
> > 
> > Honestly, I almost never worry about alignment... the very few times I do 
> > care,
> > I use __cacheline_aligned_in_smp.
> 
> And that generally is the right stratgey.  If Mike really doesn't want
> a change we can just open code the kmalloc for the bio set there, the
> important point is that we should not keep the old API around for no
> good reason.

Again, I never said I didn't want these changes.  I just thought it
worthwhile to mention the potential for alignment quirks arising.

Reality is false sharing is quite uncommon.  So my concern was/is pretty
niche and unlikely to be applicable.

That said, I did take the opportunity to look at all the DM structures
that were modified.  The bio_set and mempool_t structs are so large that
they span multiple cachelines as is.  So I focused on eliminating
unnecessary spanning of cachelines (by non-bio_set and non-mempool_t
members) and eliminated most holes in DM structs.  This is the result:
http://people.redhat.com/msnitzer/dm-4.18-struct-reorg/

Compare before.txt vs after_kent.txt vs after_mike.txt

Nothing staggering or special.  Just something I'll add once Kent's
latest changes land.

But, I also eliminated 2 4-byte holes in struct bio_set, Jens please
feel free to pick this up (if you think it useful):

From: Mike Snitzer <snit...@redhat.com>
Subject: [PATCH] block: eliminate 2 4-byte holes in bio_set structure

Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 include/linux/bio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/bio.h b/include/linux/bio.h
index 5e472fc..e6fc692 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -735,7 +735,6 @@ static inline void bio_inc_remaining(struct bio *bio)

 struct bio_set {
struct kmem_cache *bio_slab;
-   unsigned int front_pad;

mempool_t bio_pool;
mempool_t bvec_pool;
@@ -743,6 +742,7 @@ struct bio_set {
mempool_t bio_integrity_pool;
mempool_t bvec_integrity_pool;
 #endif
+   unsigned int front_pad;

/*
 * Deadlock avoidance for stacking block drivers: see comments in

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at  1:37pm -0400,
Kent Overstreet  wrote:

> 
> Uh, you came across as upset and paranoid to me too. Chalk it up to email? :)

Awesome.  See how easy it is to make someone with purely constructive
questions and feedback come off as upset and paranoid?

The tipping point occurs when bait is set with:
"It's not like ".
Then:
"Let's focus on getting it reviewed, rather than pontificate on what
could potentially go all wrong with this."

Message received: less pontificating, more doing!

And here I thought that focusing on what could go wrong (across all code
touched) was review.  But OK, I'm the one that made this all weird ;)

It is what it is at this point.

> I personally don't care, I have no horse in this race. This particular patch
> series wasn't my idea, Christoph wanted all these conversions done so
> bioset_create() could be deleted. If you want us to hold off on the dm patch 
> for
> awhile until someone can get around to testing it or whatever (since I don't
> have tests for everything I pushed) that would be perfectly fine by me.

As I clarified already: this isn't about DM.

Every single data structure change in this series should be reviewed for
unforeseen alignment consequences.  Jens seemed to say that is
worthwhile.  Not sure if he'll do it or we divide it up.  If we divide
it up a temp topic branch should be published for others to inspect.

Could be alignment hasn't been a historic concern for a bunch of the
data structures changed in this series.. if so then all we can do is fix
up any obvious potential for false sharing.

Mike

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 11:36am -0400,
Jens Axboe <ax...@kernel.dk> wrote:

> On 5/21/18 9:18 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 11:09am -0400,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> >> On 5/21/18 9:04 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:52am -0400,
> >>> Jens Axboe <ax...@kernel.dk> wrote:
> >>>
...
> >>>> IMHO you're making a big deal out of something that should not be.
> >>>
> >>> I raised an issue that had seemingly not been considered at all.  Not
> >>> making a big deal.  Raising it for others' benefit.
> >>>
> >>>> If the dm bits are that sensitive and cache line honed to perfection
> >>>> already due to previous regressions in that area, then it might
> >>>> not be a bad idea to have some compile checks for false cacheline
> >>>> sharing between sensitive members, or spilling of a sub-struct
> >>>> into multiple cachelines.
> >>>>
> >>>> It's not like this was pushed behind your back. It's posted for
> >>>> review. It's quite possible the net change is a win for dm. Let's
> >>>> focus on getting it reviewed, rather than pontificate on what
> >>>> could potentially go all wrong with this.
> >>>
> >>> Why are you making this personal?  Or purely about DM?  I'm merely
> >>> pointing out this change isn't something that can be given a quick
> >>> blanket "looks good".
> >>
> >> I'm not making this personal at all?! You raised a (valid) concern,
> >> I'm merely stating why I don't think it's a high risk issue. I'm
> >> assuming your worry is related to dm, as those are the reports
> >> that would ultimately land on your desk.
> > 
> > Then we'll just agree to disagree with what this implies: "It's not like
> > this was pushed behind your back."
> 
> I'm afraid you've lost me now - it was not pushed behind your back,
> it was posted for review, with you on the CC list. Not trying to
> be deliberately dense here, I just don't see what our disagreement is.

You're having an off day ;)  Mondays and all?

I just raised an alignment concern that needs to be considered during
review by all stakeholders.  Wasn't upset at all.  Maybe my email tone
came off otherwise?

And then you got confused by me pointing out how it was weird for you to
suggest I felt this stuff was pushed behind my back.. and went on to
think I really think that.  It's almost like you're a confused hypnotist
seeding me with paranoid dilutions. ;)

/me waits for Jens to snap his fingers so he can just slowly step away
from this line of discussion that is solidly dead now...

> > Reality is I'm fine with the change.  Just think there is follow-on work
> > (now or later) that is needed.
> 
> It's not hard to run the quick struct layout checks or alignment. If
> there's a concern, that should be done now, instead of being deferred to
> later. There's no point merging something that we expect to have
> follow-on work. If that's the case, then it should not be merged. There
> are no dependencies in the patchset, except that the last patch
> obviously can't be applied until all of the previous ones are in.

Cool, sounds good.

> I took a quick look at the struct mapped_device layout, which I'm
> assuming is the most important on the dm side. It's pretty big to
> begin with, obviously this makes it bigger since we're now
> embedding the bio_sets. On my config, doesn't show any false sharing
> that would be problematic, or layout changes that would worry me. FWIW.

Great, thanks, do you happen to have a tree you can push so others can
get a quick compile and look at the series fully applied?

BTW, I'm upstream DM maintainer but I have a role in caring for related
subsystems (e.g. block core, etc) in downstream releases.. *cough* RHEL.
So this alignment concern wasn't ever purely about DM.

Mike

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 11:09am -0400,
Jens Axboe <ax...@kernel.dk> wrote:

> On 5/21/18 9:04 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:52am -0400,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> >> On 5/21/18 8:47 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:36am -0400,
> >>> Jens Axboe <ax...@kernel.dk> wrote:
> >>>
> >>>> On 5/21/18 8:31 AM, Mike Snitzer wrote:
> >>>>> On Mon, May 21 2018 at 10:19am -0400,
> >>>>> Jens Axboe <ax...@kernel.dk> wrote:
> >>>>>
> >>>>>> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> >>>>>>> On Sun, May 20 2018 at  6:25pm -0400,
> >>>>>>> Kent Overstreet <kent.overstr...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Jens - this series does the rest of the conversions that Christoph 
> >>>>>>>> wanted, and
> >>>>>>>> drops bioset_create().
> >>>>>>>>
> >>>>>>>> Only lightly tested, but the changes are pretty mechanical. Based on 
> >>>>>>>> your
> >>>>>>>> for-next tree.
> >>>>>>>
> >>>>>>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> >>>>>>> you've altered the alignment of members in data structures.  So I'll
> >>>>>>> need to audit all the data structures you've modified in DM.
> >>>>>>>
> >>>>>>> Could we get the backstory on _why_ you're making this change?
> >>>>>>> Would go a long way to helping me appreciate why this is a good use of
> >>>>>>> anyone's time.
> >>>>>>
> >>>>>> Yeah, it's in the first series, it gets rid of a pointer indirection.
> >>>>>
> >>>>> "Allows mempools to be embedded in other structs, getting rid of a
> >>>>> pointer indirection from allocation fastpaths."
> >>>>>
> >>>>> So this is about using contiguous memory or avoiding partial allocation
> >>>>> failure?  Or both?
> >>>>>
> >>>>> Or more to it?  Just trying to fully appreciate the theory behind the
> >>>>> perceived associated benefit.
> >>>>
> >>>> It's about avoiding a pointer indirection. Instead of having to
> >>>> follow a pointer to get to that struct, it's simple offset math off
> >>>> your main structure.
> >>>>
> >>>>> I do think the increased risk of these embedded bio_set and mempool_t
> >>>>> themselves crossing cachelines, or struct members that follow them doing
> >>>>> so, really detracts from these types of changes.
> >>>>
> >>>> Definitely something to look out for, though most of them should be
> >>>> per-dev structures and not in-flight structures. That makes it a bit
> >>>> less sensitive. But can't hurt to audit the layouts and adjust if
> >>>> necessary. This is why it's posted for review :-)
> >>>
> >>> This isn't something that is easily caught upfront.  Yes we can all be
> >>> busy little beavers with pahole to audit alignment.  But chances are
> >>> most people won't do it.
> >>>
> >>> Reality is there is potential for a regression due to false sharing to
> >>> creep in if a hot struct member suddenly starts straddling a cacheline.
> >>> That type of NUMA performance killer is pretty insidious and somewhat
> >>> tedious to hunt down even when looking for it with specialized tools:
> >>> https://joemario.github.io/blog/2016/09/01/c2c-blog/
> >>
> >> IMHO you're making a big deal out of something that should not be.
> > 
> > I raised an issue that had seemingly not been considered at all.  Not
> > making a big deal.  Raising it for others' benefit.
> > 
> >> If the dm bits are that sensitive and cache line honed to perfection
> >> already due to previous regressions in that area, then it might
> >> not be a bad idea to have some compile checks for false cacheline
> >> sharing between sensitive members, or spilling of a sub-struct
> >> into multiple cachelines.
> >>
> >> It's not like this was pushed behind your back. It's posted for
> >> review. It's quite possible the net change is a win for dm. Let's
> >> focus on getting it reviewed, rather than pontificate on what
> >> could potentially go all wrong with this.
> > 
> > Why are you making this personal?  Or purely about DM?  I'm merely
> > pointing out this change isn't something that can be given a quick
> > blanket "looks good".
> 
> I'm not making this personal at all?! You raised a (valid) concern,
> I'm merely stating why I don't think it's a high risk issue. I'm
> assuming your worry is related to dm, as those are the reports
> that would ultimately land on your desk.

Then we'll just agree to disagree with what this implies: "It's not like
this was pushed behind your back."

Reality is I'm fine with the change.  Just think there is follow-on work
(now or later) that is needed.

Enough said.

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:52am -0400,
Jens Axboe <ax...@kernel.dk> wrote:

> On 5/21/18 8:47 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:36am -0400,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> >> On 5/21/18 8:31 AM, Mike Snitzer wrote:
> >>> On Mon, May 21 2018 at 10:19am -0400,
> >>> Jens Axboe <ax...@kernel.dk> wrote:
> >>>
> >>>> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> >>>>> On Sun, May 20 2018 at  6:25pm -0400,
> >>>>> Kent Overstreet <kent.overstr...@gmail.com> wrote:
> >>>>>
> >>>>>> Jens - this series does the rest of the conversions that Christoph 
> >>>>>> wanted, and
> >>>>>> drops bioset_create().
> >>>>>>
> >>>>>> Only lightly tested, but the changes are pretty mechanical. Based on 
> >>>>>> your
> >>>>>> for-next tree.
> >>>>>
> >>>>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> >>>>> you've altered the alignment of members in data structures.  So I'll
> >>>>> need to audit all the data structures you've modified in DM.
> >>>>>
> >>>>> Could we get the backstory on _why_ you're making this change?
> >>>>> Would go a long way to helping me appreciate why this is a good use of
> >>>>> anyone's time.
> >>>>
> >>>> Yeah, it's in the first series, it gets rid of a pointer indirection.
> >>>
> >>> "Allows mempools to be embedded in other structs, getting rid of a
> >>> pointer indirection from allocation fastpaths."
> >>>
> >>> So this is about using contiguous memory or avoiding partial allocation
> >>> failure?  Or both?
> >>>
> >>> Or more to it?  Just trying to fully appreciate the theory behind the
> >>> perceived associated benefit.
> >>
> >> It's about avoiding a pointer indirection. Instead of having to
> >> follow a pointer to get to that struct, it's simple offset math off
> >> your main structure.
> >>
> >>> I do think the increased risk of these embedded bio_set and mempool_t
> >>> themselves crossing cachelines, or struct members that follow them doing
> >>> so, really detracts from these types of changes.
> >>
> >> Definitely something to look out for, though most of them should be
> >> per-dev structures and not in-flight structures. That makes it a bit
> >> less sensitive. But can't hurt to audit the layouts and adjust if
> >> necessary. This is why it's posted for review :-)
> > 
> > This isn't something that is easily caught upfront.  Yes we can all be
> > busy little beavers with pahole to audit alignment.  But chances are
> > most people won't do it.
> > 
> > Reality is there is potential for a regression due to false sharing to
> > creep in if a hot struct member suddenly starts straddling a cacheline.
> > That type of NUMA performance killer is pretty insidious and somewhat
> > tedious to hunt down even when looking for it with specialized tools:
> > https://joemario.github.io/blog/2016/09/01/c2c-blog/
> 
> IMHO you're making a big deal out of something that should not be.

I raised an issue that had seemingly not been considered at all.  Not
making a big deal.  Raising it for others' benefit.

> If the dm bits are that sensitive and cache line honed to perfection
> already due to previous regressions in that area, then it might
> not be a bad idea to have some compile checks for false cacheline
> sharing between sensitive members, or spilling of a sub-struct
> into multiple cachelines.
> 
> It's not like this was pushed behind your back. It's posted for
> review. It's quite possible the net change is a win for dm. Let's
> focus on getting it reviewed, rather than pontificate on what
> could potentially go all wrong with this.

Why are you making this personal?  Or purely about DM?  I'm merely
pointing out this change isn't something that can be given a quick
blanket "looks good".

Mike

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:36am -0400,
Jens Axboe <ax...@kernel.dk> wrote:

> On 5/21/18 8:31 AM, Mike Snitzer wrote:
> > On Mon, May 21 2018 at 10:19am -0400,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> >> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> >>> On Sun, May 20 2018 at  6:25pm -0400,
> >>> Kent Overstreet <kent.overstr...@gmail.com> wrote:
> >>>
> >>>> Jens - this series does the rest of the conversions that Christoph 
> >>>> wanted, and
> >>>> drops bioset_create().
> >>>>
> >>>> Only lightly tested, but the changes are pretty mechanical. Based on your
> >>>> for-next tree.
> >>>
> >>> By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> >>> you've altered the alignment of members in data structures.  So I'll
> >>> need to audit all the data structures you've modified in DM.
> >>>
> >>> Could we get the backstory on _why_ you're making this change?
> >>> Would go a long way to helping me appreciate why this is a good use of
> >>> anyone's time.
> >>
> >> Yeah, it's in the first series, it gets rid of a pointer indirection.
> > 
> > "Allows mempools to be embedded in other structs, getting rid of a
> > pointer indirection from allocation fastpaths."
> > 
> > So this is about using contiguous memory or avoiding partial allocation
> > failure?  Or both?
> > 
> > Or more to it?  Just trying to fully appreciate the theory behind the
> > perceived associated benefit.
> 
> It's about avoiding a pointer indirection. Instead of having to
> follow a pointer to get to that struct, it's simple offset math off
> your main structure.
> 
> > I do think the increased risk of these embedded bio_set and mempool_t
> > themselves crossing cachelines, or struct members that follow them doing
> > so, really detracts from these types of changes.
> 
> Definitely something to look out for, though most of them should be
> per-dev structures and not in-flight structures. That makes it a bit
> less sensitive. But can't hurt to audit the layouts and adjust if
> necessary. This is why it's posted for review :-)

This isn't something that is easily caught upfront.  Yes we can all be
busy little beavers with pahole to audit alignment.  But chances are
most people won't do it.

Reality is there is potential for a regression due to false sharing to
creep in if a hot struct member suddenly starts straddling a cacheline.
That type of NUMA performance killer is pretty insidious and somewhat
tedious to hunt down even when looking for it with specialized tools:
https://joemario.github.io/blog/2016/09/01/c2c-blog/

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Mon, May 21 2018 at 10:19am -0400,
Jens Axboe <ax...@kernel.dk> wrote:

> On 5/21/18 8:03 AM, Mike Snitzer wrote:
> > On Sun, May 20 2018 at  6:25pm -0400,
> > Kent Overstreet <kent.overstr...@gmail.com> wrote:
> > 
> >> Jens - this series does the rest of the conversions that Christoph wanted, 
> >> and
> >> drops bioset_create().
> >>
> >> Only lightly tested, but the changes are pretty mechanical. Based on your
> >> for-next tree.
> > 
> > By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
> > you've altered the alignment of members in data structures.  So I'll
> > need to audit all the data structures you've modified in DM.
> > 
> > Could we get the backstory on _why_ you're making this change?
> > Would go a long way to helping me appreciate why this is a good use of
> > anyone's time.
> 
> Yeah, it's in the first series, it gets rid of a pointer indirection.

"Allows mempools to be embedded in other structs, getting rid of a
pointer indirection from allocation fastpaths."

So this is about using contiguous memory or avoiding partial allocation
failure?  Or both?

Or more to it?  Just trying to fully appreciate the theory behind the
perceived associated benefit.

I do think the increased risk of these embedded bio_set and mempool_t
themselves crossing cachelines, or struct members that follow them doing
so, really detracts from these types of changes.

Thanks,
Mike

Re: [PATCH 00/13] convert block layer to bioset_init()/mempool_init()

2018-05-21 Thread Mike Snitzer

On Sun, May 20 2018 at  6:25pm -0400,
Kent Overstreet  wrote:

> Jens - this series does the rest of the conversions that Christoph wanted, and
> drops bioset_create().
> 
> Only lightly tested, but the changes are pretty mechanical. Based on your
> for-next tree.

By switching 'mempool_t *' to 'mempool_t' and 'bio_set *' to 'bio_set'
you've altered the alignment of members in data structures.  So I'll
need to audit all the data structures you've modified in DM.

Could we get the backstory on _why_ you're making this change?
Would go a long way to helping me appreciate why this is a good use of
anyone's time.

Thanks,
Mike

Re: limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]

2018-04-09 Thread Mike Snitzer

On Mon, Apr 09 2018 at 11:51am -0400,
Mike Snitzer <snit...@redhat.com> wrote:

> On Sun, Apr 08 2018 at 12:00am -0400,
> Ming Lei <ming@redhat.com> wrote:
> 
> > Hi,
> > 
> > The following kernel oops(divide error) is triggered when running
> > xfstest(generic/347) on ext4.
> > 
> > [  442.632954] run fstests generic/347 at 2018-04-07 18:06:44
> > [  443.839480] divide error:  [#1] PREEMPT SMP PTI
> > [  443.840201] Dumping ftrace buffer:
> > [  443.840692](ftrace buffer empty)
...
> > [  443.845756] CPU: 1 PID: 29607 Comm: dmsetup Not tainted 
> > 4.16.0_f605ba97fb80_master+ #1
> > [  443.846968] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.10.2-2.fc27 04/01/2014
> > [  443.848147] RIP: 0010:pool_io_hints+0x77/0x153 [dm_thin_pool]

...

> I was able to reproduce (in my case RIP was pool_io_hints+0x45)
> 
> Which on my kernel, is:
> 
> crash> dis -l pool_io_hints+0x45
> /root/snitm/git/linux/drivers/md/dm-thin.c: 2748
> 0xc0765165 <pool_io_hints+69>:  div%rdi
> 
> Which is drivers/md/dm-thin.c:is_factor()'s return
> !sector_div(block_size, n);
> 
> SO looking at pool_io_hints() it would seem limits->max_sectors is 0 for
> this xfstests device... why would that be!?
> 
> Clearly pool_io_hints() could stand to be more defensive with a
> !limits->max_sectors negative check but is it ever really valid for
> max_sectors to be 0?
> 
> Pretty sure the ultimate bug is outside DM (but not seeing an obvious
> place where block core would set max_sectors to 0, all blk-settings.c
> uses min_not_zero(), etc).

I successfully ran this test against the linux-dm.git
"for-4.17/dm-changes" tag that Linus merged after the block changes:
 git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-4.17/dm-changes

# ./check tests/generic/347
FSTYP -- ext4
PLATFORM  -- Linux/x86_64 thegoat 4.16.0-rc5.snitm
MKFS_OPTIONS  -- /dev/mapper/test-xfstests_scratch
MOUNT_OPTIONS -- -o acl,user_xattr /dev/mapper/test-xfstests_scratch /scratch

generic/347  65s
Ran: generic/347
Passed all 1 tests

SO this would seem to implicate some regression in the 4.17 block layer
changes.

limits->max_sectors is getting set to 0, why/where? [was: Re: dm: kernel oops by divide error on v4.16+]

2018-04-09 Thread Mike Snitzer

On Sun, Apr 08 2018 at 12:00am -0400,
Ming Lei  wrote:

> Hi,
> 
> The following kernel oops(divide error) is triggered when running
> xfstest(generic/347) on ext4.
> 
> [  442.632954] run fstests generic/347 at 2018-04-07 18:06:44
> [  443.839480] divide error:  [#1] PREEMPT SMP PTI
> [  443.840201] Dumping ftrace buffer:
> [  443.840692](ftrace buffer empty)
> [  443.841195] Modules linked in: dm_thin_pool dm_persistent_data 
> dm_bio_prison dm_snapshot dm_bufio xfs libcrc32c dm_flakey isofs iTCO_wdt 
> iTCO_vendor_support lpc_ich i2c_i801 i2c_core mfd_core ip_tables sr_mod cdrom 
> sd_mod usb_storage ahci libahci libata nvme crc32c_intel nvme_core 
> virtio_scsi qemu_fw_cfg dm_mirror dm_region_hash dm_log dm_mod [last 
> unloaded: scsi_debug]
> [  443.845756] CPU: 1 PID: 29607 Comm: dmsetup Not tainted 
> 4.16.0_f605ba97fb80_master+ #1
> [  443.846968] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.10.2-2.fc27 04/01/2014
> [  443.848147] RIP: 0010:pool_io_hints+0x77/0x153 [dm_thin_pool]
> [  443.848949] RSP: 0018:c90001407af0 EFLAGS: 00010246
> [  443.849679] RAX: 0400 RBX: c90001407b48 RCX: 
> 
> [  443.850969] RDX:  RSI:  RDI: 
> 
> [  443.852097] RBP: 88006ce028a0 R08:  R09: 
> 0001
> [  443.853099] R10: c90001407b20 R11: ea0001cfad60 R12: 
> 88006de62000
> [  443.854404] R13:  R14:  R15: 
> 
> [  443.856129] FS:  7fb30462d840() GS:88007bc8() 
> knlGS:
> [  443.857741] CS:  0010 DS:  ES:  CR0: 80050033
> [  443.858576] CR2: 7efc82a10440 CR3: 7e76 CR4: 
> 007606e0
> [  443.859583] DR0:  DR1:  DR2: 
> 
> [  443.860587] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  443.861595] PKRU: 5554
> [  443.861978] Call Trace:
> [  443.862344]  dm_calculate_queue_limits+0xb5/0x262 [dm_mod]
> [  443.863128]  dm_setup_md_queue+0xe2/0x131 [dm_mod]
> [  443.863819]  table_load+0x15e/0x2a7 [dm_mod]
> [  443.864425]  ? table_clear+0xc1/0xc1 [dm_mod]
> [  443.865079]  ctl_ioctl+0x295/0x374 [dm_mod]
> [  443.865679]  dm_ctl_ioctl+0xa/0xd [dm_mod]
> [  443.866262]  vfs_ioctl+0x1e/0x2b
> [  443.866721]  do_vfs_ioctl+0x515/0x53d
> [  443.867242]  ? ksys_semctl+0xb9/0x126
> [  443.867761]  ? __fput+0x17a/0x18d
> [  443.868236]  ksys_ioctl+0x3e/0x5d
> [  443.868707]  SyS_ioctl+0xa/0xd
> [  443.869144]  do_syscall_64+0x9d/0x15e
> [  443.869669]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> [  443.870381] RIP: 0033:0x7fb303ee8dc7
> [  443.870886] RSP: 002b:7ffdc3c81478 EFLAGS: 0246 ORIG_RAX: 
> 0010
> [  443.871937] RAX: ffda RBX: 7fb3041cbec0 RCX: 
> 7fb303ee8dc7
> [  443.872925] RDX: 563591b81c30 RSI: c138fd09 RDI: 
> 0003
> [  443.873912] RBP:  R08: 7fb3042071c8 R09: 
> 7ffdc3c812e0
> [  443.874900] R10: 7fb304206683 R11: 0246 R12: 
> 
> [  443.875901] R13: 563591b81c60 R14: 563591b81c30 R15: 
> 563591b81a80
> [  443.876905] Code: 72 41 eb 33 8d 41 ff 85 c8 75 03 89 43 24 8b 43 24 44 89 
> c1 48 0f bd c8 4c 89 c8 48 d3 e0 89 43 24 8b 73 24 41 8b 44 24 38 31 d2 <48> 
> f7 f6 48 89 f1 85 d2 75 cf eb bf 31 d2 89 f8 48 f7 f1 48 85
> [  443.879519] RIP: pool_io_hints+0x77/0x153 [dm_thin_pool] RSP: 
> c90001407af0
> [  443.880549] ---[ end trace 56e7f9b41e671f53 ]---

I was able to reproduce (in my case RIP was pool_io_hints+0x45)

Which on my kernel, is:

crash> dis -l pool_io_hints+0x45
/root/snitm/git/linux/drivers/md/dm-thin.c: 2748
0xc0765165 :  div%rdi

Which is drivers/md/dm-thin.c:is_factor()'s return
!sector_div(block_size, n);

SO looking at pool_io_hints() it would seem limits->max_sectors is 0 for
this xfstests device... why would that be!?

Clearly pool_io_hints() could stand to be more defensive with a
!limits->max_sectors negative check but is it ever really valid for
max_sectors to be 0?

Pretty sure the ultimate bug is outside DM (but not seeing an obvious
place where block core would set max_sectors to 0, all blk-settings.c
uses min_not_zero(), etc).

Mike

Re: dm mpath: fix passing integrity data

2018-03-14 Thread Mike Snitzer

On Wed, Mar 14 2018 at 10:33am -0400,
Steffen Maier  wrote:

> After v4.12 commit e2460f2a4bc7 ("dm: mark targets that pass integrity
> data"), dm-multipath, e.g. on DIF+DIX SCSI disk paths, does not support
> block integrity any more. So add it to the whitelist.
> 
> This is also a pre-requisite to use block integrity with other dm layer(s)
> on top of multipath, such as kpartx partitions (dm-linear) or LVM.
> 
> Signed-off-by: Steffen Maier 
> Bisected-by: Fedor Loshakov 
> Fixes: e2460f2a4bc7 ("dm: mark targets that pass integrity data")
> Cc:  #4.12+
> ---
>  drivers/md/dm-mpath.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> index 3fde9e9faddd..c174f0c53dc9 100644
> --- a/drivers/md/dm-mpath.c
> +++ b/drivers/md/dm-mpath.c
> @@ -2023,7 +2023,8 @@ static int multipath_busy(struct dm_target *ti)
>  static struct target_type multipath_target = {
>   .name = "multipath",
>   .version = {1, 12, 0},
> - .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE,
> + .features = DM_TARGET_SINGLETON | DM_TARGET_IMMUTABLE |
> + DM_TARGET_PASSES_INTEGRITY,
>   .module = THIS_MODULE,
>   .ctr = multipath_ctr,
>   .dtr = multipath_dtr,

Thanks, I've queued this for 4.16-rc6, will send to Linus tomorrow.

Re: [LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-02-02 Thread Mike Snitzer

On Fri, Feb 02 2018 at 11:08am -0500,
Mike Snitzer <snit...@redhat.com> wrote:
> 
> But if the bioset enhancements are implemented properly then the kernels
> N biosets shouldn't need to be in doubt.  They'll all just adapt to have
> N backing mempools (N being for N conflicting front_pad requirements).

This should've read:

"But if the bioset enhancements are implemented properly then the kernels
N biosets ability to provide adequate front_pad shouldn't need to be in
doubt.  They'll all just adapt to have M backing mempools (for M
conflicting front_pad requirements)."

What this implies is there would need to be a way for the bioset
code to maintain a global graph of all biosets in the system.  And when
a device comes along with a unique bioset front_pad requirement, that
isn't already met by existing mempool, the device's driver (DM in
my case) would call into a bioset interface that would add a new backing
mempool, that accounts for the front_pad increase, to each bioset in the
system.

Not liking that (DM) device creation would potentially spawn a new
mempool within each existing bioset.  It could/would easily result in
many of those mempools going completely unused.

In addition: how would a bio_alloc_bioset() call _know_ the bio was for
use on a specific block device?  The entire beauty of the existing
bio_set code, especially for upper layers like filesystems, is it _is_
device agnostic.

So all this could be the worst idea ever.. not sure.  I've deferred
judging it one way or the other because the details are shakey at best.

And I still need to look closer at all the existing code.

Mike

Re: [LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-02-02 Thread Mike Snitzer

On Fri, Feb 02 2018 at  1:19am -0500,
NeilBrown <ne...@suse.com> wrote:

> On Mon, Jan 29 2018, Mike Snitzer wrote:
> 
> > I'd like to enable bio-based DM to _not_ need to clone bios.  But to do
> > so each bio-based DM target's required per-bio-data would need to be
> > provided by upper layer biosets (as opposed to the bioset DM currently
> > creates).
> >
> > So my thinking is that all system-level biosets (e.g. fs_bio_set,
> > blkdev_dio_pool) would redirect to a device specific variant bioset IFF
> > the underlying device advertises the need for a specific per-bio-data
> > payload to be provided.
> >
> > I know this _could_ become a rathole but I'd like to avoid reverting DM
> > back to the days of having to worry about managing mempools for the
> > purpose of per-io allocations.  I've grown spoiled by the performance
> > and elegance that comes with having the bio and per-bio-data allocated
> > from the same bioset.
> >
> > Thoughts?
> 
> md/raid0 remaps each bio and passes it directly down to one of several
> devices.
> I think your scheme would mean that it would need to clone each bio to
> make sure it is from the correctly sized pool.

Not sure why the md/raid0 device would need to do anything (not unless
it wanted to take advantage of this).

The model, in my head, would be that this is _not_ intended for an
arbitrary stacked MD or DM device.  So the underlying device(s) for
a DM or MD device, that wants to leverage upper layer bio_set provided
front_pad, would be a real device (e.g. NVMe, SCSI, whatever).

But if a mix of underlying drivers were used (each with unique
per-bio-data, aka front_pad, requirements) then the blk_stack_limits()
interface _could_ build that information up.  And while that pretty much
gets us all we'd need to support md/raid0 on arbitrary stacked DM or MD
volumes that isn't what I'm interested in.  But supporting an arbitrary
stacked device could be done as follow-on work.

I'm specifically looking to optimize DM's new DM_TYPE_NVME_BIO_BASED
device (see commits 22c11858e, 978e51ba3, cd02538445, etc) so that bios
are just remapped to the underlying NVMe device without cloning.  DM
core now takes care to validate that a DM_TYPE_NVME_BIO_BASED DM device
(e.g. a DM multipath device -- via "queue_mode nvme") is _only_ stacked
directly ontop of native NVMe devices.

> I suspect it could be made to work though.
> 
> 1/ have a way for the driver receiving a bio to discover how much
>frontpad was allocated.

Yes

> 2/ require drivers to accept bios with any size of frontpad, but a
>fast-path is taken if it is already big enough.

Yes and no, the driver really should be able to trust that the block
layer is sending it bios with adeuqate front_pad (if the device
registered its front_pad requirements).

That said, making it optional (via "hint") is likely safer from the
standpoint that it is less cut-throat given we'd be depending on the
various biosets to respect our wishes.  I just worry that having the
code for falling back to cloning, if the front_pad isn't adequate, will
defeat much of the benefit of optimizing what is intended to be a faster
fast path.

But if the bioset enhancements are implemented properly then the kernels
N biosets shouldn't need to be in doubt.  They'll all just adapt to have
N backing mempools (N being for N conflicting front_pad requirements).

> 3/ allow a block device to advertise it's preferred frontpad.

s/preferred/required/ in my mental model.

> 4/ make sure your config-change-notification mechanism can communicate
>changes to this number.

You're referring to he notifier chain idea about restacking
queue_limits?  Yes, that'd be needed once that exists.

> 5/ gather statistics on what percentage of bios have a too-small
>frontpad.
>
> Then start modifying places that allocate bios to use the hint,
> and when benchmarks show the percentage is high - use it to encourage
> other people to allocate better bios.

This shouldn't be needed if we were to go the route where bio_sets
dynamically select the appropriate mempool based on the front_pad
requirements advertised by the underlying device ("3/" above).

Thanks,
Mike

[PATCH v6] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <ax...@kernel.dk>
Tested-by: Laurence Oberman <lober...@redhat.com>
Signed-off-by: Ming Lei <ming@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
V6:
- use -EBUSY, instead of -ENOMEM, for BLK_STS_DEV_RESOURCE
- rename BLK_MQ_QUEUE_DELAY to BLK_MQ_RESOURCE_DELAY
- cleanup blk_types.h comment block above BLK_STS_DEV_RESOURCE
- all suggested by Jens
V5:
- fixed stale reference to 10ms delay in blk-mq.c comment
- revised commit header to include Ming's proof of how
  blk-mq drivers will make progress (serves to document how
  scsi_queue_rq now works)
V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 18 ++
 9 files changed, 45 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..79dfef84c66c 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -EBUSY, "device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..52a206e3777f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_RESOURCE_DELAY  3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, , false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, );
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

On Tue, Jan 30 2018 at  2:42pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Tue, 2018-01-30 at 14:33 -0500, Mike Snitzer wrote:
> > On Tue, Jan 30 2018 at 12:52pm -0500,
> > Bart Van Assche <bart.vanass...@wdc.com> wrote:
> > 
> > > - This patch does not fix any bugs nor makes block drivers easier to
> > >   read or to implement. So why is this patch considered useful?
> > 
> > It enables blk-mq core to own the problem that individual drivers should
> > have no business needing to worry about.  Period.
> 
> Thanks for having confirmed that this patch is an API-change only and does
> not fix any bugs.

No, it is an API change that enables blk-mq drivers to make forward
progress without compromising the potential benefits associated with
blk-mq's SCHED_RESTART functionality.

(If we're going to beat this horse to death it might as well be with
precision)

Re: [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

On Tue, Jan 30 2018 at 12:52pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On 01/30/18 06:24, Mike Snitzer wrote:
> >+ *
> >+ * If driver returns BLK_STS_RESOURCE and SCHED_RESTART
> >+ * bit is set, run queue after a delay to avoid IO stalls
> >+ * that could otherwise occur if the queue is idle.
> >  */
> >-if (!blk_mq_sched_needs_restart(hctx) ||
> >+needs_restart = blk_mq_sched_needs_restart(hctx);
> >+if (!needs_restart ||
> > (no_tag && list_empty_careful(>dispatch_wait.entry)))
> > blk_mq_run_hw_queue(hctx, true);
> >+else if (needs_restart && (ret == BLK_STS_RESOURCE))
> >+blk_mq_delay_run_hw_queue(hctx, BLK_MQ_QUEUE_DELAY);
> > }
> 
> If a request completes concurrently with execution of the above code
> then the request completion will trigger a call of
> blk_mq_sched_restart_hctx() and that call will clear the
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code tests it then the above code will schedule an asynchronous call
> of __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new queue run returns again BLK_STS_RESOURCE then the above code
> will be executed again. In other words, a loop occurs. That loop
> will repeat as long as the described race occurs. The current
> (kernel v4.15) block layer behavior is simpler: only block drivers
> call blk_mq_delay_run_hw_queue() and the block layer core never
> calls that function. Hence that loop cannot occur with the v4.15
> block layer core and block drivers. A motivation of why that loop is
> preferred compared to the current behavior (no loop) is missing.
> Does this mean that that loop is a needless complication of the
> block layer core?

No it means the loop is an internal blk-mq concern.  And that drivers
needn't worry about kicking the queue.  And it affords blk-mq latitude
to change how it responds to BLK_STS_RESOURCE in the future (without
needing to change every driver).

But even v4.15 has a similar loop.  It just happens to extend into the
.queue_rq() where the driver is completely blind to SCHED_RESTART.  And
the driver may just repeatedly kick the queue after a delay (via
blk_mq_delay_run_hw_queue).

This cycle should be a very rare occurrence regardless of which approach
is taken (V5 vs 4.15).  The fact that you engineered your SRP initiator
and target code to pathologically trigger this worst case (via
target_can_queue) doesn't mean it is the fast path for a properly
configured and functioning storage network.

> Sorry but I still prefer the v4.15 block layer approach because this
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and need
>   some time to stabilize.

The number of blk-mq API changes that have occurred since blk-mq was
introduced is a very long list.  Seems contrived to make this the one
that is breaking the camel's back.

> - The delay after which to rerun the queue is moved from block layer
>   drivers into the block layer core. I think that's wrong because only
>   the block driver authors can make a good choice for this constant.

Unsubstantiated.  3ms (scsi-mq, nvmefc) vs 100ms (dm-mq mpath): which is
better?  Pretty sure if the underlying storage network is 1) performant
2) properly configured -- then a shorter delay is preferable.

> - This patch makes block drivers harder to understand. Anyone who sees
>   return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the first
>   time will have to look up the meaning of these constants. An explicit
>   blk_mq_delay_run_hw_queue() call is easier to understand.

No it doesn't make blk-mq harder to understand.  But even if it did it
actually acknowledges that there is existing blk-mq SCHED_RESTART
heuristic for how to deal with the need for blk-mq to back-off in the
face of BLK_STS_RESOURCE returns.  By just having each blk-mq driver
blindly kick the queue you're actively ignoring, and defeating, that
entire design element of blk-mq (SCHED_RESTART).

It is an instance where blk-mq can and does know better.  Acknowledging
that fact is moving blk-mq in a better direction.

> - This patch makes the blk-mq core harder to understand because of the
>   loop mentioned above.

You've said your peace.  But you've taken on this campaign to undermine
this line of development with such passion that we're now in a place
where Jens is shell-shocked by all the repeat campaigning and noise.

Bart you keep saying the same things over and over.  Yet you cannot show
the patch to actively be a problem with testing-based detail.

Seems you'd rather refuse to even test it than open yourself up t

[PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

If a driver can make sure there is in-flight IO, it is safe to return
BLK_STS_DEV_RESOURCE because:

1) If all in-flight IOs complete before examining SCHED_RESTART in
blk_mq_dispatch_rq_list(), SCHED_RESTART must be cleared, so queue
is run immediately in this case by blk_mq_dispatch_rq_list();

2) if there is any in-flight IO after/when examining SCHED_RESTART
in blk_mq_dispatch_rq_list():
- if SCHED_RESTART isn't set, queue is run immediately as handled in 1)
- otherwise, this request will be dispatched after any in-flight IO is
  completed via blk_mq_sched_restart()

3) if SCHED_RESTART is set concurently in context because of
BLK_STS_RESOURCE, blk_mq_delay_run_hw_queue() will cover the above two
cases and make sure IO hang can be avoided.

One invariant is that queue will be rerun if SCHED_RESTART is set.

Suggested-by: Jens Axboe <ax...@kernel.dk>
Tested-by: Laurence Oberman <lober...@redhat.com>
Signed-off-by: Ming Lei <ming@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
V5:
- fixed stale reference to 10ms delay in blk-mq.c comment
- revised commit header to include Ming's proof of how
  blk-mq drivers will make progress (serves to document how
  scsi_queue_rq now works)
V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 17 +
 9 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..e39b4e2a63db 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY 3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, , false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, );
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is where we will continue on next queu

[LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-01-29 Thread Mike Snitzer

I'd like to enable bio-based DM to _not_ need to clone bios.  But to do
so each bio-based DM target's required per-bio-data would need to be
provided by upper layer biosets (as opposed to the bioset DM currently
creates).

So my thinking is that all system-level biosets (e.g. fs_bio_set,
blkdev_dio_pool) would redirect to a device specific variant bioset IFF
the underlying device advertises the need for a specific per-bio-data
payload to be provided.

I know this _could_ become a rathole but I'd like to avoid reverting DM
back to the days of having to worry about managing mempools for the
purpose of per-io allocations.  I've grown spoiled by the performance
and elegance that comes with having the bio and per-bio-data allocated
from the same bioset.

Thoughts?
Mike

[LSF/MM TOPIC] block, dm: restack queue_limits

2018-01-29 Thread Mike Snitzer

We currently don't restack the queue_limits if the lowest, or
intermediate, layer of an IO stack changes.

This is particularly unfortunate in the case of FLUSH/FUA which may
change if/when a HW controller's BBU fails; whereby requiring the device
advertise that it has a volatile write cache (WCE=1).

But in the context of DM, really it'd be best if the entire stack of
devices had their limits restacked if any underlying layer's limits
change.

In the past, Martin and I discussed that we should "just do it" but
never did.  Not sure we need a lengthy discussion but figured I'd put it
out there.

Maybe I'll find time, between now and April, to try implementing it.

Thanks,
Mike

Re: [LSF/MM TOPIC] Two blk-mq related topics

2018-01-29 Thread Mike Snitzer

On Mon, Jan 29 2018 at 10:46am -0500,
Ming Lei  wrote:
 
> 2. When to enable SCSI_MQ at default again?
> 
> SCSI_MQ is enabled on V3.17 firstly, but disabled at default. In V4.13-rc1,
> it is enabled at default, but later the patch is reverted in V4.13-rc7, and
> becomes disabled at default too.
> 
> Now both the original reported PM issue(actually SCSI quiesce) and the
> sequential IO performance issue have been addressed. And MQ IO schedulers
> are ready too for traditional disks. Are there other issues to be addressed
> for enabling SCSI_MQ at default? When can we do that again?
> 
> Last time, the two issues were reported during V4.13 dev cycle just when it is
> enabled at default, that seems if SCSI_MQ isn't enabled at default, it 
> wouldn't
> be exposed to run/tested completely & fully.  
> 
> So if we continue to disable it at default, maybe it can never be exposed to
> full test/production environment.

I was going to propose revisiting this as well.

I'd really like to see all the old .request_fn block core code removed.

But maybe we take a first step of enabling:
CONFIG_SCSI_MQ_DEFAULT=Y
CONFIG_DM_MQ_DEFAULT=Y

Thanks,
Mike

[PATCH V4] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-29 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

This status is returned from driver to block layer if device related
resource is unavailable, but driver can guarantee that IO dispatch
will be triggered in future when the resource is available.

Convert some drivers to return BLK_STS_DEV_RESOURCE.  Also, if driver
returns BLK_STS_RESOURCE and SCHED_RESTART is set, rerun queue after
a delay (BLK_MQ_DELAY_QUEUE) to avoid IO stalls.  BLK_MQ_DELAY_QUEUE is
3 ms because both scsi-mq and nvmefc are using that magic value.

Suggested-by: Jens Axboe <ax...@kernel.dk>
Tested-by: Laurence Oberman <lober...@redhat.com>
Signed-off-by: Ming Lei <ming@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---

V4:
- cleanup header and code comments
- rerun queue after BLK_MQ_QUEUE_DELAY (3ms) instead of 10ms
- eliminate nvmefc's queue rerun now that blk-mq does it
V3:
- fix typo, and improvement document
- add tested-by tag
V2:
- add comments on the new introduced status
- patch style fix
- both are suggested by Christoph

 block/blk-core.c |  1 +
 block/blk-mq.c   | 20 
 drivers/block/null_blk.c |  2 +-
 drivers/block/virtio_blk.c   |  2 +-
 drivers/block/xen-blkfront.c |  2 +-
 drivers/md/dm-rq.c   |  5 ++---
 drivers/nvme/host/fc.c   | 12 ++--
 drivers/scsi/scsi_lib.c  |  6 +++---
 include/linux/blk_types.h| 14 ++
 9 files changed, 41 insertions(+), 23 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..38279d4ae08b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -145,6 +145,7 @@ static const struct {
[BLK_STS_MEDIUM]= { -ENODATA,   "critical medium" },
[BLK_STS_PROTECTION]= { -EILSEQ,"protection" },
[BLK_STS_RESOURCE]  = { -ENOMEM,"kernel resource" },
+   [BLK_STS_DEV_RESOURCE]  = { -ENOMEM,"device resource" },
[BLK_STS_AGAIN] = { -EAGAIN,"nonblocking retry" },
 
/* device mapper special case, should not leak out: */
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 43e7449723e0..dd097ca5f1e9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1160,6 +1160,8 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx 
**hctx,
return true;
 }
 
+#define BLK_MQ_QUEUE_DELAY 3   /* ms units */
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 bool got_budget)
 {
@@ -1167,6 +1169,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
struct request *rq, *nxt;
bool no_tag = false;
int errors, queued;
+   blk_status_t ret = BLK_STS_OK;
 
if (list_empty(list))
return false;
@@ -1179,7 +1182,6 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
errors = queued = 0;
do {
struct blk_mq_queue_data bd;
-   blk_status_t ret;
 
rq = list_first_entry(list, struct request, queuelist);
if (!blk_mq_get_driver_tag(rq, , false)) {
@@ -1224,7 +1226,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
}
 
ret = q->mq_ops->queue_rq(hctx, );
-   if (ret == BLK_STS_RESOURCE) {
+   if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
/*
 * If an I/O scheduler has been configured and we got a
 * driver tag for the next request already, free it
@@ -1255,6 +1257,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * that is where we will continue on next queue run.
 */
if (!list_empty(list)) {
+   bool needs_restart;
+
spin_lock(>lock);
list_splice_init(list, >dispatch);
spin_unlock(>lock);
@@ -1278,10 +1282,17 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * - Some but not all block drivers stop a queue before
 *   returning BLK_STS_RESOURCE. Two exceptions are scsi-mq
 *   and dm-rq.
+*
+* If driver returns BLK_STS_RESOURCE and SCHED_RESTART
+* bit is set, run queue after 10ms to avoid IO stalls
+* that could otherwise occur if the queue is idle.
 */
-   if (!blk_mq_sched_needs_restart(hctx) ||
+   needs_restart = blk_mq_sched_needs_restart(hctx);
+   if (!needs_restart ||
(no_tag && list_empty_careful(>dispatch_wait.entry)))
blk_mq_run_hw_queue(hctx, true);
+   e

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-27 Thread Mike Snitzer

On Sat, Jan 27 2018 at 10:00pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> > You cannot even be forthcoming about the technical merit of a change you
> > authored (commit 6077c2d70) that I'm left to clean up in the face of
> > performance bottlenecks it unwittingly introduced?  If you were being
> > honest: you'd grant that the random delay of 100ms is utterly baseless
> > (not to mention that kicking the queue like you did is a complete
> > hack).  So that 100ms delay is what my dm-4.16 commit is talking about.
> 
> There are multiple errors in the above:
> 1. I have already explained in detail why commit 6077c2d70 is (a) correct
>and (b) essential. See e.g. 
> https://www.redhat.com/archives/dm-devel/2018-January/msg00168.html.

And you'd be wrong.  Again.  We've already established that commit
6077c2d70 is _not_ essential.  Ming's V3, in conjunction with all the
changes that already went into block and DM for 4.16, prove that.

Yet here you go repeating yourself.  You are clinging to a patch that
absolutely had no business going in when it did.  And even when
presented with the reality that it was not the proper way to fix the
issue you observed you keep going back to it like you cured cancer with
a single line of code.

> 2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
>unintended delays" applied, there is nothing to clean up anymore since
>that patch eliminates the queue delays that were triggered by
>blk_mq_delay_run_hw_queue().

The issue Ming fixed is that your random queue kicks break RESTART on
dm-mq mpath.

> 3. You know that I'm honest. Suggesting that I'm not is wrong.

No, I trully do _not_ know you're always honest.  You've conflated so
many details on this topic that it makes me have serious doubts.  You're
lashing out so much, in defense of your dm_mq_queue_rq delayed queue
kick, that you're missing there is a genuine generic blk-mq problem that
is getting fixed in Ming's V3.

> 4. I never claimed that 100ms is the optimal value for the queue
>rerunning delay. I have already explained to you that I copied that
>value from older dm-rq code.

And I pointed out to you that you most certainly did _not_ copy the
value from elsewhere:
https://www.redhat.com/archives/dm-devel/2018-January/msg00202.html

The specific delay used isn't the issue; the need to kick the queue like
this is.  Ming's V3 avoids unnecessary kicking.

> > Don't project onto me Bart.  This isn't the first time you've been
> > completely unprofessional and sadly it likely won't be the last.
> 
> The only person who is behaving unprofessionally in this e-mail thread
> is you.

Bart, the number of emails exchanged on this topic has exhausted
_everyone_.  Emotions get tense when things don't evolve despite every
oppotunity for progress to be realized.  When people cling to solutions
that are no longer relevant.  There is a very real need for progress
here.  So either get out of the way or suggest a specific series of
change(s) that you feel better than Ming's V3.

With that, I'll also stop responding to your baiting now and forevermore
(e.g. "maintainers should this.. and maintainers should not that" or
"your employer is not investing adequately".  Next time you find
yourself typing drivel like that: spare us all and hit delete).

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-27 Thread Mike Snitzer

On Sat, Jan 27 2018 at  7:54pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Sat, 2018-01-27 at 19:23 -0500, Mike Snitzer wrote:
> > Your contributions do _not_ make up for your inability to work well with
> > others.  Tiresome doesn't begin to describe these interactions.
> > 
> > Life is too short to continue enduring your bullshit.
> > 
> > But do let us know when you have something of substance to contribute
> > (hint: code talks).
> 
> Sorry Mike but what you wrote above is not only very gross but it is also
> wrong. What I did in my e-mail is to identify technical problems in Ming's
> work. Additionally, it seems like you forgot that recently I helped Ming?
> My patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
> unintended delays" has been queued by Jens for kernel v4.16 and solves a
> problem that Ming has been working on for two months but that he was
> unable to come up with a proper fix for. Additionally, there is something
> that you have to explain: the patch "dm mpath: don't call
> blk_mq_delay_run_hw_queue() in case of BLK_STS_RESOURCE" that you have
> queued in your tree is wrong and introduces a regression
> (https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=316a795ad388e0c3ca613454851a28079d917a92).
> I'm surprised that you have not yet reverted that patch? BTW, in case you
> would not yet have noticed my patch "blk-mq: Avoid that
> blk_mq_delay_run_hw_queue() introduces unintended delays" eliminates the
> delay you referred to in the description of that patch.

You cannot even be forthcoming about the technical merit of a change you
authored (commit 6077c2d70) that I'm left to clean up in the face of
performance bottlenecks it unwittingly introduced?  If you were being
honest: you'd grant that the random delay of 100ms is utterly baseless
(not to mention that kicking the queue like you did is a complete
hack).  So that 100ms delay is what my dm-4.16 commit is talking about.
Not the fact that blk-mq wasn't using kblockd_mod_delayed_work_on().

Commit 6077c2d70 was wrong and never should've been introduced!  And you
even said that reintroducing commit 6077c2d70 didn't fix the regression
Ming's V3 fully corrects.  So why would I revert eliminating it exactly?

You aren't doing yourself any justice.  We're all smart enough to see
through your misdirection and labored attempt to cover your tracks.

In the past you've been very helpful with highly reasoned and technical
contributions.  But you get unhinged more often than not when it comes
to Ming's work.  That has made you one of the most duplicitous engineers
I've witnessed and had to deal with directly.  It is like Dr Jeykll and
Mr Hyde with you.

So I'm done treating you with kid gloves.  You are incapable of
responding favorably to subtle social queues or even outright pleas for
more productive behavior:
https://marc.info/?l=linux-scsi=151011037229460=2

Otherwise I wouldn't be having to respond to you right now!

> In case the above would not yet have addressed the technical issue(s) you
> are facing, I would really appreciate it if you would stop using insults and
> if you could explain what technical problems you are facing. Isn't that what
> the Linux kernel is about, namely about collaboration between technical
> people from different organizations? Isn't that what made the Linux kernel
> great?

Don't project onto me Bart.  This isn't the first time you've been
completely unprofessional and sadly it likely won't be the last.

Treat others how you want to be treated.  It really is that simple.

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-27 Thread Mike Snitzer

On Sat, Jan 27 2018 at  5:12pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Sat, 2018-01-27 at 14:09 -0500, Mike Snitzer wrote:
> > Ming let me know that he successfully tested this V3 patch using both
> > your test (fio to both mpath and underlying path) and Bart's (02-mq with
> > can_queue in guest).
> > 
> > Would be great if you'd review and verify this fix works for you too.
> > 
> > Ideally we'd get a fix for this regression staged for 4.16 inclusion.
> > This V3 patch seems like the best option we have at this point.
> 
> Hello Mike,
> 
> There are several issues with the patch at the start of this thread:
> - It is an unnecessary change of the block layer API. Queue stalls can
>   already be addressed with the current block layer API, namely by inserting
>   a blk_mq_delay_run_hw_queue() call before returning BLK_STS_RESOURCE.
> - The patch at the start of this thread complicates code further that is
>   already too complicated, namely the blk-mq core.

The above says _nothing_ of substance.  You talk so loudly against
Ming's work that it has gotten to the point where nothing you say
against Ming's work can be taken seriously.

> - The patch at the start of this thread introduces a regression in the
>   SCSI core, namely a queue stall if a request completion occurs concurrently
>   with the newly added BLK_MQ_S_SCHED_RESTART test in the blk-mq core.

And why is this supposed race unique to SCSI core?  

Fact is Ming dutifully implemented what Jens suggested.  And he verified
it to work.  What have you done other than play the antagonist?

> As a kernel maintainer one of your responsibilities is to help keeping the
> quality of the kernel code high. So I think that you, as a kernel maintainer,
> should tell Ming to discard this patch instead of
> asking to merge it upstream
> given all the disadvantages of this patch.

Your contributions do _not_ make up for your inability to work well with
others.  Tiresome doesn't begin to describe these interactions.

Life is too short to continue enduring your bullshit.

But do let us know when you have something of substance to contribute
(hint: code talks).

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-27 Thread Mike Snitzer

On Tue, Jan 23 2018 at 10:31pm -0500,
Ming Lei  wrote:

> On Tue, Jan 23, 2018 at 04:57:34PM +, Bart Van Assche wrote:
> > On Wed, 2018-01-24 at 00:37 +0800, Ming Lei wrote:
> > > On Tue, Jan 23, 2018 at 04:24:20PM +, Bart Van Assche wrote:
> > > > My opinion about this patch is as follows:
> > > > * Changing a blk_mq_delay_run_hw_queue() call followed by return
> > > >   BLK_STS_DEV_RESOURCE into return BLK_STS_RESOURCE is wrong because it 
> > > > changes
> > > >   a guaranteed queue rerun into a queue rerun that may or may not happen
> > > >   depending on whether or not multiple queue runs happen simultaneously.
> > > 
> > > You may not understand the two:
> > > 
> > > 1) it is always safe to return BLK_STS_RESOURCE, which will make sure to
> > > avoid IO hang by blk_mq_delay_run_hw_queue() or blk_mq_run_hw_queue(),
> > > and using which one depends on SCHED_RESTART.
> > > 
> > > 2) if driver can make sure the queue will be rerun after some resource
> > > is available, either by itself or by blk-mq, it will return 
> > > BLK_STS_DEV_RESOURCE
> > > 
> > > So what is wrong with this way?
> > 
> > Sorry, I swapped BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE accidentally in 
> > my
> > reply. What I meant is that changing a blk_mq_delay_run_hw_queue() call 
> > followed
> > by return BLK_STS_RESOURCE into BLK_STS_DEV_RESOURCE is wrong and 
> > introduces a
> > race condition in code where there was no race condition.
> 
> OK, then no such race you worried about in this patch.
> 
> Jens, could you take a look at this patch?

Hi Jens,

Ming let me know that he successfully tested this V3 patch using both
your test (fio to both mpath and underlying path) and Bart's (02-mq with
can_queue in guest).

Would be great if you'd review and verify this fix works for you too.

Ideally we'd get a fix for this regression staged for 4.16 inclusion.
This V3 patch seems like the best option we have at this point.

Mike

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-23 Thread Mike Snitzer

On Tue, Jan 23 2018 at 11:16am -0500,
Ming Lei <ming@redhat.com> wrote:

> This status is returned from driver to block layer if device related
> resource is run out of, but driver can guarantee that IO dispatch will
> be triggered in future when the resource is available.
> 
> This patch converts some drivers to use this return value. Meantime
> if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> queue after 10ms for avoiding IO hang.
> 
> Suggested-by: Jens Axboe <ax...@kernel.dk>
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Cc: Christoph Hellwig <h...@infradead.org>
> Cc: Mike Snitzer <snit...@redhat.com>
> Cc: Laurence Oberman <lober...@redhat.com>
> Cc: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Martin K. Petersen <martin.peter...@oracle.com>
> Cc: James E.J. Bottomley <j...@linux.vnet.ibm.com>
> Signed-off-by: Ming Lei <ming@redhat.com>

Acked-by: Mike Snitzer <snit...@redhat.com>

Thanks Ming

Re: [PATCH V2] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-23 Thread Mike Snitzer

On Tue, Jan 23 2018 at  9:27am -0500,
Ming Lei  wrote:

> Hello Martin,
> 
> On Tue, Jan 23, 2018 at 08:30:41AM -0500, Martin K. Petersen wrote:
> > 
> > Ming,
> > 
> > > + * Block layer and block driver specific status, which is ususally 
> > > returnd
> >   
> > ^^^
> > > + * from driver to block layer in IO path.
> > 
> > Given that the comment blurb is long and the flag not defined until
> > later, it is not entirely obvious that you are documenting
> > BLK_STS_DEV_RESOURCE. So please make that clear at the beginning of the
> > comment.
> 
> OK, how about the following document?
> 
> /*
>  * BLK_STS_DEV_RESOURC: Block layer and block driver specific status,
 ^^^ typo

>  * which is usually returned from driver to block layer in IO path.
>  *
>  * BLK_STS_DEV_RESOURCE is returned from driver to block layer if device
>  * related resource is run out of, but driver can guarantee that queue
>  * will be rerun in future for dispatching the current request when this
>  * resource is available.
>  *
>  * Difference with BLK_STS_RESOURCE:
>  * If driver isn't sure if the queue can be run again for dealing with the
>  * current request after this kind of resource is available, please return
>  * BLK_STS_SOURCE, for example, when memory allocation, DMA Mapping or other
 ^^ typo

>  * system resource allocation fails and IO can't be submitted to device,
>  * BLK_STS_RESOURCE should be used for avoiding IO hang.
>  */

In general the 2nd paragraph is one big run-on sentence.  Needs some
isolation.

Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

2018-01-23 Thread Mike Snitzer

On Tue, Jan 23 2018 at  7:17am -0500,
Ming Lei <tom.leim...@gmail.com> wrote:

> On Tue, Jan 23, 2018 at 8:15 PM, Mike Snitzer <snit...@redhat.com> wrote:
> > On Tue, Jan 23 2018 at  5:53am -0500,
> > Ming Lei <ming@redhat.com> wrote:
> >
> >> Hi Mike,
> >>
> >> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
> >> >
> >> > From: Mike Snitzer <snit...@redhat.com>
> >> > Date: Tue, 23 Jan 2018 09:40:22 +0100
> >> > Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall 
> >> > regression
> >> >
> >> > The series of blk-mq changes intended to improve sequential IO
> >> > performace (through improved merging with dm-mapth blk-mq stacked on
> >> > underlying blk-mq device).  Unfortunately these changes have caused
> >> > dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
> >> > q->mq_ops->queue_rq() fails (due to device-specific resource
> >> > unavailability).
> >> >
> >> > Fix this by reverting back to how blk_insert_cloned_request() functioned
> >> > prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
> >> > instead of blk_mq_request_issue_directly().
> >> >
> >> > In the future, this commit should be reverted as the first change in a
> >> > followup series of changes that implements a comprehensive solution to
> >> > allowing an underlying blk-mq queue's resource limitation to trigger the
> >> > upper blk-mq queue to run once that underlying limited resource is
> >> > replenished.
> >> >
> >> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> >> > blk_insert_cloned_request feedback")
> >> > Signed-off-by: Mike Snitzer <snit...@redhat.com>
> >> > ---
> >> >  block/blk-core.c | 3 ++-
> >> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >> >
> >> > diff --git a/block/blk-core.c b/block/blk-core.c
> >> > index cdae69be68e9..a224f282b4a6 100644
> >> > --- a/block/blk-core.c
> >> > +++ b/block/blk-core.c
> >> > @@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct 
> >> > request_queue *q, struct request *
> >> >  * bypass a potential scheduler on the bottom device for
> >> >  * insert.
> >> >  */
> >> > -   return blk_mq_request_issue_directly(rq);
> >> > +   blk_mq_request_bypass_insert(rq, true);
> >> > +   return BLK_STS_OK;
> >> > }
> >>
> >> If this patch is for fixing IO stall on DM, it isn't needed, and actually
> >> it can't fix the IO stall issue.
> >
> > It should "fix it" in conjunction with this:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=87b7e73546b55f4a3a37d4726daedd9a17a20509
> >
> > (Bart already said as much, I just effectively enabled the equivalent of
> > his reverts)
> 
> If the generic solution is take, Bart's revert isn't needed.

Yes, we need Bart to verify your v2 patch:
https://patchwork.kernel.org/patch/10179945/

Bart please test this tree:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next

Please report back with your results (and include details on any changes
you make to the tree, hopefully no changes are needed).

Thanks,
Mike

Re: block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

2018-01-23 Thread Mike Snitzer

On Tue, Jan 23 2018 at  5:53am -0500,
Ming Lei <ming@redhat.com> wrote:

> Hi Mike,
> 
> On Tue, Jan 23, 2018 at 10:22:04AM +0100, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at  5:20pm -0500,
> > Bart Van Assche <bart.vanass...@wdc.com> wrote:
> > 
> > > On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > > > And yet Laurence cannot reproduce any such lockups with your test...
> > > 
> > > Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> > > already succeeded at running an unmodified version of my tests. In one of 
> > > the
> > > e-mails Laurence sent me this morning I read that he modified these 
> > > scripts
> > > to get past a kernel module unload failure that was reported while 
> > > starting
> > > these tests. So the next step is to check which changes were made to the 
> > > test
> > > scripts and also whether the test results are still valid.
> > > 
> > > > Are you absolutely certain this patch doesn't help you?
> > > > https://patchwork.kernel.org/patch/10174037/
> > > > 
> > > > If it doesn't then that is actually very useful to know.
> > > 
> > > The first I tried this morning is to run the srp-test software against a 
> > > merge
> > > of Jens' for-next branch and your dm-4.16 branch. Since I noticed that 
> > > the dm
> > > queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm 
> > > code.
> > > Since even that was not sufficient I tried to kick the queues via debugfs 
> > > (for
> > > s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that 
> > > was
> > > not sufficient to resolve the queue stall I reverted the following tree 
> > > patches
> > > that are in Jens' tree:
> > > * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > > feedback"
> > > * "blk-mq-sched: remove unused 'can_block' arg from 
> > > blk_mq_sched_insert_request"
> > > * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue 
> > > is busy"
> > > 
> > > Only after I had done this the srp-test software ran again without 
> > > triggering
> > > dm queue lockups.
> > 
> > Given that Ming's notifier-based patchset needs more development time I
> > think we're unfortunately past the point where we can comfortably wait
> > for that to be ready.
> > 
> > So we need to explore alternatives to fixing this IO stall regression.
> 
> The fix for IO stall doesn't need the notifier-based patchset, and only
> the 1st patch is enough for fixing the IO stall. And it is a generic
> issue, which need generic solution, that is the conclusion made by
> Jens and me.
> 
>   https://marc.info/?l=linux-kernel=151638176727612=2

That's fine if Bart verifies it.

> And the notifier-based patchset is for solving the performance issue
> reported by Jens:
> 
>   - run IO on dm-mpath
>   - run background IO on low depth underlying queue
>   - then IO performance on dm-mpath is extremely slow
> 
> I will send out the 1st patch of 'blk-mq: introduce BLK_STS_DEV_RESOURCE'
> soon, but the notifier-based patchset shouldn't be very urgent, since
> the above test case isn't usual in reality.
> 
> > Rather than attempt the above block reverts (which is an incomplete
> > listing given newer changes): might we develop a more targeted code
> > change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> > merging via blk_insert_cloned_request feedback")? -- which, given Bart's
> > findings above, seems to be the most problematic block commit.
> 
> The stall isn't related with commit 396eaf21ee too.
> 
> > 
> > To that end, assuming I drop this commit from dm-4.16:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=316a795ad388e0c3ca613454851a28079d917a92
> > 
> > Here is my proposal for putting this regression behind us for 4.16
> > (Ming's line of development would continue and hopefully be included in
> > 4.17):
> 
> Actually notifier based approach is ready, even cache for clone is ready
> too, but the test result isn't good enough on random IO on Jens's above
> case, and sequential IO is much better with both cache clone and
> notifier based allocation(much better than non-mq). And follows the tree
> if anyone is interested:
> 
> https://github.com/ming1/linux/commits/v4.15-rc-block-dm-mpat

[PATCH] block: neutralize blk_insert_cloned_request IO stall regression (was: Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle)

2018-01-23 Thread Mike Snitzer

On Thu, Jan 18 2018 at  5:20pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> > And yet Laurence cannot reproduce any such lockups with your test...
> 
> Hmm ... maybe I misunderstood Laurence but I don't think that Laurence has
> already succeeded at running an unmodified version of my tests. In one of the
> e-mails Laurence sent me this morning I read that he modified these scripts
> to get past a kernel module unload failure that was reported while starting
> these tests. So the next step is to check which changes were made to the test
> scripts and also whether the test results are still valid.
> 
> > Are you absolutely certain this patch doesn't help you?
> > https://patchwork.kernel.org/patch/10174037/
> > 
> > If it doesn't then that is actually very useful to know.
> 
> The first I tried this morning is to run the srp-test software against a merge
> of Jens' for-next branch and your dm-4.16 branch. Since I noticed that the dm
> queue locked up I reinserted a blk_mq_delay_run_hw_queue() call in the dm 
> code.
> Since even that was not sufficient I tried to kick the queues via debugfs (for
> s in /sys/kernel/debug/block/*/state; do echo kick >$s; done). Since that was
> not sufficient to resolve the queue stall I reverted the following tree 
> patches
> that are in Jens' tree:
> * "blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> feedback"
> * "blk-mq-sched: remove unused 'can_block' arg from 
> blk_mq_sched_insert_request"
> * "blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is 
> busy"
> 
> Only after I had done this the srp-test software ran again without triggering
> dm queue lockups.

Given that Ming's notifier-based patchset needs more development time I
think we're unfortunately past the point where we can comfortably wait
for that to be ready.

So we need to explore alternatives to fixing this IO stall regression.
Rather than attempt the above block reverts (which is an incomplete
listing given newer changes): might we develop a more targeted code
change to neutralize commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
merging via blk_insert_cloned_request feedback")? -- which, given Bart's
findings above, seems to be the most problematic block commit.

To that end, assuming I drop this commit from dm-4.16:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=316a795ad388e0c3ca613454851a28079d917a92

Here is my proposal for putting this regression behind us for 4.16
(Ming's line of development would continue and hopefully be included in
4.17):

From: Mike Snitzer <snit...@redhat.com>
Date: Tue, 23 Jan 2018 09:40:22 +0100
Subject: [PATCH] block: neutralize blk_insert_cloned_request IO stall regression

The series of blk-mq changes intended to improve sequential IO
performace (through improved merging with dm-mapth blk-mq stacked on
underlying blk-mq device).  Unfortunately these changes have caused
dm-mpath blk-mq IO stalls when blk_mq_request_issue_directly()'s call to
q->mq_ops->queue_rq() fails (due to device-specific resource
unavailability).

Fix this by reverting back to how blk_insert_cloned_request() functioned
prior to commit 396eaf21ee -- by using blk_mq_request_bypass_insert()
instead of blk_mq_request_issue_directly().

In the future, this commit should be reverted as the first change in a
followup series of changes that implements a comprehensive solution to
allowing an underlying blk-mq queue's resource limitation to trigger the
upper blk-mq queue to run once that underlying limited resource is
replenished.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
blk_insert_cloned_request feedback")
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index cdae69be68e9..a224f282b4a6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2520,7 +2520,8 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   return blk_mq_request_issue_directly(rq);
+   blk_mq_request_bypass_insert(rq, true);
+   return BLK_STS_OK;
}

spin_lock_irqsave(q->queue_lock, flags);
-- 
2.15.0

Re: blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-20 Thread Mike Snitzer

On Sat, Jan 20 2018 at  7:57pm -0500,
Ming Lei <ming@redhat.com> wrote:

> On Sat, Jan 20, 2018 at 12:30:02PM -0500, Mike Snitzer wrote:
> > On Sat, Jan 20 2018 at  8:48am -0500,
> > Ming Lei <ming@redhat.com> wrote:
> > 
> > > This status is returned from driver to block layer if device related
> > > resource is run out of, but driver can guarantee that IO dispatch is
> > > triggered in future when the resource is available.
> > > 
> > > This patch converts some drivers to use this return value. Meantime
> > > if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> > > queue after 10ms for avoiding IO hang.
> > > 
> > > Suggested-by: Jens Axboe <ax...@kernel.dk>
> > > Cc: Mike Snitzer <snit...@redhat.com>
> > > Cc: Laurence Oberman <lober...@redhat.com>
> > > Cc: Bart Van Assche <bart.vanass...@sandisk.com>
> > > Signed-off-by: Ming Lei <ming@redhat.com>
> > > ---
> > >  block/blk-core.c |  1 +
> > >  block/blk-mq.c   | 20 
> > >  drivers/block/null_blk.c |  2 +-
> > >  drivers/block/virtio_blk.c   |  2 +-
> > >  drivers/block/xen-blkfront.c |  2 +-
> > >  drivers/scsi/scsi_lib.c  |  6 +++---
> > >  include/linux/blk_types.h|  7 +++
> > >  7 files changed, 30 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 01f271d40825..6e97e0bf8178 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue 
> > > *q, struct list_head *list,
> > >   }
> > >  
> > >   ret = q->mq_ops->queue_rq(hctx, );
> > > - if (ret == BLK_STS_RESOURCE) {
> > > + if ((ret == BLK_STS_RESOURCE) ||
> > > + (ret == BLK_STS_DEV_RESOURCE)) {
> > >   /*
> > >* If an I/O scheduler has been configured and we got a
> > >* driver tag for the next request already, free it
> > 
> > Just a nit, but this should be on one line.
> 
> It is too long, and my editor starts to highlight/complain it, :-)

Look at the lines immediately following it, your isn't longer than
them..

> > > @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct 
> > > blk_mq_hw_ctx *hctx,
> > >   *cookie = new_cookie;
> > >   break;
> > >   case BLK_STS_RESOURCE:
> > > + case BLK_STS_DEV_RESOURCE:
> > >   __blk_mq_requeue_request(rq);
> > >   break;
> > >   default:
> > 
> > It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
> > too muddled: calling __blk_mq_requeue_request() for both will cause
> > underlying blk-mq driver to retain the request, won't it?
> 
> blk_mq_request_issue_directly() is used by driver(dm-rq) on underlying
> queue, and driver need to deal with underlying queue busy, now we simply
> free the (underlying)request and feedback the busy status to blk-mq via
> dm-rq.
> 
> Except for blk_mq_request_issue_directly(), this request need to be
> requeued, and is retained by blk-mq in hctx->dispatch_list.
> 
> The difference is that if driver returns BLK_STS_DEV_RESOURCE, the queue
> will be rerun when resource is available, so don't need to run queue after
> a delay for avoiding IO hang explicitly.

Yes, I understand the intent.

> > > @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct 
> > > blk_mq_hw_ctx *hctx,
> > >   hctx_lock(hctx, _idx);
> > >  
> > >   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> > > - if (ret == BLK_STS_RESOURCE)
> > > + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
> > >   blk_mq_sched_insert_request(rq, false, true, false);
> > >   else if (ret != BLK_STS_OK)
> > >   blk_mq_end_request(rq, ret);
> > 
> > For this normal (non dm-mpath) case the request gets re-inserted;
> > dm-mpath must avoid that.
> > 
> > But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
> > we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
> > driver (e.g. scsi-mq or nvme) is made to retain the request, using
> > __blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
> > then dm-mpath will not have the ability to requeue the request without
> &

Re: blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-20 Thread Mike Snitzer

On Sat, Jan 20 2018 at  8:48am -0500,
Ming Lei <ming@redhat.com> wrote:

> This status is returned from driver to block layer if device related
> resource is run out of, but driver can guarantee that IO dispatch is
> triggered in future when the resource is available.
> 
> This patch converts some drivers to use this return value. Meantime
> if driver returns BLK_STS_RESOURCE and S_SCHED_RESTART is marked, run
> queue after 10ms for avoiding IO hang.
> 
> Suggested-by: Jens Axboe <ax...@kernel.dk>
> Cc: Mike Snitzer <snit...@redhat.com>
> Cc: Laurence Oberman <lober...@redhat.com>
> Cc: Bart Van Assche <bart.vanass...@sandisk.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-core.c |  1 +
>  block/blk-mq.c   | 20 
>  drivers/block/null_blk.c |  2 +-
>  drivers/block/virtio_blk.c   |  2 +-
>  drivers/block/xen-blkfront.c |  2 +-
>  drivers/scsi/scsi_lib.c  |  6 +++---
>  include/linux/blk_types.h|  7 +++
>  7 files changed, 30 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 01f271d40825..6e97e0bf8178 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1226,7 +1226,8 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
> struct list_head *list,
>   }
>  
>   ret = q->mq_ops->queue_rq(hctx, );
> - if (ret == BLK_STS_RESOURCE) {
> + if ((ret == BLK_STS_RESOURCE) ||
> + (ret == BLK_STS_DEV_RESOURCE)) {
>   /*
>* If an I/O scheduler has been configured and we got a
>* driver tag for the next request already, free it

Just a nit, but this should be on one line.

> @@ -1764,6 +1775,7 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   *cookie = new_cookie;
>   break;
>   case BLK_STS_RESOURCE:
> + case BLK_STS_DEV_RESOURCE:
>   __blk_mq_requeue_request(rq);
>   break;
>   default:

It seems the strategy for BLK_STS_DEV_RESOURCE and BLK_STS_RESOURCE is
too muddled: calling __blk_mq_requeue_request() for both will cause
underlying blk-mq driver to retain the request, won't it?

> @@ -1826,7 +1838,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   hctx_lock(hctx, _idx);
>  
>   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
> - if (ret == BLK_STS_RESOURCE)
> + if ((ret == BLK_STS_RESOURCE) || (ret == BLK_STS_DEV_RESOURCE))
>   blk_mq_sched_insert_request(rq, false, true, false);
>   else if (ret != BLK_STS_OK)
>   blk_mq_end_request(rq, ret);

For this normal (non dm-mpath) case the request gets re-inserted;
dm-mpath must avoid that.

But with dm-mpath, which instead uses blk_mq_request_issue_directly(),
we're driving IO with stacked blk-mq drivers.  If the underlying blk-mq 
driver (e.g. scsi-mq or nvme) is made to retain the request, using
__blk_mq_issue_directly()'s call to __blk_mq_requeue_request() above,
then dm-mpath will not have the ability to requeue the request without
conflicting with the underlying blk-mq driver, will it?

Or am I'm misunderstanding what __blk_mq_requeue_request() is doing?

dm_mq_queue_rq
-> multipath_clone_and_map
   -> blk_get_request (scsi_mq)
  -> if error, dm-mpath conditionally requeues (w/ or w/o delay)
  -> if BLK_STS_OK then blk_mq_request_issue_directly() gets called
-> dm_dispatch_clone_request
   -> blk_mq_request_issue_directly
  -> __blk_mq_try_issue_directly
 -> __blk_mq_issue_directly
-> q->mq_ops->queue_rq (this is the underlying scsi_mq)
   -> a BLK_STS_RESOURCE return here is how Bart was able to cause 
stalls
-> __blk_mq_requeue_request, if BLK_STS_RESOURCE or 
BLK_STS_DEV_RESOURCE **1
   -> (return from blk_mq_request_issue_directly)
   -> if BLK_STS_RESOURCE, the dm-mpath request is released using 
blk_put_request();
   and DM_MAPIO_REQUEUE is returned to dm_mq_queue_rq 
**2
-> if DM_MAPIO_REQUEUE return from map_request()'s call to 
dm_dispatch_clone_request:
   BLK_STS_RESOURCE is returned from dm-mpath's dm_mq_queue_rq

The redundant queueing (both to underlying blk-mq at **1 above, and
upper layer blk-mq at **2 above) is what I'm concerned about.

Hope this is clear.

I'd love to be missing something, please advise.

Thanks,
Mike

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 12:38pm -0500,
Jens Axboe  wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Here's an example of that, using my current block tree (merged into
> master).  The setup is dm-mpath on top of null_blk, the latter having
> just a single request. Both are mq devices.
> 
> Fio direct 4k random reads on dm_mq: ~250K iops
> 
> Start dd on underlying device (or partition on same device), just doing
> sequential reads.
> 
> Fio direct 4k random reads on dm_mq with dd running: 9 iops
> 
> No schedulers involved.
> 
> https://i.imgur.com/WTDnnwE.gif

FYI, your tree doesn't have these dm-4.16 changes (which are needed to
make Ming's blk-mq chnages that are in your tree, commit 396eaf21e et
al, have Ming's desired effect on DM):

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=050af08ffb1b62af69196d61c22a0755f9a3cdbd
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=459b54019cfeb7330ed4863ad40f78489e0ff23d
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec

Fact that you're seeing such shit results without dm-4.16 commit
ec3eaf9a67 (that reverts older commit 6077c2d706) means: yeap, this
is really awful, let's fix it!  But it is a different flavor of awful
because the dm-rq.c:map_request() will handle the DM_MAPIO_DELAY_REQUEUE
result from the null_blk's blk_get_request() failure using
dm_mq_delay_requeue_request() against the dm-mq mpath device:

blk_mq_requeue_request(rq, false);
__dm_mq_kick_requeue_list(rq->q, msecs);

So begs the question: why are we stalling _exactly_?  (you may have it
all figured out, as your gif implies.. but I'm not there yet).

Might be interesting to see how your same test behaves without all of
the churn we've staged for 4.16, e.g. against v4.15-rc8

Mike

Re: [PATCH 3/3] block: Remove kblockd_schedule_delayed_work{,_on}()

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> The previous patch removed all users of these two functions. Hence
> also remove the functions themselves.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>

Reviewed-by: Mike Snitzer <snit...@redhat.com>

Re: [PATCH 2/3] blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> Make sure that calling blk_mq_run_hw_queue() or
> blk_mq_kick_requeue_list() triggers a queue run without delay even
> if blk_mq_delay_run_hw_queue() has been called recently and if its
> delay has not yet expired.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>

Reviewed-by: Mike Snitzer <snit...@redhat.com>

Re: [PATCH 1/3] blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 11:58am -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> Most blk-mq functions have a name that follows the pattern blk_mq_${action}.
> However, the function name blk_mq_request_direct_issue is an exception.
> Hence rename this function. This patch does not change any functionality.
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>

Reviewed-by: Mike Snitzer <snit...@redhat.com>

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 11:41am -0500,
Jens Axboe  wrote:

> On 1/19/18 9:37 AM, Ming Lei wrote:
> > On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
> >> On 1/19/18 9:26 AM, Ming Lei wrote:
> >>> On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
> >>
> >> There are no pending requests for this case, nothing to restart the
> >> queue. When you fail that blk_get_request(), you are idle, nothing
> >> is pending.
> > 
> > I think we needn't worry about that, once a device is attached to
> > dm-rq, it can't be mounted any more, and usually user don't use the device
> > directly and by dm-mpath at the same time.
> 
> Even if it doesn't happen for a normal dm setup, it is a case that
> needs to be handled. The request allocation is just one example of
> a wider scope resource that can be unavailable. If the driver returns
> NO_DEV_RESOURCE (or whatever name), it will be a possibility that
> the device itself is currently idle.

How would a driver's resources be exhausted yet the device is idle (so
as not to be able to benefit from RESTART)?

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-19 Thread Mike Snitzer

On Fri, Jan 19 2018 at 10:48am -0500,
Jens Axboe  wrote:

> On 1/19/18 8:40 AM, Ming Lei wrote:
>  Where does the dm STS_RESOURCE error usually come from - what's exact
>  resource are we running out of?
> >>>
> >>> It is from blk_get_request(underlying queue), see
> >>> multipath_clone_and_map().
> >>
> >> That's what I thought. So for a low queue depth underlying queue, it's
> >> quite possible that this situation can happen. Two potential solutions
> >> I see:
> >>
> >> 1) As described earlier in this thread, having a mechanism for being
> >>notified when the scarce resource becomes available. It would not
> >>be hard to tap into the existing sbitmap wait queue for that.
> >>
> >> 2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
> >>allocation. I haven't read the dm code to know if this is a
> >>possibility or not.

Right, #2 is _not_ the way forward.  Historically request-based DM used
its own mempool for requests, this was to be able to have some measure
of control and resiliency in the face of low memory conditions that
might be affecting the broader system.

Then Christoph switched over to adding per-request-data; which ushered
in the use of blk_get_request using ATOMIC allocations.  I like the
result of that line of development.  But taking the next step of setting
BLK_MQ_F_BLOCKING is highly unfortunate (especially in that this
dm-mpath.c code is common to old .request_fn and blk-mq, at least the
call to blk_get_request is).  Ultimately dm-mpath like to avoid blocking
for a request because for this dm-mpath device we have multiple queues
to allocate from if need be (provided we have an active-active storage
network topology).

> >> I'd probably prefer #1. It's a classic case of trying to get the
> >> request, and if it fails, add ourselves to the sbitmap tag wait
> >> queue head, retry, and bail if that also fails. Connecting the
> >> scarce resource and the consumer is the only way to really fix
> >> this, without bogus arbitrary delays.
> > 
> > Right, as I have replied to Bart, using mod_delayed_work_on() with
> > returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
> > resource should fix this issue.
> 
> It'll fix the forever stall, but it won't really fix it, as we'll slow
> down the dm device by some random amount.

Agreed.

> A simple test case would be to have a null_blk device with a queue depth
> of one, and dm on top of that. Start a fio job that runs two jobs: one
> that does IO to the underlying device, and one that does IO to the dm
> device. If the job on the dm device runs substantially slower than the
> one to the underlying device, then the problem isn't really fixed.

Not sure DM will allow the underlying device to be opened (due to
master/slave ownership that is part of loading a DM table)?

> That said, I'm fine with ensuring that we make forward progress always
> first, and then we can come up with a proper solution to the issue. The
> forward progress guarantee will be needed for the more rare failure
> cases, like allocation failures. nvme needs that too, for instance, for
> the discard range struct allocation.

Yeap, I'd be OK with that too.  We'd be better for revisted this and
then have some time to develop the ultimate robust fix (#1, callback
from above).

Mike

Re: [PATCH v3 RESEND 2/2] blk-throttle: fix wrong initialization in case of dm device

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at 10:09pm -0500,
Joseph Qi  wrote:

> From: Joseph Qi 
> 
> DM device sets QUEUE_FLAG_NONROT after the queue is registered. That is
> to mean, the previous initialization in blk_throtl_register_queue is
> wrong in this case.
> Fix it by checking and then updating the info during root tg
> initialization as we don't have a better choice.
> 
> Signed-off-by: Joseph Qi 
> Reviewed-by: Shaohua Li 
> ---
>  block/blk-throttle.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/block/blk-throttle.c b/block/blk-throttle.c
> index bf52035..7150f14 100644
> --- a/block/blk-throttle.c
> +++ b/block/blk-throttle.c
> @@ -541,6 +541,25 @@ static void throtl_pd_init(struct blkg_policy_data *pd)
>   if (cgroup_subsys_on_dfl(io_cgrp_subsys) && blkg->parent)
>   sq->parent_sq = _to_tg(blkg->parent)->service_queue;
>   tg->td = td;
> +
> +#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
> + /*
> +  * DM device sets QUEUE_FLAG_NONROT after the queue is registered,
> +  * so the previous initialization is wrong in this case. Check and
> +  * update it here.
> +  */
> + if (blk_queue_nonrot(blkg->q) &&
> + td->filtered_latency != LATENCY_FILTERED_SSD) {
> + int i;
> +
> + td->throtl_slice = DFL_THROTL_SLICE_SSD;
> + td->filtered_latency = LATENCY_FILTERED_SSD;
> + for (i = 0; i < LATENCY_BUCKET_SIZE; i++) {
> + td->avg_buckets[READ][i].latency = 0;
> + td->avg_buckets[WRITE][i].latency = 0;
> + }
> + }
> +#endif
>  }
>  
>  /*
> -- 
> 1.9.4

This should be fixed for 4.16, please see these block tree commits:
http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=fa70d2e2c4a0a54ced98260c6a176cc94c876d27
http://git.kernel.dk/cgit/linux-block/commit/?h=for-4.16/block=c100ec49fdd836ff8a17c7bfcc7611d2ee2b

The last commit's patch header even references the previous submission
you had for this patch with:

"These changes also stave off the need to introduce new DM-specific
workarounds in block core, e.g. this proposal:
https://patchwork.kernel.org/patch/10067961/;

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at  4:39pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at  3:58P -0500,
> > Bart Van Assche <bart.vanass...@wdc.com> wrote:
> > 
> > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > For Bart's test the underlying scsi-mq driver is what is regularly
> > > > hitting this case in __blk_mq_try_issue_directly():
> > > > 
> > > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > >
> > > These lockups were all triggered by incorrect handling of
> > > .queue_rq() returning BLK_STS_RESOURCE.
> > 
> > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> In what I wrote I was referring to both dm_mq_queue_rq() and scsi_queue_rq().
> With "incorrect" I meant that queue lockups are introduced that make user
> space processes unkillable. That's a severe bug.

And yet Laurence cannot reproduce any such lockups with your test...

Are you absolutely certain this patch doesn't help you?
https://patchwork.kernel.org/patch/10174037/

If it doesn't then that is actually very useful to know.

> > We have time to get this right, please stop hyperventilating about
> > "regressions".
> 
> Sorry Mike but that's something I consider as an unfair comment. If Ming and
> you work on patches together, it's your job to make sure that no regressions
> are introduced. Instead of blaming me because I report these regressions you
> should be grateful that I take the time and effort to report these regressions
> early. And since you are employed by a large organization that sells Linux
> support services, your employer should invest in developing test cases that
> reach a higher coverage of the dm, SCSI and block layer code. I don't think
> that it's normal that my tests discovered several issues that were not
> discovered by Red Hat's internal test suite. That's something Red Hat has to
> address.

You have no self-awareness of just how mypoic you're being about this.

I'm not ignoring or blaming you for your test no longer passing.  Far
from it.  I very much want to fix this.  But I want it fixed in a way
that doesn't paper over the real bug(s) while also introducing blind
queue runs that compromise the blk-mq RESTART code's ability to work as
intended.

I'd have thought you could appreciate this.  We need a root cause on
this, not hand-waving justifications on why problematic delayed queue
runs are correct.

Please just focus on helping Laurence get his very capable testbed to
reproduce this issue.  Once we can reproduce these "unkillable" "stalls"
in-house it'll be _much_ easier to analyze and fix.

Thanks,
Mike

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at  3:58P -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > For Bart's test the underlying scsi-mq driver is what is regularly
> > hitting this case in __blk_mq_try_issue_directly():
> > 
> > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> 
> Hello Mike,
> 
> That code path is not the code path that triggered the lockups that I reported
> during the past days.

If you're hitting blk_mq_sched_insert_request() then you most certainly
are hitting that code path.

If you aren't then what was your earlier email going on about?
https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html

If you were just focusing on that as one possible reason, that isn't
very helpful.  By this point you really should _know_ what is triggering
the stall based on the code paths taken.  Please use ftrace's
function_graph tracer if need be.

> These lockups were all triggered by incorrect handling of
> .queue_rq() returning BLK_STS_RESOURCE.

Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
"Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?

Please try to do more work analyzing the test case that only you can
easily run (due to srp_test being a PITA).  And less time lobbying for
a change that you don't understand to _really_ be correct.

We have time to get this right, please stop hyperventilating about
"regressions".

Thanks,
Mike

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at  3:11pm -0500,
Jens Axboe  wrote:

> On 1/18/18 11:47 AM, Bart Van Assche wrote:
> >> This is all very tiresome.
> > 
> > Yes, this is tiresome. It is very annoying to me that others keep
> > introducing so many regressions in such important parts of the kernel.
> > It is also annoying to me that I get blamed if I report a regression
> > instead of seeing that the regression gets fixed.
> 
> I agree, it sucks that any change there introduces the regression. I'm
> fine with doing the delay insert again until a new patch is proven to be
> better.
> 
> From the original topic of this email, we have conditions that can cause
> the driver to not be able to submit an IO. A set of those conditions can
> only happen if IO is in flight, and those cases we have covered just
> fine. Another set can potentially trigger without IO being in flight.
> These are cases where a non-device resource is unavailable at the time
> of submission. This might be iommu running out of space, for instance,
> or it might be a memory allocation of some sort. For these cases, we
> don't get any notification when the shortage clears. All we can do is
> ensure that we restart operations at some point in the future. We're SOL
> at that point, but we have to ensure that we make forward progress.
> 
> That last set of conditions better not be a a common occurence, since
> performance is down the toilet at that point. I don't want to introduce
> hot path code to rectify it. Have the driver return if that happens in a
> way that is DIFFERENT from needing a normal restart. The driver knows if
> this is a resource that will become available when IO completes on this
> device or not. If we get that return, we have a generic run-again delay.
> 
> This basically becomes the same as doing the delay queue thing from DM,
> but just in a generic fashion.

This is a bit confusing for me (as I see it we have 2 blk-mq drivers
trying to collaborate, so your refering to "driver" lacks precision; but
I could just be missing something)...

For Bart's test the underlying scsi-mq driver is what is regularly
hitting this case in __blk_mq_try_issue_directly():

if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

It certainly better not be the norm (Bart's test hammering on this aside).

For starters, it'd be very useful to know if Bart is hitting the
blk_mq_hctx_stopped() or blk_queue_quiesced() for this case that is
triggering the use of blk_mq_sched_insert_request() -- I'd wager it is
due to blk_queue_quiesced() but Bart _please_ try to figure it out.

Anyway, in response to this case Bart would like the upper layer dm-mq
driver to blk_mq_delay_run_hw_queue().  Certainly is quite the hammer.

But that hammer aside, in general for this case, I'm concerned about: is
it really correct to allow an already stopped/quiesced underlying queue
to retain responsibility for processing the request?  Or would the
upper-layer dm-mq benefit from being able to retry the request on its
terms (via a "DIFFERENT" return from blk-mq core)?

Like this?  The (ab)use of BLK_STS_DM_REQUEUE certainly seems fitting in
this case but...

(Bart please note that this patch applies on linux-dm.git's 'for-next';
which is just a merge of Jens' 4.16 tree and dm-4.16)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 74a4f237ba91..371a1b97bf56 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1781,16 +1781,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
struct request_queue *q = rq->q;
bool run_queue = true;
 
-   /*
-* RCU or SRCU read lock is needed before checking quiesced flag.
-*
-* When queue is stopped or quiesced, ignore 'bypass_insert' from
-* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
-* and avoid driver to try to dispatch again.
-*/
+   /* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
-   bypass_insert = false;
+   if (bypass_insert)
+   return BLK_STS_DM_REQUEUE;
goto insert;
}
 
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index d8519ddd7e1a..2f554ea485c3 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -408,7 +408,7 @@ static blk_status_t dm_dispatch_clone_request(struct 
request *clone, struct requ
 
clone->start_time = jiffies;
r = blk_insert_cloned_request(clone->q, clone);
-   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE)
+   if (r != BLK_STS_OK && r != BLK_STS_RESOURCE && r != BLK_STS_DM_REQUEUE)
/* must complete clone in terms of original request */
dm_complete_request(rq, r);
return r;
@@ -472,6 +472,7 @@ static void init_tio(struct dm_rq_target_io *tio, struct 
request *rq,
  * Returns:
  * DM_MAPIO_*   : the request has

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at 12:20pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Thu, 2018-01-18 at 12:03 -0500, Mike Snitzer wrote:
> > On Thu, Jan 18 2018 at 11:50am -0500,
> > Bart Van Assche <bart.vanass...@wdc.com> wrote:
> > > My comments about the above are as follows:
> > > - It can take up to q->rq_timeout jiffies after a .queue_rq()
> > >   implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
> > >   gets called. However, it can happen that only a few milliseconds after
> > >   .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
> > >   it to return BLK_STS_RESOURCE gets cleared. So the above approach can
> > >   result in long delays during which it will seem like the queue got
> > >   stuck. Additionally, I think that the block driver should decide how
> > >   long it takes before a queue is rerun and not the block layer core.
> > 
> > So configure q->rq_timeout to be shorter?  Which is configurable though
> > blk_mq_tag_set's 'timeout' member.  It apparently defaults to 30 * HZ.
> > 
> > That is the problem with timeouts, there is generally no one size fits
> > all.
> 
> Sorry but I think that would be wrong. The delay after which a queue is rerun
> should not be coupled to the request timeout. These two should be independent.

That's fair.  Not saying I think that is a fix anyway.

> > > - The lockup that I reported only occurs with the dm driver but not any
> > >   other block driver. So why to modify the block layer core since this
> > >   can be fixed by modifying the dm driver?
> > 
> > Hard to know it is only DM's blk-mq that is impacted.  That is the only
> > blk-mq driver that you're testing like this (that is also able to handle
> > faults, etc).
> 
> That's not correct. I'm also testing the SCSI core, which is one of the most
> complicated block drivers.

OK, but SCSI mq is part of the problem here.  It is a snowflake that
has more exotic reasons for returning BLK_STS_RESOURCE.

> > > - A much simpler fix and a fix that is known to work exists, namely
> > >   inserting a blk_mq_delay_run_hw_queue() call in the dm driver.
> > 
> > Because your "much simpler" fix actively hurts performance, as is
> > detailed in this header:
> > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec
> 
> We are close to the start of the merge window so I think it's better to fall
> back to an old approach that is known to work than to keep a new approach
> that is known not to work. Additionally, the performance issue you referred
> to only affects IOPS and bandwidth more than 1% with the lpfc driver and that
> is because the queue depth it supports is much lower than for other SCSI HBAs,
> namely 3 instead of 64.

1%!?  Where are you getting that number?  Ming has detailed more
significant performance gains than 1%.. and not just on lpfc (though you
keep seizing on lpfc because of the low queue_depth of 3).

This is all very tiresome.  I'm _really_ not interested in this debate
any more.  The specific case that causes the stall need to be identified
and a real fix needs to be developed.  Ming is doing a lot of that hard
work.  Please contribute or at least stop pleading for your hack to be
reintroduced.

If at the end of the 4.16 release we still don't have a handle on the
stall you're seeing I'll revisit this and likely revert to blindly
kicking the queue after an arbitrary delay.  But I'm willing to let this
issue get more time without papering over it.

Mike

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at 11:50am -0500,
Bart Van Assche  wrote:

> On 01/17/18 18:41, Ming Lei wrote:
> >BLK_STS_RESOURCE can be returned from driver when any resource
> >is running out of. And the resource may not be related with tags,
> >such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
> >BLK_STS_RESOURCE, restart can't work any more, then IO hang may
> >be caused.
> >
> >Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
> >all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
> >it may be triggered a bit easier since the request pool of underlying
> >queue may be consumed up much easier. But in reality, it is still not
> >easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
> >with all kinds of scsi_debug parameters, can't trigger this issue
> >at all. But finally it is triggered in Bart's SRP test, which seems
> >made by genius, :-)
> >
> >[ ... ]
> >
> >  static void blk_mq_timeout_work(struct work_struct *work)
> >  {
> > struct request_queue *q =
> >@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct 
> >*work)
> >  */
> > queue_for_each_hw_ctx(q, hctx, i) {
> > /* the hctx may be unmapped, so check it here */
> >-if (blk_mq_hw_queue_mapped(hctx))
> >+if (blk_mq_hw_queue_mapped(hctx)) {
> > blk_mq_tag_idle(hctx);
> >+blk_mq_fixup_restart(hctx);
> >+}
> > }
> > }
> > blk_queue_exit(q);
> 
> Hello Ming,
> 
> My comments about the above are as follows:
> - It can take up to q->rq_timeout jiffies after a .queue_rq()
>   implementation returned BLK_STS_RESOURCE before blk_mq_timeout_work()
>   gets called. However, it can happen that only a few milliseconds after
>   .queue_rq() returned BLK_STS_RESOURCE that the condition that caused
>   it to return BLK_STS_RESOURCE gets cleared. So the above approach can
>   result in long delays during which it will seem like the queue got
>   stuck. Additionally, I think that the block driver should decide how
>   long it takes before a queue is rerun and not the block layer core.

So configure q->rq_timeout to be shorter?  Which is configurable though
blk_mq_tag_set's 'timeout' member.  It apparently defaults to 30 * HZ.

That is the problem with timeouts, there is generally no one size fits
all.

> - The lockup that I reported only occurs with the dm driver but not any
>   other block driver. So why to modify the block layer core since this
>   can be fixed by modifying the dm driver?

Hard to know it is only DM's blk-mq that is impacted.  That is the only
blk-mq driver that you're testing like this (that is also able to handle
faults, etc).

> - A much simpler fix and a fix that is known to work exists, namely
>   inserting a blk_mq_delay_run_hw_queue() call in the dm driver.

Because your "much simpler" fix actively hurts performance, as is
detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=ec3eaf9a673106f66606896aed6ddd20180b02ec

I'm not going to take your bandaid fix given it very much seems to be
papering over a real blk-mq issue.

Mike

Re: dm rq: Avoid that request processing stalls sporadically

2018-01-18 Thread Mike Snitzer

On Thu, Jan 18 2018 at 11:37am -0500,
Bart Van Assche  wrote:

> If the .queue_rq() implementation of a block driver returns
> BLK_STS_RESOURCE then that block driver is responsible for
> rerunning the queue once the condition that caused it to return
> BLK_STS_RESOURCE has been cleared. The dm-mpath driver tells the
> dm core to requeue a request if e.g. not enough memory is
> available for cloning a request or if the underlying path is
> busy. Since the dm-mpath driver does not receive any kind of
> notification if the condition that caused it to return "requeue"
> is cleared, the only solution to avoid that dm-mpath request
> processing stalls is to call blk_mq_delay_run_hw_queue(). Hence
> this patch.
> 
> Fixes: ec3eaf9a6731 ("dm mpath: don't call blk_mq_delay_run_hw_queue() in 
> case of BLK_STS_RESOURCE")
> Signed-off-by: Bart Van Assche 
> Cc: Ming Lei 
> ---
>  drivers/md/dm-rq.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index f16096af879a..c59c59cfd2a5 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -761,6 +761,7 @@ static blk_status_t dm_mq_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   /* Undo dm_start_request() before requeuing */
>   rq_end_stats(md, rq);
>   rq_completed(md, rq_data_dir(rq), false);
> + blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
>   return BLK_STS_RESOURCE;
>   }
>  
> -- 
> 2.15.1
> 

Sorry but we need to understand why you still need this.

The issue you say it was originally intended to fix _should_ be
addressed with this change:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=4dd6edd23e7ea971efddc303f9e67eb79e95808e

So if you still feel you need to blindly kick the queue then it is very
likely a bug in blk-mq (either its RESTART hueristics or whatever
internal implementation is lacking)

Did you try Ming's RFC patch to "fixup RESTART" before resorting to the
above again?, see: https://patchwork.kernel.org/patch/10172315/

Thanks,
Mike

Re: blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at 11:06pm -0500,
Ming Lei <ming@redhat.com> wrote:

> If we run into blk_mq_request_direct_issue(), when queue is busy, we
> don't want to dispatch this request into hctx->dispatch_list, and
> what we need to do is to return the queue busy info to caller, so
> that caller can deal with it well.
> 
> Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> blk_insert_cloned_request feedback")
> Reported-by: Laurence Oberman <lober...@redhat.com>
> Reviewed-by: Mike Snitzer <snit...@redhat.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq.c | 22 ++
>  1 file changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4d4af8d712da..1af7fa70993b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -static void __blk_mq_fallback_to_insert(struct request *rq,
> - bool run_queue, bool bypass_insert)
> -{
> - if (!bypass_insert)
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> - else
> - blk_mq_request_bypass_insert(rq, run_queue);
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> @@ -1873,9 +1864,16 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   struct request_queue *q = rq->q;
>   bool run_queue = true;
>  
> - /* RCU or SRCU read lock is needed before checking quiesced flag */
> + /*
> +  * RCU or SRCU read lock is needed before checking quiesced flag.
> +  *
> +  * When queue is stopped or quiesced, ignore 'bypass_insert' from
> +  * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
> +  * and avoid driver to try to dispatch again.
> +  */
>   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
>   run_queue = false;
> + bypass_insert = false;
>   goto insert;
>   }
>  
> @@ -1892,10 +1890,10 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  
>   return __blk_mq_issue_directly(hctx, rq, cookie);
>  insert:
> - __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
>  
> + blk_mq_sched_insert_request(rq, false, run_queue, false);
>   return BLK_STS_OK;
>  }

OK so you're just leveraging blk_mq_sched_insert_request()'s
ability to resort to__blk_mq_insert_request() if !q->elevator.

Re: blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at 10:39pm -0500,
Ming Lei <ming@redhat.com> wrote:

> On Wed, Jan 17, 2018 at 10:33:35PM -0500, Mike Snitzer wrote:
> > On Wed, Jan 17 2018 at  7:54P -0500,
> > Mike Snitzer <snit...@redhat.com> wrote:
> >  
> > > But sure, I suppose there is something I missed when refactoring Ming's
> > > change to get it acceptable for upstream.  I went over the mechanical
> > > nature of what I did many times (comparing Ming's v4 to my v5).
> > 
> > And yes there is one subtlety that I missed.
> > 
> > > The call to blk_mq_request_bypass_insert will only occur via
> > > __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> > > fast path.  This will occur if the underlying blk-mq device cannot get
> > > resources it needs in order to issue the request.  Specifically: if/when
> > > in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> > > quiesced, or it cannot get the driver tag or dispatch_budget (in the
> > > case of scsi-mq).
> > > 
> > > The same fallback, via call to blk_mq_request_bypass_insert, occured
> > > with Ming's v4 though.
> > 
> > Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
> > the driver tag or dispatch_budget" case.
> > 
> > This patch should fix it (Laurence, please report back on if this fixes
> > your list_add corruption, pretty sure it will):
> > 
> > From: Mike Snitzer <snit...@redhat.com>
> > Date: Wed, 17 Jan 2018 22:02:07 -0500
> > Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
> > BLK_STS_RESOURCE
> > 
> > It isn't ever valid to call blk_mq_request_bypass_insert() and return
> > BLK_STS_RESOURCE.
> > 
> > Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> > merging via blk_insert_cloned_request feedback") we do just that if
> > blk_mq_request_direct_issue() cannot get the resources (driver_tag or
> > dispatch_budget) needed to directly issue a request.  This will lead to
> > "list_add corruption" because blk-mq submits the IO but then reports
> > that it didn't (BLK_STS_RESOURCE in this case).
> > 
> > Fix this by simply returning BLK_STS_RESOURCE for this case.
> > 
> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> > blk_insert_cloned_request feedback")
> > Reported-by: Laurence Oberman <lober...@redhat.com>
> > Signed-off-by: Mike Snitzer <snit...@redhat.com>
> > ---
> >  block/blk-mq.c | 12 +---
> >  1 file changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c418858a60ef..8bee37239255 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1799,20 +1799,18 @@ static blk_status_t 
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > if (q->elevator && !bypass_insert)
> > goto insert;
> >  
> > -   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > -   goto insert;
> > -
> > -   if (!blk_mq_get_dispatch_budget(hctx)) {
> > +   if (!blk_mq_get_driver_tag(rq, NULL, false) ||
> > +   !blk_mq_get_dispatch_budget(hctx)) {
> > +   /* blk_mq_put_driver_tag() is idempotent */
> > blk_mq_put_driver_tag(rq);
> > +   if (bypass_insert)
> > +   return BLK_STS_RESOURCE;
> > goto insert;
> > }
> >  
> > return __blk_mq_issue_directly(hctx, rq, cookie);
> >  insert:
> > __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> > -   if (bypass_insert)
> > -   return BLK_STS_RESOURCE;
> > -
> > return BLK_STS_OK;
> >  }
> 
> Hi Mike,
> 
> I'd suggest to use the following one, which is simple and clean:
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4d4af8d712da..816ff5d6bc88 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -static void __blk_mq_fallback_to_insert(struct request *rq,
> - bool run_queue, bool bypass_insert)
> -{
> - if (!bypass_insert)
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> - else
> - blk_mq_request_bypass_insert(rq, run_queue);
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>

Re: [for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at 10:25pm -0500,
Ming Lei <ming@redhat.com> wrote:

> Hi Mike,
> 
> On Wed, Jan 17, 2018 at 11:25:57AM -0500, Mike Snitzer wrote:
> > From: Ming Lei <ming@redhat.com>
> > 
> > blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> > (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> > blk_mq_request_bypass_insert() to directly append the request to the
> > blk-mq hctx->dispatch_list of the underlying queue.
> > 
> > 1) This way isn't efficient enough because the hctx spinlock is always
> > used.
> > 
> > 2) With blk_insert_cloned_request(), we completely bypass underlying
> > queue's elevator and depend on the upper-level dm-rq driver's elevator
> > to schedule IO.  But dm-rq currently can't get the underlying queue's
> > dispatch feedback at all.  Without knowing whether a request was issued
> > or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> > not be able to provide effective IO merging (as a side-effect of dm-rq
> > currently blindly destaging a request from its elevator only to requeue
> > it after a delay, which kills any opportunity for merging).  This
> > obviously causes very bad sequential IO performance.
> > 
> > Fix this by updating blk_insert_cloned_request() to use
> > blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> > request to be issued directly to the underlying queue and returns the
> > dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> > returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> > to _not_ destage the request.  Whereby preserving the opportunity to
> > merge IO.
> > 
> > With this, request-based DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing).
> > 
> > Signed-off-by: Ming Lei <ming@redhat.com>
> > [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
> > they were refactored to make them less fragile and easier to read/review]
> > Signed-off-by: Mike Snitzer <snit...@redhat.com>
...
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c117c2baf2c9..f5f0d8456713 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1731,15 +1731,19 @@ static blk_status_t __blk_mq_issue_directly(struct 
> > blk_mq_hw_ctx *hctx,
> >  
> >  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
> > struct request *rq,
> > -   bool run_queue)
> > +   bool run_queue, bool bypass_insert)
> >  {
> > -   blk_mq_sched_insert_request(rq, false, run_queue, false,
> > -   hctx->flags & BLK_MQ_F_BLOCKING);
> > +   if (!bypass_insert)
> > +   blk_mq_sched_insert_request(rq, false, run_queue, false,
> > +   hctx->flags & BLK_MQ_F_BLOCKING);
> > +   else
> > +   blk_mq_request_bypass_insert(rq, run_queue);
> >  }
> 
> If 'bypass_insert' is true, we don't need to insert the request into
> hctx->dispatch_list for dm-rq, then it causes the issue(use after free)
> reported by Bart and Laurence.
> 
> Also this way is the exact opposite of the idea of the improvement,
> we do not want to dispatch request if underlying queue is busy.

Yeap, please see the patch I just posted to fix it.

But your v4 does fallback to using blk_mq_request_bypass_insert() as
well, just in a much narrower case -- specifically:
   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

Thanks,
Mike

[PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at  7:54P -0500,
Mike Snitzer <snit...@redhat.com> wrote:
 
> But sure, I suppose there is something I missed when refactoring Ming's
> change to get it acceptable for upstream.  I went over the mechanical
> nature of what I did many times (comparing Ming's v4 to my v5).

And yes there is one subtlety that I missed.

> The call to blk_mq_request_bypass_insert will only occur via
> __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> fast path.  This will occur if the underlying blk-mq device cannot get
> resources it needs in order to issue the request.  Specifically: if/when
> in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> quiesced, or it cannot get the driver tag or dispatch_budget (in the
> case of scsi-mq).
> 
> The same fallback, via call to blk_mq_request_bypass_insert, occured
> with Ming's v4 though.

Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
the driver tag or dispatch_budget" case.

This patch should fix it (Laurence, please report back on if this fixes
your list_add corruption, pretty sure it will):

From: Mike Snitzer <snit...@redhat.com>
Date: Wed, 17 Jan 2018 22:02:07 -0500
Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
BLK_STS_RESOURCE

It isn't ever valid to call blk_mq_request_bypass_insert() and return
BLK_STS_RESOURCE.

Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
merging via blk_insert_cloned_request feedback") we do just that if
blk_mq_request_direct_issue() cannot get the resources (driver_tag or
dispatch_budget) needed to directly issue a request.  This will lead to
"list_add corruption" because blk-mq submits the IO but then reports
that it didn't (BLK_STS_RESOURCE in this case).

Fix this by simply returning BLK_STS_RESOURCE for this case.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman <lober...@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-mq.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c418858a60ef..8bee37239255 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1799,20 +1799,18 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (q->elevator && !bypass_insert)
goto insert;
 
-   if (!blk_mq_get_driver_tag(rq, NULL, false))
-   goto insert;
-
-   if (!blk_mq_get_dispatch_budget(hctx)) {
+   if (!blk_mq_get_driver_tag(rq, NULL, false) ||
+   !blk_mq_get_dispatch_budget(hctx)) {
+   /* blk_mq_put_driver_tag() is idempotent */
blk_mq_put_driver_tag(rq);
+   if (bypass_insert)
+   return BLK_STS_RESOURCE;
goto insert;
}
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
__blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
-   if (bypass_insert)
-   return BLK_STS_RESOURCE;
-
return BLK_STS_OK;
 }
 
-- 
2.15.0

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at  6:53pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Wed, 2018-01-17 at 18:43 -0500, Laurence Oberman wrote:
> > On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> > > On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > > > On Wed, Jan 17 2018 at 11:50am -0500,
> > > > Jens Axboe <ax...@kernel.dk> wrote:
> > > > 
> > > > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > > > Hi Jens,
> > > > > > 
> > > > > > Think this finally takes care of it! ;)
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > > 
> > > > > > Mike Snitzer (2):
> > > > > >   blk-mq: factor out a few helpers from
> > > > > > __blk_mq_try_issue_directly
> > > > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > > > blk_mq_sched_insert_request
> > > > > > 
> > > > > > Ming Lei (1):
> > > > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > > > blk_insert_cloned_request feedback
> > > > > 
> > > > > Applied - added actual commit message to patch 3.
> > > > 
> > > > Great, thanks.
> > > 
> > > Hello Mike,
> > > 
> > > Laurence hit the following while retesting the SRP initiator code:
> > > 
> > > [ 2223.797129] list_add corruption. prev->next should be next
> > > (e0ddd5dd), but was 3defe5cd.
> > > (prev=3defe5cd).
> > > [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> > > __list_add_valid+0x6a/0x70
> > > [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> > > G  I  4.15.0-rc8.bart3+ #1
> > > [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> > > [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> > > [ 2224.967002] Call Trace:
> > > [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> > > [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> > > [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> > > [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> > > [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> > > [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> > > [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> > > [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> > > [ 2225.264852]  process_one_work+0x141/0x340
> > > [ 2225.287872]  worker_thread+0x47/0x3e0
> > > [ 2225.308354]  kthread+0xf5/0x130
> > > [ 2225.396405]  ret_from_fork+0x32/0x40
> > > 
> > > That call trace did not show up before this patch series was added to
> > > Jens'
> > > tree. This is a regression. Could this have been introduced by this
> > > patch
> > > series?
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > 
> > Hi Bart
> > One thing to note.
> > 
> > I tested Mike's combined tree on the weekend fully dm4.16-block4.16 and
> > did not see this.
> > This was with Mike combined tree and SRPT running 4.13-rc2.
> > 
> > I also tested your tree Monday with the revert of the scatter/gather
> > patches with both SRP and SRPT running your tree and it was fine.
> > 
> > So its a combination of what you provided me before and that has been
> > added to your tree.
> > 
> > Mike combined tree seemed to be fine, I can revisit that if needed. I
> > still have that kernel in place.
> > 
> > I was not running latest SRPT when I tested Mike's tree
> 
> Hello Laurence,
> 
> The tree I sent you this morning did not only include Mike's latest dm code
> but also Jens' latest for-next branch. So what you wrote above does not
> contradict what I wrote in my e-mail, namely that I suspect that a regression
> was introduced by the patches in the series "blk-mq: improve DM's blk-mq IO
> merging via blk_insert_cloned_request feedback". These changes namely went in
> through the block tree and not through the dm tree. Additionally, these
> changes were not present in the block-scsi-for-next branch I sent you on
> Monday.

Functionality shouldn't be any different than the variant (Ming's v4)
that Laurence tested on Sunday/Monday (once we got past the genirq issue
on HPSA).

But sure, I suppose there is something I missed when re

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer

On Wed, Jan 17 2018 at 11:50am -0500,
Jens Axboe <ax...@kernel.dk> wrote:

> On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > Hi Jens,
> > 
> > Think this finally takes care of it! ;)
> > 
> > Thanks,
> > Mike
> > 
> > Mike Snitzer (2):
> >   blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
> >   blk-mq-sched: remove unused 'can_block' arg from 
> > blk_mq_sched_insert_request
> > 
> > Ming Lei (1):
> >   blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > feedback
> 
> Applied - added actual commit message to patch 3.

Great, thanks.

[for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei <ming@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 37 +
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c117c2baf2c9..f5f0d8456713 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1731,15 +1731,19 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!bypass_insert)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1754,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, false))
@@ -1763,7 +1767,9 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(hctx, rq, run_queue);
+   __blk_mq_fallback_to_insert(hctx, rq, run_queue, bypass_insert);
+   if (bypass_insert)
+   return BLK_STS_RESOURCE;
 
return BLK_STS_OK;
 }
@@ -1778,15 +1784,30 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
hctx_lock(hctx, _idx);
 
-   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(hctx, rq, true);
+

Re: [for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging performance

2018-01-16 Thread Mike Snitzer

This one is redundant and should be dropped.  I ran git-format-patch
twice after a quick rebase to tweak the subject and header.

Sorry for the confusion.

On Tue, Jan 16 2018 at 11:33pm -0500,
Mike Snitzer <snit...@redhat.com> wrote:

> From: Ming Lei <ming@redhat.com>
> 
> blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> blk_mq_request_bypass_insert() to directly append the request to the
> blk-mq hctx->dispatch_list of the underlying queue.
> 
> 1) This way isn't efficient enough because the hctx spinlock is always
> used.
> 
> 2) With blk_insert_cloned_request(), we completely bypass underlying
> queue's elevator and depend on the upper-level dm-rq driver's elevator
> to schedule IO.  But dm-rq currently can't get the underlying queue's
> dispatch feedback at all.  Without knowing whether a request was issued
> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> not be able to provide effective IO merging (as a side-effect of dm-rq
> currently blindly destaging a request from its elevator only to requeue
> it after a delay, which kills any opportunity for merging).  This
> obviously causes very bad sequential IO performance.
> 
> Fix this by updating blk_insert_cloned_request() to use
> blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> request to be issued directly to the underlying queue and returns the
> dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> to _not_ destage the request.  Whereby preserving the opportunity to
> merge IO.
> 
> With this, request-based DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).
> 
> Signed-off-by: Ming Lei <ming@redhat.com>
> [based _heavily_ on Ming Lei's initial solution, but blk-mq.c changes
> were refactored to make them less fragile and easier to read/review]
> Signed-off-by: Mike Snitzer <snit...@redhat.com>
> ---
>  block/blk-core.c   |  3 +--
>  block/blk-mq.c | 42 +-
>  block/blk-mq.h |  3 +++
>  drivers/md/dm-rq.c | 19 ---
>  4 files changed, 53 insertions(+), 14 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ba607527487..55f338020254 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - blk_mq_request_bypass_insert(rq, true);
> - return BLK_STS_OK;
> + return blk_mq_request_direct_issue(rq);
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f30e34a22a6c..81ee3f9124dc 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret;
>  
> - new_cookie = request_to_qc_t(hctx, rq);
> + if (cookie)
> + new_cookie = request_to_qc_t(hctx, rq);
>  
>   /*
>* For OK queue, we are done. For error, caller may kill it.
> @@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   ret = q->mq_ops->queue_rq(hctx, );
>   switch (ret) {
>   case BLK_STS_OK:
> - *cookie = new_cookie;
> + if (cookie)
> + *cookie = new_cookie;
>   break;
>   case BLK_STS_RESOURCE:
>   __blk_mq_requeue_request(rq);
>   break;
>   default:
> - *cookie = BLK_QC_T_NONE;
> + if (cookie)
> + *cookie = BLK_QC_T_NONE;
>   break;
>   }
>  
> @@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - bool run_queue)
> + bool run_queue, bool bypass_insert)
>  {
> + if (bypass_insert) {
> + blk_mq_request_bypass_insert(rq, run_queue);
> + return;
> + }
>   blk_mq_sched_insert_request(rq, false, run_queue, false,
>   hctx->flags & BL

[for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-16 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei <ming@redhat.com>
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c117c2baf2c9..0b64f7210a89 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret;
 
-   new_cookie = request_to_qc_t(hctx, rq);
+   if (cookie)
+   new_cookie = request_to_qc_t(hctx, rq);
 
/*
 * For OK queue, we are done. For error, caller may kill it.
@@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
-   *cookie = new_cookie;
+   if (cookie)
+   *cookie = new_cookie;
break;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
break;
default:
-   *cookie = BLK_QC_T_NONE;
+   if (cookie)
+   *cookie = BLK_QC_T_NONE;
break;
}
 
@@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
+   if (bypass_insert) {
+   blk_mq_request_bypass_insert(rq, run_queue);
+   return;
+   }
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1758,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto

[for-4.16 PATCH v5 3/3] blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

2018-01-16 Thread Mike Snitzer

Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-exec.c |  2 +-
 block/blk-mq-sched.c |  2 +-
 block/blk-mq-sched.h |  2 +-
 block/blk-mq.c   | 16 +++-
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/blk-exec.c b/block/blk-exec.c
index 5c0f3dc446dc..f7b292f12449 100644
--- a/block/blk-exec.c
+++ b/block/blk-exec.c
@@ -61,7 +61,7 @@ void blk_execute_rq_nowait(struct request_queue *q, struct 
gendisk *bd_disk,
 * be reused after dying flag is set
 */
if (q->mq_ops) {
-   blk_mq_sched_insert_request(rq, at_head, true, false, false);
+   blk_mq_sched_insert_request(rq, at_head, true, false);
return;
}
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 2ff7cf0cbf73..55c0a745b427 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -427,7 +427,7 @@ void blk_mq_sched_restart(struct blk_mq_hw_ctx *const hctx)
 }
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
-bool run_queue, bool async, bool can_block)
+bool run_queue, bool async)
 {
struct request_queue *q = rq->q;
struct elevator_queue *e = q->elevator;
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index ba1d1418a96d..1e9c9018ace1 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -18,7 +18,7 @@ bool blk_mq_sched_try_insert_merge(struct request_queue *q, 
struct request *rq);
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_sched_insert_request(struct request *rq, bool at_head,
-bool run_queue, bool async, bool can_block);
+bool run_queue, bool async);
 void blk_mq_sched_insert_requests(struct request_queue *q,
  struct blk_mq_ctx *ctx,
  struct list_head *list, bool run_queue_async);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0b64f7210a89..06ef6a7cc29c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -745,13 +745,13 @@ static void blk_mq_requeue_work(struct work_struct *work)
 
rq->rq_flags &= ~RQF_SOFTBARRIER;
list_del_init(>queuelist);
-   blk_mq_sched_insert_request(rq, true, false, false, true);
+   blk_mq_sched_insert_request(rq, true, false, false);
}
 
while (!list_empty(_list)) {
rq = list_entry(rq_list.next, struct request, queuelist);
list_del_init(>queuelist);
-   blk_mq_sched_insert_request(rq, false, false, false, true);
+   blk_mq_sched_insert_request(rq, false, false, false);
}
 
blk_mq_run_hw_queues(q, false);
@@ -1732,16 +1732,14 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
+static void __blk_mq_fallback_to_insert(struct request *rq,
bool run_queue, bool bypass_insert)
 {
if (bypass_insert) {
blk_mq_request_bypass_insert(rq, run_queue);
return;
}
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   blk_mq_sched_insert_request(rq, false, run_queue, false);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
@@ -1771,7 +1769,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(hctx, rq, run_queue, bypass_insert);
+   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
if (bypass_insert)
return BLK_STS_RESOURCE;
 
@@ -1790,7 +1788,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(hctx, rq, true, false);
+   __blk_mq_fallback_to_insert(rq, true, false);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 
@@ -1919,7 +1917,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
} else if (q->elevator) {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
-   blk_mq_sched_insert_request(rq, false, true, true, true);
+   blk_mq_sched_insert_request(rq, false, true, true);
} else {
blk_mq_put_ctx(data.ctx);
blk_mq_bio_to_request(rq, bio);
-- 
2.15.0

[for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging performance

2018-01-16 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei <ming@redhat.com>
[based _heavily_ on Ming Lei's initial solution, but blk-mq.c changes
were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f30e34a22a6c..81ee3f9124dc 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret;
 
-   new_cookie = request_to_qc_t(hctx, rq);
+   if (cookie)
+   new_cookie = request_to_qc_t(hctx, rq);
 
/*
 * For OK queue, we are done. For error, caller may kill it.
@@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
-   *cookie = new_cookie;
+   if (cookie)
+   *cookie = new_cookie;
break;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
break;
default:
-   *cookie = BLK_QC_T_NONE;
+   if (cookie)
+   *cookie = BLK_QC_T_NONE;
break;
}
 
@@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
+   if (bypass_insert) {
+   blk_mq_request_bypass_insert(rq, run_queue);
+   return;
+   }
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1758,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto insert

[for-4.16 PATCH v5 1/3] blk-mq: factor out a few helpers from __blk_mq_try_issue_directly

2018-01-16 Thread Mike Snitzer

No functional change.  Just makes code flow more logically.

In following commit, __blk_mq_try_issue_directly() will be used to
return the dispatch result (blk_status_t) to DM.  DM needs this
information to improve IO merging.

Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-mq.c | 79 ++
 1 file changed, 52 insertions(+), 27 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c8f62e6be6b6..c117c2baf2c9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,9 +1694,9 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1705,6 +1705,43 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
};
blk_qc_t new_cookie;
blk_status_t ret;
+
+   new_cookie = request_to_qc_t(hctx, rq);
+
+   /*
+* For OK queue, we are done. For error, caller may kill it.
+* Any other error (busy), just add it to our list as we
+* previously would have done.
+*/
+   ret = q->mq_ops->queue_rq(hctx, );
+   switch (ret) {
+   case BLK_STS_OK:
+   *cookie = new_cookie;
+   break;
+   case BLK_STS_RESOURCE:
+   __blk_mq_requeue_request(rq);
+   break;
+   default:
+   *cookie = BLK_QC_T_NONE;
+   break;
+   }
+
+   return ret;
+}
+
+static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   bool run_queue)
+{
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+}
+
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
+{
+   struct request_queue *q = rq->q;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1724,41 +1761,29 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   new_cookie = request_to_qc_t(hctx, rq);
-
-   /*
-* For OK queue, we are done. For error, kill it. Any other
-* error (busy), just add it to our list as we previously
-* would have done
-*/
-   ret = q->mq_ops->queue_rq(hctx, );
-   switch (ret) {
-   case BLK_STS_OK:
-   *cookie = new_cookie;
-   return;
-   case BLK_STS_RESOURCE:
-   __blk_mq_requeue_request(rq);
-   goto insert;
-   default:
-   *cookie = BLK_QC_T_NONE;
-   blk_mq_end_request(rq, ret);
-   return;
-   }
-
+   return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   __blk_mq_fallback_to_insert(hctx, rq, run_queue);
+
+   return BLK_STS_OK;
 }
 
 static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq, blk_qc_t *cookie)
 {
+   blk_status_t ret;
int srcu_idx;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, _idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_fallback_to_insert(hctx, rq, true);
+   else if (ret != BLK_STS_OK)
+   blk_mq_end_request(rq, ret);
+
hctx_unlock(hctx, srcu_idx);
 }
 
-- 
2.15.0

[for-4.16 PATCH v5 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-16 Thread Mike Snitzer

Hi Jens,

I spent a decent amount of time going over this and am happy with it.
Hopefully you'll be too.

Thanks,
Mike

Mike Snitzer (2):
  blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
  blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request

Ming Lei (1):
  blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

 block/blk-core.c |   3 +-
 block/blk-exec.c |   2 +-
 block/blk-mq-sched.c |   2 +-
 block/blk-mq-sched.h |   2 +-
 block/blk-mq.c   | 109 ---
 block/blk-mq.h   |   3 ++
 drivers/md/dm-rq.c   |  19 +++--
 7 files changed, 101 insertions(+), 39 deletions(-)

-- 
2.15.0

Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-16 Thread Mike Snitzer

On Tue, Jan 16 2018 at  1:17pm -0500,
Bart Van Assche  wrote:

> The __blk_mq_register_dev(), blk_mq_unregister_dev(),
> elv_register_queue() and elv_unregister_queue() calls need to be
> protected with sysfs_lock but other code in these functions not.
> Hence protect only this code with sysfs_lock. This patch fixes a
> locking inversion issue in blk_unregister_queue() and also in an
> error path of blk_register_queue(): it is not allowed to hold
> sysfs_lock around the kobject_del(>kobj) call.
> 
> Signed-off-by: Bart Van Assche 
> ---
>  block/blk-sysfs.c | 13 -
>  1 file changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 4a6a40ffd78e..e9ce45ff0ef2 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -909,11 +909,12 @@ int blk_register_queue(struct gendisk *disk)
>   if (q->request_fn || (q->mq_ops && q->elevator)) {
>   ret = elv_register_queue(q);
>   if (ret) {
> + mutex_unlock(>sysfs_lock);
>   kobject_uevent(>kobj, KOBJ_REMOVE);
>   kobject_del(>kobj);
>   blk_trace_remove_sysfs(dev);
>   kobject_put(>kobj);
> - goto unlock;
> + return ret;
>   }
>   }
>   ret = 0;
> @@ -934,28 +935,22 @@ void blk_unregister_queue(struct gendisk *disk)
>   if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
>   return;
>  
> - /*
> -  * Protect against the 'queue' kobj being accessed
> -  * while/after it is removed.
> -  */
> - mutex_lock(>sysfs_lock);
> -
>   spin_lock_irq(q->queue_lock);
>   queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
>   spin_unlock_irq(q->queue_lock);
>  
>   wbt_exit(q);
>  
> + mutex_lock(>sysfs_lock);
>   if (q->mq_ops)
>   blk_mq_unregister_dev(disk_to_dev(disk), q);
>  
>   if (q->request_fn || (q->mq_ops && q->elevator))
>   elv_unregister_queue(q);

My concern with this change is detailed in the following portion of
the header for commit 667257e8b2988c0183ba23e2bcd6900e87961606:

2) Conversely, __elevator_change() is testing for QUEUE_FLAG_REGISTERED
in case elv_iosched_store() loses the race with blk_unregister_queue(),
it needs a way to know the 'queue' kobj isn't there.

I don't think moving mutex_lock(>sysfs_lock); after the clearing of
QUEUE_FLAG_REGISTERED is a step in the right direction.

Current code shows:

blk_cleanup_queue() calls blk_set_queue_dying() while holding
the sysfs_lock.

queue_attr_{show,store} both test if blk_queue_dying(q) while holding
the sysfs_lock.

BUT drivers can/do call del_gendisk() _before_ blk_cleanup_queue().
(if your proposed change above were to go in all of the block drivers
would first need to be audited for the need to call blk_cleanup_queue()
before del_gendisk() -- seems awful).

Therefore it seems to me that all queue_attr_{show,store} are racey vs
blk_unregister_queue() removing the 'queue' kobject.

And it was just that __elevator_change() was myopicly fixed to address
the race whereas a more generic solution was/is needed.  But short of
that more generic fix your change will reintroduce the potential for
hitting the issue that commit e9a823fb34a8b fixed.

In that light, think it best to leave blk_unregister_queue()'s
mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
sysfs_lock.

Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
__elevator_change().

But it could be I'm wrong for some reason.. as you know that happens ;)

Mike

Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer

On Tue, Jan 16 2018 at 12:41pm -0500,
Jens Axboe <ax...@kernel.dk> wrote:

> On 1/16/18 10:38 AM, Mike Snitzer wrote:
> > On Tue, Jan 16 2018 at 12:20pm -0500,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> >> On 1/16/18 8:01 AM, Mike Snitzer wrote:
> >>> From: Ming Lei <ming@redhat.com>
> >>>
> >>> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> >>> in this function we append request to hctx->dispatch_list of the 
> >>> underlying
> >>> queue directly.
> >>>
> >>> 1) This way isn't efficient enough because hctx lock is always required
> >>>
> >>> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> >>> scheduler totally, and depend on DM rq driver to do IO schedule
> >>> completely.  But DM rq driver can't get underlying queue's dispatch
> >>> feedback at all, and this information is extreamly useful for IO merge.
> >>> Without that IO merge can't be done basically by blk-mq, which causes
> >>> very bad sequential IO performance.
> >>>
> >>> Fix this by having blk_insert_cloned_request() make use of
> >>> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> >>> blk_mq_request_direct_issue() allows a request to be dispatched to be
> >>> issue directly to the underlying queue and provides dispatch result to
> >>> dm-rq and blk-mq.
> >>>
> >>> With this, the DM's blk-mq sequential IO performance is vastly
> >>> improved (as much as 3X in mpath/virtio-scsi testing).
> >>
> >> This still feels pretty hacky...
> >>
> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >>> index 55f3a27fb2e6..3168a13cb012 100644
> >>> --- a/block/blk-mq.c
> >>> +++ b/block/blk-mq.c
> >>> @@ -1706,6 +1706,12 @@ static blk_status_t 
> >>> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >>>   blk_qc_t new_cookie;
> >>>   blk_status_t ret = BLK_STS_OK;
> >>>   bool run_queue = true;
> >>> + /*
> >>> +  * If @cookie is NULL do not insert the request, this mode is used
> >>> +  * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> >>> +  */
> >>> + bool dispatch_only = !cookie;
> >>> + bool need_insert = false;
> >>
> >> Overloading 'cookie' to also mean this isn't very future proof or solid.
> > 
> > It enables the existing interface to be used without needing to prop up
> > something else that extends out to the edge (blk_insert_cloned_request).
> 
> Doesn't really matter if the end result is too ugly/fragile to live.

Agreed.
 
> >> And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
> >> it should be split in two, where the other half would do the actual
> >> insert. Then let the caller do it, if we could not issue directly. That
> >> would be a lot more solid and easier to read.
> > 
> > That is effectively what Ming's variant did (by splitting out the issue
> > to a helper).
> > 
> > BUT I'll see what I can come up with...
> 
> Maybe I missed that version, there were many rapid fire versions posted.
> Please just take your time and get it right, that's much more important.

No trying to rush, going over it carefully now.

Think I have a cleaner way forward.

Thanks,
Mike

Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer

On Tue, Jan 16 2018 at 12:20pm -0500,
Jens Axboe <ax...@kernel.dk> wrote:

> On 1/16/18 8:01 AM, Mike Snitzer wrote:
> > From: Ming Lei <ming@redhat.com>
> > 
> > blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> > in this function we append request to hctx->dispatch_list of the underlying
> > queue directly.
> > 
> > 1) This way isn't efficient enough because hctx lock is always required
> > 
> > 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> > scheduler totally, and depend on DM rq driver to do IO schedule
> > completely.  But DM rq driver can't get underlying queue's dispatch
> > feedback at all, and this information is extreamly useful for IO merge.
> > Without that IO merge can't be done basically by blk-mq, which causes
> > very bad sequential IO performance.
> > 
> > Fix this by having blk_insert_cloned_request() make use of
> > blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> > blk_mq_request_direct_issue() allows a request to be dispatched to be
> > issue directly to the underlying queue and provides dispatch result to
> > dm-rq and blk-mq.
> > 
> > With this, the DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing).
> 
> This still feels pretty hacky...
> 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 55f3a27fb2e6..3168a13cb012 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1706,6 +1706,12 @@ static blk_status_t 
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > blk_qc_t new_cookie;
> > blk_status_t ret = BLK_STS_OK;
> > bool run_queue = true;
> > +   /*
> > +* If @cookie is NULL do not insert the request, this mode is used
> > +* by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> > +*/
> > +   bool dispatch_only = !cookie;
> > +   bool need_insert = false;
> 
> Overloading 'cookie' to also mean this isn't very future proof or solid.

It enables the existing interface to be used without needing to prop up
something else that extends out to the edge (blk_insert_cloned_request).
 
> And now __blk_mq_try_issue_directly() is pretty much a mess. Feels like
> it should be split in two, where the other half would do the actual
> insert. Then let the caller do it, if we could not issue directly. That
> would be a lot more solid and easier to read.

That is effectively what Ming's variant did (by splitting out the issue
to a helper).

BUT I'll see what I can come up with...

(Ming please stand down until you hear back from me ;)

Thanks,
Mike

Re: [for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer

On Tue, Jan 16 2018 at 10:01P -0500,
Mike Snitzer <snit...@redhat.com> wrote:

> From: Ming Lei <ming@redhat.com>
> 
> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO
> scheduler totally, and depend on DM rq driver to do IO schedule
> completely.  But DM rq driver can't get underlying queue's dispatch
> feedback at all, and this information is extreamly useful for IO merge.
> Without that IO merge can't be done basically by blk-mq, which causes
> very bad sequential IO performance.
> 
> Fix this by having blk_insert_cloned_request() make use of
> blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
> blk_mq_request_direct_issue() allows a request to be dispatched to be
> issue directly to the underlying queue and provides dispatch result to
> dm-rq and blk-mq.
> 
> With this, the DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).
> 
> Signed-off-by: Ming Lei <ming@redhat.com>
> Signed-off-by: Mike Snitzer <snit...@redhat.com>
> ---
>  block/blk-core.c   |  3 +--
>  block/blk-mq.c | 42 --
>  block/blk-mq.h |  3 +++
>  drivers/md/dm-rq.c | 19 ---
>  4 files changed, 56 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ba607527487..55f338020254 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - blk_mq_request_bypass_insert(rq, true);
> - return BLK_STS_OK;
> + return blk_mq_request_direct_issue(rq);
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 55f3a27fb2e6..3168a13cb012 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,6 +1706,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret = BLK_STS_OK;
>   bool run_queue = true;
> + /*
> +  * If @cookie is NULL do not insert the request, this mode is used
> +  * by blk_insert_cloned_request() via blk_mq_request_direct_issue()
> +  */
> + bool dispatch_only = !cookie;
> + bool need_insert = false;
>  
>   /* RCU or SRCU read lock is needed before checking quiesced flag */
>   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
> @@ -1713,25 +1719,38 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   goto insert;
>   }
>  
> - if (q->elevator)
> + if (q->elevator && !dispatch_only)
>   goto insert;
>  
>   if (!blk_mq_get_driver_tag(rq, NULL, false))
> - goto insert;
> + need_insert = true;
>  
> - if (!blk_mq_get_dispatch_budget(hctx)) {
> + if (!need_insert && !blk_mq_get_dispatch_budget(hctx)) {
>   blk_mq_put_driver_tag(rq);
> + need_insert = true;
> + }
> +
> + if (need_insert) {
> + if (dispatch_only)
> + return BLK_STS_RESOURCE;
>   goto insert;
>   }
>  
>   new_cookie = request_to_qc_t(hctx, rq);
>  
> + ret = q->mq_ops->queue_rq(hctx, );
> +
> + if (dispatch_only) {
> + if (ret == BLK_STS_RESOURCE)
> + __blk_mq_requeue_request(rq);
> + return ret;
> + }
> +
>   /*
>* For OK queue, we are done. For error, kill it. Any other
>* error (busy), just add it to our list as we previously
>* would have done
>*/
> - ret = q->mq_ops->queue_rq(hctx, );
>   switch (ret) {
>   case BLK_STS_OK:
>   *cookie = new_cookie;

FYI, because blk_mq_put_driver_tag() is idempotent, the above can be
simplified by eliminating 'need_insert' like so (Jens feel free to fold
this in?):

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3168a13cb012..c2f66a5c5242 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1711,7 +1711,6 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 * by blk_insert_cloned_request() via blk_mq_request_direct_issue

[for-4.16 PATCH v4-mike 2/2] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-16 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

blk_insert_cloned_request() is called in fast path of dm-rq driver, and
in this function we append request to hctx->dispatch_list of the underlying
queue directly.

1) This way isn't efficient enough because hctx lock is always required

2) With blk_insert_cloned_request(), we bypass underlying queue's IO
scheduler totally, and depend on DM rq driver to do IO schedule
completely.  But DM rq driver can't get underlying queue's dispatch
feedback at all, and this information is extreamly useful for IO merge.
Without that IO merge can't be done basically by blk-mq, which causes
very bad sequential IO performance.

Fix this by having blk_insert_cloned_request() make use of
blk_mq_try_issue_directly() via blk_mq_request_direct_issue().
blk_mq_request_direct_issue() allows a request to be dispatched to be
issue directly to the underlying queue and provides dispatch result to
dm-rq and blk-mq.

With this, the DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei <ming@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 42 --
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 56 insertions(+), 11 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 55f3a27fb2e6..3168a13cb012 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1706,6 +1706,12 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
blk_qc_t new_cookie;
blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
+   /*
+* If @cookie is NULL do not insert the request, this mode is used
+* by blk_insert_cloned_request() via blk_mq_request_direct_issue()
+*/
+   bool dispatch_only = !cookie;
+   bool need_insert = false;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
@@ -1713,25 +1719,38 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !dispatch_only)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, false))
-   goto insert;
+   need_insert = true;
 
-   if (!blk_mq_get_dispatch_budget(hctx)) {
+   if (!need_insert && !blk_mq_get_dispatch_budget(hctx)) {
blk_mq_put_driver_tag(rq);
+   need_insert = true;
+   }
+
+   if (need_insert) {
+   if (dispatch_only)
+   return BLK_STS_RESOURCE;
goto insert;
}
 
new_cookie = request_to_qc_t(hctx, rq);
 
+   ret = q->mq_ops->queue_rq(hctx, );
+
+   if (dispatch_only) {
+   if (ret == BLK_STS_RESOURCE)
+   __blk_mq_requeue_request(rq);
+   return ret;
+   }
+
/*
 * For OK queue, we are done. For error, kill it. Any other
 * error (busy), just add it to our list as we previously
 * would have done
 */
-   ret = q->mq_ops->queue_rq(hctx, );
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
@@ -1746,8 +1765,11 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
}
 
 insert:
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!dispatch_only)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
return ret;
 }
 
@@ -1767,6 +1789,14 @@ static blk_status_t blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
+blk_status_t blk_mq_request_direct_issue(struct request *rq)
+{
+   struct blk_mq_ctx *ctx = rq->mq_ctx;
+   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(rq->q, ctx->cpu);
+
+   return blk_mq_try_issue_directly(hctx, rq, NULL);
+}
+
 static blk_qc_t blk_mq_make_request(struct request_queue *q, st

[for-4.16 PATCH v4-mike 1/2] blk-mq: return dispatch result from blk_mq_try_issue_directly

2018-01-16 Thread Mike Snitzer

From: Ming Lei <ming@redhat.com>

In the following patch, we will use blk_mq_try_issue_directly() for DM
to return the dispatch result.  DM needs this information to improve
IO merging.

Signed-off-by: Ming Lei <ming@redhat.com>
Signed-off-by: Mike Snitzer <snit...@redhat.com>
---
 block/blk-mq.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c8f62e6be6b6..55f3a27fb2e6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1694,9 +1694,9 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
*hctx, struct request *rq)
return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
 }
 
-static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq,
-   blk_qc_t *cookie)
+static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+   struct request *rq,
+   blk_qc_t *cookie)
 {
struct request_queue *q = rq->q;
struct blk_mq_queue_data bd = {
@@ -1704,7 +1704,7 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
.last = true,
};
blk_qc_t new_cookie;
-   blk_status_t ret;
+   blk_status_t ret = BLK_STS_OK;
bool run_queue = true;
 
/* RCU or SRCU read lock is needed before checking quiesced flag */
@@ -1735,31 +1735,36 @@ static void __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
switch (ret) {
case BLK_STS_OK:
*cookie = new_cookie;
-   return;
+   return ret;
case BLK_STS_RESOURCE:
__blk_mq_requeue_request(rq);
goto insert;
default:
*cookie = BLK_QC_T_NONE;
blk_mq_end_request(rq, ret);
-   return;
+   return ret;
}
 
 insert:
blk_mq_sched_insert_request(rq, false, run_queue, false,
hctx->flags & BLK_MQ_F_BLOCKING);
+   return ret;
 }
 
-static void blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
-   struct request *rq, blk_qc_t *cookie)
+static blk_status_t blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
+ struct request *rq,
+ blk_qc_t *cookie)
 {
int srcu_idx;
+   blk_status_t ret;
 
might_sleep_if(hctx->flags & BLK_MQ_F_BLOCKING);
 
hctx_lock(hctx, _idx);
-   __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
hctx_unlock(hctx, srcu_idx);
+
+   return ret;
 }
 
 static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)
-- 
2.15.0

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  8:43pm -0500,
Ming Lei <ming@redhat.com> wrote:

> On Mon, Jan 15, 2018 at 02:41:12PM -0500, Mike Snitzer wrote:
> > On Mon, Jan 15 2018 at 12:29pm -0500,
> > Jens Axboe <ax...@kernel.dk> wrote:
> > 
> > > On 1/15/18 9:58 AM, Ming Lei wrote:
> > > > No functional change, just to clean up code a bit, so that the following
> > > > change of using direct issue for blk_mq_request_bypass_insert() which is
> > > > needed by DM can be easier to do.
> > > > 
> > > > Signed-off-by: Ming Lei <ming@redhat.com>
> > > > ---
> > > >  block/blk-mq.c | 39 +++
> > > >  1 file changed, 27 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index edb1291a42c5..bf8d6651f40e 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct 
> > > > blk_mq_hw_ctx *hctx, struct request *rq)
> > > > return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
> > > >  }
> > > >  
> > > > -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > > > -   struct request *rq,
> > > > -   blk_qc_t *cookie)
> > > > +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> > > > +  struct request *rq,
> > > > +  blk_qc_t *new_cookie)
> > > >  {
> > > > +   blk_status_t ret;
> > > > struct request_queue *q = rq->q;
> > > > struct blk_mq_queue_data bd = {
> > > > .rq = rq,
> > > > .last = true,
> > > > };
> > > > +
> > > > +   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > > > +   return BLK_STS_AGAIN;
> > > > +
> > > > +   if (!blk_mq_get_dispatch_budget(hctx)) {
> > > > +   blk_mq_put_driver_tag(rq);
> > > > +   return BLK_STS_AGAIN;
> > > > +   }
> > > > +
> > > > +   *new_cookie = request_to_qc_t(hctx, rq);
> > > > +
> > > > +   ret = q->mq_ops->queue_rq(hctx, );
> > > > +
> > > > +   return ret;
> > > 
> > >   return q->mq_ops->queue_rq(hctx, );
> > > 
> > > and kill 'ret', it's not used.
> > 
> > Yeap, good point.
> > 
> > > But more importantly, who puts the
> > > driver tag and the budget if we get != OK for ->queue_rq()?
> > 
> > __blk_mq_try_issue_directly() processes the returned value same as before
> > this patch.  Means this patch isn't making any functional change:
> > If BLK_STS_RESOURCE: __blk_mq_requeue_request() is called.
> > __blk_mq_requeue_request() will blk_mq_put_driver_tag().
> > Otherwise, all other errors result in blk_mq_end_request(rq, ret);
> > 
> > So ignoring this patch, are you concerned that:
> > 1) Does blk_mq_end_request() put both?  Looks like blk_mq_free_request()
> > handles rq->tag != -1 but why not have it use __blk_mq_put_driver_tag()?
> > I'm not seeing where the budget is put from blk_mq_end_request()...
> 
> blk_mq_free_request() releases driver tag, and budget is owned by driver
> once .queue_rq is called.
> 
> > 
> > 2) Nothing seems to be putting the budget in
> > __blk_mq_try_issue_directly()'s BLK_STS_RESOURCE error path?  I share
> > your concern now (for drivers who set {get,put}_budget in mq_ops).
> > Should __blk_mq_requeue_request() be updated to also
> > blk_mq_put_dispatch_budget()?
> 
> No, at least it is current protocol of using budget, please see
> scsi_mq_queue_rq() and comment of blk_mq_do_dispatch_sched().

Yeap, thanks for clarifying.

Re: [PATCH V4 3/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> blk_insert_cloned_request() is called in fast path of dm-rq driver, and
> in this function we append request to hctx->dispatch_list of the underlying
> queue directly.
> 
> 1) This way isn't efficient enough because hctx lock is always required
> 
> 2) With blk_insert_cloned_request(), we bypass underlying queue's IO scheduler
> totally, and depend on DM rq driver to do IO schedule completely. But DM
> rq driver can't get underlying queue's dispatch feedback at all, and this
> information is extreamly useful for IO merge. Without that IO merge can't
> be done basically by blk-mq, and causes very bad sequential IO performance.
> 
> This patch makes use of blk_mq_try_issue_directly() to dispatch rq to
> underlying queue and provides disptch result to dm-rq and blk-mq, and
> improves the above situations very much.
> 
> With this patch, seqential IO is improved by 3X in my test over
> mpath/virtio-scsi.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 29 ++---
>  1 file changed, 26 insertions(+), 3 deletions(-)

Hey Ming,

I also just noticed your V4 of this patch only includes the
block/blk-mq.c changes.

You're missing:

 block/blk-core.c   |  3 +--
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 +---

Please let Jens know if you're OK with my V4, tagged here:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/tag/?h=for-block-4.16/dm-changes-2
And also detailed in this message from earlier in this thread:
https://marc.info/?l=linux-block=151603824725438=2

Or please generate V5 of your series.  Hopefully it'd include the header
I revised for this 3/3 patch, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=for-block-4.16/dm-changes-2=d86beab5712a8f18123011487dee797a1e3a07e1

We also need to address the issues Jens noticed and I looked at a bit
closer: https://marc.info/?l=linux-block=151604528127707=2
(those issues might need fixing first, and marked for stable@, and the
series rebased ontop of it?)

Thanks!
Mie

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  6:10P -0500,
Mike Snitzer <snit...@redhat.com> wrote:

> On Mon, Jan 15 2018 at  5:51pm -0500,
> Bart Van Assche <bart.vanass...@wdc.com> wrote:
> 
> > On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> > > sysfs write op calls kernfs_fop_write which takes:
> > > of->mutex then kn->count#213 (no idea what that is)
> > > then q->sysfs_lock (via queue_attr_store)
> > > 
> > > vs 
> > > 
> > > blk_unregister_queue takes:
> > > q->sysfs_lock then
> > > kernfs_mutex (via kernfs_remove)
> > > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> > > 
> > > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> > > triggering false positives.. because these seem like different kernfs
> > > locks yet they are reported as "kn->count#213".
> > > 
> > > Certainly feeling out of my depth with kernfs's locking though.
> > 
> > Hello Mike,
> > 
> > I don't think that this is a false positive but rather the following 
> > traditional
> > pattern of a potential deadlock involving sysfs attributes:
> > * One context obtains a mutex from inside a sysfs attribute method:
> >   queue_attr_store() obtains q->sysfs_lock.
> > * Another context removes a sysfs attribute while holding a mutex:
> >   blk_unregister_queue() removes the queue sysfs attributes while holding
> >   q->sysfs_lock.
> > 
> > This can result in a real deadlock because the code that removes sysfs 
> > objects
> > waits until all ongoing attribute callbacks have finished.
> > 
> > Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
> > blk_unregister_queue") modified blk_unregister_queue() such that 
> > q->sysfs_lock
> > is held around the kobject_del(>kobj) call I think this is a regression
> > introduced by that commit.
> 
> Sure, of course it is a regression.
> 
> Aside from moving the mutex_unlock(>sysfs_lock) above the
> kobject_del(>kobj) I don't know how to fix it.
> 
> Though, realistically that'd be an adequate fix (given the way the code
> was before).

Any chance you apply this and re-run your srp_test that triggered the
lockdep splat?

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4a6a40ffd78e..c50e08e9bf17 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -952,10 +952,10 @@ void blk_unregister_queue(struct gendisk *disk)
if (q->request_fn || (q->mq_ops && q->elevator))
elv_unregister_queue(q);
 
+   mutex_unlock(>sysfs_lock);
+
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
kobject_put(_to_dev(disk)->kobj);
-
-   mutex_unlock(>sysfs_lock);
 }

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  5:51pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Mon, 2018-01-15 at 17:15 -0500, Mike Snitzer wrote:
> > sysfs write op calls kernfs_fop_write which takes:
> > of->mutex then kn->count#213 (no idea what that is)
> > then q->sysfs_lock (via queue_attr_store)
> > 
> > vs 
> > 
> > blk_unregister_queue takes:
> > q->sysfs_lock then
> > kernfs_mutex (via kernfs_remove)
> > seems lockdep thinks "kernfs_mutex" is "kn->count#213"?
> > 
> > Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
> > triggering false positives.. because these seem like different kernfs
> > locks yet they are reported as "kn->count#213".
> > 
> > Certainly feeling out of my depth with kernfs's locking though.
> 
> Hello Mike,
> 
> I don't think that this is a false positive but rather the following 
> traditional
> pattern of a potential deadlock involving sysfs attributes:
> * One context obtains a mutex from inside a sysfs attribute method:
>   queue_attr_store() obtains q->sysfs_lock.
> * Another context removes a sysfs attribute while holding a mutex:
>   blk_unregister_queue() removes the queue sysfs attributes while holding
>   q->sysfs_lock.
> 
> This can result in a real deadlock because the code that removes sysfs objects
> waits until all ongoing attribute callbacks have finished.
> 
> Since commit 667257e8b298 ("block: properly protect the 'queue' kobj in
> blk_unregister_queue") modified blk_unregister_queue() such that q->sysfs_lock
> is held around the kobject_del(>kobj) call I think this is a regression
> introduced by that commit.

Sure, of course it is a regression.

Aside from moving the mutex_unlock(>sysfs_lock) above the
kobject_del(>kobj) I don't know how to fix it.

Though, realistically that'd be an adequate fix (given the way the code
was before).

Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:16pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote:
> > I'm submitting this v5 with more feeling now ;)
> 
> Hello Mike,
> 
> Have these patches been tested with lockdep enabled? The following appeared in
> the kernel log when after I started testing Jens' for-next tree of this 
> morning:
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.15.0-rc7-dbg+ #1 Not tainted
> --
> 02-mq/1211 is trying to acquire lock:
>  (>sysfs_lock){+.+.}, at: [<8b65bdad>] queue_attr_store+0x35/0x80
> 
> but task is already holding lock:
>  (kn->count#213){}, at: [<7a18ad18>] kernfs_fop_write+0xe5/0x190
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (kn->count#213){}:
>kernfs_remove+0x1a/0x30
>kobject_del.part.3+0xe/0x40
>blk_unregister_queue+0xa7/0xe0
>del_gendisk+0x12f/0x260
>sd_remove+0x58/0xc0 [sd_mod]
>device_release_driver_internal+0x15a/0x220
>bus_remove_device+0xf4/0x170
>device_del+0x12f/0x330
>__scsi_remove_device+0xef/0x120 [scsi_mod]
>scsi_forget_host+0x1b/0x60 [scsi_mod]
>scsi_remove_host+0x6f/0x110 [scsi_mod]
>0xc09ed6e4
>process_one_work+0x21c/0x6d0
>worker_thread+0x35/0x380
>kthread+0x117/0x130
>ret_from_fork+0x24/0x30
> 
> -> #0 (>sysfs_lock){+.+.}:
>__mutex_lock+0x6c/0x9e0
>queue_attr_store+0x35/0x80
>kernfs_fop_write+0x109/0x190
>__vfs_write+0x1e/0x130
>vfs_write+0xb9/0x1b0
>SyS_write+0x40/0xa0
>entry_SYSCALL_64_fastpath+0x23/0x9a
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>CPU0CPU1
>
>   lock(kn->count#213);
>lock(>sysfs_lock);
>lock(kn->count#213);
>   lock(>sysfs_lock);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by 02-mq/1211:
>  #0:  (sb_writers#6){.+.+}, at: [<afdb61d3>] vfs_write+0x17f/0x1b0
>  #1:  (>mutex){+.+.}, at: [<b291cabb>] kernfs_fop_write+0xdd/0x190
>  #2:  (kn->count#213){}, at: [<7a18ad18>] 
> kernfs_fop_write+0xe5/0x190

sysfs write op calls kernfs_fop_write which takes:
of->mutex then kn->count#213 (no idea what that is)
then q->sysfs_lock (via queue_attr_store)

vs 

blk_unregister_queue takes:
q->sysfs_lock then
kernfs_mutex (via kernfs_remove)
seems lockdep thinks "kernfs_mutex" is "kn->count#213"?

Feels like lockdep code in fs/kernfs/file.c and fs/kernfs/dir.c is
triggering false positives.. because these seem like different kernfs
locks yet they are reported as "kn->count#213".

Certainly feeling out of my depth with kernfs's locking though.

Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:51pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Mon, 2018-01-15 at 12:48 -0500, Mike Snitzer wrote:
> > Do I need to do something more to enable lockdep aside from set
> > CONFIG_LOCKDEP_SUPPORT=y ?
> 
> Hello Mike,
> 
> I think you also need to set CONFIG_PROVE_LOCKING=y.

Ah ok, was missing that, thanks.

Mike

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:36pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Mon, 2018-01-15 at 12:29 -0500, Mike Snitzer wrote:
> > So you replied to v5, I emailed a v6 out for the relevant patch.  Just
> > want to make sure you're testing with either Jens' latest tree or are
> > using my v6 that fixed blk_mq_unregister_dev() to require caller holds
> > q->sysfs_lock ?
> 
> Hello Mike,
> 
> In my test I was using Jens' latest for-next tree (commit 563877ae7dae).

OK, that includes his latest for-4.16/block (commit c100ec49fdd8) so
I'll take a closer look at this lockdep splat.

Do I need to do something more to enable lockdep aside from set
CONFIG_LOCKDEP_SUPPORT=y ?

Thanks,
Mike

Re: [PATCH V4 0/3] blk-mq: issue request directly for blk_insert_cloned_request

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> Hi Guys,
> 
> The 3 paches changes the blk-mq part of blk_insert_cloned_request(),
> in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
> and blk-mq can get the dispatch result of underlying queue, and with
> this information, blk-mq can handle IO merge much better, then
> sequential I/O performance is improved much.
> 
> In my dm-mpath over virtio-scsi test, this whole patchset improves
> sequential IO by 3X ~ 5X.
> 
> V4:
>   - remove dm patches which are in DM tree already
>   - cleanup __blk_mq_issue_req as suggested by Jens
> 

Ming,

You dropped the header cleanups that I did in v3 ("blk-mq: issue request
directly for blk_insert_cloned_request") being the one header I care
about being updated).

I also worked in parallel on my own v4 to address Jens' dislike for v3's
3 returns.  But I skinned the cat a different way, by dropping your
first patch that introduces the __blk_mq_issue_req helper, please see
these 2 commits:

https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=40f8947784128bb83dc5f7a6aed7ed230222f675
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=block-4.16=f641015f42f41df75220313ee62e8241f2fd

I think it makes the changes more obvious (not spread across 2 methods
and doesn't require use of BLK_STS_AGAIN).

Happy to yield to Jens to decide which he prefers.

Jens, if you'd like to pick my variant of v4 up they are here, thanks!

The following changes since commit c100ec49fdd836ff8a17c7bfcc7611d2ee2b:

  dm: fix incomplete request_queue initialization (2018-01-15 08:54:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-block-4.16/dm-changes-2

for you to fetch changes up to d86beab5712a8f18123011487dee797a1e3a07e1:

  blk-mq: issue request directly for blk_insert_cloned_request (2018-01-15 
12:40:44 -0500)


- Ming's blk-mq improvements to blk_insert_cloned_request(), which is
  used exclusively by request-based DM's blk-mq mode, that enable
  substantial dm-mpath sequential IO performance improvements.


Ming Lei (2):
  blk-mq: return dispatch result from blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c   |  3 +--
 block/blk-mq.c | 65 +-
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 +---
 4 files changed, 70 insertions(+), 20 deletions(-)

Re: [for-4.16 PATCH v5 0/4] block/dm: allow DM to defer blk_register_queue() until ready

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 12:16pm -0500,
Bart Van Assche <bart.vanass...@wdc.com> wrote:

> On Fri, 2018-01-12 at 10:06 -0500, Mike Snitzer wrote:
> > I'm submitting this v5 with more feeling now ;)
> 
> Hello Mike,
> 
> Have these patches been tested with lockdep enabled? The following appeared in
> the kernel log when after I started testing Jens' for-next tree of this 
> morning:

So you replied to v5, I emailed a v6 out for the relevant patch.  Just
want to make sure you're testing with either Jens' latest tree or are
using my v6 that fixed blk_mq_unregister_dev() to require caller holds
q->sysfs_lock ?

With the latest code that Jens merged, I do have lockdep enabled in my
kernel... but haven't seen any issues on the 2 testbeds I've been
hammering the changes on.

Ming was concerned about the potential for a lockdep splat (said he
tried expanding the scope of q->sysfs_lock in blk_unregister_queue
before and in doing so got lockdep to trigger).

Once you respond with what it is you're testing, and if it is latest
code, I'll look closer at what you've provided and see if it is real or
that lockdep just needs some training.

Thanks,
Mike

Re: [PATCH V4 1/3] blk-mq: move actual issue into one helper

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 11:58am -0500,
Ming Lei  wrote:

> No functional change, just to clean up code a bit, so that the following
> change of using direct issue for blk_mq_request_bypass_insert() which is
> needed by DM can be easier to do.
> 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 39 +++
>  1 file changed, 27 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index edb1291a42c5..bf8d6651f40e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1696,15 +1696,37 @@ static blk_qc_t request_to_qc_t(struct blk_mq_hw_ctx 
> *hctx, struct request *rq)
>   return blk_tag_to_qc_t(rq->internal_tag, hctx->queue_num, true);
>  }
>  
> -static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> - struct request *rq,
> - blk_qc_t *cookie)
> +static blk_status_t __blk_mq_issue_req(struct blk_mq_hw_ctx *hctx,
> +struct request *rq,
> +blk_qc_t *new_cookie)
>  {
> + blk_status_t ret;
>   struct request_queue *q = rq->q;
>   struct blk_mq_queue_data bd = {
>   .rq = rq,
>   .last = true,
>   };
> +
> + if (!blk_mq_get_driver_tag(rq, NULL, false))
> + return BLK_STS_AGAIN;
> +
> + if (!blk_mq_get_dispatch_budget(hctx)) {
> + blk_mq_put_driver_tag(rq);
> + return BLK_STS_AGAIN;
> + }
> +
> + *new_cookie = request_to_qc_t(hctx, rq);
> +
> + ret = q->mq_ops->queue_rq(hctx, );
> +
> + return ret;
> +}
> +
> +static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> + struct request *rq,
> + blk_qc_t *cookie)
> +{
> + struct request_queue *q = rq->q;
>   blk_qc_t new_cookie;
>   blk_status_t ret;
>   bool run_queue = true;
> @@ -1718,22 +1740,15 @@ static void __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   if (q->elevator)
>   goto insert;
>  
> - if (!blk_mq_get_driver_tag(rq, NULL, false))
> + ret = __blk_mq_issue_req(hctx, rq, _cookie);
> + if (ret == BLK_STS_AGAIN)
>   goto insert;

You're (ab)using BLK_STS_AGAIN as a means to an end...
But what if in the future q->mq_ops->queue_rq() returns BLK_STS_AGAIN?

Mike

Re: [GIT PULL] block changes to improve device mapper for 4.16

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at 10:33am -0500,
Jens Axboe <ax...@kernel.dk> wrote:

> On 1/14/18 7:59 PM, Mike Snitzer wrote:
...
> > Ming Lei (3):
> >   blk-mq: move actual issue into __blk_mq_issue_req helper
> 
> I don't like this patch at all - it's a 10 line function (if that)
> that ends up with three outputs, two of them hidden in passed
> in pointers. On top of that, a function that is named
> __blk_mq_issue_req() and returns bool, you would logically expect
> a 'true' return to mean that it succeeded. This is the opposite.
> 
> Not strongly opposed to the rest.

OK, I'll have a closer look at how to clean it up (and also get with
Ming).

In the meantime, you can either cherry-pick my first 4 patches or feel
free to use this to pull them in:

The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:

  blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 
10:46:24 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-block-4.16/dm-changes-1

for you to fetch changes up to d0cc27da2b04351b3cb52afeb99ceca7b9f91f3b:

  dm: fix incomplete request_queue initialization (2018-01-14 12:59:59 -0500)


- Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used.

- Cleanup blk_unregister_queue() to more precisely protect against
  concurrent sysfs changes, blk_mq_unregister_dev() now requires caller
  to hold q->sysfslock (blk_unregister_queue is only caller).

- Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the
  gendisk to be registered but the associated disk->queue's
  blk_register_queue() is left for the driver to do once its
  request_queue is fully initialized.  Fixes long-standing DM
  request_queue initialization issues.

--------
Mike Snitzer (4):
  block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
  block: properly protect the 'queue' kobj in blk_unregister_queue
  block: allow gendisk's request_queue registration to be deferred
  dm: fix incomplete request_queue initialization

 block/blk-mq-sysfs.c  |  9 +
 block/blk-sysfs.c | 18 +++---
 block/genhd.c | 23 +++
 drivers/md/dm-rq.c|  9 -
 drivers/md/dm.c   | 11 ++-
 include/linux/genhd.h |  5 +
 6 files changed, 50 insertions(+), 25 deletions(-)

4.16 genirq change prevents HP servers from booting [was: Re: linux-next: Signed-off-by missing for commit in the device-mapper tree]

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  8:27am -0500,
Stephen Rothwell  wrote:

> Hi all,
> 
> Commit
> 
>   34e1467da673 ("Revert "genirq/affinity: assign vectors to all possible 
> CPUs"")
> 
> is missing a Signed-off-by from its author and committer.
> 
> Reverts are commits as well.

Right, I'm aware.  I staged the tree that made some HP servers finally
work with the latest linux-block 4.16 changes.  Without thinking about
the broader implications.  Anyway, I'll drop the revert from
linux-dm.git's 'for-next'.

Because I'm confident others will hunt down the irq issues.

I think Ming was looking to grab the queue mapping info and CPU related
info from the affected server.

> Though I do note it actually has a reasonable commit message, thanks.

Semi-reasonable.  Lacks detail.  The issue is that over the weekend
Laurence found linux-block.git commit 84676c1f21e8ff54befe98 prevents
some HP servers from booting.  They'd hang when trying to initialize
their HPSA controller's devices, e.g.:

[  246.751050] INFO: task systemd-udevd:411 blocked for more than 120
seconds.
[  246.791852]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  246.874637] systemd-udevd   D0   411408 0x8004
[  246.904934] Call Trace:
[  246.918191]  ? __schedule+0x28d/0x870
[  246.937643]  ? _cond_resched+0x15/0x30
[  246.958222]  schedule+0x32/0x80
[  246.975424]  async_synchronize_cookie_domain+0x8b/0x140
[  247.004452]  ? remove_wait_queue+0x60/0x60
[  247.027335]  do_init_module+0xbe/0x219
[  247.048022]  load_module+0x21d6/0x2910
[  247.069436]  ? m_show+0x1c0/0x1c0
[  247.087999]  SYSC_finit_module+0x94/0xe0
[  247.110392]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  247.136669] RIP: 0033:0x7f84049287f9
[  247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX:
0139
[  247.196883] RAX: ffda RBX: 55b712b59e80 RCX:
7f84049287f9
[  247.237989] RDX:  RSI: 7f8405245099 RDI:
0008
[  247.279105] RBP: 7f8404bf2760 R08:  R09:
55b712b45760
[  247.320005] R10: 0008 R11: 0246 R12:
0020
[  247.360625] R13: 7f8404bf2818 R14: 0050 R15:
7f8404bf27b8
[  247.401062] INFO: task scsi_eh_0:471 blocked for more than 120 seconds.
[  247.438161]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  247.520700] scsi_eh_0   D0   471  2 0x8000
[  247.551339] Call Trace:
[  247.564360]  ? __schedule+0x28d/0x870
[  247.584720]  schedule+0x32/0x80
[  247.601294]  hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa]
[  247.633358]  ? remove_wait_queue+0x60/0x60
[  247.656345]  scsi_try_bus_device_reset+0x27/0x40
[  247.682424]  scsi_eh_ready_devs+0x53f/0xe20
[  247.706467]  ? __pm_runtime_resume+0x55/0x70
[  247.730327]  scsi_error_handler+0x434/0x5e0
[  247.754387]  ? __schedule+0x295/0x870
[  247.775420]  kthread+0xf5/0x130
[  247.793461]  ? scsi_eh_get_sense+0x240/0x240
[  247.818008]  ? kthread_associate_blkcg+0x90/0x90
[  247.844759]  ret_from_fork+0x1f/0x30
[  247.865440] INFO: task scsi_id:488 blocked for more than 120 seconds.
[  247.901112]   Tainted: G  I  4.15.0-rc4.block.dm.4.16+ #1
[  247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[  247.981092] scsi_id D0   488  1 0x0004
[  248.010535] Call Trace:
[  248.023567]  ? __schedule+0x28d/0x870
[  248.044236]  ? __switch_to+0x1f5/0x460
[  248.065776]  schedule+0x32/0x80
[  248.084238]  schedule_timeout+0x1d4/0x2f0
[  248.106184]  wait_for_completion+0x123/0x190
[  248.130759]  ? wake_up_q+0x70/0x70
[  248.150295]  flush_work+0x119/0x1a0
[  248.169238]  ? wake_up_worker+0x30/0x30
[  248.189670]  __cancel_work_timer+0x103/0x190
[  248.213751]  ? kobj_lookup+0x10b/0x160
[  248.235441]  disk_block_events+0x6f/0x90
[  248.257820]  __blkdev_get+0x6a/0x480
[  248.278770]  ? bd_acquire+0xd0/0xd0
[  248.298438]  blkdev_get+0x1a5/0x300
[  248.316587]  ? bd_acquire+0xd0/0xd0
[  248.334814]  do_dentry_open+0x202/0x320
[  248.354372]  ? security_inode_permission+0x3c/0x50
[  248.378818]  path_openat+0x537/0x12c0
[  248.397386]  ? vm_insert_page+0x1e0/0x1f0
[  248.417664]  ? vvar_fault+0x75/0x140
[  248.435811]  do_filp_open+0x91/0x100
[  248.454061]  do_sys_open+0x126/0x210
[  248.472462]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  248.495438] RIP: 0033:0x7f39e60e1e90
[  248.513136] RSP: 002b:7ffc4c906ba8 EFLAGS: 0246 ORIG_RAX:
0002
[  248.550726] RAX: ffda RBX: 5624aead3010 RCX:
7f39e60e1e90
[  248.586207] RDX: 7f39e60cc0c4 RSI: 00080800 RDI:
7ffc4c906ed0
[  248.622411] RBP: 7ffc4c906b60 R08: 7f39e60cc140 R09:
7f39e60cc140
[  248.658704] R10: 001f R11: 0246 R12:
7ffc4c906ed0

Re: [PATCH 23/27] drbd: make intelligent use of blkdev_issue_zeroout

2018-01-15 Thread Mike Snitzer

On Mon, Jan 15 2018 at  7:46am -0500,
Lars Ellenberg  wrote:

> As I understood it,
> blkdev_issue_zeroout() was supposed to "always try to unmap",
> deprovision, the relevant region, and zero-out any unaligned
> head or tail, just like my work around above was doing.
> 
> And that device mapper thin was "about to" learn this, "soon",
> or maybe block core would do the equivalent of my workaround
> described above.
> 
> But it then did not.
> 
> See also:
> https://www.redhat.com/archives/dm-devel/2017-March/msg00213.html
> https://www.redhat.com/archives/dm-devel/2017-March/msg00226.html

Right, now that you mention it it is starting to ring a bell (especially
after I read your 2nd dm-devel archive url above).

> I then did not follow this closely enough anymore,
> and I missed that with recent enough kernel,
> discard on DRBD on dm-thin would fully allocate.
> 
> In our out-of-tree module, we had to keep the older code for
> compat reasons, anyways. I will just re-enable our zeroout
> workaround there again.
> 
> In tree, either dm-thin learns to do REQ_OP_WRITE_ZEROES "properly",
> so the result in this scenario is what we expect:
> 
>   _: unprovisioned, not allocated, returns zero on read anyways
>   *: provisioned, some arbitrary data
>   0: explicitly zeroed:
> 
>   |gran|ular|ity ||||
>   |||||
>  to|-be-|zero|ed
>   |**00|||00**|
> 
> (leave unallocated blocks alone,
>  de-allocate full blocks just like with discard,
>  explicitly zero unaligned head and tail)

"de-allocate full blocks just like with discard" is an interesting take
what it means for dm-thin to handle REQ_OP_WRITE_ZEROES "properly".

> Or DRBD will have to resurrect that reinvented zeroout again,
> with exactly those semantics. I did reinvent it for a reason ;)

Yeah, I now recall dropping that line of development because it
became "hard" (or at least harder than originally thought).

Don't people use REQ_OP_WRITE_ZEROES to initialize a portion of the
disk?  E.g. zeroing superblocks, metadata areas, or whatever?

If we just discarded the logical extent and then a user did a partial
write to the block, areas that a user might expect to be zeroed wouldn't
be (at least in the case of dm-thinp if "skip_block_zeroing" is
enabled).  And yes if discard passdown is enabled and the device's
discard implementation does "discard_zeroes_data" then it'd be
fine.. but there are a lot of things that need to line up for drbd's
REQ_OP_WRITE_ZEROES to "just work" (as it expects).

(now I'm just echoing the kinds of concerns I had in that 2nd dm-devel
post above).

This post from mkp is interesting:
https://www.redhat.com/archives/dm-devel/2017-March/msg00228.html

Specifically:
"You don't have a way to mark those blocks as being full of zeroes
without actually writing them?

Note that the fallback to a zeroout command is to do a regular write. So
if DM doesn't zero the blocks, the block layer is going to it."

No, dm-thinp doesn't have an easy way to mark an allocated block as
containing zeroes (without actually zeroing).  I toyed with adding that
but then realized that even if we had it it'd still require block
zeroing be enabled.  But block zeroing is done at allocation time.  So
we'd need to interpret the "this block is zeroes" flag to mean "on first
write or read to this block it needs to first zero it".  Fugly to say
the least...

I've been quite busy with other things but I can revisit all this with
Joe Thornber and see what we come up with after a 2nd discussion.

But sadly, in general, this is a low priority for me, so you might do
well to reintroduce your drbd workaround.. sorry about that :(

Mike

[GIT PULL] block changes to improve device mapper for 4.16

2018-01-14 Thread Mike Snitzer

Hi Jens,

I prepared this pull request in the hope that it may help you review and
stage these changes for 4.16.

I went over Ming's changes again to refine the headers and code comments
for clarity to help ease review and inclussion.

I've done extensive testing of the changes in this pull request in
combination with all the dm-4.16 changes I'll be sending to Linus, using
this tree (which gets pulled into linux-next, I wanted some coverage):
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=for-next
I'll obviously drop these changes from dm's linux-next once you pick
them up.

FYI, this dm-4.16 commit addresses the concerns Bart raised last week
against Ming's dm-mpath changes:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=7a9d12664a183c4a1e2fa86b0a9206006b95138e

The following changes since commit bf9ae8c5325c0070d0ec81a849bba8d156f65993:

  blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init() (2018-01-14 
10:46:24 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git 
tags/for-block-4.16/dm

for you to fetch changes up to 0e94dfab56071a7afd0a9de9b8f6a0535148fb36:

  blk-mq: issue request directly for blk_insert_cloned_request (2018-01-14 
13:00:02 -0500)

Please pull, thanks!
Mike


- Small correctness fix in del_gendisk() if GENHD_FL_HIDDEN is used.

- Cleanup blk_unregister_queue() to more precisely protect against
  concurrent sysfs changes, blk_mq_unregister_dev() now requires caller
  to hold q->sysfslock (blk_unregister_queue is only caller).

- Introduce add_disk() variant, add_disk_no_queue_reg(), that allows the
  gendisk to be registered but the associated disk->queue's
  blk_register_queue() is left for the driver to do once its
  request_queue is fully initialized.  Fixes long-standing DM
  request_queue initialization issues.

- Ming's blk-mq improvements to blk_insert_cloned_request(), which is
  used exclusively by request-based DM's blk-mq mode, that enable
  substantial dm-mpath sequential IO performance improvements.

----
Mike Snitzer (4):
  block: only bdi_unregister() in del_gendisk() if !GENHD_FL_HIDDEN
  block: properly protect the 'queue' kobj in blk_unregister_queue
  block: allow gendisk's request_queue registration to be deferred
  dm: fix incomplete request_queue initialization

Ming Lei (3):
  blk-mq: move actual issue into __blk_mq_issue_req helper
  blk-mq: return dispatch result from blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c  |  3 +-
 block/blk-mq-sysfs.c  |  9 +-
 block/blk-mq.c| 86 +++
 block/blk-mq.h|  3 ++
 block/blk-sysfs.c | 18 +--
 block/genhd.c | 23 +++---
 drivers/md/dm-rq.c| 28 ++---
 drivers/md/dm.c   | 11 ++-
 include/linux/genhd.h |  5 +++
 9 files changed, 136 insertions(+), 50 deletions(-)

Re: blk-mq: don't clear RQF_MQ_INFLIGHT in blk_mq_rq_ctx_init()

2018-01-14 Thread Mike Snitzer

On Sun, Jan 14 2018 at 12:38pm -0500,
Jens Axboe <ax...@kernel.dk> wrote:

> On 1/13/18 10:49 AM, Ming Lei wrote:
> > In case of no IO scheduler, RQF_MQ_INFLIGHT is set in blk_mq_rq_ctx_init(),
> > but 7c3fb70f0341 clears it mistakenly, so fix it.
> 
> Oops yeah, that's my bad. However, I think the below fix is cleaner
> and avoids a conditional.
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b3b2003b7429..c8f62e6be6b6 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -269,13 +269,14 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>  {
>   struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
>   struct request *rq = tags->static_rqs[tag];
> + req_flags_t rq_flags = 0;
>  
>   if (data->flags & BLK_MQ_REQ_INTERNAL) {
>   rq->tag = -1;
>   rq->internal_tag = tag;
>   } else {
>   if (blk_mq_tag_busy(data->hctx)) {
> - rq->rq_flags = RQF_MQ_INFLIGHT;
> + rq_flags = RQF_MQ_INFLIGHT;
>   atomic_inc(>hctx->nr_active);
>   }
>   rq->tag = tag;
> @@ -286,7 +287,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
> blk_mq_alloc_data *data,
>   /* csd/requeue_work/fifo_time is initialized before use */
>   rq->q = data->q;
>   rq->mq_ctx = data->ctx;
> - rq->rq_flags = 0;
> + rq->rq_flags = rq_flags;
>   rq->cpu = -1;
>   rq->cmd_flags = op;
>   if (data->flags & BLK_MQ_REQ_PREEMPT)

Yeap, looks good.

Reviewed-by:  Mike Snitzer <snit...@redhat.com>

Defintely need this fix, without it all my dm-mpath testing hangs

Thanks,
Mike

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-13 Thread Mike Snitzer

On Fri, Jan 12 2018 at  8:37pm -0500,
Mike Snitzer <snit...@redhat.com> wrote:

> On Fri, Jan 12 2018 at  8:00pm -0500,
> Bart Van Assche <bart.vanass...@wdc.com> wrote:
> 
> > On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > > It was 50 ms before it was 100 ms.  No real explaination for these
> > > values other than they seem to make Bart's IB SRP testbed happy?
> > 
> > But that constant was not introduced by me in the dm code.
> 
> No actually it was (not that there's anything wrong with that):
> 
> commit 06eb061f48594aa369f6e852b352410298b317a8
> Author: Bart Van Assche <bart.vanass...@sandisk.com>
> Date:   Fri Apr 7 16:50:44 2017 -0700
> 
> dm mpath: requeue after a small delay if blk_get_request() fails
> 
> If blk_get_request() returns ENODEV then multipath_clone_and_map()
> causes a request to be requeued immediately. This can cause a kworker
> thread to spend 100% of the CPU time of a single core in
> __blk_mq_run_hw_queue() and also can cause device removal to never
> finish.
> 
> Avoid this by only requeuing after a delay if blk_get_request() fails.
> Additionally, reduce the requeue delay.
> 
> Cc: sta...@vger.kernel.org # 4.9+
> Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com>
> Signed-off-by: Mike Snitzer <snit...@redhat.com>
> 
> Note that this commit actually details a different case where a
> blk_get_request() (in existing code) return of -ENODEV is a very
> compelling case to use DM_MAPIO_DELAY_REQUEUE.
> 
> SO I'll revisit what is appropriate in multipath_clone_and_map() on
> Monday.

Sleep helped.  I had another look and it is only the old .request_fn
blk_get_request() code that even sets -ENODEV (if blk_queue_dying).

But thankfully the blk_get_request() error handling in
multipath_clone_and_map() checks for blk_queue_dying() and will return
DM_MAPIO_DELAY_REQUEUE.

So we're all set for this case.

Mike

1 2 3 >

1 - 100 of 297 matches

Mail list logo