Re: blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Laurence Oberman
On Wed, 2018-01-17 at 23:36 -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at 11:06pm -0500,
> Ming Lei  wrote:
> 
> > If we run into blk_mq_request_direct_issue(), when queue is busy,
> > we
> > don't want to dispatch this request into hctx->dispatch_list, and
> > what we need to do is to return the queue busy info to caller, so
> > that caller can deal with it well.
> > 
> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
> > blk_insert_cloned_request feedback")
> > Reported-by: Laurence Oberman 
> > Reviewed-by: Mike Snitzer 
> > Signed-off-by: Ming Lei 
> > ---
> >  block/blk-mq.c | 22 ++
> >  1 file changed, 10 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4d4af8d712da..1af7fa70993b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1856,15 +1856,6 @@ static blk_status_t
> > __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >     return ret;
> >  }
> >  
> > -static void __blk_mq_fallback_to_insert(struct request *rq,
> > -   bool run_queue, bool
> > bypass_insert)
> > -{
> > -   if (!bypass_insert)
> > -   blk_mq_sched_insert_request(rq, false, run_queue,
> > false);
> > -   else
> > -   blk_mq_request_bypass_insert(rq, run_queue);
> > -}
> > -
> >  static blk_status_t __blk_mq_try_issue_directly(struct
> > blk_mq_hw_ctx *hctx,
> >     struct request
> > *rq,
> >     blk_qc_t *cookie,
> > @@ -1873,9 +1864,16 @@ static blk_status_t
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >     struct request_queue *q = rq->q;
> >     bool run_queue = true;
> >  
> > -   /* RCU or SRCU read lock is needed before checking
> > quiesced flag */
> > +   /*
> > +    * RCU or SRCU read lock is needed before checking
> > quiesced flag.
> > +    *
> > +    * When queue is stopped or quiesced, ignore
> > 'bypass_insert' from
> > +    * blk_mq_request_direct_issue(), and return BLK_STS_OK to
> > caller,
> > +    * and avoid driver to try to dispatch again.
> > +    */
> >     if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
> >     run_queue = false;
> > +   bypass_insert = false;
> >     goto insert;
> >     }
> >  
> > @@ -1892,10 +1890,10 @@ static blk_status_t
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  
> >     return __blk_mq_issue_directly(hctx, rq, cookie);
> >  insert:
> > -   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> >     if (bypass_insert)
> >     return BLK_STS_RESOURCE;
> >  
> > +   blk_mq_sched_insert_request(rq, false, run_queue, false);
> >     return BLK_STS_OK;
> >  }
> 
> OK so you're just leveraging blk_mq_sched_insert_request()'s
> ability to resort to__blk_mq_insert_request() if !q->elevator.

I tested this against Mike's latest combined tree and its stable.
This fixes the list corruption issue.
Many Thanks Ming and Mike.

I will apply it to Bart's latest SRP/SRPT tree tomorrow as its very
late here but it will clearly fix the issue in Bart's tree too.

Tested-by: Laurence Oberman 



Re: [PATCH v2 06/12] bcache: set error_limit correctly

2018-01-17 Thread tang . junhui
From: Tang Junhui 

Hello Coly:

>It is because of ca->set->error_decay. When error_decay is set, bcache
>tries to do an exponential decay for error count. That is, error numbers
>is decaying against the quantity of io count, this is to avoid long time
>accumulated errors exceeds error_limit and trigger bch_cache_set_error().
>
>The I/O error counter, uses most significant 12 bits to record real
>errors number. And too many I/Os may decay the errors number too fast,
>so left shit it by 20 bits to make sure there are still enough errors
>number after a while (e.g. the halflife).
>
>The we don't use the left shifting, when error_deay is set, and too many
>I/Os happen, after a small piece of time, number of I/O errors will be
>decayed too fast to reach error_limit. Therefore IMHO the left shifting
>is necessary.
>
>BTW, I doubt whether current exponential error decay in
>bch_count_io_errors() works properly. Because I don't catch the
>connection between IO counter (ca->io_count) and error counter
>(ca->io_errors). By default ca->set->error_decay is 0, so I doubt how
>many people use/test this piece of code. (Maybe I am wrong)

OK, LGTM.

Reviewed-by: Tang Junhui 

And ping Mike?


Thanks,
Tang Junhui


Re: [PATCH] blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Jens Axboe
On 1/17/18 9:06 PM, Ming Lei wrote:
> If we run into blk_mq_request_direct_issue(), when queue is busy, we
> don't want to dispatch this request into hctx->dispatch_list, and
> what we need to do is to return the queue busy info to caller, so
> that caller can deal with it well.

Thanks for fixing this up, applied.

-- 
Jens Axboe



Re: blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at 11:06pm -0500,
Ming Lei  wrote:

> If we run into blk_mq_request_direct_issue(), when queue is busy, we
> don't want to dispatch this request into hctx->dispatch_list, and
> what we need to do is to return the queue busy info to caller, so
> that caller can deal with it well.
> 
> Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> blk_insert_cloned_request feedback")
> Reported-by: Laurence Oberman 
> Reviewed-by: Mike Snitzer 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 22 ++
>  1 file changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4d4af8d712da..1af7fa70993b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -static void __blk_mq_fallback_to_insert(struct request *rq,
> - bool run_queue, bool bypass_insert)
> -{
> - if (!bypass_insert)
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> - else
> - blk_mq_request_bypass_insert(rq, run_queue);
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> @@ -1873,9 +1864,16 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   struct request_queue *q = rq->q;
>   bool run_queue = true;
>  
> - /* RCU or SRCU read lock is needed before checking quiesced flag */
> + /*
> +  * RCU or SRCU read lock is needed before checking quiesced flag.
> +  *
> +  * When queue is stopped or quiesced, ignore 'bypass_insert' from
> +  * blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
> +  * and avoid driver to try to dispatch again.
> +  */
>   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
>   run_queue = false;
> + bypass_insert = false;
>   goto insert;
>   }
>  
> @@ -1892,10 +1890,10 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  
>   return __blk_mq_issue_directly(hctx, rq, cookie);
>  insert:
> - __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
>  
> + blk_mq_sched_insert_request(rq, false, run_queue, false);
>   return BLK_STS_OK;
>  }

OK so you're just leveraging blk_mq_sched_insert_request()'s
ability to resort to__blk_mq_insert_request() if !q->elevator.


[PATCH] blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Ming Lei
If we run into blk_mq_request_direct_issue(), when queue is busy, we
don't want to dispatch this request into hctx->dispatch_list, and
what we need to do is to return the queue busy info to caller, so
that caller can deal with it well.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman 
Reviewed-by: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4d4af8d712da..1af7fa70993b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static void __blk_mq_fallback_to_insert(struct request *rq,
-   bool run_queue, bool bypass_insert)
-{
-   if (!bypass_insert)
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
-   else
-   blk_mq_request_bypass_insert(rq, run_queue);
-}
-
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
blk_qc_t *cookie,
@@ -1873,9 +1864,16 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
struct request_queue *q = rq->q;
bool run_queue = true;
 
-   /* RCU or SRCU read lock is needed before checking quiesced flag */
+   /*
+* RCU or SRCU read lock is needed before checking quiesced flag.
+*
+* When queue is stopped or quiesced, ignore 'bypass_insert' from
+* blk_mq_request_direct_issue(), and return BLK_STS_OK to caller,
+* and avoid driver to try to dispatch again.
+*/
if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
run_queue = false;
+   bypass_insert = false;
goto insert;
}
 
@@ -1892,10 +1890,10 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
if (bypass_insert)
return BLK_STS_RESOURCE;
 
+   blk_mq_sched_insert_request(rq, false, run_queue, false);
return BLK_STS_OK;
 }
 
@@ -1911,7 +1909,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(rq, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 
-- 
2.9.5



Re: blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 10:49:58PM -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at 10:39pm -0500,
> Ming Lei  wrote:
> 
> > On Wed, Jan 17, 2018 at 10:33:35PM -0500, Mike Snitzer wrote:
> > > On Wed, Jan 17 2018 at  7:54P -0500,
> > > Mike Snitzer  wrote:
> > >  
> > > > But sure, I suppose there is something I missed when refactoring Ming's
> > > > change to get it acceptable for upstream.  I went over the mechanical
> > > > nature of what I did many times (comparing Ming's v4 to my v5).
> > > 
> > > And yes there is one subtlety that I missed.
> > > 
> > > > The call to blk_mq_request_bypass_insert will only occur via
> > > > __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> > > > fast path.  This will occur if the underlying blk-mq device cannot get
> > > > resources it needs in order to issue the request.  Specifically: if/when
> > > > in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> > > > quiesced, or it cannot get the driver tag or dispatch_budget (in the
> > > > case of scsi-mq).
> > > > 
> > > > The same fallback, via call to blk_mq_request_bypass_insert, occured
> > > > with Ming's v4 though.
> > > 
> > > Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
> > > the driver tag or dispatch_budget" case.
> > > 
> > > This patch should fix it (Laurence, please report back on if this fixes
> > > your list_add corruption, pretty sure it will):
> > > 
> > > From: Mike Snitzer 
> > > Date: Wed, 17 Jan 2018 22:02:07 -0500
> > > Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
> > > BLK_STS_RESOURCE
> > > 
> > > It isn't ever valid to call blk_mq_request_bypass_insert() and return
> > > BLK_STS_RESOURCE.
> > > 
> > > Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> > > merging via blk_insert_cloned_request feedback") we do just that if
> > > blk_mq_request_direct_issue() cannot get the resources (driver_tag or
> > > dispatch_budget) needed to directly issue a request.  This will lead to
> > > "list_add corruption" because blk-mq submits the IO but then reports
> > > that it didn't (BLK_STS_RESOURCE in this case).
> > > 
> > > Fix this by simply returning BLK_STS_RESOURCE for this case.
> > > 
> > > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> > > blk_insert_cloned_request feedback")
> > > Reported-by: Laurence Oberman 
> > > Signed-off-by: Mike Snitzer 
> > > ---
> > >  block/blk-mq.c | 12 +---
> > >  1 file changed, 5 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c418858a60ef..8bee37239255 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1799,20 +1799,18 @@ static blk_status_t 
> > > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >   if (q->elevator && !bypass_insert)
> > >   goto insert;
> > >  
> > > - if (!blk_mq_get_driver_tag(rq, NULL, false))
> > > - goto insert;
> > > -
> > > - if (!blk_mq_get_dispatch_budget(hctx)) {
> > > + if (!blk_mq_get_driver_tag(rq, NULL, false) ||
> > > + !blk_mq_get_dispatch_budget(hctx)) {
> > > + /* blk_mq_put_driver_tag() is idempotent */
> > >   blk_mq_put_driver_tag(rq);
> > > + if (bypass_insert)
> > > + return BLK_STS_RESOURCE;
> > >   goto insert;
> > >   }
> > >  
> > >   return __blk_mq_issue_directly(hctx, rq, cookie);
> > >  insert:
> > >   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> > > - if (bypass_insert)
> > > - return BLK_STS_RESOURCE;
> > > -
> > >   return BLK_STS_OK;
> > >  }
> > 
> > Hi Mike,
> > 
> > I'd suggest to use the following one, which is simple and clean:
> > 
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4d4af8d712da..816ff5d6bc88 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> > blk_mq_hw_ctx *hctx,
> > return ret;
> >  }
> >  
> > -static void __blk_mq_fallback_to_insert(struct request *rq,
> > -   bool run_queue, bool bypass_insert)
> > -{
> > -   if (!bypass_insert)
> > -   blk_mq_sched_insert_request(rq, false, run_queue, false);
> > -   else
> > -   blk_mq_request_bypass_insert(rq, run_queue);
> > -}
> > -
> >  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > struct request *rq,
> > blk_qc_t *cookie,
> > @@ -1892,10 +1883,10 @@ static blk_status_t 
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  
> > return __blk_mq_issue_directly(hctx, rq, cookie);
> >  insert:
> > -   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> > if (bypass_insert)
> > return BLK_STS_RESOURCE;
> >  
> > +   

Re: blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at 10:39pm -0500,
Ming Lei  wrote:

> On Wed, Jan 17, 2018 at 10:33:35PM -0500, Mike Snitzer wrote:
> > On Wed, Jan 17 2018 at  7:54P -0500,
> > Mike Snitzer  wrote:
> >  
> > > But sure, I suppose there is something I missed when refactoring Ming's
> > > change to get it acceptable for upstream.  I went over the mechanical
> > > nature of what I did many times (comparing Ming's v4 to my v5).
> > 
> > And yes there is one subtlety that I missed.
> > 
> > > The call to blk_mq_request_bypass_insert will only occur via
> > > __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> > > fast path.  This will occur if the underlying blk-mq device cannot get
> > > resources it needs in order to issue the request.  Specifically: if/when
> > > in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> > > quiesced, or it cannot get the driver tag or dispatch_budget (in the
> > > case of scsi-mq).
> > > 
> > > The same fallback, via call to blk_mq_request_bypass_insert, occured
> > > with Ming's v4 though.
> > 
> > Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
> > the driver tag or dispatch_budget" case.
> > 
> > This patch should fix it (Laurence, please report back on if this fixes
> > your list_add corruption, pretty sure it will):
> > 
> > From: Mike Snitzer 
> > Date: Wed, 17 Jan 2018 22:02:07 -0500
> > Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
> > BLK_STS_RESOURCE
> > 
> > It isn't ever valid to call blk_mq_request_bypass_insert() and return
> > BLK_STS_RESOURCE.
> > 
> > Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> > merging via blk_insert_cloned_request feedback") we do just that if
> > blk_mq_request_direct_issue() cannot get the resources (driver_tag or
> > dispatch_budget) needed to directly issue a request.  This will lead to
> > "list_add corruption" because blk-mq submits the IO but then reports
> > that it didn't (BLK_STS_RESOURCE in this case).
> > 
> > Fix this by simply returning BLK_STS_RESOURCE for this case.
> > 
> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> > blk_insert_cloned_request feedback")
> > Reported-by: Laurence Oberman 
> > Signed-off-by: Mike Snitzer 
> > ---
> >  block/blk-mq.c | 12 +---
> >  1 file changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c418858a60ef..8bee37239255 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1799,20 +1799,18 @@ static blk_status_t 
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> > if (q->elevator && !bypass_insert)
> > goto insert;
> >  
> > -   if (!blk_mq_get_driver_tag(rq, NULL, false))
> > -   goto insert;
> > -
> > -   if (!blk_mq_get_dispatch_budget(hctx)) {
> > +   if (!blk_mq_get_driver_tag(rq, NULL, false) ||
> > +   !blk_mq_get_dispatch_budget(hctx)) {
> > +   /* blk_mq_put_driver_tag() is idempotent */
> > blk_mq_put_driver_tag(rq);
> > +   if (bypass_insert)
> > +   return BLK_STS_RESOURCE;
> > goto insert;
> > }
> >  
> > return __blk_mq_issue_directly(hctx, rq, cookie);
> >  insert:
> > __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> > -   if (bypass_insert)
> > -   return BLK_STS_RESOURCE;
> > -
> > return BLK_STS_OK;
> >  }
> 
> Hi Mike,
> 
> I'd suggest to use the following one, which is simple and clean:
> 
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4d4af8d712da..816ff5d6bc88 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   return ret;
>  }
>  
> -static void __blk_mq_fallback_to_insert(struct request *rq,
> - bool run_queue, bool bypass_insert)
> -{
> - if (!bypass_insert)
> - blk_mq_sched_insert_request(rq, false, run_queue, false);
> - else
> - blk_mq_request_bypass_insert(rq, run_queue);
> -}
> -
>  static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
>   blk_qc_t *cookie,
> @@ -1892,10 +1883,10 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>  
>   return __blk_mq_issue_directly(hctx, rq, cookie);
>  insert:
> - __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
>   if (bypass_insert)
>   return BLK_STS_RESOURCE;
>  
> + blk_mq_sched_insert_request(rq, false, run_queue, false);
>   return BLK_STS_OK;
>  }
>  
> @@ -1911,7 +1902,7 @@ static void blk_mq_try_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, 

Re: [for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 10:37:23PM -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at 10:25pm -0500,
> Ming Lei  wrote:
> 
> > Hi Mike,
> > 
> > On Wed, Jan 17, 2018 at 11:25:57AM -0500, Mike Snitzer wrote:
> > > From: Ming Lei 
> > > 
> > > blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> > > (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> > > blk_mq_request_bypass_insert() to directly append the request to the
> > > blk-mq hctx->dispatch_list of the underlying queue.
> > > 
> > > 1) This way isn't efficient enough because the hctx spinlock is always
> > > used.
> > > 
> > > 2) With blk_insert_cloned_request(), we completely bypass underlying
> > > queue's elevator and depend on the upper-level dm-rq driver's elevator
> > > to schedule IO.  But dm-rq currently can't get the underlying queue's
> > > dispatch feedback at all.  Without knowing whether a request was issued
> > > or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> > > not be able to provide effective IO merging (as a side-effect of dm-rq
> > > currently blindly destaging a request from its elevator only to requeue
> > > it after a delay, which kills any opportunity for merging).  This
> > > obviously causes very bad sequential IO performance.
> > > 
> > > Fix this by updating blk_insert_cloned_request() to use
> > > blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> > > request to be issued directly to the underlying queue and returns the
> > > dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> > > returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> > > to _not_ destage the request.  Whereby preserving the opportunity to
> > > merge IO.
> > > 
> > > With this, request-based DM's blk-mq sequential IO performance is vastly
> > > improved (as much as 3X in mpath/virtio-scsi testing).
> > > 
> > > Signed-off-by: Ming Lei 
> > > [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
> > > they were refactored to make them less fragile and easier to read/review]
> > > Signed-off-by: Mike Snitzer 
> ...
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index c117c2baf2c9..f5f0d8456713 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1731,15 +1731,19 @@ static blk_status_t 
> > > __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> > >  
> > >  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
> > >   struct request *rq,
> > > - bool run_queue)
> > > + bool run_queue, bool bypass_insert)
> > >  {
> > > - blk_mq_sched_insert_request(rq, false, run_queue, false,
> > > - hctx->flags & BLK_MQ_F_BLOCKING);
> > > + if (!bypass_insert)
> > > + blk_mq_sched_insert_request(rq, false, run_queue, false,
> > > + hctx->flags & BLK_MQ_F_BLOCKING);
> > > + else
> > > + blk_mq_request_bypass_insert(rq, run_queue);
> > >  }
> > 
> > If 'bypass_insert' is true, we don't need to insert the request into
> > hctx->dispatch_list for dm-rq, then it causes the issue(use after free)
> > reported by Bart and Laurence.
> > 
> > Also this way is the exact opposite of the idea of the improvement,
> > we do not want to dispatch request if underlying queue is busy.
> 
> Yeap, please see the patch I just posted to fix it.

Looks your patch is a bit complicated, then __blk_mq_fallback_to_insert()
can be removed.

> 
> But your v4 does fallback to using blk_mq_request_bypass_insert() as
> well, just in a much narrower case -- specifically:
>if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

Yeah, I just found it, and you can add 'bypass_insert = false' under
condition.

-- 
Ming


Re: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 10:33:35PM -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at  7:54P -0500,
> Mike Snitzer  wrote:
>  
> > But sure, I suppose there is something I missed when refactoring Ming's
> > change to get it acceptable for upstream.  I went over the mechanical
> > nature of what I did many times (comparing Ming's v4 to my v5).
> 
> And yes there is one subtlety that I missed.
> 
> > The call to blk_mq_request_bypass_insert will only occur via
> > __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> > fast path.  This will occur if the underlying blk-mq device cannot get
> > resources it needs in order to issue the request.  Specifically: if/when
> > in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> > quiesced, or it cannot get the driver tag or dispatch_budget (in the
> > case of scsi-mq).
> > 
> > The same fallback, via call to blk_mq_request_bypass_insert, occured
> > with Ming's v4 though.
> 
> Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
> the driver tag or dispatch_budget" case.
> 
> This patch should fix it (Laurence, please report back on if this fixes
> your list_add corruption, pretty sure it will):
> 
> From: Mike Snitzer 
> Date: Wed, 17 Jan 2018 22:02:07 -0500
> Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
> BLK_STS_RESOURCE
> 
> It isn't ever valid to call blk_mq_request_bypass_insert() and return
> BLK_STS_RESOURCE.
> 
> Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
> merging via blk_insert_cloned_request feedback") we do just that if
> blk_mq_request_direct_issue() cannot get the resources (driver_tag or
> dispatch_budget) needed to directly issue a request.  This will lead to
> "list_add corruption" because blk-mq submits the IO but then reports
> that it didn't (BLK_STS_RESOURCE in this case).
> 
> Fix this by simply returning BLK_STS_RESOURCE for this case.
> 
> Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
> blk_insert_cloned_request feedback")
> Reported-by: Laurence Oberman 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-mq.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c418858a60ef..8bee37239255 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1799,20 +1799,18 @@ static blk_status_t 
> __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
>   if (q->elevator && !bypass_insert)
>   goto insert;
>  
> - if (!blk_mq_get_driver_tag(rq, NULL, false))
> - goto insert;
> -
> - if (!blk_mq_get_dispatch_budget(hctx)) {
> + if (!blk_mq_get_driver_tag(rq, NULL, false) ||
> + !blk_mq_get_dispatch_budget(hctx)) {
> + /* blk_mq_put_driver_tag() is idempotent */
>   blk_mq_put_driver_tag(rq);
> + if (bypass_insert)
> + return BLK_STS_RESOURCE;
>   goto insert;
>   }
>  
>   return __blk_mq_issue_directly(hctx, rq, cookie);
>  insert:
>   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> - if (bypass_insert)
> - return BLK_STS_RESOURCE;
> -
>   return BLK_STS_OK;
>  }

Hi Mike,

I'd suggest to use the following one, which is simple and clean:


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4d4af8d712da..816ff5d6bc88 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1856,15 +1856,6 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
return ret;
 }
 
-static void __blk_mq_fallback_to_insert(struct request *rq,
-   bool run_queue, bool bypass_insert)
-{
-   if (!bypass_insert)
-   blk_mq_sched_insert_request(rq, false, run_queue, false);
-   else
-   blk_mq_request_bypass_insert(rq, run_queue);
-}
-
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
blk_qc_t *cookie,
@@ -1892,10 +1883,10 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
if (bypass_insert)
return BLK_STS_RESOURCE;
 
+   blk_mq_sched_insert_request(rq, false, run_queue, false);
return BLK_STS_OK;
 }
 
@@ -1911,7 +1902,7 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(rq, true, false);
+   blk_mq_sched_insert_request(rq, false, true, false);
else if (ret != BLK_STS_OK)
blk_mq_end_request(rq, ret);
 


-- 

Re: [for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at 10:25pm -0500,
Ming Lei  wrote:

> Hi Mike,
> 
> On Wed, Jan 17, 2018 at 11:25:57AM -0500, Mike Snitzer wrote:
> > From: Ming Lei 
> > 
> > blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> > (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> > blk_mq_request_bypass_insert() to directly append the request to the
> > blk-mq hctx->dispatch_list of the underlying queue.
> > 
> > 1) This way isn't efficient enough because the hctx spinlock is always
> > used.
> > 
> > 2) With blk_insert_cloned_request(), we completely bypass underlying
> > queue's elevator and depend on the upper-level dm-rq driver's elevator
> > to schedule IO.  But dm-rq currently can't get the underlying queue's
> > dispatch feedback at all.  Without knowing whether a request was issued
> > or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> > not be able to provide effective IO merging (as a side-effect of dm-rq
> > currently blindly destaging a request from its elevator only to requeue
> > it after a delay, which kills any opportunity for merging).  This
> > obviously causes very bad sequential IO performance.
> > 
> > Fix this by updating blk_insert_cloned_request() to use
> > blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> > request to be issued directly to the underlying queue and returns the
> > dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> > returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> > to _not_ destage the request.  Whereby preserving the opportunity to
> > merge IO.
> > 
> > With this, request-based DM's blk-mq sequential IO performance is vastly
> > improved (as much as 3X in mpath/virtio-scsi testing).
> > 
> > Signed-off-by: Ming Lei 
> > [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
> > they were refactored to make them less fragile and easier to read/review]
> > Signed-off-by: Mike Snitzer 
...
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index c117c2baf2c9..f5f0d8456713 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1731,15 +1731,19 @@ static blk_status_t __blk_mq_issue_directly(struct 
> > blk_mq_hw_ctx *hctx,
> >  
> >  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
> > struct request *rq,
> > -   bool run_queue)
> > +   bool run_queue, bool bypass_insert)
> >  {
> > -   blk_mq_sched_insert_request(rq, false, run_queue, false,
> > -   hctx->flags & BLK_MQ_F_BLOCKING);
> > +   if (!bypass_insert)
> > +   blk_mq_sched_insert_request(rq, false, run_queue, false,
> > +   hctx->flags & BLK_MQ_F_BLOCKING);
> > +   else
> > +   blk_mq_request_bypass_insert(rq, run_queue);
> >  }
> 
> If 'bypass_insert' is true, we don't need to insert the request into
> hctx->dispatch_list for dm-rq, then it causes the issue(use after free)
> reported by Bart and Laurence.
> 
> Also this way is the exact opposite of the idea of the improvement,
> we do not want to dispatch request if underlying queue is busy.

Yeap, please see the patch I just posted to fix it.

But your v4 does fallback to using blk_mq_request_bypass_insert() as
well, just in a much narrower case -- specifically:
   if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))

Thanks,
Mike


[PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return BLK_STS_RESOURCE

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at  7:54P -0500,
Mike Snitzer  wrote:
 
> But sure, I suppose there is something I missed when refactoring Ming's
> change to get it acceptable for upstream.  I went over the mechanical
> nature of what I did many times (comparing Ming's v4 to my v5).

And yes there is one subtlety that I missed.

> The call to blk_mq_request_bypass_insert will only occur via
> __blk_mq_fallback_to_insert.  Which as the name implies this is not the
> fast path.  This will occur if the underlying blk-mq device cannot get
> resources it needs in order to issue the request.  Specifically: if/when
> in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
> quiesced, or it cannot get the driver tag or dispatch_budget (in the
> case of scsi-mq).
> 
> The same fallback, via call to blk_mq_request_bypass_insert, occured
> with Ming's v4 though.

Turns out Ming's v4 doesn't fallback to insert for the "or it cannot get
the driver tag or dispatch_budget" case.

This patch should fix it (Laurence, please report back on if this fixes
your list_add corruption, pretty sure it will):

From: Mike Snitzer 
Date: Wed, 17 Jan 2018 22:02:07 -0500
Subject: [PATCH] blk mq: don't blk_mq_request_bypass_insert _and_ return 
BLK_STS_RESOURCE

It isn't ever valid to call blk_mq_request_bypass_insert() and return
BLK_STS_RESOURCE.

Unfortunately after commit 396eaf21ee ("blk-mq: improve DM's blk-mq IO
merging via blk_insert_cloned_request feedback") we do just that if
blk_mq_request_direct_issue() cannot get the resources (driver_tag or
dispatch_budget) needed to directly issue a request.  This will lead to
"list_add corruption" because blk-mq submits the IO but then reports
that it didn't (BLK_STS_RESOURCE in this case).

Fix this by simply returning BLK_STS_RESOURCE for this case.

Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via 
blk_insert_cloned_request feedback")
Reported-by: Laurence Oberman 
Signed-off-by: Mike Snitzer 
---
 block/blk-mq.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c418858a60ef..8bee37239255 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1799,20 +1799,18 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
if (q->elevator && !bypass_insert)
goto insert;
 
-   if (!blk_mq_get_driver_tag(rq, NULL, false))
-   goto insert;
-
-   if (!blk_mq_get_dispatch_budget(hctx)) {
+   if (!blk_mq_get_driver_tag(rq, NULL, false) ||
+   !blk_mq_get_dispatch_budget(hctx)) {
+   /* blk_mq_put_driver_tag() is idempotent */
blk_mq_put_driver_tag(rq);
+   if (bypass_insert)
+   return BLK_STS_RESOURCE;
goto insert;
}
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
__blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
-   if (bypass_insert)
-   return BLK_STS_RESOURCE;
-
return BLK_STS_OK;
 }
 
-- 
2.15.0



Re: [for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Ming Lei
Hi Mike,

On Wed, Jan 17, 2018 at 11:25:57AM -0500, Mike Snitzer wrote:
> From: Ming Lei 
> 
> blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> blk_mq_request_bypass_insert() to directly append the request to the
> blk-mq hctx->dispatch_list of the underlying queue.
> 
> 1) This way isn't efficient enough because the hctx spinlock is always
> used.
> 
> 2) With blk_insert_cloned_request(), we completely bypass underlying
> queue's elevator and depend on the upper-level dm-rq driver's elevator
> to schedule IO.  But dm-rq currently can't get the underlying queue's
> dispatch feedback at all.  Without knowing whether a request was issued
> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> not be able to provide effective IO merging (as a side-effect of dm-rq
> currently blindly destaging a request from its elevator only to requeue
> it after a delay, which kills any opportunity for merging).  This
> obviously causes very bad sequential IO performance.
> 
> Fix this by updating blk_insert_cloned_request() to use
> blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> request to be issued directly to the underlying queue and returns the
> dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> to _not_ destage the request.  Whereby preserving the opportunity to
> merge IO.
> 
> With this, request-based DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).
> 
> Signed-off-by: Ming Lei 
> [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
> they were refactored to make them less fragile and easier to read/review]
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-core.c   |  3 +--
>  block/blk-mq.c | 37 +
>  block/blk-mq.h |  3 +++
>  drivers/md/dm-rq.c | 19 ---
>  4 files changed, 49 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7ba607527487..55f338020254 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
> request_queue *q, struct request *
>* bypass a potential scheduler on the bottom device for
>* insert.
>*/
> - blk_mq_request_bypass_insert(rq, true);
> - return BLK_STS_OK;
> + return blk_mq_request_direct_issue(rq);
>   }
>  
>   spin_lock_irqsave(q->queue_lock, flags);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c117c2baf2c9..f5f0d8456713 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1731,15 +1731,19 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - bool run_queue)
> + bool run_queue, bool bypass_insert)
>  {
> - blk_mq_sched_insert_request(rq, false, run_queue, false,
> - hctx->flags & BLK_MQ_F_BLOCKING);
> + if (!bypass_insert)
> + blk_mq_sched_insert_request(rq, false, run_queue, false,
> + hctx->flags & BLK_MQ_F_BLOCKING);
> + else
> + blk_mq_request_bypass_insert(rq, run_queue);
>  }

If 'bypass_insert' is true, we don't need to insert the request into
hctx->dispatch_list for dm-rq, then it causes the issue(use after free)
reported by Bart and Laurence.

Also this way is the exact opposite of the idea of the improvement,
we do not want to dispatch request if underlying queue is busy.

Thanks,
Ming


[RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-17 Thread Ming Lei
BLK_STS_RESOURCE can be returned from driver when any resource
is running out of. And the resource may not be related with tags,
such as kmalloc(GFP_ATOMIC), when queue is idle under this kind of
BLK_STS_RESOURCE, restart can't work any more, then IO hang may
be caused.

Most of drivers may call kmalloc(GFP_ATOMIC) in IO path, and almost
all returns BLK_STS_RESOURCE under this situation. But for dm-mpath,
it may be triggered a bit easier since the request pool of underlying
queue may be consumed up much easier. But in reality, it is still not
easy to trigger it. I run all kinds of test on dm-mpath/scsi-debug
with all kinds of scsi_debug parameters, can't trigger this issue
at all. But finally it is triggered in Bart's SRP test, which seems
made by genius, :-)

This patch deals with this situation by running the queue again when
queue is found idle in timeout handler.

Signed-off-by: Ming Lei 
---

Another approach is to do the check after BLK_STS_RESOURCE is returned
from .queue_rq() and BLK_MQ_S_SCHED_RESTART is set, that way may introduce
a bit cost in hot path, and it was V1 of this patch actually, please see
that in the following link:


https://github.com/ming1/linux/commit/68a66900f3647ea6751aab2848b1e5eef508feaa

Or other better ways?

 block/blk-mq.c | 83 +-
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6e3f77829dcc..4d4af8d712da 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -896,6 +896,85 @@ static void blk_mq_terminate_expired(struct blk_mq_hw_ctx 
*hctx,
blk_mq_rq_timed_out(rq, reserved);
 }
 
+struct hctx_busy_data {
+   struct blk_mq_hw_ctx *hctx;
+   bool reserved;
+   bool busy;
+};
+
+static bool check_busy_hctx(struct sbitmap *sb, unsigned int bitnr, void *data)
+{
+   struct hctx_busy_data *busy_data = data;
+   struct blk_mq_hw_ctx *hctx = busy_data->hctx;
+   struct request *rq;
+
+   if (busy_data->reserved)
+   bitnr += hctx->tags->nr_reserved_tags;
+
+   rq = hctx->tags->static_rqs[bitnr];
+   if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
+   busy_data->busy = true;
+   return false;
+   }
+   return true;
+}
+
+/* Check if there is any in-flight request */
+static bool blk_mq_hctx_is_busy(struct blk_mq_hw_ctx *hctx)
+{
+   struct hctx_busy_data data = {
+   .hctx = hctx,
+   .busy = false,
+   .reserved = true,
+   };
+
+   sbitmap_for_each_set(>tags->breserved_tags.sb,
+   check_busy_hctx, );
+   if (data.busy)
+   return true;
+
+   data.reserved = false;
+   sbitmap_for_each_set(>tags->bitmap_tags.sb,
+   check_busy_hctx, );
+   if (data.busy)
+   return true;
+
+   return false;
+}
+
+static void blk_mq_fixup_restart(struct blk_mq_hw_ctx *hctx)
+{
+   if (test_bit(BLK_MQ_S_SCHED_RESTART, >state)) {
+   bool busy;
+
+   /*
+* If this hctx is still marked as RESTART, and there
+* isn't any in-flight requests, we have to run queue
+* here to prevent IO from hanging.
+*
+* BLK_STS_RESOURCE can be returned from driver when any
+* resource is running out of. And the resource may not
+* be related with tags, such as kmalloc(GFP_ATOMIC), when
+* queue is idle under this kind of BLK_STS_RESOURCE, restart
+* can't work any more, then IO hang may be caused.
+*
+* The counter-pair of the following barrier is the one
+* in blk_mq_put_driver_tag() after returning BLK_STS_RESOURCE
+* from ->queue_rq().
+*/
+   smp_mb();
+
+   busy = blk_mq_hctx_is_busy(hctx);
+   if (!busy) {
+   printk(KERN_WARNING "blk-mq: fixup RESTART\n");
+   printk(KERN_WARNING "\t If this message is shown"
+  " a bit often, please report the issue to"
+  " linux-block@vger.kernel.org\n");
+   blk_mq_run_hw_queue(hctx, true);
+   }
+   }
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -966,8 +1045,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 */
queue_for_each_hw_ctx(q, hctx, i) {
/* the hctx may be unmapped, so check it here */
-   if (blk_mq_hw_queue_mapped(hctx))
+   if (blk_mq_hw_queue_mapped(hctx)) {
blk_mq_tag_idle(hctx);
+   blk_mq_fixup_restart(hctx);
+   }
}
}
   

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at  6:53pm -0500,
Bart Van Assche  wrote:

> On Wed, 2018-01-17 at 18:43 -0500, Laurence Oberman wrote:
> > On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> > > On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > > > On Wed, Jan 17 2018 at 11:50am -0500,
> > > > Jens Axboe  wrote:
> > > > 
> > > > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > > > Hi Jens,
> > > > > > 
> > > > > > Think this finally takes care of it! ;)
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > > 
> > > > > > Mike Snitzer (2):
> > > > > >   blk-mq: factor out a few helpers from
> > > > > > __blk_mq_try_issue_directly
> > > > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > > > blk_mq_sched_insert_request
> > > > > > 
> > > > > > Ming Lei (1):
> > > > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > > > blk_insert_cloned_request feedback
> > > > > 
> > > > > Applied - added actual commit message to patch 3.
> > > > 
> > > > Great, thanks.
> > > 
> > > Hello Mike,
> > > 
> > > Laurence hit the following while retesting the SRP initiator code:
> > > 
> > > [ 2223.797129] list_add corruption. prev->next should be next
> > > (e0ddd5dd), but was 3defe5cd.
> > > (prev=3defe5cd).
> > > [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> > > __list_add_valid+0x6a/0x70
> > > [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> > > G  I  4.15.0-rc8.bart3+ #1
> > > [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> > > [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> > > [ 2224.967002] Call Trace:
> > > [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> > > [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> > > [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> > > [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> > > [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> > > [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> > > [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> > > [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> > > [ 2225.264852]  process_one_work+0x141/0x340
> > > [ 2225.287872]  worker_thread+0x47/0x3e0
> > > [ 2225.308354]  kthread+0xf5/0x130
> > > [ 2225.396405]  ret_from_fork+0x32/0x40
> > > 
> > > That call trace did not show up before this patch series was added to
> > > Jens'
> > > tree. This is a regression. Could this have been introduced by this
> > > patch
> > > series?
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > 
> > Hi Bart
> > One thing to note.
> > 
> > I tested Mike's combined tree on the weekend fully dm4.16-block4.16 and
> > did not see this.
> > This was with Mike combined tree and SRPT running 4.13-rc2.
> > 
> > I also tested your tree Monday with the revert of the scatter/gather
> > patches with both SRP and SRPT running your tree and it was fine.
> > 
> > So its a combination of what you provided me before and that has been
> > added to your tree.
> > 
> > Mike combined tree seemed to be fine, I can revisit that if needed. I
> > still have that kernel in place.
> > 
> > I was not running latest SRPT when I tested Mike's tree
> 
> Hello Laurence,
> 
> The tree I sent you this morning did not only include Mike's latest dm code
> but also Jens' latest for-next branch. So what you wrote above does not
> contradict what I wrote in my e-mail, namely that I suspect that a regression
> was introduced by the patches in the series "blk-mq: improve DM's blk-mq IO
> merging via blk_insert_cloned_request feedback". These changes namely went in
> through the block tree and not through the dm tree. Additionally, these
> changes were not present in the block-scsi-for-next branch I sent you on
> Monday.

Functionality shouldn't be any different than the variant (Ming's v4)
that Laurence tested on Sunday/Monday (once we got past the genirq issue
on HPSA).

But sure, I suppose there is something I missed when refactoring Ming's
change to get it acceptable for upstream.  I went over the mechanical
nature of what I did many times (comaping Ming's v4 to my v5).

Anyway, we'll see how Laurence fairs with this tree (but with the revert
of 84676c1 added, so his HPSA server will boot):
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16_dm-4.16
(which is the same as linux-dm.git's 'for-next' at the moment)

The call to blk_mq_request_bypass_insert will only occur via
__blk_mq_fallback_to_insert.  Which as the name implies this is not the
fast path.  This will occur if the underlying blk-mq device cannot get
resources it needs in order to issue the request.  Specifically: if/when
in __blk_mq_try_issue_directly() the hctx is stopped, or queue is
quiesced, or it cannot get the driver tag or dispatch_budget (in the
case of scsi-mq).

The same fallback, via call to blk_mq_request_bypass_insert, occured
with 

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Laurence Oberman
On Wed, 2018-01-17 at 23:53 +, Bart Van Assche wrote:
> On Wed, 2018-01-17 at 18:43 -0500, Laurence Oberman wrote:
> > On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> > > On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > > > On Wed, Jan 17 2018 at 11:50am -0500,
> > > > Jens Axboe  wrote:
> > > > 
> > > > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > > > Hi Jens,
> > > > > > 
> > > > > > Think this finally takes care of it! ;)
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > > 
> > > > > > Mike Snitzer (2):
> > > > > >   blk-mq: factor out a few helpers from
> > > > > > __blk_mq_try_issue_directly
> > > > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > > > blk_mq_sched_insert_request
> > > > > > 
> > > > > > Ming Lei (1):
> > > > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > > > blk_insert_cloned_request feedback
> > > > > 
> > > > > Applied - added actual commit message to patch 3.
> > > > 
> > > > Great, thanks.
> > > 
> > > Hello Mike,
> > > 
> > > Laurence hit the following while retesting the SRP initiator
> > > code:
> > > 
> > > [ 2223.797129] list_add corruption. prev->next should be next
> > > (e0ddd5dd), but was 3defe5cd.
> > > (prev=3defe5cd).
> > > [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> > > __list_add_valid+0x6a/0x70
> > > [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> > > G  I  4.15.0-rc8.bart3+ #1
> > > [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> > > [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> > > [ 2224.967002] Call Trace:
> > > [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> > > [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> > > [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> > > [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> > > [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> > > [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> > > [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> > > [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> > > [ 2225.264852]  process_one_work+0x141/0x340
> > > [ 2225.287872]  worker_thread+0x47/0x3e0
> > > [ 2225.308354]  kthread+0xf5/0x130
> > > [ 2225.396405]  ret_from_fork+0x32/0x40
> > > 
> > > That call trace did not show up before this patch series was
> > > added to
> > > Jens'
> > > tree. This is a regression. Could this have been introduced by
> > > this
> > > patch
> > > series?
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > 
> > Hi Bart
> > One thing to note.
> > 
> > I tested Mike's combined tree on the weekend fully dm4.16-block4.16 
> > and
> > did not see this.
> > This was with Mike combined tree and SRPT running 4.13-rc2.
> > 
> > I also tested your tree Monday with the revert of the
> > scatter/gather
> > patches with both SRP and SRPT running your tree and it was fine.
> > 
> > So its a combination of what you provided me before and that has
> > been
> > added to your tree.
> > 
> > Mike combined tree seemed to be fine, I can revisit that if needed.
> > I
> > still have that kernel in place.
> > 
> > I was not running latest SRPT when I tested Mike's tree
> 
> Hello Laurence,
> 
> The tree I sent you this morning did not only include Mike's latest
> dm code
> but also Jens' latest for-next branch. So what you wrote above does
> not
> contradict what I wrote in my e-mail, namely that I suspect that a
> regression
> was introduced by the patches in the series "blk-mq: improve DM's
> blk-mq IO
> merging via blk_insert_cloned_request feedback". These changes namely
> went in
> through the block tree and not through the dm tree. Additionally,
> these
> changes were not present in the block-scsi-for-next branch I sent you
> on
> Monday.
> 
> Bart.

Hi Bart
Thank you

I probably don't have latest stuff Mike added Monday in his tree.
I have not gone back to test that, been busy with yours all week.

I will get with Mike and pull his latest and test it with SRPT on 4.13-
rc2 like before.

That should be the best way to isolate this.

Regards
Laurence


Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Bart Van Assche
On Wed, 2018-01-17 at 18:43 -0500, Laurence Oberman wrote:
> On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> > On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > > On Wed, Jan 17 2018 at 11:50am -0500,
> > > Jens Axboe  wrote:
> > > 
> > > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > > Hi Jens,
> > > > > 
> > > > > Think this finally takes care of it! ;)
> > > > > 
> > > > > Thanks,
> > > > > Mike
> > > > > 
> > > > > Mike Snitzer (2):
> > > > >   blk-mq: factor out a few helpers from
> > > > > __blk_mq_try_issue_directly
> > > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > > blk_mq_sched_insert_request
> > > > > 
> > > > > Ming Lei (1):
> > > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > > blk_insert_cloned_request feedback
> > > > 
> > > > Applied - added actual commit message to patch 3.
> > > 
> > > Great, thanks.
> > 
> > Hello Mike,
> > 
> > Laurence hit the following while retesting the SRP initiator code:
> > 
> > [ 2223.797129] list_add corruption. prev->next should be next
> > (e0ddd5dd), but was 3defe5cd.
> > (prev=3defe5cd).
> > [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> > __list_add_valid+0x6a/0x70
> > [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> > G  I  4.15.0-rc8.bart3+ #1
> > [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > 08/16/2015
> > [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> > [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> > [ 2224.967002] Call Trace:
> > [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> > [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> > [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> > [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> > [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> > [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> > [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> > [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> > [ 2225.264852]  process_one_work+0x141/0x340
> > [ 2225.287872]  worker_thread+0x47/0x3e0
> > [ 2225.308354]  kthread+0xf5/0x130
> > [ 2225.396405]  ret_from_fork+0x32/0x40
> > 
> > That call trace did not show up before this patch series was added to
> > Jens'
> > tree. This is a regression. Could this have been introduced by this
> > patch
> > series?
> > 
> > Thanks,
> > 
> > Bart.
> 
> Hi Bart
> One thing to note.
> 
> I tested Mike's combined tree on the weekend fully dm4.16-block4.16 and
> did not see this.
> This was with Mike combined tree and SRPT running 4.13-rc2.
> 
> I also tested your tree Monday with the revert of the scatter/gather
> patches with both SRP and SRPT running your tree and it was fine.
> 
> So its a combination of what you provided me before and that has been
> added to your tree.
> 
> Mike combined tree seemed to be fine, I can revisit that if needed. I
> still have that kernel in place.
> 
> I was not running latest SRPT when I tested Mike's tree

Hello Laurence,

The tree I sent you this morning did not only include Mike's latest dm code
but also Jens' latest for-next branch. So what you wrote above does not
contradict what I wrote in my e-mail, namely that I suspect that a regression
was introduced by the patches in the series "blk-mq: improve DM's blk-mq IO
merging via blk_insert_cloned_request feedback". These changes namely went in
through the block tree and not through the dm tree. Additionally, these
changes were not present in the block-scsi-for-next branch I sent you on
Monday.

Bart.

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Laurence Oberman
On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > On Wed, Jan 17 2018 at 11:50am -0500,
> > Jens Axboe  wrote:
> > 
> > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > Hi Jens,
> > > > 
> > > > Think this finally takes care of it! ;)
> > > > 
> > > > Thanks,
> > > > Mike
> > > > 
> > > > Mike Snitzer (2):
> > > >   blk-mq: factor out a few helpers from
> > > > __blk_mq_try_issue_directly
> > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > blk_mq_sched_insert_request
> > > > 
> > > > Ming Lei (1):
> > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > blk_insert_cloned_request feedback
> > > 
> > > Applied - added actual commit message to patch 3.
> > 
> > Great, thanks.
> 
> Hello Mike,
> 
> Laurence hit the following while retesting the SRP initiator code:
> 
> [ 2223.797129] list_add corruption. prev->next should be next
> (e0ddd5dd), but was 3defe5cd.
> (prev=3defe5cd).
> [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> G  I  4.15.0-rc8.bart3+ #1
> [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> 08/16/2015
> [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> [ 2224.967002] Call Trace:
> [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> [ 2225.264852]  process_one_work+0x141/0x340
> [ 2225.287872]  worker_thread+0x47/0x3e0
> [ 2225.308354]  kthread+0xf5/0x130
> [ 2225.396405]  ret_from_fork+0x32/0x40
> 
> That call trace did not show up before this patch series was added to
> Jens'
> tree. This is a regression. Could this have been introduced by this
> patch
> series?
> 
> Thanks,
> 
> Bart.

Hi Bart
One thing to note.

I tested Mike's combined tree on the weekend fully dm4.16-block4.16 and
did not see this.
This was with Mike combined tree and SRPT running 4.13-rc2.

I also tested your tree Monday with the revert of the scatter/gather
patches with both SRP and SRPT running your tree and it was fine.

So its a combination of what you provided me before and that has been
added to your tree.

Mike combined tree seemed to be fine, I can revisit that if needed. I
still have that kernel in place.

I was not running latest SRPT when I tested Mike's tree


Regards
Laurence



Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Bart Van Assche
On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at 11:50am -0500,
> Jens Axboe  wrote:
> 
> > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > Hi Jens,
> > > 
> > > Think this finally takes care of it! ;)
> > > 
> > > Thanks,
> > > Mike
> > > 
> > > Mike Snitzer (2):
> > >   blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
> > >   blk-mq-sched: remove unused 'can_block' arg from 
> > > blk_mq_sched_insert_request
> > > 
> > > Ming Lei (1):
> > >   blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > > feedback
> > 
> > Applied - added actual commit message to patch 3.
> 
> Great, thanks.

Hello Mike,

Laurence hit the following while retesting the SRP initiator code:

[ 2223.797129] list_add corruption. prev->next should be next 
(e0ddd5dd), but was 3defe5cd. (prev=3defe5cd).
[ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28 
__list_add_valid+0x6a/0x70
[ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted: G  I  
4.15.0-rc8.bart3+ #1
[ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
[ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
[ 2224.967002] Call Trace:
[ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
[ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
[ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
[ 2225.090608]  map_request+0x142/0x260 [dm_mod]
[ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
[ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
[ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
[ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
[ 2225.264852]  process_one_work+0x141/0x340
[ 2225.287872]  worker_thread+0x47/0x3e0
[ 2225.308354]  kthread+0xf5/0x130
[ 2225.396405]  ret_from_fork+0x32/0x40

That call trace did not show up before this patch series was added to Jens'
tree. This is a regression. Could this have been introduced by this patch
series?

Thanks,

Bart.

Re: [LSF/MM TOPIC] A high-performance userspace block driver

2018-01-17 Thread Matthew Wilcox
On Wed, Jan 17, 2018 at 10:49:24AM +0800, Ming Lei wrote:
> Userfaultfd might be another choice:
> 
> 1) map the block LBA space into a range of process vm space

That would limit the size of a block device to ~200TB (with my laptop's
CPU).  That's probably OK for most users, but I suspect there are some
who would chafe at such a restriction (before the 57-bit CPUs arrive).

> 2) when READ/WRITE req comes, convert it to page fault on the
> mapped range, and let userland to take control of it, and meantime
> kernel req context is slept

You don't want to sleep the request; you want it to be able to submit
more I/O.  But we have infrastructure in place to inform the submitter
when I/Os have completed.

> 3) IO req context in kernel side is waken up after userspace completed
> the IO request via userfaultfd
> 
> 4) kernel side continue to complete the IO, such as copying page from
> storage range to req(bio) pages.
> 
> Seems READ should be fine since it is very similar with the use case
> of QEMU postcopy live migration, WRITE can be a bit different, and
> maybe need some change on userfaultfd.

I like this idea, and maybe extending UFFD is the way to solve this
problem.  Perhaps I should explain a little more what the requirements
are.  At the point the driver gets the I/O, pages to copy data into (for
a read) or copy data from (for a write) have already been allocated.
At all costs, we need to avoid playing VM tricks (because TLB flushes
are expensive).  So one copy is probably OK, but we'd like to avoid it
if reasonable.

Let's assume that the userspace program looks at the request metadata and
decides that it needs to send a network request.  Ideally, it would find
a way to have the data from the response land in the pre-allocated pages
(for a read) or send the data straight from the pages in the request
(for a write).  I'm not sure UFFD helps us with that part of the problem.


two more nvme fixes for Linux 4.15

2018-01-17 Thread Christoph Hellwig
Hi Jens,

below are the two fixes for the new SGL support in 4.15 as a patch
series due to the issues with the git server.



Re: two more nvme fixes for Linux 4.15

2018-01-17 Thread Jens Axboe
On 1/17/18 2:04 PM, Christoph Hellwig wrote:
> Hi Jens,
> 
> below are the two fixes for the new SGL support in 4.15 as a patch
> series due to the issues with the git server.

Applied, thanks.

-- 
Jens Axboe



[PATCH 2/2] nvme-pci: take sglist coalescing in dma_map_sg into account

2018-01-17 Thread Christoph Hellwig
Some iommu implementations can merge physically and/or virtually
contiguous segments inside sg_map_dma.  The NVMe SGL support does not take
this into account and will warn because of falling off a loop.  Pass the
number of mapped segments to nvme_pci_setup_sgls so that the SGL setup
can take the number of mapped segments into account.

Reported-by: Fangjian (Turing) 
Fixes: a7a7cbe3 ("nvme-pci: add SGL support")
Signed-off-by: Christoph Hellwig 
Reviewed-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
---
 drivers/nvme/host/pci.c | 21 +
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7e94cc3c70e..4276ebfff22b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -725,20 +725,19 @@ static void nvme_pci_sgl_set_seg(struct nvme_sgl_desc 
*sge,
 }
 
 static blk_status_t nvme_pci_setup_sgls(struct nvme_dev *dev,
-   struct request *req, struct nvme_rw_command *cmd)
+   struct request *req, struct nvme_rw_command *cmd, int entries)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-   int length = blk_rq_payload_bytes(req);
struct dma_pool *pool;
struct nvme_sgl_desc *sg_list;
struct scatterlist *sg = iod->sg;
-   int entries = iod->nents, i = 0;
dma_addr_t sgl_dma;
+   int i = 0;
 
/* setting the transfer type as SGL */
cmd->flags = NVME_CMD_SGL_METABUF;
 
-   if (length == sg_dma_len(sg)) {
+   if (entries == 1) {
nvme_pci_sgl_set_data(>dptr.sgl, sg);
return BLK_STS_OK;
}
@@ -778,13 +777,9 @@ static blk_status_t nvme_pci_setup_sgls(struct nvme_dev 
*dev,
}
 
nvme_pci_sgl_set_data(_list[i++], sg);
-
-   length -= sg_dma_len(sg);
sg = sg_next(sg);
-   entries--;
-   } while (length > 0);
+   } while (--entries > 0);
 
-   WARN_ON(entries > 0);
return BLK_STS_OK;
 }
 
@@ -796,6 +791,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
enum dma_data_direction dma_dir = rq_data_dir(req) ?
DMA_TO_DEVICE : DMA_FROM_DEVICE;
blk_status_t ret = BLK_STS_IOERR;
+   int nr_mapped;
 
sg_init_table(iod->sg, blk_rq_nr_phys_segments(req));
iod->nents = blk_rq_map_sg(q, req, iod->sg);
@@ -803,12 +799,13 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
goto out;
 
ret = BLK_STS_RESOURCE;
-   if (!dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
-   DMA_ATTR_NO_WARN))
+   nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents, dma_dir,
+   DMA_ATTR_NO_WARN);
+   if (!nr_mapped)
goto out;
 
if (iod->use_sgl)
-   ret = nvme_pci_setup_sgls(dev, req, >rw);
+   ret = nvme_pci_setup_sgls(dev, req, >rw, nr_mapped);
else
ret = nvme_pci_setup_prps(dev, req, >rw);
 
-- 
2.14.2



[PATCH 1/2] nvme-pci: check segement valid for SGL use

2018-01-17 Thread Christoph Hellwig
From: Keith Busch 

The driver needs to verify there is a payload with a command before
seeing if it should use SGLs to map it.

Fixes: 955b1b5a00ba ("nvme-pci: move use_sgl initialization to nvme_init_iod()")
Reported-by: Paul Menzel 
Reviewed-by: Paul Menzel 
Signed-off-by: Keith Busch 
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d53550e612bc..a7e94cc3c70e 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -451,10 +451,13 @@ static void **nvme_pci_iod_list(struct request *req)
 static inline bool nvme_pci_use_sgls(struct nvme_dev *dev, struct request *req)
 {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+   int nseg = blk_rq_nr_phys_segments(req);
unsigned int avg_seg_size;
 
-   avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req),
-   blk_rq_nr_phys_segments(req));
+   if (nseg == 0)
+   return false;
+
+   avg_seg_size = DIV_ROUND_UP(blk_rq_payload_bytes(req), nseg);
 
if (!(dev->ctrl.sgls & ((1 << 0) | (1 << 1
return false;
-- 
2.14.2



Re: [GIT PULL] two more nvme fixes for Linux 4.15

2018-01-17 Thread Christoph Hellwig
On Wed, Jan 17, 2018 at 01:30:44PM -0700, Jens Axboe wrote:
> On 1/17/18 11:36 AM, Christoph Hellwig wrote:
> > Hi Jens,
> > 
> > two more last minute fixes for regressions due to the SGL support that
> > is new in Linux 4.15.
> > 
> > The following changes since commit 32835a074171e7b83c1cefbbf9d681bb9518bbd5:
> > 
> >   Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-linus 
> > (2018-01-12 10:42:36 -0700)
> > 
> > are available in the git repository at:
> > 
> >   git://git.infradead.org/nvme.git nvme-4.15
> 
> Something appears to be funky at your end:
> 
> git pull --no-ff  git://git.infradead.org/nvme.git nvme-4.15
> remote: Counting objects: 12, done.
> remote: Compressing objects: 100% (11/11), done.
> remote: fatal: unable to read 4276ebfff22ba00fd90e8c241cc7edd56deca353
> remote: aborting due to possible repository corruption on the remote side.
> fatal: protocol error: bad pack header

The repo did have corruption due to an unclean shutdown of the
git.infradead.org machine and gits unsafe defaults.  But I'm totally
lost on how to fix this, so I'll just send the patch series to you
instead.


Re: [GIT PULL] two more nvme fixes for Linux 4.15

2018-01-17 Thread Jens Axboe
On 1/17/18 11:36 AM, Christoph Hellwig wrote:
> Hi Jens,
> 
> two more last minute fixes for regressions due to the SGL support that
> is new in Linux 4.15.
> 
> The following changes since commit 32835a074171e7b83c1cefbbf9d681bb9518bbd5:
> 
>   Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-linus 
> (2018-01-12 10:42:36 -0700)
> 
> are available in the git repository at:
> 
>   git://git.infradead.org/nvme.git nvme-4.15

Something appears to be funky at your end:

git pull --no-ff  git://git.infradead.org/nvme.git nvme-4.15
remote: Counting objects: 12, done.
remote: Compressing objects: 100% (11/11), done.
remote: fatal: unable to read 4276ebfff22ba00fd90e8c241cc7edd56deca353
remote: aborting due to possible repository corruption on the remote side.
fatal: protocol error: bad pack header

-- 
Jens Axboe



[PATCH v2 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-17 Thread Bart Van Assche
The __blk_mq_register_dev(), blk_mq_unregister_dev(),
elv_register_queue() and elv_unregister_queue() calls need to be
protected with sysfs_lock but other code in these functions not.
Hence protect only this code with sysfs_lock. This patch fixes a
locking inversion issue in blk_unregister_queue() and also in an
error path of blk_register_queue(): it is not allowed to hold
sysfs_lock around the kobject_del(>kobj) call.

Signed-off-by: Bart Van Assche 
---
 block/blk-sysfs.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 4a6a40ffd78e..cbea895a5547 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
.release= blk_release_queue,
 };
 
+/**
+ * blk_register_queue - register a block layer queue with sysfs
+ * @disk: Disk of which the request queue should be registered with sysfs.
+ */
 int blk_register_queue(struct gendisk *disk)
 {
int ret;
@@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
if (q->request_fn || (q->mq_ops && q->elevator)) {
ret = elv_register_queue(q);
if (ret) {
+   mutex_unlock(>sysfs_lock);
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(dev);
kobject_put(>kobj);
-   goto unlock;
+   return ret;
}
}
ret = 0;
@@ -923,6 +928,13 @@ int blk_register_queue(struct gendisk *disk)
 }
 EXPORT_SYMBOL_GPL(blk_register_queue);
 
+/**
+ * blk_unregister_queue - counterpart of blk_register_queue()
+ * @disk: Disk of which the request queue should be unregistered from sysfs.
+ *
+ * Note: the caller is responsible for guaranteeing that this function is 
called
+ * after blk_register_queue() has finished.
+ */
 void blk_unregister_queue(struct gendisk *disk)
 {
struct request_queue *q = disk->queue;
@@ -935,8 +947,9 @@ void blk_unregister_queue(struct gendisk *disk)
return;
 
/*
-* Protect against the 'queue' kobj being accessed
-* while/after it is removed.
+* Since sysfs_remove_dir() prevents adding new directory entries
+* before removal of existing entries starts, protect against
+* concurrent elv_iosched_store() calls.
 */
mutex_lock(>sysfs_lock);
 
@@ -944,18 +957,24 @@ void blk_unregister_queue(struct gendisk *disk)
queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
spin_unlock_irq(q->queue_lock);
 
-   wbt_exit(q);
-
+   /*
+* Remove the sysfs attributes before unregistering the queue data
+* structures that can be modified through sysfs.
+*/
if (q->mq_ops)
blk_mq_unregister_dev(disk_to_dev(disk), q);
-
-   if (q->request_fn || (q->mq_ops && q->elevator))
-   elv_unregister_queue(q);
+   mutex_unlock(>sysfs_lock);
 
kobject_uevent(>kobj, KOBJ_REMOVE);
kobject_del(>kobj);
blk_trace_remove_sysfs(disk_to_dev(disk));
-   kobject_put(_to_dev(disk)->kobj);
 
+   wbt_exit(q);
+
+   mutex_lock(>sysfs_lock);
+   if (q->request_fn || (q->mq_ops && q->elevator))
+   elv_unregister_queue(q);
mutex_unlock(>sysfs_lock);
+
+   kobject_put(_to_dev(disk)->kobj);
 }
-- 
2.15.1



[PATCH v2 0/3] Avoid that blk_{un,}register_queue() trigger lock inversion

2018-01-17 Thread Bart Van Assche
Hello Jens,

The three patches in this series are what I came up with after having
analyzed a lockdep complaint triggered by blk_unregister_queue(). Please
consider these patches for kernel v4.16.

Thanks,

Bart.

Changes compared to v1:
- Restored the code that protects against races between changing the scheduler
  from sysfs and unregistering a block layer queue.

Bart Van Assche (3):
  block: Unexport elv_register_queue() and elv_unregister_queue()
  block: Document scheduler modification locking requirements
  block: Protect less code with sysfs_lock in blk_{un,}register_queue()

 block/blk-sysfs.c| 37 -
 block/blk.h  |  3 +++
 block/elevator.c | 10 --
 include/linux/elevator.h |  2 --
 4 files changed, 39 insertions(+), 13 deletions(-)

-- 
2.15.1



[PATCH v2 2/3] block: Document scheduler modification locking requirements

2018-01-17 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
---
 block/elevator.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/block/elevator.c b/block/elevator.c
index 4f00b53cd5fd..e87e9b43aba0 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -869,6 +869,8 @@ int elv_register_queue(struct request_queue *q)
struct elevator_queue *e = q->elevator;
int error;
 
+   lockdep_assert_held(>sysfs_lock);
+
error = kobject_add(>kobj, >kobj, "%s", "iosched");
if (!error) {
struct elv_fs_entry *attr = e->type->elevator_attrs;
@@ -889,6 +891,8 @@ int elv_register_queue(struct request_queue *q)
 
 void elv_unregister_queue(struct request_queue *q)
 {
+   lockdep_assert_held(>sysfs_lock);
+
if (q) {
struct elevator_queue *e = q->elevator;
 
@@ -965,6 +969,8 @@ static int elevator_switch_mq(struct request_queue *q,
 {
int ret;
 
+   lockdep_assert_held(>sysfs_lock);
+
blk_mq_freeze_queue(q);
blk_mq_quiesce_queue(q);
 
@@ -1010,6 +1016,8 @@ static int elevator_switch(struct request_queue *q, 
struct elevator_type *new_e)
bool old_registered = false;
int err;
 
+   lockdep_assert_held(>sysfs_lock);
+
if (q->mq_ops)
return elevator_switch_mq(q, new_e);
 
-- 
2.15.1



[PATCH v2 1/3] block: Unexport elv_register_queue() and elv_unregister_queue()

2018-01-17 Thread Bart Van Assche
These two functions are only called from inside the block layer so
unexport them.

Signed-off-by: Bart Van Assche 
---
 block/blk.h  | 3 +++
 block/elevator.c | 2 --
 include/linux/elevator.h | 2 --
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 662005e46560..46db5dc83dcb 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -162,6 +162,9 @@ static inline void elv_deactivate_rq(struct request_queue 
*q, struct request *rq
e->type->ops.sq.elevator_deactivate_req_fn(q, rq);
 }
 
+int elv_register_queue(struct request_queue *q);
+void elv_unregister_queue(struct request_queue *q);
+
 struct hd_struct *__disk_get_part(struct gendisk *disk, int partno);
 
 #ifdef CONFIG_FAIL_IO_TIMEOUT
diff --git a/block/elevator.c b/block/elevator.c
index 138faeb08a7c..4f00b53cd5fd 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -886,7 +886,6 @@ int elv_register_queue(struct request_queue *q)
}
return error;
 }
-EXPORT_SYMBOL(elv_register_queue);
 
 void elv_unregister_queue(struct request_queue *q)
 {
@@ -900,7 +899,6 @@ void elv_unregister_queue(struct request_queue *q)
wbt_enable_default(q);
}
 }
-EXPORT_SYMBOL(elv_unregister_queue);
 
 int elv_register(struct elevator_type *e)
 {
diff --git a/include/linux/elevator.h b/include/linux/elevator.h
index 3d794b3dc532..6d9e230dffd2 100644
--- a/include/linux/elevator.h
+++ b/include/linux/elevator.h
@@ -198,8 +198,6 @@ extern bool elv_attempt_insert_merge(struct request_queue 
*, struct request *);
 extern void elv_requeue_request(struct request_queue *, struct request *);
 extern struct request *elv_former_request(struct request_queue *, struct 
request *);
 extern struct request *elv_latter_request(struct request_queue *, struct 
request *);
-extern int elv_register_queue(struct request_queue *q);
-extern void elv_unregister_queue(struct request_queue *q);
 extern int elv_may_queue(struct request_queue *, unsigned int);
 extern void elv_completed_request(struct request_queue *, struct request *);
 extern int elv_set_request(struct request_queue *q, struct request *rq,
-- 
2.15.1



WARNING: bi_status >= ARRAY_SIZE(blk_errors): bi_status=201

2018-01-17 Thread Eric Wheeler
Hello all,

We are getting the following on 4.14.13:

[33418.520838] bi_status >= ARRAY_SIZE(blk_errors): bi_status=201

The line above is our debug output showing the value of "idx" in 
block/blk-core.c:174 blk_status_to_errno where the WARN_ON is located.

Why would bi_status have a value of 201?

-Eric

[33418.523591] [ cut here ]
[33418.526319] WARNING: CPU: 2 PID: 8484 at block/blk-core.c:174 
blk_status_to_errno+0x41/0x50
[33418.529326] Modules linked in: btrfs xor zstd_decompress zstd_compress 
xxhash raid6_pq dm_crypt binfmt_misc dm_snapshot lz4 lz4_compress zram drbd 
lru_cache ebtable_filter ebtables ip6table_filter ip6_tables iptable_nat 
nf_nat_ipv4 nf_nat xt_connbytes xt_DSCP iptable_mangle ipt_REJECT 
nf_reject_ipv4 xt_limit nf_log_ipv4 nf_log_common xt_LOG nf_conntrack_ipv4 
nf_defrag_ipv4 xt_comment xt_owner mpt3sas raid_class scsi_transport_sas mptctl 
mptbase xt_recent xt_conntrack nf_conntrack xt_set iptable_filter 
ip_set_hash_net ip_set nfnetlink netconsole dm_thin_pool dm_persistent_data 
dm_bio_prison dm_bufio libcrc32c 8021q garp mrp bridge stp llc sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass bcache 
crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel
[33418.553475]  mxm_wmi dcdbas pcspkr sg ipmi_si lpc_ich ipmi_devintf mfd_core 
ipmi_msghandler wmi mei_me mei shpchp acpi_power_meter nfsd auth_rpcgss nfs_acl 
lockd grace sunrpc ip_tables ext4 mbcache jbd2 mgag200 drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm igb crc32c_intel ptp drm 
pps_core dca megaraid_sas i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log 
dm_mod dax bonding
[33418.568770] CPU: 2 PID: 8484 Comm: kworker/2:0 Not tainted 4.14.13 #66
[33418.573012] Hardware name: Dell Inc. PowerEdge R730xd/0599V5, BIOS 2.7.0 
12/005/2017
[33418.577148] Workqueue: bcache bch_data_insert_keys [bcache]
[33418.581315] task: 977748e54380 task.stack: a42a0fd58000
[33418.585623] RIP: 0010:blk_status_to_errno+0x41/0x50
[33418.589742] RSP: 0018:a42a0fd5bd70 EFLAGS: 00010246
[33418.594016] RAX: 0032 RBX: 977721daa800 RCX: 
[33418.598305] RDX:  RSI: 9777bfc96938 RDI: 9777bfc96938
[33418.602638] RBP: 9776e6a25a40 R08: 0001 R09: 6d9a
[33418.606801] R10: 006c R11: a42a06e1c160 R12: 9775c22e1300
[33418.611008] R13: 0015 R14:  R15: 977734664cf0
[33418.615231] FS:  () GS:9777bfc8() 
knlGS:
[33418.619818] CS:  0010 DS:  ES:  CR0: 80050033
[33418.624355] CR2: ff600400 CR3: 000a2800a006 CR4: 001626e0
[33418.628596] Call Trace:
[33418.632290]  drbd_request_endio+0x5d/0x290 [drbd]
[33418.636713]  clone_endio+0x8e/0x130 [dm_mod]
[33418.640879]  clone_endio+0x8e/0x130 [dm_mod]
[33418.646070]  clone_endio+0x8e/0x130 [dm_mod]
[33418.651681]  bio_complete+0x74/0xd0 [bcache]
[33418.657139]  search_free+0xe/0x40 [bcache]
[33418.663005]  cached_dev_write_complete+0x2c/0x60 [bcache]
[33418.668419]  process_one_work+0x141/0x340
[33418.673814]  worker_thread+0x47/0x3e0
[33418.679164]  kthread+0xfc/0x130
[33418.684400]  ? rescuer_thread+0x380/0x380
[33418.689763]  ? kthread_park+0x60/0x60
[33418.694955]  ? do_syscall_64+0x61/0x1a0
[33418.700193]  ? SyS_exit_group+0x10/0x10
[33418.705528]  ret_from_fork+0x1f/0x30
[33418.710839] Code: 8b 80 40 ba c7 ba c3 80 3d cc 55 e8 00 00 74 06 b8 fb ff 
ff ff c3 48 c7 c7 c8 3b e8 ba 31 c0 c6 05 b4 55 e8 00 01 e8 bc 87 da ff <0f> ff 
eb e1 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 
[33418.722086] ---[ end trace dc46edef7ad4baac ]---


--
Eric Wheeler


[GIT PULL] two more nvme fixes for Linux 4.15

2018-01-17 Thread Christoph Hellwig
Hi Jens,

two more last minute fixes for regressions due to the SGL support that
is new in Linux 4.15.

The following changes since commit 32835a074171e7b83c1cefbbf9d681bb9518bbd5:

  Merge branch 'nvme-4.15' of git://git.infradead.org/nvme into for-linus 
(2018-01-12 10:42:36 -0700)

are available in the git repository at:

  git://git.infradead.org/nvme.git nvme-4.15

for you to fetch changes up to 0eb89a5061873dcb69570b1571e12eb6f00f4ae6:

  nvme-pci: take sglist coalescing in dma_map_sg into account (2018-01-17 
17:41:28 +0100)


Christoph Hellwig (1):
  nvme-pci: take sglist coalescing in dma_map_sg into account

Keith Busch (1):
  nvme-pci: check segement valid for SGL use

 drivers/nvme/host/pci.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)


[ANNOUNCE] Participate at the Linux FAST Summit '18

2018-01-17 Thread Christoph Hellwig
We are pleased to announce that the Call for Participation [1] for the
2018 USENIX Research in Linux File and Storage Technologies Summit (Linux
FAST Summit '18) [2] is now available.  Linux FAST Summit '18 will take
place at the Oakland Marriott City Center on February 15 and will be
co-located with the 16th USENIX Conference on File and Storage Technologies
(FAST '18) [3].

Linux is one of the most frequently used platforms in file and storage
research, and FAST is the premier conference for publishing research
results from the file and storage research community.  This summit will
bring together key Linux kernel developers and academic researchers.  The
goal is to provide a forum for presenting and discussing current ideas in
research with a group of active practitioners in the Linux Kernel.

The format will be a single session of short 25-minute talks with
5-minute questions and answers, as well as 45-minute Birds-of-a-Feather
sessions (BoFs).

If you have an idea you'd like to share, or a project in the Linux Kernel
that you'd like to talk about, please send a 400-word summary of the idea
or project and a short description of the benefits that would be achieved
by discussion with kernel practitioners and/or academics to
linuxfast18cha...@usenix.org.  Submissions are due by Friday, January 26,
2018.

We look forward to receiving your submissions!

Christoph Hellwig
Erik Riedel, Dell EMC
Linux FAST Summit '18 Program Co-Chairs
linuxfast18cha...@usenix.org

[1] https://www.usenix.org/conference/linuxfastsummit18/call-for-participation
[2] https://www.usenix.org/conference/linuxfastsummit18
[3] https://www.usenix.org/conference/fast18


Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-17 Thread Bart Van Assche
On Wed, 2018-01-17 at 21:04 +0800, Ming Lei wrote:
> On Wed, Jan 17, 2018 at 10:17 AM, Bart Van Assche
>  wrote:
> > Sorry but I think what you wrote is wrong. kobject_del(>kobj) waits 
> > until all
> > ongoing sysfs callback methods, including elv_iosched_store(), have 
> > finished and
> > prevents that any new elv_iosched_store() calls start. That is why I think 
> > the
> > above changes do not reintroduce the race fixed by commit e9a823fb34a8 
> > ("block:
> > fix warning when I/O elevator is changed as request_queue is being 
> > removed").
> 
> But your patch basically reverts commit e9a823fb34a8, and I just saw the 
> warning
> again after applying your patch in my stress test of switching elelvato:
> 
> [  225.999505] kobject_add_internal failed for iosched (error: -2 parent: 
> queue)
> [  225.999521] WARNING: CPU: 0 PID: 12707 at lib/kobject.c:244
> kobject_add_internal+0x236/0x24c
> [  225.999566] Call Trace:
> [  225.999570]  kobject_add+0x9e/0xc5
> [  225.999573]  elv_register_queue+0x35/0xa2
> [  225.999575]  elevator_switch+0x7a/0x1a4
> [  225.999577]  elv_iosched_store+0xd2/0x103
> [  225.999579]  queue_attr_store+0x66/0x82
> [  225.999581]  kernfs_fop_write+0xf3/0x135
> [  225.999583]  __vfs_write+0x31/0x142
> [  225.999591]  vfs_write+0xcb/0x16e
> [  225.999593]  SyS_write+0x5d/0xab
> [  225.999595]  entry_SYSCALL_64_fastpath+0x1a/0x7d

The above call stack means that sysfs_create_dir_ns() returned -ENOENT because
kobj->parent->sd was NULL (where kobj is the elevator kernel object). That's
probably becase sysfs_remove_dir() clears the sd pointer before performing the
actual removal. From fs/sysfs/dir.c:

spin_lock(_symlink_target_lock);
kobj->sd = NULL;
spin_unlock(_symlink_target_lock);

if (kn) {
WARN_ON_ONCE(kernfs_type(kn) != KERNFS_DIR);
kernfs_remove(kn);
}

In the past these two actions were performed in the opposite order. See also
commit aecdcedaab49 ("sysfs: implement kobj_sysfs_assoc_lock"). Anyway, I will
rework this patch.

Bart.

Re: [PATCH] block: Fix __bio_integrity_endio() documentation

2018-01-17 Thread Jens Axboe
On 1/16/18 11:31 AM, Bart Van Assche wrote:
> Fixes: 4246a0b63bd8 ("block: add a bi_error field to struct bio")
> Signed-off-by: Bart Van Assche 

Applied, thanks.

-- 
Jens Axboe



Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer
On Wed, Jan 17 2018 at 11:50am -0500,
Jens Axboe  wrote:

> On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > Hi Jens,
> > 
> > Think this finally takes care of it! ;)
> > 
> > Thanks,
> > Mike
> > 
> > Mike Snitzer (2):
> >   blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
> >   blk-mq-sched: remove unused 'can_block' arg from 
> > blk_mq_sched_insert_request
> > 
> > Ming Lei (1):
> >   blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> > feedback
> 
> Applied - added actual commit message to patch 3.

Great, thanks.


Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Jens Axboe
On 1/17/18 9:25 AM, Mike Snitzer wrote:
> Hi Jens,
> 
> Think this finally takes care of it! ;)
> 
> Thanks,
> Mike
> 
> Mike Snitzer (2):
>   blk-mq: factor out a few helpers from __blk_mq_try_issue_directly
>   blk-mq-sched: remove unused 'can_block' arg from blk_mq_sched_insert_request
> 
> Ming Lei (1):
>   blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request 
> feedback

Applied - added actual commit message to patch 3.

-- 
Jens Axboe



Re: [PATCH V2 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Jens Axboe
On 1/17/18 9:41 AM, Ming Lei wrote:
> Hi Jens,
> 
> The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
> queue mapping & schedule with each possisble CPU").
> 
> The 2nd one add comments for current two races, and convert the WARN_ON
> into printk(KERN_WARN) with dump_stack() since this warning is harmless.
> 
> V2:
>   - fix comment and commit log in patch 1
>   - use dump_stack() with printk to replace WARN_ON() in patch 2

Applied, thanks.

-- 
Jens Axboe



[PATCH V2 1/2] blk-mq: make sure hctx->next_cpu is set correctly

2018-01-17 Thread Ming Lei
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.

The race can be triggered in the following two sitations:

1) when one CPU is becoming DEAD, blk_mq_hctx_notify_dead() is called
to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.

2) blk_mq_delay_run_hw_queue() is called from CPU B, and found the queue
should be run on the other CPU A, then CPU A may become offline at the
same time and all CPUs in hctx->cpumask become offline.

This patch deals with this issue by re-selecting next CPU, and make
sure it is set correctly.

Cc: Christian Borntraeger 
Cc: Stefan Haberland 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Reported-by: "jianchao.wang" 
Tested-by: "jianchao.wang" 
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each 
possisble CPU")
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 30 --
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c376d1b6309a..69e73b4e32f3 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1416,21 +1416,47 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
+   bool tried = false;
+
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
-
+select_cpu:
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);
 
-   hctx->next_cpu = next_cpu;
+   /*
+* No online CPU is found, so have to make sure hctx->next_cpu
+* is set correctly for not breaking workqueue.
+*/
+   if (next_cpu >= nr_cpu_ids)
+   hctx->next_cpu = cpumask_first(hctx->cpumask);
+   else
+   hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 
+   /*
+* Do unbound schedule if we can't find a online CPU for this hctx,
+* and it should only happen in the path of handling CPU DEAD.
+*/
+   if (!cpu_online(hctx->next_cpu)) {
+   if (!tried) {
+   tried = true;
+   goto select_cpu;
+   }
+
+   /*
+* Make sure to re-select CPU next time once after CPUs
+* in hctx->cpumask become online again.
+*/
+   hctx->next_cpu_batch = 1;
+   return WORK_CPU_UNBOUND;
+   }
return hctx->next_cpu;
 }
 
-- 
2.9.5



[PATCH V2 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Ming Lei
Hi Jens,

The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
queue mapping & schedule with each possisble CPU").

The 2nd one add comments for current two races, and convert the WARN_ON
into printk(KERN_WARN) with dump_stack() since this warning is harmless.

V2:
- fix comment and commit log in patch 1
- use dump_stack() with printk to replace WARN_ON() in patch 2

Thanks,

Ming Lei (2):
  blk-mq: make sure hctx->next_cpu is set correctly
  blk-mq: avoid one WARN_ON in __blk_mq_run_hw_queue to printk

 block/blk-mq.c | 52 
 1 file changed, 48 insertions(+), 4 deletions(-)

-- 
2.9.5



[PATCH V2 2/2] blk-mq: avoid one WARN_ON in __blk_mq_run_hw_queue to printk

2018-01-17 Thread Ming Lei
We know this WARN_ON is harmless and in reality it may be trigged,
so convert it to printk() and dump_stack() for avoiding to confuse
people.

Also add comment about two releated races here.

Cc: Christian Borntraeger 
Cc: Stefan Haberland 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: "jianchao.wang" 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 69e73b4e32f3..480fc7623a5e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1391,9 +1391,27 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
/*
 * We should be running this queue from one of the CPUs that
 * are mapped to it.
+*
+* There are at least two related races now between setting
+* hctx->next_cpu from blk_mq_hctx_next_cpu() and running
+* __blk_mq_run_hw_queue():
+*
+* - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
+*   but later it becomes online, then this warning is harmless
+*   at all
+*
+* - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
+*   but later it becomes offline, then the warning can't be
+*   triggered, and we depend on blk-mq timeout handler to
+*   handle dispatched requests to this hctx
 */
-   WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-   cpu_online(hctx->next_cpu));
+   if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+   cpu_online(hctx->next_cpu)) {
+   printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
+   raw_smp_processor_id(),
+   cpumask_empty(hctx->cpumask) ? "inactive": "active");
+   dump_stack();
+   }
 
/*
 * We can't run the queue inline with ints disabled. Ensure that
-- 
2.9.5



[for-4.16 PATCH v6 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Mike Snitzer
From: Ming Lei 

blk_insert_cloned_request() is called in the fast path of a dm-rq driver
(e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
blk_mq_request_bypass_insert() to directly append the request to the
blk-mq hctx->dispatch_list of the underlying queue.

1) This way isn't efficient enough because the hctx spinlock is always
used.

2) With blk_insert_cloned_request(), we completely bypass underlying
queue's elevator and depend on the upper-level dm-rq driver's elevator
to schedule IO.  But dm-rq currently can't get the underlying queue's
dispatch feedback at all.  Without knowing whether a request was issued
or not (e.g. due to underlying queue being busy) the dm-rq elevator will
not be able to provide effective IO merging (as a side-effect of dm-rq
currently blindly destaging a request from its elevator only to requeue
it after a delay, which kills any opportunity for merging).  This
obviously causes very bad sequential IO performance.

Fix this by updating blk_insert_cloned_request() to use
blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
request to be issued directly to the underlying queue and returns the
dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
to _not_ destage the request.  Whereby preserving the opportunity to
merge IO.

With this, request-based DM's blk-mq sequential IO performance is vastly
improved (as much as 3X in mpath/virtio-scsi testing).

Signed-off-by: Ming Lei 
[blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
they were refactored to make them less fragile and easier to read/review]
Signed-off-by: Mike Snitzer 
---
 block/blk-core.c   |  3 +--
 block/blk-mq.c | 37 +
 block/blk-mq.h |  3 +++
 drivers/md/dm-rq.c | 19 ---
 4 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 7ba607527487..55f338020254 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2500,8 +2500,7 @@ blk_status_t blk_insert_cloned_request(struct 
request_queue *q, struct request *
 * bypass a potential scheduler on the bottom device for
 * insert.
 */
-   blk_mq_request_bypass_insert(rq, true);
-   return BLK_STS_OK;
+   return blk_mq_request_direct_issue(rq);
}
 
spin_lock_irqsave(q->queue_lock, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c117c2baf2c9..f5f0d8456713 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1731,15 +1731,19 @@ static blk_status_t __blk_mq_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
 static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   bool run_queue)
+   bool run_queue, bool bypass_insert)
 {
-   blk_mq_sched_insert_request(rq, false, run_queue, false,
-   hctx->flags & BLK_MQ_F_BLOCKING);
+   if (!bypass_insert)
+   blk_mq_sched_insert_request(rq, false, run_queue, false,
+   hctx->flags & BLK_MQ_F_BLOCKING);
+   else
+   blk_mq_request_bypass_insert(rq, run_queue);
 }
 
 static blk_status_t __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
struct request *rq,
-   blk_qc_t *cookie)
+   blk_qc_t *cookie,
+   bool bypass_insert)
 {
struct request_queue *q = rq->q;
bool run_queue = true;
@@ -1750,7 +1754,7 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
goto insert;
}
 
-   if (q->elevator)
+   if (q->elevator && !bypass_insert)
goto insert;
 
if (!blk_mq_get_driver_tag(rq, NULL, false))
@@ -1763,7 +1767,9 @@ static blk_status_t __blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
return __blk_mq_issue_directly(hctx, rq, cookie);
 insert:
-   __blk_mq_fallback_to_insert(hctx, rq, run_queue);
+   __blk_mq_fallback_to_insert(hctx, rq, run_queue, bypass_insert);
+   if (bypass_insert)
+   return BLK_STS_RESOURCE;
 
return BLK_STS_OK;
 }
@@ -1778,15 +1784,30 @@ static void blk_mq_try_issue_directly(struct 
blk_mq_hw_ctx *hctx,
 
hctx_lock(hctx, _idx);
 
-   ret = __blk_mq_try_issue_directly(hctx, rq, cookie);
+   ret = __blk_mq_try_issue_directly(hctx, rq, cookie, false);
if (ret == BLK_STS_RESOURCE)
-   __blk_mq_fallback_to_insert(hctx, rq, true);
+   __blk_mq_fallback_to_insert(hctx, rq, true, false);
 

Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 08:46:40AM -0700, Jens Axboe wrote:
> On 1/17/18 5:34 AM, Ming Lei wrote:
> > We know this WARN_ON is harmless and the stack trace isn't useful too,
> > so convert it to printk(), and avoid to confuse people.
> 
> I disagree, it is useful to know the exact path it happened from,
> in case it's a valid warning. It could be an inline run and

inline run is only possible in theory, since the window is too
small.

> we screwed up the logic, or it could be from a workqueue and
> the reason would be entirely different.

OK, if you don't agree, I will just add the comment on the races
we found.

-- 
Ming


Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

2018-01-17 Thread Jens Axboe
On 1/17/18 9:05 AM, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 04:46 PM, Jens Axboe wrote:
>> On 1/17/18 5:34 AM, Ming Lei wrote:
>>> We know this WARN_ON is harmless and the stack trace isn't useful too,
>>> so convert it to printk(), and avoid to confuse people.
>>
>> I disagree, it is useful to know the exact path it happened from,
>> in case it's a valid warning. It could be an inline run and
>> we screwed up the logic, or it could be from a workqueue and
>> the reason would be entirely different.
> 
> Then add a dump_stack or whatever, but WARN_ON does have fatal effects
> for some setups. If it can happen then WARN_ON is just wrong.

dump_stack() is fine - and the intent is for it to never happen, it
would be nice to close those holes so we're only catching cases that
are due to bad code.

-- 
Jens Axboe



Re: [PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

2018-01-17 Thread Jens Axboe
On 1/17/18 5:34 AM, Ming Lei wrote:
> We know this WARN_ON is harmless and the stack trace isn't useful too,
> so convert it to printk(), and avoid to confuse people.

I disagree, it is useful to know the exact path it happened from,
in case it's a valid warning. It could be an inline run and
we screwed up the logic, or it could be from a workqueue and
the reason would be entirely different.

-- 
Jens Axboe



Re: [PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly

2018-01-17 Thread Jens Axboe
On 1/17/18 5:34 AM, Ming Lei wrote:
> When hctx->next_cpu is set from possible online CPUs, there is one
> race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
> break workqueue.
> 
> The race is triggered when one CPU is becoming DEAD, blk_mq_hctx_notify_dead()
> is called to dispatch requests from the DEAD cpu context, but at that
> time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
> CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
> a bad value.
> 
> This patch deals with this issue by re-selecting next CPU, and make
> sure it is set correctly.
> 
> Cc: Christian Borntraeger 
> Cc: Stefan Haberland 
> Cc: Christoph Hellwig 
> Cc: Thomas Gleixner 
> Reported-by: "jianchao.wang" 
> Tested-by: "jianchao.wang" 
> Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each 
> possisble CPU")
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq.c | 31 +--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c376d1b6309a..dc4066d28323 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1416,21 +1416,48 @@ static void __blk_mq_run_hw_queue(struct 
> blk_mq_hw_ctx *hctx)
>   */
>  static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
>  {
> + bool tried = false;
> +
>   if (hctx->queue->nr_hw_queues == 1)
>   return WORK_CPU_UNBOUND;
>  
>   if (--hctx->next_cpu_batch <= 0) {
>   int next_cpu;
> -
> +select_cpu:
>   next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
>   cpu_online_mask);
>   if (next_cpu >= nr_cpu_ids)
>   next_cpu = 
> cpumask_first_and(hctx->cpumask,cpu_online_mask);
>  
> - hctx->next_cpu = next_cpu;
> + /*
> +  * No online CPU is found here, and this may happen when
> +  * running from blk_mq_hctx_notify_dead(), and make sure
> +  * hctx->next_cpu is set correctly for not breaking workqueue.
> +  */
> + if (next_cpu >= nr_cpu_ids)
> + hctx->next_cpu = cpumask_first(hctx->cpumask);
> + else
> + hctx->next_cpu = next_cpu;
>   hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;

Since this should _only_ happen from the offline notification, why don't
we move the reset/update logic in that path and out of the hot path?

-- 
Jens Axboe



Re: [for-4.16 PATCH v5 2/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Jens Axboe
On 1/16/18 9:33 PM, Mike Snitzer wrote:
> From: Ming Lei 
> 
> blk_insert_cloned_request() is called in the fast path of a dm-rq driver
> (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
> blk_mq_request_bypass_insert() to directly append the request to the
> blk-mq hctx->dispatch_list of the underlying queue.
> 
> 1) This way isn't efficient enough because the hctx spinlock is always
> used.
> 
> 2) With blk_insert_cloned_request(), we completely bypass underlying
> queue's elevator and depend on the upper-level dm-rq driver's elevator
> to schedule IO.  But dm-rq currently can't get the underlying queue's
> dispatch feedback at all.  Without knowing whether a request was issued
> or not (e.g. due to underlying queue being busy) the dm-rq elevator will
> not be able to provide effective IO merging (as a side-effect of dm-rq
> currently blindly destaging a request from its elevator only to requeue
> it after a delay, which kills any opportunity for merging).  This
> obviously causes very bad sequential IO performance.
> 
> Fix this by updating blk_insert_cloned_request() to use
> blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
> request to be issued directly to the underlying queue and returns the
> dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
> returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
> to _not_ destage the request.  Whereby preserving the opportunity to
> merge IO.
> 
> With this, request-based DM's blk-mq sequential IO performance is vastly
> improved (as much as 3X in mpath/virtio-scsi testing).

This looks better. Two minor nit picks:

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c117c2baf2c9..0b64f7210a89 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1706,7 +1706,8 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   blk_qc_t new_cookie;
>   blk_status_t ret;
>  
> - new_cookie = request_to_qc_t(hctx, rq);
> + if (cookie)
> + new_cookie = request_to_qc_t(hctx, rq);
>  
>   /*
>* For OK queue, we are done. For error, caller may kill it.
> @@ -1716,13 +1717,15 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>   ret = q->mq_ops->queue_rq(hctx, );
>   switch (ret) {
>   case BLK_STS_OK:
> - *cookie = new_cookie;
> + if (cookie)
> + *cookie = new_cookie;
>   break;
>   case BLK_STS_RESOURCE:
>   __blk_mq_requeue_request(rq);
>   break;
>   default:
> - *cookie = BLK_QC_T_NONE;
> + if (cookie)
> + *cookie = BLK_QC_T_NONE;
>   break;
>   }

Instead of adding these three conditions, always pass in a valid pointer
to a cookie and get rid of them.

> @@ -1731,15 +1734,20 @@ static blk_status_t __blk_mq_issue_directly(struct 
> blk_mq_hw_ctx *hctx,
>  
>  static void __blk_mq_fallback_to_insert(struct blk_mq_hw_ctx *hctx,
>   struct request *rq,
> - bool run_queue)
> + bool run_queue, bool bypass_insert)
>  {
> + if (bypass_insert) {
> + blk_mq_request_bypass_insert(rq, run_queue);
> + return;
> + }
>   blk_mq_sched_insert_request(rq, false, run_queue, false,
>   hctx->flags & BLK_MQ_F_BLOCKING);
>  }

Lose the return and just make it an if/else.

-- 
Jens Axboe



Re: [PATCH v2 06/12] bcache: set error_limit correctly

2018-01-17 Thread Coly Li
On 17/01/2018 2:09 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> Hello Coly:
> 
> Then in bch_count_io_errors(), why did us still keep these code:
>> 92 unsigned errors = atomic_add_return(1 << IO_ERROR_SHIFT,
>> 93 >io_errors);
>> 94 errors >>= IO_ERROR_SHIFT;
> 
> why not just modify the code as bellow:
>> 92 unsigned errors = atomic_add_return(1,
>> 93 >io_errors);
> 
> 

Hi Junhui,

It is because of ca->set->error_decay. When error_decay is set, bcache
tries to do an exponential decay for error count. That is, error numbers
is decaying against the quantity of io count, this is to avoid long time
accumulated errors exceeds error_limit and trigger bch_cache_set_error().

The I/O error counter, uses most significant 12 bits to record real
errors number. And too many I/Os may decay the errors number too fast,
so left shit it by 20 bits to make sure there are still enough errors
number after a while (e.g. the halflife).

The we don't use the left shifting, when error_deay is set, and too many
I/Os happen, after a small piece of time, number of I/O errors will be
decayed too fast to reach error_limit. Therefore IMHO the left shifting
is necessary.

BTW, I doubt whether current exponential error decay in
bch_count_io_errors() works properly. Because I don't catch the
connection between IO counter (ca->io_count) and error counter
(ca->io_errors). By default ca->set->error_decay is 0, so I doubt how
many people use/test this piece of code. (Maybe I am wrong)


Coly Li

[code snipped]




Re: [PATCH 3/3] block: Protect less code with sysfs_lock in blk_{un,}register_queue()

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 10:17 AM, Bart Van Assche
 wrote:
> On Wed, 2018-01-17 at 09:23 +0800, Ming Lei wrote:
>> On Wed, Jan 17, 2018 at 8:03 AM, Bart Van Assche  
>> wrote:
>> > On Tue, 2018-01-16 at 17:32 -0500, Mike Snitzer wrote:
>> > > Therefore it seems to me that all queue_attr_{show,store} are racey vs
>> > > blk_unregister_queue() removing the 'queue' kobject.
>> > >
>> > > And it was just that __elevator_change() was myopicly fixed to address
>> > > the race whereas a more generic solution was/is needed.  But short of
>> > > that more generic fix your change will reintroduce the potential for
>> > > hitting the issue that commit e9a823fb34a8b fixed.
>> > >
>> > > In that light, think it best to leave blk_unregister_queue()'s
>> > > mutex_lock() above the QUEUE_FLAG_REGISTERED clearing _and_ update
>> > > queue_attr_{show,store} to test for QUEUE_FLAG_REGISTERED while holding
>> > > sysfs_lock.
>> > >
>> > > Then remove the unicorn test_bit for QUEUE_FLAG_REGISTERED from
>> > > __elevator_change().
>> >
>> > Thanks Mike for the feedback. However, I think a simpler approach exists 
>> > than
>> > what has been described above, namely by unregistering the sysfs attributes
>> > earlier. How about the patch below?
>> >
>> > [PATCH] block: Protect less code with sysfs_lock in 
>> > blk_{un,}register_queue()
>> > ---
>> >  block/blk-sysfs.c | 39 ++-
>> >  block/elevator.c  |  4 
>> >  2 files changed, 26 insertions(+), 17 deletions(-)
>> >
>> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> > index 4a6a40ffd78e..ce32366c6db7 100644
>> > --- a/block/blk-sysfs.c
>> > +++ b/block/blk-sysfs.c
>> > @@ -853,6 +853,10 @@ struct kobj_type blk_queue_ktype = {
>> > .release= blk_release_queue,
>> >  };
>> >
>> > +/**
>> > + * blk_register_queue - register a block layer queue with sysfs
>> > + * @disk: Disk of which the request queue should be registered with sysfs.
>> > + */
>> >  int blk_register_queue(struct gendisk *disk)
>> >  {
>> > int ret;
>> > @@ -909,11 +913,12 @@ int blk_register_queue(struct gendisk *disk)
>> > if (q->request_fn || (q->mq_ops && q->elevator)) {
>> > ret = elv_register_queue(q);
>> > if (ret) {
>> > +   mutex_unlock(>sysfs_lock);
>> > kobject_uevent(>kobj, KOBJ_REMOVE);
>> > kobject_del(>kobj);
>> > blk_trace_remove_sysfs(dev);
>> > kobject_put(>kobj);
>> > -   goto unlock;
>> > +   return ret;
>> > }
>> > }
>> > ret = 0;
>> > @@ -923,6 +928,13 @@ int blk_register_queue(struct gendisk *disk)
>> >  }
>> >  EXPORT_SYMBOL_GPL(blk_register_queue);
>> >
>> > +/**
>> > + * blk_unregister_queue - counterpart of blk_register_queue()
>> > + * @disk: Disk of which the request queue should be unregistered from 
>> > sysfs.
>> > + *
>> > + * Note: the caller is responsible for guaranteeing that this function is 
>> > called
>> > + * after blk_register_queue() has finished.
>> > + */
>> >  void blk_unregister_queue(struct gendisk *disk)
>> >  {
>> > struct request_queue *q = disk->queue;
>> > @@ -934,28 +946,29 @@ void blk_unregister_queue(struct gendisk *disk)
>> > if (!test_bit(QUEUE_FLAG_REGISTERED, >queue_flags))
>> > return;
>> >
>> > -   /*
>> > -* Protect against the 'queue' kobj being accessed
>> > -* while/after it is removed.
>> > -*/
>> > -   mutex_lock(>sysfs_lock);
>> > -
>> > spin_lock_irq(q->queue_lock);
>> > queue_flag_clear(QUEUE_FLAG_REGISTERED, q);
>> > spin_unlock_irq(q->queue_lock);
>> >
>> > -   wbt_exit(q);
>> > -
>> > +   /*
>> > +* Remove the sysfs attributes before unregistering the queue data
>> > +* structures that can be modified through sysfs.
>> > +*/
>> > +   mutex_lock(>sysfs_lock);
>> > if (q->mq_ops)
>> > blk_mq_unregister_dev(disk_to_dev(disk), q);
>> > -
>> > -   if (q->request_fn || (q->mq_ops && q->elevator))
>> > -   elv_unregister_queue(q);
>> > +   mutex_unlock(>sysfs_lock);
>> >
>> > kobject_uevent(>kobj, KOBJ_REMOVE);
>> > kobject_del(>kobj);
>>
>> elevator switch still can come just after the above line code is completed,
>> so the previous warning addressed in e9a823fb34a8b can be triggered
>> again.
>>
>> > blk_trace_remove_sysfs(disk_to_dev(disk));
>> > -   kobject_put(_to_dev(disk)->kobj);
>> >
>> > +   wbt_exit(q);
>> > +
>> > +   mutex_lock(>sysfs_lock);
>> > +   if (q->request_fn || (q->mq_ops && q->elevator))
>> > +   elv_unregister_queue(q);
>> > mutex_unlock(>sysfs_lock);
>> > +
>> > +   kobject_put(_to_dev(disk)->kobj);
>> >  }
>> > diff --git a/block/elevator.c 

[PATCH 2/2] blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

2018-01-17 Thread Ming Lei
We know this WARN_ON is harmless and the stack trace isn't useful too,
so convert it to printk(), and avoid to confuse people.

Also add comment about two releated races here.

Cc: Christian Borntraeger 
Cc: Stefan Haberland 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Cc: "jianchao.wang" 
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index dc4066d28323..6562360bf108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1391,9 +1391,25 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
/*
 * We should be running this queue from one of the CPUs that
 * are mapped to it.
+*
+* There are at least two related races now between setting
+* hctx->next_cpu from blk_mq_hctx_next_cpu() and running
+* __blk_mq_run_hw_queue():
+*
+* - hctx->next_cpu is found offline in blk_mq_hctx_next_cpu(),
+*   but later it becomes online, then this warning is harmless
+*   at all
+*
+* - hctx->next_cpu is found online in blk_mq_hctx_next_cpu(),
+*   but later it becomes offline, then the warning can't be
+*   triggered, and we depend on blk-mq timeout handler to
+*   handle dispatched requests to this hctx
 */
-   WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
-   cpu_online(hctx->next_cpu));
+   if (!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+   cpu_online(hctx->next_cpu))
+   printk(KERN_WARNING "run queue from wrong CPU %d, hctx %s\n",
+   raw_smp_processor_id(),
+   cpumask_empty(hctx->cpumask) ? "inactive": "active");
 
/*
 * We can't run the queue inline with ints disabled. Ensure that
-- 
2.9.5



[PATCH 1/2] blk-mq: make sure hctx->next_cpu is set correctly

2018-01-17 Thread Ming Lei
When hctx->next_cpu is set from possible online CPUs, there is one
race in which hctx->next_cpu may be set as >= nr_cpu_ids, and finally
break workqueue.

The race is triggered when one CPU is becoming DEAD, blk_mq_hctx_notify_dead()
is called to dispatch requests from the DEAD cpu context, but at that
time, this DEAD CPU has been cleared from 'cpu_online_mask', so all
CPUs in hctx->cpumask may become offline, and cause hctx->next_cpu set
a bad value.

This patch deals with this issue by re-selecting next CPU, and make
sure it is set correctly.

Cc: Christian Borntraeger 
Cc: Stefan Haberland 
Cc: Christoph Hellwig 
Cc: Thomas Gleixner 
Reported-by: "jianchao.wang" 
Tested-by: "jianchao.wang" 
Fixes: 20e4d81393 ("blk-mq: simplify queue mapping & schedule with each 
possisble CPU")
Signed-off-by: Ming Lei 
---
 block/blk-mq.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index c376d1b6309a..dc4066d28323 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1416,21 +1416,48 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
+   bool tried = false;
+
if (hctx->queue->nr_hw_queues == 1)
return WORK_CPU_UNBOUND;
 
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
-
+select_cpu:
next_cpu = cpumask_next_and(hctx->next_cpu, hctx->cpumask,
cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = 
cpumask_first_and(hctx->cpumask,cpu_online_mask);
 
-   hctx->next_cpu = next_cpu;
+   /*
+* No online CPU is found here, and this may happen when
+* running from blk_mq_hctx_notify_dead(), and make sure
+* hctx->next_cpu is set correctly for not breaking workqueue.
+*/
+   if (next_cpu >= nr_cpu_ids)
+   hctx->next_cpu = cpumask_first(hctx->cpumask);
+   else
+   hctx->next_cpu = next_cpu;
hctx->next_cpu_batch = BLK_MQ_CPU_WORK_BATCH;
}
 
+   /*
+* Do unbound schedule if we can't find a online CPU for this hctx,
+* and it should only happen in the path of handling CPU DEAD.
+*/
+   if (!cpu_online(hctx->next_cpu)) {
+   if (!tried) {
+   tried = true;
+   goto select_cpu;
+   }
+
+   /*
+* Make sure to re-select CPU next time once after CPUs
+* in hctx->cpumask become online again.
+*/
+   hctx->next_cpu_batch = 1;
+   return WORK_CPU_UNBOUND;
+   }
return hctx->next_cpu;
 }
 
-- 
2.9.5



[PATCH 0/2] blk-mq: two patches related with CPU hotplug

2018-01-17 Thread Ming Lei
Hi Jens,

The 1st one fixes one regression caused by 20e4d81393 ("blk-mq: simplify
queue mapping & schedule with each possisble CPU").

The 2nd one add commens about current race, and convert the WARN_ON
into printk(KERN_WARN) since this warning is harmless.

Thanks,

Ming Lei (2):
  blk-mq: make sure hctx->next_cpu is set correctly
  blk-mq: convert WARN_ON in __blk_mq_run_hw_queue to printk

 block/blk-mq.c | 51 +++
 1 file changed, 47 insertions(+), 4 deletions(-)

-- 
2.9.5



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Ming Lei
On Wed, Jan 17, 2018 at 11:07:48AM +0100, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
> > Hi Jianchao,
> > 
> > On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
> >> Hi ming 
> >>
> >> Thanks for your kindly response.
> >>
> >> On 01/17/2018 02:22 PM, Ming Lei wrote:
> >>> This warning can't be removed completely, for example, the CPU figured
> >>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> >>> following call returns and before __blk_mq_run_hw_queue() is scheduled
> >>> to run.
> >>>
> >>>   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> >>> >run_work, msecs_to_jiffies(msecs))
> >> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> >> There is a big gap between cpu_online and cpu_active. rebind_workers is 
> >> also between them.
> > 
> > This warning is harmless, also you can't reproduce it without help of your
> > special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with 
> panic_on_warn.
> If a condition can happen we should not use WARN_ON but something else.

Agree, printk() should be fine, IMO.


-- 
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Christian Borntraeger


On 01/17/2018 11:07 AM, Christian Borntraeger wrote:
> 
> 
> On 01/17/2018 10:57 AM, Ming Lei wrote:
>> Hi Jianchao,
>>
>> On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
>>> Hi ming 
>>>
>>> Thanks for your kindly response.
>>>
>>> On 01/17/2018 02:22 PM, Ming Lei wrote:
 This warning can't be removed completely, for example, the CPU figured
 in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
 following call returns and before __blk_mq_run_hw_queue() is scheduled
 to run.

kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
 >run_work, msecs_to_jiffies(msecs))
>>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>>> There is a big gap between cpu_online and cpu_active. rebind_workers is 
>>> also between them.
>>
>> This warning is harmless, also you can't reproduce it without help of your
>> special patch, I guess, :-) So the window shouldn't be a big deal. 
> 
> FWIW, every WARN_ON is problematic since there are people running with 
> panic_on_warn.

To make it more clear. Every WARN_ON that can happen in real life without 
actually being
an error is problematic.

> If a condition can happen we should not use WARN_ON but something else.
> 



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Christian Borntraeger


On 01/17/2018 10:57 AM, Ming Lei wrote:
> Hi Jianchao,
> 
> On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
>> Hi ming 
>>
>> Thanks for your kindly response.
>>
>> On 01/17/2018 02:22 PM, Ming Lei wrote:
>>> This warning can't be removed completely, for example, the CPU figured
>>> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
>>> following call returns and before __blk_mq_run_hw_queue() is scheduled
>>> to run.
>>>
>>> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
>>> >run_work, msecs_to_jiffies(msecs))
>> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
>> There is a big gap between cpu_online and cpu_active. rebind_workers is also 
>> between them.
> 
> This warning is harmless, also you can't reproduce it without help of your
> special patch, I guess, :-) So the window shouldn't be a big deal. 

FWIW, every WARN_ON is problematic since there are people running with 
panic_on_warn.
If a condition can happen we should not use WARN_ON but something else.



Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread Ming Lei
Hi Jianchao,

On Wed, Jan 17, 2018 at 04:09:11PM +0800, jianchao.wang wrote:
> Hi ming 
> 
> Thanks for your kindly response.
> 
> On 01/17/2018 02:22 PM, Ming Lei wrote:
> > This warning can't be removed completely, for example, the CPU figured
> > in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> > following call returns and before __blk_mq_run_hw_queue() is scheduled
> > to run.
> > 
> > kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> > >run_work, msecs_to_jiffies(msecs))
> We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
> There is a big gap between cpu_online and cpu_active. rebind_workers is also 
> between them.

This warning is harmless, also you can't reproduce it without help of your
special patch, I guess, :-) So the window shouldn't be a big deal. 

But it can be a problem about the delay(msecs_to_jiffies(msecs)) passed to
kblockd_mod_delayed_work_on(), because during the period:

1) hctx->next_cpu can become online from offline before __blk_mq_run_hw_queue
is run, your warning is triggered, but it is harmless

2) hctx->next_cpu can become offline from online before __blk_mq_run_hw_queue
is run, there isn't warning, but once the IO is submitted to hardware,
after it is completed, how does the HBA/hw queue notify CPU since CPUs
assigned to this hw queue(irq vector) are offline? blk-mq's timeout
handler may cover that, but looks too tricky.

> 
> > 
> > Just be curious how you trigger this issue? And is it triggered in CPU
> > hotplug stress test? Or in a normal use case?
> 
> In fact, this is my own investigation about whether the .queue_rq to one 
> hardware queue could be executed on
> the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
> I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
>  - A special patch that could hold some requests on ctx->rq_list though 
> .get_budget
>  - A script issues IOs with fio
>  - A script online/offline the cpus continuously

Thanks for sharing your reproduction approach.

Without a handler for CPU hotplug, it isn't easy to avoid the warning
completely in __blk_mq_run_hw_queue().

> At first, just the warning above. Then after this patch was introduced, panic 
> came up.

We have to fix the panic, so I will post the patch you tested in this thread.

Thanks,
Ming


Re: [PATCH 2/2] blk-mq: simplify queue mapping & schedule with each possisble CPU

2018-01-17 Thread jianchao.wang
Hi ming 

Thanks for your kindly response.

On 01/17/2018 02:22 PM, Ming Lei wrote:
> This warning can't be removed completely, for example, the CPU figured
> in blk_mq_hctx_next_cpu(hctx) can be put on again just after the
> following call returns and before __blk_mq_run_hw_queue() is scheduled
> to run.
> 
>   kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), 
> >run_work, msecs_to_jiffies(msecs))
We could use cpu_active in __blk_mq_run_hw_queue() to narrow the window.
There is a big gap between cpu_online and cpu_active. rebind_workers is also 
between them.

> 
> Just be curious how you trigger this issue? And is it triggered in CPU
> hotplug stress test? Or in a normal use case?

In fact, this is my own investigation about whether the .queue_rq to one 
hardware queue could be executed on
the cpu where it is not mapped. Finally, found this hole when cpu hotplug.
I did the test on NVMe device which has 1-to-1 mapping between cpu and hctx.
 - A special patch that could hold some requests on ctx->rq_list though 
.get_budget
 - A script issues IOs with fio
 - A script online/offline the cpus continuously
At first, just the warning above. Then after this patch was introduced, panic 
came up.

Thanks
Jianchao


Re: [PATCH] lightnvm/pblk-gc: Delete an error message for a failed memory allocation in pblk_gc_line_prepare_ws()

2018-01-17 Thread Javier González
> On 16 Jan 2018, at 22.10, SF Markus Elfring  
> wrote:
> 
> From: Markus Elfring 
> Date: Tue, 16 Jan 2018 22:00:15 +0100
> 
> Omit an extra message for a memory allocation failure in this function.
> 
> This issue was detected by using the Coccinelle software.
> 
> Signed-off-by: Markus Elfring 
> ---
> drivers/lightnvm/pblk-gc.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
> index 9c8e114c8a54..54cdb4360366 100644
> --- a/drivers/lightnvm/pblk-gc.c
> +++ b/drivers/lightnvm/pblk-gc.c
> @@ -147,10 +147,8 @@ static void pblk_gc_line_prepare_ws(struct work_struct 
> *work)
>   int ret;
> 
>   invalid_bitmap = kmalloc(lm->sec_bitmap_len, GFP_KERNEL);
> - if (!invalid_bitmap) {
> - pr_err("pblk: could not allocate GC invalid bitmap\n");
> + if (!invalid_bitmap)
>   goto fail_free_ws;
> - }
> 
>   emeta_buf = pblk_malloc(lm->emeta_len[0], l_mg->emeta_alloc_type,
>   GFP_KERNEL);
> --
> 2.15.1

Looks good to me.

Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP