Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-23 Thread Keith Busch
On Wed, May 23, 2018 at 08:02:31AM -0600, Keith Busch wrote:
> Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
> otherwise its guaranteed to return 'true' and there's no point to the
> loop and 'if' check.

And I see v14 is already posted with that fix! :)


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-23 Thread Keith Busch
On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> +static bool blk_mq_change_rq_state(struct request *rq,
> +enum mq_rq_state old_state,
> +enum mq_rq_state new_state)
> +{
> + union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
> + union blk_generation_and_state old_val = gstate;
> + union blk_generation_and_state new_val = gstate;
> +
> + old_val.state = old_state;
> + new_val.state = new_state;
> + if (new_state == MQ_RQ_IN_FLIGHT)
> + new_val.generation++;
> + /*
> +  * For transitions from state in-flight to another state cmpxchg()
> +  * must be used. For other state transitions it is safe to use
> +  * WRITE_ONCE().
> +  */
> + if (old_state != MQ_RQ_IN_FLIGHT) {
> + WRITE_ONCE(rq->gstate.val, new_val.val);
> + return true;
> + }
> + return blk_mq_set_rq_state(rq, old_val, new_val);
> +}



>  void blk_mq_complete_request(struct request *rq)
>  {
>   struct request_queue *q = rq->q;
> - struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> - int srcu_idx;
>  
>   if (unlikely(blk_should_fake_timeout(q)))
>   return;
>  
> - /*
> -  * If @rq->aborted_gstate equals the current instance, timeout is
> -  * claiming @rq and we lost.  This is synchronized through
> -  * hctx_lock().  See blk_mq_timeout_work() for details.
> -  *
> -  * Completion path never blocks and we can directly use RCU here
> -  * instead of hctx_lock() which can be either RCU or SRCU.
> -  * However, that would complicate paths which want to synchronize
> -  * against us.  Let stay in sync with the issue path so that
> -  * hctx_lock() covers both issue and completion paths.
> -  */
> - hctx_lock(hctx, _idx);
> - if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> - __blk_mq_complete_request(rq);
> - hctx_unlock(hctx, srcu_idx);
> + /* The loop is for the unlikely case of a race with the timeout code. */
> + while (true) {
> + if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
> +MQ_RQ_COMPLETE)) {
> + __blk_mq_complete_request(rq);
> + break;
> + }
> + if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
> + break;
> + }
>  }

Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
otherwise its guaranteed to return 'true' and there's no point to the
loop and 'if' check.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Ming Lei
On Tue, May 22, 2018 at 02:38:37PM -0600, Jens Axboe wrote:
> On 5/22/18 2:33 PM, Bart Van Assche wrote:
> > On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
> >> On 5/22/18 1:03 PM, Jens Axboe wrote:
> >>> On 5/22/18 12:47 PM, Jens Axboe wrote:
>  Ran into this, running block/014 from blktests:
> 
>  [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>  [ 5750.723000] null: rq ff68f103 timed out
>  [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
>  __blk_mq_complete_request+0xa6/0x0
>  [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
>  sb_edac x86_pkg_temp_therma]
>  [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
>  4.17.0-rc6+ #712
>  [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 
>  2.3.4 11/09/2016
>  [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>  [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>  [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
>  [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
>  
>  [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
>  881ff100d800
>  [ 5750.819119] RBP: e8d91800 R08:  R09: 
>  82066eb8
>  [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
>  0001
>  [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
>  0002
>  [ 5750.844529] FS:  () GS:88407fd8() 
>  knlGS:
>  [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
>  [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
>  003606e0
>  [ 5750.869861] DR0:  DR1:  DR2: 
>  
>  [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
>  0400
>  [ 5750.886805] Call Trace:
>  [ 5750.890033]  bt_iter+0x42/0x50
>  [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>  [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>  [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>  [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>  [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>  [ 5750.920220]  process_one_work+0x21b/0x510
>  [ 5750.925197]  worker_thread+0x3a/0x390
>  [ 5750.929781]  ? process_one_work+0x510/0x510
>  [ 5750.934944]  kthread+0x11c/0x140
>  [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>  [ 5750.945169]  ret_from_fork+0x1f/0x30
>  [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 
>  e2 75 3b 48 89 df ff 90 2 
>  [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
>  which is this:
> 
>  WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> >>>
> >>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> >>> state transition. So that check should be:
> >>>
> >>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
> >>>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
> >>
> >> I guess it would be cleaner to actually do the transition, in
> >> blk_mq_rq_timed_out():
> >>
> >> case BLK_EH_HANDLED:   
> >>  
> >> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,   
> >>  
> >> MQ_RQ_COMPLETE))   
> >>  
> >> __blk_mq_complete_request(req);
> >>  
> >> break;
> >>
> >> This works for me.
> > 
> > Hello Jens,
> > 
> > Thanks for having reported this. How about using the following change to 
> > suppress
> > that warning:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index bb99c03e7a34..84e55ea55baf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> >  
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > +   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > 
> > I think this will work better than what was proposed in your last e-mail. 
> > I'm afraid
> > that with that change that a completion that occurs while the timeout 
> > handler is
> > running can be ignored.
> 
> What if that races with eg requeue? We get the completion from IRQ, decide we
> need to requeue/restart the request rather than complete it.

If drivers need to requeue this request, which should have been done
in its .complete.

But if the requeue is done via IRQ handler directly, that may be one
existed race between normal completion vs. timeout, and Bart's patch
doesn't change situation too.

thanks,

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Christoph Hellwig
On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
> > of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> > when the driver actually knows the request has been completed before
> > returning the status?
> 
> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
> Or BLK_EH_HANDLED would work too, since the above state transition would
> then fail.

Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
untested progress toward that is here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret

The only real missing bit is SCSI overloading the value for internal
use.



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:44 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> > 
> > Have you noticed that if blk_mq_complete_request() encounters a request with
> > state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> > think
> > the code in blk_mq_complete_request() together with the above code 
> > guarantees
> > that __blk_mq_complete_request() is only called once per request generation.
> 
> Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
> status to return if the driver knows blk_mq_complete_request() was called
> prior to returning from the timeout handler, so we need a similiar check
> there, right?

Good catch. To me that seems like the right place to handle that case.

Bart.

Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:38 -0600, Jens Axboe wrote:
> On 5/22/18 2:33 PM, Bart Van Assche wrote:
> > Thanks for having reported this. How about using the following change to 
> > suppress
> > that warning:
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index bb99c03e7a34..84e55ea55baf 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> >  
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > +   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > 
> > I think this will work better than what was proposed in your last e-mail. 
> > I'm afraid
> > that with that change that a completion that occurs while the timeout 
> > handler is
> > running can be ignored.
> 
> What if that races with eg requeue? We get the completion from IRQ, decide we
> need to requeue/restart the request rather than complete it.

Shouldn't block drivers that requeue a request from inside their timeout handler
return BLK_EH_NOT_HANDLED instead of BLK_EH_HANDLED?

Thanks,

Bart.




Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> 
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
status to return if the driver knows blk_mq_complete_request() was called
prior to returning from the timeout handler, so we need a similiar check
there, right?


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 2:33 PM, Bart Van Assche wrote:
> On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
>> On 5/22/18 1:03 PM, Jens Axboe wrote:
>>> On 5/22/18 12:47 PM, Jens Axboe wrote:
 Ran into this, running block/014 from blktests:

 [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
 [ 5750.723000] null: rq ff68f103 timed out
 [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
 __blk_mq_complete_request+0xa6/0x0
 [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
 sb_edac x86_pkg_temp_therma]
 [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
 4.17.0-rc6+ #712
 [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
 11/09/2016
 [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
 [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
 [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
 [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
 
 [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
 881ff100d800
 [ 5750.819119] RBP: e8d91800 R08:  R09: 
 82066eb8
 [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
 0001
 [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
 0002
 [ 5750.844529] FS:  () GS:88407fd8() 
 knlGS:
 [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
 [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
 003606e0
 [ 5750.869861] DR0:  DR1:  DR2: 
 
 [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [ 5750.886805] Call Trace:
 [ 5750.890033]  bt_iter+0x42/0x50
 [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
 [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
 [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
 [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
 [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
 [ 5750.920220]  process_one_work+0x21b/0x510
 [ 5750.925197]  worker_thread+0x3a/0x390
 [ 5750.929781]  ? process_one_work+0x510/0x510
 [ 5750.934944]  kthread+0x11c/0x140
 [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
 [ 5750.945169]  ret_from_fork+0x1f/0x30
 [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 
 75 3b 48 89 df ff 90 2 
 [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

 which is this:

 WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
>>>
>>> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
>>> state transition. So that check should be:
>>>
>>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>>>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
>>
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>> case BLK_EH_HANDLED: 
>>
>> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, 
>>
>> MQ_RQ_COMPLETE)) 
>>
>> __blk_mq_complete_request(req);  
>>
>> break;
>>
>> This works for me.
> 
> Hello Jens,
> 
> Thanks for having reported this. How about using the following change to 
> suppress
> that warning:
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index bb99c03e7a34..84e55ea55baf 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
> reserved)
>  
>   switch (ret) {
>   case BLK_EH_HANDLED:
> + blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
> 
> I think this will work better than what was proposed in your last e-mail. I'm 
> afraid
> that with that change that a completion that occurs while the timeout handler 
> is
> running can be ignored.

What if that races with eg requeue? We get the completion from IRQ, decide we
need to requeue/restart the request rather than complete it.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 08:36:27PM +, Bart Van Assche wrote:
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I 
> think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Right, my mistake. I noticed that when I saw your reply on the EH_HANDLED
case, so looks fine.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 14:33 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> > @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, 
> > bool reserved)
> > case BLK_EH_RESET_TIMER:
> > +   blk_mq_add_timer(req);
> > /*
> > +* The loop is for the unlikely case of a race with the
> > +* completion code. There is no need to reset the deadline
> > +* if the transition to the in-flight state fails because
> > +* the deadline only matters in the in-flight state.
> >  */
> > -   blk_mq_rq_update_aborted_gstate(req, 0);
> > -   blk_add_timer(req);
> > +   while (true) {
> > +   if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> > +  MQ_RQ_IN_FLIGHT))
> > +   break;
> > +   if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> > +   __blk_mq_complete_request(req);
> > +   break;
> > +   }
> > +   }
> 
> I'm having some trouble triggering this case where the state is already
> MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.
> 
> It looks like the new blk_mq_complete_request() already called
> __blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
> so doing that again completes it a second time.

Hello Keith,

Have you noticed that if blk_mq_complete_request() encounters a request with
state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
the code in blk_mq_complete_request() together with the above code guarantees
that __blk_mq_complete_request() is only called once per request generation.

Bart.





Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Bart Van Assche
On Tue, 2018-05-22 at 13:38 -0600, Jens Axboe wrote:
> On 5/22/18 1:03 PM, Jens Axboe wrote:
> > On 5/22/18 12:47 PM, Jens Axboe wrote:
> > > Ran into this, running block/014 from blktests:
> > > 
> > > [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> > > [ 5750.723000] null: rq ff68f103 timed out
> > > [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
> > > __blk_mq_complete_request+0xa6/0x0
> > > [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
> > > sb_edac x86_pkg_temp_therma]
> > > [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 
> > > 4.17.0-rc6+ #712
> > > [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> > > 11/09/2016
> > > [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> > > [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> > > [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
> > > [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
> > > 
> > > [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
> > > 881ff100d800
> > > [ 5750.819119] RBP: e8d91800 R08:  R09: 
> > > 82066eb8
> > > [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
> > > 0001
> > > [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
> > > 0002
> > > [ 5750.844529] FS:  () GS:88407fd8() 
> > > knlGS:
> > > [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
> > > [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
> > > 003606e0
> > > [ 5750.869861] DR0:  DR1:  DR2: 
> > > 
> > > [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
> > > 0400
> > > [ 5750.886805] Call Trace:
> > > [ 5750.890033]  bt_iter+0x42/0x50
> > > [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> > > [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> > > [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> > > [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> > > [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> > > [ 5750.920220]  process_one_work+0x21b/0x510
> > > [ 5750.925197]  worker_thread+0x3a/0x390
> > > [ 5750.929781]  ? process_one_work+0x510/0x510
> > > [ 5750.934944]  kthread+0x11c/0x140
> > > [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> > > [ 5750.945169]  ret_from_fork+0x1f/0x30
> > > [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 
> > > e2 75 3b 48 89 df ff 90 2 
> > > [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> > > 
> > > which is this:
> > > 
> > > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> > 
> > That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> > state transition. So that check should be:
> > 
> > WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
> >  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
> 
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
> case BLK_EH_HANDLED:  
>   
> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,  
>   
> MQ_RQ_COMPLETE))  
>   
> __blk_mq_complete_request(req);   
>   
> break;
> 
> This works for me.

Hello Jens,

Thanks for having reported this. How about using the following change to 
suppress
that warning:

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb99c03e7a34..84e55ea55baf 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -844,6 +844,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 
switch (ret) {
case BLK_EH_HANDLED:
+   blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE);
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:

I think this will work better than what was proposed in your last e-mail. I'm 
afraid
that with that change that a completion that occurs while the timeout handler is
running can be ignored.

Thanks,

Bart.



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, 
> bool reserved)
>   case BLK_EH_RESET_TIMER:
> + blk_mq_add_timer(req);
>   /*
> +  * The loop is for the unlikely case of a race with the
> +  * completion code. There is no need to reset the deadline
> +  * if the transition to the in-flight state fails because
> +  * the deadline only matters in the in-flight state.
>*/
> - blk_mq_rq_update_aborted_gstate(req, 0);
> - blk_add_timer(req);
> + while (true) {
> + if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> +MQ_RQ_IN_FLIGHT))
> + break;
> + if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> + __blk_mq_complete_request(req);
> + break;
> + }
> + }

I'm having some trouble triggering this case where the state is already
MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

It looks like the new blk_mq_complete_request() already called
__blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
so doing that again completes it a second time.


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 2:26 PM, Keith Busch wrote:
> On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>> case BLK_EH_HANDLED: 
>>
>> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT, 
>>
>> MQ_RQ_COMPLETE)) 
>>
>> __blk_mq_complete_request(req);  
>>
>> break;
>>
>> This works for me.
> 
> Works for me as well on manual fault injection tests.
> 
> I think this change above goes back to Christoph's point earlier on usage
> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> when the driver actually knows the request has been completed before
> returning the status?

If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
Or BLK_EH_HANDLED would work too, since the above state transition would
then fail.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Keith Busch
On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
> case BLK_EH_HANDLED:  
>   
> if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,  
>   
> MQ_RQ_COMPLETE))  
>   
> __blk_mq_complete_request(req);   
>   
> break;
> 
> This works for me.

Works for me as well on manual fault injection tests.

I think this change above goes back to Christoph's point earlier on usage
of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
when the driver actually knows the request has been completed before
returning the status?


Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 1:03 PM, Jens Axboe wrote:
> On 5/22/18 12:47 PM, Jens Axboe wrote:
>> On 5/22/18 11:17 AM, Jens Axboe wrote:
>>> On 5/22/18 10:44 AM, Jens Axboe wrote:

 On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

 I'll take a look at this again, getting rid of cmpxchg64 makes me
 much more comfortable.
>>>
>>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>>> more efficient than what's currently in the kernel, testing with both
>>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>>> is that we shrink the request size from 360 bytes to 312, and I did
>>> a small followup patch that brings that to 304. That's a 15% reduction,
>>> massive.
>>
>> Ran into this, running block/014 from blktests:
>>
>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>> [ 5750.723000] null: rq ff68f103 timed out
>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
>> __blk_mq_complete_request+0xa6/0x0
>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core 
>> sb_edac x86_pkg_temp_therma]
>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
>> #712
>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>> 11/09/2016
>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>> [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
>> [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
>> 
>> [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
>> 881ff100d800
>> [ 5750.819119] RBP: e8d91800 R08:  R09: 
>> 82066eb8
>> [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
>> 0001
>> [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
>> 0002
>> [ 5750.844529] FS:  () GS:88407fd8() 
>> knlGS:
>> [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
>> [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
>> 003606e0
>> [ 5750.869861] DR0:  DR1:  DR2: 
>> 
>> [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [ 5750.886805] Call Trace:
>> [ 5750.890033]  bt_iter+0x42/0x50
>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>> [ 5750.920220]  process_one_work+0x21b/0x510
>> [ 5750.925197]  worker_thread+0x3a/0x390
>> [ 5750.929781]  ? process_one_work+0x510/0x510
>> [ 5750.934944]  kthread+0x11c/0x140
>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>> [ 5750.945169]  ret_from_fork+0x1f/0x30
>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 
>> 75 3b 48 89 df ff 90 2 
>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
>>
>> which is this:
>>
>> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> 
> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> state transition. So that check should be:
> 
> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>  blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

I guess it would be cleaner to actually do the transition, in
blk_mq_rq_timed_out():

case BLK_EH_HANDLED:
if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
MQ_RQ_COMPLETE))
__blk_mq_complete_request(req); 
break;

This works for me.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 12:47 PM, Jens Axboe wrote:
> On 5/22/18 11:17 AM, Jens Axboe wrote:
>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>
>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
 Recently the blk-mq timeout handling code was reworked. See also Tejun
 Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
 This patch reworks the blk-mq timeout handling code again. The timeout
 handling code is simplified by introducing a state machine per request.
 This change avoids that the blk-mq timeout handling code ignores
 completions that occur after blk_mq_check_expired() has been called and
 before blk_mq_rq_timed_out() has been called.
>>>
>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>> much more comfortable.
>>
>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>> more efficient than what's currently in the kernel, testing with both
>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>> is that we shrink the request size from 360 bytes to 312, and I did
>> a small followup patch that brings that to 304. That's a 15% reduction,
>> massive.
> 
> Ran into this, running block/014 from blktests:
> 
> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> [ 5750.723000] null: rq ff68f103 timed out
> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
> __blk_mq_complete_request+0xa6/0x0
> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac 
> x86_pkg_temp_therma]
> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
> #712
> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> 11/09/2016
> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> [ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
> [ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
> 
> [ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 
> 881ff100d800
> [ 5750.819119] RBP: e8d91800 R08:  R09: 
> 82066eb8
> [ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 
> 0001
> [ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 
> 0002
> [ 5750.844529] FS:  () GS:88407fd8() 
> knlGS:
> [ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
> [ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 
> 003606e0
> [ 5750.869861] DR0:  DR1:  DR2: 
> 
> [ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 5750.886805] Call Trace:
> [ 5750.890033]  bt_iter+0x42/0x50
> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> [ 5750.920220]  process_one_work+0x21b/0x510
> [ 5750.925197]  worker_thread+0x3a/0x390
> [ 5750.929781]  ? process_one_work+0x510/0x510
> [ 5750.934944]  kthread+0x11c/0x140
> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> [ 5750.945169]  ret_from_fork+0x1f/0x30
> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 
> 3b 48 89 df ff 90 2 
> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
> which is this:
> 
> WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
state transition. So that check should be:

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
 blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 11:17 AM, Jens Axboe wrote:
> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>
>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>> handling code is simplified by introducing a state machine per request.
>>> This change avoids that the blk-mq timeout handling code ignores
>>> completions that occur after blk_mq_check_expired() has been called and
>>> before blk_mq_rq_timed_out() has been called.
>>
>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>> much more comfortable.
> 
> FWIW, a quick pass on runtime testing works fine. As expected, it's
> more efficient than what's currently in the kernel, testing with both
> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
> is that we shrink the request size from 360 bytes to 312, and I did
> a small followup patch that brings that to 304. That's a 15% reduction,
> massive.

Ran into this, running block/014 from blktests:

[ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
[ 5750.723000] null: rq ff68f103 timed out
[ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 
__blk_mq_complete_request+0xa6/0x0
[ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac 
x86_pkg_temp_therma]
[ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ 
#712
[ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
11/09/2016
[ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
[ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
[ 5750.795850] RSP: 0018:883ffb417d68 EFLAGS: 00010202
[ 5750.802187] RAX: 0003 RBX: 881ff100d800 RCX: 
[ 5750.810649] RDX: 88407fd9e040 RSI: 88407fd956b8 RDI: 881ff100d800
[ 5750.819119] RBP: e8d91800 R08:  R09: 82066eb8
[ 5750.827588] R10: 883ffa386138 R11: 883ffa385900 R12: 0001
[ 5750.836050] R13: 881fe7da6000 R14: 0020 R15: 0002
[ 5750.844529] FS:  () GS:88407fd8() 
knlGS:
[ 5750.854482] CS:  0010 DS:  ES:  CR0: 80050033
[ 5750.861397] CR2: 7ffc92f97f68 CR3: 0201d005 CR4: 003606e0
[ 5750.869861] DR0:  DR1:  DR2: 
[ 5750.878333] DR3:  DR6: fffe0ff0 DR7: 0400
[ 5750.886805] Call Trace:
[ 5750.890033]  bt_iter+0x42/0x50
[ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
[ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
[ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
[ 5750.920220]  process_one_work+0x21b/0x510
[ 5750.925197]  worker_thread+0x3a/0x390
[ 5750.929781]  ? process_one_work+0x510/0x510
[ 5750.934944]  kthread+0x11c/0x140
[ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
[ 5750.945169]  ret_from_fork+0x1f/0x30
[ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 
3b 48 89 df ff 90 2 
[ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

which is this:

WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

hitting when blk_mq_terminate_expired() completes the request through 
BLK_EH_HANDLED.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe
On 5/22/18 10:44 AM, Jens Axboe wrote:
> 
> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>> This patch reworks the blk-mq timeout handling code again. The timeout
>> handling code is simplified by introducing a state machine per request.
>> This change avoids that the blk-mq timeout handling code ignores
>> completions that occur after blk_mq_check_expired() has been called and
>> before blk_mq_rq_timed_out() has been called.
> 
> I'll take a look at this again, getting rid of cmpxchg64 makes me
> much more comfortable.

FWIW, a quick pass on runtime testing works fine. As expected, it's
more efficient than what's currently in the kernel, testing with both
null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
is that we shrink the request size from 360 bytes to 312, and I did
a small followup patch that brings that to 304. That's a 15% reduction,
massive.

-- 
Jens Axboe



Re: [PATCH v13] blk-mq: Rework blk-mq timeout handling again

2018-05-22 Thread Jens Axboe

On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

I'll take a look at this again, getting rid of cmpxchg64 makes me
much more comfortable.

-- 
Jens Axboe