Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-13 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.

We should fix this issue in block layer because:

1) when normal completion is done after rq's state is updated
to COMPLETE during timeout handling, this patch still follows previous
behaviour to reset timer, instead of complete this request immediately.

2) if driver's .timeout() deals with this situation by returning
RESET_TIMER at the first time, it can be very possible to deal with
this situation as RESET_TIMER in next timeout, because both hw and sw state
wrt. this request isn't changed compared with 1st .timeout.
So it is very possible for this request to not be completed for ever.

3) normal completion may be done just after returning from .timeout(),
so this issue may not be avoided by fixing driver

And the simple approach in the following link can fix the issue
without introducing any cost in fast path:

https://marc.info/?l=linux-block=152353441722852=2

> 
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:
> - Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request.
> - Store the request state in the lowest two bits of the deadline instead
>   of the lowest two bits of 'gstate'.
> - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
>   enumeration member into a #define such that its type can be changed
>   into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
>   ~(unsigned long)RQ_STATE_MASK.
> - Remove all request member variables that became superfluous due to
>   this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to this
>   patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the hctx member that became superfluous due to these changes,
>   namely nr_expired.
> - Remove the code that became superfluous due to this change, namely
>   the RCU lock and unlock statements in blk_mq_complete_request() and
>   also the synchronize_rcu() call in the timeout handler.
> 
> Signed-off-by: Bart Van Assche 
> Cc: Tejun Heo 
> Cc: Christoph Hellwig 
> Cc: Ming Lei 
> Cc: Sagi Grimberg 
> Cc: Israel Rukshin ,
> Cc: Max Gurtovoy 
> Cc:  # v4.16
> ---
> 
> Changes compared to v4:
> - Addressed multiple review comments from Christoph. The most important are
>   that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
>   there is now a nice and clean split between the legacy and blk-mq versions
>   of blk_add_timer().
> - Changed the patch name and modified the patch description because there is
>   disagreement about whether or not the v4.16 blk-mq core can complete a
>   single request twice. Kept the "Cc: stable" tag because of
>   https://bugzilla.kernel.org/show_bug.cgi?id=199077.
> 
> Changes compared to v3 (see also 
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
> - Removed the spinlock again that was introduced to protect the request state.
>   v4 uses atomic_long_cmpxchg() instead.
> - Split __deadline into two variables - one for the legacy block layer and one
>   for blk-mq.
> 
> Changes compared to v2 
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
> - Rebased and retested on top of kernel v4.16.
> 
> Changes compared to v1 
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
> - Removed the gstate and aborted_gstate members of struct request and used
>   the __deadline member to encode both the generation and state information.
> 
>  block/blk-core.c   |   2 -
>  block/blk-mq-debugfs.c |   1 -
>  block/blk-mq.c | 174 
> +
>  block/blk-mq.h |  65 ++
>  block/blk-timeout.c|  89 ++---
>  block/blk.h|  13 ++--
>  include/linux/blk-mq.h |   1 -
>  include/linux/blkdev.h |  30 ++---
>  8 files changed, 118 insertions(+), 257 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 8625ec929fe5..181b1a688a5b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -200,8 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request 
> *rq)
>   rq->start_time = jiffies;
>   set_start_time_ns(rq);
>   rq->part = NULL;
> - seqcount_init(>gstate_seq);
> - 

Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-12 Thread Christoph Hellwig
Looks good:

Reviewed-by: Christoph Hellwig 

In addition to all the arguments in the changelog the diffstat is
a pretty clear indicator that a straight forward state machine is
exactly what we want.


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-12 Thread Christoph Hellwig
On Wed, Apr 11, 2018 at 10:11:05AM +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> > The blk-mq timeout handling code ignores completions that occur after
> > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> > has reset rq->aborted_gstate. If a block driver timeout handler always
> > returns BLK_EH_RESET_TIMER then the result will be that the request
> > never terminates.
> 
> Under this situation:
> 
> IMO, if this request has been handled by driver's irq handler, and if
> driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
> and the correct return value should be BLK_EH_HANDLED.

We have plenty drivers that do that, so we'll need to audit all the
drivers first.  I guess a start would be to find a way that disables
timeouts entirely.


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Christoph Hellwig
On Wed, Apr 11, 2018 at 04:19:18PM +0300, Sagi Grimberg wrote:
>
>>   static void __blk_mq_requeue_request(struct request *rq)
>>   {
>>  struct request_queue *q = rq->q;
>> +enum mq_rq_state old_state = blk_mq_rq_state(rq);
>>  blk_mq_put_driver_tag(rq);
>>  trace_block_rq_requeue(q, rq);
>>  wbt_requeue(q->rq_wb, >issue_stat);
>>   -  if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
>> -blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>> +if (old_state != MQ_RQ_IDLE) {
>> +if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
>> +WARN_ON_ONCE(true);
>>  if (q->dma_drain_size && blk_rq_bytes(rq))
>>  rq->nr_phys_segments--;
>>  }
>
> Can you explain why was old_state kept as a local variable?

As it was me who added this:  just to not read it again as no one
else can change the state at this point.

>> +static inline bool blk_mq_change_rq_state(struct request *rq,
>> +  enum mq_rq_state old_state,
>> +  enum mq_rq_state new_state)
>>   {
>> -u64 old_val = READ_ONCE(rq->gstate);
>> -u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
>> -
>> -if (state == MQ_RQ_IN_FLIGHT) {
>> -WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
>> -new_val += MQ_RQ_GEN_INC;
>> -}
>> +unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
>> +old_state;
>> +unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
>>   -  /* avoid exposing interim values */
>> -WRITE_ONCE(rq->gstate, new_val);
>> +return cmpxchg(>__deadline, old_val, new_val) == old_val;
>>   }
>
> Can you explain why this takes the old_state of the request?
>
> Otherwise this looks good to me,

Because that is how cmpxchg works?


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



   static void __blk_mq_requeue_request(struct request *rq)
   {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
   
   	blk_mq_put_driver_tag(rq);
   
   	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
   
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


Hello Sagi,

Since blk_mq_requeue_request() must be called after a request has completed
the timeout handler will ignore requests that are being requeued. Hence it
is safe in this function to cache the request state in a local variable.


I understand why it is safe, I just didn't understand why it is needed.


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
   {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
   
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
   }


Can you explain why this takes the old_state of the request?


Can you clarify your question? The purpose of this function is to change
the request state only into @new_state if it matches @old_state. I think
that is also what the implementation of this function does.


I misread the documentation of this. never mind. thanks.


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:11 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> > The blk-mq timeout handling code ignores completions that occur after
> > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> > has reset rq->aborted_gstate. If a block driver timeout handler always
> > returns BLK_EH_RESET_TIMER then the result will be that the request
> > never terminates.
> 
> Under this situation:
> 
> IMO, if this request has been handled by driver's irq handler, and if
> driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
> and the correct return value should be BLK_EH_HANDLED.

Maybe. In the virtio-scsi driver I found the following:

/*
 * The host guarantees to respond to each command, although I/O
 * latencies might be higher than on bare metal.  Reset the timer
 * unconditionally to give the host a chance to perform EH.
 */
static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd)
{
return BLK_EH_RESET_TIMER;
}

Bart.





Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 16:19 +0300, Sagi Grimberg wrote:
> >   static void __blk_mq_requeue_request(struct request *rq)
> >   {
> > struct request_queue *q = rq->q;
> > +   enum mq_rq_state old_state = blk_mq_rq_state(rq);
> >   
> > blk_mq_put_driver_tag(rq);
> >   
> > trace_block_rq_requeue(q, rq);
> > wbt_requeue(q->rq_wb, >issue_stat);
> >   
> > -   if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
> > -   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> > +   if (old_state != MQ_RQ_IDLE) {
> > +   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
> > +   WARN_ON_ONCE(true);
> > if (q->dma_drain_size && blk_rq_bytes(rq))
> > rq->nr_phys_segments--;
> > }
> 
> Can you explain why was old_state kept as a local variable?

Hello Sagi,

Since blk_mq_requeue_request() must be called after a request has completed
the timeout handler will ignore requests that are being requeued. Hence it
is safe in this function to cache the request state in a local variable.

> > +static inline bool blk_mq_change_rq_state(struct request *rq,
> > + enum mq_rq_state old_state,
> > + enum mq_rq_state new_state)
> >   {
> > -   u64 old_val = READ_ONCE(rq->gstate);
> > -   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
> > -
> > -   if (state == MQ_RQ_IN_FLIGHT) {
> > -   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
> > -   new_val += MQ_RQ_GEN_INC;
> > -   }
> > +   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
> > +   old_state;
> > +   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
> >   
> > -   /* avoid exposing interim values */
> > -   WRITE_ONCE(rq->gstate, new_val);
> > +   return cmpxchg(>__deadline, old_val, new_val) == old_val;
> >   }
> 
> Can you explain why this takes the old_state of the request?

Can you clarify your question? The purpose of this function is to change
the request state only into @new_state if it matches @old_state. I think
that is also what the implementation of this function does.

Thanks,

Bart.






Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



  static void __blk_mq_requeue_request(struct request *rq)
  {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
  
  	blk_mq_put_driver_tag(rq);
  
  	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
  
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
  {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
  
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
  }


Can you explain why this takes the old_state of the request?

Otherwise this looks good to me,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.


Under this situation:

IMO, if this request has been handled by driver's irq handler, and if
driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
and the correct return value should be BLK_EH_HANDLED.


Is it possible to guarantee this efficiently if the irq handler
can run concurrently with the timeout handler?


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.

Under this situation:

IMO, if this request has been handled by driver's irq handler, and if
driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
and the correct return value should be BLK_EH_HANDLED.

-- 
Ming