Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-13 Thread Ming Lei
On Thu, Apr 12, 2018 at 06:57:12AM -0700, Tejun Heo wrote:
> On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > > Not really because aborted_gstate right now doesn't have any memory
> > > barrier around it, so nothing ensures blk_add_timer() actually appears
> > > before.  We can either add the matching barriers in aborted_gstate
> > > update and when it's read in the normal completion path, or we can
> > > wait for the update to be visible everywhere by waiting for rcu grace
> > > period (because the reader is rcu protected).
> > 
> > Seems not necessary.
> > 
> > Suppose it is out of order, the only side-effect is that the new
> > recycled request is timed out as a bit late, I think that is what
> > we can survive, right?
> 
> It at least can mess up the timeout duration for the next recycle
> instance because there can be two competing blk_add_timer() instances.
> I'm not sure whether there can be other consequences.  When ownership
> isn't clear, it becomes really difficult to reason about these things
> and can lead to subtle failures.  I think it'd be best to always
> establish who owns what.

Please see the code of blk_add_timer() for blk-mq:

blk_rq_set_deadline(req, jiffies + req->timeout);
req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;

if (!timer_pending(&q->timeout) ||
time_before(expiry, q->timeout.expires))
mod_timer(&q->timeout, expiry);

If this rq is recycled, blk_add_timer() only touches rq->deadline
and the EXPIRED flags, and the only effect is that the timeout
may be handled a bit late, but the timeout monitor won't be lost.

And this thing shouldn't be difficult to avoid, as you mentioned,
synchronize_rcu() can be added between blk_add_timer() and
resetting aborted gstate for avoiding it.


thanks,
Ming


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-12 Thread Tejun Heo
On Thu, Apr 12, 2018 at 07:05:13AM +0800, Ming Lei wrote:
> > Not really because aborted_gstate right now doesn't have any memory
> > barrier around it, so nothing ensures blk_add_timer() actually appears
> > before.  We can either add the matching barriers in aborted_gstate
> > update and when it's read in the normal completion path, or we can
> > wait for the update to be visible everywhere by waiting for rcu grace
> > period (because the reader is rcu protected).
> 
> Seems not necessary.
> 
> Suppose it is out of order, the only side-effect is that the new
> recycled request is timed out as a bit late, I think that is what
> we can survive, right?

It at least can mess up the timeout duration for the next recycle
instance because there can be two competing blk_add_timer() instances.
I'm not sure whether there can be other consequences.  When ownership
isn't clear, it becomes really difficult to reason about these things
and can lead to subtle failures.  I think it'd be best to always
establish who owns what.

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
On Wed, Apr 11, 2018 at 10:49:51PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> > +again:
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > [ ... ]
> > +   spin_lock_irqsave(req->q->queue_lock, flags);
> > +   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   blk_add_timer(req);
> > +   } else {
> > +   blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> > +   ret = BLK_EH_HANDLED;
> > +   goto again;
> > +   }
> > +   spin_unlock_irqrestore(req->q->queue_lock, flags);
> 
> Does the above chunk introduce a backwards goto from inside a region around
> which a spinlock is held to outside that region? Can such a goto result in
> anything else than a deadlock?

Yes, it is being fixed in my local V2, :-)

-- 
Ming


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
On Wed, Apr 11, 2018 at 03:47:12PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
> > On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> > > Hello, Ming.
> > > 
> > > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> > > ...
> > > > +   spin_lock_irqsave(req->q->queue_lock, flags);
> > > > +   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > > > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > > > +   blk_add_timer(req);
> > > 
> > > Nothing prevents the above blk_add_timer() racing against the next
> > > recycle instance of the request, so this still leaves a small race
> > > window.
> > 
> > OK.
> > 
> > But this small race window can be avoided by running blk_add_timer(req)
> > before blk_mq_rq_update_aborted_gstate(req, 0), can't it?
> 
> Not really because aborted_gstate right now doesn't have any memory
> barrier around it, so nothing ensures blk_add_timer() actually appears
> before.  We can either add the matching barriers in aborted_gstate
> update and when it's read in the normal completion path, or we can
> wait for the update to be visible everywhere by waiting for rcu grace
> period (because the reader is rcu protected).

Seems not necessary.

Suppose it is out of order, the only side-effect is that the new
recycled request is timed out as a bit late, I think that is what
we can survive, right?

But it need to be documented.

--
Ming


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> +again:
>   switch (ret) {
>   case BLK_EH_HANDLED:
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
>   [ ... ]
> + spin_lock_irqsave(req->q->queue_lock, flags);
> + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> + blk_mq_rq_update_aborted_gstate(req, 0);
> + blk_add_timer(req);
> + } else {
> + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> + ret = BLK_EH_HANDLED;
> + goto again;
> + }
> + spin_unlock_irqrestore(req->q->queue_lock, flags);

Does the above chunk introduce a backwards goto from inside a region around
which a spinlock is held to outside that region? Can such a goto result in
anything else than a deadlock?

Thanks,

Bart.





Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Tejun Heo
Hello,

On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
> On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> > Hello, Ming.
> > 
> > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> > ...
> > > + spin_lock_irqsave(req->q->queue_lock, flags);
> > > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > > + blk_mq_rq_update_aborted_gstate(req, 0);
> > > + blk_add_timer(req);
> > 
> > Nothing prevents the above blk_add_timer() racing against the next
> > recycle instance of the request, so this still leaves a small race
> > window.
> 
> OK.
> 
> But this small race window can be avoided by running blk_add_timer(req)
> before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

Not really because aborted_gstate right now doesn't have any memory
barrier around it, so nothing ensures blk_add_timer() actually appears
before.  We can either add the matching barriers in aborted_gstate
update and when it's read in the normal completion path, or we can
wait for the update to be visible everywhere by waiting for rcu grace
period (because the reader is rcu protected).

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> ...
> > +   spin_lock_irqsave(req->q->queue_lock, flags);
> > +   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   blk_add_timer(req);
> 
> Nothing prevents the above blk_add_timer() racing against the next
> recycle instance of the request, so this still leaves a small race
> window.

OK.

But this small race window can be avoided by running blk_add_timer(req)
before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

-- 
Ming


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Tejun Heo
Hello, Ming.

On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
...
> + spin_lock_irqsave(req->q->queue_lock, flags);
> + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> + blk_mq_rq_update_aborted_gstate(req, 0);
> + blk_add_timer(req);

Nothing prevents the above blk_add_timer() racing against the next
recycle instance of the request, so this still leaves a small race
window.

Thanks.

-- 
tejun


[PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handing
BLK_EH_RESET_TIMER.

This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---

This is another way to fix this long-time issue, and turns out this
solution is much simpler.

 block/blk-mq.c | 44 +++-
 block/blk-mq.h |  1 +
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..12e8850e3905 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 * However, that would complicate paths which want to synchronize
 * against us.  Let stay in sync with the issue path so that
 * hctx_lock() covers both issue and completion paths.
+*
+* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+* helding queue lock.
 */
hctx_lock(hctx, &srcu_idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else {
+   unsigned long flags;
+   bool need_complete = false;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (!blk_mq_rq_aborted_gstate(rq))
+   need_complete = true;
+   else
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_RESET);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (need_complete)
+   __blk_mq_complete_request(rq);
+   }
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -814,24 +831,41 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
+   unsigned long flags;
 
req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
if (ops->timeout)
ret = ops->timeout(req, reserved);
 
+again:
switch (ret) {
case BLK_EH_HANDLED:
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
/*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
+* The normal completion may happen during handling the
+* timeout, or even after returning from .timeout(), so
+* once the request has been completed, we can't reset
+* timer any more since this request may be handled as
+* BLK_EH_RESET_TIMER in next timeout handling too, and
+* it has to be completed in this situation.
+*
+* Holding the queue lock to cover read/write rq's
+* aborted_gstate and normal state, so the race can be
+* avoided completely.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
-   blk_add_timer(req);
+   spin_lock_irqsave(req->q->queue_lock, flags);
+   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
+   blk_mq_rq_update_aborted_gstate(req, 0);
+   blk_add_timer(req);
+   } else {
+   blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+   ret = BLK_EH_HANDLED;
+   goto again;
+   }
+   spin_unlock_irqrestore(req->q->queue_lock, flags);
break;
case BLK_EH_NOT_HANDLED:
break;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..6dc242fc785a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,7 @@ enum mq_rq_state {
MQ_RQ_IDLE  = 0,
MQ_RQ_IN_FLIGHT = 1,
MQ_RQ_COMPLETE  = 2,
+   MQ_RQ_COMPLETE_IN_RESET = 3,
 
MQ_RQ_STATE_BITS= 2,
MQ_RQ_STATE_MASK= (1 << MQ_RQ_STATE_BITS) - 1,
-- 
2.9.5