Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread Bart Van Assche
On Wed, 2018-05-16 at 19:31 +0200, h...@lst.de wrote:
> On Wed, May 16, 2018 at 04:47:54PM +, Bart Van Assche wrote:
> > I think your patch changes the order of changing the request state and
> > calling mod_timer(). In my patch the request state and the deadline are
> > updated first and mod_timer() is called afterwards. I think your patch
> > changes the order of these operations into the following:
> > (1) __blk_mq_start_request() sets the request deadline.
> > (2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
> > mod_timer().
> > (3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.
> > 
> > In the unlikely event of a significant delay between (2) and (3) it can
> > happen that the timer fires and examines and ignores the request because
> > its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
> > happened times out its timeout will only be handled the next time
> > blk_mq_timeout_work() is called. Is this the behavior you intended?
> 
> We can move the timer manipulation after the change easily I think.
> It would make sense to add comments explaining the ordering.

Hello Christoph,

I'm afraid that could lead to mod_timer() being called in another order than
intended. If e.g. the code that handles BLK_EH_RESET_TIMER changes the request
state first to in-flight and next calls mod_timer() then it can happen that
another context completes and restarts the request, resulting in a concurrent
mod_timer() call. I'm not sure reordering of the mod_timer() calls would result
in correct behavior.

Bart.




Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread h...@lst.de
On Wed, May 16, 2018 at 04:47:54PM +, Bart Van Assche wrote:
> I think your patch changes the order of changing the request state and
> calling mod_timer(). In my patch the request state and the deadline are
> updated first and mod_timer() is called afterwards. I think your patch
> changes the order of these operations into the following:
> (1) __blk_mq_start_request() sets the request deadline.
> (2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
> mod_timer().
> (3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.
> 
> In the unlikely event of a significant delay between (2) and (3) it can
> happen that the timer fires and examines and ignores the request because
> its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
> happened times out its timeout will only be handled the next time
> blk_mq_timeout_work() is called. Is this the behavior you intended?

We can move the timer manipulation after the change easily I think.
It would make sense to add comments explaining the ordering.


Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread Bart Van Assche
On Wed, 2018-05-16 at 18:24 +0200, h...@lst.de wrote:
> On Wed, May 16, 2018 at 04:17:42PM +, Bart Van Assche wrote:
> > There is another reason the deadline is included in the atomic operation,
> > namely to handle races between the BLK_EH_RESET_TIMER case in 
> > blk_mq_rq_timed_out()
> > and blk_mq_complete_request(). I don't think that race is addressed 
> > properly by
> > your patch. I will see what I can do to address that race without using 
> > 64-bit
> > atomic operations.
> 
> I might be missing something here, so please help me understand
> what is missing.
> 
> If we restart the timer in blk_mq_rq_timed_out we also bump the
> generation at the same time as we reset the deadline in your old
> patch.  With this patch we only bump the generation, but that should
> be enough to address the rest, or not?

Hello Christoph,

I think your patch changes the order of changing the request state and
calling mod_timer(). In my patch the request state and the deadline are
updated first and mod_timer() is called afterwards. I think your patch
changes the order of these operations into the following:
(1) __blk_mq_start_request() sets the request deadline.
(2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
mod_timer().
(3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.

In the unlikely event of a significant delay between (2) and (3) it can
happen that the timer fires and examines and ignores the request because
its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
happened times out its timeout will only be handled the next time
blk_mq_timeout_work() is called. Is this the behavior you intended?

Thanks,

Bart.


Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread h...@lst.de
On Wed, May 16, 2018 at 04:17:42PM +, Bart Van Assche wrote:
> There is another reason the deadline is included in the atomic operation,
> namely to handle races between the BLK_EH_RESET_TIMER case in 
> blk_mq_rq_timed_out()
> and blk_mq_complete_request(). I don't think that race is addressed properly 
> by
> your patch. I will see what I can do to address that race without using 64-bit
> atomic operations.

I might be missing something here, so please help me understand
what is missing.

If we restart the timer in blk_mq_rq_timed_out we also bump the
generation at the same time as we reset the deadline in your old
patch.  With this patch we only bump the generation, but that should
be enough to address the rest, or not?


Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread Bart Van Assche
On Wed, 2018-05-16 at 14:51 +0200, Christoph Hellwig wrote:
> I've been looking at this carefully, and I don't think we need cmpxchg64
> at all, and we don't need anywhere near as many cmpxchg operations either.
> 
> The only reason to include the deadline in the atomic operation is the
> blk_abort_request case, as the blk_mq_add_timer never modifies the
> deadline of a request that someone could be racing with.  So if we
> introduce a new aborted state for use by blk_abort_request we can modify
> the deadline separately (and in fact have a common field with the legacy
> path).

There is another reason the deadline is included in the atomic operation,
namely to handle races between the BLK_EH_RESET_TIMER case in 
blk_mq_rq_timed_out()
and blk_mq_complete_request(). I don't think that race is addressed properly by
your patch. I will see what I can do to address that race without using 64-bit
atomic operations.

Bart.





Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-16 Thread Christoph Hellwig
I've been looking at this carefully, and I don't think we need cmpxchg64
at all, and we don't need anywhere near as many cmpxchg operations either.

The only reason to include the deadline in the atomic operation is the
blk_abort_request case, as the blk_mq_add_timer never modifies the
deadline of a request that someone could be racing with.  So if we
introduce a new aborted state for use by blk_abort_request we can modify
the deadline separately (and in fact have a common field with the legacy
path).

Also anytime we call blk_mq_change_rq_state and don't care about the
previous state it is a pretty clear indicator that we aren't racing
with anyone else  For blk_mq_free_request that obviously is the case
as the request must have completed before, and for the requeue case
we assume that.  If it isn't the case we have a deeper problem going
way back.

Below is a lightly tested patch on top of yours to implement these
suggestions, and also making sure we only access the generation/state
using WRITE_ONCE, READ_ONLY and cmpxchg to avoid any races due to
reordering.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 40c9aa085613..5960e14025e6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -274,6 +274,40 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 }
 EXPORT_SYMBOL(blk_mq_can_queue);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static inline bool blk_mq_change_rq_state(struct request *rq,
+   enum mq_rq_state old_state, enum mq_rq_state new_state)
+{
+   union request_sag sag = READ_ONCE(rq->sag);
+   union request_sag old = sag, new = sag;
+
+   BUILD_BUG_ON(sizeof(union request_sag) != sizeof(u32));
+
+   old.state = old_state;
+   new.state = new_state;
+   return cmpxchg(>sag.val, old.val, new.val) == old.val;
+}
+
+static inline void blk_mq_set_rq_state(struct request *rq,
+   enum mq_rq_state new_state)
+{
+   union request_sag sag = READ_ONCE(rq->sag);
+
+   if (new_state == MQ_RQ_IN_FLIGHT)
+   sag.generation++;
+   sag.state = new_state;
+
+   WRITE_ONCE(rq->sag, sag);
+}
+
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
unsigned int tag, unsigned int op)
 {
@@ -318,7 +352,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -494,8 +528,7 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
-   WARN_ON_ONCE(true);
+   blk_mq_set_rq_state(rq, MQ_RQ_IDLE);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -619,6 +652,14 @@ int blk_mq_request_started(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
+static inline void __blk_mq_start_request(struct request *rq)
+{
+   rq->__deadline = jiffies +
+   (rq->timeout ? rq->timeout : rq->q->rq_timeout);
+   __blk_add_timer(rq, rq->__deadline);
+   blk_mq_set_rq_state(rq, MQ_RQ_IN_FLIGHT);
+}
+
 void blk_mq_start_request(struct request *rq)
 {
struct request_queue *q = rq->q;
@@ -636,7 +677,7 @@ void blk_mq_start_request(struct request *rq)
wbt_issue(q->rq_wb, rq);
}
 
-   blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
+   __blk_mq_start_request(rq);
 
if (q->dma_drain_size && blk_rq_bytes(rq)) {
/*
@@ -652,16 +693,18 @@ EXPORT_SYMBOL(blk_mq_start_request);
 static void __blk_mq_requeue_request(struct request *rq)
 {
struct request_queue *q = rq->q;
-   enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
blk_mq_put_driver_tag(rq);
 
trace_block_rq_requeue(q, rq);
wbt_requeue(q->rq_wb, rq);
 
-   if (old_state != MQ_RQ_IDLE) {
-   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
-   WARN_ON_ONCE(true);
+   /*
+* Note: the driver must ensure this doesn't race with completions or
+* timeouts.
+*/
+   if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
+   blk_mq_set_rq_state(rq, MQ_RQ_IDLE);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}
@@ -778,7 +821,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
__blk_mq_complete_request(req);
break;
  

Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-15 Thread Sebastian Ott
On Mon, 14 May 2018, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
> driver timeout handler always returns BLK_EH_RESET_TIMER then the result
> will be that the request never terminates.
> 
> Fix this race as follows:
> - Replace the __deadline member of struct request by a new member
>   called das that contains the generation number, state and deadline.
>   Only 32 bits are used for the deadline field such that all three
>   fields occupy only 64 bits. This change reduces the maximum supported
>   request timeout value from (2**63/HZ) to (2**31/HZ).
> - Remove all request member variables that became superfluous due to
>   this change: gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to
>   this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the code that became superfluous due to this change, namely
>   the RCU lock and unlock statements in blk_mq_complete_request() and
>   also the synchronize_rcu() call in the timeout handler.
> 
> Notes:
> - A spinlock is used to protect atomic changes of rq->das on those
>   architectures that do not provide a cmpxchg64() implementation.
> - Atomic instructions are only used to update the request state if
>   a concurrent request state change could be in progress.
> - blk_add_timer() has been split into two functions - one for the
>   legacy block layer and one for blk-mq.
> 

I tested your patch on top of block/for-next (with forced timeouts) -
works as expected. The lockdep warnings with regard to gstate_seq are
gone (surprise with gstate_seq gone) - thanks for that!

Regards,
Sebastian



[PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again

2018-05-14 Thread Bart Van Assche
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Jianchao Wang 
Cc: Ming Lei 
Cc: Sebastian Ott 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
---
 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 183 ++---
 block/blk-mq.h | 117 +++
 block/blk-timeout.c|  95 ++---
 block/blk.h|  14 ++--
 include/linux/blkdev.h |  47 ++---
 7 files changed, 230 insertions(+), 233 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..a7301fcda734 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->internal_tag = -1;
rq->start_time_ns = ktime_get_ns();
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..40c9aa085613 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
+   WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -494,7 +494,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -547,8 +548,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -593,36 +593,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static