Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Jens Axboe
On 4/21/18 9:05 AM, Bart Van Assche wrote:
> On Sat, 2018-04-21 at 22:55 +0800, jianchao.wang wrote:
>> So we have a __deadline as below:
>>  
>> deadline gen state   
>>> __||_|
>>
>>\___ __/ 
>>V
>>   granularity 
>>
>> and even we usually have a 30~60s timeout,
>> but we should support a 1s granularity at least.
>> How to decide the granularity ?
> 
> Your diagram shows that the uppermost bits are used for the deadline. What I
> proposed is to use the lower 32 bits for the deadline such that we again have
> a resolution of 1/HZ.

It honestly doesn't matter how we do it, granularity could be exactly
the same. But yeah, storing it at the lower bits is perfectly fine,
either way works. Only difference is in the encoding, functionally
they can be identical.

-- 
Jens Axboe



Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Bart Van Assche
On Sat, 2018-04-21 at 22:55 +0800, jianchao.wang wrote:
> So we have a __deadline as below:
>  
> deadline gen state   
> > __||_|
> 
>\___ __/ 
>V
>   granularity 
> 
> and even we usually have a 30~60s timeout,
> but we should support a 1s granularity at least.
> How to decide the granularity ?

Your diagram shows that the uppermost bits are used for the deadline. What I
proposed is to use the lower 32 bits for the deadline such that we again have
a resolution of 1/HZ.

Bart.




Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang


On 04/21/2018 10:10 PM, Jens Axboe wrote:
> On 4/21/18 7:34 AM, jianchao.wang wrote:
>> Hi Bart
>>
>> Thanks for your kindly response.
>>
>> On 04/20/2018 10:11 PM, Bart Van Assche wrote:
>>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
 Hi Bart

 On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request

 Maybe use deadline to do this is not suitable.

 Let's think of the following scenario.

 T1/T2 times in nano seconds
 J1/J2 times in jiffies

 rq is started at T1,J1
 blk_mq_timeout_work
   -> blk_mq_check_expired
 -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)

  rq is completed and freed 
  rq is allocated and started 
 again on T2
  rq->__deadline = J2 + 
 MQ_RQ_IN_FLIGHT
   -> synchronize srcu/rcu
   -> blk_mq_terminate_expired
 -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)

 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
if J2-J1 < 4 jiffies, we will get the same deadline value.

 2. even if we do some bit shift when save and get deadline
if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
>>>
>>> Hello Jianchao,
>>>
>>> How about using the upper 16 or 32 bits of rq->__deadline to store a 
>>> generation
>>> number? I don't know any block driver for which the product of (deadline in
>>> seconds) and HZ exceeds (1 << 32).
>>>
>> yes, we don't need so long timeout value.
>> However, req->__deadline is an absolute time, not a relative one, 
>> its type is unsigned long variable which is same with jiffies.
>> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
>> generation number,
>> how to handle it when mod_timer ?
> 
> We don't need to support timeouts to the full granularity of jiffies.
> Most normal commands are in the 30-60s range, and some lower level
> commands may take minutes. We're well within the range of chopping
> off some bits and not having to worry about that at all.
> 
Yes.
So we have a __deadline as below:
 
deadline gen state   
|__||_|
   \___ __/ 
   V
  granularity 

and even we usually have a 30~60s timeout,
but we should support a 1s granularity at least.
How to decide the granularity ?

Thanks
Jianchao


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread Jens Axboe
On 4/21/18 7:34 AM, jianchao.wang wrote:
> Hi Bart
> 
> Thanks for your kindly response.
> 
> On 04/20/2018 10:11 PM, Bart Van Assche wrote:
>> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>>> Hi Bart
>>>
>>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
 Use the deadline instead of the request generation to detect whether
   or not a request timer fired after reinitialization of a request
>>>
>>> Maybe use deadline to do this is not suitable.
>>>
>>> Let's think of the following scenario.
>>>
>>> T1/T2 times in nano seconds
>>> J1/J2 times in jiffies
>>>
>>> rq is started at T1,J1
>>> blk_mq_timeout_work
>>>   -> blk_mq_check_expired
>>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>>
>>>  rq is completed and freed 
>>>  rq is allocated and started 
>>> again on T2
>>>  rq->__deadline = J2 + 
>>> MQ_RQ_IN_FLIGHT
>>>   -> synchronize srcu/rcu
>>>   -> blk_mq_terminate_expired
>>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>>
>>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>>
>>> 2. even if we do some bit shift when save and get deadline
>>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
>>
>> Hello Jianchao,
>>
>> How about using the upper 16 or 32 bits of rq->__deadline to store a 
>> generation
>> number? I don't know any block driver for which the product of (deadline in
>> seconds) and HZ exceeds (1 << 32).
>>
> yes, we don't need so long timeout value.
> However, req->__deadline is an absolute time, not a relative one, 
> its type is unsigned long variable which is same with jiffies.
> If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
> generation number,
> how to handle it when mod_timer ?

We don't need to support timeouts to the full granularity of jiffies.
Most normal commands are in the 30-60s range, and some lower level
commands may take minutes. We're well within the range of chopping
off some bits and not having to worry about that at all.

-- 
Jens Axboe



Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-21 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/20/2018 10:11 PM, Bart Van Assche wrote:
> On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
>> Hi Bart
>>
>> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
>>> Use the deadline instead of the request generation to detect whether
>>>   or not a request timer fired after reinitialization of a request
>>
>> Maybe use deadline to do this is not suitable.
>>
>> Let's think of the following scenario.
>>
>> T1/T2 times in nano seconds
>> J1/J2 times in jiffies
>>
>> rq is started at T1,J1
>> blk_mq_timeout_work
>>   -> blk_mq_check_expired
>> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
>>
>>  rq is completed and freed 
>>  rq is allocated and started 
>> again on T2
>>  rq->__deadline = J2 + 
>> MQ_RQ_IN_FLIGHT
>>   -> synchronize srcu/rcu
>>   -> blk_mq_terminate_expired
>> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
>>
>> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>>if J2-J1 < 4 jiffies, we will get the same deadline value.
>>
>> 2. even if we do some bit shift when save and get deadline
>>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.
> 
> Hello Jianchao,
> 
> How about using the upper 16 or 32 bits of rq->__deadline to store a 
> generation
> number? I don't know any block driver for which the product of (deadline in
> seconds) and HZ exceeds (1 << 32).
> 
yes, we don't need so long timeout value.
However, req->__deadline is an absolute time, not a relative one, 
its type is unsigned long variable which is same with jiffies.
If we reserve some bits (not just 1 or 2 bits) of req->__deadline for 
generation number,
how to handle it when mod_timer ?

Thanks
Jianchao
 


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-20 Thread Bart Van Assche
On Fri, 2018-04-20 at 14:55 +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> > Use the deadline instead of the request generation to detect whether
> >   or not a request timer fired after reinitialization of a request
> 
> Maybe use deadline to do this is not suitable.
> 
> Let's think of the following scenario.
> 
> T1/T2 times in nano seconds
> J1/J2 times in jiffies
> 
> rq is started at T1,J1
> blk_mq_timeout_work
>   -> blk_mq_check_expired
> -> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)
> 
>  rq is completed and freed 
>  rq is allocated and started 
> again on T2
>  rq->__deadline = J2 + 
> MQ_RQ_IN_FLIGHT
>   -> synchronize srcu/rcu
>   -> blk_mq_terminate_expired
> -> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)
> 
> 1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
>if J2-J1 < 4 jiffies, we will get the same deadline value.
> 
> 2. even if we do some bit shift when save and get deadline
>if T2 - T1 < time of 1 jiffies, we sill get the same deadline.

Hello Jianchao,

How about using the upper 16 or 32 bits of rq->__deadline to store a generation
number? I don't know any block driver for which the product of (deadline in
seconds) and HZ exceeds (1 << 32).

Thanks,

Bart.





Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-20 Thread jianchao.wang
Hi Bart

On 04/20/2018 12:43 AM, Bart Van Assche wrote:
> Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request

Maybe use deadline to do this is not suitable.

Let's think of the following scenario.

T1/T2 times in nano seconds
J1/J2 times in jiffies

rq is started at T1,J1
blk_mq_timeout_work
  -> blk_mq_check_expired
-> save rq->__deadline (J1 + MQ_RQ_IN_FLIGHT)

 rq is completed and freed 
 rq is allocated and started again 
on T2
 rq->__deadline = J2 + 
MQ_RQ_IN_FLIGHT
  -> synchronize srcu/rcu
  -> blk_mq_terminate_expired
-> rq->__deadline (J2 + MQ_RQ_IN_FLIGHT)

1. deadline = rq->__deadline & ~RQ_STATE_MASK (0x3)
   if J2-J1 < 4 jiffies, we will get the same deadline value.

2. even if we do some bit shift when save and get deadline
   if T2 - T1 < time of 1 jiffies, we sill get the same deadline.

Thanks
Jianchao


Re: [PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-19 Thread Jens Axboe
On 4/19/18 10:43 AM, Bart Van Assche wrote:
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.
> 
> Fix this race as follows:
> - Use the deadline instead of the request generation to detect whether
>   or not a request timer fired after reinitialization of a request.
> - Store the request state in the lowest two bits of the deadline instead
>   of the lowest two bits of 'gstate'.
> - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
>   enumeration member into a #define such that its type can be changed
>   into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
>   ~(unsigned long)RQ_STATE_MASK.
> - Remove all request member variables that became superfluous due to
>   this change: gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to this
>   patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the code that became superfluous due to this change, namely
>   the RCU lock and unlock statements in blk_mq_complete_request() and
>   also the synchronize_rcu() call in the timeout handler.

I like this approach, it's simpler and easier to understand. Net
reduction in code is nice too. I've run it through the testing and it
passes. Unless people holler loudly, I'd like to include this in 4.17.

-- 
Jens Axboe



[PATCH v6] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-19 Thread Bart Van Assche
The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.

Fix this race as follows:
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
  enumeration member into a #define such that its type can be changed
  into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
  ~(unsigned long)RQ_STATE_MASK.
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc:  # v4.16
---

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also 
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.


 block/blk-core.c   |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 158 ++---
 block/blk-mq.h |  85 +-
 block/blk-timeout.c|  89 
 block/blk.h|  13 ++--
 include/linux/blkdev.h |  29 +++--
 7 files changed, 154 insertions(+), 227 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index de90ecab61cd..730a8e3be7ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,12 +200,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
-   /*
-* See comment of blk_mq_init_request
-*/
-   WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index adb8d6f00098..529383841b3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -346,7 +346,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bb7f59d319fa..6f20845827f4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -481,7 +481,8 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (rq->tag !=