Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-18 Thread Bart Van Assche

On 04/10/18 14:54, t...@kernel.org wrote:

On Tue, Apr 10, 2018 at 09:46:23PM +, Bart Van Assche wrote:

On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:

+   else
+   rq->missed_completion = true;


In this patch I found code that sets rq->missed_completion but no code that
clears it. Did I perhaps miss something?


Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

[ ... ]


Hello Tejun,

This patch no longer applies cleanly on Jens' tree. Can you rebase and 
repost it?


Thanks,

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-15 Thread Israel Rukshin

Hi,

On 4/12/2018 4:35 PM, t...@kernel.org wrote:

Hello, Israel.

On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:

On 4/12/2018 12:31 AM, t...@kernel.org wrote:

Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:

Hello, Israel.

On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:

Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?

Yes, I just did and it looks good.

Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

I am not sure I understand.
What is the combined patch?
What are the debug messages that I need to see?

There are total of three patches and I posted the combined patch + a
debug message in another post.  Can you please test the following
patch?  It'll print out something like the following (possibly many of
them).


I tested this combined patch with your debug print and I got the 
following call trace:


Apr 15 11:42:35 nvme nvme1: I/O 55 QID 10 timeout, reset controller
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 WARNING: CPU: 1 PID: 648 at block/blk-mq.c:534 
__blk_mq_complete_request+0x154/0x160
Apr 15 11:42:35 Modules linked in: nvme_rdma rdma_cm iw_cm ib_cm 
nvme_fabrics nfsv3 rpcsec_gss_krb5 nfsv4 nfs fscache iscsi_tcp 
libiscsi_tcp libiscsi scsi_transport_iscsi netconsole mlx4_ib ib_core 
mlx4_en mlx4_core xfs libcrc32c dm_mirror dm_region_hash dm_log 
dm_multipath scsi_dh_rdac scsi_dh_emc dm_mod dax scsi_dh_alua 
x86_pkg_temp_thermal coretemp kvm_intel kvm irqbypass 
ghash_clmulni_intel pcbc iTCO_wdt aesni_intel nfsd iTCO_vendor_support 
aes_x86_64 ipmi_si crypto_simd lpc_ich cryptd ipmi_devintf shpchp wmi 
acpi_pad glue_helper ipmi_msghandler pcspkr sg i2c_i801 mfd_core ioatdma 
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 
sd_mod igb i2c_algo_bit ahci i2c_core xhci_pci crc32c_intel xhci_hcd 
libahci dca configfs nvme nvme_core ipv6 crc_ccitt autofs4 [last 
unloaded: nvme_fabrics]

Apr 15 11:42:35 CPU: 1 PID: 648 Comm: kworker/1:1H Not tainted 4.16.0+ #8
Apr 15 11:42:35 Hardware name: Supermicro SYS-6018R-WTR/X10DRW-i, BIOS 
2.0 12/17/2015

Apr 15 11:42:35 Workqueue: kblockd blk_mq_timeout_work
Apr 15 11:42:35 RIP: 0010:__blk_mq_complete_request+0x154/0x160
Apr 15 11:42:35 RSP: 0018:c9000407bd68 EFLAGS: 00010297
Apr 15 11:42:35 RAX:  RBX: 8808310236c0 RCX: 
0009
Apr 15 11:42:35 RDX: 0009 RSI:  RDI: 
8808310236c0
Apr 15 11:42:35 RBP: e8ae12c0 R08:  R09: 
04ea
Apr 15 11:42:35 R10: 006c R11: c90003dfda80 R12: 

Apr 15 11:42:35 R13: 0010 R14: 0010 R15: 

Apr 15 11:42:35 FS:  () GS:88047fa4() 
knlGS:

Apr 15 11:42:35 CS:  0010 DS:  ES:  CR0: 80050033
Apr 15 11:42:35 CR2: 7f6ab2db3140 CR3: 01e0a006 CR4: 
001606e0

Apr 15 11:42:35 Call Trace:
Apr 15 11:42:35  blk_mq_complete_request+0x4b/0x90
Apr 15 11:42:35  blk_mq_timeout_reset_cleanup+0x27/0x40
Apr 15 11:42:35  bt_iter+0x43/0x50
Apr 15 11:42:35  blk_mq_queue_tag_busy_iter+0xfb/0x230
Apr 15 11:42:35  ? blk_mq_complete_request+0x90/0x90
Apr 15 11:42:35  ? blk_mq_complete_request+0x90/0x90
Apr 15 11:42:35  ? __call_rcu.constprop.72+0x170/0x1c0
Apr 15 11:42:35  blk_mq_timeout_work+0x191/0x1f0
Apr 15 11:42:35  process_one_work+0x140/0x2a0
Apr 15 11:42:35  worker_thread+0x3f/0x3c0
Apr 15 11:42:35  kthread+0xeb/0x120
Apr 15 11:42:35  ? process_one_work+0x2a0/0x2a0
Apr 15 11:42:35  ? kthread_bind+0x10/0x10
Apr 15 11:42:35  ? SyS_exit_group+0xb/0x10
Apr 15 11:42:35  ret_from_fork+0x35/0x40
Apr 15 11:42:35 Code: 00 00 00 00 00 8b 7d 40 e8 0a 20 e7 ff e9 6e ff ff 
ff 48 8b 35 1e ca ba 00 48 83 c7 10 48 83 c6 64 e8 71 ee e5 ff e9 77 ff 
ff ff <0f> 0b e9 c3 fe ff ff 0f 1f 44 00 00 41 54 45 31 e4 55 53 48 8b

Apr 15 11:42:35 ---[ end trace fbf397c4b27ea0b8 ]---
Apr 15 11:42:35 XXX blk_mq_timeout_reset_cleanup(): executed 1 missed 
completions
Apr 15 11:42:35 BUG: unable to handle kernel NULL pointer dereference at 
0018


Regards,
Israel


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-12 Thread t...@kernel.org
Hello, Israel.

On Thu, Apr 12, 2018 at 11:59:11AM +0300, Israel Rukshin wrote:
> On 4/12/2018 12:31 AM, t...@kernel.org wrote:
> >Hey, again.
> >
> >On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> >>Hello, Israel.
> >>
> >>On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?
> >>>Yes, I just did and it looks good.
> >>Awesome.
> >Just to be sure, you tested the combined patch and saw the XXX debug
> >messages, right?
> 
> I am not sure I understand.
> What is the combined patch?
> What are the debug messages that I need to see?

There are total of three patches and I posted the combined patch + a
debug message in another post.  Can you please test the following
patch?  It'll print out something like the following (possibly many of
them).

 XXX blk_mq_timeout_reset_cleanup(): executed %d missed completions

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
+*/
+   if 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-12 Thread Israel Rukshin

On 4/12/2018 12:31 AM, t...@kernel.org wrote:

Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:

Hello, Israel.

On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:

Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?

Yes, I just did and it looks good.

Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?


I am not sure I understand.
What is the combined patch?
What are the debug messages that I need to see?


Thanks.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> Hello, Israel.
> 
> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> > >Just noticed this one, this looks interesting to me as well. Israel,
> > >can you run your test with this patch?
> > 
> > Yes, I just did and it looks good.
> 
> Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Martin Steigerwald
Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> > 
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
> 
> (+Martin Steigerwald)
[…]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
>  
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

Yes, with the three patches:

- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'

- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'

- '[PATCH v4] blk-mq_Fix race conditions in request timeout 
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to 
be gone.

Also it appears that the error loading SMART data issue is gone as well 
(see my bug report). However it is still to early to say for sure. I 
think I need at least 2-3 days of additional testing with this kernel to 
be sufficiently sure about it.

However… I could also test another patch, but from reading the rest of 
this thread so far I have no clear on whether to try one of the new 
patches and if so which one and whether adding it on top of some of the 
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here 
and then, but I won´t have much time to read the complete discussion to 
figure out what to apply.

Personally as a stable kernel has been released with those issues, I 
think its good to fix it up soon. On the other hand it may take quite 
some time til popular distros carry 4.16 for regular users. And I have 
no idea how frequent the reported issues are, i.e. how many users would 
be affected.

Thanks,
-- 
Martin




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Israel Rukshin

On 4/11/2018 5:24 PM, Sagi Grimberg wrote:



Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.


Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?


Yes, I just did and it looks good.





---
  block/blk-mq.c |   45 
-

  include/linux/blkdev.h |    2 ++
  2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
  hctx_lock(hctx, _idx);
  if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
  __blk_mq_complete_request(rq);
+    else
+    rq->missed_completion = true;
  hctx_unlock(hctx, srcu_idx);
  }
  EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
    blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
  blk_add_timer(rq);
+    rq->missed_completion = false;
    write_seqcount_end(>gstate_seq);
  preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
  }
  }
  -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool 
clear)

  {
  struct blk_mq_hw_ctx *hctx;
  bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
  else
  synchronize_srcu(hctx->srcu);
  -    hctx->need_sync_rcu = false;
+    if (clear)
+    hctx->need_sync_rcu = false;
  }
  if (has_rcu)
  synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
  blk_mq_rq_timed_out(hctx, rq, priv, reserved);
  }
  -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
  struct request *rq, void *priv, bool reserved)
  {
  /*
   * @rq's timer reset has gone through rcu synchronization and is
   * visible now.  Allow normal completions again by resetting
   * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
- * there's no memory ordering around ->aborted_gstate making it the
- * only field safe to update.  Let blk_add_timer() clear it later
- * when the request is recycled or times out again.
- *
- * As nothing prevents from completion happening while
- * ->aborted_gstate is set, this may lead to ignored completions
- * and further spurious timeouts.
+ * blk_mq_timeout_reset_cleanup() needs it again and there's no
+ * memory ordering around ->aborted_gstate making it the only field
+ * safe to update.  Let blk_add_timer() clear it later when the
+ * request is recycled or times out again.
   */
  if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
  blk_mq_rq_update_aborted_gstate(rq, 0);
  }
  +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+    struct request *rq, void *priv, bool reserved)
+{
+    /*
+ * @rq is now fully returned to the normal path.  If normal
+ * completion raced timeout handling, execute the missed completion
+ * here.  This is safe because 1. ->missed_completion can no longer
+ * be asserted because nothing is timing out right now and 2. if
+ * ->missed_completion is set, @rq is safe from recycling because
+ * nobody could have completed it.
+ */
+    if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+    blk_mq_complete_request(rq);
+}
+
  static void blk_mq_timeout_work(struct work_struct *work)
  {
  struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
   * becomes a problem, we can add per-hw_ctx rcu_head and
   * wait in parallel.
   */
-    blk_mq_timeout_sync_rcu(q);
+    blk_mq_timeout_sync_rcu(q, true);
    /* terminate the ones we won */
  blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
   * reset racing against recycling.
   */
  if (nr_resets) {
-    blk_mq_timeout_sync_rcu(q);
+    blk_mq_timeout_sync_rcu(q, false);
+    blk_mq_queue_tag_busy_iter(q,
+    blk_mq_timeout_reset_return, NULL);
+    blk_mq_timeout_sync_rcu(q, true);
  blk_mq_queue_tag_busy_iter(q,
-    blk_mq_finish_timeout_reset, NULL);
+    blk_mq_timeout_reset_cleanup, NULL);
  }
  }
  --- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
    unsigned int extra_len;    /* length of alignment and padding */
  +    bool missed_completion;
+


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:24:49PM +0300, Sagi Grimberg wrote:
> 
> >Ah, yeah, I was moving it out of add_timer but forgot to actully add
> >it to the issue path.  Fixed patch below.
> >
> >BTW, no matter what we do w/ the request handover between normal and
> >timeout paths, we'd need something similar.  Otherwise, while we can
> >reduce the window, we can't get rid of it.
> >
> >Thanks.
> 
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

The following is the combined patch with one debug message to see
whether missed completions are actually happening.

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);
+}
+
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   int cnt = 0;
+
+   /*
+* @rq is now fully returned to the normal path.  If normal
+ 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Sagi Grimberg



Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.


Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?



---
  block/blk-mq.c |   45 -
  include/linux/blkdev.h |2 ++
  2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
  }
  EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
  
  	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);

blk_add_timer(rq);
+   rq->missed_completion = false;
  
  	write_seqcount_end(>gstate_seq);

preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
  }
  
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)

+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
  {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
  
-		hctx->need_sync_rcu = false;

+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
  }
  
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,

+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
  {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
  }
  
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,

+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
  static void blk_mq_timeout_work(struct work_struct *work)
  {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
  
  		/* terminate the ones we won */

blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_timeout_reset_return, NULL);
+   blk_mq_timeout_sync_rcu(q, true);
blk_mq_queue_tag_busy_iter(q,
-   blk_mq_finish_timeout_reset, NULL);
+   

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
> crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
> remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Bart Van Assche
On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote:
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.

(+Martin Steigerwald)

Hello Tejun,

Thank you for having shared this patch. It looks interesting to me. What I
know about the blk-mq timeout handling is as follows:
* Nobody who has reviewed the blk-mq timeout handling code with this patch
  applied has reported any shortcomings for that code.
* However, several people have reported kernel crashes that disappear when
  the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
  memory upon timeout"
  (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
  and also to a "RIP: scsi_times_out+0x17" crash during boot
  (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

So we have the choice between two approaches:
(1) apply the patch from your previous e-mail and root-cause and fix the
crashes referred to above.
(2) apply a patch that makes the crashes reported against v4.16 disappear and
remove the atomic instructions introduced by such a patch at a later time.

Since crashes have been reported for kernel v4.16 I think we should follow
approach (2). That will remove the time pressure from root-causing and fixing
the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

Thanks,

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
On Tue, Apr 10, 2018 at 09:46:23PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:
> > +   else
> > +   rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.

---
 block/blk-mq.c |   45 -
 include/linux/blkdev.h |2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_timeout_reset_return, NULL);
+

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 14:33 -0700, t...@kernel.org wrote:
> +   else
> +   rq->missed_completion = true;

In this patch I found code that sets rq->missed_completion but no code that
clears it. Did I perhaps miss something?

Thanks,

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 08:40:43AM -0700, t...@kernel.org wrote:
> On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> > I agree, the issue should be in driver's irq handler and .timeout in
> > theory.
> > 
> > For example, even though one request has been done by irq handler, .timeout
> > still may return RESET_TIMER.
> 
> blk-mq can use a separate flag to track normal completions during
> timeout and complete the request normally on RESET_TIMER if the flag
> is set while EH was in progress.  With a bit of care, we'd be able to
> plug the race completely.

So, something like the following.  Just compile tested.  The extra
dedicated field is kinda unfortunate but we need something like that
no matter how we do this because the involved completion marking
violates the ownership (we want to record the normail completion while
timeout handling is in progress).  Alternatively, we can add an atomic
flags field.  The addition doesn't change the size of struct request
because it fits in a hole but that hole seems removeable, so...

Anyways, it'd be great if someone can test this combined with the
previous two rcu sync patches.

Thanks.
---
 block/blk-mq.c |   44 +++-
 include/linux/blkdev.h |2 ++
 2 files changed, 33 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -881,7 +883,7 @@ static void blk_mq_check_expired(struct
}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +898,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
 
-   hctx->need_sync_rcu = false;
+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +920,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
struct request_queue *q =
@@ -976,7 +991,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1003,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> I agree, the issue should be in driver's irq handler and .timeout in
> theory.
> 
> For example, even though one request has been done by irq handler, .timeout
> still may return RESET_TIMER.

blk-mq can use a separate flag to track normal completions during
timeout and complete the request normally on RESET_TIMER if the flag
is set while EH was in progress.  With a bit of care, we'd be able to
plug the race completely.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
Hi Tejun,

On Tue, Apr 10, 2018 at 08:30:31AM -0700, t...@kernel.org wrote:
> Hello, Ming.
> 
> On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> > +   if (time_after_eq(jiffies, deadline) &&
> > +   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> > +   blk_mq_rq_timed_out(rq, reserved);
> > 
> > Normal completion still can happen between blk_mq_change_rq_state()
> > and blk_mq_rq_timed_out().
> > 
> > In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> > and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> > the big window.
> 
> I don't think plugging this hole is all that difficult, but this
> shouldn't lead to any critical failures.  If so, that'd be a driver
> bug.

I agree, the issue should be in driver's irq handler and .timeout in
theory.

For example, even though one request has been done by irq handler, .timeout
still may return RESET_TIMER.

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello, Ming.

On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> +   if (time_after_eq(jiffies, deadline) &&
> +   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> +   blk_mq_rq_timed_out(rq, reserved);
> 
> Normal completion still can happen between blk_mq_change_rq_state()
> and blk_mq_rq_timed_out().
> 
> In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> the big window.

I don't think plugging this hole is all that difficult, but this
shouldn't lead to any critical failures.  If so, that'd be a driver
bug.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:02:11PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> > > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > > existing RCU readers to finish. synchronize_rcu() does not prevent that 
> > > new
> > > rcu_read_lock() calls happen. It is e.g. possible that after
> > 
> > That is right, and I also mentioned normal completion can be done between
> > 1) and reset aborted_gstate in 3).
> > 
> > > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > > completion occurs. If that request is not reused before the timer that was
> > > restarted by the timeout code expires, that request will be completed 
> > > twice.
> > 
> > In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> > called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> > to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> > just like the above you described, right?
> 
> I should have added the following in my previous e-mail: "if the completion
> occurs after blk_mq_check_expired() examined rq->gstate and before it updated
> rq->aborted_gstate".

That is the difference between tj's approach and your patch, but the
difference is just in the timing.

See this patch:

+   if (time_after_eq(jiffies, deadline) &&
+   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+   blk_mq_rq_timed_out(rq, reserved);

Normal completion still can happen between blk_mq_change_rq_state()
and blk_mq_rq_timed_out().

In tj's approach, there is synchronize_rcu() between writing aborted_gstate
and blk_mq_rq_timed_out, it is easier for normal completion to happen during
the big window.

> That race can occur with the current upstream blk-mq
> timeout handling code but not after my patch has been applied.

In theory, the 'race' can occur with your patch too, but the window
is just so small.

If you think my comment is correct, please update your commit log.

-- 
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > existing RCU readers to finish. synchronize_rcu() does not prevent that new
> > rcu_read_lock() calls happen. It is e.g. possible that after
> 
> That is right, and I also mentioned normal completion can be done between
> 1) and reset aborted_gstate in 3).
> 
> > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > completion occurs. If that request is not reused before the timer that was
> > restarted by the timeout code expires, that request will be completed twice.
> 
> In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> just like the above you described, right?

I should have added the following in my previous e-mail: "if the completion
occurs after blk_mq_check_expired() examined rq->gstate and before it updated
rq->aborted_gstate". That race can occur with the current upstream blk-mq
timeout handling code but not after my patch has been applied.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread h...@lst.de
On Tue, Apr 10, 2018 at 01:26:39PM +, Bart Van Assche wrote:
> Can you explain why you think that using cmpxchg() is safe in this context?
> The regular completion path and the timeout code can both execute e.g. the
> following code concurrently:
> 
>   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That's why I think that we need an atomic compare and exchange instead of
> cmpxchg(). What I found in the Intel Software Developer Manual seems to
> confirm that:

The Linux cmpxchg() helper is supposed to always be atomic, you only
need atomic_cmpxchg and friends if you want to operate on an atomic_t.
(same for the _long variant).

In fact if you look at the x86 implementation of the cmpxchg() macro
you'll see that it will use the lock prefix if the kernel has been
built with CONFIG_SMP enabled.


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Jens Axboe
On 4/10/18 3:55 AM, Christoph Hellwig wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
>> If a completion occurs after blk_mq_rq_timed_out() has reset
>> rq->aborted_gstate and the request is again in flight when the timeout
>> expires then a request will be completed twice: a first time by the
>> timeout handler and a second time when the regular completion occurs.
>>
>> Additionally, the blk-mq timeout handling code ignores completions that
>> occur after blk_mq_check_expired() has been called and before
>> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
>> timeout handler always returns BLK_EH_RESET_TIMER then the result will
>> be that the request never terminates.
>>
>> Since the request state can be updated from two different contexts,
>> namely regular completion and request timeout, this race cannot be
>> fixed with RCU synchronization only. Fix this race as follows:
>> - Split __deadline in two variables, namely lq_deadline for legacy
>>   request queues and mq_deadline for blk-mq request queues. Use atomic
>>   operations to update mq_deadline.
> 
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().  That plus some minor cleanups in the version below,
> only boot tested:

This version looks reasonable, let me throw it through the testing
machinery.

-- 
Jens Axboe



Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread t...@kernel.org
Hello,

On Tue, Apr 10, 2018 at 02:30:26PM +, Bart Van Assche wrote:
> > Switching to another model might be better but let's please do that
> > with the right rationales.  A good portion of this seems to be built
> > on misunderstandings.
> 
> Which misunderstandings? I'm not aware of any misunderstandings at my side.
> Additionally, tests with two different block drivers (NVMeOF initiator and
> the SRP initiator driver) have shown that the current blk-mq timeout
> implementation with or without your two patches applied result in subtle and
> hard to debug crashes and/or memory corruption. That is not the case for the

I must have missed that part.  Which tests were they?

> patch at the start of this thread. The latest report of a crash I ran into
> myself and that is fixed by the patch at the start of this thread is
> available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.
> 
> Please also keep in mind that if this patch would be accepted that that does
> not prevent this patch to be replaced with an RCU-based solution later on.
> If anyone comes up any time with a reliably working RCU-based solution I
> will be happy to accept a revert of this patch and I will help reviewing that
> RCU-based solution.

Oh, switching is fine but let's get in sync first.  Who have the repro
cases and what were tested?

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

Thanks for your kindly response.

On 04/10/2018 09:01 PM, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
>> If yes, how does the timeout handler get the freed request when the tag has 
>> been freed ?
> 
> Hello Jianchao,
> 
> Have you noticed that the timeout handler does not check whether or not the 
> request
> tag is freed? Additionally, I don't think it would be possible to add such a 
> check
> to the timeout code without introducing a new race condition.

Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been 
allocated/set ?
When the request is freed, the tag will be cleared through 
blk_mq_put_tag->sbitmap_queue_clear
Do I miss something else ?

Thanks
Jianchao

> 
> 


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 02:09:33PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> > Then I have same question with Jianchao, what is the actual double
> > complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> > 
> > Follows my understanding:
> > 
> > 1) when timeout is detected on one request, its aborted_gstate is
> > updated in blk_mq_check_expired().
> > 
> > 2) run synchronize_rcu(), and make sure all pending completion is done
> > 
> > 3) run blk_mq_rq_timed_out()
> > - ret = ops->timeout
> > - blk_mq_rq_update_aborted_gstate(req, 0)
> > - blk_add_timer(req);
> > 
> > If normal completion is done between 1) and reset aborted_gstate in 3),
> > blk_mq_complete_request() will be called, and found that aborted_gstate
> > is set, then the rq won't be completed really.
> > 
> > If normal completion is done after reset aborted_gstate in 3), it should
> > be same with applying this patch.
> 
> Hello Ming,
> 
> Please keep in mind that all synchronize_rcu() does is to wait for pre-
> existing RCU readers to finish. synchronize_rcu() does not prevent that new
> rcu_read_lock() calls happen. It is e.g. possible that after

That is right, and I also mentioned normal completion can be done between
1) and reset aborted_gstate in 3).

> blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> completion occurs. If that request is not reused before the timer that was
> restarted by the timeout code expires, that request will be completed twice.

In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
just like the above you described, right?


Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 07:20 -0700, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > Since the request state can be updated from two different contexts,
> > namely regular completion and request timeout, this race cannot be
> > fixed with RCU synchronization only. Fix this race as follows:
> 
> Well, it can be and the patches have been posted months ago.

That's not correct. I have explained you in detail that the two patches you
posted do not fix all the races fixed by the patch at the start of this
e-mail thread.

> Switching to another model might be better but let's please do that
> with the right rationales.  A good portion of this seems to be built
> on misunderstandings.

Which misunderstandings? I'm not aware of any misunderstandings at my side.
Additionally, tests with two different block drivers (NVMeOF initiator and
the SRP initiator driver) have shown that the current blk-mq timeout
implementation with or without your two patches applied result in subtle and
hard to debug crashes and/or memory corruption. That is not the case for the
patch at the start of this thread. The latest report of a crash I ran into
myself and that is fixed by the patch at the start of this thread is
available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.

Please also keep in mind that if this patch would be accepted that that does
not prevent this patch to be replaced with an RCU-based solution later on.
If anyone comes up any time with a reliably working RCU-based solution I
will be happy to accept a revert of this patch and I will help reviewing that
RCU-based solution.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Tejun Heo
Hello, Bart.

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:

Well, it can be and the patches have been posted months ago.  It just
needed a repro case to confirm the fix, which we now seem to have.

Switching to another model might be better but let's please do that
with the right rationales.  A good portion of this seems to be built
on misunderstandings.

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> Then I have same question with Jianchao, what is the actual double
> complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> 
> Follows my understanding:
> 
> 1) when timeout is detected on one request, its aborted_gstate is
> updated in blk_mq_check_expired().
> 
> 2) run synchronize_rcu(), and make sure all pending completion is done
> 
> 3) run blk_mq_rq_timed_out()
> - ret = ops->timeout
> - blk_mq_rq_update_aborted_gstate(req, 0)
> - blk_add_timer(req);
> 
> If normal completion is done between 1) and reset aborted_gstate in 3),
> blk_mq_complete_request() will be called, and found that aborted_gstate
> is set, then the rq won't be completed really.
> 
> If normal completion is done after reset aborted_gstate in 3), it should
> be same with applying this patch.

Hello Ming,

Please keep in mind that all synchronize_rcu() does is to wait for pre-
existing RCU readers to finish. synchronize_rcu() does not prevent that new
rcu_read_lock() calls happen. It is e.g. possible that after
blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
completion occurs. If that request is not reused before the timer that was
restarted by the timeout code expires, that request will be completed twice.

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 12:58:04PM +, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> > On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > > If a completion occurs after blk_mq_rq_timed_out() has reset
> > > rq->aborted_gstate and the request is again in flight when the timeout
> > 
> > Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> > think you are addressing the race between normal completion and
> > the timeout of BLK_EH_RESET_TIMER.
> 
> Yes, that's correct. I will make this more explicit.
> 
> > > expires then a request will be completed twice: a first time by the
> > > timeout handler and a second time when the regular completion occurs.
> > 
> > This patch looks simpler, and more like the previous model of
> > using blk_mark_rq_complete().
> 
> That's also correct.
> 
> > I have one question:
> > 
> > - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> > one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> > blk_mq_check_expired(), then ops->timeout() is run and
> > BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> > MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> > 
> > Now the original normal completion still may occur after rq's state
> > becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> > with this patch? Maybe I am wrong, please explain a bit.
> 
> That scenario won't result in a double completion. After the timer has
> been reset the block driver blk_mq_complete_request() call will change
> the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
> time blk_mq_check_expired() is called it will execute the following code:
> 
>   blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That function call only changes the request state if the current state is
> IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
> state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
> return false and the blk-mq timeout code will skip this request. If the
> request would already have been reused and would have been marked again as
> IN_FLIGHT then its deadline will also have been updated and the request
> will be skipped by the timeout code because its deadline has not yet
> expired.

OK.

Then I have same question with Jianchao, what is the actual double
complete in linus tree between BLK_EH_RESET_TIMER and normal completion?

Follows my understanding:

1) when timeout is detected on one request, its aborted_gstate is
updated in blk_mq_check_expired().

2) run synchronize_rcu(), and make sure all pending completion is done

3) run blk_mq_rq_timed_out()
- ret = ops->timeout
- blk_mq_rq_update_aborted_gstate(req, 0)
- blk_add_timer(req);

If normal completion is done between 1) and reset aborted_gstate in 3),
blk_mq_complete_request() will be called, and found that aborted_gstate
is set, then the rq won't be completed really.

If normal completion is done after reset aborted_gstate in 3), it should
be same with applying this patch.

> 
> > And synchronize_rcu() is needed by Tejun's approach between marking
> > COMPLETE and handling this rq's timeout, and the time can be quite long,
> > I guess it might be easier to trigger?
> 
> I have done what I could to trigger races between the regular completion
> path and the timeout code in my tests. Without this patch if I run the
> srp-test software I see crashes being reported in the rdma_rxe driver but
> with this patch applied I don't see any crashes with my tests.

I believe this patch may fix this issue, but I think the idea behind
should be understood.

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 11:55 +0200, Christoph Hellwig wrote:
> I don't think we need the atomic_long_cmpxchg, and can do with a plain
> cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
> both callers are protected from other changes so we can turn that
> into a WARN_ON().

Hello Christoph,

I will remove the blk_mq_rq_update_state() function so that's one while
loop less.

Can you explain why you think that using cmpxchg() is safe in this context?
The regular completion path and the timeout code can both execute e.g. the
following code concurrently:

blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That's why I think that we need an atomic compare and exchange instead of
cmpxchg(). What I found in the Intel Software Developer Manual seems to
confirm that:

Description

Compares the value in the AL, AX, EAX, or RAX register with the first
operand (destination operand). If the two values are equal, the second
operand (source operand) is loaded into the destination operand. Otherwise,
the destination operand is loaded into the AL, AX, EAX or RAX register. RAX
register is available only in 64-bit mode. This instruction can be used
with a LOCK prefix to allow the instruction to be executed atomically. To
simplify the interface to the processor’s bus, the destination operand
receives a write cycle without regard to the result of the comparison. The
destination operand is written back if the comparison fails; otherwise, the
source operand is written into the destination. (The processor never
produces a locked read without also producing a locked write.)

Thanks,

Bart.





Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
> If yes, how does the timeout handler get the freed request when the tag has 
> been freed ?

Hello Jianchao,

Have you noticed that the timeout handler does not check whether or not the 
request
tag is freed? Additionally, I don't think it would be possible to add such a 
check
to the timeout code without introducing a new race condition.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Bart Van Assche
On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> 
> Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> think you are addressing the race between normal completion and
> the timeout of BLK_EH_RESET_TIMER.

Yes, that's correct. I will make this more explicit.

> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs.
> 
> This patch looks simpler, and more like the previous model of
> using blk_mark_rq_complete().

That's also correct.

> I have one question:
> 
> - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> blk_mq_check_expired(), then ops->timeout() is run and
> BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> 
> Now the original normal completion still may occur after rq's state
> becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> with this patch? Maybe I am wrong, please explain a bit.

That scenario won't result in a double completion. After the timer has
been reset the block driver blk_mq_complete_request() call will change
the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
time blk_mq_check_expired() is called it will execute the following code:

blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That function call only changes the request state if the current state is
IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
return false and the blk-mq timeout code will skip this request. If the
request would already have been reused and would have been marked again as
IN_FLIGHT then its deadline will also have been updated and the request
will be skipped by the timeout code because its deadline has not yet
expired.

> And synchronize_rcu() is needed by Tejun's approach between marking
> COMPLETE and handling this rq's timeout, and the time can be quite long,
> I guess it might be easier to trigger?

I have done what I could to trigger races between the regular completion
path and the timeout code in my tests. Without this patch if I run the
srp-test software I see crashes being reported in the rdma_rxe driver but
with this patch applied I don't see any crashes with my tests.

Bart.




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Shan Hai



On 2018年04月10日 18:04, Ming Lei wrote:

On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:

Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:

If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed 
twice ?

Is it the scenario you described as below in 
https://marc.info/?l=linux-block=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
   __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
   request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
   request.
If yes, how does the timeout handler get the freed request when the tag has 
been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.


Or think it as below?


   C   C    C
+-+---+---+
S   T F  E

Request life time line:
S: start
T: timed out
F: found (by timer), inflight but timed out because of a long delay
E: completed by timeout handler
C: regular completion

normal request completion time range: (S, T)
completion in (T, F) is fine since it's not inflight anymore
race window time range: (F, E)

if the delayed request is completed in (F, E) range then it could be 
completed

twice by the regular completion first and then the timeout handler.

Thanks
Shan Hai

Thanks,
Ming




Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs
> 
> Would you please detail more here about why the request could be completed 
> twice ?
> 
> Is it the scenario you described as below in 
> https://marc.info/?l=linux-block=151796816127318
> 
> The following race can occur between the code that resets the timer
> and completion handling:
> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> - A completion occurs and blk_mq_complete_request() calls
>   __blk_mq_complete_request().
> - The timeout code calls blk_add_timer() and that function sets the
>   request deadline and adjusts the timer.
> - __blk_mq_complete_request() frees the request tag.
> - The timer fires and the timeout handler gets called for a freed
>   request.
> If yes, how does the timeout handler get the freed request when the tag has 
> been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Christoph Hellwig
On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.
> 
> Additionally, the blk-mq timeout handling code ignores completions that
> occur after blk_mq_check_expired() has been called and before
> blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
> timeout handler always returns BLK_EH_RESET_TIMER then the result will
> be that the request never terminates.
> 
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:
> - Split __deadline in two variables, namely lq_deadline for legacy
>   request queues and mq_deadline for blk-mq request queues. Use atomic
>   operations to update mq_deadline.

I don't think we need the atomic_long_cmpxchg, and can do with a plain
cmpxhg.  Also unterminated cmpxchg loops are a bad idea, but I think
both callers are protected from other changes so we can turn that
into a WARN_ON().  That plus some minor cleanups in the version below,
only boot tested:

diff --git a/block/blk-core.c b/block/blk-core.c
index abcb8684ba67..7e88e4a9133d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -199,8 +199,6 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 58b3b79cbe83..ecb934bafb29 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f5c7dbcb954f..0f5c56789fde 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,7 +477,9 @@ void blk_mq_free_request(struct request *rq)
if (blk_rq_rl(rq))
blk_put_rl(blk_rq_rl(rq));
 
-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+   WARN_ON_ONCE(1);
+
if (rq->tag != -1)
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
if (sched_tag != -1)
@@ -523,8 +525,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -573,36 +574,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-   unsigned long flags;
-
-   /*
-* blk_mq_rq_aborted_gstate() is used from the completion path and
-* can thus be called from irq context.  u64_stats_fetch in the
-* middle of update on the same CPU leads to lockup.  Disable irq
-* while updating.
-*/
-   local_irq_save(flags);
-   u64_stats_update_begin(>aborted_gstate_sync);
-   rq->aborted_gstate = gstate;
-   u64_stats_update_end(>aborted_gstate_sync);
-   local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-   unsigned int start;
-   u64 aborted_gstate;
-
-   do {
-   start = u64_stats_fetch_begin(>aborted_gstate_sync);
-   aborted_gstate = rq->aborted_gstate;
-   } while (u64_stats_fetch_retry(>aborted_gstate_sync, start));
-
-   return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:the request being processed
@@ -614,27 +585,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
struct request_queue *q = rq->q;
-   struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-   int srcu_idx;
 
if (unlikely(blk_should_fake_timeout(q)))
return;
 
-   /*
-* If @rq->aborted_gstate equals the current instance, timeout is
-* claiming @rq and we lost.  This is synchronized through
-* hctx_lock().  See blk_mq_timeout_work() for details.
-*
-* Completion path never blocks and we can directly use RCU here
-* instead of 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread Ming Lei
On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout

Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
think you are addressing the race between normal completion and
the timeout of BLK_EH_RESET_TIMER.

> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.

This patch looks simpler, and more like the previous model of
using blk_mark_rq_complete().

I have one question:

- with this patch, rq's state is updated atomically as cmpxchg. Suppose
one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
blk_mq_check_expired(), then ops->timeout() is run and
BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.

Now the original normal completion still may occur after rq's state
becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
with this patch? Maybe I am wrong, please explain a bit.

And synchronize_rcu() is needed by Tejun's approach between marking
COMPLETE and handling this rq's timeout, and the time can be quite long,
I guess it might be easier to trigger?

Thanks,
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-10 Thread jianchao.wang
Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed 
twice ?

Is it the scenario you described as below in 
https://marc.info/?l=linux-block=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.
If yes, how does the timeout handler get the freed request when the tag has 
been freed ?

Thanks
Jianchao