Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-21 Thread Bart Van Assche
On Wed, 2018-02-21 at 11:21 -0800, t...@kernel.org wrote: > Hello, Bart. > > On Wed, Feb 21, 2018 at 06:53:05PM +, Bart Van Assche wrote: > > On Sun, 2018-02-18 at 05:11 -0800, t...@kernel.org wrote: > > > On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote: > > > > With this

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-21 Thread Bart Van Assche
On Sun, 2018-02-18 at 05:11 -0800, t...@kernel.org wrote: > On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote: > > With this patch applied the tests I ran so far pass. > > Ah, great to hear. Thanks a lot for testing. Can you please verify > the following? It's the same approach

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-21 Thread t...@kernel.org
Hello, Bart. On Wed, Feb 21, 2018 at 06:53:05PM +, Bart Van Assche wrote: > On Sun, 2018-02-18 at 05:11 -0800, t...@kernel.org wrote: > > On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote: > > > With this patch applied the tests I ran so far pass. > > > > Ah, great to hear.

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-18 Thread t...@kernel.org
Hello, Bart. On Wed, Feb 14, 2018 at 04:58:56PM +, Bart Van Assche wrote: > With this patch applied the tests I ran so far pass. Ah, great to hear. Thanks a lot for testing. Can you please verify the following? It's the same approach but with RCU sync batching. Thanks. Index:

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-14 Thread Bart Van Assche
On Tue, 2018-02-13 at 13:20 -0800, t...@kernel.org wrote: > On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote: > > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. > > The > > instruction at that address tries to dereference scsi_cmnd.device (%rax). > >

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-13 Thread t...@kernel.org
Hello, Bart. Sorry about the delay. On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote: > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. The > instruction at that address tries to dereference scsi_cmnd.device (%rax). The > register dump shows that that

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 05:48:32PM +, Bart Van Assche wrote: > On Thu, 2018-02-08 at 09:40 -0800, t...@kernel.org wrote: > > Heh, sorry about not being clear. What I'm trying to say is that > > scmd->device != NULL && device->host == NULL. Or was this what you > > were saying all along? > >

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread Bart Van Assche
On Thu, 2018-02-08 at 09:40 -0800, t...@kernel.org wrote: > Heh, sorry about not being clear. What I'm trying to say is that > scmd->device != NULL && device->host == NULL. Or was this what you > were saying all along? What I agree with is that the request pointer (req argument) is stored in

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 05:37:46PM +, Bart Van Assche wrote: > On Thu, 2018-02-08 at 09:19 -0800, t...@kernel.org wrote: > > Hello, Bart. > > > > On Thu, Feb 08, 2018 at 05:10:45PM +, Bart Van Assche wrote: > > > I think "dereferencing a pointer" means reading the memory location that >

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread Bart Van Assche
On Thu, 2018-02-08 at 09:19 -0800, t...@kernel.org wrote: > Hello, Bart. > > On Thu, Feb 08, 2018 at 05:10:45PM +, Bart Van Assche wrote: > > I think "dereferencing a pointer" means reading the memory location that > > pointer points > > at? Anyway, I think we both interpret the crash report

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
Hello, Bart. On Thu, Feb 08, 2018 at 05:10:45PM +, Bart Van Assche wrote: > I think "dereferencing a pointer" means reading the memory location that > pointer points > at? Anyway, I think we both interpret the crash report in the same way, > namely that it > means that scmd->device == NULL.

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread Bart Van Assche
On Thu, 2018-02-08 at 09:00 -0800, t...@kernel.org wrote: > On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote: > > The crash is reported at address scsi_times_out+0x17 == scsi_times_out+23. > > The > > instruction at that address tries to dereference scsi_cmnd.device (%rax). > >

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
Hello, Bart. On Thu, Feb 08, 2018 at 04:31:43PM +, Bart Van Assche wrote: > > That sounds more like a scsi hotplug bug than an issue in the timeout > > code unless we messed up @req pointer to begin with. > > I don't think that this is related to SCSI hotplugging: this crash does not > occur

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread Bart Van Assche
On Thu, 2018-02-08 at 07:39 -0800, t...@kernel.org wrote: > On Thu, Feb 08, 2018 at 01:09:57AM +, Bart Van Assche wrote: > > On Wed, 2018-02-07 at 23:48 +, Bart Van Assche wrote: > > > With this patch applied I see requests for which it seems like the > > > timeout handler > > > did not

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 07:39:40AM -0800, t...@kernel.org wrote: > That sounds more like a scsi hotplug but than an issue in the timeout ^bug > code unless we messed up @req pointer to begin with. -- tejun

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-08 Thread t...@kernel.org
On Thu, Feb 08, 2018 at 01:09:57AM +, Bart Van Assche wrote: > On Wed, 2018-02-07 at 23:48 +, Bart Van Assche wrote: > > With this patch applied I see requests for which it seems like the timeout > > handler > > did not get invoked: [ ... ] > > I just noticed the following in the system

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 23:48 +, Bart Van Assche wrote: > With this patch applied I see requests for which it seems like the timeout > handler > did not get invoked: [ ... ] I just noticed the following in the system log, which is probably the reason why some requests got stuck: Feb 7

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 12:07 -0800, t...@kernel.org wrote: > Ah, you're right. u64_stat_sync doesn't imply barriers, so we want > something like the following. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index df93102..d6edf3b 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, On Wed, Feb 07, 2018 at 09:02:21PM +, Bart Van Assche wrote: > The patch that I used in my test had an smp_wmb() call (see also below). > Anyway, > I will see whether I can extract more state information through debugfs. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 12:09 -0800, t...@kernel.org wrote: > Hello, > > On Wed, Feb 07, 2018 at 07:03:56PM +, Bart Van Assche wrote: > > I tried the above patch but already during the first iteration of the test I > > noticed that the test hung, probably due to the following request that got

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, On Wed, Feb 07, 2018 at 07:03:56PM +, Bart Van Assche wrote: > I tried the above patch but already during the first iteration of the test I > noticed that the test hung, probably due to the following request that got > stuck: > > $ (cd /sys/kernel/debug/block && grep -aH .

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart. On Wed, Feb 07, 2018 at 06:14:13PM +, Bart Van Assche wrote: > When I wrote my comment I was not sure whether or not non-reentrancy is > guaranteed for work queue items. However, according to what I found in the > workqueue implementation I think that is guaranteed. So it

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote: > Can you see whether by any chance the following patch fixes the issue? > If not, can you share the repro case? > > Thanks. > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index df93102..651d18c 100644 > --- a/block/blk-mq.c > +++

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:35 -0800, t...@kernel.org wrote: > On Wed, Feb 07, 2018 at 05:27:10PM +, Bart Van Assche wrote: > > Even with the above change I think that there is still a race between the > > code that handles timer resets and the completion handler. > > Can you elaborate the

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread t...@kernel.org
Hello, Bart. On Wed, Feb 07, 2018 at 05:27:10PM +, Bart Van Assche wrote: > Even with the above change I think that there is still a race between the > code that handles timer resets and the completion handler. Anyway, the test Can you elaborate the scenario a bit further? If you're

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Bart Van Assche
On Wed, 2018-02-07 at 09:06 -0800, Tejun Heo wrote: > On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote: > > The following race can occur between the code that resets the timer > > and completion handling: > > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate. > > - A

Re: [PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-07 Thread Tejun Heo
Hello, Bart. On Tue, Feb 06, 2018 at 05:11:33PM -0800, Bart Van Assche wrote: > The following race can occur between the code that resets the timer > and completion handling: > - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate. > - A completion occurs and blk_mq_complete_request()

[PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-06 Thread Bart Van Assche
The following race can occur between the code that resets the timer and completion handling: - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate. - A completion occurs and blk_mq_complete_request() calls __blk_mq_complete_request(). - The timeout code calls blk_add_timer() and that