Re: [PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-15 Thread Ming Lei
On Sat, Apr 14, 2018 at 03:22:07PM +, Bart Van Assche wrote:
> On Fri, 2018-04-13 at 21:06 -0600, Jens Axboe wrote:
> > I like this approach since it keeps the cost outside of the fast
> > path. And it's fine to reuse the queue lock for this, instead of
> > adding a special lock for something we consider a rare occurrence.
> > 
> > From a quick look this looks sane, but I'll take a closer look
> > tomrrow and add some testing too.

Jens, please hold on, I will post out V4 soon, which will improve V3 a bit.

> 
> Shouldn't we know the root cause of the "RIP: scsi_times_out+0x17" crash 
> reported in
> https://bugzilla.kernel.org/show_bug.cgi?id=199077 before we decide how to 
> proceed?

I will ask Martin to test the V4 once it is posted out.


Thanks,
Ming


Re: [PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-14 Thread Bart Van Assche
On Fri, 2018-04-13 at 21:06 -0600, Jens Axboe wrote:
> I like this approach since it keeps the cost outside of the fast
> path. And it's fine to reuse the queue lock for this, instead of
> adding a special lock for something we consider a rare occurrence.
> 
> From a quick look this looks sane, but I'll take a closer look
> tomrrow and add some testing too.

Shouldn't we know the root cause of the "RIP: scsi_times_out+0x17" crash 
reported in
https://bugzilla.kernel.org/show_bug.cgi?id=199077 before we decide how to 
proceed?

Thanks,

Bart.





Re: [PATCH V3] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-13 Thread Jens Axboe
On 4/12/18 5:59 AM, Ming Lei wrote:
> The normal request completion can be done before or during handling
> BLK_EH_RESET_TIMER, and this race may cause the request to never be
> completed since driver's .timeout() may always return
> BLK_EH_RESET_TIMER.
> 
> This issue can't be fixed completely by driver, since the normal
> completion can be done between returning .timeout() and handling
> BLK_EH_RESET_TIMER.
> 
> This patch fixes the race by introducing rq state of
> MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding
> queue lock, which can be per-request actually, but just not necessary
> to introduce one lock for so unusual event.
> 
> Also when .timeout() returns BLK_EH_HANDLED, sync with normal
> completion path before completing this timed-out rq finally for
> avoiding this rq's state touched by normal completion.

I like this approach since it keeps the cost outside of the fast
path. And it's fine to reuse the queue lock for this, instead of
adding a special lock for something we consider a rare occurrence.

>From a quick look this looks sane, but I'll take a closer look
tomrrow and add some testing too.

-- 
Jens Axboe