Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 03:18:45PM +, Bart Van Assche wrote: > On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > > The problem is when blk-mq's timeout handler prevents the request from > > completing, and doesn't leave any indication the driver requested to > > complete it. Who is

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 11:22:20PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch wrote: > > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > >> > > >> >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch wrote: > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: >> > >> > Indeed that prevents .complete from running concurrently with the >> >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Bart Van Assche
On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > The problem is when blk-mq's timeout handler prevents the request from > completing, and doesn't leave any indication the driver requested to > complete it. Who is responsible for completing that request now? Hello Keith, My understanding

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Bart Van Assche
On Mon, 2017-09-18 at 21:55 -0400, Keith Busch wrote: > The only way to complete that request now is if the timeout > handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns > BLK_EH_NOT_HANDLED on success (very few drivers actually return > BLK_EH_HANDLED). > > After the timeout

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > > > > Indeed that prevents .complete from running concurrently with the > > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Ming Lei
On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: >> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: >> > The nvme driver's use of blk_mq_reinit_tagset only happens during >> > controller

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 11:14:38PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > > > Are you sure that scenario can happen? The blk-mq core calls > > > test_and_set_bit() > > > for the

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > > Are you sure that scenario can happen? The blk-mq core calls > > test_and_set_bit() > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is > >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > > The nvme driver's use of blk_mq_reinit_tagset only happens during > > controller initialisation, but I'm seeing lost commands well after that > > during normal and stable

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > The nvme driver's use of blk_mq_reinit_tagset only happens during > controller initialisation, but I'm seeing lost commands well after that > during normal and stable running. > > The timing is pretty narrow to hit, but I'm pretty sure this

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:07:58PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > > I think we've always known it's possible to lose a request during timeout > > handling, but just accepted that possibility. It seems to be causing > > problems, though,

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > I think we've always known it's possible to lose a request during timeout > handling, but just accepted that possibility. It seems to be causing > problems, though, leading to unnecessary error escalation and IO failures. > > The possiblity

[RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
I think we've always known it's possible to lose a request during timeout handling, but just accepted that possibility. It seems to be causing problems, though, leading to unnecessary error escalation and IO failures. The possiblity arises when the block layer marks the request complete prior to