Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 03:18:45PM +, Bart Van Assche wrote: > On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > > The problem is when blk-mq's timeout handler prevents the request from > > completing, and doesn't leave any indication the driver requested to > > complete it. Who is

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 11:22:20PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch wrote: > > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > >> > > >> >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Ming Lei
On Tue, Sep 19, 2017 at 11:07 PM, Keith Busch wrote: > On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: >> On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: >> > >> > Indeed that prevents .complete from running concurrently with the >> >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Bart Van Assche
On Tue, 2017-09-19 at 11:07 -0400, Keith Busch wrote: > The problem is when blk-mq's timeout handler prevents the request from > completing, and doesn't leave any indication the driver requested to > complete it. Who is responsible for completing that request now? Hello Keith, My understanding

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Bart Van Assche
On Mon, 2017-09-18 at 21:55 -0400, Keith Busch wrote: > The only way to complete that request now is if the timeout > handler returns BLK_EH_HANDLED, but the scsi-mq abort path returns > BLK_EH_NOT_HANDLED on success (very few drivers actually return > BLK_EH_HANDLED). > > After the timeout

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-19 Thread Keith Busch
On Tue, Sep 19, 2017 at 12:16:31PM +0800, Ming Lei wrote: > On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > > > > Indeed that prevents .complete from running concurrently with the > > timeout handler, but scsi_mq_done and nvme_handle_cqe are not .complete > >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Ming Lei
On Tue, Sep 19, 2017 at 7:08 AM, Keith Busch wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: >> On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: >> > The nvme driver's use of blk_mq_reinit_tagset only happens during >> > controller

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 11:14:38PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > > > Are you sure that scenario can happen? The blk-mq core calls > > > test_and_set_bit() > > > for the

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 19:08 -0400, Keith Busch wrote: > On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > > Are you sure that scenario can happen? The blk-mq core calls > > test_and_set_bit() > > for the REQ_ATOM_COMPLETE flag before any completion or timeout handler is > >

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:53:12PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > > The nvme driver's use of blk_mq_reinit_tagset only happens during > > controller initialisation, but I'm seeing lost commands well after that > > during normal and stable

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:39 -0400, Keith Busch wrote: > The nvme driver's use of blk_mq_reinit_tagset only happens during > controller initialisation, but I'm seeing lost commands well after that > during normal and stable running. > > The timing is pretty narrow to hit, but I'm pretty sure this

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Keith Busch
On Mon, Sep 18, 2017 at 10:07:58PM +, Bart Van Assche wrote: > On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > > I think we've always known it's possible to lose a request during timeout > > handling, but just accepted that possibility. It seems to be causing > > problems, though,

Re: [RFC PATCH] blk-mq: Fix lost request during timeout

2017-09-18 Thread Bart Van Assche
On Mon, 2017-09-18 at 18:03 -0400, Keith Busch wrote: > I think we've always known it's possible to lose a request during timeout > handling, but just accepted that possibility. It seems to be causing > problems, though, leading to unnecessary error escalation and IO failures. > > The possiblity