Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-04 Thread jianchao.wang
Hi Keith

Thanks for you kindly response and comment.
That's really appreciated.

On 02/03/2018 02:31 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>>controller doing DMA access before free the request.
>>  - nvme_dev_disable has to depend on nvme_timeout to complete
>>adminq requests to set HMB or delete sq/cq when the controller
>>has no response.
>>  - nvme_dev_disable will race with nvme_timeout when cancels the
>>outstanding requests.
> 
> Your patch is releasing a command back to the OS with the
> PCI controller bus master still enabled. This could lead to data or
> memory corruption.
> 

There are two cases nvme_timeout will return.
BLK_EH_HANDLED
BLK_EH_NOT_HANDLED

For the 1st case, the patch will disable the controller.
Then the controller will stop processing any outstanding command and delete the 
sq/cq queues as the protocol.
Looks like it is still not enough, I will to disable the _pci_in 
nvme_pci_disable_dev_directly next version. 
Really thanks for your directive here.

For the 2nd case, it will return BLK_EH_NOT_HANDLED.
blk_mq_rq_timed_out will do nothing for this case.
All the command will be handled after all the things are disabled.

> In any case, it's not as complicated as you're making it out to
> be. It'd be easier to just enforce the exisiting rule that commands
> issued in the disabling path not depend on completions or timeout
> handling. All of commands issued in this path already do this except
> for HMB disabling. Let'sjust fix that command, right?
> 
We will still met nvme_timeout will invoke nvme_dev_disable and cannot 
synchronize on the outstanding
requests. This is really a devil and will be a block to do other improvements.
This patch just do two things:
1. grab all the previous outstanding requests with blk_abort_request. Then 
release them after the controller is totally disabled/shutdown.
   consequently, during the disable/shutdown and initializing procedure, 
nvme_timeout path only need to serve them. And this also could
   ensure there will be _no_ any outstanding requests after nvme_dev_disable.
2. fail the adminq command issued during disable/shutdown and initializing 
procedure when the controller no response. we need to do two 
   steps for this, disable the controller/pci and complete the command. Then 
nvme_timeout will not need to invoke nvme_dev_disable and nvme_dev_disable
   will be independent.
 
Please consider this.

Many thanks
Jianchao



Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-04 Thread jianchao.wang
Hi Keith

Thanks for you kindly response and comment.
That's really appreciated.

On 02/03/2018 02:31 AM, Keith Busch wrote:
> On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
>> Currently, the complicated relationship between nvme_dev_disable
>> and nvme_timeout has become a devil that will introduce many
>> circular pattern which may trigger deadlock or IO hang. Let's
>> enumerate the tangles between them:
>>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>>controller doing DMA access before free the request.
>>  - nvme_dev_disable has to depend on nvme_timeout to complete
>>adminq requests to set HMB or delete sq/cq when the controller
>>has no response.
>>  - nvme_dev_disable will race with nvme_timeout when cancels the
>>outstanding requests.
> 
> Your patch is releasing a command back to the OS with the
> PCI controller bus master still enabled. This could lead to data or
> memory corruption.
> 

There are two cases nvme_timeout will return.
BLK_EH_HANDLED
BLK_EH_NOT_HANDLED

For the 1st case, the patch will disable the controller.
Then the controller will stop processing any outstanding command and delete the 
sq/cq queues as the protocol.
Looks like it is still not enough, I will to disable the _pci_in 
nvme_pci_disable_dev_directly next version. 
Really thanks for your directive here.

For the 2nd case, it will return BLK_EH_NOT_HANDLED.
blk_mq_rq_timed_out will do nothing for this case.
All the command will be handled after all the things are disabled.

> In any case, it's not as complicated as you're making it out to
> be. It'd be easier to just enforce the exisiting rule that commands
> issued in the disabling path not depend on completions or timeout
> handling. All of commands issued in this path already do this except
> for HMB disabling. Let'sjust fix that command, right?
> 
We will still met nvme_timeout will invoke nvme_dev_disable and cannot 
synchronize on the outstanding
requests. This is really a devil and will be a block to do other improvements.
This patch just do two things:
1. grab all the previous outstanding requests with blk_abort_request. Then 
release them after the controller is totally disabled/shutdown.
   consequently, during the disable/shutdown and initializing procedure, 
nvme_timeout path only need to serve them. And this also could
   ensure there will be _no_ any outstanding requests after nvme_dev_disable.
2. fail the adminq command issued during disable/shutdown and initializing 
procedure when the controller no response. we need to do two 
   steps for this, disable the controller/pci and complete the command. Then 
nvme_timeout will not need to invoke nvme_dev_disable and nvme_dev_disable
   will be independent.
 
Please consider this.

Many thanks
Jianchao



Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-02 Thread Keith Busch
On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>controller doing DMA access before free the request.
>  - nvme_dev_disable has to depend on nvme_timeout to complete
>adminq requests to set HMB or delete sq/cq when the controller
>has no response.
>  - nvme_dev_disable will race with nvme_timeout when cancels the
>outstanding requests.

Your patch is releasing a command back to the OS with the
PCI controller bus master still enabled. This could lead to data or
memory corruption.

In any case, it's not as complicated as you're making it out to
be. It'd be easier to just enforce the exisiting rule that commands
issued in the disabling path not depend on completions or timeout
handling. All of commands issued in this path already do this except
for HMB disabling. Let'sjust fix that command, right?


Re: [PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-02 Thread Keith Busch
On Fri, Feb 02, 2018 at 03:00:47PM +0800, Jianchao Wang wrote:
> Currently, the complicated relationship between nvme_dev_disable
> and nvme_timeout has become a devil that will introduce many
> circular pattern which may trigger deadlock or IO hang. Let's
> enumerate the tangles between them:
>  - nvme_timeout has to invoke nvme_dev_disable to stop the
>controller doing DMA access before free the request.
>  - nvme_dev_disable has to depend on nvme_timeout to complete
>adminq requests to set HMB or delete sq/cq when the controller
>has no response.
>  - nvme_dev_disable will race with nvme_timeout when cancels the
>outstanding requests.

Your patch is releasing a command back to the OS with the
PCI controller bus master still enabled. This could lead to data or
memory corruption.

In any case, it's not as complicated as you're making it out to
be. It'd be easier to just enforce the exisiting rule that commands
issued in the disabling path not depend on completions or timeout
handling. All of commands issued in this path already do this except
for HMB disabling. Let'sjust fix that command, right?


[PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-01 Thread Jianchao Wang
Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING  previous adminq/IOq request
or shutdownadminq requests from nvme_dev_disable

RECONNECTING   adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 146 
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct device *dev;
+   int grab_flag;
struct dma_pool *prp_page_pool;
struct dma_pool *prp_small_pool;
unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, 
blk_status_t error)
blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+   u32 csts;
+   bool dead;
+
+   if (!pci_is_enabled(pdev))
+   return;
+
+   csts = readl(dev->bar + NVME_REG_CSTS);
+   dead = !!((csts & NVME_CSTS_CFS) ||
+   !(csts & NVME_CSTS_RDY) ||
+   pdev->error_state  != pci_channel_io_normal);
+   if (!dead)
+   nvme_disable_ctrl(>ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
/*
 * Reset immediately if the controller is failed
+* nvme_dev_disable will take over the expired requests.
 */
if (nvme_should_reset(dev, csts)) {
+   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
nvme_reset_ctrl(>ctrl);
-   return BLK_EH_HANDLED;
+   return BLK_EH_NOT_HANDLED;
}
 
/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
}
 
/*
-* Shutdown immediately if controller times out while starting. The
-* reset work will see the pci device disabled when it gets the forced
-* cancellation error. All outstanding requests are completed on
-* shutdown, so we return BLK_EH_HANDLED.
+* The previous outstanding requests on adminq and ioq have been
+* grabbed 

[PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-01 Thread Jianchao Wang
Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING  previous adminq/IOq request
or shutdownadminq requests from nvme_dev_disable

RECONNECTING   adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 146 
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct device *dev;
+   int grab_flag;
struct dma_pool *prp_page_pool;
struct dma_pool *prp_small_pool;
unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, 
blk_status_t error)
blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+   u32 csts;
+   bool dead;
+
+   if (!pci_is_enabled(pdev))
+   return;
+
+   csts = readl(dev->bar + NVME_REG_CSTS);
+   dead = !!((csts & NVME_CSTS_CFS) ||
+   !(csts & NVME_CSTS_RDY) ||
+   pdev->error_state  != pci_channel_io_normal);
+   if (!dead)
+   nvme_disable_ctrl(>ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
/*
 * Reset immediately if the controller is failed
+* nvme_dev_disable will take over the expired requests.
 */
if (nvme_should_reset(dev, csts)) {
+   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
nvme_reset_ctrl(>ctrl);
-   return BLK_EH_HANDLED;
+   return BLK_EH_NOT_HANDLED;
}
 
/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
}
 
/*
-* Shutdown immediately if controller times out while starting. The
-* reset work will see the pci device disabled when it gets the forced
-* cancellation error. All outstanding requests are completed on
-* shutdown, so we return BLK_EH_HANDLED.
+* The previous outstanding requests on adminq and ioq have been
+* grabbed or drained for RECONNECTING 

[PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-01 Thread Jianchao Wang
Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING  previous adminq/IOq request
or shutdownadminq requests from nvme_dev_disable

RECONNECTING   adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 146 
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct device *dev;
+   int grab_flag;
struct dma_pool *prp_page_pool;
struct dma_pool *prp_small_pool;
unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, 
blk_status_t error)
blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+   u32 csts;
+   bool dead;
+
+   if (!pci_is_enabled(pdev))
+   return;
+
+   csts = readl(dev->bar + NVME_REG_CSTS);
+   dead = !!((csts & NVME_CSTS_CFS) ||
+   !(csts & NVME_CSTS_RDY) ||
+   pdev->error_state  != pci_channel_io_normal);
+   if (!dead)
+   nvme_disable_ctrl(>ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
/*
 * Reset immediately if the controller is failed
+* nvme_dev_disable will take over the expired requests.
 */
if (nvme_should_reset(dev, csts)) {
+   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
nvme_reset_ctrl(>ctrl);
-   return BLK_EH_HANDLED;
+   return BLK_EH_NOT_HANDLED;
}
 
/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
}
 
/*
-* Shutdown immediately if controller times out while starting. The
-* reset work will see the pci device disabled when it gets the forced
-* cancellation error. All outstanding requests are completed on
-* shutdown, so we return BLK_EH_HANDLED.
+* The previous outstanding requests on adminq and ioq have been
+* grabbed 

[PATCH 4/6] nvme-pci: break up nvme_timeout and nvme_dev_disable

2018-02-01 Thread Jianchao Wang
Currently, the complicated relationship between nvme_dev_disable
and nvme_timeout has become a devil that will introduce many
circular pattern which may trigger deadlock or IO hang. Let's
enumerate the tangles between them:
 - nvme_timeout has to invoke nvme_dev_disable to stop the
   controller doing DMA access before free the request.
 - nvme_dev_disable has to depend on nvme_timeout to complete
   adminq requests to set HMB or delete sq/cq when the controller
   has no response.
 - nvme_dev_disable will race with nvme_timeout when cancels the
   outstanding requests.

To break up them, let's first look at what's kind of requests
nvme_timeout has to handle.

RESETTING  previous adminq/IOq request
or shutdownadminq requests from nvme_dev_disable

RECONNECTING   adminq requests from nvme_reset_work

nvme_timeout has to invoke nvme_dev_disable first before complete
all the expired request above. We avoid this as following.

For the previous adminq/IOq request:
use blk_abort_request to force all the outstanding requests expired
in nvme_dev_disable. In nvme_timeout, set NVME_REQ_CANCELLED and
return BLK_EH_NOT_HANDLED. Then the request will not be completed and
freed. We needn't invoke nvme_dev_disable any more.

blk_abort_request is safe when race with irq completion path.
we have been able to grab all the outstanding requests. This could
eliminate the race between nvme_timeout and nvme_dev_disable.

We use NVME_REQ_CANCELLED to identify them. After the controller is
totally disabled/shutdown, we invoke blk_mq_rq_update_aborted_gstate
to clear requests and invoke blk_mq_complete_request to complete them.

In addition, to identify the previous adminq/IOq request and adminq
requests from nvme_dev_disable, we introduce NVME_PCI_OUTSTANDING_GRABBING
and NVME_PCI_OUTSTANDING_GRABBED to let nvme_timeout be able to
distinguish them.

For the adminq requests from nvme_dev_disable/nvme_reset_work:
invoke nvme_disable_ctrl directly, then set NVME_REQ_CANCELLED and
return BLK_EH_HANDLED. nvme_dev_disable/nvme_reset_work will
see the error.

With this patch, we could avoid nvme_dev_disable to be invoked
by nvme_timeout and eliminate the race between nvme_timeout and
nvme_dev_disable on outstanding requests.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 146 
 1 file changed, 123 insertions(+), 23 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a7fa397..5b192b0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -70,6 +70,8 @@ struct nvme_queue;
 
 static void nvme_process_cq(struct nvme_queue *nvmeq);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+#define NVME_PCI_OUTSTANDING_GRABBING 1
+#define NVME_PCI_OUTSTANDING_GRABBED 2
 
 /*
  * Represents an NVM Express device.  Each nvme_dev is a PCI function.
@@ -80,6 +82,7 @@ struct nvme_dev {
struct blk_mq_tag_set admin_tagset;
u32 __iomem *dbs;
struct device *dev;
+   int grab_flag;
struct dma_pool *prp_page_pool;
struct dma_pool *prp_small_pool;
unsigned online_queues;
@@ -1130,6 +1133,23 @@ static void abort_endio(struct request *req, 
blk_status_t error)
blk_mq_free_request(req);
 }
 
+static void nvme_pci_disable_ctrl_directly(struct nvme_dev *dev)
+{
+   struct pci_dev *pdev = to_pci_dev(dev->dev);
+   u32 csts;
+   bool dead;
+
+   if (!pci_is_enabled(pdev))
+   return;
+
+   csts = readl(dev->bar + NVME_REG_CSTS);
+   dead = !!((csts & NVME_CSTS_CFS) ||
+   !(csts & NVME_CSTS_RDY) ||
+   pdev->error_state  != pci_channel_io_normal);
+   if (!dead)
+   nvme_disable_ctrl(>ctrl, dev->ctrl.cap);
+}
+
 static bool nvme_should_reset(struct nvme_dev *dev, u32 csts)
 {
 
@@ -1191,12 +1211,13 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
/*
 * Reset immediately if the controller is failed
+* nvme_dev_disable will take over the expired requests.
 */
if (nvme_should_reset(dev, csts)) {
+   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
nvme_warn_reset(dev, csts);
-   nvme_dev_disable(dev, false);
nvme_reset_ctrl(>ctrl);
-   return BLK_EH_HANDLED;
+   return BLK_EH_NOT_HANDLED;
}
 
/*
@@ -1210,38 +1231,51 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
}
 
/*
-* Shutdown immediately if controller times out while starting. The
-* reset work will see the pci device disabled when it gets the forced
-* cancellation error. All outstanding requests are completed on
-* shutdown, so we return BLK_EH_HANDLED.
+* The previous outstanding requests on adminq and ioq have been
+* grabbed or drained for RECONNECTING