Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-16 Thread Ming Lei
On Wed, May 16, 2018 at 09:18:26AM -0600, Keith Busch wrote:
> On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote:
> > Hi Keith,
> > 
> > This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set 
> > nvmeq->cq_vector
> > after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
> > queue after controller is shutdown'[2], and both two have been included in 
> > the
> > posted V6.
> 
> No, it's definitely not related to that patch. The link is down in this
> test, I can assure you we're bailing out long before we ever even try to
> create an IO queue. The failing condition is detected by nvme_pci_enable's
> check for all 1's completions at the very beginning.

OK, this kind of failure during reset can be triggered in my test easily, then
nvme_remove_dead_ctrl() is called too, but not see IO hang from remove path.

As we discussed, it shouldn't be so, since queues are unquiesced &
killed, all IO should have been failed immediately. Also controller has
been shutdown, the queues are frozen too, so blk_mq_freeze_queue_wait()
won't wait on one unfrozen queue.

So could you post the debugfs log when the hang happens so that we may
find some clue?

Also, I don't think your issue is caused by this patchset, since
nvme_remove_dead_ctrl_work() and nvme_remove() aren't touched by this patch.
That means this issue may be triggered without this patchset too,
so could we start to review this patchset meantime?


Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-16 Thread Keith Busch
On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote:
> Hi Keith,
> 
> This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set 
> nvmeq->cq_vector
> after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
> queue after controller is shutdown'[2], and both two have been included in the
> posted V6.

No, it's definitely not related to that patch. The link is down in this
test, I can assure you we're bailing out long before we ever even try to
create an IO queue. The failing condition is detected by nvme_pci_enable's
check for all 1's completions at the very beginning.


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread Ming Lei
On Mon, May 14, 2018 at 06:33:35PM -0600, Keith Busch wrote:
> On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
> > > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > > > [  760.727488] nvme nvme1: EH: fail controller
> > > > 
> > > > The above issue(hang in nvme_remove()) is still an old issue, which
> > > > is because queues are kept as quiesce during remove, so could you
> > > > please test the following change?
> > > > 
> > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > > index 1dec353388be..c78e5a0cde06 100644
> > > > --- a/drivers/nvme/host/core.c
> > > > +++ b/drivers/nvme/host/core.c
> > > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl 
> > > > *ctrl)
> > > >  */
> > > > if (ctrl->state == NVME_CTRL_DEAD)
> > > > nvme_kill_queues(ctrl);
> > > > +   else {
> > > > +   if (ctrl->admin_q)
> > > > +   blk_mq_unquiesce_queue(ctrl->admin_q);
> > > > +   nvme_start_queues(ctrl);
> > > > +   }
> > > > 
> > > > down_write(>namespaces_rwsem);
> > > > list_splice_init(>namespaces, _list);
> > > 
> > > The above won't actually do anything here since the broken link puts the
> > > controller in the DEAD state, so we've killed the queues which also
> > > unquiesces them.
> > 
> > I suggest you to double check if the controller is set as DEAD
> > in nvme_remove() since there won't be any log dumped when this happen.
> 
> Yes, it's dead. pci_device_is_present returns false when the link is
> broken.
> 
> Also, the logs showed the capacity was set to 0, which only happens when
> we kill the namespace queues, which supposedly restarts the queues too.
>  
> > If controller is set as DEAD and queues are killed, and all IO should
> > have been dispatched to driver and nvme_queueu_rq() will fail them all,
> > then there isn't any reason to see the hang in your stack trace log.
> 
> Right, that's the idea. It just doesn't appear to be working here.

Hi Keith,

This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set 
nvmeq->cq_vector
after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin
queue after controller is shutdown'[2], and both two have been included in the
posted V6.

[1] https://marc.info/?l=linux-block=152644346006002=2 
[2] https://marc.info/?l=linux-block=152644344805995=2

So please retest V6 and see if this issue can be addressed, if not,
could you share the debugfs log, then we might see where is wrong
from the log?

Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi Ming

On 05/15/2018 08:56 PM, Ming Lei wrote:
> Looks a nice fix on nvme_create_queue(), but seems the change on 
> adapter_alloc_cq() is missed in above patch.
> 
> Could you prepare a formal one so that I may integrate it to V6?

Please refer to

Thanks
Jianchao
>From 9bb6db79901ef303cd40c4c911604bc053b1ad95 Mon Sep 17 00:00:00 2001
From: Jianchao Wang 
Date: Wed, 16 May 2018 09:45:45 +0800
Subject: [PATCH] nvme-pci: set nvmeq->cq_vector after alloc cq/sq

Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc
cq/sq command timeout, nvme_suspend_queue will invoke free_irq
for the nvmeq because the cq_vector is valid, this will cause warning
'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after
alloc cq/sq successes to fix this.

Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/pci.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index fbc71fa..c830092 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id)
 }
 
 static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
-		struct nvme_queue *nvmeq)
+		struct nvme_queue *nvmeq, int cq_vector)
 {
 	struct nvme_command c;
 	int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED;
@@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid,
 	c.create_cq.cqid = cpu_to_le16(qid);
 	c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1);
 	c.create_cq.cq_flags = cpu_to_le16(flags);
-	c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector);
+	c.create_cq.irq_vector = cpu_to_le16(cq_vector);
 
 	return nvme_submit_sync_cmd(dev->ctrl.admin_q, , NULL, 0);
 }
@@ -1450,6 +1450,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 {
 	struct nvme_dev *dev = nvmeq->dev;
 	int result;
+	int cq_vector;
 
 	if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
 		unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1463,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	 * A queue's vector matches the queue identifier unless the controller
 	 * has only one vector available.
 	 */
-	nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-	result = adapter_alloc_cq(dev, qid, nvmeq);
+	cq_vector = dev->num_vecs == 1 ? 0 : qid;
+	result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
 	if (result < 0)
-		goto release_vector;
+		goto out;
 
 	result = adapter_alloc_sq(dev, qid, nvmeq);
 	if (result < 0)
 		goto release_cq;
 
+	/*
+	 * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command
+	 * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning
+	 * 'Trying to free already-free IRQ xxx'
+	 */
+	nvmeq->cq_vector = cq_vector;
 	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
@@ -1478,13 +1485,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 
 	return result;
 
- release_sq:
+release_sq:
+	nvmeq->cq_vector = -1;
 	dev->online_queues--;
 	adapter_delete_sq(dev, qid);
- release_cq:
+release_cq:
 	adapter_delete_cq(dev, qid);
- release_vector:
-	nvmeq->cq_vector = -1;
+out:
 	return result;
 }
 
-- 
2.7.4



Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi ming

On 05/16/2018 10:09 AM, Ming Lei wrote:
> So could you check if only the patch("unquiesce admin queue after shutdown
> controller") can fix your IO hang issue?

I indeed tested this before fix the warning.
It could fix the io hung issue. :)

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread Ming Lei
On Wed, May 16, 2018 at 10:04:20AM +0800, Ming Lei wrote:
> On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> > Hi ming
> > 
> > On 05/15/2018 08:33 AM, Ming Lei wrote:
> > > We still have to quiesce admin queue before canceling request, so looks
> > > the following patch is better, so please ignore the above patch and try
> > > the following one and see if your hang can be addressed:
> > > 
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index f509d37b2fb8..c2adc76472a8 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev 
> > > *dev)
> > >   dev->ctrl.admin_q = NULL;
> > >   return -ENODEV;
> > >   }
> > > - } else
> > > - blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > > + }
> > >  
> > >   return 0;
> > >  }
> > > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> > > bool shutdown, bool
> > >*/
> > >   if (shutdown)
> > >   nvme_start_queues(>ctrl);
> > > +
> > > + /*
> > > +  * Avoid to suck reset because timeout may happen during reset and
> > > +  * reset may hang forever if admin queue is kept as quiesced
> > > +  */
> > > + blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > >   mutex_unlock(>shutdown_lock);
> > >  }
> > 
> > w/ patch above and patch below, both the warning and io hung issue didn't 
> > reproduce till now.
> > 
> > 
> > @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue 
> > *nvmeq, int qid)
> >  {
> > struct nvme_dev *dev = nvmeq->dev;
> > int result;
> > +   int cq_vector;
> >  
> > if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> > unsigned offset = (qid - 1) * 
> > roundup(SQ_SIZE(nvmeq->q_depth),
> > @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue 
> > *nvmeq, int qid)
> >  * A queue's vector matches the queue identifier unless the 
> > controller
> >  * has only one vector available.
> >  */
> > -   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > -   result = adapter_alloc_cq(dev, qid, nvmeq);
> > +   cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > +   result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
> > if (result < 0)
> > -   goto release_vector;
> > +   goto out;
> 
> Think of this issue further, the above change will cause adapter_alloc_cq()
> failed immediately because nvmeq->cq_vector isn't set before submitting this
> admin IO.
> 
> So could you check if only the patch("unquiesce admin queue after shutdown
> controller") can fix your IO hang issue?
> 
> BTW, the warning from genirq can be left alone, that is another issue.

Ooops, no such issue at all since admin queue is ready, please ignore the
noise, sorry, :-(

Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread Ming Lei
On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/15/2018 08:33 AM, Ming Lei wrote:
> > We still have to quiesce admin queue before canceling request, so looks
> > the following patch is better, so please ignore the above patch and try
> > the following one and see if your hang can be addressed:
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index f509d37b2fb8..c2adc76472a8 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> > dev->ctrl.admin_q = NULL;
> > return -ENODEV;
> > }
> > -   } else
> > -   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > +   }
> >  
> > return 0;
> >  }
> > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> > bool shutdown, bool
> >  */
> > if (shutdown)
> > nvme_start_queues(>ctrl);
> > +
> > +   /*
> > +* Avoid to suck reset because timeout may happen during reset and
> > +* reset may hang forever if admin queue is kept as quiesced
> > +*/
> > +   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > mutex_unlock(>shutdown_lock);
> >  }
> 
> w/ patch above and patch below, both the warning and io hung issue didn't 
> reproduce till now.
> 
> 
> @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
> int qid)
>  {
> struct nvme_dev *dev = nvmeq->dev;
> int result;
> +   int cq_vector;
>  
> if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue 
> *nvmeq, int qid)
>  * A queue's vector matches the queue identifier unless the controller
>  * has only one vector available.
>  */
> -   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> -   result = adapter_alloc_cq(dev, qid, nvmeq);
> +   cq_vector = dev->num_vecs == 1 ? 0 : qid;
> +   result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
> if (result < 0)
> -   goto release_vector;
> +   goto out;

Think of this issue further, the above change will cause adapter_alloc_cq()
failed immediately because nvmeq->cq_vector isn't set before submitting this
admin IO.

So could you check if only the patch("unquiesce admin queue after shutdown
controller") can fix your IO hang issue?

BTW, the warning from genirq can be left alone, that is another issue.

Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread Ming Lei
On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/15/2018 08:33 AM, Ming Lei wrote:
> > We still have to quiesce admin queue before canceling request, so looks
> > the following patch is better, so please ignore the above patch and try
> > the following one and see if your hang can be addressed:
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index f509d37b2fb8..c2adc76472a8 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> > dev->ctrl.admin_q = NULL;
> > return -ENODEV;
> > }
> > -   } else
> > -   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > +   }
> >  
> > return 0;
> >  }
> > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> > bool shutdown, bool
> >  */
> > if (shutdown)
> > nvme_start_queues(>ctrl);
> > +
> > +   /*
> > +* Avoid to suck reset because timeout may happen during reset and
> > +* reset may hang forever if admin queue is kept as quiesced
> > +*/
> > +   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> > mutex_unlock(>shutdown_lock);
> >  }
> 
> w/ patch above and patch below, both the warning and io hung issue didn't 
> reproduce till now.
> 

That is great to see another issue which can be covered now, :-)

> 
> @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
> int qid)
>  {
> struct nvme_dev *dev = nvmeq->dev;
> int result;
> +   int cq_vector;
>  
> if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
> unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
> @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue 
> *nvmeq, int qid)
>  * A queue's vector matches the queue identifier unless the controller
>  * has only one vector available.
>  */
> -   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> -   result = adapter_alloc_cq(dev, qid, nvmeq);
> +   cq_vector = dev->num_vecs == 1 ? 0 : qid;
> +   result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
> if (result < 0)
> -   goto release_vector;
> +   goto out;
>  
> result = adapter_alloc_sq(dev, qid, nvmeq);
> if (result < 0)
> goto release_cq;
> -
> +   
> +   nvmeq->cq_vector = cq_vector;
> nvme_init_queue(nvmeq, qid);
> result = queue_request_irq(nvmeq);
> if (result < 0)
> @@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue 
> *nvmeq, int qid)
> return result;
>  
>   release_sq:
> +   nvmeq->cq_vector = -1;
> dev->online_queues--;
> adapter_delete_sq(dev, qid);
>   release_cq:
> adapter_delete_cq(dev, qid);
> - release_vector:
> -   nvmeq->cq_vector = -1;
> + out:
> return result;
>  }
> 

Looks a nice fix on nvme_create_queue(), but seems the change on 
adapter_alloc_cq() is missed in above patch.

Could you prepare a formal one so that I may integrate it to V6?

Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread jianchao.wang
Hi ming

On 05/15/2018 08:33 AM, Ming Lei wrote:
> We still have to quiesce admin queue before canceling request, so looks
> the following patch is better, so please ignore the above patch and try
> the following one and see if your hang can be addressed:
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..c2adc76472a8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   dev->ctrl.admin_q = NULL;
>   return -ENODEV;
>   }
> - } else
> - blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> + }
>  
>   return 0;
>  }
> @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, 
> bool shutdown, bool
>*/
>   if (shutdown)
>   nvme_start_queues(>ctrl);
> +
> + /*
> +  * Avoid to suck reset because timeout may happen during reset and
> +  * reset may hang forever if admin queue is kept as quiesced
> +  */
> + blk_mq_unquiesce_queue(dev->ctrl.admin_q);
>   mutex_unlock(>shutdown_lock);
>  }

w/ patch above and patch below, both the warning and io hung issue didn't 
reproduce till now.


@@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
 {
struct nvme_dev *dev = nvmeq->dev;
int result;
+   int cq_vector;
 
if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth),
@@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
 * A queue's vector matches the queue identifier unless the controller
 * has only one vector available.
 */
-   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
-   result = adapter_alloc_cq(dev, qid, nvmeq);
+   cq_vector = dev->num_vecs == 1 ? 0 : qid;
+   result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector);
if (result < 0)
-   goto release_vector;
+   goto out;
 
result = adapter_alloc_sq(dev, qid, nvmeq);
if (result < 0)
goto release_cq;
-
+   
+   nvmeq->cq_vector = cq_vector;
nvme_init_queue(nvmeq, qid);
result = queue_request_irq(nvmeq);
if (result < 0)
@@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, 
int qid)
return result;
 
  release_sq:
+   nvmeq->cq_vector = -1;
dev->online_queues--;
adapter_delete_sq(dev, qid);
  release_cq:
adapter_delete_cq(dev, qid);
- release_vector:
-   nvmeq->cq_vector = -1;
+ out:
return result;
 }

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-15 Thread Ming Lei
On Tue, May 15, 2018 at 8:33 AM, Keith Busch
 wrote:
> On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
>> > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
>> > > > [  760.727488] nvme nvme1: EH: fail controller
>> > >
>> > > The above issue(hang in nvme_remove()) is still an old issue, which
>> > > is because queues are kept as quiesce during remove, so could you
>> > > please test the following change?
>> > >
>> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> > > index 1dec353388be..c78e5a0cde06 100644
>> > > --- a/drivers/nvme/host/core.c
>> > > +++ b/drivers/nvme/host/core.c
>> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl 
>> > > *ctrl)
>> > >  */
>> > > if (ctrl->state == NVME_CTRL_DEAD)
>> > > nvme_kill_queues(ctrl);
>> > > +   else {
>> > > +   if (ctrl->admin_q)
>> > > +   blk_mq_unquiesce_queue(ctrl->admin_q);
>> > > +   nvme_start_queues(ctrl);
>> > > +   }
>> > >
>> > > down_write(>namespaces_rwsem);
>> > > list_splice_init(>namespaces, _list);
>> >
>> > The above won't actually do anything here since the broken link puts the
>> > controller in the DEAD state, so we've killed the queues which also
>> > unquiesces them.
>>
>> I suggest you to double check if the controller is set as DEAD
>> in nvme_remove() since there won't be any log dumped when this happen.
>
> Yes, it's dead. pci_device_is_present returns false when the link is
> broken.
>
> Also, the logs showed the capacity was set to 0, which only happens when
> we kill the namespace queues, which supposedly restarts the queues too.
>

Right, nvme_kill_queues() may trigger that, and in my 019 test, not see
pci_device_is_present() returns false, but nvme_kill_queues() has been
called in nvme_remove_dead_ctrl_work(),  and still didn't reproduce
the hang in blk_cleanup_queue() yet.

Looks a bit weird, but debugfs may show some clue, :-)

Thanks,
Ming Lei


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 08:22:11PM +0800, Ming Lei wrote:
> Hi Jianchao,
> 
> On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> > Hi ming
> > 
> > On 05/14/2018 05:38 PM, Ming Lei wrote:
> > >> Here is the deadlock scenario.
> > >>
> > >> nvme_eh_work // EH0
> > >>   -> nvme_reset_dev //hold reset_lock
> > >> -> nvme_setup_io_queues
> > >>   -> nvme_create_io_queues
> > >> -> nvme_create_queue
> > >>   -> set nvmeq->cq_vector
> > >>   -> adapter_alloc_cq
> > >>   -> adapter_alloc_sq
> > >>  irq has not been requested
> > >>  io timeout 
> > >> nvme_eh_work //EH1
> > >>   -> nvme_dev_disable
> > >> -> quiesce the adminq //> 
> > >> here !
> > >> -> nvme_suspend_queue
> > >>   print out warning Trying to 
> > >> free already-free IRQ 133
> > >> -> nvme_cancel_request // 
> > >> complete the timeout admin request
> > >>   -> require reset_lock
> > >>   -> adapter_delete_cq
> > > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > > nvme_dev_disable() in EH1 will complete it which is set as 
> > > REQ_FAILFAST_DRIVER,
> > > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > > should have been terminated immediately.
> > 
> > Please refer to the following segment:
> > 
> > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> > {
> > struct nvme_dev *dev = nvmeq->dev;
> > int result;
> > ...
> > nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
> > result = adapter_alloc_cq(dev, qid, nvmeq);
> > if (result < 0)
> > goto release_vector;
> > 
> > result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
> > here
> > if (result < 0)
> > goto release_cq;
> > 
> > nvme_init_queue(nvmeq, qid);
> > result = queue_request_irq(nvmeq);
> > if (result < 0)
> > goto release_sq;
> > 
> > return result;
> > 
> >  release_sq:
> > dev->online_queues--;
> > adapter_delete_sq(dev, qid);
> >  release_cq:// we will be here !
> > adapter_delete_cq(dev, qid);// another cq delete admin 
> > command will be sent out.
> >  release_vector:
> > nvmeq->cq_vector = -1;
> > return result;
> > }
> 
> Given admin queue has been suspended, all admin IO should have
> been terminated immediately, so could you test the following patch?
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index f509d37b2fb8..44e38be259f2 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
>   nvmeq->cq_vector = -1;
>   spin_unlock_irq(>q_lock);
>  
> - if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
> - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
> -
>   pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
>  
>   return 0;
> @@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>   dev->ctrl.admin_q = NULL;
>   return -ENODEV;
>   }
> - } else
> - blk_mq_unquiesce_queue(dev->ctrl.admin_q);
> + }
>  
>   return 0;
>  }

We still have to quiesce admin queue before canceling request, so looks
the following patch is better, so please ignore the above patch and try
the following one and see if your hang can be addressed:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..c2adc76472a8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->ctrl.admin_q = NULL;
return -ENODEV;
}
-   } else
-   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+   }
 
return 0;
 }
@@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool 
shutdown, bool
 */
if (shutdown)
nvme_start_queues(>ctrl);
+
+   /*
+* Avoid to suck reset because timeout may happen during reset and
+* reset may hang forever if admin queue is kept as quiesced
+*/
+   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
mutex_unlock(>shutdown_lock);
 }
 

-- 
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Keith Busch
On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote:
> > > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > > [  760.727488] nvme nvme1: EH: fail controller
> > > 
> > > The above issue(hang in nvme_remove()) is still an old issue, which
> > > is because queues are kept as quiesce during remove, so could you
> > > please test the following change?
> > > 
> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > > index 1dec353388be..c78e5a0cde06 100644
> > > --- a/drivers/nvme/host/core.c
> > > +++ b/drivers/nvme/host/core.c
> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> > >  */
> > > if (ctrl->state == NVME_CTRL_DEAD)
> > > nvme_kill_queues(ctrl);
> > > +   else {
> > > +   if (ctrl->admin_q)
> > > +   blk_mq_unquiesce_queue(ctrl->admin_q);
> > > +   nvme_start_queues(ctrl);
> > > +   }
> > > 
> > > down_write(>namespaces_rwsem);
> > > list_splice_init(>namespaces, _list);
> > 
> > The above won't actually do anything here since the broken link puts the
> > controller in the DEAD state, so we've killed the queues which also
> > unquiesces them.
> 
> I suggest you to double check if the controller is set as DEAD
> in nvme_remove() since there won't be any log dumped when this happen.

Yes, it's dead. pci_device_is_present returns false when the link is
broken.

Also, the logs showed the capacity was set to 0, which only happens when
we kill the namespace queues, which supposedly restarts the queues too.
 
> If controller is set as DEAD and queues are killed, and all IO should
> have been dispatched to driver and nvme_queueu_rq() will fail them all,
> then there isn't any reason to see the hang in your stack trace log.

Right, that's the idea. It just doesn't appear to be working here.


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 09:18:21AM -0600, Keith Busch wrote:
> Hi Ming,
> 
> On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > > [  760.679960] nvme nvme1: controller is down; will reset: 
> > > CSTS=0x, PCI_STATUS=0x
> > > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > > [  760.727103] nvme :86:00.0: Refused to change power state, 
> > > currently in D3
> > 
> > EH may not cover this kind of failure, so it fails in the 1st try.
> 
> Indeed, the test is simulating a permanently broken link, so recovery is
> not expected. A success in this case is just completing driver
> unbinding.
>  
> > > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > > [  760.727488] nvme nvme1: EH: fail controller
> > 
> > The above issue(hang in nvme_remove()) is still an old issue, which
> > is because queues are kept as quiesce during remove, so could you
> > please test the following change?
> > 
> > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > index 1dec353388be..c78e5a0cde06 100644
> > --- a/drivers/nvme/host/core.c
> > +++ b/drivers/nvme/host/core.c
> > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
> >  */
> > if (ctrl->state == NVME_CTRL_DEAD)
> > nvme_kill_queues(ctrl);
> > +   else {
> > +   if (ctrl->admin_q)
> > +   blk_mq_unquiesce_queue(ctrl->admin_q);
> > +   nvme_start_queues(ctrl);
> > +   }
> > 
> > down_write(>namespaces_rwsem);
> > list_splice_init(>namespaces, _list);
> 
> The above won't actually do anything here since the broken link puts the
> controller in the DEAD state, so we've killed the queues which also
> unquiesces them.

I suggest you to double check if the controller is set as DEAD
in nvme_remove() since there won't be any log dumped when this happen.

If controller is set as DEAD and queues are killed, and all IO should
have been dispatched to driver and nvme_queueu_rq() will fail them all,
then there isn't any reason to see the hang in your stack trace log.

> 
> > BTW, in my environment, it is hard to trigger this failure, so not see
> > this issue, but I did verify the nested EH which can recover from error
> > in reset.
> 
> It's actually pretty easy to trigger this one. I just modify block/019 to
> remove the check for a hotplug slot then run it on a block device that's
> not hot-pluggable.

I will try this test, and hope I can reproduce it in my environment.

Thanks,
Ming


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Keith Busch
Hi Ming,

On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote:
> > [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
> > PCI_STATUS=0x
> > [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> > [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> > [  760.727103] nvme :86:00.0: Refused to change power state, currently 
> > in D3
> 
> EH may not cover this kind of failure, so it fails in the 1st try.

Indeed, the test is simulating a permanently broken link, so recovery is
not expected. A success in this case is just completing driver
unbinding.
 
> > [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> > [  760.727485] nvme nvme1: EH 0: after recovery -19
> > [  760.727488] nvme nvme1: EH: fail controller
> 
> The above issue(hang in nvme_remove()) is still an old issue, which
> is because queues are kept as quiesce during remove, so could you
> please test the following change?
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1dec353388be..c78e5a0cde06 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
>  */
> if (ctrl->state == NVME_CTRL_DEAD)
> nvme_kill_queues(ctrl);
> +   else {
> +   if (ctrl->admin_q)
> +   blk_mq_unquiesce_queue(ctrl->admin_q);
> +   nvme_start_queues(ctrl);
> +   }
> 
> down_write(>namespaces_rwsem);
> list_splice_init(>namespaces, _list);

The above won't actually do anything here since the broken link puts the
controller in the DEAD state, so we've killed the queues which also
unquiesces them.

> BTW, in my environment, it is hard to trigger this failure, so not see
> this issue, but I did verify the nested EH which can recover from error
> in reset.

It's actually pretty easy to trigger this one. I just modify block/019 to
remove the check for a hotplug slot then run it on a block device that's
not hot-pluggable.


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
Hi Jianchao,

On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote:
> Hi ming
> 
> On 05/14/2018 05:38 PM, Ming Lei wrote:
> >> Here is the deadlock scenario.
> >>
> >> nvme_eh_work // EH0
> >>   -> nvme_reset_dev //hold reset_lock
> >> -> nvme_setup_io_queues
> >>   -> nvme_create_io_queues
> >> -> nvme_create_queue
> >>   -> set nvmeq->cq_vector
> >>   -> adapter_alloc_cq
> >>   -> adapter_alloc_sq
> >>  irq has not been requested
> >>  io timeout 
> >> nvme_eh_work //EH1
> >>   -> nvme_dev_disable
> >> -> quiesce the adminq //> here 
> >> !
> >> -> nvme_suspend_queue
> >>   print out warning Trying to free 
> >> already-free IRQ 133
> >> -> nvme_cancel_request // complete 
> >> the timeout admin request
> >>   -> require reset_lock
> >>   -> adapter_delete_cq
> > If the admin IO submitted in adapter_alloc_sq() is timed out,
> > nvme_dev_disable() in EH1 will complete it which is set as 
> > REQ_FAILFAST_DRIVER,
> > then adapter_alloc_sq() should return error, and the whole reset in EH0
> > should have been terminated immediately.
> 
> Please refer to the following segment:
> 
> static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
> {
>   struct nvme_dev *dev = nvmeq->dev;
>   int result;
> ...
>   nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
>   result = adapter_alloc_cq(dev, qid, nvmeq);
>   if (result < 0)
>   goto release_vector;
> 
>   result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
> here
>   if (result < 0)
>   goto release_cq;
> 
>   nvme_init_queue(nvmeq, qid);
>   result = queue_request_irq(nvmeq);
>   if (result < 0)
>   goto release_sq;
> 
>   return result;
> 
>  release_sq:
>   dev->online_queues--;
>   adapter_delete_sq(dev, qid);
>  release_cq:// we will be here !
>   adapter_delete_cq(dev, qid);// another cq delete admin 
> command will be sent out.
>  release_vector:
>   nvmeq->cq_vector = -1;
>   return result;
> }

Given admin queue has been suspended, all admin IO should have
been terminated immediately, so could you test the following patch?

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f509d37b2fb8..44e38be259f2 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq)
nvmeq->cq_vector = -1;
spin_unlock_irq(>q_lock);
 
-   if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q)
-   blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q);
-
pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq);
 
return 0;
@@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
dev->ctrl.admin_q = NULL;
return -ENODEV;
}
-   } else
-   blk_mq_unquiesce_queue(dev->ctrl.admin_q);
+   }
 
return 0;
 }

> 
> 
> > 
> > I guess the issue should be that nvme_create_io_queues() ignores the 
> > failure.
> > 
> > Could you dump the stack trace of EH0 reset task? So that we may see
> > where EH0 reset kthread hangs.
> 
> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
> [<0>] blk_execute_rq+0xf7/0x150
> [<0>] __nvme_submit_sync_cmd+0x94/0x110
> [<0>] nvme_submit_sync_cmd+0x1b/0x20
> [<0>] adapter_delete_queue+0xad/0xf0
> [<0>] nvme_reset_dev+0x1b67/0x2450
> [<0>] nvme_eh_work+0x19c/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0x

Even without this patch, the above hang can happen in reset context,
so this issue isn't related with the introduced 'reset_lock'.

> root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
> [<0>] nvme_eh_work+0x11a/0x4b0
> [<0>] process_one_work+0x3ca/0xaa0
> [<0>] worker_thread+0x89/0x6c0
> [<0>] kthread+0x18d/0x1e0
> [<0>] ret_from_fork+0x24/0x30
> [<0>] 0x
> 
> > 
> >> -> adapter_delete_queue // submit to the adminq which has been 
> >> quiesced.
> >>   -> nvme_submit_sync_cmd
> >> -> blk_execute_rq
> >>   -> wait_for_completion_io_timeout
> >>   hang_check is true, so there is no hung task warning for 
> >> this context
> >>
> >> EH0 submit cq delete admin command, but it will never be completed or 
> >> timed out, because the admin request queue has
> >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
> >> reset_lock and make 

Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

On 05/14/2018 05:38 PM, Ming Lei wrote:
>> Here is the deadlock scenario.
>>
>> nvme_eh_work // EH0
>>   -> nvme_reset_dev //hold reset_lock
>> -> nvme_setup_io_queues
>>   -> nvme_create_io_queues
>> -> nvme_create_queue
>>   -> set nvmeq->cq_vector
>>   -> adapter_alloc_cq
>>   -> adapter_alloc_sq
>>  irq has not been requested
>>  io timeout 
>> nvme_eh_work //EH1
>>   -> nvme_dev_disable
>> -> quiesce the adminq //> here !
>> -> nvme_suspend_queue
>>   print out warning Trying to free 
>> already-free IRQ 133
>> -> nvme_cancel_request // complete 
>> the timeout admin request
>>   -> require reset_lock
>>   -> adapter_delete_cq
> If the admin IO submitted in adapter_alloc_sq() is timed out,
> nvme_dev_disable() in EH1 will complete it which is set as 
> REQ_FAILFAST_DRIVER,
> then adapter_alloc_sq() should return error, and the whole reset in EH0
> should have been terminated immediately.

Please refer to the following segment:

static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
{
struct nvme_dev *dev = nvmeq->dev;
int result;
...
nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid;
result = adapter_alloc_cq(dev, qid, nvmeq);
if (result < 0)
goto release_vector;

result = adapter_alloc_sq(dev, qid, nvmeq);   // if timeout and failed 
here
if (result < 0)
goto release_cq;

nvme_init_queue(nvmeq, qid);
result = queue_request_irq(nvmeq);
if (result < 0)
goto release_sq;

return result;

 release_sq:
dev->online_queues--;
adapter_delete_sq(dev, qid);
 release_cq:// we will be here !
adapter_delete_cq(dev, qid);// another cq delete admin 
command will be sent out.
 release_vector:
nvmeq->cq_vector = -1;
return result;
}


> 
> I guess the issue should be that nvme_create_io_queues() ignores the failure.
> 
> Could you dump the stack trace of EH0 reset task? So that we may see
> where EH0 reset kthread hangs.

root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack 
[<0>] blk_execute_rq+0xf7/0x150
[<0>] __nvme_submit_sync_cmd+0x94/0x110
[<0>] nvme_submit_sync_cmd+0x1b/0x20
[<0>] adapter_delete_queue+0xad/0xf0
[<0>] nvme_reset_dev+0x1b67/0x2450
[<0>] nvme_eh_work+0x19c/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x
root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack 
[<0>] nvme_eh_work+0x11a/0x4b0
[<0>] process_one_work+0x3ca/0xaa0
[<0>] worker_thread+0x89/0x6c0
[<0>] kthread+0x18d/0x1e0
[<0>] ret_from_fork+0x24/0x30
[<0>] 0x

> 
>> -> adapter_delete_queue // submit to the adminq which has been 
>> quiesced.
>>   -> nvme_submit_sync_cmd
>> -> blk_execute_rq
>>   -> wait_for_completion_io_timeout
>>   hang_check is true, so there is no hung task warning for 
>> this context
>>
>> EH0 submit cq delete admin command, but it will never be completed or timed 
>> out, because the admin request queue has
>> been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
>> reset_lock and make things forward.
> The nvme_dev_disable() in outer EH(EH1 in above log) will complete all
> admin command, which won't be retried because it is set as
> REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in
> nvme_dev_disable().

This cq delete admin command is sent out after EH 1 nvme_dev_disable completed 
and failed the
previous timeout sq alloc admin command. please refer to the code segment above.

Thanks
jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread Ming Lei
On Mon, May 14, 2018 at 04:21:04PM +0800, jianchao.wang wrote:
> Hi ming
> 
> Please refer to my test log and analysis.
> 
> [  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
> [  229.872649] nvme nvme0: EH 0: before shutdown
> [  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
> [  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
> [  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
> [  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
> [  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
> [  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
> [  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
> [  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
> [  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
> [  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
> [  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
> [  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
> [  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
> [  229.874041] nvme nvme0: Abort status: 0x0
> [  229.874064] nvme nvme0: Abort status: 0x0
> [  229.874078] nvme nvme0: Abort status: 0x0
> [  229.874092] nvme nvme0: Abort status: 0x0
> [  229.874106] nvme nvme0: Abort status: 0x0
> [  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
> [  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
> [  290.840746] nvme nvme0: EH 1: before shutdown
> 
> [  290.992650] [ cut here ]
> [  290.992751] Trying to free already-free IRQ 133
> [  290.992783] WARNING: CPU: 6 PID: 227 at 
> /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 
> __free_irq+0xf6/0x420
> [  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 
> 4.17.0-rc3+ #37
> [  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
> [  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
> [  290.993436] RIP: 0010:__free_irq+0xf6/0x420
> ...
> [  290.993541] Call Trace:
> [  290.993581]  free_irq+0x4c/0xa0
> [  290.993600]  pci_free_irq+0x18/0x30
> [  290.993615]  nvme_dev_disable+0x20b/0x720
> [  290.993745]  nvme_eh_work+0xdd/0x4b0
> [  290.993866]  process_one_work+0x3ca/0xaa0
> [  290.993960]  worker_thread+0x89/0x6c0
> [  290.994017]  kthread+0x18d/0x1e0
> [  290.994061]  ret_from_fork+0x24/0x30
> [  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
> [  338.913348]   Tainted: GW 4.17.0-rc3+ #37
> [  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  338.913809] kworker/u16:4   D0   227  2 0x8000
> [  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
> [  338.913857] Call Trace:
> [  338.914002]  schedule+0x52/0xe0
> [  338.914019]  schedule_preempt_disabled+0x14/0x20
> [  338.914029]  __mutex_lock+0x5b4/0xca0
> [  338.914228]  nvme_eh_work+0x11a/0x4b0
> [  338.914344]  process_one_work+0x3ca/0xaa0
> [  338.914440]  worker_thread+0x89/0x6c0
> [  338.914496]  kthread+0x18d/0x1e0
> [  338.914542]  ret_from_fork+0x24/0x30
> [  338.914648]

The above isn't a new issue too, since nvme_dev_disable() can be
run before the controller is recovered(nvme_setup_io_queues() done)
without this patchset.

This can happen in several cases, such as the one you listed below, or
two or more timed-out req are triggered from different queues.

This issue is that genirq won't work well if the same IRQ is freed by
two times.

> 
> Here is the deadlock scenario.
> 
> nvme_eh_work // EH0
>   -> nvme_reset_dev //hold reset_lock
> -> nvme_setup_io_queues
>   -> nvme_create_io_queues
> -> nvme_create_queue
>   -> set nvmeq->cq_vector
>   -> adapter_alloc_cq
>   -> adapter_alloc_sq
>  irq has not been requested
>  io timeout 
> nvme_eh_work //EH1
>   -> nvme_dev_disable
> -> quiesce the adminq //> here !
> -> nvme_suspend_queue
>   print out warning Trying to free 
> already-free IRQ 133
> -> nvme_cancel_request // complete 
> the timeout admin request
>   -> require reset_lock
>   -> adapter_delete_cq

If the admin IO submitted in adapter_alloc_sq() is timed out,
nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER,
then adapter_alloc_sq() should return error, and the whole reset in EH0
should have been terminated immediately.

I guess the issue should be that nvme_create_io_queues() ignores the failure.

Could you dump the stack trace of EH0 reset task? So that we may see
where EH0 reset kthread hangs.

> -> adapter_delete_queue // submit to the adminq which has been 
> quiesced.

Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-14 Thread jianchao.wang
Hi ming

Please refer to my test log and analysis.

[  229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller
[  229.872649] nvme nvme0: EH 0: before shutdown
[  229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller
[  229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller
[  229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller
[  229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller
[  229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller
[  229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller
[  229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller
[  229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting
[  229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting
[  229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller
[  229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting
[  229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting
[  229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting
[  229.874041] nvme nvme0: Abort status: 0x0
[  229.874064] nvme nvme0: Abort status: 0x0
[  229.874078] nvme nvme0: Abort status: 0x0
[  229.874092] nvme nvme0: Abort status: 0x0
[  229.874106] nvme nvme0: Abort status: 0x0
[  230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1
[  290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller
[  290.840746] nvme nvme0: EH 1: before shutdown

[  290.992650] [ cut here ]
[  290.992751] Trying to free already-free IRQ 133
[  290.992783] WARNING: CPU: 6 PID: 227 at 
/home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 
__free_irq+0xf6/0x420
[  290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 
4.17.0-rc3+ #37
[  290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017
[  290.993418] Workqueue: nvme-reset-wq nvme_eh_work
[  290.993436] RIP: 0010:__free_irq+0xf6/0x420
...
[  290.993541] Call Trace:
[  290.993581]  free_irq+0x4c/0xa0
[  290.993600]  pci_free_irq+0x18/0x30
[  290.993615]  nvme_dev_disable+0x20b/0x720
[  290.993745]  nvme_eh_work+0xdd/0x4b0
[  290.993866]  process_one_work+0x3ca/0xaa0
[  290.993960]  worker_thread+0x89/0x6c0
[  290.994017]  kthread+0x18d/0x1e0
[  290.994061]  ret_from_fork+0x24/0x30
[  338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds.
[  338.913348]   Tainted: GW 4.17.0-rc3+ #37
[  338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  338.913809] kworker/u16:4   D0   227  2 0x8000
[  338.913842] Workqueue: nvme-reset-wq nvme_eh_work
[  338.913857] Call Trace:
[  338.914002]  schedule+0x52/0xe0
[  338.914019]  schedule_preempt_disabled+0x14/0x20
[  338.914029]  __mutex_lock+0x5b4/0xca0
[  338.914228]  nvme_eh_work+0x11a/0x4b0
[  338.914344]  process_one_work+0x3ca/0xaa0
[  338.914440]  worker_thread+0x89/0x6c0
[  338.914496]  kthread+0x18d/0x1e0
[  338.914542]  ret_from_fork+0x24/0x30
[  338.914648]

Here is the deadlock scenario.

nvme_eh_work // EH0
  -> nvme_reset_dev //hold reset_lock
-> nvme_setup_io_queues
  -> nvme_create_io_queues
-> nvme_create_queue
  -> set nvmeq->cq_vector
  -> adapter_alloc_cq
  -> adapter_alloc_sq
 irq has not been requested
 io timeout 
nvme_eh_work //EH1
  -> nvme_dev_disable
-> quiesce the adminq //> here !
-> nvme_suspend_queue
  print out warning Trying to free 
already-free IRQ 133
-> nvme_cancel_request // complete the 
timeout admin request
  -> require reset_lock
  -> adapter_delete_cq
-> adapter_delete_queue // submit to the adminq which has been 
quiesced.
  -> nvme_submit_sync_cmd
-> blk_execute_rq
  -> wait_for_completion_io_timeout
  hang_check is true, so there is no hung task warning for this 
context

EH0 submit cq delete admin command, but it will never be completed or timed 
out, because the admin request queue has
been quiesced, so the reset_lock cannot be released, and EH1 cannot get 
reset_lock and make things forward.

Thanks
Jianchao


Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-11 Thread Ming Lei
Hi Keith,

On Fri, May 11, 2018 at 02:50:28PM -0600, Keith Busch wrote:
> On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> > Hi,
> > 
> > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> > for NVMe, meantime fixes blk_sync_queue().
> > 
> > The 2nd patch covers timeout for admin commands for recovering controller
> > for avoiding possible deadlock.
> > 
> > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> > 
> > The last 5 patches fixes several races wrt. NVMe timeout handler, and
> > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> > mecanism become much more rebost than before.
> > 
> > gitweb:
> > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5
> 
> Hi Ming,
> 
> First test with simulated broken links is unsuccessful. I'm getting
> stuck here:
> 
> [<0>] blk_mq_freeze_queue_wait+0x46/0xb0
> [<0>] blk_cleanup_queue+0x78/0x170
> [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
> [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
> [<0>] nvme_remove+0x6b/0x130 [nvme]
> [<0>] pci_device_remove+0x36/0xb0
> [<0>] device_release_driver_internal+0x157/0x220
> [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
> [<0>] process_one_work+0x170/0x350
> [<0>] worker_thread+0x2e/0x380
> [<0>] kthread+0x111/0x130
> [<0>] ret_from_fork+0x1f/0x30
> 
> 
> Here's the last parts of the kernel logs capturing the failure:
> 
> [  760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679176] nvme nvme1: EH 0: before shutdown
> [  760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> 
> ( above repeats a few more hundred times )
> 
> [  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
> PCI_STATUS=0x
> [  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
> [  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
> [  760.727103] nvme :86:00.0: Refused to change power state, currently in 
> D3

EH may not cover this kind of failure, so it fails in the 1st try.

> [  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
> [  760.727485] nvme nvme1: EH 0: after recovery -19
> [  760.727488] nvme nvme1: EH: fail controller

The above issue(hang in 

Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote:
> Hi,
> 
> The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout()
> for NVMe, meantime fixes blk_sync_queue().
> 
> The 2nd patch covers timeout for admin commands for recovering controller
> for avoiding possible deadlock.
> 
> The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen.
> 
> The last 5 patches fixes several races wrt. NVMe timeout handler, and
> finally can make blktests block/011 passed. Meantime the NVMe PCI timeout
> mecanism become much more rebost than before.
> 
> gitweb:
>   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5

Hi Ming,

First test with simulated broken links is unsuccessful. I'm getting
stuck here:

[<0>] blk_mq_freeze_queue_wait+0x46/0xb0
[<0>] blk_cleanup_queue+0x78/0x170
[<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core]
[<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core]
[<0>] nvme_remove+0x6b/0x130 [nvme]
[<0>] pci_device_remove+0x36/0xb0
[<0>] device_release_driver_internal+0x157/0x220
[<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme]
[<0>] process_one_work+0x170/0x350
[<0>] worker_thread+0x2e/0x380
[<0>] kthread+0x111/0x130
[<0>] ret_from_fork+0x1f/0x30


Here's the last parts of the kernel logs capturing the failure:

[  760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679176] nvme nvme1: EH 0: before shutdown
[  760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x

( above repeats a few more hundred times )

[  760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, 
PCI_STATUS=0x
[  760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1
[  760.727099] pci_raw_set_power_state: 62 callbacks suppressed
[  760.727103] nvme :86:00.0: Refused to change power state, currently in D3
[  760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1
[  760.727485] nvme nvme1: EH 0: after recovery -19
[  760.727488] nvme nvme1: EH: fail controller
[  760.727491] nvme nvme1: Removing after probe failure status: 0
[  760.735138] nvme1n1: detected capacity change from 1200243695616 to 0