Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Wed, May 16, 2018 at 09:18:26AM -0600, Keith Busch wrote: > On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote: > > Hi Keith, > > > > This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set > > nvmeq->cq_vector > > after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin > > queue after controller is shutdown'[2], and both two have been included in > > the > > posted V6. > > No, it's definitely not related to that patch. The link is down in this > test, I can assure you we're bailing out long before we ever even try to > create an IO queue. The failing condition is detected by nvme_pci_enable's > check for all 1's completions at the very beginning. OK, this kind of failure during reset can be triggered in my test easily, then nvme_remove_dead_ctrl() is called too, but not see IO hang from remove path. As we discussed, it shouldn't be so, since queues are unquiesced & killed, all IO should have been failed immediately. Also controller has been shutdown, the queues are frozen too, so blk_mq_freeze_queue_wait() won't wait on one unfrozen queue. So could you post the debugfs log when the hang happens so that we may find some clue? Also, I don't think your issue is caused by this patchset, since nvme_remove_dead_ctrl_work() and nvme_remove() aren't touched by this patch. That means this issue may be triggered without this patchset too, so could we start to review this patchset meantime? Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Wed, May 16, 2018 at 12:31:28PM +0800, Ming Lei wrote: > Hi Keith, > > This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set > nvmeq->cq_vector > after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin > queue after controller is shutdown'[2], and both two have been included in the > posted V6. No, it's definitely not related to that patch. The link is down in this test, I can assure you we're bailing out long before we ever even try to create an IO queue. The failing condition is detected by nvme_pci_enable's check for all 1's completions at the very beginning.
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Mon, May 14, 2018 at 06:33:35PM -0600, Keith Busch wrote: > On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote: > > > > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > > > > [ 760.727488] nvme nvme1: EH: fail controller > > > > > > > > The above issue(hang in nvme_remove()) is still an old issue, which > > > > is because queues are kept as quiesce during remove, so could you > > > > please test the following change? > > > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > > index 1dec353388be..c78e5a0cde06 100644 > > > > --- a/drivers/nvme/host/core.c > > > > +++ b/drivers/nvme/host/core.c > > > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl > > > > *ctrl) > > > > */ > > > > if (ctrl->state == NVME_CTRL_DEAD) > > > > nvme_kill_queues(ctrl); > > > > + else { > > > > + if (ctrl->admin_q) > > > > + blk_mq_unquiesce_queue(ctrl->admin_q); > > > > + nvme_start_queues(ctrl); > > > > + } > > > > > > > > down_write(>namespaces_rwsem); > > > > list_splice_init(>namespaces, _list); > > > > > > The above won't actually do anything here since the broken link puts the > > > controller in the DEAD state, so we've killed the queues which also > > > unquiesces them. > > > > I suggest you to double check if the controller is set as DEAD > > in nvme_remove() since there won't be any log dumped when this happen. > > Yes, it's dead. pci_device_is_present returns false when the link is > broken. > > Also, the logs showed the capacity was set to 0, which only happens when > we kill the namespace queues, which supposedly restarts the queues too. > > > If controller is set as DEAD and queues are killed, and all IO should > > have been dispatched to driver and nvme_queueu_rq() will fail them all, > > then there isn't any reason to see the hang in your stack trace log. > > Right, that's the idea. It just doesn't appear to be working here. Hi Keith, This issue may probably be fixed by Jianchao's patch of 'nvme: pci: set nvmeq->cq_vector after alloc cq/sq'[1] and my another patch of 'nvme: pci: unquiesce admin queue after controller is shutdown'[2], and both two have been included in the posted V6. [1] https://marc.info/?l=linux-block=152644346006002=2 [2] https://marc.info/?l=linux-block=152644344805995=2 So please retest V6 and see if this issue can be addressed, if not, could you share the debugfs log, then we might see where is wrong from the log? Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Ming On 05/15/2018 08:56 PM, Ming Lei wrote: > Looks a nice fix on nvme_create_queue(), but seems the change on > adapter_alloc_cq() is missed in above patch. > > Could you prepare a formal one so that I may integrate it to V6? Please refer to Thanks Jianchao >From 9bb6db79901ef303cd40c4c911604bc053b1ad95 Mon Sep 17 00:00:00 2001 From: Jianchao WangDate: Wed, 16 May 2018 09:45:45 +0800 Subject: [PATCH] nvme-pci: set nvmeq->cq_vector after alloc cq/sq Currently nvmeq->cq_vector is set before alloc cq/sq. If the alloc cq/sq command timeout, nvme_suspend_queue will invoke free_irq for the nvmeq because the cq_vector is valid, this will cause warning 'Trying to free already-free IRQ xxx'. set nvmeq->cq_vector after alloc cq/sq successes to fix this. Signed-off-by: Jianchao Wang --- drivers/nvme/host/pci.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index fbc71fa..c830092 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1070,7 +1070,7 @@ static int adapter_delete_queue(struct nvme_dev *dev, u8 opcode, u16 id) } static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, - struct nvme_queue *nvmeq) + struct nvme_queue *nvmeq, int cq_vector) { struct nvme_command c; int flags = NVME_QUEUE_PHYS_CONTIG | NVME_CQ_IRQ_ENABLED; @@ -1085,7 +1085,7 @@ static int adapter_alloc_cq(struct nvme_dev *dev, u16 qid, c.create_cq.cqid = cpu_to_le16(qid); c.create_cq.qsize = cpu_to_le16(nvmeq->q_depth - 1); c.create_cq.cq_flags = cpu_to_le16(flags); - c.create_cq.irq_vector = cpu_to_le16(nvmeq->cq_vector); + c.create_cq.irq_vector = cpu_to_le16(cq_vector); return nvme_submit_sync_cmd(dev->ctrl.admin_q, , NULL, 0); } @@ -1450,6 +1450,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; + int cq_vector; if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), @@ -1462,15 +1463,21 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) * A queue's vector matches the queue identifier unless the controller * has only one vector available. */ - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; - result = adapter_alloc_cq(dev, qid, nvmeq); + cq_vector = dev->num_vecs == 1 ? 0 : qid; + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); if (result < 0) - goto release_vector; + goto out; result = adapter_alloc_sq(dev, qid, nvmeq); if (result < 0) goto release_cq; + /* + * set cq_vector after alloc cq/sq, otherwise, if alloc cq/sq command + * timeout, nvme_suspend_queue will invoke free_irq for it and cause warning + * 'Trying to free already-free IRQ xxx' + */ + nvmeq->cq_vector = cq_vector; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) @@ -1478,13 +1485,13 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) return result; - release_sq: +release_sq: + nvmeq->cq_vector = -1; dev->online_queues--; adapter_delete_sq(dev, qid); - release_cq: +release_cq: adapter_delete_cq(dev, qid); - release_vector: - nvmeq->cq_vector = -1; +out: return result; } -- 2.7.4
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/16/2018 10:09 AM, Ming Lei wrote: > So could you check if only the patch("unquiesce admin queue after shutdown > controller") can fix your IO hang issue? I indeed tested this before fix the warning. It could fix the io hung issue. :) Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Wed, May 16, 2018 at 10:04:20AM +0800, Ming Lei wrote: > On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote: > > Hi ming > > > > On 05/15/2018 08:33 AM, Ming Lei wrote: > > > We still have to quiesce admin queue before canceling request, so looks > > > the following patch is better, so please ignore the above patch and try > > > the following one and see if your hang can be addressed: > > > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > index f509d37b2fb8..c2adc76472a8 100644 > > > --- a/drivers/nvme/host/pci.c > > > +++ b/drivers/nvme/host/pci.c > > > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev > > > *dev) > > > dev->ctrl.admin_q = NULL; > > > return -ENODEV; > > > } > > > - } else > > > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > > + } > > > > > > return 0; > > > } > > > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, > > > bool shutdown, bool > > >*/ > > > if (shutdown) > > > nvme_start_queues(>ctrl); > > > + > > > + /* > > > + * Avoid to suck reset because timeout may happen during reset and > > > + * reset may hang forever if admin queue is kept as quiesced > > > + */ > > > + blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > > mutex_unlock(>shutdown_lock); > > > } > > > > w/ patch above and patch below, both the warning and io hung issue didn't > > reproduce till now. > > > > > > @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue > > *nvmeq, int qid) > > { > > struct nvme_dev *dev = nvmeq->dev; > > int result; > > + int cq_vector; > > > > if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { > > unsigned offset = (qid - 1) * > > roundup(SQ_SIZE(nvmeq->q_depth), > > @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue > > *nvmeq, int qid) > > * A queue's vector matches the queue identifier unless the > > controller > > * has only one vector available. > > */ > > - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; > > - result = adapter_alloc_cq(dev, qid, nvmeq); > > + cq_vector = dev->num_vecs == 1 ? 0 : qid; > > + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); > > if (result < 0) > > - goto release_vector; > > + goto out; > > Think of this issue further, the above change will cause adapter_alloc_cq() > failed immediately because nvmeq->cq_vector isn't set before submitting this > admin IO. > > So could you check if only the patch("unquiesce admin queue after shutdown > controller") can fix your IO hang issue? > > BTW, the warning from genirq can be left alone, that is another issue. Ooops, no such issue at all since admin queue is ready, please ignore the noise, sorry, :-( Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote: > Hi ming > > On 05/15/2018 08:33 AM, Ming Lei wrote: > > We still have to quiesce admin queue before canceling request, so looks > > the following patch is better, so please ignore the above patch and try > > the following one and see if your hang can be addressed: > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index f509d37b2fb8..c2adc76472a8 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > > dev->ctrl.admin_q = NULL; > > return -ENODEV; > > } > > - } else > > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > + } > > > > return 0; > > } > > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, > > bool shutdown, bool > > */ > > if (shutdown) > > nvme_start_queues(>ctrl); > > + > > + /* > > +* Avoid to suck reset because timeout may happen during reset and > > +* reset may hang forever if admin queue is kept as quiesced > > +*/ > > + blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > mutex_unlock(>shutdown_lock); > > } > > w/ patch above and patch below, both the warning and io hung issue didn't > reproduce till now. > > > @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, > int qid) > { > struct nvme_dev *dev = nvmeq->dev; > int result; > + int cq_vector; > > if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { > unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), > @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue > *nvmeq, int qid) > * A queue's vector matches the queue identifier unless the controller > * has only one vector available. > */ > - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; > - result = adapter_alloc_cq(dev, qid, nvmeq); > + cq_vector = dev->num_vecs == 1 ? 0 : qid; > + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); > if (result < 0) > - goto release_vector; > + goto out; Think of this issue further, the above change will cause adapter_alloc_cq() failed immediately because nvmeq->cq_vector isn't set before submitting this admin IO. So could you check if only the patch("unquiesce admin queue after shutdown controller") can fix your IO hang issue? BTW, the warning from genirq can be left alone, that is another issue. Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Tue, May 15, 2018 at 05:56:14PM +0800, jianchao.wang wrote: > Hi ming > > On 05/15/2018 08:33 AM, Ming Lei wrote: > > We still have to quiesce admin queue before canceling request, so looks > > the following patch is better, so please ignore the above patch and try > > the following one and see if your hang can be addressed: > > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index f509d37b2fb8..c2adc76472a8 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > > dev->ctrl.admin_q = NULL; > > return -ENODEV; > > } > > - } else > > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > + } > > > > return 0; > > } > > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, > > bool shutdown, bool > > */ > > if (shutdown) > > nvme_start_queues(>ctrl); > > + > > + /* > > +* Avoid to suck reset because timeout may happen during reset and > > +* reset may hang forever if admin queue is kept as quiesced > > +*/ > > + blk_mq_unquiesce_queue(dev->ctrl.admin_q); > > mutex_unlock(>shutdown_lock); > > } > > w/ patch above and patch below, both the warning and io hung issue didn't > reproduce till now. > That is great to see another issue which can be covered now, :-) > > @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, > int qid) > { > struct nvme_dev *dev = nvmeq->dev; > int result; > + int cq_vector; > > if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { > unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), > @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue > *nvmeq, int qid) > * A queue's vector matches the queue identifier unless the controller > * has only one vector available. > */ > - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; > - result = adapter_alloc_cq(dev, qid, nvmeq); > + cq_vector = dev->num_vecs == 1 ? 0 : qid; > + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); > if (result < 0) > - goto release_vector; > + goto out; > > result = adapter_alloc_sq(dev, qid, nvmeq); > if (result < 0) > goto release_cq; > - > + > + nvmeq->cq_vector = cq_vector; > nvme_init_queue(nvmeq, qid); > result = queue_request_irq(nvmeq); > if (result < 0) > @@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue > *nvmeq, int qid) > return result; > > release_sq: > + nvmeq->cq_vector = -1; > dev->online_queues--; > adapter_delete_sq(dev, qid); > release_cq: > adapter_delete_cq(dev, qid); > - release_vector: > - nvmeq->cq_vector = -1; > + out: > return result; > } > Looks a nice fix on nvme_create_queue(), but seems the change on adapter_alloc_cq() is missed in above patch. Could you prepare a formal one so that I may integrate it to V6? Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/15/2018 08:33 AM, Ming Lei wrote: > We still have to quiesce admin queue before canceling request, so looks > the following patch is better, so please ignore the above patch and try > the following one and see if your hang can be addressed: > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index f509d37b2fb8..c2adc76472a8 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > dev->ctrl.admin_q = NULL; > return -ENODEV; > } > - } else > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > + } > > return 0; > } > @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, > bool shutdown, bool >*/ > if (shutdown) > nvme_start_queues(>ctrl); > + > + /* > + * Avoid to suck reset because timeout may happen during reset and > + * reset may hang forever if admin queue is kept as quiesced > + */ > + blk_mq_unquiesce_queue(dev->ctrl.admin_q); > mutex_unlock(>shutdown_lock); > } w/ patch above and patch below, both the warning and io hung issue didn't reproduce till now. @@ -1450,6 +1648,7 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; + int cq_vector; if (dev->cmb && use_cmb_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) { unsigned offset = (qid - 1) * roundup(SQ_SIZE(nvmeq->q_depth), @@ -1462,15 +1661,16 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) * A queue's vector matches the queue identifier unless the controller * has only one vector available. */ - nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; - result = adapter_alloc_cq(dev, qid, nvmeq); + cq_vector = dev->num_vecs == 1 ? 0 : qid; + result = adapter_alloc_cq(dev, qid, nvmeq, cq_vector); if (result < 0) - goto release_vector; + goto out; result = adapter_alloc_sq(dev, qid, nvmeq); if (result < 0) goto release_cq; - + + nvmeq->cq_vector = cq_vector; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) @@ -1479,12 +1679,12 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) return result; release_sq: + nvmeq->cq_vector = -1; dev->online_queues--; adapter_delete_sq(dev, qid); release_cq: adapter_delete_cq(dev, qid); - release_vector: - nvmeq->cq_vector = -1; + out: return result; } Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Tue, May 15, 2018 at 8:33 AM, Keith Buschwrote: > On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote: >> > > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 >> > > > [ 760.727488] nvme nvme1: EH: fail controller >> > > >> > > The above issue(hang in nvme_remove()) is still an old issue, which >> > > is because queues are kept as quiesce during remove, so could you >> > > please test the following change? >> > > >> > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> > > index 1dec353388be..c78e5a0cde06 100644 >> > > --- a/drivers/nvme/host/core.c >> > > +++ b/drivers/nvme/host/core.c >> > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl >> > > *ctrl) >> > > */ >> > > if (ctrl->state == NVME_CTRL_DEAD) >> > > nvme_kill_queues(ctrl); >> > > + else { >> > > + if (ctrl->admin_q) >> > > + blk_mq_unquiesce_queue(ctrl->admin_q); >> > > + nvme_start_queues(ctrl); >> > > + } >> > > >> > > down_write(>namespaces_rwsem); >> > > list_splice_init(>namespaces, _list); >> > >> > The above won't actually do anything here since the broken link puts the >> > controller in the DEAD state, so we've killed the queues which also >> > unquiesces them. >> >> I suggest you to double check if the controller is set as DEAD >> in nvme_remove() since there won't be any log dumped when this happen. > > Yes, it's dead. pci_device_is_present returns false when the link is > broken. > > Also, the logs showed the capacity was set to 0, which only happens when > we kill the namespace queues, which supposedly restarts the queues too. > Right, nvme_kill_queues() may trigger that, and in my 019 test, not see pci_device_is_present() returns false, but nvme_kill_queues() has been called in nvme_remove_dead_ctrl_work(), and still didn't reproduce the hang in blk_cleanup_queue() yet. Looks a bit weird, but debugfs may show some clue, :-) Thanks, Ming Lei
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Mon, May 14, 2018 at 08:22:11PM +0800, Ming Lei wrote: > Hi Jianchao, > > On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote: > > Hi ming > > > > On 05/14/2018 05:38 PM, Ming Lei wrote: > > >> Here is the deadlock scenario. > > >> > > >> nvme_eh_work // EH0 > > >> -> nvme_reset_dev //hold reset_lock > > >> -> nvme_setup_io_queues > > >> -> nvme_create_io_queues > > >> -> nvme_create_queue > > >> -> set nvmeq->cq_vector > > >> -> adapter_alloc_cq > > >> -> adapter_alloc_sq > > >> irq has not been requested > > >> io timeout > > >> nvme_eh_work //EH1 > > >> -> nvme_dev_disable > > >> -> quiesce the adminq //> > > >> here ! > > >> -> nvme_suspend_queue > > >> print out warning Trying to > > >> free already-free IRQ 133 > > >> -> nvme_cancel_request // > > >> complete the timeout admin request > > >> -> require reset_lock > > >> -> adapter_delete_cq > > > If the admin IO submitted in adapter_alloc_sq() is timed out, > > > nvme_dev_disable() in EH1 will complete it which is set as > > > REQ_FAILFAST_DRIVER, > > > then adapter_alloc_sq() should return error, and the whole reset in EH0 > > > should have been terminated immediately. > > > > Please refer to the following segment: > > > > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) > > { > > struct nvme_dev *dev = nvmeq->dev; > > int result; > > ... > > nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; > > result = adapter_alloc_cq(dev, qid, nvmeq); > > if (result < 0) > > goto release_vector; > > > > result = adapter_alloc_sq(dev, qid, nvmeq); // if timeout and failed > > here > > if (result < 0) > > goto release_cq; > > > > nvme_init_queue(nvmeq, qid); > > result = queue_request_irq(nvmeq); > > if (result < 0) > > goto release_sq; > > > > return result; > > > > release_sq: > > dev->online_queues--; > > adapter_delete_sq(dev, qid); > > release_cq:// we will be here ! > > adapter_delete_cq(dev, qid);// another cq delete admin > > command will be sent out. > > release_vector: > > nvmeq->cq_vector = -1; > > return result; > > } > > Given admin queue has been suspended, all admin IO should have > been terminated immediately, so could you test the following patch? > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index f509d37b2fb8..44e38be259f2 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq) > nvmeq->cq_vector = -1; > spin_unlock_irq(>q_lock); > > - if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) > - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q); > - > pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq); > > return 0; > @@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > dev->ctrl.admin_q = NULL; > return -ENODEV; > } > - } else > - blk_mq_unquiesce_queue(dev->ctrl.admin_q); > + } > > return 0; > } We still have to quiesce admin queue before canceling request, so looks the following patch is better, so please ignore the above patch and try the following one and see if your hang can be addressed: diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f509d37b2fb8..c2adc76472a8 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1741,8 +1741,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) dev->ctrl.admin_q = NULL; return -ENODEV; } - } else - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + } return 0; } @@ -2520,6 +2519,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool */ if (shutdown) nvme_start_queues(>ctrl); + + /* +* Avoid to suck reset because timeout may happen during reset and +* reset may hang forever if admin queue is kept as quiesced +*/ + blk_mq_unquiesce_queue(dev->ctrl.admin_q); mutex_unlock(>shutdown_lock); } -- Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Tue, May 15, 2018 at 07:47:07AM +0800, Ming Lei wrote: > > > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > > > [ 760.727488] nvme nvme1: EH: fail controller > > > > > > The above issue(hang in nvme_remove()) is still an old issue, which > > > is because queues are kept as quiesce during remove, so could you > > > please test the following change? > > > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > > index 1dec353388be..c78e5a0cde06 100644 > > > --- a/drivers/nvme/host/core.c > > > +++ b/drivers/nvme/host/core.c > > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > > > */ > > > if (ctrl->state == NVME_CTRL_DEAD) > > > nvme_kill_queues(ctrl); > > > + else { > > > + if (ctrl->admin_q) > > > + blk_mq_unquiesce_queue(ctrl->admin_q); > > > + nvme_start_queues(ctrl); > > > + } > > > > > > down_write(>namespaces_rwsem); > > > list_splice_init(>namespaces, _list); > > > > The above won't actually do anything here since the broken link puts the > > controller in the DEAD state, so we've killed the queues which also > > unquiesces them. > > I suggest you to double check if the controller is set as DEAD > in nvme_remove() since there won't be any log dumped when this happen. Yes, it's dead. pci_device_is_present returns false when the link is broken. Also, the logs showed the capacity was set to 0, which only happens when we kill the namespace queues, which supposedly restarts the queues too. > If controller is set as DEAD and queues are killed, and all IO should > have been dispatched to driver and nvme_queueu_rq() will fail them all, > then there isn't any reason to see the hang in your stack trace log. Right, that's the idea. It just doesn't appear to be working here.
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Mon, May 14, 2018 at 09:18:21AM -0600, Keith Busch wrote: > Hi Ming, > > On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote: > > > [ 760.679960] nvme nvme1: controller is down; will reset: > > > CSTS=0x, PCI_STATUS=0x > > > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > > > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > > > [ 760.727103] nvme :86:00.0: Refused to change power state, > > > currently in D3 > > > > EH may not cover this kind of failure, so it fails in the 1st try. > > Indeed, the test is simulating a permanently broken link, so recovery is > not expected. A success in this case is just completing driver > unbinding. > > > > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > > [ 760.727488] nvme nvme1: EH: fail controller > > > > The above issue(hang in nvme_remove()) is still an old issue, which > > is because queues are kept as quiesce during remove, so could you > > please test the following change? > > > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > > index 1dec353388be..c78e5a0cde06 100644 > > --- a/drivers/nvme/host/core.c > > +++ b/drivers/nvme/host/core.c > > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > > */ > > if (ctrl->state == NVME_CTRL_DEAD) > > nvme_kill_queues(ctrl); > > + else { > > + if (ctrl->admin_q) > > + blk_mq_unquiesce_queue(ctrl->admin_q); > > + nvme_start_queues(ctrl); > > + } > > > > down_write(>namespaces_rwsem); > > list_splice_init(>namespaces, _list); > > The above won't actually do anything here since the broken link puts the > controller in the DEAD state, so we've killed the queues which also > unquiesces them. I suggest you to double check if the controller is set as DEAD in nvme_remove() since there won't be any log dumped when this happen. If controller is set as DEAD and queues are killed, and all IO should have been dispatched to driver and nvme_queueu_rq() will fail them all, then there isn't any reason to see the hang in your stack trace log. > > > BTW, in my environment, it is hard to trigger this failure, so not see > > this issue, but I did verify the nested EH which can recover from error > > in reset. > > It's actually pretty easy to trigger this one. I just modify block/019 to > remove the check for a hotplug slot then run it on a block device that's > not hot-pluggable. I will try this test, and hope I can reproduce it in my environment. Thanks, Ming
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Ming, On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote: > > [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, > > PCI_STATUS=0x > > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > > [ 760.727103] nvme :86:00.0: Refused to change power state, currently > > in D3 > > EH may not cover this kind of failure, so it fails in the 1st try. Indeed, the test is simulating a permanently broken link, so recovery is not expected. A success in this case is just completing driver unbinding. > > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > [ 760.727488] nvme nvme1: EH: fail controller > > The above issue(hang in nvme_remove()) is still an old issue, which > is because queues are kept as quiesce during remove, so could you > please test the following change? > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 1dec353388be..c78e5a0cde06 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > */ > if (ctrl->state == NVME_CTRL_DEAD) > nvme_kill_queues(ctrl); > + else { > + if (ctrl->admin_q) > + blk_mq_unquiesce_queue(ctrl->admin_q); > + nvme_start_queues(ctrl); > + } > > down_write(>namespaces_rwsem); > list_splice_init(>namespaces, _list); The above won't actually do anything here since the broken link puts the controller in the DEAD state, so we've killed the queues which also unquiesces them. > BTW, in my environment, it is hard to trigger this failure, so not see > this issue, but I did verify the nested EH which can recover from error > in reset. It's actually pretty easy to trigger this one. I just modify block/019 to remove the check for a hotplug slot then run it on a block device that's not hot-pluggable.
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Jianchao, On Mon, May 14, 2018 at 06:05:50PM +0800, jianchao.wang wrote: > Hi ming > > On 05/14/2018 05:38 PM, Ming Lei wrote: > >> Here is the deadlock scenario. > >> > >> nvme_eh_work // EH0 > >> -> nvme_reset_dev //hold reset_lock > >> -> nvme_setup_io_queues > >> -> nvme_create_io_queues > >> -> nvme_create_queue > >> -> set nvmeq->cq_vector > >> -> adapter_alloc_cq > >> -> adapter_alloc_sq > >> irq has not been requested > >> io timeout > >> nvme_eh_work //EH1 > >> -> nvme_dev_disable > >> -> quiesce the adminq //> here > >> ! > >> -> nvme_suspend_queue > >> print out warning Trying to free > >> already-free IRQ 133 > >> -> nvme_cancel_request // complete > >> the timeout admin request > >> -> require reset_lock > >> -> adapter_delete_cq > > If the admin IO submitted in adapter_alloc_sq() is timed out, > > nvme_dev_disable() in EH1 will complete it which is set as > > REQ_FAILFAST_DRIVER, > > then adapter_alloc_sq() should return error, and the whole reset in EH0 > > should have been terminated immediately. > > Please refer to the following segment: > > static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) > { > struct nvme_dev *dev = nvmeq->dev; > int result; > ... > nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; > result = adapter_alloc_cq(dev, qid, nvmeq); > if (result < 0) > goto release_vector; > > result = adapter_alloc_sq(dev, qid, nvmeq); // if timeout and failed > here > if (result < 0) > goto release_cq; > > nvme_init_queue(nvmeq, qid); > result = queue_request_irq(nvmeq); > if (result < 0) > goto release_sq; > > return result; > > release_sq: > dev->online_queues--; > adapter_delete_sq(dev, qid); > release_cq:// we will be here ! > adapter_delete_cq(dev, qid);// another cq delete admin > command will be sent out. > release_vector: > nvmeq->cq_vector = -1; > return result; > } Given admin queue has been suspended, all admin IO should have been terminated immediately, so could you test the following patch? diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f509d37b2fb8..44e38be259f2 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1515,9 +1515,6 @@ static int nvme_suspend_queue(struct nvme_queue *nvmeq) nvmeq->cq_vector = -1; spin_unlock_irq(>q_lock); - if (!nvmeq->qid && nvmeq->dev->ctrl.admin_q) - blk_mq_quiesce_queue(nvmeq->dev->ctrl.admin_q); - pci_free_irq(to_pci_dev(nvmeq->dev->dev), vector, nvmeq); return 0; @@ -1741,8 +1738,7 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) dev->ctrl.admin_q = NULL; return -ENODEV; } - } else - blk_mq_unquiesce_queue(dev->ctrl.admin_q); + } return 0; } > > > > > > I guess the issue should be that nvme_create_io_queues() ignores the > > failure. > > > > Could you dump the stack trace of EH0 reset task? So that we may see > > where EH0 reset kthread hangs. > > root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack > [<0>] blk_execute_rq+0xf7/0x150 > [<0>] __nvme_submit_sync_cmd+0x94/0x110 > [<0>] nvme_submit_sync_cmd+0x1b/0x20 > [<0>] adapter_delete_queue+0xad/0xf0 > [<0>] nvme_reset_dev+0x1b67/0x2450 > [<0>] nvme_eh_work+0x19c/0x4b0 > [<0>] process_one_work+0x3ca/0xaa0 > [<0>] worker_thread+0x89/0x6c0 > [<0>] kthread+0x18d/0x1e0 > [<0>] ret_from_fork+0x24/0x30 > [<0>] 0x Even without this patch, the above hang can happen in reset context, so this issue isn't related with the introduced 'reset_lock'. > root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack > [<0>] nvme_eh_work+0x11a/0x4b0 > [<0>] process_one_work+0x3ca/0xaa0 > [<0>] worker_thread+0x89/0x6c0 > [<0>] kthread+0x18d/0x1e0 > [<0>] ret_from_fork+0x24/0x30 > [<0>] 0x > > > > >> -> adapter_delete_queue // submit to the adminq which has been > >> quiesced. > >> -> nvme_submit_sync_cmd > >> -> blk_execute_rq > >> -> wait_for_completion_io_timeout > >> hang_check is true, so there is no hung task warning for > >> this context > >> > >> EH0 submit cq delete admin command, but it will never be completed or > >> timed out, because the admin request queue has > >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get > >> reset_lock and make
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming On 05/14/2018 05:38 PM, Ming Lei wrote: >> Here is the deadlock scenario. >> >> nvme_eh_work // EH0 >> -> nvme_reset_dev //hold reset_lock >> -> nvme_setup_io_queues >> -> nvme_create_io_queues >> -> nvme_create_queue >> -> set nvmeq->cq_vector >> -> adapter_alloc_cq >> -> adapter_alloc_sq >> irq has not been requested >> io timeout >> nvme_eh_work //EH1 >> -> nvme_dev_disable >> -> quiesce the adminq //> here ! >> -> nvme_suspend_queue >> print out warning Trying to free >> already-free IRQ 133 >> -> nvme_cancel_request // complete >> the timeout admin request >> -> require reset_lock >> -> adapter_delete_cq > If the admin IO submitted in adapter_alloc_sq() is timed out, > nvme_dev_disable() in EH1 will complete it which is set as > REQ_FAILFAST_DRIVER, > then adapter_alloc_sq() should return error, and the whole reset in EH0 > should have been terminated immediately. Please refer to the following segment: static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) { struct nvme_dev *dev = nvmeq->dev; int result; ... nvmeq->cq_vector = dev->num_vecs == 1 ? 0 : qid; result = adapter_alloc_cq(dev, qid, nvmeq); if (result < 0) goto release_vector; result = adapter_alloc_sq(dev, qid, nvmeq); // if timeout and failed here if (result < 0) goto release_cq; nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) goto release_sq; return result; release_sq: dev->online_queues--; adapter_delete_sq(dev, qid); release_cq:// we will be here ! adapter_delete_cq(dev, qid);// another cq delete admin command will be sent out. release_vector: nvmeq->cq_vector = -1; return result; } > > I guess the issue should be that nvme_create_io_queues() ignores the failure. > > Could you dump the stack trace of EH0 reset task? So that we may see > where EH0 reset kthread hangs. root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2273/stack [<0>] blk_execute_rq+0xf7/0x150 [<0>] __nvme_submit_sync_cmd+0x94/0x110 [<0>] nvme_submit_sync_cmd+0x1b/0x20 [<0>] adapter_delete_queue+0xad/0xf0 [<0>] nvme_reset_dev+0x1b67/0x2450 [<0>] nvme_eh_work+0x19c/0x4b0 [<0>] process_one_work+0x3ca/0xaa0 [<0>] worker_thread+0x89/0x6c0 [<0>] kthread+0x18d/0x1e0 [<0>] ret_from_fork+0x24/0x30 [<0>] 0x root@will-ThinkCentre-M910s:/home/will/Desktop# cat /proc/2275/stack [<0>] nvme_eh_work+0x11a/0x4b0 [<0>] process_one_work+0x3ca/0xaa0 [<0>] worker_thread+0x89/0x6c0 [<0>] kthread+0x18d/0x1e0 [<0>] ret_from_fork+0x24/0x30 [<0>] 0x > >> -> adapter_delete_queue // submit to the adminq which has been >> quiesced. >> -> nvme_submit_sync_cmd >> -> blk_execute_rq >> -> wait_for_completion_io_timeout >> hang_check is true, so there is no hung task warning for >> this context >> >> EH0 submit cq delete admin command, but it will never be completed or timed >> out, because the admin request queue has >> been quiesced, so the reset_lock cannot be released, and EH1 cannot get >> reset_lock and make things forward. > The nvme_dev_disable() in outer EH(EH1 in above log) will complete all > admin command, which won't be retried because it is set as > REQ_FAILFAST_DRIVER, so nvme_cancel_request() will complete it in > nvme_dev_disable(). This cq delete admin command is sent out after EH 1 nvme_dev_disable completed and failed the previous timeout sq alloc admin command. please refer to the code segment above. Thanks jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Mon, May 14, 2018 at 04:21:04PM +0800, jianchao.wang wrote: > Hi ming > > Please refer to my test log and analysis. > > [ 229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller > [ 229.872649] nvme nvme0: EH 0: before shutdown > [ 229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller > [ 229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller > [ 229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller > [ 229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller > [ 229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller > [ 229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller > [ 229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller > [ 229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting > [ 229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting > [ 229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller > [ 229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting > [ 229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting > [ 229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting > [ 229.874041] nvme nvme0: Abort status: 0x0 > [ 229.874064] nvme nvme0: Abort status: 0x0 > [ 229.874078] nvme nvme0: Abort status: 0x0 > [ 229.874092] nvme nvme0: Abort status: 0x0 > [ 229.874106] nvme nvme0: Abort status: 0x0 > [ 230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1 > [ 290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller > [ 290.840746] nvme nvme0: EH 1: before shutdown > > [ 290.992650] [ cut here ] > [ 290.992751] Trying to free already-free IRQ 133 > [ 290.992783] WARNING: CPU: 6 PID: 227 at > /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 > __free_irq+0xf6/0x420 > [ 290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted > 4.17.0-rc3+ #37 > [ 290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 > [ 290.993418] Workqueue: nvme-reset-wq nvme_eh_work > [ 290.993436] RIP: 0010:__free_irq+0xf6/0x420 > ... > [ 290.993541] Call Trace: > [ 290.993581] free_irq+0x4c/0xa0 > [ 290.993600] pci_free_irq+0x18/0x30 > [ 290.993615] nvme_dev_disable+0x20b/0x720 > [ 290.993745] nvme_eh_work+0xdd/0x4b0 > [ 290.993866] process_one_work+0x3ca/0xaa0 > [ 290.993960] worker_thread+0x89/0x6c0 > [ 290.994017] kthread+0x18d/0x1e0 > [ 290.994061] ret_from_fork+0x24/0x30 > [ 338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds. > [ 338.913348] Tainted: GW 4.17.0-rc3+ #37 > [ 338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 338.913809] kworker/u16:4 D0 227 2 0x8000 > [ 338.913842] Workqueue: nvme-reset-wq nvme_eh_work > [ 338.913857] Call Trace: > [ 338.914002] schedule+0x52/0xe0 > [ 338.914019] schedule_preempt_disabled+0x14/0x20 > [ 338.914029] __mutex_lock+0x5b4/0xca0 > [ 338.914228] nvme_eh_work+0x11a/0x4b0 > [ 338.914344] process_one_work+0x3ca/0xaa0 > [ 338.914440] worker_thread+0x89/0x6c0 > [ 338.914496] kthread+0x18d/0x1e0 > [ 338.914542] ret_from_fork+0x24/0x30 > [ 338.914648] The above isn't a new issue too, since nvme_dev_disable() can be run before the controller is recovered(nvme_setup_io_queues() done) without this patchset. This can happen in several cases, such as the one you listed below, or two or more timed-out req are triggered from different queues. This issue is that genirq won't work well if the same IRQ is freed by two times. > > Here is the deadlock scenario. > > nvme_eh_work // EH0 > -> nvme_reset_dev //hold reset_lock > -> nvme_setup_io_queues > -> nvme_create_io_queues > -> nvme_create_queue > -> set nvmeq->cq_vector > -> adapter_alloc_cq > -> adapter_alloc_sq > irq has not been requested > io timeout > nvme_eh_work //EH1 > -> nvme_dev_disable > -> quiesce the adminq //> here ! > -> nvme_suspend_queue > print out warning Trying to free > already-free IRQ 133 > -> nvme_cancel_request // complete > the timeout admin request > -> require reset_lock > -> adapter_delete_cq If the admin IO submitted in adapter_alloc_sq() is timed out, nvme_dev_disable() in EH1 will complete it which is set as REQ_FAILFAST_DRIVER, then adapter_alloc_sq() should return error, and the whole reset in EH0 should have been terminated immediately. I guess the issue should be that nvme_create_io_queues() ignores the failure. Could you dump the stack trace of EH0 reset task? So that we may see where EH0 reset kthread hangs. > -> adapter_delete_queue // submit to the adminq which has been > quiesced.
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi ming Please refer to my test log and analysis. [ 229.872622] nvme nvme0: I/O 164 QID 1 timeout, reset controller [ 229.872649] nvme nvme0: EH 0: before shutdown [ 229.872683] nvme nvme0: I/O 165 QID 1 timeout, reset controller [ 229.872700] nvme nvme0: I/O 166 QID 1 timeout, reset controller [ 229.872716] nvme nvme0: I/O 167 QID 1 timeout, reset controller [ 229.872733] nvme nvme0: I/O 169 QID 1 timeout, reset controller [ 229.872750] nvme nvme0: I/O 173 QID 1 timeout, reset controller [ 229.872766] nvme nvme0: I/O 174 QID 1 timeout, reset controller [ 229.872783] nvme nvme0: I/O 178 QID 1 timeout, reset controller [ 229.872800] nvme nvme0: I/O 182 QID 1 timeout, aborting [ 229.872900] nvme nvme0: I/O 185 QID 1 timeout, aborting [ 229.872975] nvme nvme0: I/O 190 QID 1 timeout, reset controller [ 229.872990] nvme nvme0: I/O 191 QID 1 timeout, aborting [ 229.873021] nvme nvme0: I/O 5 QID 2 timeout, aborting [ 229.873096] nvme nvme0: I/O 40 QID 2 timeout, aborting [ 229.874041] nvme nvme0: Abort status: 0x0 [ 229.874064] nvme nvme0: Abort status: 0x0 [ 229.874078] nvme nvme0: Abort status: 0x0 [ 229.874092] nvme nvme0: Abort status: 0x0 [ 229.874106] nvme nvme0: Abort status: 0x0 [ 230.060698] nvme nvme0: EH 0: after shutdown, top eh: 1 [ 290.840592] nvme nvme0: I/O 18 QID 0 timeout, disable controller [ 290.840746] nvme nvme0: EH 1: before shutdown [ 290.992650] [ cut here ] [ 290.992751] Trying to free already-free IRQ 133 [ 290.992783] WARNING: CPU: 6 PID: 227 at /home/will/u04/source_code/linux-stable/kernel/irq/manage.c:1549 __free_irq+0xf6/0x420 [ 290.993394] CPU: 6 PID: 227 Comm: kworker/u16:4 Kdump: loaded Not tainted 4.17.0-rc3+ #37 [ 290.993402] Hardware name: LENOVO 10MLS0E339/3106, BIOS M1AKT22A 06/27/2017 [ 290.993418] Workqueue: nvme-reset-wq nvme_eh_work [ 290.993436] RIP: 0010:__free_irq+0xf6/0x420 ... [ 290.993541] Call Trace: [ 290.993581] free_irq+0x4c/0xa0 [ 290.993600] pci_free_irq+0x18/0x30 [ 290.993615] nvme_dev_disable+0x20b/0x720 [ 290.993745] nvme_eh_work+0xdd/0x4b0 [ 290.993866] process_one_work+0x3ca/0xaa0 [ 290.993960] worker_thread+0x89/0x6c0 [ 290.994017] kthread+0x18d/0x1e0 [ 290.994061] ret_from_fork+0x24/0x30 [ 338.912379] INFO: task kworker/u16:4:227 blocked for more than 30 seconds. [ 338.913348] Tainted: GW 4.17.0-rc3+ #37 [ 338.913549] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 338.913809] kworker/u16:4 D0 227 2 0x8000 [ 338.913842] Workqueue: nvme-reset-wq nvme_eh_work [ 338.913857] Call Trace: [ 338.914002] schedule+0x52/0xe0 [ 338.914019] schedule_preempt_disabled+0x14/0x20 [ 338.914029] __mutex_lock+0x5b4/0xca0 [ 338.914228] nvme_eh_work+0x11a/0x4b0 [ 338.914344] process_one_work+0x3ca/0xaa0 [ 338.914440] worker_thread+0x89/0x6c0 [ 338.914496] kthread+0x18d/0x1e0 [ 338.914542] ret_from_fork+0x24/0x30 [ 338.914648] Here is the deadlock scenario. nvme_eh_work // EH0 -> nvme_reset_dev //hold reset_lock -> nvme_setup_io_queues -> nvme_create_io_queues -> nvme_create_queue -> set nvmeq->cq_vector -> adapter_alloc_cq -> adapter_alloc_sq irq has not been requested io timeout nvme_eh_work //EH1 -> nvme_dev_disable -> quiesce the adminq //> here ! -> nvme_suspend_queue print out warning Trying to free already-free IRQ 133 -> nvme_cancel_request // complete the timeout admin request -> require reset_lock -> adapter_delete_cq -> adapter_delete_queue // submit to the adminq which has been quiesced. -> nvme_submit_sync_cmd -> blk_execute_rq -> wait_for_completion_io_timeout hang_check is true, so there is no hung task warning for this context EH0 submit cq delete admin command, but it will never be completed or timed out, because the admin request queue has been quiesced, so the reset_lock cannot be released, and EH1 cannot get reset_lock and make things forward. Thanks Jianchao
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
Hi Keith, On Fri, May 11, 2018 at 02:50:28PM -0600, Keith Busch wrote: > On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote: > > Hi, > > > > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout() > > for NVMe, meantime fixes blk_sync_queue(). > > > > The 2nd patch covers timeout for admin commands for recovering controller > > for avoiding possible deadlock. > > > > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen. > > > > The last 5 patches fixes several races wrt. NVMe timeout handler, and > > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout > > mecanism become much more rebost than before. > > > > gitweb: > > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5 > > Hi Ming, > > First test with simulated broken links is unsuccessful. I'm getting > stuck here: > > [<0>] blk_mq_freeze_queue_wait+0x46/0xb0 > [<0>] blk_cleanup_queue+0x78/0x170 > [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core] > [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core] > [<0>] nvme_remove+0x6b/0x130 [nvme] > [<0>] pci_device_remove+0x36/0xb0 > [<0>] device_release_driver_internal+0x157/0x220 > [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme] > [<0>] process_one_work+0x170/0x350 > [<0>] worker_thread+0x2e/0x380 > [<0>] kthread+0x111/0x130 > [<0>] ret_from_fork+0x1f/0x30 > > > Here's the last parts of the kernel logs capturing the failure: > > [ 760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679176] nvme nvme1: EH 0: before shutdown > [ 760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > > ( above repeats a few more hundred times ) > > [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > [ 760.727103] nvme :86:00.0: Refused to change power state, currently in > D3 EH may not cover this kind of failure, so it fails in the 1st try. > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > [ 760.727488] nvme nvme1: EH: fail controller The above issue(hang in
Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling
On Fri, May 11, 2018 at 08:29:24PM +0800, Ming Lei wrote: > Hi, > > The 1st patch introduces blk_quiesce_timeout() and blk_unquiesce_timeout() > for NVMe, meantime fixes blk_sync_queue(). > > The 2nd patch covers timeout for admin commands for recovering controller > for avoiding possible deadlock. > > The 3rd and 4th patches avoid to wait_freeze on queues which aren't frozen. > > The last 5 patches fixes several races wrt. NVMe timeout handler, and > finally can make blktests block/011 passed. Meantime the NVMe PCI timeout > mecanism become much more rebost than before. > > gitweb: > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V5 Hi Ming, First test with simulated broken links is unsuccessful. I'm getting stuck here: [<0>] blk_mq_freeze_queue_wait+0x46/0xb0 [<0>] blk_cleanup_queue+0x78/0x170 [<0>] nvme_ns_remove+0x137/0x1a0 [nvme_core] [<0>] nvme_remove_namespaces+0x86/0xc0 [nvme_core] [<0>] nvme_remove+0x6b/0x130 [nvme] [<0>] pci_device_remove+0x36/0xb0 [<0>] device_release_driver_internal+0x157/0x220 [<0>] nvme_remove_dead_ctrl_work+0x29/0x40 [nvme] [<0>] process_one_work+0x170/0x350 [<0>] worker_thread+0x2e/0x380 [<0>] kthread+0x111/0x130 [<0>] ret_from_fork+0x1f/0x30 Here's the last parts of the kernel logs capturing the failure: [ 760.679105] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679116] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679120] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679124] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679127] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679131] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679135] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679138] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679141] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679144] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679148] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679151] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679155] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679158] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679161] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679164] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679169] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679172] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679176] nvme nvme1: EH 0: before shutdown [ 760.679177] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679181] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679185] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679189] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679192] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679196] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679199] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679202] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679240] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679243] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.679246] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x ( above repeats a few more hundred times ) [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0x, PCI_STATUS=0x [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed [ 760.727103] nvme :86:00.0: Refused to change power state, currently in D3 [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 [ 760.727485] nvme nvme1: EH 0: after recovery -19 [ 760.727488] nvme nvme1: EH: fail controller [ 760.727491] nvme nvme1: Removing after probe failure status: 0 [ 760.735138] nvme1n1: detected capacity change from 1200243695616 to 0