Re: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
On Wed, Jun 10, 2020 at 04:07:35AM +, Z.q. Hou wrote: > Hi Kuppuswamy, > > Thanks a lot for your comments and sorry for my late response! > > > -Original Message- > > From: Kuppuswamy, Sathyanarayanan > > > > Sent: 2020年5月29日 12:25 > > To: Z.q. Hou ; linux-...@vger.kernel.org; > > linux-kernel@vger.kernel.org; rus...@russell.cc; sbobr...@linux.ibm.com; > > ooh...@gmail.com; bhelg...@google.com > > Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by > > error_detect() > > > > > > > > On 5/28/20 9:04 PM, Z.q. Hou wrote: > > > Hi Kuppuswamy, > > > > > >> -Original Message- > > >> From: Kuppuswamy, Sathyanarayanan > > >> > > >> Sent: 2020年5月29日 5:19 > > >> To: Z.q. Hou ; linux-...@vger.kernel.org; > > >> linux-kernel@vger.kernel.org; rus...@russell.cc; > > >> sbobr...@linux.ibm.com; ooh...@gmail.com; bhelg...@google.com > > >> Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by > > >> error_detect() > > >> > > >> Hi, > > >> > > >> On 5/27/20 1:31 AM, Zhiqiang Hou wrote: > > >>> From: Hou Zhiqiang > > >>> > > >>> The commit 6d2c89441571 ("PCI/ERR: Update error status after > > >>> reset_link()") overrode the 'status' returned by the error_detect() > > >>> call back function, which is depended on by the next step. This > > >>> overriding makes the Endpoint driver's required info (kept in the > > >>> var > > >>> status) lost, so it results in the fatal errors' recovery failed and > > >>> then kernel > > >> panic. > > >> Can you explain why updating status affects the recovery ? > > > > > > Take the e1000e as an example: > > > Once a fatal error is reported by e1000e, the e1000e's error_detect() > > > will be called and it will return PCI_ERS_RESULT_NEED_RESET to request > > > a slot reset, the return value is stored in the '' of the > > > calling pci_walk_bus(bus,report_frozen_detected, ). > > > If you update the 'status' with the reset_link()'s return value > > > (PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the > > > 'status' has the value PCI_ERS_RESULT_RECOVERED and e1000e's request > > > PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function > > > .slot_reset() will be skipped and directly call the .resume(). > > I believe you are working with non hotplug capable device. If yes, then this > > issue will be addressed by the following patch. > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org > > %2Flkml%2F2020%2F5%2F6%2F1545data=02%7C01%7Czhiqiang.hou%4 > > 0nxp.com%7C4f0ad836e4384f40400f08d80388383c%7C686ea1d3bc2b4c6fa92 > > cd99c5c301635%7C0%7C1%7C637263230875781678sdata=ap0PUMzse > > xIuCnOpBCFPW%2BrUEwMWgAYzT7yxGP8pjio%3Dreserved=0 > > I'll try this patch, seems it also override the 'status' in the > pci_channel_io_frozen > Case but it make sense for me. I'm not sure what the resolution of this was. Please repost this patch if it's still needed. > > > So this is how the update of 'status' break the handshake between RC's > > > AER driver and the Endpoint's protocol driver error_handlers, then result > > > in > > the recovery failure. > > > > > >>> > > >>> In the e1000e case, the error logs: > > >>> pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received: > > >>> 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error: > > >>> severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent > > >>> ID) pcieport 0002:00:00.0: AER: Root Port link has been reset > > >> As per above commit log, it looks like link is reset correctly. > > > > > > Yes, see my comments above. > > > > > > Thanks, > > > Zhiqiang > > > > > >>> SError Interrupt on CPU0, code 0xbf02 -- SError > > >>> CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted > > >>> 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT) > > >>> pstate: 8005 (Nzcv daif -PAN -UAO BTYPE=--) pc : > > >>> __pci_enable_msix_range+0x4c8/0x5b8 > > >>> lr : __pci_enable_msix_range+0x480/0x5b8 > > >>> sp : 80001116bb30 > > >>> x29: 80001116bb30 x28: 0003 > > >>> x27: 0003 x26: 000
RE: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
Hi Kuppuswamy, Thanks a lot for your comments and sorry for my late response! > -Original Message- > From: Kuppuswamy, Sathyanarayanan > > Sent: 2020年5月29日 12:25 > To: Z.q. Hou ; linux-...@vger.kernel.org; > linux-kernel@vger.kernel.org; rus...@russell.cc; sbobr...@linux.ibm.com; > ooh...@gmail.com; bhelg...@google.com > Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by > error_detect() > > > > On 5/28/20 9:04 PM, Z.q. Hou wrote: > > Hi Kuppuswamy, > > > >> -Original Message- > >> From: Kuppuswamy, Sathyanarayanan > >> > >> Sent: 2020年5月29日 5:19 > >> To: Z.q. Hou ; linux-...@vger.kernel.org; > >> linux-kernel@vger.kernel.org; rus...@russell.cc; > >> sbobr...@linux.ibm.com; ooh...@gmail.com; bhelg...@google.com > >> Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by > >> error_detect() > >> > >> Hi, > >> > >> On 5/27/20 1:31 AM, Zhiqiang Hou wrote: > >>> From: Hou Zhiqiang > >>> > >>> The commit 6d2c89441571 ("PCI/ERR: Update error status after > >>> reset_link()") overrode the 'status' returned by the error_detect() > >>> call back function, which is depended on by the next step. This > >>> overriding makes the Endpoint driver's required info (kept in the > >>> var > >>> status) lost, so it results in the fatal errors' recovery failed and > >>> then kernel > >> panic. > >> Can you explain why updating status affects the recovery ? > > > > Take the e1000e as an example: > > Once a fatal error is reported by e1000e, the e1000e's error_detect() > > will be called and it will return PCI_ERS_RESULT_NEED_RESET to request > > a slot reset, the return value is stored in the '' of the > > calling pci_walk_bus(bus,report_frozen_detected, ). > > If you update the 'status' with the reset_link()'s return value > > (PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the > > 'status' has the value PCI_ERS_RESULT_RECOVERED and e1000e's request > > PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function > > .slot_reset() will be skipped and directly call the .resume(). > I believe you are working with non hotplug capable device. If yes, then this > issue will be addressed by the following patch. > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org > %2Flkml%2F2020%2F5%2F6%2F1545data=02%7C01%7Czhiqiang.hou%4 > 0nxp.com%7C4f0ad836e4384f40400f08d80388383c%7C686ea1d3bc2b4c6fa92 > cd99c5c301635%7C0%7C1%7C637263230875781678sdata=ap0PUMzse > xIuCnOpBCFPW%2BrUEwMWgAYzT7yxGP8pjio%3Dreserved=0 I'll try this patch, seems it also override the 'status' in the pci_channel_io_frozen Case but it make sense for me. Thanks, Zhiqiang > > > > So this is how the update of 'status' break the handshake between RC's > > AER driver and the Endpoint's protocol driver error_handlers, then result in > the recovery failure. > > > >>> > >>> In the e1000e case, the error logs: > >>> pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received: > >>> 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error: > >>> severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent > >>> ID) pcieport 0002:00:00.0: AER: Root Port link has been reset > >> As per above commit log, it looks like link is reset correctly. > > > > Yes, see my comments above. > > > > Thanks, > > Zhiqiang > > > >>> SError Interrupt on CPU0, code 0xbf02 -- SError > >>> CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted > >>> 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT) > >>> pstate: 8005 (Nzcv daif -PAN -UAO BTYPE=--) pc : > >>> __pci_enable_msix_range+0x4c8/0x5b8 > >>> lr : __pci_enable_msix_range+0x480/0x5b8 > >>> sp : 80001116bb30 > >>> x29: 80001116bb30 x28: 0003 > >>> x27: 0003 x26: > >>> x25: 00097243e0a8 x24: 0001 > >>> x23: 00097243e2d8 x22: > >>> x21: 0003 x20: 00095bd46080 > >>> x19: 00097243e000 x18: > >>> x17: x16: > >>> x15: b958fa0e9948 x14: 00095bd46303 > >>> x13: 00095bd46302 x12: 0038 > >>> x11: 0040 x10: b958fa101e68 > >>> x9 : b958fa101e60 x8 : 0908 > >>> x7 : 0908 x6
Re: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
On 5/28/20 9:04 PM, Z.q. Hou wrote: Hi Kuppuswamy, -Original Message- From: Kuppuswamy, Sathyanarayanan Sent: 2020年5月29日 5:19 To: Z.q. Hou ; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; rus...@russell.cc; sbobr...@linux.ibm.com; ooh...@gmail.com; bhelg...@google.com Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by error_detect() Hi, On 5/27/20 1:31 AM, Zhiqiang Hou wrote: From: Hou Zhiqiang The commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") overrode the 'status' returned by the error_detect() call back function, which is depended on by the next step. This overriding makes the Endpoint driver's required info (kept in the var status) lost, so it results in the fatal errors' recovery failed and then kernel panic. Can you explain why updating status affects the recovery ? Take the e1000e as an example: Once a fatal error is reported by e1000e, the e1000e's error_detect() will be called and it will return PCI_ERS_RESULT_NEED_RESET to request a slot reset, the return value is stored in the '' of the calling pci_walk_bus(bus,report_frozen_detected, ). If you update the 'status' with the reset_link()'s return value (PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the 'status' has the value PCI_ERS_RESULT_RECOVERED and e1000e's request PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function .slot_reset() will be skipped and directly call the .resume(). I believe you are working with non hotplug capable device. If yes, then this issue will be addressed by the following patch. https://lkml.org/lkml/2020/5/6/1545 So this is how the update of 'status' break the handshake between RC's AER driver and the Endpoint's protocol driver error_handlers, then result in the recovery failure. In the e1000e case, the error logs: pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received: 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID) pcieport 0002:00:00.0: AER: Root Port link has been reset As per above commit log, it looks like link is reset correctly. Yes, see my comments above. Thanks, Zhiqiang SError Interrupt on CPU0, code 0xbf02 -- SError CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT) pstate: 8005 (Nzcv daif -PAN -UAO BTYPE=--) pc : __pci_enable_msix_range+0x4c8/0x5b8 lr : __pci_enable_msix_range+0x480/0x5b8 sp : 80001116bb30 x29: 80001116bb30 x28: 0003 x27: 0003 x26: x25: 00097243e0a8 x24: 0001 x23: 00097243e2d8 x22: x21: 0003 x20: 00095bd46080 x19: 00097243e000 x18: x17: x16: x15: b958fa0e9948 x14: 00095bd46303 x13: 00095bd46302 x12: 0038 x11: 0040 x10: b958fa101e68 x9 : b958fa101e60 x8 : 0908 x7 : 0908 x6 : 80001160 x5 : 00095bd46800 x4 : 00096e7f6080 x3 : x2 : x1 : x0 : Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted 5.7.0-rc7-next-20200526 #8 I think it's the expected result that "if the initial value of error status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using reset_link()) pcie_do_recovery() will report the recovery result as failure" which is described in commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()"). Refer to the Documentation/PCI/pci-error-recovery.rst. As the error_detect() is mandatory callback if the pci_err_handlers is implemented, if it return the PCI_ERS_RESULT_DISCONNECT, it means the driver doesn't want to recover at all; For the case PCI_ERS_RESULT_NO_AER_DRIVER, if the pci_err_handlers is not implemented, the failure is more expected. Fixes: commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") Signed-off-by: Hou Zhiqiang --- drivers/pci/pcie/err.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 14bb8f54723e..84f72342259c 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, ); - status = reset_link(dev); - if (status != PCI_ERS_RESULT_RECOVERED) { + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { pci_warn(dev, "link reset failed\n"); goto failed; } -- Sathyanarayanan Kuppu
RE: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
Hi Kuppuswamy, > -Original Message- > From: Kuppuswamy, Sathyanarayanan > > Sent: 2020年5月29日 5:19 > To: Z.q. Hou ; linux-...@vger.kernel.org; > linux-kernel@vger.kernel.org; rus...@russell.cc; sbobr...@linux.ibm.com; > ooh...@gmail.com; bhelg...@google.com > Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by > error_detect() > > Hi, > > On 5/27/20 1:31 AM, Zhiqiang Hou wrote: > > From: Hou Zhiqiang > > > > The commit 6d2c89441571 ("PCI/ERR: Update error status after > > reset_link()") overrode the 'status' returned by the error_detect() > > call back function, which is depended on by the next step. This > > overriding makes the Endpoint driver's required info (kept in the var > > status) lost, so it results in the fatal errors' recovery failed and then > > kernel > panic. > Can you explain why updating status affects the recovery ? Take the e1000e as an example: Once a fatal error is reported by e1000e, the e1000e's error_detect() will be called and it will return PCI_ERS_RESULT_NEED_RESET to request a slot reset, the return value is stored in the '' of the calling pci_walk_bus(bus,report_frozen_detected, ). If you update the 'status' with the reset_link()'s return value (PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the 'status' has the value PCI_ERS_RESULT_RECOVERED and e1000e's request PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function .slot_reset() will be skipped and directly call the .resume(). So this is how the update of 'status' break the handshake between RC's AER driver and the Endpoint's protocol driver error_handlers, then result in the recovery failure. > > > > In the e1000e case, the error logs: > > pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received: > > 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error: > > severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent > > ID) pcieport 0002:00:00.0: AER: Root Port link has been reset > As per above commit log, it looks like link is reset correctly. Yes, see my comments above. Thanks, Zhiqiang > > SError Interrupt on CPU0, code 0xbf02 -- SError > > CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted > > 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT) > > pstate: 8005 (Nzcv daif -PAN -UAO BTYPE=--) pc : > > __pci_enable_msix_range+0x4c8/0x5b8 > > lr : __pci_enable_msix_range+0x480/0x5b8 > > sp : 80001116bb30 > > x29: 80001116bb30 x28: 0003 > > x27: 0003 x26: > > x25: 00097243e0a8 x24: 0001 > > x23: 00097243e2d8 x22: > > x21: 0003 x20: 00095bd46080 > > x19: 00097243e000 x18: > > x17: x16: > > x15: b958fa0e9948 x14: 00095bd46303 > > x13: 00095bd46302 x12: 0038 > > x11: 0040 x10: b958fa101e68 > > x9 : b958fa101e60 x8 : 0908 > > x7 : 0908 x6 : 80001160 > > x5 : 00095bd46800 x4 : 00096e7f6080 > > x3 : x2 : > > x1 : x0 : Kernel panic - not > > syncing: Asynchronous SError Interrupt > > CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted > > 5.7.0-rc7-next-20200526 #8 > > > > I think it's the expected result that "if the initial value of error > > status is PCI_ERS_RESULT_DISCONNECT or > PCI_ERS_RESULT_NO_AER_DRIVER > > then even after successful recovery (using reset_link()) > > pcie_do_recovery() will report the recovery result as failure" which > > is described in commit 6d2c89441571 ("PCI/ERR: Update error status after > reset_link()"). > > > > Refer to the Documentation/PCI/pci-error-recovery.rst. > > As the error_detect() is mandatory callback if the pci_err_handlers is > > implemented, if it return the PCI_ERS_RESULT_DISCONNECT, it means the > > driver doesn't want to recover at all; For the case > > PCI_ERS_RESULT_NO_AER_DRIVER, if the pci_err_handlers is not > > implemented, the failure is more expected. > > > > Fixes: commit 6d2c89441571 ("PCI/ERR: Update error status after > > reset_link()") > > Signed-off-by: Hou Zhiqiang > > --- > > drivers/pci/pcie/err.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index > > 14bb8f54723e..84f72342259c 100644 > > --- a/drivers/pci/pcie/err.c > > +++ b/drivers/pci/pcie/err.c > > @@ -165,8 +165,7 @@ pci_
Re: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
Hi, On 5/27/20 1:31 AM, Zhiqiang Hou wrote: From: Hou Zhiqiang The commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") overrode the 'status' returned by the error_detect() call back function, which is depended on by the next step. This overriding makes the Endpoint driver's required info (kept in the var status) lost, so it results in the fatal errors' recovery failed and then kernel panic. Can you explain why updating status affects the recovery ? In the e1000e case, the error logs: pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received: 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID) pcieport 0002:00:00.0: AER: Root Port link has been reset As per above commit log, it looks like link is reset correctly. SError Interrupt on CPU0, code 0xbf02 -- SError CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT) pstate: 8005 (Nzcv daif -PAN -UAO BTYPE=--) pc : __pci_enable_msix_range+0x4c8/0x5b8 lr : __pci_enable_msix_range+0x480/0x5b8 sp : 80001116bb30 x29: 80001116bb30 x28: 0003 x27: 0003 x26: x25: 00097243e0a8 x24: 0001 x23: 00097243e2d8 x22: x21: 0003 x20: 00095bd46080 x19: 00097243e000 x18: x17: x16: x15: b958fa0e9948 x14: 00095bd46303 x13: 00095bd46302 x12: 0038 x11: 0040 x10: b958fa101e68 x9 : b958fa101e60 x8 : 0908 x7 : 0908 x6 : 80001160 x5 : 00095bd46800 x4 : 00096e7f6080 x3 : x2 : x1 : x0 : Kernel panic - not syncing: Asynchronous SError Interrupt CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted 5.7.0-rc7-next-20200526 #8 I think it's the expected result that "if the initial value of error status is PCI_ERS_RESULT_DISCONNECT or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using reset_link()) pcie_do_recovery() will report the recovery result as failure" which is described in commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()"). Refer to the Documentation/PCI/pci-error-recovery.rst. As the error_detect() is mandatory callback if the pci_err_handlers is implemented, if it return the PCI_ERS_RESULT_DISCONNECT, it means the driver doesn't want to recover at all; For the case PCI_ERS_RESULT_NO_AER_DRIVER, if the pci_err_handlers is not implemented, the failure is more expected. Fixes: commit 6d2c89441571 ("PCI/ERR: Update error status after reset_link()") Signed-off-by: Hou Zhiqiang --- drivers/pci/pcie/err.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index 14bb8f54723e..84f72342259c 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, pci_dbg(dev, "broadcast error_detected message\n"); if (state == pci_channel_io_frozen) { pci_walk_bus(bus, report_frozen_detected, ); - status = reset_link(dev); - if (status != PCI_ERS_RESULT_RECOVERED) { + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) { pci_warn(dev, "link reset failed\n"); goto failed; } -- Sathyanarayanan Kuppuswamy Linux Kernel Developer