RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
Bjorn, >> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c >> index ed4d094..7abefd9 100644 >> --- a/drivers/pci/pcie/portdrv_pci.c >> +++ b/drivers/pci/pcie/portdrv_pci.c >> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct >> pci_dev *dev) >> pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; >> int retval; >> >> -/* If fatal, restore cfg space for possible link reset at upstream */ >> -if (dev->error_state == pci_channel_io_frozen) { >> -dev->state_saved = true; >> -pci_restore_state(dev); >> -pcie_portdrv_restore_config(dev); >> -pci_enable_pcie_error_reporting(dev); >> -} >> +/* restore cfg space for possible link reset at upstream */ >> +dev->state_saved = true; >> +pci_restore_state(dev); >> +pcie_portdrv_restore_config(dev); >> +pci_enable_pcie_error_reporting(dev); >> >> /* get true return value from */ >> retval = device_for_each_child(>dev, , slot_reset_iter); > >I think this patch changes the behavior in the case of a non-fatal error >where one of the .error_detected() methods returned >PCI_ERS_RESULT_NEED_RESET. In that case, pcie_portdrv_slot_reset() >previously did not restore config space, but after your patch, it *will* >restore it. We need an explanation of why this is safe. Here is my understanding of this part of the patch: I think your concern here should not be an issue. Whether it is a non-fatal error or a fatal error, as long as one of the .error_detected() method from the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should trump all others and a slot reset should be performed, even though it was originally due to a non-fatal error reported or only one of the downstream drivers requests it. In the case of AER driver, this should be implemented in the broadcast_error_message() with pci_walk_bus() by passing in the report_error_detected() function, and merging the results into the result_data->result... In any case, the fact that this pcie_portdrv_slot_reset() being invoked should be considered as a aftermath of a slot reset has been performed, thus the restore of config space should be performed after the reset. I suppose the restore should be to the same state as fresh power-on states, right? Thanks, Joe Liu, Ph.D. Senior Principal Engineer Emulex Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
Bjorn, diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c index ed4d094..7abefd9 100644 --- a/drivers/pci/pcie/portdrv_pci.c +++ b/drivers/pci/pcie/portdrv_pci.c @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev) pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED; int retval; -/* If fatal, restore cfg space for possible link reset at upstream */ -if (dev-error_state == pci_channel_io_frozen) { -dev-state_saved = true; -pci_restore_state(dev); -pcie_portdrv_restore_config(dev); -pci_enable_pcie_error_reporting(dev); -} +/* restore cfg space for possible link reset at upstream */ +dev-state_saved = true; +pci_restore_state(dev); +pcie_portdrv_restore_config(dev); +pci_enable_pcie_error_reporting(dev); /* get true return value from status */ retval = device_for_each_child(dev-dev, status, slot_reset_iter); I think this patch changes the behavior in the case of a non-fatal error where one of the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET. In that case, pcie_portdrv_slot_reset() previously did not restore config space, but after your patch, it *will* restore it. We need an explanation of why this is safe. Here is my understanding of this part of the patch: I think your concern here should not be an issue. Whether it is a non-fatal error or a fatal error, as long as one of the .error_detected() method from the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should trump all others and a slot reset should be performed, even though it was originally due to a non-fatal error reported or only one of the downstream drivers requests it. In the case of AER driver, this should be implemented in the broadcast_error_message() with pci_walk_bus() by passing in the report_error_detected() function, and merging the results into the result_data-result... In any case, the fact that this pcie_portdrv_slot_reset() being invoked should be considered as a aftermath of a slot reset has been performed, thus the restore of config space should be performed after the reset. I suppose the restore should be to the same state as fresh power-on states, right? Thanks, Joe Liu, Ph.D. Senior Principal Engineer Emulex Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
Bjorn, I have been working with the low level device drivers for supporting both EEH and AER and were involved in working with Yanmin for verifying this AER patch. Please see some of my responses to your questions inline, prefixed with [jzl]. Thanks, Joe Liu, Ph.D. Senior Principal Engineer Emulex Corporation -Original Message- From: Bjorn Helgaas [mailto:bhelg...@google.com] Sent: Friday, May 17, 2013 7:44 PM To: Zhang, LongX Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael J. Wysocki Subject: Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset [+cc Rafael because he knows about dev->state_saved] Sorry, I'm not very familiar with AER, so please excuse some naive questions below. On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX wrote: > From: Zhang Long > > Specific pci device drivers might have many functions to call > pci_channel_offline to check device states. When slot_reset happens, > drivers' slot_reset callback might call such functions and eventually > abort the reset. |Where does this happen? I looked at all the references to |dev->error_state and all the callers of pci_channel_offline(), and I |didn't see any in .slot_reset() methods. | |(There are *assignments* to dev->error_state in qlcnic_attach_func(), |qlge_io_slot_reset(), and qla2xxx_pci_slot_reset(). You might be able |to remove those assignments after this patch, but this patch wouldn't |really change anything for those paths.) [jzl] Although low level driver might not call pci_channel_offline() directly in .slot_reset() method itself, it does not mean it will not call it during the device error recovery execution of .slot_reset() method. The .slot_reset() method needs to call some of its bottom layer routines interfacing with device PCI interface for preparing the PCI device from recovery of PCI error (EEH or AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's assignments to dev->error was an attempt to work around this AER driver issue that this patch is trying to resolve. From upstream kernel 3.9.2's qla2xxx_pci_slot_reset(), it has the assignment and comment below: static pci_ers_result_t qla2xxx_pci_slot_reset(struct pci_dev *pdev) { pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT; scsi_qla_host_t *base_vha = pci_get_drvdata(pdev); struct qla_hw_data *ha = base_vha->hw; struct rsp_que *rsp; int rc, retries = 10; ql_dbg(ql_dbg_aer, base_vha, 0x9004, "Slot Reset.\n"); /* Workaround: qla2xxx driver which access hardware earlier * needs error state to be pci_channel_io_online. * Otherwise mailbox command timesout. */ pdev->error_state = pci_channel_io_normal; > The patch resets pdev->error_state to pci_channel_io_normal at > the begining of report_slot_reset. > Signed-off-by: Zhang Yanmin > Signed-off-by: Zhang Long > --- > drivers/pci/pcie/aer/aerdrv_core.c |1 + > drivers/pci/pcie/portdrv_pci.c | 12 +--- > 2 files changed, 6 insertions(+), 7 deletions(-) > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c > b/drivers/pci/pcie/aer/aerdrv_core.c > index 564d97f..c61fd44 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void > *data) > result_data = (struct aer_broadcast_data *) data; > > device_lock(>dev); > + dev->error_state = pci_channel_io_normal; |The device's error_state might be pci_channel_io_frozen when we get |here. We haven't touched anything in the hardware yet. What makes |the device unfrozen now? Did anything actually change as far as the |hardware device is concerned? [jzl] When the AER driver gets to invoking report_slot_reset(), it has already performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it has been performed. But with proper AER driver implementation, it should be either PCI slot hot reset or even fundamental reset that has already been performed. As such, after platform performing the PCI slot reset, it should move the device's error_state out of pci_channel_io_fronzen before calling report_slot_reset() to low level device drivers allows them to access the corresponding device's PCI function in preparation for recovery. |I agree it looks like report_slot_reset() should be made more like |eeh_report_reset(). I'm just wondering if the error_state should be |changed *after* calling the .slot_reset() method instead of before. [jzl] No, you should not set the error_state after calling the .slot_reset() method because the AER driver calling the low level driver's .slot_reset() method is to "report&quo
RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
Bjorn, I have been working with the low level device drivers for supporting both EEH and AER and were involved in working with Yanmin for verifying this AER patch. Please see some of my responses to your questions inline, prefixed with [jzl]. Thanks, Joe Liu, Ph.D. Senior Principal Engineer Emulex Corporation -Original Message- From: Bjorn Helgaas [mailto:bhelg...@google.com] Sent: Friday, May 17, 2013 7:44 PM To: Zhang, LongX Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael J. Wysocki Subject: Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset [+cc Rafael because he knows about dev-state_saved] Sorry, I'm not very familiar with AER, so please excuse some naive questions below. On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote: From: Zhang Long longx.zh...@intel.com Specific pci device drivers might have many functions to call pci_channel_offline to check device states. When slot_reset happens, drivers' slot_reset callback might call such functions and eventually abort the reset. |Where does this happen? I looked at all the references to |dev-error_state and all the callers of pci_channel_offline(), and I |didn't see any in .slot_reset() methods. | |(There are *assignments* to dev-error_state in qlcnic_attach_func(), |qlge_io_slot_reset(), and qla2xxx_pci_slot_reset(). You might be able |to remove those assignments after this patch, but this patch wouldn't |really change anything for those paths.) [jzl] Although low level driver might not call pci_channel_offline() directly in .slot_reset() method itself, it does not mean it will not call it during the device error recovery execution of .slot_reset() method. The .slot_reset() method needs to call some of its bottom layer routines interfacing with device PCI interface for preparing the PCI device from recovery of PCI error (EEH or AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's assignments to dev-error was an attempt to work around this AER driver issue that this patch is trying to resolve. From upstream kernel 3.9.2's qla2xxx_pci_slot_reset(), it has the assignment and comment below: static pci_ers_result_t qla2xxx_pci_slot_reset(struct pci_dev *pdev) { pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT; scsi_qla_host_t *base_vha = pci_get_drvdata(pdev); struct qla_hw_data *ha = base_vha-hw; struct rsp_que *rsp; int rc, retries = 10; ql_dbg(ql_dbg_aer, base_vha, 0x9004, Slot Reset.\n); /* Workaround: qla2xxx driver which access hardware earlier * needs error state to be pci_channel_io_online. * Otherwise mailbox command timesout. */ pdev-error_state = pci_channel_io_normal; The patch resets pdev-error_state to pci_channel_io_normal at the begining of report_slot_reset. Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com Signed-off-by: Zhang Long longx.zh...@intel.com --- drivers/pci/pcie/aer/aerdrv_core.c |1 + drivers/pci/pcie/portdrv_pci.c | 12 +--- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 564d97f..c61fd44 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) result_data = (struct aer_broadcast_data *) data; device_lock(dev-dev); + dev-error_state = pci_channel_io_normal; |The device's error_state might be pci_channel_io_frozen when we get |here. We haven't touched anything in the hardware yet. What makes |the device unfrozen now? Did anything actually change as far as the |hardware device is concerned? [jzl] When the AER driver gets to invoking report_slot_reset(), it has already performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it has been performed. But with proper AER driver implementation, it should be either PCI slot hot reset or even fundamental reset that has already been performed. As such, after platform performing the PCI slot reset, it should move the device's error_state out of pci_channel_io_fronzen before calling report_slot_reset() to low level device drivers allows them to access the corresponding device's PCI function in preparation for recovery. |I agree it looks like report_slot_reset() should be made more like |eeh_report_reset(). I'm just wondering if the error_state should be |changed *after* calling the .slot_reset() method instead of before. [jzl] No, you should not set the error_state after calling the .slot_reset() method because the AER driver calling the low level driver's .slot_reset() method is to report that the platform PCI slot reset has been done and inform the low level