RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Liu, Joseph
Bjorn,

>> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
>> index ed4d094..7abefd9 100644
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
>> pci_dev *dev)
>>  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>>  int retval;
>>  
>> -/* If fatal, restore cfg space for possible link reset at upstream */
>> -if (dev->error_state == pci_channel_io_frozen) {
>> -dev->state_saved = true;
>> -pci_restore_state(dev);
>> -pcie_portdrv_restore_config(dev);
>> -pci_enable_pcie_error_reporting(dev);
>> -}
>> +/* restore cfg space for possible link reset at upstream */
>> +dev->state_saved = true;
>> +pci_restore_state(dev);
>> +pcie_portdrv_restore_config(dev);
>> +pci_enable_pcie_error_reporting(dev);
>>  
>>  /* get true return value from  */
>>  retval = device_for_each_child(>dev, , slot_reset_iter);
>
>I think this patch changes the behavior in the case of a non-fatal error
>where one of the .error_detected() methods returned
>PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
>previously did not restore config space, but after your patch, it *will*
>restore it.  We need an explanation of why this is safe. 

Here is my understanding of this part of the patch:

I think your concern here should not be an issue. Whether it is a non-fatal 
error or a fatal error, as long as one of the .error_detected() method from the 
downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should 
trump all others and a slot reset should be performed, even though it was 
originally due to a non-fatal error reported or only one of the downstream 
drivers requests it. In the case of AER driver, this should be implemented in 
the broadcast_error_message() with pci_walk_bus() by passing in the 
report_error_detected() function, and merging the results into the 
result_data->result...

In any case, the fact that this pcie_portdrv_slot_reset() being invoked should 
be considered as a aftermath of a slot reset has been performed, thus the 
restore of config space should be performed after the reset. I suppose the 
restore should be to the same state as fresh power-on states, right?

Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Liu, Joseph
Bjorn,

 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
 pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;
  
 -/* If fatal, restore cfg space for possible link reset at upstream */
 -if (dev-error_state == pci_channel_io_frozen) {
 -dev-state_saved = true;
 -pci_restore_state(dev);
 -pcie_portdrv_restore_config(dev);
 -pci_enable_pcie_error_reporting(dev);
 -}
 +/* restore cfg space for possible link reset at upstream */
 +dev-state_saved = true;
 +pci_restore_state(dev);
 +pcie_portdrv_restore_config(dev);
 +pci_enable_pcie_error_reporting(dev);
  
  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe. 

Here is my understanding of this part of the patch:

I think your concern here should not be an issue. Whether it is a non-fatal 
error or a fatal error, as long as one of the .error_detected() method from the 
downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should 
trump all others and a slot reset should be performed, even though it was 
originally due to a non-fatal error reported or only one of the downstream 
drivers requests it. In the case of AER driver, this should be implemented in 
the broadcast_error_message() with pci_walk_bus() by passing in the 
report_error_detected() function, and merging the results into the 
result_data-result...

In any case, the fact that this pcie_portdrv_slot_reset() being invoked should 
be considered as a aftermath of a slot reset has been performed, thus the 
restore of config space should be performed after the reset. I suppose the 
restore should be to the same state as fresh power-on states, right?

Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Liu, Joseph
Bjorn,

I have been working with the low level device drivers for supporting both EEH 
and AER and were involved in working with Yanmin for verifying this AER patch. 
Please see some of my responses to your questions inline, prefixed with [jzl]. 


Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Friday, May 17, 2013 7:44 PM
To: Zhang, LongX
Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael 
J. Wysocki
Subject: Re: Subject : [ PATCH ] 
pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

[+cc Rafael because he knows about dev->state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:
> From: Zhang Long 
>
> Specific pci device drivers might have many functions to call
> pci_channel_offline to check device states. When slot_reset happens,
> drivers' slot_reset callback might call such functions and eventually
> abort the reset.

|Where does this happen?  I looked at all the references to
|dev->error_state and all the callers of pci_channel_offline(), and I
|didn't see any in .slot_reset() methods.
|
|(There are *assignments* to dev->error_state in qlcnic_attach_func(),
|qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
|to remove those assignments after this patch, but this patch wouldn't
|really change anything for those paths.)

[jzl] Although low level driver might not call pci_channel_offline() directly 
in .slot_reset() method itself, it does not mean it will not call it during the 
device error recovery execution of .slot_reset() method. The .slot_reset() 
method needs to call some of its bottom layer routines interfacing with device 
PCI interface for preparing the PCI device from recovery of PCI error (EEH or 
AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's 
assignments to dev->error was an attempt to work around this AER driver issue 
that this patch is trying to resolve. From upstream kernel 3.9.2's 
qla2xxx_pci_slot_reset(), it has the assignment and comment below:

static pci_ers_result_t
qla2xxx_pci_slot_reset(struct pci_dev *pdev)
{
pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
struct qla_hw_data *ha = base_vha->hw;
struct rsp_que *rsp;
int rc, retries = 10;

ql_dbg(ql_dbg_aer, base_vha, 0x9004,
"Slot Reset.\n");

/* Workaround: qla2xxx driver which access hardware earlier
 * needs error state to be pci_channel_io_online.
 * Otherwise mailbox command timesout.
 */
pdev->error_state = pci_channel_io_normal;



> The patch resets pdev->error_state to pci_channel_io_normal at
> the begining of report_slot_reset.

> Signed-off-by: Zhang Yanmin 
> Signed-off-by: Zhang Long 
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>  drivers/pci/pcie/portdrv_pci.c |   12 +---
>  2 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 564d97f..c61fd44 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> *data)
> result_data = (struct aer_broadcast_data *) data;
>
> device_lock(>dev);
> +   dev->error_state = pci_channel_io_normal;

|The device's error_state might be pci_channel_io_frozen when we get
|here.  We haven't touched anything in the hardware yet.  What makes
|the device unfrozen now?  Did anything actually change as far as the
|hardware device is concerned?

[jzl] When the AER driver gets to invoking report_slot_reset(), it has already 
performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it 
has been performed. But with proper AER driver implementation, it should be 
either PCI slot hot reset or even fundamental reset that has already been 
performed. As such, after platform performing the PCI slot reset, it should 
move the device's error_state out of pci_channel_io_fronzen before calling 
report_slot_reset() to low level device drivers allows them to access the 
corresponding device's PCI function in preparation for recovery.


|I agree it looks like report_slot_reset() should be made more like
|eeh_report_reset().  I'm just wondering if the error_state should be
|changed *after* calling the .slot_reset() method instead of before.

[jzl] No, you should not set the error_state after calling the .slot_reset() 
method because the AER driver calling the low level driver's .slot_reset() 
method is to "report&quo

RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Liu, Joseph
Bjorn,

I have been working with the low level device drivers for supporting both EEH 
and AER and were involved in working with Yanmin for verifying this AER patch. 
Please see some of my responses to your questions inline, prefixed with [jzl]. 


Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Friday, May 17, 2013 7:44 PM
To: Zhang, LongX
Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael 
J. Wysocki
Subject: Re: Subject : [ PATCH ] 
pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

[+cc Rafael because he knows about dev-state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:
 From: Zhang Long longx.zh...@intel.com

 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.

|Where does this happen?  I looked at all the references to
|dev-error_state and all the callers of pci_channel_offline(), and I
|didn't see any in .slot_reset() methods.
|
|(There are *assignments* to dev-error_state in qlcnic_attach_func(),
|qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
|to remove those assignments after this patch, but this patch wouldn't
|really change anything for those paths.)

[jzl] Although low level driver might not call pci_channel_offline() directly 
in .slot_reset() method itself, it does not mean it will not call it during the 
device error recovery execution of .slot_reset() method. The .slot_reset() 
method needs to call some of its bottom layer routines interfacing with device 
PCI interface for preparing the PCI device from recovery of PCI error (EEH or 
AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's 
assignments to dev-error was an attempt to work around this AER driver issue 
that this patch is trying to resolve. From upstream kernel 3.9.2's 
qla2xxx_pci_slot_reset(), it has the assignment and comment below:

static pci_ers_result_t
qla2xxx_pci_slot_reset(struct pci_dev *pdev)
{
pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
struct qla_hw_data *ha = base_vha-hw;
struct rsp_que *rsp;
int rc, retries = 10;

ql_dbg(ql_dbg_aer, base_vha, 0x9004,
Slot Reset.\n);

/* Workaround: qla2xxx driver which access hardware earlier
 * needs error state to be pci_channel_io_online.
 * Otherwise mailbox command timesout.
 */
pdev-error_state = pci_channel_io_normal;



 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.

 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)

 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c
 +++ b/drivers/pci/pcie/aer/aerdrv_core.c
 @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
 *data)
 result_data = (struct aer_broadcast_data *) data;

 device_lock(dev-dev);
 +   dev-error_state = pci_channel_io_normal;

|The device's error_state might be pci_channel_io_frozen when we get
|here.  We haven't touched anything in the hardware yet.  What makes
|the device unfrozen now?  Did anything actually change as far as the
|hardware device is concerned?

[jzl] When the AER driver gets to invoking report_slot_reset(), it has already 
performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it 
has been performed. But with proper AER driver implementation, it should be 
either PCI slot hot reset or even fundamental reset that has already been 
performed. As such, after platform performing the PCI slot reset, it should 
move the device's error_state out of pci_channel_io_fronzen before calling 
report_slot_reset() to low level device drivers allows them to access the 
corresponding device's PCI function in preparation for recovery.


|I agree it looks like report_slot_reset() should be made more like
|eeh_report_reset().  I'm just wondering if the error_state should be
|changed *after* calling the .slot_reset() method instead of before.

[jzl] No, you should not set the error_state after calling the .slot_reset() 
method because the AER driver calling the low level driver's .slot_reset() 
method is to report that the platform PCI slot reset has been done and 
inform the low level