Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-10 Thread Bjorn Helgaas
I opened https://bugzilla.kernel.org/show_bug.cgi?id=59531 for this issue.

Unfortunately, I don't have an army of minions to assign things like this to.

It's easy to change the dev state from frozen to normal before calling
the slot_reset() callback, but we first have to unravel what that
means for pcie_portdrv_slot_reset(), which restores the device state
if it is frozen.

Bjorn

On Thu, Jun 6, 2013 at 12:29 AM, Yanmin Zhang
 wrote:
> On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote:
>> On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
>>  wrote:
>> > On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
>> >> I'm not sure where we are with this patch.  I think Joseph initially
>> >> reported a problem (though I haven't actually seen that), and this
>> >> patch fixed it, so it seems like there's something we want to do here.
>> > Yes, indeed. We checked Powerpc error handling and plan to use the
>> > similar method to deal with the issue.
>> >
>> > Sorry for replying very late. I move to Android smartphone area and am
>> > very busy on many top issues.
>>
>> OK.  Can you point us at a bugzilla or email archive of Joseph's
>> original problem report, or post it to this thread?  Then maybe
>> somebody else can pick it up and make progress on this.
>
> Joseph sent email to my outlook emailbox. I changed it a little to delete 
> company
> sensitive product name.
>
> ===Joseph's original email to me
> Hi Tom and Yanmin,
>
> Hope this email can reach you well. I am working on a driver's PCI error 
> recovery with AER. I have a question with one of my observations from 
> platform AER driver's behavior. As your names and emails are listed to the 
> PCIe AER driver coming with the kernel, I send this question to you:
>
> During injecting AER Non-Correctable/Fatal error and PCIe error recovery 
> process, I observed the following from our Low Level Driver (LLD):
>
> 1. The LLD's error_detected() callback got called and the PCI channel state 
> passed in as "pci_channel_io_frozen", as expected;
>
> 2. The LLD's error_detected() callback function returned with 
> PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset;
>
> 3. The LLD's slot_reset() callback got called and it attempted to 
> re-initialize the device hardware for the recovery. However, the PCI slot 
> state was still in "pci_channel_io_frozen" and the pci_channel_offline() 
> helper routine returned 1. This is not expected, and it actually prevented 
> LLD in performing needed access to memory mapped I/O space for preparing the 
> device for recovery;
>
> 4. Later, the LLD's resume() callback got called and the PCI slot state was 
> set to "pci_channel_io_normal";
>
> In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, "STEP 4 Slot 
> Reset", it stated after platform has performed PCI slot reset and then calls 
> LLD's slot_reset() callback:
>
> "This call gives drivers the chance to re-initialize the hardware 
> (re-download firmware, etc.).  At this point, the driver may assume that the 
> card is in a fresh state and is fully functional. The slot is unfrozen and 
> the driver has full access to PCI config space, memory mapped I/O space and 
> DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available."
>
> I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the 
> PCI slot state was set to "pci_channel_io_frozen" on AER_FATAL serverity, and 
> only set back to "pci_channel_io_normal" in report_resume() function. The PCI 
> slot state was not set to "pci_channel_io_normal" when invoking 
> report_slot_reset().
>
> As a comparison, I also perform the EEH error recovery with the LLD driver on 
> a PPC platform, which indeed set the PCI slot state to 
> "pci_channel_io_normal" when calling LLD's slot_reset() callback function.
>
> Can you please verify this platform AER driver's behavior is intended and 
> will stay with the AER platform support, or it should be changed to be 
> consistent with the PCI error recovery procedure described in the 
> pci-error-recovery.txt documentation? I also noticed that before invoking the 
> broadcast_error_message() function with "slot_reset", there is a comment in 
> the 3.7.0 kernel source:
>
> /*
>  * TODO: Should call platform-specific
>  * functions to reset slot before calling
>  * drivers' slot_reset callbacks?
>  */
> And I do see that only the PCIe Link_Reset was performed, no PCI fundamental 
> reset was performed even the LLD set the PCIe device's "pdev->needs_freset = 
> 1", is this the area that later will be added for performing needed PCI hot 
> reset or fundamental reset at this stage?
>
> Your timely response is very appreciated. Thanks in advance and please let me 
> know if you think I should redirect the question to someone else.
>
> Best Regards,
> Joseph Liu, Ph.D.
> Senior Principal Engineer
> Emulex Corporation
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to 

Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-10 Thread Bjorn Helgaas
I opened https://bugzilla.kernel.org/show_bug.cgi?id=59531 for this issue.

Unfortunately, I don't have an army of minions to assign things like this to.

It's easy to change the dev state from frozen to normal before calling
the slot_reset() callback, but we first have to unravel what that
means for pcie_portdrv_slot_reset(), which restores the device state
if it is frozen.

Bjorn

On Thu, Jun 6, 2013 at 12:29 AM, Yanmin Zhang
yanmin_zh...@linux.intel.com wrote:
 On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote:
 On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
 yanmin_zh...@linux.intel.com wrote:
  On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
  I'm not sure where we are with this patch.  I think Joseph initially
  reported a problem (though I haven't actually seen that), and this
  patch fixed it, so it seems like there's something we want to do here.
  Yes, indeed. We checked Powerpc error handling and plan to use the
  similar method to deal with the issue.
 
  Sorry for replying very late. I move to Android smartphone area and am
  very busy on many top issues.

 OK.  Can you point us at a bugzilla or email archive of Joseph's
 original problem report, or post it to this thread?  Then maybe
 somebody else can pick it up and make progress on this.

 Joseph sent email to my outlook emailbox. I changed it a little to delete 
 company
 sensitive product name.

 ===Joseph's original email to me
 Hi Tom and Yanmin,

 Hope this email can reach you well. I am working on a driver's PCI error 
 recovery with AER. I have a question with one of my observations from 
 platform AER driver's behavior. As your names and emails are listed to the 
 PCIe AER driver coming with the kernel, I send this question to you:

 During injecting AER Non-Correctable/Fatal error and PCIe error recovery 
 process, I observed the following from our Low Level Driver (LLD):

 1. The LLD's error_detected() callback got called and the PCI channel state 
 passed in as pci_channel_io_frozen, as expected;

 2. The LLD's error_detected() callback function returned with 
 PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset;

 3. The LLD's slot_reset() callback got called and it attempted to 
 re-initialize the device hardware for the recovery. However, the PCI slot 
 state was still in pci_channel_io_frozen and the pci_channel_offline() 
 helper routine returned 1. This is not expected, and it actually prevented 
 LLD in performing needed access to memory mapped I/O space for preparing the 
 device for recovery;

 4. Later, the LLD's resume() callback got called and the PCI slot state was 
 set to pci_channel_io_normal;

 In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, STEP 4 Slot 
 Reset, it stated after platform has performed PCI slot reset and then calls 
 LLD's slot_reset() callback:

 This call gives drivers the chance to re-initialize the hardware 
 (re-download firmware, etc.).  At this point, the driver may assume that the 
 card is in a fresh state and is fully functional. The slot is unfrozen and 
 the driver has full access to PCI config space, memory mapped I/O space and 
 DMA. Interrupts (Legacy, MSI, or MSI-X) will also be available.

 I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the 
 PCI slot state was set to pci_channel_io_frozen on AER_FATAL serverity, and 
 only set back to pci_channel_io_normal in report_resume() function. The PCI 
 slot state was not set to pci_channel_io_normal when invoking 
 report_slot_reset().

 As a comparison, I also perform the EEH error recovery with the LLD driver on 
 a PPC platform, which indeed set the PCI slot state to 
 pci_channel_io_normal when calling LLD's slot_reset() callback function.

 Can you please verify this platform AER driver's behavior is intended and 
 will stay with the AER platform support, or it should be changed to be 
 consistent with the PCI error recovery procedure described in the 
 pci-error-recovery.txt documentation? I also noticed that before invoking the 
 broadcast_error_message() function with slot_reset, there is a comment in 
 the 3.7.0 kernel source:

 /*
  * TODO: Should call platform-specific
  * functions to reset slot before calling
  * drivers' slot_reset callbacks?
  */
 And I do see that only the PCIe Link_Reset was performed, no PCI fundamental 
 reset was performed even the LLD set the PCIe device's pdev-needs_freset = 
 1, is this the area that later will be added for performing needed PCI hot 
 reset or fundamental reset at this stage?

 Your timely response is very appreciated. Thanks in advance and please let me 
 know if you think I should redirect the question to someone else.

 Best Regards,
 Joseph Liu, Ph.D.
 Senior Principal Engineer
 Emulex Corporation


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-06 Thread Yanmin Zhang
On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
>  wrote:
> > On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
> >> I'm not sure where we are with this patch.  I think Joseph initially
> >> reported a problem (though I haven't actually seen that), and this
> >> patch fixed it, so it seems like there's something we want to do here.
> > Yes, indeed. We checked Powerpc error handling and plan to use the
> > similar method to deal with the issue.
> >
> > Sorry for replying very late. I move to Android smartphone area and am
> > very busy on many top issues.
> 
> OK.  Can you point us at a bugzilla or email archive of Joseph's
> original problem report, or post it to this thread?  Then maybe
> somebody else can pick it up and make progress on this.

Joseph sent email to my outlook emailbox. I changed it a little to delete 
company
sensitive product name.

===Joseph's original email to me
Hi Tom and Yanmin,
 
Hope this email can reach you well. I am working on a driver's PCI error 
recovery with AER. I have a question with one of my observations from platform 
AER driver's behavior. As your names and emails are listed to the PCIe AER 
driver coming with the kernel, I send this question to you:
 
During injecting AER Non-Correctable/Fatal error and PCIe error recovery 
process, I observed the following from our Low Level Driver (LLD):
 
1. The LLD's error_detected() callback got called and the PCI channel state 
passed in as "pci_channel_io_frozen", as expected;
 
2. The LLD's error_detected() callback function returned with 
PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset;
 
3. The LLD's slot_reset() callback got called and it attempted to re-initialize 
the device hardware for the recovery. However, the PCI slot state was still in 
"pci_channel_io_frozen" and the pci_channel_offline() helper routine returned 
1. This is not expected, and it actually prevented LLD in performing needed 
access to memory mapped I/O space for preparing the device for recovery;
 
4. Later, the LLD's resume() callback got called and the PCI slot state was set 
to "pci_channel_io_normal";
 
In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, "STEP 4 Slot 
Reset", it stated after platform has performed PCI slot reset and then calls 
LLD's slot_reset() callback:
 
"This call gives drivers the chance to re-initialize the hardware (re-download 
firmware, etc.).  At this point, the driver may assume that the card is in a 
fresh state and is fully functional. The slot is unfrozen and the driver has 
full access to PCI config space, memory mapped I/O space and DMA. Interrupts 
(Legacy, MSI, or MSI-X) will also be available."
 
I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the 
PCI slot state was set to "pci_channel_io_frozen" on AER_FATAL serverity, and 
only set back to "pci_channel_io_normal" in report_resume() function. The PCI 
slot state was not set to "pci_channel_io_normal" when invoking 
report_slot_reset().
 
As a comparison, I also perform the EEH error recovery with the LLD driver on a 
PPC platform, which indeed set the PCI slot state to "pci_channel_io_normal" 
when calling LLD's slot_reset() callback function.
 
Can you please verify this platform AER driver's behavior is intended and will 
stay with the AER platform support, or it should be changed to be consistent 
with the PCI error recovery procedure described in the pci-error-recovery.txt 
documentation? I also noticed that before invoking the 
broadcast_error_message() function with "slot_reset", there is a comment in the 
3.7.0 kernel source:
 
/*
 * TODO: Should call platform-specific
 * functions to reset slot before calling
 * drivers' slot_reset callbacks?
 */
And I do see that only the PCIe Link_Reset was performed, no PCI fundamental 
reset was performed even the LLD set the PCIe device's "pdev->needs_freset = 
1", is this the area that later will be added for performing needed PCI hot 
reset or fundamental reset at this stage?
 
Your timely response is very appreciated. Thanks in advance and please let me 
know if you think I should redirect the question to someone else.
 
Best Regards,
Joseph Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-06 Thread Yanmin Zhang
On Wed, 2013-06-05 at 07:30 -0600, Bjorn Helgaas wrote:
 On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
 yanmin_zh...@linux.intel.com wrote:
  On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
  I'm not sure where we are with this patch.  I think Joseph initially
  reported a problem (though I haven't actually seen that), and this
  patch fixed it, so it seems like there's something we want to do here.
  Yes, indeed. We checked Powerpc error handling and plan to use the
  similar method to deal with the issue.
 
  Sorry for replying very late. I move to Android smartphone area and am
  very busy on many top issues.
 
 OK.  Can you point us at a bugzilla or email archive of Joseph's
 original problem report, or post it to this thread?  Then maybe
 somebody else can pick it up and make progress on this.

Joseph sent email to my outlook emailbox. I changed it a little to delete 
company
sensitive product name.

===Joseph's original email to me
Hi Tom and Yanmin,
 
Hope this email can reach you well. I am working on a driver's PCI error 
recovery with AER. I have a question with one of my observations from platform 
AER driver's behavior. As your names and emails are listed to the PCIe AER 
driver coming with the kernel, I send this question to you:
 
During injecting AER Non-Correctable/Fatal error and PCIe error recovery 
process, I observed the following from our Low Level Driver (LLD):
 
1. The LLD's error_detected() callback got called and the PCI channel state 
passed in as pci_channel_io_frozen, as expected;
 
2. The LLD's error_detected() callback function returned with 
PCI_ERS_RESULT_NEED_RESET, requesting a PCI slot reset;
 
3. The LLD's slot_reset() callback got called and it attempted to re-initialize 
the device hardware for the recovery. However, the PCI slot state was still in 
pci_channel_io_frozen and the pci_channel_offline() helper routine returned 
1. This is not expected, and it actually prevented LLD in performing needed 
access to memory mapped I/O space for preparing the device for recovery;
 
4. Later, the LLD's resume() callback got called and the PCI slot state was set 
to pci_channel_io_normal;
 
In the upstream Linux kernel 3.7.0's pci-error-recovery.txt, STEP 4 Slot 
Reset, it stated after platform has performed PCI slot reset and then calls 
LLD's slot_reset() callback:
 
This call gives drivers the chance to re-initialize the hardware (re-download 
firmware, etc.).  At this point, the driver may assume that the card is in a 
fresh state and is fully functional. The slot is unfrozen and the driver has 
full access to PCI config space, memory mapped I/O space and DMA. Interrupts 
(Legacy, MSI, or MSI-X) will also be available.
 
I looked into the kernel 3.7.0's AER driver's aerdrv_core.c and see that the 
PCI slot state was set to pci_channel_io_frozen on AER_FATAL serverity, and 
only set back to pci_channel_io_normal in report_resume() function. The PCI 
slot state was not set to pci_channel_io_normal when invoking 
report_slot_reset().
 
As a comparison, I also perform the EEH error recovery with the LLD driver on a 
PPC platform, which indeed set the PCI slot state to pci_channel_io_normal 
when calling LLD's slot_reset() callback function.
 
Can you please verify this platform AER driver's behavior is intended and will 
stay with the AER platform support, or it should be changed to be consistent 
with the PCI error recovery procedure described in the pci-error-recovery.txt 
documentation? I also noticed that before invoking the 
broadcast_error_message() function with slot_reset, there is a comment in the 
3.7.0 kernel source:
 
/*
 * TODO: Should call platform-specific
 * functions to reset slot before calling
 * drivers' slot_reset callbacks?
 */
And I do see that only the PCIe Link_Reset was performed, no PCI fundamental 
reset was performed even the LLD set the PCIe device's pdev-needs_freset = 
1, is this the area that later will be added for performing needed PCI hot 
reset or fundamental reset at this stage?
 
Your timely response is very appreciated. Thanks in advance and please let me 
know if you think I should redirect the question to someone else.
 
Best Regards,
Joseph Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-05 Thread Bjorn Helgaas
On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
 wrote:
> On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
>> I'm not sure where we are with this patch.  I think Joseph initially
>> reported a problem (though I haven't actually seen that), and this
>> patch fixed it, so it seems like there's something we want to do here.
> Yes, indeed. We checked Powerpc error handling and plan to use the
> similar method to deal with the issue.
>
> Sorry for replying very late. I move to Android smartphone area and am
> very busy on many top issues.

OK.  Can you point us at a bugzilla or email archive of Joseph's
original problem report, or post it to this thread?  Then maybe
somebody else can pick it up and make progress on this.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-05 Thread Bjorn Helgaas
On Tue, Jun 4, 2013 at 6:38 PM, Yanmin Zhang
yanmin_zh...@linux.intel.com wrote:
 On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
 I'm not sure where we are with this patch.  I think Joseph initially
 reported a problem (though I haven't actually seen that), and this
 patch fixed it, so it seems like there's something we want to do here.
 Yes, indeed. We checked Powerpc error handling and plan to use the
 similar method to deal with the issue.

 Sorry for replying very late. I move to Android smartphone area and am
 very busy on many top issues.

OK.  Can you point us at a bugzilla or email archive of Joseph's
original problem report, or post it to this thread?  Then maybe
somebody else can pick it up and make progress on this.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-04 Thread Yanmin Zhang
On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
> I'm not sure where we are with this patch.  I think Joseph initially
> reported a problem (though I haven't actually seen that), and this
> patch fixed it, so it seems like there's something we want to do here.
Yes, indeed. We checked Powerpc error handling and plan to use the
similar method to deal with the issue.

Sorry for replying very late. I move to Android smartphone area and am
very busy on many top issues.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-04 Thread Bjorn Helgaas
I'm not sure where we are with this patch.  I think Joseph initially
reported a problem (though I haven't actually seen that), and this
patch fixed it, so it seems like there's something we want to do here.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-04 Thread Bjorn Helgaas
I'm not sure where we are with this patch.  I think Joseph initially
reported a problem (though I haven't actually seen that), and this
patch fixed it, so it seems like there's something we want to do here.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-06-04 Thread Yanmin Zhang
On Tue, 2013-06-04 at 12:04 -0600, Bjorn Helgaas wrote:
 I'm not sure where we are with this patch.  I think Joseph initially
 reported a problem (though I haven't actually seen that), and this
 patch fixed it, so it seems like there's something we want to do here.
Yes, indeed. We checked Powerpc error handling and plan to use the
similar method to deal with the issue.

Sorry for replying very late. I move to Android smartphone area and am
very busy on many top issues.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2013 at 9:41 AM, Liu, Joseph  wrote:
> Bjorn,
>
>>> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
>>> index ed4d094..7abefd9 100644
>>> --- a/drivers/pci/pcie/portdrv_pci.c
>>> +++ b/drivers/pci/pcie/portdrv_pci.c
>>> @@ -332,13 +332,11 @@ static pci_ers_result_t 
>>> pcie_portdrv_slot_reset(struct pci_dev *dev)
>>>  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>>>  int retval;
>>>
>>> -/* If fatal, restore cfg space for possible link reset at upstream */
>>> -if (dev->error_state == pci_channel_io_frozen) {
>>> -dev->state_saved = true;
>>> -pci_restore_state(dev);
>>> -pcie_portdrv_restore_config(dev);
>>> -pci_enable_pcie_error_reporting(dev);
>>> -}
>>> +/* restore cfg space for possible link reset at upstream */
>>> +dev->state_saved = true;
>>> +pci_restore_state(dev);
>>> +pcie_portdrv_restore_config(dev);
>>> +pci_enable_pcie_error_reporting(dev);
>>>
>>>  /* get true return value from  */
>>>  retval = device_for_each_child(>dev, , slot_reset_iter);
>>
>>I think this patch changes the behavior in the case of a non-fatal error
>>where one of the .error_detected() methods returned
>>PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
>>previously did not restore config space, but after your patch, it *will*
>>restore it.  We need an explanation of why this is safe.
>
> Here is my understanding of this part of the patch:
>
> I think your concern here should not be an issue. Whether it is a non-fatal 
> error or a fatal error, as long as one of the .error_detected() method from 
> the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it 
> should trump all others and a slot reset should be performed, even though it 
> was originally due to a non-fatal error reported or only one of the 
> downstream drivers requests it. In the case of AER driver, this should be 
> implemented in the broadcast_error_message() with pci_walk_bus() by passing 
> in the report_error_detected() function, and merging the results into the 
> result_data->result...

Yes, I understand all this.  (Though as I pointed out, the current AER
code does not actually perform a reset based on
PCI_ERS_RESULT_NEED_RESET being returned.  The only time we currently
do a reset is for AER_FATAL errors.)

> In any case, the fact that this pcie_portdrv_slot_reset() being invoked 
> should be considered as a aftermath of a slot reset has been performed, thus 
> the restore of config space should be performed after the reset.

The intention has always been that .slot_reset() would be called to
inform the driver that a reset has been performed.  That was the case
even when Yanmin added pcie_portdrv_slot_reset() with commit
4bf3392e0.  Yet in that commit, pcie_portdrv_slot_reset() only does
the restore when the channel is frozen, i.e., when we're handling an
AER_FATAL error.

So far, I haven't seen any explanation of what changed to make us want
to do the restore always, even for non-fatal errors.  Maybe the
original test for the channel being frozen was just a mistake, but it
would have been much simpler to omit the test to begin with, so
obviously Yanmin thought it was necessary at the time.

Maybe the "error_state == pci_channel_io_frozen" test was a back-door
way to express "we only want to restore state when we've actually done
a reset."  That would sort of make sense, although there's no
documented connection between the frozen state and a device reset, and
no driver should rely on one.

If the idea of "only do a restore after a reset" is still valid, we
don't want to make this change to pcie_portdrv_slot_reset() because it
IS called in some cases when a reset has not been performed, namely
non-fatal errors when an .error_detected() method returns
PCI_ERS_RESULT_NEED_RESET.

> I suppose the restore should be to the same state as fresh power-on states, 
> right?

I *think* that in pcie_portdrv_slot_reset(), we're probably restoring
the state saved by pcie_portdrv_probe(), i.e., the state saved by the
PCIe port driver after it has claimed and initialized the port.  This
is not fresh power-on state; a fresh power-on state would have BARs
and other config registers cleared, and we'd have to assign resources
to the device just like we do to a hot-added device.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Linas Vepstas
Zhang, Bjorn,

On 21 May 2013 10:41, Liu, Joseph  wrote:
> Bjorn,
>
>>> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
>>> index ed4d094..7abefd9 100644
>>> --- a/drivers/pci/pcie/portdrv_pci.c
>>> +++ b/drivers/pci/pcie/portdrv_pci.c
>>> @@ -332,13 +332,11 @@ static pci_ers_result_t 
>>> pcie_portdrv_slot_reset(struct pci_dev *dev)
>>>  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>>>  int retval;
>>>
>>> -/* If fatal, restore cfg space for possible link reset at upstream */
>>> -if (dev->error_state == pci_channel_io_frozen) {
>>> -dev->state_saved = true;
>>> -pci_restore_state(dev);
>>> -pcie_portdrv_restore_config(dev);
>>> -pci_enable_pcie_error_reporting(dev);
>>> -}
>>> +/* restore cfg space for possible link reset at upstream */
>>> +dev->state_saved = true;
>>> +pci_restore_state(dev);
>>> +pcie_portdrv_restore_config(dev);
>>> +pci_enable_pcie_error_reporting(dev);
>>>
>>>  /* get true return value from  */
>>>  retval = device_for_each_child(>dev, , slot_reset_iter);
>>
>>I think this patch changes the behavior in the case of a non-fatal error
>>where one of the .error_detected() methods returned
>>PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
>>previously did not restore config space, but after your patch, it *will*
>>restore it.  We need an explanation of why this is safe.
>
> Here is my understanding of this part of the patch:
>
> I think your concern here should not be an issue. Whether it is a non-fatal 
> error or a fatal error, as long as one of the .error_detected() method from 
> the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it 
> should trump all others and a slot reset should be performed, even though it 
> was originally due to a non-fatal error reported or only one of the 
> downstream drivers requests it. In the case of AER driver, this should be 
> implemented in the broadcast_error_message() with pci_walk_bus() by passing 
> in the report_error_detected() function, and merging the results into the 
> result_data->result...
>
> In any case, the fact that this pcie_portdrv_slot_reset() being invoked 
> should be considered as a aftermath of a slot reset has been performed, thus 
> the restore of config space should be performed after the reset. I suppose 
> the restore should be to the same state as fresh power-on states, right?


Again, I think Joe has it exactly right.   The patch, I'm not so sure.
 In my earlier emails, I was assuming that the device has just gotten
either a hard reset (power has been turned off-n-on, e.g. using
pci-hotplug, or a 'soft reset' (whatever that means :-)).   If the
adapter has been reset, then it is appropriate to restore pci config
space to something resembling a fresh power-on.

If the adapter has NOT been reset, then, ummm .. changing
('restoring') the config space would be wrong ... if I recall
correctly, a pci link reset does not whack the config space, and if
the device itself has not been whacked, then all the contents of the
config space (dma mappings, etc) are all still valid, and should not
be changed!

So, the restore of the config space should be conditional, depending
on whether the device has been reset or not.

-- Linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2013 at 1:40 AM, Yanmin Zhang
 wrote:
> On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote:
>> On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
>> > From: Zhang Long 
>> >
>> > Specific pci device drivers might have many functions to call
>> > pci_channel_offline to check device states. When slot_reset happens,
>> > drivers' slot_reset callback might call such functions and eventually
>> > abort the reset.
>> >
>> > The patch resets pdev->error_state to pci_channel_io_normal at
>> > the begining of report_slot_reset.
>> >
>> > Thank Liu Joseph for pointing it out.
>> >
>> > Signed-off-by: Zhang Yanmin 
>> > Signed-off-by: Zhang Long 
>> > ---
>> >  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>> >  drivers/pci/pcie/portdrv_pci.c |   12 +---
>> >  2 files changed, 6 insertions(+), 7 deletions(-)
> Thank all for the kind comments.
>
>> >
>> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
>> > b/drivers/pci/pcie/aer/aerdrv_core.c
>> > index 564d97f..c61fd44 100644
>> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
>> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>> > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
>> > *data)
>> > result_data = (struct aer_broadcast_data *) data;
>> >
>> > device_lock(>dev);
>> > +   dev->error_state = pci_channel_io_normal;
>> > if (!dev->driver ||
>> > !dev->driver->err_handler ||
>> > !dev->driver->err_handler->slot_reset)
>> > diff --git a/drivers/pci/pcie/portdrv_pci.c 
>> > b/drivers/pci/pcie/portdrv_pci.c
>> > index ed4d094..7abefd9 100644
>> > --- a/drivers/pci/pcie/portdrv_pci.c
>> > +++ b/drivers/pci/pcie/portdrv_pci.c
>> > @@ -332,13 +332,11 @@ static pci_ers_result_t 
>> > pcie_portdrv_slot_reset(struct pci_dev *dev)
>> > pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>> > int retval;
>> >
>> > -   /* If fatal, restore cfg space for possible link reset at upstream */
>> > -   if (dev->error_state == pci_channel_io_frozen) {
>> > -   dev->state_saved = true;
>> > -   pci_restore_state(dev);
>> > -   pcie_portdrv_restore_config(dev);
>> > -   pci_enable_pcie_error_reporting(dev);
>> > -   }
>> > +   /* restore cfg space for possible link reset at upstream */
>> > +   dev->state_saved = true;
> It's indeed a dirty trick to change it to true. The reason is suspend. The 
> port would
> suspend/resume at suspend-to-ram. When the port is suspended, PCI power 
> framework calls
> pci_save_state. When the port is resumed, PCI framework calls 
> pci_restore_state and
> dev->state_saved is set to false.

I can read the code as well as you can.  The above does nothing to
explain why dev->saved_config_space is still valid even when
state_saved is false.

But I want to drop this issue because it was there before your patch,
and we're clearly not going to resolve it here.

> A better solution is to add double space in 
> pci_dev->saved_config_space/saved_cap_space,
> which seems consume unnecessary resource.
>
>> > +   pci_restore_state(dev);
>> > +   pcie_portdrv_restore_config(dev);
>> > +   pci_enable_pcie_error_reporting(dev);
>> >
>> > /* get true return value from  */
>> > retval = device_for_each_child(>dev, , slot_reset_iter);
>> I think this patch changes the behavior in the case of a non-fatal error
>> where one of the .error_detected() methods returned
>> PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
>> previously did not restore config space, but after your patch, it *will*
>> restore it.  We need an explanation of why this is safe.
> AER standard doesn't define such behavior like if we need reset a slot.
> When we implemented the feature in kernel, we assumed at non-fatal error,
> config space is still good.
>
> With the new patch, we change the behavior a little.
>
>>
>> I think you should split this into two patches: the first would remove the
>> "if (dev->error_state == pci_channel_io_frozen)" test from portdrv_pci.c
> The first patch would depend on the 2nd patch as 2nd patch resets error_state
> to pci_channel_io_normal.

Not true.  The pcie_portdrv_slot_reset() change does not depend on the
report_slot_reset() change.  If the first patch makes
pcie_portdrv_slot_reset() do the restore without even looking at
dev->error_state, there's no dependency on the report_slot_reset()
change.

>> and explain the reason, and the second would make the aerdrv_core.c change.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Liu, Joseph
Bjorn,

>> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
>> index ed4d094..7abefd9 100644
>> --- a/drivers/pci/pcie/portdrv_pci.c
>> +++ b/drivers/pci/pcie/portdrv_pci.c
>> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
>> pci_dev *dev)
>>  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>>  int retval;
>>  
>> -/* If fatal, restore cfg space for possible link reset at upstream */
>> -if (dev->error_state == pci_channel_io_frozen) {
>> -dev->state_saved = true;
>> -pci_restore_state(dev);
>> -pcie_portdrv_restore_config(dev);
>> -pci_enable_pcie_error_reporting(dev);
>> -}
>> +/* restore cfg space for possible link reset at upstream */
>> +dev->state_saved = true;
>> +pci_restore_state(dev);
>> +pcie_portdrv_restore_config(dev);
>> +pci_enable_pcie_error_reporting(dev);
>>  
>>  /* get true return value from  */
>>  retval = device_for_each_child(>dev, , slot_reset_iter);
>
>I think this patch changes the behavior in the case of a non-fatal error
>where one of the .error_detected() methods returned
>PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
>previously did not restore config space, but after your patch, it *will*
>restore it.  We need an explanation of why this is safe. 

Here is my understanding of this part of the patch:

I think your concern here should not be an issue. Whether it is a non-fatal 
error or a fatal error, as long as one of the .error_detected() method from the 
downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should 
trump all others and a slot reset should be performed, even though it was 
originally due to a non-fatal error reported or only one of the downstream 
drivers requests it. In the case of AER driver, this should be implemented in 
the broadcast_error_message() with pci_walk_bus() by passing in the 
report_error_detected() function, and merging the results into the 
result_data->result...

In any case, the fact that this pcie_portdrv_slot_reset() being invoked should 
be considered as a aftermath of a slot reset has been performed, thus the 
restore of config space should be performed after the reset. I suppose the 
restore should be to the same state as fresh power-on states, right?

Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Linas Vepstas
Hi,

On 21 May 2013 02:49, Yanmin Zhang  wrote:
> On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote:

>>  My impression
>> is that maybe the AER driver had been doing not quite the right thing
>> for a long time.
> Pls. provide evidence/facts. The new patch is to facilitate device driver
> implementation. It doesn't mean current AER driver is incorrect. We need
> a tradeoff.
>
> Just like what Bjoin says, we shouldn't change error_state to 
> pci_channel_io_normal
> before we really recover the hardware. The patch changes it just because
> drivers might call some functions to recover the devices, while such functions
> need (error_state==pci_channel_io_normal).

Perhaps we are talking past each other.  One needs to set error_state
== pci_channel_io_normal before calling slot_reset().  If the aer
driver wasn't doing this all along, then it seems like an old bug to
me.   The error_state flag indicates the status of the PCI channel,
and not the status of the attached device.  Once the channel has been
reset, the error state is "normal" even if the card itself hasn't yet
been brought up.

Whatever it is that the aer driver is doing, it should be doing
something similar to what the eeh driver is doing, in
arch/powerpc/platforms/pseries/eeh_driver.c  -- This is the "reference
implementation" -- Its known right, it was and continues to be heavily
tested, has found its way into sriov, etc.

The one thing that eeh does NOT do is to handle suspend/sleep states.
The basic design assumption back then was that no one would ever
suspend/sleep their server.  Since suspend/sleep messes with PCI
config space, then, yes, one would need to somehow save a second,
pristine copy of config space for device recovery.

-- Linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Yanmin Zhang
On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote:
> I think Joe Liu has it right.  I'm going to top-post because things
> are a bit tangled below.  I urge a review of
> /Documentation/PCI/pci-error-recovery.txt, as that gives the details.
> 
> The intended sequence is that, after an error, the device driver gets
> a shot at running some diagnostics & dumps, and then the pci
> bridges/controllers/ports/links are reset (by platform code, viz. aer
> in this case) to a state resembling a fresh power-on.  Then the
> .slot_reset() callback is called on the device driver, to tell the
> driver "hey everything upstream is now working, go set yourself up for
> normal ops."  Thus, in particular, one should have pdev->error_state =
> pci_channel_io_normal; before .slot_reset() is called, and the PCI
> config space contents should resemble a fresh power-on state (and
> **NOT** some other saved state!)
> 
> If the device driver wanted to leave the card in a dead state, it had
> several opportunities to say so, earlier in the callback sequence.  If
> the driver wanted to fiddle with the card with the old PCI config
> space, it already had a chance to do that too, before the
> bridges/controllers/ports/links were fully reset by the platform.   By
> the time that .slot_reset() is being called, both the platform and the
> device driver are expecting smooth sailing ahead.
Yes. It's flexible for drivers to do that in many callbacks. AER framework
provides such flexibility.

> 
> So, looking at the original patch, it seems reasonable. 
I agree.

>  My impression
> is that maybe the AER driver had been doing not quite the right thing
> for a long time.
Pls. provide evidence/facts. The new patch is to facilitate device driver
implementation. It doesn't mean current AER driver is incorrect. We need
a tradeoff.

Just like what Bjoin says, we shouldn't change error_state to 
pci_channel_io_normal
before we really recover the hardware. The patch changes it just because
drivers might call some functions to recover the devices, while such functions
need (error_state==pci_channel_io_normal).

> 
> -- Linas Vepstas


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Yanmin Zhang
On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote:
> On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
> > From: Zhang Long 
> > 
> > Specific pci device drivers might have many functions to call
> > pci_channel_offline to check device states. When slot_reset happens,
> > drivers' slot_reset callback might call such functions and eventually
> > abort the reset.
> > 
> > The patch resets pdev->error_state to pci_channel_io_normal at
> > the begining of report_slot_reset.
> > 
> > Thank Liu Joseph for pointing it out.
> > 
> > Signed-off-by: Zhang Yanmin 
> > Signed-off-by: Zhang Long 
> > ---
> >  drivers/pci/pcie/aer/aerdrv_core.c |1 +
> >  drivers/pci/pcie/portdrv_pci.c |   12 +---
> >  2 files changed, 6 insertions(+), 7 deletions(-)
Thank all for the kind comments.

> > 
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> > b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 564d97f..c61fd44 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> > *data)
> > result_data = (struct aer_broadcast_data *) data;
> >  
> > device_lock(>dev);
> > +   dev->error_state = pci_channel_io_normal;
> > if (!dev->driver ||
> > !dev->driver->err_handler ||
> > !dev->driver->err_handler->slot_reset)
> > diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> > index ed4d094..7abefd9 100644
> > --- a/drivers/pci/pcie/portdrv_pci.c
> > +++ b/drivers/pci/pcie/portdrv_pci.c
> > @@ -332,13 +332,11 @@ static pci_ers_result_t 
> > pcie_portdrv_slot_reset(struct pci_dev *dev)
> > pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
> > int retval;
> >  
> > -   /* If fatal, restore cfg space for possible link reset at upstream */
> > -   if (dev->error_state == pci_channel_io_frozen) {
> > -   dev->state_saved = true;
> > -   pci_restore_state(dev);
> > -   pcie_portdrv_restore_config(dev);
> > -   pci_enable_pcie_error_reporting(dev);
> > -   }
> > +   /* restore cfg space for possible link reset at upstream */
> > +   dev->state_saved = true;
It's indeed a dirty trick to change it to true. The reason is suspend. The port 
would
suspend/resume at suspend-to-ram. When the port is suspended, PCI power 
framework calls
pci_save_state. When the port is resumed, PCI framework calls pci_restore_state 
and
dev->state_saved is set to false.
A better solution is to add double space in 
pci_dev->saved_config_space/saved_cap_space,
which seems consume unnecessary resource.

> > +   pci_restore_state(dev);
> > +   pcie_portdrv_restore_config(dev);
> > +   pci_enable_pcie_error_reporting(dev);
> >  
> > /* get true return value from  */
> > retval = device_for_each_child(>dev, , slot_reset_iter);
> I think this patch changes the behavior in the case of a non-fatal error
> where one of the .error_detected() methods returned
> PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
> previously did not restore config space, but after your patch, it *will*
> restore it.  We need an explanation of why this is safe.
AER standard doesn't define such behavior like if we need reset a slot.
When we implemented the feature in kernel, we assumed at non-fatal error,
config space is still good.

With the new patch, we change the behavior a little.

> 
> I think you should split this into two patches: the first would remove the
> "if (dev->error_state == pci_channel_io_frozen)" test from portdrv_pci.c
The first patch would depend on the 2nd patch as 2nd patch resets error_state 
to pci_channel_io_normal.

> and explain the reason, and the second would make the aerdrv_core.c change.
> 
> I'm also concerned that in that same case (a non-fatal error where one of
> the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't
> think we actually *do* any kind of device reset.  This isn't related to
> your patch, of course, so if you resolve the config space restore question,
> we can deal with the reset question later.
It's a good question. At the beginning when we enabled AER feature in kernel,
we didn't really implement the real device reset. It's in the TODO list.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Yanmin Zhang
On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote:
 On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
  From: Zhang Long longx.zh...@intel.com
  
  Specific pci device drivers might have many functions to call
  pci_channel_offline to check device states. When slot_reset happens,
  drivers' slot_reset callback might call such functions and eventually
  abort the reset.
  
  The patch resets pdev-error_state to pci_channel_io_normal at
  the begining of report_slot_reset.
  
  Thank Liu Joseph for pointing it out.
  
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
  Signed-off-by: Zhang Long longx.zh...@intel.com
  ---
   drivers/pci/pcie/aer/aerdrv_core.c |1 +
   drivers/pci/pcie/portdrv_pci.c |   12 +---
   2 files changed, 6 insertions(+), 7 deletions(-)
Thank all for the kind comments.

  
  diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
  b/drivers/pci/pcie/aer/aerdrv_core.c
  index 564d97f..c61fd44 100644
  --- a/drivers/pci/pcie/aer/aerdrv_core.c
  +++ b/drivers/pci/pcie/aer/aerdrv_core.c
  @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
  *data)
  result_data = (struct aer_broadcast_data *) data;
   
  device_lock(dev-dev);
  +   dev-error_state = pci_channel_io_normal;
  if (!dev-driver ||
  !dev-driver-err_handler ||
  !dev-driver-err_handler-slot_reset)
  diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
  index ed4d094..7abefd9 100644
  --- a/drivers/pci/pcie/portdrv_pci.c
  +++ b/drivers/pci/pcie/portdrv_pci.c
  @@ -332,13 +332,11 @@ static pci_ers_result_t 
  pcie_portdrv_slot_reset(struct pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;
   
  -   /* If fatal, restore cfg space for possible link reset at upstream */
  -   if (dev-error_state == pci_channel_io_frozen) {
  -   dev-state_saved = true;
  -   pci_restore_state(dev);
  -   pcie_portdrv_restore_config(dev);
  -   pci_enable_pcie_error_reporting(dev);
  -   }
  +   /* restore cfg space for possible link reset at upstream */
  +   dev-state_saved = true;
It's indeed a dirty trick to change it to true. The reason is suspend. The port 
would
suspend/resume at suspend-to-ram. When the port is suspended, PCI power 
framework calls
pci_save_state. When the port is resumed, PCI framework calls pci_restore_state 
and
dev-state_saved is set to false.
A better solution is to add double space in 
pci_dev-saved_config_space/saved_cap_space,
which seems consume unnecessary resource.

  +   pci_restore_state(dev);
  +   pcie_portdrv_restore_config(dev);
  +   pci_enable_pcie_error_reporting(dev);
   
  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);
 I think this patch changes the behavior in the case of a non-fatal error
 where one of the .error_detected() methods returned
 PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
 previously did not restore config space, but after your patch, it *will*
 restore it.  We need an explanation of why this is safe.
AER standard doesn't define such behavior like if we need reset a slot.
When we implemented the feature in kernel, we assumed at non-fatal error,
config space is still good.

With the new patch, we change the behavior a little.

 
 I think you should split this into two patches: the first would remove the
 if (dev-error_state == pci_channel_io_frozen) test from portdrv_pci.c
The first patch would depend on the 2nd patch as 2nd patch resets error_state 
to pci_channel_io_normal.

 and explain the reason, and the second would make the aerdrv_core.c change.
 
 I'm also concerned that in that same case (a non-fatal error where one of
 the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't
 think we actually *do* any kind of device reset.  This isn't related to
 your patch, of course, so if you resolve the config space restore question,
 we can deal with the reset question later.
It's a good question. At the beginning when we enabled AER feature in kernel,
we didn't really implement the real device reset. It's in the TODO list.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Yanmin Zhang
On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote:
 I think Joe Liu has it right.  I'm going to top-post because things
 are a bit tangled below.  I urge a review of
 /Documentation/PCI/pci-error-recovery.txt, as that gives the details.
 
 The intended sequence is that, after an error, the device driver gets
 a shot at running some diagnostics  dumps, and then the pci
 bridges/controllers/ports/links are reset (by platform code, viz. aer
 in this case) to a state resembling a fresh power-on.  Then the
 .slot_reset() callback is called on the device driver, to tell the
 driver hey everything upstream is now working, go set yourself up for
 normal ops.  Thus, in particular, one should have pdev-error_state =
 pci_channel_io_normal; before .slot_reset() is called, and the PCI
 config space contents should resemble a fresh power-on state (and
 **NOT** some other saved state!)
 
 If the device driver wanted to leave the card in a dead state, it had
 several opportunities to say so, earlier in the callback sequence.  If
 the driver wanted to fiddle with the card with the old PCI config
 space, it already had a chance to do that too, before the
 bridges/controllers/ports/links were fully reset by the platform.   By
 the time that .slot_reset() is being called, both the platform and the
 device driver are expecting smooth sailing ahead.
Yes. It's flexible for drivers to do that in many callbacks. AER framework
provides such flexibility.

 
 So, looking at the original patch, it seems reasonable. 
I agree.

  My impression
 is that maybe the AER driver had been doing not quite the right thing
 for a long time.
Pls. provide evidence/facts. The new patch is to facilitate device driver
implementation. It doesn't mean current AER driver is incorrect. We need
a tradeoff.

Just like what Bjoin says, we shouldn't change error_state to 
pci_channel_io_normal
before we really recover the hardware. The patch changes it just because
drivers might call some functions to recover the devices, while such functions
need (error_state==pci_channel_io_normal).

 
 -- Linas Vepstas


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Linas Vepstas
Hi,

On 21 May 2013 02:49, Yanmin Zhang yanmin_zh...@linux.intel.com wrote:
 On Mon, 2013-05-20 at 10:37 -0500, Linas Vepstas wrote:

  My impression
 is that maybe the AER driver had been doing not quite the right thing
 for a long time.
 Pls. provide evidence/facts. The new patch is to facilitate device driver
 implementation. It doesn't mean current AER driver is incorrect. We need
 a tradeoff.

 Just like what Bjoin says, we shouldn't change error_state to 
 pci_channel_io_normal
 before we really recover the hardware. The patch changes it just because
 drivers might call some functions to recover the devices, while such functions
 need (error_state==pci_channel_io_normal).

Perhaps we are talking past each other.  One needs to set error_state
== pci_channel_io_normal before calling slot_reset().  If the aer
driver wasn't doing this all along, then it seems like an old bug to
me.   The error_state flag indicates the status of the PCI channel,
and not the status of the attached device.  Once the channel has been
reset, the error state is normal even if the card itself hasn't yet
been brought up.

Whatever it is that the aer driver is doing, it should be doing
something similar to what the eeh driver is doing, in
arch/powerpc/platforms/pseries/eeh_driver.c  -- This is the reference
implementation -- Its known right, it was and continues to be heavily
tested, has found its way into sriov, etc.

The one thing that eeh does NOT do is to handle suspend/sleep states.
The basic design assumption back then was that no one would ever
suspend/sleep their server.  Since suspend/sleep messes with PCI
config space, then, yes, one would need to somehow save a second,
pristine copy of config space for device recovery.

-- Linas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Liu, Joseph
Bjorn,

 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
 pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;
  
 -/* If fatal, restore cfg space for possible link reset at upstream */
 -if (dev-error_state == pci_channel_io_frozen) {
 -dev-state_saved = true;
 -pci_restore_state(dev);
 -pcie_portdrv_restore_config(dev);
 -pci_enable_pcie_error_reporting(dev);
 -}
 +/* restore cfg space for possible link reset at upstream */
 +dev-state_saved = true;
 +pci_restore_state(dev);
 +pcie_portdrv_restore_config(dev);
 +pci_enable_pcie_error_reporting(dev);
  
  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe. 

Here is my understanding of this part of the patch:

I think your concern here should not be an issue. Whether it is a non-fatal 
error or a fatal error, as long as one of the .error_detected() method from the 
downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it should 
trump all others and a slot reset should be performed, even though it was 
originally due to a non-fatal error reported or only one of the downstream 
drivers requests it. In the case of AER driver, this should be implemented in 
the broadcast_error_message() with pci_walk_bus() by passing in the 
report_error_detected() function, and merging the results into the 
result_data-result...

In any case, the fact that this pcie_portdrv_slot_reset() being invoked should 
be considered as a aftermath of a slot reset has been performed, thus the 
restore of config space should be performed after the reset. I suppose the 
restore should be to the same state as fresh power-on states, right?

Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2013 at 1:40 AM, Yanmin Zhang
yanmin_zh...@linux.intel.com wrote:
 On Mon, 2013-05-20 at 16:48 -0600, Bjorn Helgaas wrote:
 On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
  From: Zhang Long longx.zh...@intel.com
 
  Specific pci device drivers might have many functions to call
  pci_channel_offline to check device states. When slot_reset happens,
  drivers' slot_reset callback might call such functions and eventually
  abort the reset.
 
  The patch resets pdev-error_state to pci_channel_io_normal at
  the begining of report_slot_reset.
 
  Thank Liu Joseph for pointing it out.
 
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
  Signed-off-by: Zhang Long longx.zh...@intel.com
  ---
   drivers/pci/pcie/aer/aerdrv_core.c |1 +
   drivers/pci/pcie/portdrv_pci.c |   12 +---
   2 files changed, 6 insertions(+), 7 deletions(-)
 Thank all for the kind comments.

 
  diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
  b/drivers/pci/pcie/aer/aerdrv_core.c
  index 564d97f..c61fd44 100644
  --- a/drivers/pci/pcie/aer/aerdrv_core.c
  +++ b/drivers/pci/pcie/aer/aerdrv_core.c
  @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
  *data)
  result_data = (struct aer_broadcast_data *) data;
 
  device_lock(dev-dev);
  +   dev-error_state = pci_channel_io_normal;
  if (!dev-driver ||
  !dev-driver-err_handler ||
  !dev-driver-err_handler-slot_reset)
  diff --git a/drivers/pci/pcie/portdrv_pci.c 
  b/drivers/pci/pcie/portdrv_pci.c
  index ed4d094..7abefd9 100644
  --- a/drivers/pci/pcie/portdrv_pci.c
  +++ b/drivers/pci/pcie/portdrv_pci.c
  @@ -332,13 +332,11 @@ static pci_ers_result_t 
  pcie_portdrv_slot_reset(struct pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;
 
  -   /* If fatal, restore cfg space for possible link reset at upstream */
  -   if (dev-error_state == pci_channel_io_frozen) {
  -   dev-state_saved = true;
  -   pci_restore_state(dev);
  -   pcie_portdrv_restore_config(dev);
  -   pci_enable_pcie_error_reporting(dev);
  -   }
  +   /* restore cfg space for possible link reset at upstream */
  +   dev-state_saved = true;
 It's indeed a dirty trick to change it to true. The reason is suspend. The 
 port would
 suspend/resume at suspend-to-ram. When the port is suspended, PCI power 
 framework calls
 pci_save_state. When the port is resumed, PCI framework calls 
 pci_restore_state and
 dev-state_saved is set to false.

I can read the code as well as you can.  The above does nothing to
explain why dev-saved_config_space is still valid even when
state_saved is false.

But I want to drop this issue because it was there before your patch,
and we're clearly not going to resolve it here.

 A better solution is to add double space in 
 pci_dev-saved_config_space/saved_cap_space,
 which seems consume unnecessary resource.

  +   pci_restore_state(dev);
  +   pcie_portdrv_restore_config(dev);
  +   pci_enable_pcie_error_reporting(dev);
 
  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);
 I think this patch changes the behavior in the case of a non-fatal error
 where one of the .error_detected() methods returned
 PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
 previously did not restore config space, but after your patch, it *will*
 restore it.  We need an explanation of why this is safe.
 AER standard doesn't define such behavior like if we need reset a slot.
 When we implemented the feature in kernel, we assumed at non-fatal error,
 config space is still good.

 With the new patch, we change the behavior a little.


 I think you should split this into two patches: the first would remove the
 if (dev-error_state == pci_channel_io_frozen) test from portdrv_pci.c
 The first patch would depend on the 2nd patch as 2nd patch resets error_state
 to pci_channel_io_normal.

Not true.  The pcie_portdrv_slot_reset() change does not depend on the
report_slot_reset() change.  If the first patch makes
pcie_portdrv_slot_reset() do the restore without even looking at
dev-error_state, there's no dependency on the report_slot_reset()
change.

 and explain the reason, and the second would make the aerdrv_core.c change.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Linas Vepstas
Zhang, Bjorn,

On 21 May 2013 10:41, Liu, Joseph joseph@emulex.com wrote:
 Bjorn,

 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t 
 pcie_portdrv_slot_reset(struct pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;

 -/* If fatal, restore cfg space for possible link reset at upstream */
 -if (dev-error_state == pci_channel_io_frozen) {
 -dev-state_saved = true;
 -pci_restore_state(dev);
 -pcie_portdrv_restore_config(dev);
 -pci_enable_pcie_error_reporting(dev);
 -}
 +/* restore cfg space for possible link reset at upstream */
 +dev-state_saved = true;
 +pci_restore_state(dev);
 +pcie_portdrv_restore_config(dev);
 +pci_enable_pcie_error_reporting(dev);

  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe.

 Here is my understanding of this part of the patch:

 I think your concern here should not be an issue. Whether it is a non-fatal 
 error or a fatal error, as long as one of the .error_detected() method from 
 the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it 
 should trump all others and a slot reset should be performed, even though it 
 was originally due to a non-fatal error reported or only one of the 
 downstream drivers requests it. In the case of AER driver, this should be 
 implemented in the broadcast_error_message() with pci_walk_bus() by passing 
 in the report_error_detected() function, and merging the results into the 
 result_data-result...

 In any case, the fact that this pcie_portdrv_slot_reset() being invoked 
 should be considered as a aftermath of a slot reset has been performed, thus 
 the restore of config space should be performed after the reset. I suppose 
 the restore should be to the same state as fresh power-on states, right?


Again, I think Joe has it exactly right.   The patch, I'm not so sure.
 In my earlier emails, I was assuming that the device has just gotten
either a hard reset (power has been turned off-n-on, e.g. using
pci-hotplug, or a 'soft reset' (whatever that means :-)).   If the
adapter has been reset, then it is appropriate to restore pci config
space to something resembling a fresh power-on.

If the adapter has NOT been reset, then, ummm .. changing
('restoring') the config space would be wrong ... if I recall
correctly, a pci link reset does not whack the config space, and if
the device itself has not been whacked, then all the contents of the
config space (dma mappings, etc) are all still valid, and should not
be changed!

So, the restore of the config space should be conditional, depending
on whether the device has been reset or not.

-- Linas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-21 Thread Bjorn Helgaas
On Tue, May 21, 2013 at 9:41 AM, Liu, Joseph joseph@emulex.com wrote:
 Bjorn,

 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t 
 pcie_portdrv_slot_reset(struct pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;

 -/* If fatal, restore cfg space for possible link reset at upstream */
 -if (dev-error_state == pci_channel_io_frozen) {
 -dev-state_saved = true;
 -pci_restore_state(dev);
 -pcie_portdrv_restore_config(dev);
 -pci_enable_pcie_error_reporting(dev);
 -}
 +/* restore cfg space for possible link reset at upstream */
 +dev-state_saved = true;
 +pci_restore_state(dev);
 +pcie_portdrv_restore_config(dev);
 +pci_enable_pcie_error_reporting(dev);

  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe.

 Here is my understanding of this part of the patch:

 I think your concern here should not be an issue. Whether it is a non-fatal 
 error or a fatal error, as long as one of the .error_detected() method from 
 the downstream drivers involved returns a PCI_ERS_RESULT_NEED_RESET, it 
 should trump all others and a slot reset should be performed, even though it 
 was originally due to a non-fatal error reported or only one of the 
 downstream drivers requests it. In the case of AER driver, this should be 
 implemented in the broadcast_error_message() with pci_walk_bus() by passing 
 in the report_error_detected() function, and merging the results into the 
 result_data-result...

Yes, I understand all this.  (Though as I pointed out, the current AER
code does not actually perform a reset based on
PCI_ERS_RESULT_NEED_RESET being returned.  The only time we currently
do a reset is for AER_FATAL errors.)

 In any case, the fact that this pcie_portdrv_slot_reset() being invoked 
 should be considered as a aftermath of a slot reset has been performed, thus 
 the restore of config space should be performed after the reset.

The intention has always been that .slot_reset() would be called to
inform the driver that a reset has been performed.  That was the case
even when Yanmin added pcie_portdrv_slot_reset() with commit
4bf3392e0.  Yet in that commit, pcie_portdrv_slot_reset() only does
the restore when the channel is frozen, i.e., when we're handling an
AER_FATAL error.

So far, I haven't seen any explanation of what changed to make us want
to do the restore always, even for non-fatal errors.  Maybe the
original test for the channel being frozen was just a mistake, but it
would have been much simpler to omit the test to begin with, so
obviously Yanmin thought it was necessary at the time.

Maybe the error_state == pci_channel_io_frozen test was a back-door
way to express we only want to restore state when we've actually done
a reset.  That would sort of make sense, although there's no
documented connection between the frozen state and a device reset, and
no driver should rely on one.

If the idea of only do a restore after a reset is still valid, we
don't want to make this change to pcie_portdrv_slot_reset() because it
IS called in some cases when a reset has not been performed, namely
non-fatal errors when an .error_detected() method returns
PCI_ERS_RESULT_NEED_RESET.

 I suppose the restore should be to the same state as fresh power-on states, 
 right?

I *think* that in pcie_portdrv_slot_reset(), we're probably restoring
the state saved by pcie_portdrv_probe(), i.e., the state saved by the
PCIe port driver after it has claimed and initialized the port.  This
is not fresh power-on state; a fresh power-on state would have BARs
and other config registers cleared, and we'd have to assign resources
to the device just like we do to a hot-added device.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Bjorn Helgaas
On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
> From: Zhang Long 
> 
> Specific pci device drivers might have many functions to call
> pci_channel_offline to check device states. When slot_reset happens,
> drivers' slot_reset callback might call such functions and eventually
> abort the reset.
> 
> The patch resets pdev->error_state to pci_channel_io_normal at
> the begining of report_slot_reset.
> 
> Thank Liu Joseph for pointing it out.
> 
> Signed-off-by: Zhang Yanmin 
> Signed-off-by: Zhang Long 
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>  drivers/pci/pcie/portdrv_pci.c |   12 +---
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 564d97f..c61fd44 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> *data)
>   result_data = (struct aer_broadcast_data *) data;
>  
>   device_lock(>dev);
> + dev->error_state = pci_channel_io_normal;
>   if (!dev->driver ||
>   !dev->driver->err_handler ||
>   !dev->driver->err_handler->slot_reset)
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index ed4d094..7abefd9 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
> pci_dev *dev)
>   pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>   int retval;
>  
> - /* If fatal, restore cfg space for possible link reset at upstream */
> - if (dev->error_state == pci_channel_io_frozen) {
> - dev->state_saved = true;
> - pci_restore_state(dev);
> - pcie_portdrv_restore_config(dev);
> - pci_enable_pcie_error_reporting(dev);
> - }
> + /* restore cfg space for possible link reset at upstream */
> + dev->state_saved = true;
> + pci_restore_state(dev);
> + pcie_portdrv_restore_config(dev);
> + pci_enable_pcie_error_reporting(dev);
>  
>   /* get true return value from  */
>   retval = device_for_each_child(>dev, , slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe.

I think you should split this into two patches: the first would remove the
"if (dev->error_state == pci_channel_io_frozen)" test from portdrv_pci.c
and explain the reason, and the second would make the aerdrv_core.c change.

I'm also concerned that in that same case (a non-fatal error where one of
the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't
think we actually *do* any kind of device reset.  This isn't related to
your patch, of course, so if you resolve the config space restore question,
we can deal with the reset question later.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Linas Vepstas
On 20 May 2013 12:21, Bjorn Helgaas  wrote:
> On Fri, May 17, 2013 at 5:56 PM, Rafael J. Wysocki  wrote:
>> On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
>>> On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  
>>> wrote:
>
>>> > +   /* restore cfg space for possible link reset at upstream */
>>> > +   dev->state_saved = true;
>>>
>>> "dev->state_saved == true" means that the dev->saved_config_space
>>> contains valid data.  Why do we know that's the case here?  I see that
>>> pcie_portdrv_probe() calls pci_save_state() when we first claim the
>>> port, and I guess we're assuming the state saved then is still valid.

Yes, see below.

>>> But why do we need to actually set dev->state_saved here?  Shouldn't
>>> it be already set to true anyway?
>>
>> This is a dirty trick to make pci_restore_state(dev) always work here
>> (because it checks dev->state_saved and does nothing if it isn't set).
>> I suppose.
>
> Yes, I did investigate enough to see that this is a dirty trick.  My
> question is how we know it's safe to do this dirty trick.
>
>>> > +   pci_restore_state(dev);
>>> > +   pcie_portdrv_restore_config(dev);

Lets review what is supposed to happen:

-- poweron
-- BIOS/firmware/bootloader maybe fiddles with PCI config space
-- kernel fiddles with PCI config space during boot
-- device driver sets up PCI config space correctly for normal i/o
-- PCI error occurs
-- PCI port/link is reset

At this point, we want to set the PCI config space to something
resembling what it was just before the device driver first touched it
after poweron.  Why?  Because the device driver will typically set up
DMA segments, and tweak other settings as it initializes the card.  It
needs to do almost exactly the same things again, when re-initializing
the device after the error reset.  Thus, the PCI config needs to be
appropriate for a fresh initialization of the card.

If  some other variant of PCI config was loaded here, it might contain
incorrect DMA mappings or other settings.  In particular, while the
driver is initializing the card, you risk having the card run away and
start doing things based on these incorrect settings -- provoking
another error or maybe just silently corrupting data.   The config
that you want to load should be more-or-less the same one that was
there during kernel boot, before the device-driver started touching
things.

The "dirty hack" is weird:  either there's valid data, and the flag is
set, or the data is garbage and the flag isn't set ... how could it be
otherwise?

-- Linas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Bjorn Helgaas
On Fri, May 17, 2013 at 5:56 PM, Rafael J. Wysocki  wrote:
> On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
>> On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:

>> > +   /* restore cfg space for possible link reset at upstream */
>> > +   dev->state_saved = true;
>>
>> "dev->state_saved == true" means that the dev->saved_config_space
>> contains valid data.  Why do we know that's the case here?  I see that
>> pcie_portdrv_probe() calls pci_save_state() when we first claim the
>> port, and I guess we're assuming the state saved then is still valid.
>> But why do we need to actually set dev->state_saved here?  Shouldn't
>> it be already set to true anyway?
>
> This is a dirty trick to make pci_restore_state(dev) always work here
> (because it checks dev->state_saved and does nothing if it isn't set).
> I suppose.

Yes, I did investigate enough to see that this is a dirty trick.  My
question is how we know it's safe to do this dirty trick.

>> > +   pci_restore_state(dev);
>> > +   pcie_portdrv_restore_config(dev);
>> > +   pci_enable_pcie_error_reporting(dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Linas Vepstas
I think Joe Liu has it right.  I'm going to top-post because things
are a bit tangled below.  I urge a review of
/Documentation/PCI/pci-error-recovery.txt, as that gives the details.

The intended sequence is that, after an error, the device driver gets
a shot at running some diagnostics & dumps, and then the pci
bridges/controllers/ports/links are reset (by platform code, viz. aer
in this case) to a state resembling a fresh power-on.  Then the
.slot_reset() callback is called on the device driver, to tell the
driver "hey everything upstream is now working, go set yourself up for
normal ops."  Thus, in particular, one should have pdev->error_state =
pci_channel_io_normal; before .slot_reset() is called, and the PCI
config space contents should resemble a fresh power-on state (and
**NOT** some other saved state!)

If the device driver wanted to leave the card in a dead state, it had
several opportunities to say so, earlier in the callback sequence.  If
the driver wanted to fiddle with the card with the old PCI config
space, it already had a chance to do that too, before the
bridges/controllers/ports/links were fully reset by the platform.   By
the time that .slot_reset() is being called, both the platform and the
device driver are expecting smooth sailing ahead.

So, looking at the original patch, it seems reasonable.  My impression
is that maybe the AER driver had been doing not quite the right thing
for a long time.

-- Linas Vepstas

On 20 May 2013 09:38, Liu, Joseph  wrote:
> Bjorn,
>
> I have been working with the low level device drivers for supporting both EEH 
> and AER and were involved in working with Yanmin for verifying this AER 
> patch. Please see some of my responses to your questions inline, prefixed 
> with [jzl].
>
>
> Thanks,
> Joe Liu, Ph.D.
> Senior Principal Engineer
> Emulex Corporation
>
> -Original Message-
> From: Bjorn Helgaas [mailto:bhelg...@google.com]
> Sent: Friday, May 17, 2013 7:44 PM
> To: Zhang, LongX
> Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
> linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; 
> Rafael J. Wysocki
> Subject: Re: Subject : [ PATCH ] 
> pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset
>
> [+cc Rafael because he knows about dev->state_saved]
>
> Sorry, I'm not very familiar with AER, so please excuse some naive
> questions below.
>
> On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:
>> From: Zhang Long 
>>
>> Specific pci device drivers might have many functions to call
>> pci_channel_offline to check device states. When slot_reset happens,
>> drivers' slot_reset callback might call such functions and eventually
>> abort the reset.
>
> |Where does this happen?  I looked at all the references to
> |dev->error_state and all the callers of pci_channel_offline(), and I
> |didn't see any in .slot_reset() methods.
> |
> |(There are *assignments* to dev->error_state in qlcnic_attach_func(),
> |qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
> |to remove those assignments after this patch, but this patch wouldn't
> |really change anything for those paths.)
>
> [jzl] Although low level driver might not call pci_channel_offline() directly 
> in .slot_reset() method itself, it does not mean it will not call it during 
> the device error recovery execution of .slot_reset() method. The 
> .slot_reset() method needs to call some of its bottom layer routines 
> interfacing with device PCI interface for preparing the PCI device from 
> recovery of PCI error (EEH or AER). As a matter of fact, it seems that 
> qla2xxx_pci_slot_reset() routine's assignments to dev->error was an attempt 
> to work around this AER driver issue that this patch is trying to resolve. 
> From upstream kernel 3.9.2's qla2xxx_pci_slot_reset(), it has the assignment 
> and comment below:
>
> static pci_ers_result_t
> qla2xxx_pci_slot_reset(struct pci_dev *pdev)
> {
> pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
> scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
> struct qla_hw_data *ha = base_vha->hw;
> struct rsp_que *rsp;
> int rc, retries = 10;
>
> ql_dbg(ql_dbg_aer, base_vha, 0x9004,
> "Slot Reset.\n");
>
> /* Workaround: qla2xxx driver which access hardware earlier
>  * needs error state to be pci_channel_io_online.
>  * Otherwise mailbox command timesout.
>  */
> pdev->error_state = pci_channel_io_normal;
>
>
>
>> The patch resets pdev->error_state to pci_channel_io_normal at
>> the begining of report_slot_reset.
>
>> Signed-off-by: Zhang Yanmin 
>> Signed-off-by: Zhang Long 
>

RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Liu, Joseph
Bjorn,

I have been working with the low level device drivers for supporting both EEH 
and AER and were involved in working with Yanmin for verifying this AER patch. 
Please see some of my responses to your questions inline, prefixed with [jzl]. 


Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Friday, May 17, 2013 7:44 PM
To: Zhang, LongX
Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael 
J. Wysocki
Subject: Re: Subject : [ PATCH ] 
pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

[+cc Rafael because he knows about dev->state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:
> From: Zhang Long 
>
> Specific pci device drivers might have many functions to call
> pci_channel_offline to check device states. When slot_reset happens,
> drivers' slot_reset callback might call such functions and eventually
> abort the reset.

|Where does this happen?  I looked at all the references to
|dev->error_state and all the callers of pci_channel_offline(), and I
|didn't see any in .slot_reset() methods.
|
|(There are *assignments* to dev->error_state in qlcnic_attach_func(),
|qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
|to remove those assignments after this patch, but this patch wouldn't
|really change anything for those paths.)

[jzl] Although low level driver might not call pci_channel_offline() directly 
in .slot_reset() method itself, it does not mean it will not call it during the 
device error recovery execution of .slot_reset() method. The .slot_reset() 
method needs to call some of its bottom layer routines interfacing with device 
PCI interface for preparing the PCI device from recovery of PCI error (EEH or 
AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's 
assignments to dev->error was an attempt to work around this AER driver issue 
that this patch is trying to resolve. From upstream kernel 3.9.2's 
qla2xxx_pci_slot_reset(), it has the assignment and comment below:

static pci_ers_result_t
qla2xxx_pci_slot_reset(struct pci_dev *pdev)
{
pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
struct qla_hw_data *ha = base_vha->hw;
struct rsp_que *rsp;
int rc, retries = 10;

ql_dbg(ql_dbg_aer, base_vha, 0x9004,
"Slot Reset.\n");

/* Workaround: qla2xxx driver which access hardware earlier
 * needs error state to be pci_channel_io_online.
 * Otherwise mailbox command timesout.
 */
pdev->error_state = pci_channel_io_normal;



> The patch resets pdev->error_state to pci_channel_io_normal at
> the begining of report_slot_reset.

> Signed-off-by: Zhang Yanmin 
> Signed-off-by: Zhang Long 
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>  drivers/pci/pcie/portdrv_pci.c |   12 +---
>  2 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 564d97f..c61fd44 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> *data)
> result_data = (struct aer_broadcast_data *) data;
>
> device_lock(>dev);
> +   dev->error_state = pci_channel_io_normal;

|The device's error_state might be pci_channel_io_frozen when we get
|here.  We haven't touched anything in the hardware yet.  What makes
|the device unfrozen now?  Did anything actually change as far as the
|hardware device is concerned?

[jzl] When the AER driver gets to invoking report_slot_reset(), it has already 
performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it 
has been performed. But with proper AER driver implementation, it should be 
either PCI slot hot reset or even fundamental reset that has already been 
performed. As such, after platform performing the PCI slot reset, it should 
move the device's error_state out of pci_channel_io_fronzen before calling 
report_slot_reset() to low level device drivers allows them to access the 
corresponding device's PCI function in preparation for recovery.


|I agree it looks like report_slot_reset() should be made more like
|eeh_report_reset().  I'm just wondering if the error_state should be
|changed *after* calling the .slot_reset() method instead of before.

[jzl] No, you should not set the error_state after calling the .slot_reset() 
method because the AER driver calling the low level driver's .slot_reset() 
method is to "report&quo

RE: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Liu, Joseph
Bjorn,

I have been working with the low level device drivers for supporting both EEH 
and AER and were involved in working with Yanmin for verifying this AER patch. 
Please see some of my responses to your questions inline, prefixed with [jzl]. 


Thanks,
Joe Liu, Ph.D.
Senior Principal Engineer
Emulex Corporation

-Original Message-
From: Bjorn Helgaas [mailto:bhelg...@google.com] 
Sent: Friday, May 17, 2013 7:44 PM
To: Zhang, LongX
Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; Rafael 
J. Wysocki
Subject: Re: Subject : [ PATCH ] 
pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

[+cc Rafael because he knows about dev-state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:
 From: Zhang Long longx.zh...@intel.com

 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.

|Where does this happen?  I looked at all the references to
|dev-error_state and all the callers of pci_channel_offline(), and I
|didn't see any in .slot_reset() methods.
|
|(There are *assignments* to dev-error_state in qlcnic_attach_func(),
|qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
|to remove those assignments after this patch, but this patch wouldn't
|really change anything for those paths.)

[jzl] Although low level driver might not call pci_channel_offline() directly 
in .slot_reset() method itself, it does not mean it will not call it during the 
device error recovery execution of .slot_reset() method. The .slot_reset() 
method needs to call some of its bottom layer routines interfacing with device 
PCI interface for preparing the PCI device from recovery of PCI error (EEH or 
AER). As a matter of fact, it seems that qla2xxx_pci_slot_reset() routine's 
assignments to dev-error was an attempt to work around this AER driver issue 
that this patch is trying to resolve. From upstream kernel 3.9.2's 
qla2xxx_pci_slot_reset(), it has the assignment and comment below:

static pci_ers_result_t
qla2xxx_pci_slot_reset(struct pci_dev *pdev)
{
pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
struct qla_hw_data *ha = base_vha-hw;
struct rsp_que *rsp;
int rc, retries = 10;

ql_dbg(ql_dbg_aer, base_vha, 0x9004,
Slot Reset.\n);

/* Workaround: qla2xxx driver which access hardware earlier
 * needs error state to be pci_channel_io_online.
 * Otherwise mailbox command timesout.
 */
pdev-error_state = pci_channel_io_normal;



 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.

 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)

 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c
 +++ b/drivers/pci/pcie/aer/aerdrv_core.c
 @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
 *data)
 result_data = (struct aer_broadcast_data *) data;

 device_lock(dev-dev);
 +   dev-error_state = pci_channel_io_normal;

|The device's error_state might be pci_channel_io_frozen when we get
|here.  We haven't touched anything in the hardware yet.  What makes
|the device unfrozen now?  Did anything actually change as far as the
|hardware device is concerned?

[jzl] When the AER driver gets to invoking report_slot_reset(), it has already 
performed PCI slot reset. Currently, with PCIe, it is the PCIe link reset it 
has been performed. But with proper AER driver implementation, it should be 
either PCI slot hot reset or even fundamental reset that has already been 
performed. As such, after platform performing the PCI slot reset, it should 
move the device's error_state out of pci_channel_io_fronzen before calling 
report_slot_reset() to low level device drivers allows them to access the 
corresponding device's PCI function in preparation for recovery.


|I agree it looks like report_slot_reset() should be made more like
|eeh_report_reset().  I'm just wondering if the error_state should be
|changed *after* calling the .slot_reset() method instead of before.

[jzl] No, you should not set the error_state after calling the .slot_reset() 
method because the AER driver calling the low level driver's .slot_reset() 
method is to report that the platform PCI slot reset has been done and 
inform the low level

Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Linas Vepstas
I think Joe Liu has it right.  I'm going to top-post because things
are a bit tangled below.  I urge a review of
/Documentation/PCI/pci-error-recovery.txt, as that gives the details.

The intended sequence is that, after an error, the device driver gets
a shot at running some diagnostics  dumps, and then the pci
bridges/controllers/ports/links are reset (by platform code, viz. aer
in this case) to a state resembling a fresh power-on.  Then the
.slot_reset() callback is called on the device driver, to tell the
driver hey everything upstream is now working, go set yourself up for
normal ops.  Thus, in particular, one should have pdev-error_state =
pci_channel_io_normal; before .slot_reset() is called, and the PCI
config space contents should resemble a fresh power-on state (and
**NOT** some other saved state!)

If the device driver wanted to leave the card in a dead state, it had
several opportunities to say so, earlier in the callback sequence.  If
the driver wanted to fiddle with the card with the old PCI config
space, it already had a chance to do that too, before the
bridges/controllers/ports/links were fully reset by the platform.   By
the time that .slot_reset() is being called, both the platform and the
device driver are expecting smooth sailing ahead.

So, looking at the original patch, it seems reasonable.  My impression
is that maybe the AER driver had been doing not quite the right thing
for a long time.

-- Linas Vepstas

On 20 May 2013 09:38, Liu, Joseph joseph@emulex.com wrote:
 Bjorn,

 I have been working with the low level device drivers for supporting both EEH 
 and AER and were involved in working with Yanmin for verifying this AER 
 patch. Please see some of my responses to your questions inline, prefixed 
 with [jzl].


 Thanks,
 Joe Liu, Ph.D.
 Senior Principal Engineer
 Emulex Corporation

 -Original Message-
 From: Bjorn Helgaas [mailto:bhelg...@google.com]
 Sent: Friday, May 17, 2013 7:44 PM
 To: Zhang, LongX
 Cc: linasveps...@gmail.com; linux-...@vger.kernel.org; 
 linux-kernel@vger.kernel.org; yanmin_zh...@linux.intel.com; Liu, Joseph; 
 Rafael J. Wysocki
 Subject: Re: Subject : [ PATCH ] 
 pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

 [+cc Rafael because he knows about dev-state_saved]

 Sorry, I'm not very familiar with AER, so please excuse some naive
 questions below.

 On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:
 From: Zhang Long longx.zh...@intel.com

 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.

 |Where does this happen?  I looked at all the references to
 |dev-error_state and all the callers of pci_channel_offline(), and I
 |didn't see any in .slot_reset() methods.
 |
 |(There are *assignments* to dev-error_state in qlcnic_attach_func(),
 |qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
 |to remove those assignments after this patch, but this patch wouldn't
 |really change anything for those paths.)

 [jzl] Although low level driver might not call pci_channel_offline() directly 
 in .slot_reset() method itself, it does not mean it will not call it during 
 the device error recovery execution of .slot_reset() method. The 
 .slot_reset() method needs to call some of its bottom layer routines 
 interfacing with device PCI interface for preparing the PCI device from 
 recovery of PCI error (EEH or AER). As a matter of fact, it seems that 
 qla2xxx_pci_slot_reset() routine's assignments to dev-error was an attempt 
 to work around this AER driver issue that this patch is trying to resolve. 
 From upstream kernel 3.9.2's qla2xxx_pci_slot_reset(), it has the assignment 
 and comment below:

 static pci_ers_result_t
 qla2xxx_pci_slot_reset(struct pci_dev *pdev)
 {
 pci_ers_result_t ret = PCI_ERS_RESULT_DISCONNECT;
 scsi_qla_host_t *base_vha = pci_get_drvdata(pdev);
 struct qla_hw_data *ha = base_vha-hw;
 struct rsp_que *rsp;
 int rc, retries = 10;

 ql_dbg(ql_dbg_aer, base_vha, 0x9004,
 Slot Reset.\n);

 /* Workaround: qla2xxx driver which access hardware earlier
  * needs error state to be pci_channel_io_online.
  * Otherwise mailbox command timesout.
  */
 pdev-error_state = pci_channel_io_normal;



 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.

 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)

 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c

Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Bjorn Helgaas
On Fri, May 17, 2013 at 5:56 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
 On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:

  +   /* restore cfg space for possible link reset at upstream */
  +   dev-state_saved = true;

 dev-state_saved == true means that the dev-saved_config_space
 contains valid data.  Why do we know that's the case here?  I see that
 pcie_portdrv_probe() calls pci_save_state() when we first claim the
 port, and I guess we're assuming the state saved then is still valid.
 But why do we need to actually set dev-state_saved here?  Shouldn't
 it be already set to true anyway?

 This is a dirty trick to make pci_restore_state(dev) always work here
 (because it checks dev-state_saved and does nothing if it isn't set).
 I suppose.

Yes, I did investigate enough to see that this is a dirty trick.  My
question is how we know it's safe to do this dirty trick.

  +   pci_restore_state(dev);
  +   pcie_portdrv_restore_config(dev);
  +   pci_enable_pcie_error_reporting(dev);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Linas Vepstas
On 20 May 2013 12:21, Bjorn Helgaas bhelg...@google.com wrote:
 On Fri, May 17, 2013 at 5:56 PM, Rafael J. Wysocki r...@sisk.pl wrote:
 On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
 On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com 
 wrote:

  +   /* restore cfg space for possible link reset at upstream */
  +   dev-state_saved = true;

 dev-state_saved == true means that the dev-saved_config_space
 contains valid data.  Why do we know that's the case here?  I see that
 pcie_portdrv_probe() calls pci_save_state() when we first claim the
 port, and I guess we're assuming the state saved then is still valid.

Yes, see below.

 But why do we need to actually set dev-state_saved here?  Shouldn't
 it be already set to true anyway?

 This is a dirty trick to make pci_restore_state(dev) always work here
 (because it checks dev-state_saved and does nothing if it isn't set).
 I suppose.

 Yes, I did investigate enough to see that this is a dirty trick.  My
 question is how we know it's safe to do this dirty trick.

  +   pci_restore_state(dev);
  +   pcie_portdrv_restore_config(dev);

Lets review what is supposed to happen:

-- poweron
-- BIOS/firmware/bootloader maybe fiddles with PCI config space
-- kernel fiddles with PCI config space during boot
-- device driver sets up PCI config space correctly for normal i/o
-- PCI error occurs
-- PCI port/link is reset

At this point, we want to set the PCI config space to something
resembling what it was just before the device driver first touched it
after poweron.  Why?  Because the device driver will typically set up
DMA segments, and tweak other settings as it initializes the card.  It
needs to do almost exactly the same things again, when re-initializing
the device after the error reset.  Thus, the PCI config needs to be
appropriate for a fresh initialization of the card.

If  some other variant of PCI config was loaded here, it might contain
incorrect DMA mappings or other settings.  In particular, while the
driver is initializing the card, you risk having the card run away and
start doing things based on these incorrect settings -- provoking
another error or maybe just silently corrupting data.   The config
that you want to load should be more-or-less the same one that was
there during kernel boot, before the device-driver started touching
things.

The dirty hack is weird:  either there's valid data, and the flag is
set, or the data is garbage and the flag isn't set ... how could it be
otherwise?

-- Linas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-20 Thread Bjorn Helgaas
On Fri, Apr 26, 2013 at 06:28:59AM +, Zhang, LongX wrote:
 From: Zhang Long longx.zh...@intel.com
 
 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.
 
 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.
 
 Thank Liu Joseph for pointing it out.
 
 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)
 
 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c
 +++ b/drivers/pci/pcie/aer/aerdrv_core.c
 @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
 *data)
   result_data = (struct aer_broadcast_data *) data;
  
   device_lock(dev-dev);
 + dev-error_state = pci_channel_io_normal;
   if (!dev-driver ||
   !dev-driver-err_handler ||
   !dev-driver-err_handler-slot_reset)
 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
 pci_dev *dev)
   pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
   int retval;
  
 - /* If fatal, restore cfg space for possible link reset at upstream */
 - if (dev-error_state == pci_channel_io_frozen) {
 - dev-state_saved = true;
 - pci_restore_state(dev);
 - pcie_portdrv_restore_config(dev);
 - pci_enable_pcie_error_reporting(dev);
 - }
 + /* restore cfg space for possible link reset at upstream */
 + dev-state_saved = true;
 + pci_restore_state(dev);
 + pcie_portdrv_restore_config(dev);
 + pci_enable_pcie_error_reporting(dev);
  
   /* get true return value from status */
   retval = device_for_each_child(dev-dev, status, slot_reset_iter);

I think this patch changes the behavior in the case of a non-fatal error
where one of the .error_detected() methods returned
PCI_ERS_RESULT_NEED_RESET.  In that case, pcie_portdrv_slot_reset()
previously did not restore config space, but after your patch, it *will*
restore it.  We need an explanation of why this is safe.

I think you should split this into two patches: the first would remove the
if (dev-error_state == pci_channel_io_frozen) test from portdrv_pci.c
and explain the reason, and the second would make the aerdrv_core.c change.

I'm also concerned that in that same case (a non-fatal error where one of
the .error_detected() methods returned PCI_ERS_RESULT_NEED_RESET), I don't
think we actually *do* any kind of device reset.  This isn't related to
your patch, of course, so if you resolve the config space restore question,
we can deal with the reset question later.

Bjorn
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-17 Thread Rafael J. Wysocki
On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
> [+cc Rafael because he knows about dev->state_saved]
> 
> Sorry, I'm not very familiar with AER, so please excuse some naive
> questions below.
> 
> On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:
> > From: Zhang Long 
> >
> > Specific pci device drivers might have many functions to call
> > pci_channel_offline to check device states. When slot_reset happens,
> > drivers' slot_reset callback might call such functions and eventually
> > abort the reset.
> 
> Where does this happen?  I looked at all the references to
> dev->error_state and all the callers of pci_channel_offline(), and I
> didn't see any in .slot_reset() methods.
> 
> (There are *assignments* to dev->error_state in qlcnic_attach_func(),
> qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
> to remove those assignments after this patch, but this patch wouldn't
> really change anything for those paths.)
> 
> > The patch resets pdev->error_state to pci_channel_io_normal at
> > the begining of report_slot_reset.
> 
> > Signed-off-by: Zhang Yanmin 
> > Signed-off-by: Zhang Long 
> > ---
> >  drivers/pci/pcie/aer/aerdrv_core.c |1 +
> >  drivers/pci/pcie/portdrv_pci.c |   12 +---
> >  2 files changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> > b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 564d97f..c61fd44 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> > *data)
> > result_data = (struct aer_broadcast_data *) data;
> >
> > device_lock(>dev);
> > +   dev->error_state = pci_channel_io_normal;
> 
> The device's error_state might be pci_channel_io_frozen when we get
> here.  We haven't touched anything in the hardware yet.  What makes
> the device unfrozen now?  Did anything actually change as far as the
> hardware device is concerned?
> 
> I agree it looks like report_slot_reset() should be made more like
> eeh_report_reset().  I'm just wondering if the error_state should be
> changed *after* calling the .slot_reset() method instead of before.
> 
> > if (!dev->driver ||
> > !dev->driver->err_handler ||
> > !dev->driver->err_handler->slot_reset)
> > diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> > index ed4d094..7abefd9 100644
> > --- a/drivers/pci/pcie/portdrv_pci.c
> > +++ b/drivers/pci/pcie/portdrv_pci.c
> > @@ -332,13 +332,11 @@ static pci_ers_result_t 
> > pcie_portdrv_slot_reset(struct pci_dev *dev)
> > pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
> > int retval;
> >
> > -   /* If fatal, restore cfg space for possible link reset at upstream 
> > */
> > -   if (dev->error_state == pci_channel_io_frozen) {
> > -   dev->state_saved = true;
> > -   pci_restore_state(dev);
> > -   pcie_portdrv_restore_config(dev);
> > -   pci_enable_pcie_error_reporting(dev);
> > -   }
> 
> Previously we only restored state for the pci_channel_io_frozen state,
> i.e., when handling an AER_FATAL error.  Now we restore it always.
> Why?
> 
> > +   /* restore cfg space for possible link reset at upstream */
> > +   dev->state_saved = true;
> 
> "dev->state_saved == true" means that the dev->saved_config_space
> contains valid data.  Why do we know that's the case here?  I see that
> pcie_portdrv_probe() calls pci_save_state() when we first claim the
> port, and I guess we're assuming the state saved then is still valid.
> But why do we need to actually set dev->state_saved here?  Shouldn't
> it be already set to true anyway?

This is a dirty trick to make pci_restore_state(dev) always work here
(because it checks dev->state_saved and does nothing if it isn't set).
I suppose.

> > +   pci_restore_state(dev);
> > +   pcie_portdrv_restore_config(dev);
> > +   pci_enable_pcie_error_reporting(dev);
> >
> > /* get true return value from  */
> > retval = device_for_each_child(>dev, , slot_reset_iter);
> > --

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-17 Thread Bjorn Helgaas
[+cc Rafael because he knows about dev->state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX  wrote:
> From: Zhang Long 
>
> Specific pci device drivers might have many functions to call
> pci_channel_offline to check device states. When slot_reset happens,
> drivers' slot_reset callback might call such functions and eventually
> abort the reset.

Where does this happen?  I looked at all the references to
dev->error_state and all the callers of pci_channel_offline(), and I
didn't see any in .slot_reset() methods.

(There are *assignments* to dev->error_state in qlcnic_attach_func(),
qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
to remove those assignments after this patch, but this patch wouldn't
really change anything for those paths.)

> The patch resets pdev->error_state to pci_channel_io_normal at
> the begining of report_slot_reset.

> Signed-off-by: Zhang Yanmin 
> Signed-off-by: Zhang Long 
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>  drivers/pci/pcie/portdrv_pci.c |   12 +---
>  2 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 564d97f..c61fd44 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> *data)
> result_data = (struct aer_broadcast_data *) data;
>
> device_lock(>dev);
> +   dev->error_state = pci_channel_io_normal;

The device's error_state might be pci_channel_io_frozen when we get
here.  We haven't touched anything in the hardware yet.  What makes
the device unfrozen now?  Did anything actually change as far as the
hardware device is concerned?

I agree it looks like report_slot_reset() should be made more like
eeh_report_reset().  I'm just wondering if the error_state should be
changed *after* calling the .slot_reset() method instead of before.

> if (!dev->driver ||
> !dev->driver->err_handler ||
> !dev->driver->err_handler->slot_reset)
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index ed4d094..7abefd9 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
> pci_dev *dev)
> pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
> int retval;
>
> -   /* If fatal, restore cfg space for possible link reset at upstream */
> -   if (dev->error_state == pci_channel_io_frozen) {
> -   dev->state_saved = true;
> -   pci_restore_state(dev);
> -   pcie_portdrv_restore_config(dev);
> -   pci_enable_pcie_error_reporting(dev);
> -   }

Previously we only restored state for the pci_channel_io_frozen state,
i.e., when handling an AER_FATAL error.  Now we restore it always.
Why?

> +   /* restore cfg space for possible link reset at upstream */
> +   dev->state_saved = true;

"dev->state_saved == true" means that the dev->saved_config_space
contains valid data.  Why do we know that's the case here?  I see that
pcie_portdrv_probe() calls pci_save_state() when we first claim the
port, and I guess we're assuming the state saved then is still valid.
But why do we need to actually set dev->state_saved here?  Shouldn't
it be already set to true anyway?

> +   pci_restore_state(dev);
> +   pcie_portdrv_restore_config(dev);
> +   pci_enable_pcie_error_reporting(dev);
>
> /* get true return value from  */
> retval = device_for_each_child(>dev, , slot_reset_iter);
> --
> 1.7.4.1
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-17 Thread Bjorn Helgaas
[+cc Rafael because he knows about dev-state_saved]

Sorry, I'm not very familiar with AER, so please excuse some naive
questions below.

On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:
 From: Zhang Long longx.zh...@intel.com

 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.

Where does this happen?  I looked at all the references to
dev-error_state and all the callers of pci_channel_offline(), and I
didn't see any in .slot_reset() methods.

(There are *assignments* to dev-error_state in qlcnic_attach_func(),
qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
to remove those assignments after this patch, but this patch wouldn't
really change anything for those paths.)

 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.

 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)

 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c
 +++ b/drivers/pci/pcie/aer/aerdrv_core.c
 @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
 *data)
 result_data = (struct aer_broadcast_data *) data;

 device_lock(dev-dev);
 +   dev-error_state = pci_channel_io_normal;

The device's error_state might be pci_channel_io_frozen when we get
here.  We haven't touched anything in the hardware yet.  What makes
the device unfrozen now?  Did anything actually change as far as the
hardware device is concerned?

I agree it looks like report_slot_reset() should be made more like
eeh_report_reset().  I'm just wondering if the error_state should be
changed *after* calling the .slot_reset() method instead of before.

 if (!dev-driver ||
 !dev-driver-err_handler ||
 !dev-driver-err_handler-slot_reset)
 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
 pci_dev *dev)
 pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
 int retval;

 -   /* If fatal, restore cfg space for possible link reset at upstream */
 -   if (dev-error_state == pci_channel_io_frozen) {
 -   dev-state_saved = true;
 -   pci_restore_state(dev);
 -   pcie_portdrv_restore_config(dev);
 -   pci_enable_pcie_error_reporting(dev);
 -   }

Previously we only restored state for the pci_channel_io_frozen state,
i.e., when handling an AER_FATAL error.  Now we restore it always.
Why?

 +   /* restore cfg space for possible link reset at upstream */
 +   dev-state_saved = true;

dev-state_saved == true means that the dev-saved_config_space
contains valid data.  Why do we know that's the case here?  I see that
pcie_portdrv_probe() calls pci_save_state() when we first claim the
port, and I guess we're assuming the state saved then is still valid.
But why do we need to actually set dev-state_saved here?  Shouldn't
it be already set to true anyway?

 +   pci_restore_state(dev);
 +   pcie_portdrv_restore_config(dev);
 +   pci_enable_pcie_error_reporting(dev);

 /* get true return value from status */
 retval = device_for_each_child(dev-dev, status, slot_reset_iter);
 --
 1.7.4.1



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-17 Thread Rafael J. Wysocki
On Friday, May 17, 2013 05:43:33 PM Bjorn Helgaas wrote:
 [+cc Rafael because he knows about dev-state_saved]
 
 Sorry, I'm not very familiar with AER, so please excuse some naive
 questions below.
 
 On Fri, Apr 26, 2013 at 12:28 AM, Zhang, LongX longx.zh...@intel.com wrote:
  From: Zhang Long longx.zh...@intel.com
 
  Specific pci device drivers might have many functions to call
  pci_channel_offline to check device states. When slot_reset happens,
  drivers' slot_reset callback might call such functions and eventually
  abort the reset.
 
 Where does this happen?  I looked at all the references to
 dev-error_state and all the callers of pci_channel_offline(), and I
 didn't see any in .slot_reset() methods.
 
 (There are *assignments* to dev-error_state in qlcnic_attach_func(),
 qlge_io_slot_reset(), and qla2xxx_pci_slot_reset().  You might be able
 to remove those assignments after this patch, but this patch wouldn't
 really change anything for those paths.)
 
  The patch resets pdev-error_state to pci_channel_io_normal at
  the begining of report_slot_reset.
 
  Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
  Signed-off-by: Zhang Long longx.zh...@intel.com
  ---
   drivers/pci/pcie/aer/aerdrv_core.c |1 +
   drivers/pci/pcie/portdrv_pci.c |   12 +---
   2 files changed, 6 insertions(+), 7 deletions(-)
 
  diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
  b/drivers/pci/pcie/aer/aerdrv_core.c
  index 564d97f..c61fd44 100644
  --- a/drivers/pci/pcie/aer/aerdrv_core.c
  +++ b/drivers/pci/pcie/aer/aerdrv_core.c
  @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
  *data)
  result_data = (struct aer_broadcast_data *) data;
 
  device_lock(dev-dev);
  +   dev-error_state = pci_channel_io_normal;
 
 The device's error_state might be pci_channel_io_frozen when we get
 here.  We haven't touched anything in the hardware yet.  What makes
 the device unfrozen now?  Did anything actually change as far as the
 hardware device is concerned?
 
 I agree it looks like report_slot_reset() should be made more like
 eeh_report_reset().  I'm just wondering if the error_state should be
 changed *after* calling the .slot_reset() method instead of before.
 
  if (!dev-driver ||
  !dev-driver-err_handler ||
  !dev-driver-err_handler-slot_reset)
  diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
  index ed4d094..7abefd9 100644
  --- a/drivers/pci/pcie/portdrv_pci.c
  +++ b/drivers/pci/pcie/portdrv_pci.c
  @@ -332,13 +332,11 @@ static pci_ers_result_t 
  pcie_portdrv_slot_reset(struct pci_dev *dev)
  pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
  int retval;
 
  -   /* If fatal, restore cfg space for possible link reset at upstream 
  */
  -   if (dev-error_state == pci_channel_io_frozen) {
  -   dev-state_saved = true;
  -   pci_restore_state(dev);
  -   pcie_portdrv_restore_config(dev);
  -   pci_enable_pcie_error_reporting(dev);
  -   }
 
 Previously we only restored state for the pci_channel_io_frozen state,
 i.e., when handling an AER_FATAL error.  Now we restore it always.
 Why?
 
  +   /* restore cfg space for possible link reset at upstream */
  +   dev-state_saved = true;
 
 dev-state_saved == true means that the dev-saved_config_space
 contains valid data.  Why do we know that's the case here?  I see that
 pcie_portdrv_probe() calls pci_save_state() when we first claim the
 port, and I guess we're assuming the state saved then is still valid.
 But why do we need to actually set dev-state_saved here?  Shouldn't
 it be already set to true anyway?

This is a dirty trick to make pci_restore_state(dev) always work here
(because it checks dev-state_saved and does nothing if it isn't set).
I suppose.

  +   pci_restore_state(dev);
  +   pcie_portdrv_restore_config(dev);
  +   pci_enable_pcie_error_reporting(dev);
 
  /* get true return value from status */
  retval = device_for_each_child(dev-dev, status, slot_reset_iter);
  --

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-06 Thread Yanmin Zhang
On Fri, 2013-05-03 at 13:00 -0500, Linas Vepstas wrote:
> Greg,
> 
> I've been receiving (and reading!) these messages; I replied that I am
> not maintaining a tree, and have no way of testing these patches (no
> access to hardware.)  I believe it unlikely that my situation will
> change, and so I should probably be removed from the maintainers file.
> 
> No acked-by or anything; the patches are not "obviously correct" by
> visual inspection; they may be right, but would require deeper
> thinking about what is actually happening than I'm capable of at this
> time; I'm a bit rusty with this code base, and might miss something
> important
powerpc uses the similar method. See function eeh_report_reset.
We worked out the patch after checking powerpc codes.

Joseph Liu helped test the patch and the patch does work well.

> .
> 
> -- Linas
> 
> p.s. you didn't see my earlier reply because I forgot to hit 'plain text 
> reply'
> 
> 
> On 2 May 2013 22:13, Yanmin Zhang  wrote:
> >
> > On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
> > > On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
> > > > On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
> > > > > Hi,
> > > > >
> > > > > On 1 May 2013 19:30, Yanmin Zhang 
> > > > > wrote:
> > > > > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> > > > > > From: Zhang Long 
> > > > > >
> > > > > > Specific pci device drivers might have many functions to
> > > > > call
> > > > > > pci_channel_offline to check device states. When slot_reset
> > > > > happens,
> > > > > > drivers' slot_reset callback might call such functions and
> > > > > eventually
> > > > > > abort the reset.
> > > > > >
> > > > > > The patch resets pdev->error_state to pci_channel_io_normal
> > > > > at
> > > > > > the begining of report_slot_reset.
> > > > > >
> > > > > > Thank Liu Joseph for pointing it out.
> > > > >
> > > > > Linas, Bjorn,
> > > > >
> > > > > Would you like to merge the patch to your testing tree?
> > > > > The patch resolves one issue pointed out by Joseph.
> > > > >
> > > > >
> > > > > I'm not maintaining a tree at this time, and am not able to test.
> > > > > Sorry.
> > > > Thanks Linas.
> > > >
> > > > Greg,
> > > >
> > > > Would you like to merge it into your testing tree? Joseph Liu tested
> > > > the patch and it does work.
> > >
> > > You do know about the scripts/get_maintainer.pl script, right?
> > >
> > > Hint, try it out :)
> > Greg,
> >
> > Thanks. We did send to the right people, Linas and Bjorn.
> >
> > scripts/get_maintainer.pl 
> > 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
> > Bjorn Helgaas  (supporter:PCI 
> > SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
> > Linas Vepstas  (commit_signer:2/8=25%)
> > Yijing Wang  
> > (commit_signer:2/8=25%,commit_signer:2/11=18%)
> > Jiang Liu  
> > (commit_signer:2/8=25%,commit_signer:2/11=18%)
> > Stephen Hemminger  (commit_signer:1/8=12%)
> > "Rafael J. Wysocki"  (commit_signer:6/11=55%)
> > Huang Ying  (commit_signer:5/11=45%)
> > linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
> > linux-kernel@vger.kernel.org (open list)
> >
> >
> > I remember Jesse was maintaining PCI subsystem and he responded quickly.
> >
> > Yanmin
> >
> >


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-06 Thread Yanmin Zhang
On Fri, 2013-05-03 at 13:00 -0500, Linas Vepstas wrote:
 Greg,
 
 I've been receiving (and reading!) these messages; I replied that I am
 not maintaining a tree, and have no way of testing these patches (no
 access to hardware.)  I believe it unlikely that my situation will
 change, and so I should probably be removed from the maintainers file.
 
 No acked-by or anything; the patches are not obviously correct by
 visual inspection; they may be right, but would require deeper
 thinking about what is actually happening than I'm capable of at this
 time; I'm a bit rusty with this code base, and might miss something
 important
powerpc uses the similar method. See function eeh_report_reset.
We worked out the patch after checking powerpc codes.

Joseph Liu helped test the patch and the patch does work well.

 .
 
 -- Linas
 
 p.s. you didn't see my earlier reply because I forgot to hit 'plain text 
 reply'
 
 
 On 2 May 2013 22:13, Yanmin Zhang yanmin_zh...@linux.intel.com wrote:
 
  On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
   On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
 Hi,

 On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com
 wrote:
 On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
  From: Zhang Long longx.zh...@intel.com
 
  Specific pci device drivers might have many functions to
 call
  pci_channel_offline to check device states. When slot_reset
 happens,
  drivers' slot_reset callback might call such functions and
 eventually
  abort the reset.
 
  The patch resets pdev-error_state to pci_channel_io_normal
 at
  the begining of report_slot_reset.
 
  Thank Liu Joseph for pointing it out.

 Linas, Bjorn,

 Would you like to merge the patch to your testing tree?
 The patch resolves one issue pointed out by Joseph.


 I'm not maintaining a tree at this time, and am not able to test.
 Sorry.
Thanks Linas.
   
Greg,
   
Would you like to merge it into your testing tree? Joseph Liu tested
the patch and it does work.
  
   You do know about the scripts/get_maintainer.pl script, right?
  
   Hint, try it out :)
  Greg,
 
  Thanks. We did send to the right people, Linas and Bjorn.
 
  scripts/get_maintainer.pl 
  0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
  Bjorn Helgaas bhelg...@google.com (supporter:PCI 
  SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
  Linas Vepstas linasveps...@gmail.com (commit_signer:2/8=25%)
  Yijing Wang wangyij...@huawei.com 
  (commit_signer:2/8=25%,commit_signer:2/11=18%)
  Jiang Liu jiang@huawei.com 
  (commit_signer:2/8=25%,commit_signer:2/11=18%)
  Stephen Hemminger shemmin...@vyatta.com (commit_signer:1/8=12%)
  Rafael J. Wysocki rafael.j.wyso...@intel.com (commit_signer:6/11=55%)
  Huang Ying ying.hu...@intel.com (commit_signer:5/11=45%)
  linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
  linux-kernel@vger.kernel.org (open list)
 
 
  I remember Jesse was maintaining PCI subsystem and he responded quickly.
 
  Yanmin
 
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-03 Thread Linas Vepstas
Greg,

I've been receiving (and reading!) these messages; I replied that I am
not maintaining a tree, and have no way of testing these patches (no
access to hardware.)  I believe it unlikely that my situation will
change, and so I should probably be removed from the maintainers file.

No acked-by or anything; the patches are not "obviously correct" by
visual inspection; they may be right, but would require deeper
thinking about what is actually happening than I'm capable of at this
time; I'm a bit rusty with this code base, and might miss something
important.

-- Linas

p.s. you didn't see my earlier reply because I forgot to hit 'plain text reply'


On 2 May 2013 22:13, Yanmin Zhang  wrote:
>
> On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
> > On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
> > > On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
> > > > Hi,
> > > >
> > > > On 1 May 2013 19:30, Yanmin Zhang 
> > > > wrote:
> > > > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> > > > > From: Zhang Long 
> > > > >
> > > > > Specific pci device drivers might have many functions to
> > > > call
> > > > > pci_channel_offline to check device states. When slot_reset
> > > > happens,
> > > > > drivers' slot_reset callback might call such functions and
> > > > eventually
> > > > > abort the reset.
> > > > >
> > > > > The patch resets pdev->error_state to pci_channel_io_normal
> > > > at
> > > > > the begining of report_slot_reset.
> > > > >
> > > > > Thank Liu Joseph for pointing it out.
> > > >
> > > > Linas, Bjorn,
> > > >
> > > > Would you like to merge the patch to your testing tree?
> > > > The patch resolves one issue pointed out by Joseph.
> > > >
> > > >
> > > > I'm not maintaining a tree at this time, and am not able to test.
> > > > Sorry.
> > > Thanks Linas.
> > >
> > > Greg,
> > >
> > > Would you like to merge it into your testing tree? Joseph Liu tested
> > > the patch and it does work.
> >
> > You do know about the scripts/get_maintainer.pl script, right?
> >
> > Hint, try it out :)
> Greg,
>
> Thanks. We did send to the right people, Linas and Bjorn.
>
> scripts/get_maintainer.pl 
> 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
> Bjorn Helgaas  (supporter:PCI 
> SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
> Linas Vepstas  (commit_signer:2/8=25%)
> Yijing Wang  
> (commit_signer:2/8=25%,commit_signer:2/11=18%)
> Jiang Liu  
> (commit_signer:2/8=25%,commit_signer:2/11=18%)
> Stephen Hemminger  (commit_signer:1/8=12%)
> "Rafael J. Wysocki"  (commit_signer:6/11=55%)
> Huang Ying  (commit_signer:5/11=45%)
> linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
> linux-kernel@vger.kernel.org (open list)
>
>
> I remember Jesse was maintaining PCI subsystem and he responded quickly.
>
> Yanmin
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-03 Thread Linas Vepstas
Greg,

I've been receiving (and reading!) these messages; I replied that I am
not maintaining a tree, and have no way of testing these patches (no
access to hardware.)  I believe it unlikely that my situation will
change, and so I should probably be removed from the maintainers file.

No acked-by or anything; the patches are not obviously correct by
visual inspection; they may be right, but would require deeper
thinking about what is actually happening than I'm capable of at this
time; I'm a bit rusty with this code base, and might miss something
important.

-- Linas

p.s. you didn't see my earlier reply because I forgot to hit 'plain text reply'


On 2 May 2013 22:13, Yanmin Zhang yanmin_zh...@linux.intel.com wrote:

 On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
  On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
   On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
Hi,
   
On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com
wrote:
On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
 From: Zhang Long longx.zh...@intel.com

 Specific pci device drivers might have many functions to
call
 pci_channel_offline to check device states. When slot_reset
happens,
 drivers' slot_reset callback might call such functions and
eventually
 abort the reset.

 The patch resets pdev-error_state to pci_channel_io_normal
at
 the begining of report_slot_reset.

 Thank Liu Joseph for pointing it out.
   
Linas, Bjorn,
   
Would you like to merge the patch to your testing tree?
The patch resolves one issue pointed out by Joseph.
   
   
I'm not maintaining a tree at this time, and am not able to test.
Sorry.
   Thanks Linas.
  
   Greg,
  
   Would you like to merge it into your testing tree? Joseph Liu tested
   the patch and it does work.
 
  You do know about the scripts/get_maintainer.pl script, right?
 
  Hint, try it out :)
 Greg,

 Thanks. We did send to the right people, Linas and Bjorn.

 scripts/get_maintainer.pl 
 0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
 Bjorn Helgaas bhelg...@google.com (supporter:PCI 
 SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
 Linas Vepstas linasveps...@gmail.com (commit_signer:2/8=25%)
 Yijing Wang wangyij...@huawei.com 
 (commit_signer:2/8=25%,commit_signer:2/11=18%)
 Jiang Liu jiang@huawei.com 
 (commit_signer:2/8=25%,commit_signer:2/11=18%)
 Stephen Hemminger shemmin...@vyatta.com (commit_signer:1/8=12%)
 Rafael J. Wysocki rafael.j.wyso...@intel.com (commit_signer:6/11=55%)
 Huang Ying ying.hu...@intel.com (commit_signer:5/11=45%)
 linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
 linux-kernel@vger.kernel.org (open list)


 I remember Jesse was maintaining PCI subsystem and he responded quickly.

 Yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Yanmin Zhang
On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
> On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
> > On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
> > > Hi,
> > > 
> > > On 1 May 2013 19:30, Yanmin Zhang 
> > > wrote:
> > > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> > > > From: Zhang Long 
> > > >
> > > > Specific pci device drivers might have many functions to
> > > call
> > > > pci_channel_offline to check device states. When slot_reset
> > > happens,
> > > > drivers' slot_reset callback might call such functions and
> > > eventually
> > > > abort the reset.
> > > >
> > > > The patch resets pdev->error_state to pci_channel_io_normal
> > > at
> > > > the begining of report_slot_reset.
> > > >
> > > > Thank Liu Joseph for pointing it out.
> > > 
> > > Linas, Bjorn,
> > > 
> > > Would you like to merge the patch to your testing tree?
> > > The patch resolves one issue pointed out by Joseph.
> > > 
> > > 
> > > I'm not maintaining a tree at this time, and am not able to test.
> > > Sorry.
> > Thanks Linas.
> > 
> > Greg,
> > 
> > Would you like to merge it into your testing tree? Joseph Liu tested
> > the patch and it does work.
> 
> You do know about the scripts/get_maintainer.pl script, right?
> 
> Hint, try it out :)
Greg,

Thanks. We did send to the right people, Linas and Bjorn.

scripts/get_maintainer.pl 
0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
Bjorn Helgaas  (supporter:PCI 
SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
Linas Vepstas  (commit_signer:2/8=25%)
Yijing Wang  
(commit_signer:2/8=25%,commit_signer:2/11=18%)
Jiang Liu  (commit_signer:2/8=25%,commit_signer:2/11=18%)
Stephen Hemminger  (commit_signer:1/8=12%)
"Rafael J. Wysocki"  (commit_signer:6/11=55%)
Huang Ying  (commit_signer:5/11=45%)
linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
linux-kernel@vger.kernel.org (open list)


I remember Jesse was maintaining PCI subsystem and he responded quickly.

Yanmin


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Greg Kroah-Hartman
On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
> On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
> > Hi,
> > 
> > On 1 May 2013 19:30, Yanmin Zhang 
> > wrote:
> > On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> > > From: Zhang Long 
> > >
> > > Specific pci device drivers might have many functions to
> > call
> > > pci_channel_offline to check device states. When slot_reset
> > happens,
> > > drivers' slot_reset callback might call such functions and
> > eventually
> > > abort the reset.
> > >
> > > The patch resets pdev->error_state to pci_channel_io_normal
> > at
> > > the begining of report_slot_reset.
> > >
> > > Thank Liu Joseph for pointing it out.
> > 
> > Linas, Bjorn,
> > 
> > Would you like to merge the patch to your testing tree?
> > The patch resolves one issue pointed out by Joseph.
> > 
> > 
> > I'm not maintaining a tree at this time, and am not able to test.
> > Sorry.
> Thanks Linas.
> 
> Greg,
> 
> Would you like to merge it into your testing tree? Joseph Liu tested
> the patch and it does work.

You do know about the scripts/get_maintainer.pl script, right?

Hint, try it out :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Yanmin Zhang
On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
> Hi,
> 
> On 1 May 2013 19:30, Yanmin Zhang 
> wrote:
> On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> > From: Zhang Long 
> >
> > Specific pci device drivers might have many functions to
> call
> > pci_channel_offline to check device states. When slot_reset
> happens,
> > drivers' slot_reset callback might call such functions and
> eventually
> > abort the reset.
> >
> > The patch resets pdev->error_state to pci_channel_io_normal
> at
> > the begining of report_slot_reset.
> >
> > Thank Liu Joseph for pointing it out.
> 
> Linas, Bjorn,
> 
> Would you like to merge the patch to your testing tree?
> The patch resolves one issue pointed out by Joseph.
> 
> 
> I'm not maintaining a tree at this time, and am not able to test.
> Sorry.
Thanks Linas.

Greg,

Would you like to merge it into your testing tree? Joseph Liu tested
the patch and it does work.

Thanks,
Yanmin



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Yanmin Zhang
On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
 Hi,
 
 On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com
 wrote:
 On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
  From: Zhang Long longx.zh...@intel.com
 
  Specific pci device drivers might have many functions to
 call
  pci_channel_offline to check device states. When slot_reset
 happens,
  drivers' slot_reset callback might call such functions and
 eventually
  abort the reset.
 
  The patch resets pdev-error_state to pci_channel_io_normal
 at
  the begining of report_slot_reset.
 
  Thank Liu Joseph for pointing it out.
 
 Linas, Bjorn,
 
 Would you like to merge the patch to your testing tree?
 The patch resolves one issue pointed out by Joseph.
 
 
 I'm not maintaining a tree at this time, and am not able to test.
 Sorry.
Thanks Linas.

Greg,

Would you like to merge it into your testing tree? Joseph Liu tested
the patch and it does work.

Thanks,
Yanmin



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Greg Kroah-Hartman
On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
 On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
  Hi,
  
  On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com
  wrote:
  On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
   From: Zhang Long longx.zh...@intel.com
  
   Specific pci device drivers might have many functions to
  call
   pci_channel_offline to check device states. When slot_reset
  happens,
   drivers' slot_reset callback might call such functions and
  eventually
   abort the reset.
  
   The patch resets pdev-error_state to pci_channel_io_normal
  at
   the begining of report_slot_reset.
  
   Thank Liu Joseph for pointing it out.
  
  Linas, Bjorn,
  
  Would you like to merge the patch to your testing tree?
  The patch resolves one issue pointed out by Joseph.
  
  
  I'm not maintaining a tree at this time, and am not able to test.
  Sorry.
 Thanks Linas.
 
 Greg,
 
 Would you like to merge it into your testing tree? Joseph Liu tested
 the patch and it does work.

You do know about the scripts/get_maintainer.pl script, right?

Hint, try it out :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-02 Thread Yanmin Zhang
On Thu, 2013-05-02 at 19:00 -0700, Greg Kroah-Hartman wrote:
 On Fri, May 03, 2013 at 08:33:00AM +0800, Yanmin Zhang wrote:
  On Wed, 2013-05-01 at 20:20 -0500, Linas Vepstas wrote:
   Hi,
   
   On 1 May 2013 19:30, Yanmin Zhang yanmin_zh...@linux.intel.com
   wrote:
   On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
From: Zhang Long longx.zh...@intel.com
   
Specific pci device drivers might have many functions to
   call
pci_channel_offline to check device states. When slot_reset
   happens,
drivers' slot_reset callback might call such functions and
   eventually
abort the reset.
   
The patch resets pdev-error_state to pci_channel_io_normal
   at
the begining of report_slot_reset.
   
Thank Liu Joseph for pointing it out.
   
   Linas, Bjorn,
   
   Would you like to merge the patch to your testing tree?
   The patch resolves one issue pointed out by Joseph.
   
   
   I'm not maintaining a tree at this time, and am not able to test.
   Sorry.
  Thanks Linas.
  
  Greg,
  
  Would you like to merge it into your testing tree? Joseph Liu tested
  the patch and it does work.
 
 You do know about the scripts/get_maintainer.pl script, right?
 
 Hint, try it out :)
Greg,

Thanks. We did send to the right people, Linas and Bjorn.

scripts/get_maintainer.pl 
0001-pci-reset-error_state-to-pci_channel_io_normal-at-re.patch
Bjorn Helgaas bhelg...@google.com (supporter:PCI 
SUBSYSTEM,commit_signer:7/8=88%,commit_signer:10/11=91%)
Linas Vepstas linasveps...@gmail.com (commit_signer:2/8=25%)
Yijing Wang wangyij...@huawei.com 
(commit_signer:2/8=25%,commit_signer:2/11=18%)
Jiang Liu jiang@huawei.com (commit_signer:2/8=25%,commit_signer:2/11=18%)
Stephen Hemminger shemmin...@vyatta.com (commit_signer:1/8=12%)
Rafael J. Wysocki rafael.j.wyso...@intel.com (commit_signer:6/11=55%)
Huang Ying ying.hu...@intel.com (commit_signer:5/11=45%)
linux-...@vger.kernel.org (open list:PCI SUBSYSTEM)
linux-kernel@vger.kernel.org (open list)


I remember Jesse was maintaining PCI subsystem and he responded quickly.

Yanmin


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-01 Thread Yanmin Zhang
On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
> From: Zhang Long 
> 
> Specific pci device drivers might have many functions to call
> pci_channel_offline to check device states. When slot_reset happens,
> drivers' slot_reset callback might call such functions and eventually
> abort the reset.
> 
> The patch resets pdev->error_state to pci_channel_io_normal at
> the begining of report_slot_reset.
> 
> Thank Liu Joseph for pointing it out.
Linas, Bjorn,

Would you like to merge the patch to your testing tree?
The patch resolves one issue pointed out by Joseph.

Thanks,
Yanmin


> 
> Signed-off-by: Zhang Yanmin 
> Signed-off-by: Zhang Long 
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c |1 +
>  drivers/pci/pcie/portdrv_pci.c |   12 +---
>  2 files changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
> b/drivers/pci/pcie/aer/aerdrv_core.c
> index 564d97f..c61fd44 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
> *data)
>   result_data = (struct aer_broadcast_data *) data;
>  
>   device_lock(>dev);
> + dev->error_state = pci_channel_io_normal;
>   if (!dev->driver ||
>   !dev->driver->err_handler ||
>   !dev->driver->err_handler->slot_reset)
> diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
> index ed4d094..7abefd9 100644
> --- a/drivers/pci/pcie/portdrv_pci.c
> +++ b/drivers/pci/pcie/portdrv_pci.c
> @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
> pci_dev *dev)
>   pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
>   int retval;
>  
> - /* If fatal, restore cfg space for possible link reset at upstream */
> - if (dev->error_state == pci_channel_io_frozen) {
> - dev->state_saved = true;
> - pci_restore_state(dev);
> - pcie_portdrv_restore_config(dev);
> - pci_enable_pcie_error_reporting(dev);
> - }
> + /* restore cfg space for possible link reset at upstream */
> + dev->state_saved = true;
> + pci_restore_state(dev);
> + pcie_portdrv_restore_config(dev);
> + pci_enable_pcie_error_reporting(dev);
>  
>   /* get true return value from  */
>   retval = device_for_each_child(>dev, , slot_reset_iter);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-05-01 Thread Yanmin Zhang
On Fri, 2013-04-26 at 06:28 +, Zhang, LongX wrote:
 From: Zhang Long longx.zh...@intel.com
 
 Specific pci device drivers might have many functions to call
 pci_channel_offline to check device states. When slot_reset happens,
 drivers' slot_reset callback might call such functions and eventually
 abort the reset.
 
 The patch resets pdev-error_state to pci_channel_io_normal at
 the begining of report_slot_reset.
 
 Thank Liu Joseph for pointing it out.
Linas, Bjorn,

Would you like to merge the patch to your testing tree?
The patch resolves one issue pointed out by Joseph.

Thanks,
Yanmin


 
 Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
 Signed-off-by: Zhang Long longx.zh...@intel.com
 ---
  drivers/pci/pcie/aer/aerdrv_core.c |1 +
  drivers/pci/pcie/portdrv_pci.c |   12 +---
  2 files changed, 6 insertions(+), 7 deletions(-)
 
 diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
 b/drivers/pci/pcie/aer/aerdrv_core.c
 index 564d97f..c61fd44 100644
 --- a/drivers/pci/pcie/aer/aerdrv_core.c
 +++ b/drivers/pci/pcie/aer/aerdrv_core.c
 @@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
 *data)
   result_data = (struct aer_broadcast_data *) data;
  
   device_lock(dev-dev);
 + dev-error_state = pci_channel_io_normal;
   if (!dev-driver ||
   !dev-driver-err_handler ||
   !dev-driver-err_handler-slot_reset)
 diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
 index ed4d094..7abefd9 100644
 --- a/drivers/pci/pcie/portdrv_pci.c
 +++ b/drivers/pci/pcie/portdrv_pci.c
 @@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
 pci_dev *dev)
   pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
   int retval;
  
 - /* If fatal, restore cfg space for possible link reset at upstream */
 - if (dev-error_state == pci_channel_io_frozen) {
 - dev-state_saved = true;
 - pci_restore_state(dev);
 - pcie_portdrv_restore_config(dev);
 - pci_enable_pcie_error_reporting(dev);
 - }
 + /* restore cfg space for possible link reset at upstream */
 + dev-state_saved = true;
 + pci_restore_state(dev);
 + pcie_portdrv_restore_config(dev);
 + pci_enable_pcie_error_reporting(dev);
  
   /* get true return value from status */
   retval = device_for_each_child(dev-dev, status, slot_reset_iter);


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-04-26 Thread Zhang, LongX
From: Zhang Long 

Specific pci device drivers might have many functions to call
pci_channel_offline to check device states. When slot_reset happens,
drivers' slot_reset callback might call such functions and eventually
abort the reset.

The patch resets pdev->error_state to pci_channel_io_normal at
the begining of report_slot_reset.

Thank Liu Joseph for pointing it out.

Signed-off-by: Zhang Yanmin 
Signed-off-by: Zhang Long 
---
 drivers/pci/pcie/aer/aerdrv_core.c |1 +
 drivers/pci/pcie/portdrv_pci.c |   12 +---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
b/drivers/pci/pcie/aer/aerdrv_core.c
index 564d97f..c61fd44 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
result_data = (struct aer_broadcast_data *) data;
 
device_lock(>dev);
+   dev->error_state = pci_channel_io_normal;
if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->slot_reset)
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index ed4d094..7abefd9 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
pci_dev *dev)
pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
int retval;
 
-   /* If fatal, restore cfg space for possible link reset at upstream */
-   if (dev->error_state == pci_channel_io_frozen) {
-   dev->state_saved = true;
-   pci_restore_state(dev);
-   pcie_portdrv_restore_config(dev);
-   pci_enable_pcie_error_reporting(dev);
-   }
+   /* restore cfg space for possible link reset at upstream */
+   dev->state_saved = true;
+   pci_restore_state(dev);
+   pcie_portdrv_restore_config(dev);
+   pci_enable_pcie_error_reporting(dev);
 
/* get true return value from  */
retval = device_for_each_child(>dev, , slot_reset_iter);
-- 
1.7.4.1





binAiQYlSzxNH.bin
Description: 0001-pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset.patch


Subject : [ PATCH ] pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset

2013-04-26 Thread Zhang, LongX
From: Zhang Long longx.zh...@intel.com

Specific pci device drivers might have many functions to call
pci_channel_offline to check device states. When slot_reset happens,
drivers' slot_reset callback might call such functions and eventually
abort the reset.

The patch resets pdev-error_state to pci_channel_io_normal at
the begining of report_slot_reset.

Thank Liu Joseph for pointing it out.

Signed-off-by: Zhang Yanmin yanmin_zh...@linux.intel.com
Signed-off-by: Zhang Long longx.zh...@intel.com
---
 drivers/pci/pcie/aer/aerdrv_core.c |1 +
 drivers/pci/pcie/portdrv_pci.c |   12 +---
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c 
b/drivers/pci/pcie/aer/aerdrv_core.c
index 564d97f..c61fd44 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -286,6 +286,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
result_data = (struct aer_broadcast_data *) data;
 
device_lock(dev-dev);
+   dev-error_state = pci_channel_io_normal;
if (!dev-driver ||
!dev-driver-err_handler ||
!dev-driver-err_handler-slot_reset)
diff --git a/drivers/pci/pcie/portdrv_pci.c b/drivers/pci/pcie/portdrv_pci.c
index ed4d094..7abefd9 100644
--- a/drivers/pci/pcie/portdrv_pci.c
+++ b/drivers/pci/pcie/portdrv_pci.c
@@ -332,13 +332,11 @@ static pci_ers_result_t pcie_portdrv_slot_reset(struct 
pci_dev *dev)
pci_ers_result_t status = PCI_ERS_RESULT_RECOVERED;
int retval;
 
-   /* If fatal, restore cfg space for possible link reset at upstream */
-   if (dev-error_state == pci_channel_io_frozen) {
-   dev-state_saved = true;
-   pci_restore_state(dev);
-   pcie_portdrv_restore_config(dev);
-   pci_enable_pcie_error_reporting(dev);
-   }
+   /* restore cfg space for possible link reset at upstream */
+   dev-state_saved = true;
+   pci_restore_state(dev);
+   pcie_portdrv_restore_config(dev);
+   pci_enable_pcie_error_reporting(dev);
 
/* get true return value from status */
retval = device_for_each_child(dev-dev, status, slot_reset_iter);
-- 
1.7.4.1





binAiQYlSzxNH.bin
Description: 0001-pci-reset-error_state-to-pci_channel_io_normal-at-report_slot_reset.patch