Re: [PATCH v14 0/9] Address error and recovery for AER and DPC

2018-05-01 Thread poza

On 2018-05-01 04:10, Bjorn Helgaas wrote:

On Thu, Apr 26, 2018 at 11:00:52AM +0530, p...@codeaurora.org wrote:

On 2018-04-23 20:53, Oza Pawandeep wrote:
> This patch set brings in error handling support for DPC
>
> The current implementation of AER and error message broadcasting to the
> EP driver is tightly coupled and limited to AER service driver.
> It is important to factor out broadcasting and other link handling
> callbacks. So that not only when AER gets triggered, but also when DPC
> get
> triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
>
> The goal of the patch-set is:
> DPC should handle the error handling and recovery similar to AER,
> because
> finally both are attempting recovery in some or the other way,
> and for that error handling and recovery framework has to be loosely
> coupled.
> ...



Hi Bjorn,

I know I need to rebase this whole patch-set to 4.17 now.

But before I do that, can you please help to comment.


My overall comment is that I think the series will be simpler and read
better if you first change AER to do remove/re-enumerate, before doing
anything with DPC.

This could be done by extracting just the AER part of "PCI/AER/DPC:
Align FATAL error handling for AER and DPC" (i.e., adding
pcie_do_fatal_recovery()) and moving that to be the very first patch.

It's a small change in terms of code size, but significant to drivers,
and it's really the core of the series, so it would be good to clearly
establish the policy of:

  ERR_NONFATAL => call driver recovery entry points
  ERR_FATAL=> remove and re-enumerate

before bringing DPC into the picture.

Then the subsequent patches would all be more or less mechanical
changes to make DPC follow the same model.


ok I have taken care of you comment, please follow v15, coming next.
I could not make that the first patch, because I needed to unify 
pci_wait_for_link function.
hence it is the second patch, but now the order looks quiet obvious and 
simplified.


Regards,
Oza.



Bjorn


Re: [PATCH v14 0/9] Address error and recovery for AER and DPC

2018-04-30 Thread Bjorn Helgaas
On Thu, Apr 26, 2018 at 11:00:52AM +0530, p...@codeaurora.org wrote:
> On 2018-04-23 20:53, Oza Pawandeep wrote:
> > This patch set brings in error handling support for DPC
> > 
> > The current implementation of AER and error message broadcasting to the
> > EP driver is tightly coupled and limited to AER service driver.
> > It is important to factor out broadcasting and other link handling
> > callbacks. So that not only when AER gets triggered, but also when DPC
> > get
> > triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.
> > 
> > The goal of the patch-set is:
> > DPC should handle the error handling and recovery similar to AER,
> > because
> > finally both are attempting recovery in some or the other way,
> > and for that error handling and recovery framework has to be loosely
> > coupled.
> > ...

> Hi Bjorn,
> 
> I know I need to rebase this whole patch-set to 4.17 now.
> 
> But before I do that, can you please help to comment.

My overall comment is that I think the series will be simpler and read
better if you first change AER to do remove/re-enumerate, before doing
anything with DPC.

This could be done by extracting just the AER part of "PCI/AER/DPC:
Align FATAL error handling for AER and DPC" (i.e., adding
pcie_do_fatal_recovery()) and moving that to be the very first patch.

It's a small change in terms of code size, but significant to drivers,
and it's really the core of the series, so it would be good to clearly
establish the policy of:

  ERR_NONFATAL => call driver recovery entry points
  ERR_FATAL=> remove and re-enumerate

before bringing DPC into the picture.

Then the subsequent patches would all be more or less mechanical
changes to make DPC follow the same model.

Bjorn


Re: [PATCH v14 0/9] Address error and recovery for AER and DPC

2018-04-25 Thread poza

On 2018-04-23 20:53, Oza Pawandeep wrote:

This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC 
get

triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

The goal of the patch-set is:
DPC should handle the error handling and recovery similar to AER, 
because

finally both are attempting recovery in some or the other way,
and for that error handling and recovery framework has to be loosely
coupled.

It achieves uniformity and transparency to the error handling agents 
such

as AER, DPC, with respect to recovery and error handling.

So, this patch-set tries to unify lot of things between error agents 
and

make them behave in a well defined way. (be it error (FATAL, NON_FATAL)
handling or recovery).

The FATAL error handling is handled with remove/reset_link/re-enumerate
sequence while the NON_FATAL follows the default path.
Documentation/PCI/pci-error-recovery.txt talks more on that.

Changes since v13:
Bjorn's comments addressed
> handke FATAL errors with remove devices followed by 
re-enumeration.

> changes in AER and DPC along with required Documentation.
Changes since v12:
Bjorn's and Keith's Comments addressed.
> Made DPC and AER error handling identical 
> hanldled cases for hotplug enabled system differently.
Changes since v11:
Bjorn's comments addressed.
> rename pcie-err.c to err.c
> removed EXPORT_SYMBOL
> made generic find_serivce function in port driver.
> removed mutex patch as no need to have mutex in pcie_do_recovery
> brough in DPC_FATAL in aer.h
> so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
Christoph Hellwig's, David Laight's and Randy Dunlap's
comments addressed.
> renamed pci_do_recovery to pcie_do_recovery
> removed inner braces in conditional statements.
> restrctured the code in pci_wait_for_link
> EXPORT_SYMBOL_GPL
Changes since v9:
Sinan's comments addressed.
> bool active = true; unnecessary variable removed.
Changes since v8:
Fixed Kbuild errors.
Changes since v7:
Rebased the code on pci master
> 
https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci

Changes since v6:
Sinan's and Stefan's comments implemented.
> reordered patch 6 and 7
> cleaned up
Changes since v5:
Sinan's and Keith's comments incorporated.
> made separate patch for mutex
> unified error repotting codes into driver/pci/pci.h
> got rid of wait link active/inactive and
  made generic function in driver/pci/pci.c
Changes since v4:
Bjorn's comments incorporated.
> Renamed only do_recovery.
> moved the things more locally to drivers/pci/pci.h
Changes since v3:
Bjorn's comments incorporated.
> Made separate patch renaming generic pci_err.c
> Introduce pci_err.h to contain all the error types and 
recovery

> removed all the dependencies on pci.h
Changes since v2:
Based on feedback from Keith:
"
When DPC is triggered due to receipt of an uncorrectable error 
Message,

the Requester ID from the Message is recorded in the DPC Error
Source ID register and that Message is discarded and not forwarded 
Upstream.

"
Removed the patch where AER checks if DPC service is active
Changes since v1:
Kbuild errors fixed:
> pci_find_dpc_dev made static
> ras_event.h updated
> pci_find_aer_service call with CONFIG check
> pci_find_dpc_service call with CONFIG check

Oza Pawandeep (9):
  PCI/AER: Rename error recovery to generic PCI naming
  PCI/AER: Factor out error reporting from AER
  PCI/PORTDRV: Implement generic find service
  PCI/PORTDRV: Implement generic find device
  PCI/DPC: Unify and plumb error handling into DPC
  PCI: Unify wait for link active into generic PCI
  PCI/DPC: Disable ERR_NONFATAL for DPC
  PCI/AER/DPC: Align FATAL error handling for AER and DPC
  pci-error-recovery: Add AER_FATAL handling

 Documentation/PCI/pci-error-recovery.txt |  35 ++-
 drivers/pci/hotplug/pciehp_hpc.c |  20 +-
 drivers/pci/pci.c|  30 +++
 drivers/pci/pci.h|   5 +
 drivers/pci/pcie/Makefile|   2 +-
 drivers/pci/pcie/aer/aerdrv.c|   2 +
 drivers/pci/pcie/aer/aerdrv.h|  30 ---
 drivers/pci/pcie/aer/aerdrv_core.c   | 317 
+-
 drivers/pci/pcie/err.c   | 374 
+++

 drivers/pci/pcie/pcie-dpc.c  |  63 +++---
 drivers/pci/pcie/portdrv.h   |   4 +
 drivers/pci/pcie/portdrv_core.c  |  69 ++
 include/linux/aer.h 

[PATCH v14 0/9] Address error and recovery for AER and DPC

2018-04-23 Thread Oza Pawandeep
This patch set brings in error handling support for DPC

The current implementation of AER and error message broadcasting to the
EP driver is tightly coupled and limited to AER service driver.
It is important to factor out broadcasting and other link handling
callbacks. So that not only when AER gets triggered, but also when DPC get
triggered (for e.g. ERR_FATAL), callbacks are handled appropriately.

The goal of the patch-set is:
DPC should handle the error handling and recovery similar to AER, because 
finally both are attempting recovery in some or the other way,
and for that error handling and recovery framework has to be loosely
coupled.

It achieves uniformity and transparency to the error handling agents such
as AER, DPC, with respect to recovery and error handling.

So, this patch-set tries to unify lot of things between error agents and
make them behave in a well defined way. (be it error (FATAL, NON_FATAL)
handling or recovery).

The FATAL error handling is handled with remove/reset_link/re-enumerate
sequence while the NON_FATAL follows the default path.
Documentation/PCI/pci-error-recovery.txt talks more on that.

Changes since v13:
Bjorn's comments addressed
> handke FATAL errors with remove devices followed by re-enumeration.
> changes in AER and DPC along with required Documentation.
Changes since v12:
Bjorn's and Keith's Comments addressed.
> Made DPC and AER error handling identical 
> hanldled cases for hotplug enabled system differently.
Changes since v11:
Bjorn's comments addressed.
> rename pcie-err.c to err.c
> removed EXPORT_SYMBOL
> made generic find_serivce function in port driver.
> removed mutex patch as no need to have mutex in pcie_do_recovery
> brough in DPC_FATAL in aer.h
> so now all the error codes (AER and DPC) are unified in aer.h
Changes since v10:
Christoph Hellwig's, David Laight's and Randy Dunlap's
comments addressed.
> renamed pci_do_recovery to pcie_do_recovery
> removed inner braces in conditional statements.
> restrctured the code in pci_wait_for_link
> EXPORT_SYMBOL_GPL
Changes since v9:
Sinan's comments addressed.
> bool active = true; unnecessary variable removed.
Changes since v8:
Fixed Kbuild errors.
Changes since v7:
Rebased the code on pci master
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/helgaas/pci
Changes since v6:
Sinan's and Stefan's comments implemented.
> reordered patch 6 and 7
> cleaned up
Changes since v5:
Sinan's and Keith's comments incorporated.
> made separate patch for mutex
> unified error repotting codes into driver/pci/pci.h
> got rid of wait link active/inactive and
  made generic function in driver/pci/pci.c
Changes since v4:
Bjorn's comments incorporated.
> Renamed only do_recovery.
> moved the things more locally to drivers/pci/pci.h
Changes since v3:
Bjorn's comments incorporated.
> Made separate patch renaming generic pci_err.c
> Introduce pci_err.h to contain all the error types and recovery
> removed all the dependencies on pci.h
Changes since v2:
Based on feedback from Keith:
"
When DPC is triggered due to receipt of an uncorrectable error Message,
the Requester ID from the Message is recorded in the DPC Error
Source ID register and that Message is discarded and not forwarded Upstream.
"
Removed the patch where AER checks if DPC service is active
Changes since v1:
Kbuild errors fixed:
> pci_find_dpc_dev made static
> ras_event.h updated
> pci_find_aer_service call with CONFIG check
> pci_find_dpc_service call with CONFIG check

Oza Pawandeep (9):
  PCI/AER: Rename error recovery to generic PCI naming
  PCI/AER: Factor out error reporting from AER
  PCI/PORTDRV: Implement generic find service
  PCI/PORTDRV: Implement generic find device
  PCI/DPC: Unify and plumb error handling into DPC
  PCI: Unify wait for link active into generic PCI
  PCI/DPC: Disable ERR_NONFATAL for DPC
  PCI/AER/DPC: Align FATAL error handling for AER and DPC
  pci-error-recovery: Add AER_FATAL handling

 Documentation/PCI/pci-error-recovery.txt |  35 ++-
 drivers/pci/hotplug/pciehp_hpc.c |  20 +-
 drivers/pci/pci.c|  30 +++
 drivers/pci/pci.h|   5 +
 drivers/pci/pcie/Makefile|   2 +-
 drivers/pci/pcie/aer/aerdrv.c|   2 +
 drivers/pci/pcie/aer/aerdrv.h|  30 ---
 drivers/pci/pcie/aer/aerdrv_core.c   | 317 +-
 drivers/pci/pcie/err.c   | 374 +++
 drivers/pci/pcie/pcie-dpc.c  |  63 +++---
 drivers/pci/pcie/portdrv.h   |   4 +
 drivers/pci/pcie/portdrv_core.c  |  69 ++
 include/linux/aer.h  |   2 +
 include/uapi/linux/pci_regs.h|   3