Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-08 Thread Niklas Schnelle
On Wed, 2021-09-08 at 11:37 +1000, Oliver O'Halloran wrote:
> On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle  
> wrote:
> > On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > > > wrote:
> > > > > Patch 3 I already sent separately resulting in the discussion below 
> > > > > but without
> > > > > a final conclusion.
> > > > > 
> > > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > > > 
> > > > > I believe even though there were some doubts about the use of
> > > > > pci_dev_is_added() by arch code the existing uses as well as the use 
> > > > > in the
> > > > > final patch of this series warrant this export.
> > > > 
> > > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > > fixed a while ago so It should be safe to remove those calls now.
> > > 
> > > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > > I can certainly sent a patch for that. This would then leave only the
> > > existing use in s390 which I added because of a dead lock prevention
> > > and explained here:
> > > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> > > 
> > > Plus the need to use it in the recovery code of this series. I think in
> > > the EEH code the need for a similar check is alleviated by the checks
> > > in the beginning of
> > > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > > I guess we could use our own state tracking in a similar way but felt
> > > like pci_dev_is_added() is the more logical choice.
> 
> The slot check is mainly there to prevent attempts to "recover"
> devices that have been surprise removed (i.e NVMe hot-unplug). The
> actual recovery process operates off the eeh_pe tree which is frozen
> in place when an error is detected. If a pci_dev is added or removed
> it's not really a problem since those are only ever looked at when
> notifying drivers which is done with the rescan_remove lock held. 

Thanks for the explanation.

> That
> said, I wouldn't really encourage anyone to follow the EEH model since
> it's pretty byzantine.
> 
> > Looking into this again, I think we actually can't easily track this
> > state ourselves outside struct pci_dev. The reason for this is that
> > when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> > pci_dev and scans it again the new struct pci_dev re-uses the same
> > struct zpci_dev because from a platform point of view the PCI device
> > was never removed but only disabled and re-enabled. Thus we can only
> > distinguish the stale struct pci_dev by looking at things stored in
> > struct pci_dev itself.
> 
> IMO the real problem is removing and re-adding the pci_dev. I think
> it's something that's done largely because the PCI core doesn't really
> provide any better mechanism for getting a device back into a
> known-good state so it's abused to implement error recovery. This is
> something that's always annoyed me since it conflates recovery with
> hotplug. After a hot-(un)plug we might have a different device or no
> device. In the recovery case we expect to start and end with the same
> device. Why not apply the same logic to the pci_dev?

For us there are two cases. First The existing
/sys/bus/pci/devices//recover attribute. This does the pci_dev
remove and re-add that you mention and thus we end up with a ne pci_dev
afterwards and I agree that is kind of a dumb way to recover which
(too?) closely resembles unplug/re-plug.

Secondly the automatic error recovery added in this series. Here we
only attempt recovery if we have a driver bound that supports the error
callbacks thus always keeping the same pci_dev. If there is no driver
we give up automatic recovery and are back at the situation without
this series.

> 
> Something I was tinkering with before I left IBM was re-working the
> way EEH handles recovering devices that don't have a driver with error
> handling callbacks to something like:
> 
> 1. unbind the driver
> 2. pci_save_state()
> 3. do the reset
> 4. pci_restore_state()
> 5. re-bind the driver
> 
> That would allow keeping the pci_dev around and let me delete a pile
> of confusing code which handles binding the eeh_dev to the new
> pci_dev.

This sounds like an interesting future approach for us too. Thankfully
our binding of the zpci_dev to the new pci_dev is pretty simple by now.
The main trouble with removing and re-adding a pci_dev is then that
upper layers like block devices are also re-created which really only
happens if we have a driver bound.

>  The obvious problem with that approach is the assumption the
> device is functional enough to allow saving the config space, but I
> don't 

Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Oliver O'Halloran
On Tue, Sep 7, 2021 at 10:21 PM Niklas Schnelle  wrote:
>
> On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> > On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > > wrote:
> > > > Patch 3 I already sent separately resulting in the discussion below but 
> > > > without
> > > > a final conclusion.
> > > >
> > > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > >
> > > > I believe even though there were some doubts about the use of
> > > > pci_dev_is_added() by arch code the existing uses as well as the use in 
> > > > the
> > > > final patch of this series warrant this export.
> > >
> > > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > > pci_bus_add_device() could be called before pci_device_add(). That was
> > > fixed a while ago so It should be safe to remove those calls now.
> >
> > Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> > I can certainly sent a patch for that. This would then leave only the
> > existing use in s390 which I added because of a dead lock prevention
> > and explained here:
> > https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> >
> > Plus the need to use it in the recovery code of this series. I think in
> > the EEH code the need for a similar check is alleviated by the checks
> > in the beginning of
> > arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> > eeh_slot_presence_check() which checks presence via the hotplug slot.
> > I guess we could use our own state tracking in a similar way but felt
> > like pci_dev_is_added() is the more logical choice.

The slot check is mainly there to prevent attempts to "recover"
devices that have been surprise removed (i.e NVMe hot-unplug). The
actual recovery process operates off the eeh_pe tree which is frozen
in place when an error is detected. If a pci_dev is added or removed
it's not really a problem since those are only ever looked at when
notifying drivers which is done with the rescan_remove lock held. That
said, I wouldn't really encourage anyone to follow the EEH model since
it's pretty byzantine.

> Looking into this again, I think we actually can't easily track this
> state ourselves outside struct pci_dev. The reason for this is that
> when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
> pci_dev and scans it again the new struct pci_dev re-uses the same
> struct zpci_dev because from a platform point of view the PCI device
> was never removed but only disabled and re-enabled. Thus we can only
> distinguish the stale struct pci_dev by looking at things stored in
> struct pci_dev itself.

IMO the real problem is removing and re-adding the pci_dev. I think
it's something that's done largely because the PCI core doesn't really
provide any better mechanism for getting a device back into a
known-good state so it's abused to implement error recovery. This is
something that's always annoyed me since it conflates recovery with
hotplug. After a hot-(un)plug we might have a different device or no
device. In the recovery case we expect to start and end with the same
device. Why not apply the same logic to the pci_dev?

Something I was tinkering with before I left IBM was re-working the
way EEH handles recovering devices that don't have a driver with error
handling callbacks to something like:

1. unbind the driver
2. pci_save_state()
3. do the reset
4. pci_restore_state()
5. re-bind the driver

That would allow keeping the pci_dev around and let me delete a pile
of confusing code which handles binding the eeh_dev to the new
pci_dev. The obvious problem with that approach is the assumption the
device is functional enough to allow saving the config space, but I
don't think that's a deal breaker. We could stash a copy of the device
state before we allow drivers to attach and use that to restore the
device after the reset. The end result would be the same known-good
state that we'd get after a re-scan.

> That said, I think for the recovery case we might be able to drop the
> pci_dev_is_added() and rely on pdev->driver != NULL which we check
> anyway and that should catch any PCI device that was already removed.

Would that work if there was an error on a device without a driver
bound? If you're just trying to stop races between recovery and device
removal then pci_dev_is_added() is probably the right tool for the
job. Trying to substitute it with a proxy seems like a bad idea.


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Tue, 2021-09-07 at 10:45 +0200, Niklas Schnelle wrote:
> On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> > On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  
> > wrote:
> > > Patch 3 I already sent separately resulting in the discussion below but 
> > > without
> > > a final conclusion.
> > > 
> > > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > > 
> > > I believe even though there were some doubts about the use of
> > > pci_dev_is_added() by arch code the existing uses as well as the use in 
> > > the
> > > final patch of this series warrant this export.
> > 
> > The use of pci_dev_is_added() in arch/powerpc was because in the past
> > pci_bus_add_device() could be called before pci_device_add(). That was
> > fixed a while ago so It should be safe to remove those calls now.
> 
> Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
> I can certainly sent a patch for that. This would then leave only the
> existing use in s390 which I added because of a dead lock prevention
> and explained here:
> https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/
> 
> Plus the need to use it in the recovery code of this series. I think in
> the EEH code the need for a similar check is alleviated by the checks
> in the beginning of
> arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
> eeh_slot_presence_check() which checks presence via the hotplug slot.
> I guess we could use our own state tracking in a similar way but felt
> like pci_dev_is_added() is the more logical choice.

Looking into this again, I think we actually can't easily track this
state ourselves outside struct pci_dev. The reason for this is that
when e.g. arch/s390/pci/pci_sysfs.c:recover_store() removes the struct
pci_dev and scans it again the new struct pci_dev re-uses the same
struct zpci_dev because from a platform point of view the PCI device
was never removed but only disabled and re-enabled. Thus we can only
distinguish the stale struct pci_dev by looking at things stored in
struct pci_dev itself.

That said, I think for the recovery case we might be able to drop the
pci_dev_is_added() and rely on pdev->driver != NULL which we check
anyway and that should catch any PCI device that was already removed.



Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Tue, 2021-09-07 at 12:04 +1000, Oliver O'Halloran wrote:
> On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  wrote:
> > Patch 3 I already sent separately resulting in the discussion below but 
> > without
> > a final conclusion.
> > 
> > https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
> > 
> > I believe even though there were some doubts about the use of
> > pci_dev_is_added() by arch code the existing uses as well as the use in the
> > final patch of this series warrant this export.
> 
> The use of pci_dev_is_added() in arch/powerpc was because in the past
> pci_bus_add_device() could be called before pci_device_add(). That was
> fixed a while ago so It should be safe to remove those calls now.

Hmm, ok that confirms Bjorns suspicion and explains how it came to be.
I can certainly sent a patch for that. This would then leave only the
existing use in s390 which I added because of a dead lock prevention
and explained here:
https://lore.kernel.org/lkml/87d15d5eead35c9eaa667958d057cf4a81a8bf13.ca...@linux.ibm.com/

Plus the need to use it in the recovery code of this series. I think in
the EEH code the need for a similar check is alleviated by the checks
in the beginning of
arch/powerpc/kernel/eeh_driver.c:eeh_handle_normal_event() especially
eeh_slot_presence_check() which checks presence via the hotplug slot.
I guess we could use our own state tracking in a similar way but felt
like pci_dev_is_added() is the more logical choice.

> 
> > Patch 4 "PCI: Export pci_dev_lock()" is basically an extension to commit
> > e3a9b1212b9d ("PCI: Export pci_dev_trylock() and pci_dev_unlock()") which
> > already exported pci_dev_trylock(). In the final patch we make use of
> > pci_dev_lock() to wait for any other exclusive uses of the pdev to be 
> > finished
> > before starting recovery.
> 
> Hmm, I noticed the EEH
> (arch/powerpc/kernel/eeh_driver.c:eeh_pe_report_edev())  and the
> generic PCIe error recovery code (see
> drivers/pci/pcie/err.c:report_error_detected()) only call
> device_lock() before entering the driver's error handling callbacks. I
> wonder if they should be using pci_dev_lock() instead. The only real
> difference is that pci_dev_lock() will also block user space from
> accessing the device's config space while error recovery is in
> progress which seems sensible enough.

I agree that sounds reasonable.



Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-07 Thread Niklas Schnelle
On Mon, 2021-09-06 at 21:05 -0500, Linas Vepstas wrote:
> On Mon, Sep 6, 2021 at 4:49 AM Niklas Schnelle 
> wrote:
> 
> >  I believe we might be the first
> > implementation of PCI device recovery in a virtualized setting requiring
> > us to
> > coordinate the device reset with the hypervisor platform by issuing a
> > disable
> > and re-enable to the platform as well as starting the recovery following
> > a platform event.
> > 
> 
> I recall none of the details, but SRIOV is a standardized system for
> sharing a PCI device across multiple virtual machines. It has detailed info
> on what the hypervisor must do, and what the local OS instance must do to
> accomplish this.  

Yes and in fact on s390 we make heavy use of SR-IOV.

> It's part of the PCI standard, and its more than a decade
> old now, maybe two. Being a part of the PCI standard, it was interoperable
> with error recovery, to the best of my recollection. 

Maybe I worded things with a bit too much sensationalism and it might
even be that POWER supports error recovery also with virtualization,
though I'm not sure how far that goes.

I believe you are right in that SR-IOV supports the error recovery,
after all this patch set also has to work together with SRIOV enabled
devices. At least on s390 though until this patch set the error
recovery performed by the hypervisor stopped in the hypervisor.

The missing part added by this patch set is coordinating with device
drivers in Linux to determine where use of a recovered device can pick
up after the PCIe level error recovery is done.

As for virtualization this coordination of course needs to cross the
hypervisor/guest boundary and at least for KVM+QEMU I know for a fact
that reporting a PCI error to the guest is currently just a stub that
actually completely stops the guest, so you definitely don't get smooth
error recovery there yet.

> At the time it was
> introduced, it got pushed very aggressively.  The x86 hypervisor vendors
> were aiming at the heart of zseries, and were militant about it.

And yet we're still here, use SR-IOV ourselves and even support Linux +
KVM as a hypervisor you can use just the same on a mainframe, an x86,
POWER, or ARM system.

> 
> -- Linas
> 



Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-06 Thread Linas Vepstas
On Mon, Sep 6, 2021 at 4:49 AM Niklas Schnelle 
wrote:

>  I believe we might be the first
> implementation of PCI device recovery in a virtualized setting requiring
> us to
> coordinate the device reset with the hypervisor platform by issuing a
> disable
> and re-enable to the platform as well as starting the recovery following
> a platform event.
>

I recall none of the details, but SRIOV is a standardized system for
sharing a PCI device across multiple virtual machines. It has detailed info
on what the hypervisor must do, and what the local OS instance must do to
accomplish this.  It's part of the PCI standard, and its more than a decade
old now, maybe two. Being a part of the PCI standard, it was interoperable
with error recovery, to the best of my recollection. At the time it was
introduced, it got pushed very aggressively.  The x86 hypervisor vendors
were aiming at the heart of zseries, and were militant about it.

-- Linas

-- 
Patrick: Are they laughing at us?
Sponge Bob: No, Patrick, they are laughing next to us.


Re: [PATCH 0/5] s390/pci: automatic error recovery

2021-09-06 Thread Oliver O'Halloran
On Mon, Sep 6, 2021 at 7:49 PM Niklas Schnelle  wrote:
>
> Patch 3 I already sent separately resulting in the discussion below but 
> without
> a final conclusion.
>
> https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/
>
> I believe even though there were some doubts about the use of
> pci_dev_is_added() by arch code the existing uses as well as the use in the
> final patch of this series warrant this export.

The use of pci_dev_is_added() in arch/powerpc was because in the past
pci_bus_add_device() could be called before pci_device_add(). That was
fixed a while ago so It should be safe to remove those calls now.

> Patch 4 "PCI: Export pci_dev_lock()" is basically an extension to commit
> e3a9b1212b9d ("PCI: Export pci_dev_trylock() and pci_dev_unlock()") which
> already exported pci_dev_trylock(). In the final patch we make use of
> pci_dev_lock() to wait for any other exclusive uses of the pdev to be finished
> before starting recovery.

Hmm, I noticed the EEH
(arch/powerpc/kernel/eeh_driver.c:eeh_pe_report_edev())  and the
generic PCIe error recovery code (see
drivers/pci/pcie/err.c:report_error_detected()) only call
device_lock() before entering the driver's error handling callbacks. I
wonder if they should be using pci_dev_lock() instead. The only real
difference is that pci_dev_lock() will also block user space from
accessing the device's config space while error recovery is in
progress which seems sensible enough.


[PATCH 0/5] s390/pci: automatic error recovery

2021-09-06 Thread Niklas Schnelle
Hello,

This series implements automatic error recovery for PCI devices on s390
following the scheme outlined at Documentation/PCI/pci-error-recovery.rst
it applies on top of currenct master.

The patches have are almost completely s390 specific except for two patches
exporting existing functionality for use by arch/s390/pci/ code. Nevertheless
I would also appreciate any feedback, especially on the last patch, concerning
the implementation of the error recovery flow. I believe we might be the first
implementation of PCI device recovery in a virtualized setting requiring us to
coordinate the device reset with the hypervisor platform by issuing a disable
and re-enable to the platform as well as starting the recovery following
a platform event.

The outline of the patches is as follows:

Patch 1 and 2 add s390 specific code implementing a reset mechanism that
takes the PCI function out of the platform specific error state.

Patches 3 and 4 export existing common code functions for use by the s390
specific recovery code.

Patch 3 I already sent separately resulting in the discussion below but without
a final conclusion.

https://lore.kernel.org/lkml/20210720150145.640727-1-schne...@linux.ibm.com/

I believe even though there were some doubts about the use of
pci_dev_is_added() by arch code the existing uses as well as the use in the
final patch of this series warrant this export.

Patch 4 "PCI: Export pci_dev_lock()" is basically an extension to commit
e3a9b1212b9d ("PCI: Export pci_dev_trylock() and pci_dev_unlock()") which
already exported pci_dev_trylock(). In the final patch we make use of
pci_dev_lock() to wait for any other exclusive uses of the pdev to be finished
before starting recovery.

Finally Patch 5 implements the recovery flow as part of the existing s390
specific PCI availability and error event mechanism. Where previously the error
case only set an error indication requiring manual intervention to make the
device usable again. Now we handle the case where firmware has already reset
a PCI function after an error was encountered informing the OS that it should
be ready to be used again. Note that the same event is also issued by the
hypervisor if the function was manually taken into a service mode for example
for firmware upgrade via the hypervisor and is now ready to be used again.

Thanks,
Niklas Schnelle

Niklas Schnelle (5):
  s390/pci: refresh function handle in iomap
  s390/pci: implement reset_slot for hotplug slot
  PCI: Move pci_dev_is/assign_added() to pci.h
  PCI: Export pci_dev_lock()
  s390/pci: implement minimal PCI error recovery

 arch/powerpc/platforms/powernv/pci-sriov.c |   3 -
 arch/powerpc/platforms/pseries/setup.c |   1 -
 arch/s390/include/asm/pci.h|   6 +-
 arch/s390/pci/pci.c| 143 ++-
 arch/s390/pci/pci_event.c  | 196 -
 arch/s390/pci/pci_insn.c   |   4 +-
 arch/s390/pci/pci_irq.c|   9 +
 arch/s390/pci/pci_sysfs.c  |   2 -
 drivers/pci/hotplug/acpiphp_glue.c |   1 -
 drivers/pci/hotplug/s390_pci_hpc.c |  24 +++
 drivers/pci/pci.c  |   3 +-
 drivers/pci/pci.h  |  15 --
 include/linux/pci.h|  16 ++
 13 files changed, 389 insertions(+), 34 deletions(-)

-- 
2.25.1