Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
> On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
> remark:
> > 
> > Additionally, in "real life", very few errors are cause by known errata.
> > If the drivers know about the errata, they usually already work around
> > them. Afaik, most of the errors are caused by transcient conditions on
> > the bus or the device, like a bit beeing flipped, or thermal
> > conditions... 
> 
> 
> Heh. Let me describe "real life" a bit more accurately.
> 
> We've been running with pci error detection enabled here for the last
> two years.  Based on this experience, the ballpark figures are:
> 
> 90% of all detected errors were device driver bugs coupled to 
> pci card hardware errata

Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.

> 9% poorly seated pci cards (remove/reseat will make problem go away)
> 
> 1% transient/other.

Ok.

> We've seen *EVERY* and I mean *EVERY* device driver that we've put
> under stress tests (e.g. peak i/o rates for > 72 hours, e.g. 
> massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
> driver tripped on an EEH error detect that was traced back to 
> a device driver bug.  Not to blame the drivers, a lot of these
> were related to pci card hardware/foirmware bugs.  For example, 
> I think grepping for "split completion" and "NAPI" in the 
> patches/errata for e100 and e1000 for the last year will reveal 
> some of the stuff that was found.  As far as I know,
> for every bug found, a patch made it into mainline.

Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver "knows" about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.

> As a rule, it seems that finding these device driver bugs was
> very hard; we had some people work on these for months, and in 
> the case of the e1000, we managed to get Intel engineers to fly
> out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
> Ditto for Emulex.  For ipr, we had inhouse people.
> 
> So overall, PCI error detection did have the expected effect 
> (protecting the kernel from corruption, e.g. due to DMA's going 
> to wild addresses), but I don't think anybody expected that the
> vast majority would be software/hardware bugs, instead of transient 
> effects.
> 
> What's ironic in all of this is that by adding error recovery,
> device driver bugs will be able to hide more effectively ... 
> if there's a pci bus error due to a driver bug, the pci card
> will get rebooted, the kernel will burp for 3 seconds, and 
> things will keep going, and most sysadmins won't notice or 
> won't care.

Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
remark:
> 
> Additionally, in "real life", very few errors are cause by known errata.
> If the drivers know about the errata, they usually already work around
> them. Afaik, most of the errors are caused by transcient conditions on
> the bus or the device, like a bit beeing flipped, or thermal
> conditions... 


Heh. Let me describe "real life" a bit more accurately.

We've been running with pci error detection enabled here for the last
two years.  Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to 
pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.


We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for > 72 hours, e.g. 
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to 
a device driver bug.  Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs.  For example, 
I think grepping for "split completion" and "NAPI" in the 
patches/errata for e100 and e1000 for the last year will reveal 
some of the stuff that was found.  As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in 
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
Ditto for Emulex.  For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect 
(protecting the kernel from corruption, e.g. due to DMA's going 
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient 
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ... 
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and 
things will keep going, and most sysadmins won't notice or 
won't care.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote:
> On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
> > >Likewise, with EEH the device driver could take recovery action on its
> > >own.  But we don't want to end up with multiple sets of recovery code
> > >in drivers, if possible.  Also we want the recovery code to be as
> > >simple as possible, otherwise driver authors will get it wrong.
> > 
> > Drivers own their devices register sets.  Therefore if there are any
> > vendor unique actions that can be taken by the driver to recovery we
> > expect the driver to do so.
> ...
> 
> All drivers also need to cleanup driver state if they can't
> simply recover (and restart pending IOs). ie they need to release
> DMA resources and return suitable errors for pending requests.

Additionally, in "real life", very few errors are cause by known errata.
If the drivers know about the errata, they usually already work around
them. Afaik, most of the errors are caused by transcient conditions on
the bus or the device, like a bit beeing flipped, or thermal
conditions... 

> To the driver writer, it's all "platform" code.
> Folks who maintain PCI (and other) services differentiate between
> "generic" and "arch/platform" specific. Think first like a driver
> writer and then worry about if/how that can be divided between platform
> generic and platform/arch specific code.
> 
> Even PCI-Express has *some* arch specific component. At a minimum each
> architecture has it's own chipset and firmware to deal with
> for PCI Express bus discovery and initialization. But driver writers
> don't have to worry about that and they shouldn't for error
> recovery either.

Exactly. A given platform could use Intel's code as-is, or may choose to
do things differently while still showing the same interface to drivers.
Eventually we may end up adding platform hooks to the generic PCIE code
like we have in the PCI code if some platforms require them.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 2:58 PM Benjamin Herrenschmidt wrote:
> Does the link side of PCIE provides a way to trigger a hard reset of
the
> rest of the card ? If not, then it's dodgy as there may be no way to
> consistently "reset" the card if it's in a bad state. 

The PCI Express spec does not make it clear of whether an in-band
mechanism, called a hot-reset, triggers a hard reset of the rest of the
card. I agree that if not, then it's dodgy.

Thanks,
Long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Friday, March 18, 2005 10:10 AM Grant Grundler wrote:
>A port bus driver does NOT sound like a normal device driver.
>If PCI Express defines a standard register set for a bridge
>device (like PCI COnfig space for PCI-PCI Bridges), then I
>don't see a problem with PCI-Express error handling code mucking
>with those registers. Look at how PCI-PCI bridges are supported
>today and which bits of code poke registers on PCI-PCI Bridges.

Please refer to PCIEBUS-HOWTO.txt for how port bus driver works.

Thanks,
Long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Grant Grundler
On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
> >Likewise, with EEH the device driver could take recovery action on its
> >own.  But we don't want to end up with multiple sets of recovery code
> >in drivers, if possible.  Also we want the recovery code to be as
> >simple as possible, otherwise driver authors will get it wrong.
> 
> Drivers own their devices register sets.  Therefore if there are any
> vendor unique actions that can be taken by the driver to recovery we
> expect the driver to do so.
...

All drivers also need to cleanup driver state if they can't
simply recover (and restart pending IOs). ie they need to release
DMA resources and return suitable errors for pending requests.


> >I would see the AER driver as being included in the "platform" code.
> >The AER driver would be be closely involved in the recovery process.
> 
> Our goal is to have the AER driver be part of the general code base
> because it is based on a PCI SIG specification that can be implemented
> across all architectures.   

To the driver writer, it's all "platform" code.
Folks who maintain PCI (and other) services differentiate between
"generic" and "arch/platform" specific. Think first like a driver
writer and then worry about if/how that can be divided between platform
generic and platform/arch specific code.

Even PCI-Express has *some* arch specific component. At a minimum each
architecture has it's own chipset and firmware to deal with
for PCI Express bus discovery and initialization. But driver writers
don't have to worry about that and they shouldn't for error
recovery either.

> For a FATAL error the link is "unreliable".  This means MMIO operations
> may or may not succeed.  That is why the reset is performed by the
> upstream port driver.  The interface to that is reliable.  A reset of an
> upstream port will propagate to all downstream links.  So we need an
> interface to the bus/port driver to request a reset on its downstream
> link.  We don't want the AER driver writing port bus driver bridge
> control registers.  We are trying to keep the ownership of the devices
> register read/write within the domain of the devices driver.  In our
> case the port bus driver.

A port bus driver does NOT sound like a normal device driver.
If PCI Express defines a standard register set for a bridge
device (like PCI COnfig space for PCI-PCI Bridges), then I
don't see a problem with PCI-Express error handling code mucking
with those registers. Look at how PCI-PCI bridges are supported
today and which bits of code poke registers on PCI-PCI Bridges.

hth,
grant
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote:
> Does the PCI Express AER specification define an API for drivers?

No. That is why we agree a general API that works for all platforms.

>Likewise, with EEH the device driver could take recovery action on its
>own.  But we don't want to end up with multiple sets of recovery code
>in drivers, if possible.  Also we want the recovery code to be as
>simple as possible, otherwise driver authors will get it wrong.

Drivers own their devices register sets.  Therefore if there are any
vendor unique actions that can be taken by the driver to recovery we
expect the driver to do so.  For example, if the drivers see "xyz" error
and there is a known errata and workaround that involves resetting some
registers on the card.  From our perspective we see drivers taking care
of their own cards but the AER driver and your platform code will take
care of the bus/link interfaces.

>I would see the AER driver as being included in the "platform" code.
>The AER driver would be be closely involved in the recovery process.

Our goal is to have the AER driver be part of the general code base
because it is based on a PCI SIG specification that can be implemented
across all architectures.   

>What is the state of a link during the time between when an error is
>detected and when a link reset is done?  Is the link usable?  What
>happens if you try to do a MMIO read from a device downstream of the
>link?

For a FATAL error the link is "unreliable".  This means MMIO operations
may or may not succeed.  That is why the reset is performed by the
upstream port driver.  The interface to that is reliable.  A reset of an
upstream port will propagate to all downstream links.  So we need an
interface to the bus/port driver to request a reset on its downstream
link.  We don't want the AER driver writing port bus driver bridge
control registers.  We are trying to keep the ownership of the devices
register read/write within the domain of the devices driver.  In our
case the port bus driver.

Thanks,
Long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 6:44 PM Benjamin Herrenschmidt wrote:
>I have difficulties following all of your previous explanations, I must
>admit. My point here is I'd like you to find out if the API can fit on
>the driver side, and if not, what would need to be changed.

In summary, we agreed that the API you propose should be: 

int (*error_handler)(struct pci_dev *dev, union error_src *);

I believe this API works for most of PCI Express needs.  The only
addition PCI Express needs is a mechanism for the AER code to request a
port bus driver to perform a downstream link reset when an error occurs
on that downstream link. For example, you can add the
PCIERR_ERROR_PORT_RESET message with the return is either
PCIERR_RESULT_RECOVERED or PCIERR_RESULT_DISCONNECT to fit PCI Express
needs.

Thanks,
Long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 6:44 PM Benjamin Herrenschmidt wrote:
I have difficulties following all of your previous explanations, I must
admit. My point here is I'd like you to find out if the API can fit on
the driver side, and if not, what would need to be changed.

In summary, we agreed that the API you propose should be: 

int (*error_handler)(struct pci_dev *dev, union error_src *);

I believe this API works for most of PCI Express needs.  The only
addition PCI Express needs is a mechanism for the AER code to request a
port bus driver to perform a downstream link reset when an error occurs
on that downstream link. For example, you can add the
PCIERR_ERROR_PORT_RESET message with the return is either
PCIERR_RESULT_RECOVERED or PCIERR_RESULT_DISCONNECT to fit PCI Express
needs.

Thanks,
Long
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote:
 Does the PCI Express AER specification define an API for drivers?

No. That is why we agree a general API that works for all platforms.

Likewise, with EEH the device driver could take recovery action on its
own.  But we don't want to end up with multiple sets of recovery code
in drivers, if possible.  Also we want the recovery code to be as
simple as possible, otherwise driver authors will get it wrong.

Drivers own their devices register sets.  Therefore if there are any
vendor unique actions that can be taken by the driver to recovery we
expect the driver to do so.  For example, if the drivers see xyz error
and there is a known errata and workaround that involves resetting some
registers on the card.  From our perspective we see drivers taking care
of their own cards but the AER driver and your platform code will take
care of the bus/link interfaces.

I would see the AER driver as being included in the platform code.
The AER driver would be be closely involved in the recovery process.

Our goal is to have the AER driver be part of the general code base
because it is based on a PCI SIG specification that can be implemented
across all architectures.   

What is the state of a link during the time between when an error is
detected and when a link reset is done?  Is the link usable?  What
happens if you try to do a MMIO read from a device downstream of the
link?

For a FATAL error the link is unreliable.  This means MMIO operations
may or may not succeed.  That is why the reset is performed by the
upstream port driver.  The interface to that is reliable.  A reset of an
upstream port will propagate to all downstream links.  So we need an
interface to the bus/port driver to request a reset on its downstream
link.  We don't want the AER driver writing port bus driver bridge
control registers.  We are trying to keep the ownership of the devices
register read/write within the domain of the devices driver.  In our
case the port bus driver.

Thanks,
Long
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Grant Grundler
On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
 Likewise, with EEH the device driver could take recovery action on its
 own.  But we don't want to end up with multiple sets of recovery code
 in drivers, if possible.  Also we want the recovery code to be as
 simple as possible, otherwise driver authors will get it wrong.
 
 Drivers own their devices register sets.  Therefore if there are any
 vendor unique actions that can be taken by the driver to recovery we
 expect the driver to do so.
...

All drivers also need to cleanup driver state if they can't
simply recover (and restart pending IOs). ie they need to release
DMA resources and return suitable errors for pending requests.


 I would see the AER driver as being included in the platform code.
 The AER driver would be be closely involved in the recovery process.
 
 Our goal is to have the AER driver be part of the general code base
 because it is based on a PCI SIG specification that can be implemented
 across all architectures.   

To the driver writer, it's all platform code.
Folks who maintain PCI (and other) services differentiate between
generic and arch/platform specific. Think first like a driver
writer and then worry about if/how that can be divided between platform
generic and platform/arch specific code.

Even PCI-Express has *some* arch specific component. At a minimum each
architecture has it's own chipset and firmware to deal with
for PCI Express bus discovery and initialization. But driver writers
don't have to worry about that and they shouldn't for error
recovery either.

 For a FATAL error the link is unreliable.  This means MMIO operations
 may or may not succeed.  That is why the reset is performed by the
 upstream port driver.  The interface to that is reliable.  A reset of an
 upstream port will propagate to all downstream links.  So we need an
 interface to the bus/port driver to request a reset on its downstream
 link.  We don't want the AER driver writing port bus driver bridge
 control registers.  We are trying to keep the ownership of the devices
 register read/write within the domain of the devices driver.  In our
 case the port bus driver.

A port bus driver does NOT sound like a normal device driver.
If PCI Express defines a standard register set for a bridge
device (like PCI COnfig space for PCI-PCI Bridges), then I
don't see a problem with PCI-Express error handling code mucking
with those registers. Look at how PCI-PCI bridges are supported
today and which bits of code poke registers on PCI-PCI Bridges.

hth,
grant
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Friday, March 18, 2005 10:10 AM Grant Grundler wrote:
A port bus driver does NOT sound like a normal device driver.
If PCI Express defines a standard register set for a bridge
device (like PCI COnfig space for PCI-PCI Bridges), then I
don't see a problem with PCI-Express error handling code mucking
with those registers. Look at how PCI-PCI bridges are supported
today and which bits of code poke registers on PCI-PCI Bridges.

Please refer to PCIEBUS-HOWTO.txt for how port bus driver works.

Thanks,
Long
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 2:58 PM Benjamin Herrenschmidt wrote:
 Does the link side of PCIE provides a way to trigger a hard reset of
the
 rest of the card ? If not, then it's dodgy as there may be no way to
 consistently reset the card if it's in a bad state. 

The PCI Express spec does not make it clear of whether an in-band
mechanism, called a hot-reset, triggers a hard reset of the rest of the
card. I agree that if not, then it's dodgy.

Thanks,
Long
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote:
 On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote:
  Likewise, with EEH the device driver could take recovery action on its
  own.  But we don't want to end up with multiple sets of recovery code
  in drivers, if possible.  Also we want the recovery code to be as
  simple as possible, otherwise driver authors will get it wrong.
  
  Drivers own their devices register sets.  Therefore if there are any
  vendor unique actions that can be taken by the driver to recovery we
  expect the driver to do so.
 ...
 
 All drivers also need to cleanup driver state if they can't
 simply recover (and restart pending IOs). ie they need to release
 DMA resources and return suitable errors for pending requests.

Additionally, in real life, very few errors are cause by known errata.
If the drivers know about the errata, they usually already work around
them. Afaik, most of the errors are caused by transcient conditions on
the bus or the device, like a bit beeing flipped, or thermal
conditions... 

 To the driver writer, it's all platform code.
 Folks who maintain PCI (and other) services differentiate between
 generic and arch/platform specific. Think first like a driver
 writer and then worry about if/how that can be divided between platform
 generic and platform/arch specific code.
 
 Even PCI-Express has *some* arch specific component. At a minimum each
 architecture has it's own chipset and firmware to deal with
 for PCI Express bus discovery and initialization. But driver writers
 don't have to worry about that and they shouldn't for error
 recovery either.

Exactly. A given platform could use Intel's code as-is, or may choose to
do things differently while still showing the same interface to drivers.
Eventually we may end up adding platform hooks to the generic PCIE code
like we have in the PCI code if some platforms require them.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
remark:
 
 Additionally, in real life, very few errors are cause by known errata.
 If the drivers know about the errata, they usually already work around
 them. Afaik, most of the errors are caused by transcient conditions on
 the bus or the device, like a bit beeing flipped, or thermal
 conditions... 


Heh. Let me describe real life a bit more accurately.

We've been running with pci error detection enabled here for the last
two years.  Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to 
pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.


We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for  72 hours, e.g. 
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to 
a device driver bug.  Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs.  For example, 
I think grepping for split completion and NAPI in the 
patches/errata for e100 and e1000 for the last year will reveal 
some of the stuff that was found.  As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in 
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
Ditto for Emulex.  For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect 
(protecting the kernel from corruption, e.g. due to DMA's going 
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient 
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ... 
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and 
things will keep going, and most sysadmins won't notice or 
won't care.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
 On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
 remark:
  
  Additionally, in real life, very few errors are cause by known errata.
  If the drivers know about the errata, they usually already work around
  them. Afaik, most of the errors are caused by transcient conditions on
  the bus or the device, like a bit beeing flipped, or thermal
  conditions... 
 
 
 Heh. Let me describe real life a bit more accurately.
 
 We've been running with pci error detection enabled here for the last
 two years.  Based on this experience, the ballpark figures are:
 
 90% of all detected errors were device driver bugs coupled to 
 pci card hardware errata

Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.

 9% poorly seated pci cards (remove/reseat will make problem go away)
 
 1% transient/other.

Ok.

 We've seen *EVERY* and I mean *EVERY* device driver that we've put
 under stress tests (e.g. peak i/o rates for  72 hours, e.g. 
 massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
 driver tripped on an EEH error detect that was traced back to 
 a device driver bug.  Not to blame the drivers, a lot of these
 were related to pci card hardware/foirmware bugs.  For example, 
 I think grepping for split completion and NAPI in the 
 patches/errata for e100 and e1000 for the last year will reveal 
 some of the stuff that was found.  As far as I know,
 for every bug found, a patch made it into mainline.

Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver knows about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.

 As a rule, it seems that finding these device driver bugs was
 very hard; we had some people work on these for months, and in 
 the case of the e1000, we managed to get Intel engineers to fly
 out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
 Ditto for Emulex.  For ipr, we had inhouse people.
 
 So overall, PCI error detection did have the expected effect 
 (protecting the kernel from corruption, e.g. due to DMA's going 
 to wild addresses), but I don't think anybody expected that the
 vast majority would be software/hardware bugs, instead of transient 
 effects.
 
 What's ironic in all of this is that by adding error recovery,
 device driver bugs will be able to hide more effectively ... 
 if there's a pci bus error due to a driver bug, the pci card
 will get rebooted, the kernel will burp for 3 seconds, and 
 things will keep going, and most sysadmins won't notice or 
 won't care.

Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes:

> We decided to implement PCI Express error handling based on the PCI
> Express specification in a platform independent manner.  This allows any
> platform that implements PCI Express AER per the PCI SIG specification
> can take advantage of the advanced features, much like SHPC hot-plug or
> PCI Express hot-plug implementations.

Does the PCI Express AER specification define an API for drivers?

> For PCI Express the endpoint device driver can take recovery action on
> its own, depending on the nature of the error so long as it does not
> affect the upstream device.  This can include endpoint device resets.

Likewise, with EEH the device driver could take recovery action on its
own.  But we don't want to end up with multiple sets of recovery code
in drivers, if possible.  Also we want the recovery code to be as
simple as possible, otherwise driver authors will get it wrong.

> To support the AER driver calling an upstream device to initiate a reset
> of the link we need a specific callback since the driver doing the reset
> is not the driver who got the error.  In the case of general PCI this

I would see the AER driver as being included in the "platform" code.
The AER driver would be be closely involved in the recovery process.

What is the state of a link during the time between when an error is
detected and when a link reset is done?  Is the link usable?  What
happens if you try to do a MMIO read from a device downstream of the
link?

Regards,
Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes:

> Is EEH a PCI-SIG specification? Is EEH specs available in public?

No and no (not yet anyway).

> It seems that a PCI-PCI bridge per slot is hardware implementation
> specific. The fact that the PCI-PCI Bridge can isolate the slot is
> hardware feature specific.

Well, it's a common feature across all current IBM PPC64 machines.

> PCI Express AER driver uses similar concept of determining whether the
> driver is AER-aware or not except that PCI Express AER is independent
> from firmware support.

Don't worry about the firmware; the driver won't have to interact with
firmware itself, that's the job of the ppc64-specific platform code.

> Where does the platform code reside and where does it log the error?

By platform code I meant the code under the arch directory that knows
the details of the I/O topology of the machine, how to access the PCI
host bridges, etc.  How and where it logs the error is a platform
policy; on IBM ppc64 machines we have an error log daemon for this
purpose, which can do things like log the error to a file or send it
to another machine.

> In PCI Express if the driver is not AER-aware the fatal error message is
> reported by its upstream switch, the AER driver obtains comprehensive
> error information from the upstream switch (like EEH platform code
> obtains error information from the firmware). Since the driver is not
> AER-aware, the fatal error is reported to user to make a policy decision
> since the PCI Express does not have a hot-plug event for the slot like
> EEH platform. 

If there is a permanent failure of an upstream link, then maybe
generating unplug events for the devices below it would be a useful
thing to do.

> So it looks like the hot-plug capability of the driver is being used in
> lieu of specific callbacks to freeze and thaw IO in the case of a
> non-aware driver.  If the driver does not support hot-plug then the
> error is just logged.  Do you leave the slot isolated or perform error
> recovery anyway?

The choice is really to leave the slot isolated or to panic the
system.  Leaving the slot isolated risks having the driver loop in an
interrupt routine or deliver bad data to userspace, so we currently
panic the system.

> On a fatal error the interface is down.  No matter what the driver

Which interface do you mean here?

> supports (AER aware, EEH aware, unaware) all IO is likely to fail.
> Resetting a bus in a point-to-point environment like PCI Express or EEH
> (as you describe) should have little adverse effect.  The risk is the
> bus reset will cause a card reset and the driver must understand to
> re-initialize the card.  A link reset in PCI Express will not cause a
> card reset.  We assume the driver will reset its card if necessary.

How will the driver reset its card?

> In PCI Express the AER driver obtains fatal error information from the
> upstream switch driver. We can use the same API with message =
> PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
> unaware of the fatal error reported by its upstream device. Mostly the
> driver will respond with PCIERR_RESULT_NEED_RESET.

Sounds fine.

Paul.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
On Thu, 2005-03-17 at 10:53 -0800, Nguyen, Tom L wrote:

> To support the AER driver calling an upstream device to initiate a reset
> of the link we need a specific callback since the driver doing the reset
> is not the driver who got the error.  In the case of general PCI this
> could be useful if a PCI bus driver were available to support the
> callback for a bridge device.  This would also support specific error
> recovery calls to reset an endpoint adapter.  We need a call to request
> a driver to perform a reset on a link or device.  

That is quite implementation specific, it doesn't need to be part of the
API (the way the general error management is implemented in PCIE could
be completely done within the bus drivers I suppose). Again, I'm not
trying to define or force a given implementation. I'm trying to define
the driver-side API, that's all.

I have difficulties following all of your previous explanations, I must
admit. My point here is I'd like you to find out if the API can fit on
the driver side, and if not, what would need to be changed. For example,
we might want to distinguish between slot reset (full hard reset) and
link reset, that sort of thing (thus adding a new state for link reset
and a new return code for the others for requesting a link reset if
possible, platforms that don't do it, like IBM EEH PCI would just
fallback to full reset).

Again, the goal here is to have a way for drivers to be mostly bus
agnostic (that is not have to care if they are running on PCI, PCI-X,
PCIE, with or without IBM EEH mecanism, and whatever other mecanism
another vendor might provide) and still implement basic error recovery.

A driver _designed_ for a PCI-Express deviec that knows it's on PCI
Express can perfectly use additional APIs to gather more error details,
etc... but it would be nice to fit the "common needs" as much as
possible in a common and _SIMPLE_ API. The simplicity here is a
requirement, I'm very serious about it, because if it's not simple,
drivers either won't implement it or won't get it right.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote:
>> What mechanism (message??) is used to perform the bus and/or link
>> level reset?  For PCI Express the reset is performed by the upstream
>> port driver.  My API takes this into account.  Are you assuming the
PCI
>> device on the bus does the reset or will there be a PCI bus driver
that
>> will do the reset?  How will the PCI error handling code initiate a
>> reset?
>
>The "caller", that is the error management framework. I'm defining the
>API at the driver level, not the implementation at the core level.
>
>For example, on IBM pSeries with PCI-Express, we will probably not have
>an AER driver. This will be all dealt by the firmware which will mimmic
>that to the existing EEH error management. We'll have the same API to
do
>the reset that we have today for resetting a slot.

We decided to implement PCI Express error handling based on the PCI
Express specification in a platform independent manner.  This allows any
platform that implements PCI Express AER per the PCI SIG specification
can take advantage of the advanced features, much like SHPC hot-plug or
PCI Express hot-plug implementations.

>You may have noticed in general that I didn't either define who is
>callign those callbacks. It's all implicit that this is done by
platform
>error management code. For example, on ppc64, even the recovery step
>requires action from the platform since the slot has been physically
>isolated. After we have notified all drivers  with the "error detected"
>callback, if we decide we can try the "recover" step (all drivers
>returned they could try it and we decided the error wasn't too fatal)
we
>will call the firmware to re-enable IOs on the slot and call the
>"recover" step.

For PCI Express the endpoint device driver can take recovery action on
its own, depending on the nature of the error so long as it does not
affect the upstream device.  This can include endpoint device resets.
We expect the driver to do this upon error notification, if possible.
In PCI Express since the driver will have the most knowledge regarding
the error it will have the best ability to do device dependent recovery
and IO retry.  If its recovery fails then the AER driver will ask the
upstream device driver to perform the link reset.  Since this is more of
a side effect an explicit call to recover is not necessary.  However, we
understand and agree that it is needed to support the general error
recovery cases for PCI.

To support the AER driver calling an upstream device to initiate a reset
of the link we need a specific callback since the driver doing the reset
is not the driver who got the error.  In the case of general PCI this
could be useful if a PCI bus driver were available to support the
callback for a bridge device.  This would also support specific error
recovery calls to reset an endpoint adapter.  We need a call to request
a driver to perform a reset on a link or device.  

Thanks,
Long
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt

> On a fatal error the interface is down.  No matter what the driver
> supports (AER aware, EEH aware, unaware) all IO is likely to fail.
> Resetting a bus in a point-to-point environment like PCI Express or EEH
> (as you describe) should have little adverse effect.  The risk is the
> bus reset will cause a card reset and the driver must understand to
> re-initialize the card.  A link reset in PCI Express will not cause a
> card reset.  We assume the driver will reset its card if necessary.

Does the link side of PCIE provides a way to trigger a hard reset of the
rest of the card ? If not, then it's dodgy as there may be no way to
consistently "reset" the card if it's in a bad state. I have to double
check, but I suspect that IBM's implementation of EEH-compliant PCIE
will add a full hard reset not just a link reset.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote:
>> We need some PCI
>> based error flows to understand the details of the flow so we can
>> develop an interface compatible with both.
>
>Here is a basic outline of what happens with EEH (Enhanced Error
>Handling) on IBM PPC64 platforms.  This applies to PCI, PCI-X and
>PCI-Express devices.

Is EEH a PCI-SIG specification? Is EEH specs available in public?

>We have a PCI-PCI bridge per slot.  The bridge (and the PCI fabric
>generally) look for errors such as address parity errors,
>out-of-bounds DMA accesses by the device, or anything that would
>normally cause SERR to be set.  If such an error occurs, the bridge
>immediately isolates the device, meaning that writes by the CPU to the
>device are discarded, reads by the CPU are returned with all 1s data,
>and DMA accesses by the device are blocked.

It seems that a PCI-PCI bridge per slot is hardware implementation
specific. The fact that the PCI-PCI Bridge can isolate the slot is
hardware feature specific.

>What happens at the driver level depends on whether the driver is
>EEH-aware or not.  (This description is more what we would like to
>have rather than what is necessarily implemented at present).

PCI Express AER driver uses similar concept of determining whether the
driver is AER-aware or not except that PCI Express AER is independent
from firmware support.

>If the driver is not EEH-aware but is hot-plug capable, then the
>platform code will notice that reads from the device are returning all
>1s and query firmware about the state of the slot.  Firmware will
>indicate that the slot has been isolated.  Platform code can obtain
>more specific information about the error from firmware and log it.
>Then, platform code will generate a hot-unplug event for the slot.
>After the driver has cleaned up and notified higher levels that its
>device has gone away, platform code will call firmware to reset and
>unisolate the slot, and then generate a hotplug event to tell the
>driver that it can use the device - but as far as the driver is
>concerned, it is a new device.

Where does the platform code reside and where does it log the error?

In PCI Express if the driver is not AER-aware the fatal error message is
reported by its upstream switch, the AER driver obtains comprehensive
error information from the upstream switch (like EEH platform code
obtains error information from the firmware). Since the driver is not
AER-aware, the fatal error is reported to user to make a policy decision
since the PCI Express does not have a hot-plug event for the slot like
EEH platform. 

So it looks like the hot-plug capability of the driver is being used in
lieu of specific callbacks to freeze and thaw IO in the case of a
non-aware driver.  If the driver does not support hot-plug then the
error is just logged.  Do you leave the slot isolated or perform error
recovery anyway?

On a fatal error the interface is down.  No matter what the driver
supports (AER aware, EEH aware, unaware) all IO is likely to fail.
Resetting a bus in a point-to-point environment like PCI Express or EEH
(as you describe) should have little adverse effect.  The risk is the
bus reset will cause a card reset and the driver must understand to
re-initialize the card.  A link reset in PCI Express will not cause a
card reset.  We assume the driver will reset its card if necessary.

>If the driver is EEH-aware, then we use the API that Ben has
>proposed.  Platform code can either reset the slot (by calling
>firmware) or not, depending on what the driver asks for, and also
>depending on any other information the platform code has available to
>it, such as specific information about the error that has occurred.
>Platform code then unisolates the slot and then informs the driver
>that it can reinitialize the device and restart any transfers that
>were in progress.

In PCI Express the AER driver obtains fatal error information from the
upstream switch driver. We can use the same API with message =
PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
unaware of the fatal error reported by its upstream device. Mostly the
driver will respond with PCIERR_RESULT_NEED_RESET.

>Ben's API is aimed at supporting the code flows that we need for EEH
>as well as those needed for recovery from errors on PCI Express.  Part
>of the reason for not just requiring the driver to do everything
>itself is that a slot isolation event can affect multiple drivers,
>because the card in the slot could have a PCI-PCI bridge with multiple
>devices behind it.  Thus the recovery process potentially requires a
>degree of coordination between multiple drivers, and Ben's API
>addresses that.  The same coordination could be required on PCI
>Express, if I understand correctly, because a fault on an upstream
>link could affect many devices downstream of that link.

Yes the same case applies to PCI Express upstream links.  So halting IO
is desired when other devices are 

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote:
 We need some PCI
 based error flows to understand the details of the flow so we can
 develop an interface compatible with both.

Here is a basic outline of what happens with EEH (Enhanced Error
Handling) on IBM PPC64 platforms.  This applies to PCI, PCI-X and
PCI-Express devices.

Is EEH a PCI-SIG specification? Is EEH specs available in public?

We have a PCI-PCI bridge per slot.  The bridge (and the PCI fabric
generally) look for errors such as address parity errors,
out-of-bounds DMA accesses by the device, or anything that would
normally cause SERR to be set.  If such an error occurs, the bridge
immediately isolates the device, meaning that writes by the CPU to the
device are discarded, reads by the CPU are returned with all 1s data,
and DMA accesses by the device are blocked.

It seems that a PCI-PCI bridge per slot is hardware implementation
specific. The fact that the PCI-PCI Bridge can isolate the slot is
hardware feature specific.

What happens at the driver level depends on whether the driver is
EEH-aware or not.  (This description is more what we would like to
have rather than what is necessarily implemented at present).

PCI Express AER driver uses similar concept of determining whether the
driver is AER-aware or not except that PCI Express AER is independent
from firmware support.

If the driver is not EEH-aware but is hot-plug capable, then the
platform code will notice that reads from the device are returning all
1s and query firmware about the state of the slot.  Firmware will
indicate that the slot has been isolated.  Platform code can obtain
more specific information about the error from firmware and log it.
Then, platform code will generate a hot-unplug event for the slot.
After the driver has cleaned up and notified higher levels that its
device has gone away, platform code will call firmware to reset and
unisolate the slot, and then generate a hotplug event to tell the
driver that it can use the device - but as far as the driver is
concerned, it is a new device.

Where does the platform code reside and where does it log the error?

In PCI Express if the driver is not AER-aware the fatal error message is
reported by its upstream switch, the AER driver obtains comprehensive
error information from the upstream switch (like EEH platform code
obtains error information from the firmware). Since the driver is not
AER-aware, the fatal error is reported to user to make a policy decision
since the PCI Express does not have a hot-plug event for the slot like
EEH platform. 

So it looks like the hot-plug capability of the driver is being used in
lieu of specific callbacks to freeze and thaw IO in the case of a
non-aware driver.  If the driver does not support hot-plug then the
error is just logged.  Do you leave the slot isolated or perform error
recovery anyway?

On a fatal error the interface is down.  No matter what the driver
supports (AER aware, EEH aware, unaware) all IO is likely to fail.
Resetting a bus in a point-to-point environment like PCI Express or EEH
(as you describe) should have little adverse effect.  The risk is the
bus reset will cause a card reset and the driver must understand to
re-initialize the card.  A link reset in PCI Express will not cause a
card reset.  We assume the driver will reset its card if necessary.

If the driver is EEH-aware, then we use the API that Ben has
proposed.  Platform code can either reset the slot (by calling
firmware) or not, depending on what the driver asks for, and also
depending on any other information the platform code has available to
it, such as specific information about the error that has occurred.
Platform code then unisolates the slot and then informs the driver
that it can reinitialize the device and restart any transfers that
were in progress.

In PCI Express the AER driver obtains fatal error information from the
upstream switch driver. We can use the same API with message =
PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
unaware of the fatal error reported by its upstream device. Mostly the
driver will respond with PCIERR_RESULT_NEED_RESET.

Ben's API is aimed at supporting the code flows that we need for EEH
as well as those needed for recovery from errors on PCI Express.  Part
of the reason for not just requiring the driver to do everything
itself is that a slot isolation event can affect multiple drivers,
because the card in the slot could have a PCI-PCI bridge with multiple
devices behind it.  Thus the recovery process potentially requires a
degree of coordination between multiple drivers, and Ben's API
addresses that.  The same coordination could be required on PCI
Express, if I understand correctly, because a fault on an upstream
link could affect many devices downstream of that link.

Yes the same case applies to PCI Express upstream links.  So halting IO
is desired when other devices are affected.

Thanks,
Long
-
To unsubscribe from this list: 

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt

 On a fatal error the interface is down.  No matter what the driver
 supports (AER aware, EEH aware, unaware) all IO is likely to fail.
 Resetting a bus in a point-to-point environment like PCI Express or EEH
 (as you describe) should have little adverse effect.  The risk is the
 bus reset will cause a card reset and the driver must understand to
 re-initialize the card.  A link reset in PCI Express will not cause a
 card reset.  We assume the driver will reset its card if necessary.

Does the link side of PCIE provides a way to trigger a hard reset of the
rest of the card ? If not, then it's dodgy as there may be no way to
consistently reset the card if it's in a bad state. I have to double
check, but I suspect that IBM's implementation of EEH-compliant PCIE
will add a full hard reset not just a link reset.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote:
 What mechanism (message??) is used to perform the bus and/or link
 level reset?  For PCI Express the reset is performed by the upstream
 port driver.  My API takes this into account.  Are you assuming the
PCI
 device on the bus does the reset or will there be a PCI bus driver
that
 will do the reset?  How will the PCI error handling code initiate a
 reset?

The caller, that is the error management framework. I'm defining the
API at the driver level, not the implementation at the core level.

For example, on IBM pSeries with PCI-Express, we will probably not have
an AER driver. This will be all dealt by the firmware which will mimmic
that to the existing EEH error management. We'll have the same API to
do
the reset that we have today for resetting a slot.

We decided to implement PCI Express error handling based on the PCI
Express specification in a platform independent manner.  This allows any
platform that implements PCI Express AER per the PCI SIG specification
can take advantage of the advanced features, much like SHPC hot-plug or
PCI Express hot-plug implementations.

You may have noticed in general that I didn't either define who is
callign those callbacks. It's all implicit that this is done by
platform
error management code. For example, on ppc64, even the recovery step
requires action from the platform since the slot has been physically
isolated. After we have notified all drivers  with the error detected
callback, if we decide we can try the recover step (all drivers
returned they could try it and we decided the error wasn't too fatal)
we
will call the firmware to re-enable IOs on the slot and call the
recover step.

For PCI Express the endpoint device driver can take recovery action on
its own, depending on the nature of the error so long as it does not
affect the upstream device.  This can include endpoint device resets.
We expect the driver to do this upon error notification, if possible.
In PCI Express since the driver will have the most knowledge regarding
the error it will have the best ability to do device dependent recovery
and IO retry.  If its recovery fails then the AER driver will ask the
upstream device driver to perform the link reset.  Since this is more of
a side effect an explicit call to recover is not necessary.  However, we
understand and agree that it is needed to support the general error
recovery cases for PCI.

To support the AER driver calling an upstream device to initiate a reset
of the link we need a specific callback since the driver doing the reset
is not the driver who got the error.  In the case of general PCI this
could be useful if a PCI bus driver were available to support the
callback for a bridge device.  This would also support specific error
recovery calls to reset an endpoint adapter.  We need a call to request
a driver to perform a reset on a link or device.  

Thanks,
Long
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
On Thu, 2005-03-17 at 10:53 -0800, Nguyen, Tom L wrote:

 To support the AER driver calling an upstream device to initiate a reset
 of the link we need a specific callback since the driver doing the reset
 is not the driver who got the error.  In the case of general PCI this
 could be useful if a PCI bus driver were available to support the
 callback for a bridge device.  This would also support specific error
 recovery calls to reset an endpoint adapter.  We need a call to request
 a driver to perform a reset on a link or device.  

That is quite implementation specific, it doesn't need to be part of the
API (the way the general error management is implemented in PCIE could
be completely done within the bus drivers I suppose). Again, I'm not
trying to define or force a given implementation. I'm trying to define
the driver-side API, that's all.

I have difficulties following all of your previous explanations, I must
admit. My point here is I'd like you to find out if the API can fit on
the driver side, and if not, what would need to be changed. For example,
we might want to distinguish between slot reset (full hard reset) and
link reset, that sort of thing (thus adding a new state for link reset
and a new return code for the others for requesting a link reset if
possible, platforms that don't do it, like IBM EEH PCI would just
fallback to full reset).

Again, the goal here is to have a way for drivers to be mostly bus
agnostic (that is not have to care if they are running on PCI, PCI-X,
PCIE, with or without IBM EEH mecanism, and whatever other mecanism
another vendor might provide) and still implement basic error recovery.

A driver _designed_ for a PCI-Express deviec that knows it's on PCI
Express can perfectly use additional APIs to gather more error details,
etc... but it would be nice to fit the common needs as much as
possible in a common and _SIMPLE_ API. The simplicity here is a
requirement, I'm very serious about it, because if it's not simple,
drivers either won't implement it or won't get it right.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes:

 Is EEH a PCI-SIG specification? Is EEH specs available in public?

No and no (not yet anyway).

 It seems that a PCI-PCI bridge per slot is hardware implementation
 specific. The fact that the PCI-PCI Bridge can isolate the slot is
 hardware feature specific.

Well, it's a common feature across all current IBM PPC64 machines.

 PCI Express AER driver uses similar concept of determining whether the
 driver is AER-aware or not except that PCI Express AER is independent
 from firmware support.

Don't worry about the firmware; the driver won't have to interact with
firmware itself, that's the job of the ppc64-specific platform code.

 Where does the platform code reside and where does it log the error?

By platform code I meant the code under the arch directory that knows
the details of the I/O topology of the machine, how to access the PCI
host bridges, etc.  How and where it logs the error is a platform
policy; on IBM ppc64 machines we have an error log daemon for this
purpose, which can do things like log the error to a file or send it
to another machine.

 In PCI Express if the driver is not AER-aware the fatal error message is
 reported by its upstream switch, the AER driver obtains comprehensive
 error information from the upstream switch (like EEH platform code
 obtains error information from the firmware). Since the driver is not
 AER-aware, the fatal error is reported to user to make a policy decision
 since the PCI Express does not have a hot-plug event for the slot like
 EEH platform. 

If there is a permanent failure of an upstream link, then maybe
generating unplug events for the devices below it would be a useful
thing to do.

 So it looks like the hot-plug capability of the driver is being used in
 lieu of specific callbacks to freeze and thaw IO in the case of a
 non-aware driver.  If the driver does not support hot-plug then the
 error is just logged.  Do you leave the slot isolated or perform error
 recovery anyway?

The choice is really to leave the slot isolated or to panic the
system.  Leaving the slot isolated risks having the driver loop in an
interrupt routine or deliver bad data to userspace, so we currently
panic the system.

 On a fatal error the interface is down.  No matter what the driver

Which interface do you mean here?

 supports (AER aware, EEH aware, unaware) all IO is likely to fail.
 Resetting a bus in a point-to-point environment like PCI Express or EEH
 (as you describe) should have little adverse effect.  The risk is the
 bus reset will cause a card reset and the driver must understand to
 re-initialize the card.  A link reset in PCI Express will not cause a
 card reset.  We assume the driver will reset its card if necessary.

How will the driver reset its card?

 In PCI Express the AER driver obtains fatal error information from the
 upstream switch driver. We can use the same API with message =
 PCIERR_ERROR_RECOVER to notify the endpoint driver, which is maybe
 unaware of the fatal error reported by its upstream device. Mostly the
 driver will respond with PCIERR_RESULT_NEED_RESET.

Sounds fine.

Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes:

 We decided to implement PCI Express error handling based on the PCI
 Express specification in a platform independent manner.  This allows any
 platform that implements PCI Express AER per the PCI SIG specification
 can take advantage of the advanced features, much like SHPC hot-plug or
 PCI Express hot-plug implementations.

Does the PCI Express AER specification define an API for drivers?

 For PCI Express the endpoint device driver can take recovery action on
 its own, depending on the nature of the error so long as it does not
 affect the upstream device.  This can include endpoint device resets.

Likewise, with EEH the device driver could take recovery action on its
own.  But we don't want to end up with multiple sets of recovery code
in drivers, if possible.  Also we want the recovery code to be as
simple as possible, otherwise driver authors will get it wrong.

 To support the AER driver calling an upstream device to initiate a reset
 of the link we need a specific callback since the driver doing the reset
 is not the driver who got the error.  In the case of general PCI this

I would see the AER driver as being included in the platform code.
The AER driver would be be closely involved in the recovery process.

What is the state of a link during the time between when an error is
detected and when a link reset is done?  Is the link usable?  What
happens if you try to do a MMIO read from a device downstream of the
link?

Regards,
Paul.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/