Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote: > On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to > remark: > > > > Additionally, in "real life", very few errors are cause by known errata. > > If the drivers know about the errata, they usually already work

Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: > > Additionally, in "real life", very few errors are cause by known errata. > If the drivers know about the errata, they usually already work around > them. Afaik, most of the errors are caused by transcient

Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote: > On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: > > >Likewise, with EEH the device driver could take recovery action on its > > >own. But we don't want to end up with multiple sets of recovery code > > >in drivers, if

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 2:58 PM Benjamin Herrenschmidt wrote: > Does the link side of PCIE provides a way to trigger a hard reset of the > rest of the card ? If not, then it's dodgy as there may be no way to > consistently "reset" the card if it's in a bad state. The PCI Express spec does

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Friday, March 18, 2005 10:10 AM Grant Grundler wrote: >A port bus driver does NOT sound like a normal device driver. >If PCI Express defines a standard register set for a bridge >device (like PCI COnfig space for PCI-PCI Bridges), then I >don't see a problem with PCI-Express error handling code

Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Grant Grundler
On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: > >Likewise, with EEH the device driver could take recovery action on its > >own. But we don't want to end up with multiple sets of recovery code > >in drivers, if possible. Also we want the recovery code to be as > >simple as

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote: > Does the PCI Express AER specification define an API for drivers? No. That is why we agree a general API that works for all platforms. >Likewise, with EEH the device driver could take recovery action on its >own. But we don't want to

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 6:44 PM Benjamin Herrenschmidt wrote: >I have difficulties following all of your previous explanations, I must >admit. My point here is I'd like you to find out if the API can fit on >the driver side, and if not, what would need to be changed. In summary, we agreed

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 6:44 PM Benjamin Herrenschmidt wrote: I have difficulties following all of your previous explanations, I must admit. My point here is I'd like you to find out if the API can fit on the driver side, and if not, what would need to be changed. In summary, we agreed that

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 8:01 PM Paul Mackerras wrote: Does the PCI Express AER specification define an API for drivers? No. That is why we agree a general API that works for all platforms. Likewise, with EEH the device driver could take recovery action on its own. But we don't want to end

Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Grant Grundler
On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: Likewise, with EEH the device driver could take recovery action on its own. But we don't want to end up with multiple sets of recovery code in drivers, if possible. Also we want the recovery code to be as simple as possible,

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Friday, March 18, 2005 10:10 AM Grant Grundler wrote: A port bus driver does NOT sound like a normal device driver. If PCI Express defines a standard register set for a bridge device (like PCI COnfig space for PCI-PCI Bridges), then I don't see a problem with PCI-Express error handling code

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Nguyen, Tom L
On Thursday, March 17, 2005 2:58 PM Benjamin Herrenschmidt wrote: Does the link side of PCIE provides a way to trigger a hard reset of the rest of the card ? If not, then it's dodgy as there may be no way to consistently reset the card if it's in a bad state. The PCI Express spec does not

Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 11:10 -0700, Grant Grundler wrote: On Fri, Mar 18, 2005 at 09:24:02AM -0800, Nguyen, Tom L wrote: Likewise, with EEH the device driver could take recovery action on its own. But we don't want to end up with multiple sets of recovery code in drivers, if possible. Also

Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: Additionally, in real life, very few errors are cause by known errata. If the drivers know about the errata, they usually already work around them. Afaik, most of the errors are caused by transcient

Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote: On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: Additionally, in real life, very few errors are cause by known errata. If the drivers know about the errata, they usually already work around

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes: > We decided to implement PCI Express error handling based on the PCI > Express specification in a platform independent manner. This allows any > platform that implements PCI Express AER per the PCI SIG specification > can take advantage of the advanced features, much like

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes: > Is EEH a PCI-SIG specification? Is EEH specs available in public? No and no (not yet anyway). > It seems that a PCI-PCI bridge per slot is hardware implementation > specific. The fact that the PCI-PCI Bridge can isolate the slot is > hardware feature specific. Well,

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
On Thu, 2005-03-17 at 10:53 -0800, Nguyen, Tom L wrote: > To support the AER driver calling an upstream device to initiate a reset > of the link we need a specific callback since the driver doing the reset > is not the driver who got the error. In the case of general PCI this > could be useful

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote: >> What mechanism (message??) is used to perform the bus and/or link >> level reset? For PCI Express the reset is performed by the upstream >> port driver. My API takes this into account. Are you assuming the PCI >> device on

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
> On a fatal error the interface is down. No matter what the driver > supports (AER aware, EEH aware, unaware) all IO is likely to fail. > Resetting a bus in a point-to-point environment like PCI Express or EEH > (as you describe) should have little adverse effect. The risk is the > bus reset

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote: >> We need some PCI >> based error flows to understand the details of the flow so we can >> develop an interface compatible with both. > >Here is a basic outline of what happens with EEH (Enhanced Error >Handling) on IBM PPC64 platforms.

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:52 PM Paul Mackerras wrote: We need some PCI based error flows to understand the details of the flow so we can develop an interface compatible with both. Here is a basic outline of what happens with EEH (Enhanced Error Handling) on IBM PPC64 platforms. This

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
On a fatal error the interface is down. No matter what the driver supports (AER aware, EEH aware, unaware) all IO is likely to fail. Resetting a bus in a point-to-point environment like PCI Express or EEH (as you describe) should have little adverse effect. The risk is the bus reset will

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Nguyen, Tom L
On Wednesday, March 16, 2005 7:20 PM Benjamin Herrenschmidt wrote: What mechanism (message??) is used to perform the bus and/or link level reset? For PCI Express the reset is performed by the upstream port driver. My API takes this into account. Are you assuming the PCI device on the bus

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Benjamin Herrenschmidt
On Thu, 2005-03-17 at 10:53 -0800, Nguyen, Tom L wrote: To support the AER driver calling an upstream device to initiate a reset of the link we need a specific callback since the driver doing the reset is not the driver who got the error. In the case of general PCI this could be useful if a

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes: Is EEH a PCI-SIG specification? Is EEH specs available in public? No and no (not yet anyway). It seems that a PCI-PCI bridge per slot is hardware implementation specific. The fact that the PCI-PCI Bridge can isolate the slot is hardware feature specific. Well, it's a

RE: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC] PCIErrorRecovery)

2005-03-17 Thread Paul Mackerras
Nguyen, Tom L writes: We decided to implement PCI Express error handling based on the PCI Express specification in a platform independent manner. This allows any platform that implements PCI Express AER per the PCI SIG specification can take advantage of the advanced features, much like SHPC