Re: [PATCH] pci-error-recover: doc cleanup

2016-12-18 Thread Cao jin
Sorry for late.

On 12/09/2016 10:37 PM, Jonathan Corbet wrote:
> On Fri, 9 Dec 2016 14:37:47 +0800
> Cao jin  wrote:
> 
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
> 
> Therein lies my original discomfort with the change; it didn't seem to
> make sense to talk about recovering from a fatal error.  Perhaps making
> it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
> been detected that can be "solved" by resetting the link" or something
> like that to make it clear how the term is being used?
> 

I find that the .link_reset callback of struct pci_error_handlers isn't
called by anyone(if I didn't miss anything), and just a few drivers
implement this callback, and their implementation seems meaningless.

And the reset_link() provided by aer driver seems is a different thing
with .link_reset callback. So I am guessing this patch probably is not
quite suitable, and the doc maybe need update totally.

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-18 Thread Cao jin
Sorry for late.

On 12/09/2016 10:37 PM, Jonathan Corbet wrote:
> On Fri, 9 Dec 2016 14:37:47 +0800
> Cao jin  wrote:
> 
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
> 
> Therein lies my original discomfort with the change; it didn't seem to
> make sense to talk about recovering from a fatal error.  Perhaps making
> it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
> been detected that can be "solved" by resetting the link" or something
> like that to make it clear how the term is being used?
> 

I find that the .link_reset callback of struct pci_error_handlers isn't
called by anyone(if I didn't miss anything), and just a few drivers
implement this callback, and their implementation seems meaningless.

And the reset_link() provided by aer driver seems is a different thing
with .link_reset callback. So I am guessing this patch probably is not
quite suitable, and the doc maybe need update totally.

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-13 Thread Gavin Shan
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote:
>On 09/12/16 17:24, Linas Vepstas wrote:
>>I suppose I'm confused, but I recall that link resets are non-fatal.
>>Fatal errors typically require that the the pci adapter be completely
>>reset, any adapter firmware to be reloaded from scratch, the device
>>driver has to kill all device state and start from scratch. Its huge.
>
>Is there a difference in terminology between an AER fatal error and what
>EEH/IBM people think of as a fatal error?
>

They are different things. AER fatal error can lead to frozen PE error,
not fenced PHB error basing on the configuration on PHB.

>>If the fatal error is on pci device that is under a block device
>>holding a file system, then (usually) there is no way to recover,
>>because the block layer (and file system) cannot deal with a block
>>device that disappeared and then reappeared some few seconds later.
>>(maybe some future zfs or lvm or btrfs might be able to deal with
>>this, but not today)
>
>Is this still true? I'm not at all familiar with the block device side of it,
>but the cxlflash driver has reasonably full EEH support, including surviving
>a full PHB fence and complete reset.
>

It's still true, especially when the recovery is going to affect the
rootfs. On completion of error recovery, the driver (if necessary)
and filesystem needs to be reloaded which depends on script or daemon
and they are unavailable in this scenario.

Thanks,
Gavin



Re: [PATCH] pci-error-recover: doc cleanup

2016-12-13 Thread Gavin Shan
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote:
>On 09/12/16 17:24, Linas Vepstas wrote:
>>I suppose I'm confused, but I recall that link resets are non-fatal.
>>Fatal errors typically require that the the pci adapter be completely
>>reset, any adapter firmware to be reloaded from scratch, the device
>>driver has to kill all device state and start from scratch. Its huge.
>
>Is there a difference in terminology between an AER fatal error and what
>EEH/IBM people think of as a fatal error?
>

They are different things. AER fatal error can lead to frozen PE error,
not fenced PHB error basing on the configuration on PHB.

>>If the fatal error is on pci device that is under a block device
>>holding a file system, then (usually) there is no way to recover,
>>because the block layer (and file system) cannot deal with a block
>>device that disappeared and then reappeared some few seconds later.
>>(maybe some future zfs or lvm or btrfs might be able to deal with
>>this, but not today)
>
>Is this still true? I'm not at all familiar with the block device side of it,
>but the cxlflash driver has reasonably full EEH support, including surviving
>a full PHB fence and complete reset.
>

It's still true, especially when the recovery is going to affect the
rootfs. On completion of error recovery, the driver (if necessary)
and filesystem needs to be reloaded which depends on script or daemon
and they are unavailable in this scenario.

Thanks,
Gavin



Re: [PATCH] pci-error-recover: doc cleanup

2016-12-09 Thread Alex Williamson
On Fri, 9 Dec 2016 14:44:25 +0800
Linas Vepstas  wrote:

> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
> >
> >
> > On 12/09/2016 02:24 PM, Linas Vepstas wrote:  
> >> I suppose I'm confused, but I recall that link resets are non-fatal.
> >> Fatal errors typically require that the the pci adapter be completely
> >> reset, any adapter firmware to be reloaded from scratch, the device
> >> driver has to kill all device state and start from scratch. Its huge.
> >> If the fatal error is on pci device that is under a block device
> >> holding a file system, then (usually) there is no way to recover,
> >> because the block layer (and file system) cannot deal with a block
> >> device that disappeared and then reappeared some few seconds later.
> >> (maybe some future zfs or lvm or btrfs might be able to deal with
> >> this, but not today)
> >>
> >> By contrast, link resets are far more gentle: the device driver might
> >> have to discard some half-full FIFO's, or cancel some in-flight
> >> commands, but can otherwise gracefully recover without telling the
> >> higher layers that there were any problems.
> >>
> >> --linas
> >>  
> >
> > I am little confused too, even not sure if we are talking the same
> > *fatal error*, I am talking the fatal error defined in PCI Express spec,
> > chapter 6.2.2.2.1:
> >
> > Fatal errors are uncorrectable error conditions which render the
> > particular Link and related hardware unreliable. For Fatal errors, a
> > reset of the components on the Link may be required to return to
> > reliable operation. Platform handling of Fatal errors, and any efforts
> > to limit the effects of these errors, is platform implementation specific.
> >
> > Link reset means set *secondary bus reset* bit in pci bridge config
> > space, can reset the link and device simultaneously, is the strongest
> > kind of reset as I know.  
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device 
> reset?
> 
> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.

Perhaps you're thinking of link retraining?  That sort of error would
be considered correctable, not fatal.  Fatal errors are uncorrected
errors and a bigger hammer is needed to deal with them, such as a link
reset.  Thanks,

Alex


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-09 Thread Alex Williamson
On Fri, 9 Dec 2016 14:44:25 +0800
Linas Vepstas  wrote:

> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
> >
> >
> > On 12/09/2016 02:24 PM, Linas Vepstas wrote:  
> >> I suppose I'm confused, but I recall that link resets are non-fatal.
> >> Fatal errors typically require that the the pci adapter be completely
> >> reset, any adapter firmware to be reloaded from scratch, the device
> >> driver has to kill all device state and start from scratch. Its huge.
> >> If the fatal error is on pci device that is under a block device
> >> holding a file system, then (usually) there is no way to recover,
> >> because the block layer (and file system) cannot deal with a block
> >> device that disappeared and then reappeared some few seconds later.
> >> (maybe some future zfs or lvm or btrfs might be able to deal with
> >> this, but not today)
> >>
> >> By contrast, link resets are far more gentle: the device driver might
> >> have to discard some half-full FIFO's, or cancel some in-flight
> >> commands, but can otherwise gracefully recover without telling the
> >> higher layers that there were any problems.
> >>
> >> --linas
> >>  
> >
> > I am little confused too, even not sure if we are talking the same
> > *fatal error*, I am talking the fatal error defined in PCI Express spec,
> > chapter 6.2.2.2.1:
> >
> > Fatal errors are uncorrectable error conditions which render the
> > particular Link and related hardware unreliable. For Fatal errors, a
> > reset of the components on the Link may be required to return to
> > reliable operation. Platform handling of Fatal errors, and any efforts
> > to limit the effects of these errors, is platform implementation specific.
> >
> > Link reset means set *secondary bus reset* bit in pci bridge config
> > space, can reset the link and device simultaneously, is the strongest
> > kind of reset as I know.  
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device 
> reset?
> 
> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.

Perhaps you're thinking of link retraining?  That sort of error would
be considered correctable, not fatal.  Fatal errors are uncorrected
errors and a bigger hammer is needed to deal with them, such as a link
reset.  Thanks,

Alex


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-09 Thread Jonathan Corbet
On Fri, 9 Dec 2016 14:37:47 +0800
Cao jin  wrote:

> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:

Therein lies my original discomfort with the change; it didn't seem to
make sense to talk about recovering from a fatal error.  Perhaps making
it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
been detected that can be "solved" by resetting the link" or something
like that to make it clear how the term is being used?

Thanks,

jon


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-09 Thread Jonathan Corbet
On Fri, 9 Dec 2016 14:37:47 +0800
Cao jin  wrote:

> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:

Therein lies my original discomfort with the change; it didn't seem to
make sense to talk about recovering from a fatal error.  Perhaps making
it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
been detected that can be "solved" by resetting the link" or something
like that to make it clear how the term is being used?

Thanks,

jon


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device 
> reset?
> 

At least I don't find the exact words saying that.

-- 
Sincerely,
Cao jin

> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.
> 
> --linas
> 
>>
>>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:


 On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin  wrote:
>
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
>
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
>

 See do_recovery() in aer core, reset_link() is called only seeing fatal
 error.

 --
 Sincerely,
 Cao jin


>>>
>>>
>>>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> .
> 






Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device 
> reset?
> 

At least I don't find the exact words saying that.

-- 
Sincerely,
Cao jin

> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.
> 
> --linas
> 
>>
>>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:


 On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin  wrote:
>
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
>
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
>

 See do_recovery() in aer core, reset_link() is called only seeing fatal
 error.

 --
 Sincerely,
 Cao jin


>>>
>>>
>>>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> .
> 






Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/09/2016 02:24 PM, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.
> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)
> 
> By contrast, link resets are far more gentle: the device driver might
> have to discard some half-full FIFO's, or cancel some in-flight
> commands, but can otherwise gracefully recover without telling the
> higher layers that there were any problems.
> 
> --linas
> 

I am little confused too, even not sure if we are talking the same
*fatal error*, I am talking the fatal error defined in PCI Express spec,
chapter 6.2.2.2.1:

Fatal errors are uncorrectable error conditions which render the
particular Link and related hardware unreliable. For Fatal errors, a
reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts
to limit the effects of these errors, is platform implementation specific.

Link reset means set *secondary bus reset* bit in pci bridge config
space, can reset the link and device simultaneously, is the strongest
kind of reset as I know.

> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>>
>>
>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>> Cao jin  wrote:
>>>
  The platform resets the link, and then calls the link_reset() callback
  on all affected device drivers.  This is a PCI-Express specific state
 -and is done whenever a non-fatal error has been detected that can be
 +and is done whenever a fatal error has been detected that can be
  "solved" by resetting the link. This call informs the driver of the
>>>
>>> As far as I can tell, the original text was correct here; why do you
>>> think this change needs to be made?
>>>
>>
>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>> error.
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> 

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/09/2016 02:24 PM, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.
> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)
> 
> By contrast, link resets are far more gentle: the device driver might
> have to discard some half-full FIFO's, or cancel some in-flight
> commands, but can otherwise gracefully recover without telling the
> higher layers that there were any problems.
> 
> --linas
> 

I am little confused too, even not sure if we are talking the same
*fatal error*, I am talking the fatal error defined in PCI Express spec,
chapter 6.2.2.2.1:

Fatal errors are uncorrectable error conditions which render the
particular Link and related hardware unreliable. For Fatal errors, a
reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts
to limit the effects of these errors, is platform implementation specific.

Link reset means set *secondary bus reset* bit in pci bridge config
space, can reset the link and device simultaneously, is the strongest
kind of reset as I know.

> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>>
>>
>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>> Cao jin  wrote:
>>>
  The platform resets the link, and then calls the link_reset() callback
  on all affected device drivers.  This is a PCI-Express specific state
 -and is done whenever a non-fatal error has been detected that can be
 +and is done whenever a fatal error has been detected that can be
  "solved" by resetting the link. This call informs the driver of the
>>>
>>> As far as I can tell, the original text was correct here; why do you
>>> think this change needs to be made?
>>>
>>
>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>> error.
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
> 

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Andrew Donnellan

On 09/12/16 17:24, Linas Vepstas wrote:

I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.


Is there a difference in terminology between an AER fatal error and what 
EEH/IBM people think of as a fatal error?



If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)


Is this still true? I'm not at all familiar with the block device side 
of it, but the cxlflash driver has reasonably full EEH support, 
including surviving a full PHB fence and complete reset.


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Andrew Donnellan

On 09/12/16 17:24, Linas Vepstas wrote:

I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.


Is there a difference in terminology between an AER fatal error and what 
EEH/IBM people think of as a fatal error?



If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)


Is this still true? I'm not at all familiar with the block device side 
of it, but the cxlflash driver has reasonably full EEH support, 
including surviving a full PHB fence and complete reset.


--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Linas Vepstas
On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
>
>
> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>> I suppose I'm confused, but I recall that link resets are non-fatal.
>> Fatal errors typically require that the the pci adapter be completely
>> reset, any adapter firmware to be reloaded from scratch, the device
>> driver has to kill all device state and start from scratch. Its huge.
>> If the fatal error is on pci device that is under a block device
>> holding a file system, then (usually) there is no way to recover,
>> because the block layer (and file system) cannot deal with a block
>> device that disappeared and then reappeared some few seconds later.
>> (maybe some future zfs or lvm or btrfs might be able to deal with
>> this, but not today)
>>
>> By contrast, link resets are far more gentle: the device driver might
>> have to discard some half-full FIFO's, or cancel some in-flight
>> commands, but can otherwise gracefully recover without telling the
>> higher layers that there were any problems.
>>
>> --linas
>>
>
> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:
>
> Fatal errors are uncorrectable error conditions which render the
> particular Link and related hardware unreliable. For Fatal errors, a
> reset of the components on the Link may be required to return to
> reliable operation. Platform handling of Fatal errors, and any efforts
> to limit the effects of these errors, is platform implementation specific.
>
> Link reset means set *secondary bus reset* bit in pci bridge config
> space, can reset the link and device simultaneously, is the strongest
> kind of reset as I know.

OK, well, its been far too many years, and I don't have the PCI spec
at my fingertips.
Isn't there a link reset that can be performed, without forcing a device reset?

The intent was that some PCI link errors are due to vibration,
ground-bounce, humidity, etc. and that these errors can be detected
and do not corrupt the device state or the device driver state.  Since
they are not associated with data corruption (or rather, the
corruption is local to the link), these can be recovered by reseting
just the link, without resetting the whole adapter. They may require
reseting some device-driver state, but not all of it.

However, this was all decided before the PCI-E spec was written, so
maybe the newer PCI-E specs now say something different.

--linas

>
>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>>>
>>>
>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
 On Thu, 8 Dec 2016 16:16:14 +0800
 Cao jin  wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

 As far as I can tell, the original text was correct here; why do you
 think this change needs to be made?

>>>
>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>> error.
>>>
>>> --
>>> Sincerely,
>>> Cao jin
>>>
>>>
>>
>>
>>
>
> --
> Sincerely,
> Cao jin
>
>


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Linas Vepstas
On Fri, Dec 9, 2016 at 2:37 PM, Cao jin  wrote:
>
>
> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>> I suppose I'm confused, but I recall that link resets are non-fatal.
>> Fatal errors typically require that the the pci adapter be completely
>> reset, any adapter firmware to be reloaded from scratch, the device
>> driver has to kill all device state and start from scratch. Its huge.
>> If the fatal error is on pci device that is under a block device
>> holding a file system, then (usually) there is no way to recover,
>> because the block layer (and file system) cannot deal with a block
>> device that disappeared and then reappeared some few seconds later.
>> (maybe some future zfs or lvm or btrfs might be able to deal with
>> this, but not today)
>>
>> By contrast, link resets are far more gentle: the device driver might
>> have to discard some half-full FIFO's, or cancel some in-flight
>> commands, but can otherwise gracefully recover without telling the
>> higher layers that there were any problems.
>>
>> --linas
>>
>
> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:
>
> Fatal errors are uncorrectable error conditions which render the
> particular Link and related hardware unreliable. For Fatal errors, a
> reset of the components on the Link may be required to return to
> reliable operation. Platform handling of Fatal errors, and any efforts
> to limit the effects of these errors, is platform implementation specific.
>
> Link reset means set *secondary bus reset* bit in pci bridge config
> space, can reset the link and device simultaneously, is the strongest
> kind of reset as I know.

OK, well, its been far too many years, and I don't have the PCI spec
at my fingertips.
Isn't there a link reset that can be performed, without forcing a device reset?

The intent was that some PCI link errors are due to vibration,
ground-bounce, humidity, etc. and that these errors can be detected
and do not corrupt the device state or the device driver state.  Since
they are not associated with data corruption (or rather, the
corruption is local to the link), these can be recovered by reseting
just the link, without resetting the whole adapter. They may require
reseting some device-driver state, but not all of it.

However, this was all decided before the PCI-E spec was written, so
maybe the newer PCI-E specs now say something different.

--linas

>
>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>>>
>>>
>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
 On Thu, 8 Dec 2016 16:16:14 +0800
 Cao jin  wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

 As far as I can tell, the original text was correct here; why do you
 think this change needs to be made?

>>>
>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>> error.
>>>
>>> --
>>> Sincerely,
>>> Cao jin
>>>
>>>
>>
>>
>>
>
> --
> Sincerely,
> Cao jin
>
>


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Linas Vepstas
I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.
If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)

By contrast, link resets are far more gentle: the device driver might
have to discard some half-full FIFO's, or cancel some in-flight
commands, but can otherwise gracefully recover without telling the
higher layers that there were any problems.

--linas

On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>
>
> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>> On Thu, 8 Dec 2016 16:16:14 +0800
>> Cao jin  wrote:
>>
>>>  The platform resets the link, and then calls the link_reset() callback
>>>  on all affected device drivers.  This is a PCI-Express specific state
>>> -and is done whenever a non-fatal error has been detected that can be
>>> +and is done whenever a fatal error has been detected that can be
>>>  "solved" by resetting the link. This call informs the driver of the
>>
>> As far as I can tell, the original text was correct here; why do you
>> think this change needs to be made?
>>
>
> See do_recovery() in aer core, reset_link() is called only seeing fatal
> error.
>
> --
> Sincerely,
> Cao jin
>
>


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Linas Vepstas
I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.
If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)

By contrast, link resets are far more gentle: the device driver might
have to discard some half-full FIFO's, or cancel some in-flight
commands, but can otherwise gracefully recover without telling the
higher layers that there were any problems.

--linas

On Thu, Dec 8, 2016 at 10:13 PM, Cao jin  wrote:
>
>
> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>> On Thu, 8 Dec 2016 16:16:14 +0800
>> Cao jin  wrote:
>>
>>>  The platform resets the link, and then calls the link_reset() callback
>>>  on all affected device drivers.  This is a PCI-Express specific state
>>> -and is done whenever a non-fatal error has been detected that can be
>>> +and is done whenever a fatal error has been detected that can be
>>>  "solved" by resetting the link. This call informs the driver of the
>>
>> As far as I can tell, the original text was correct here; why do you
>> think this change needs to be made?
>>
>
> See do_recovery() in aer core, reset_link() is called only seeing fatal
> error.
>
> --
> Sincerely,
> Cao jin
>
>


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin  wrote:
> 
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
> 
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
> 

See do_recovery() in aer core, reset_link() is called only seeing fatal
error.

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Cao jin


On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin  wrote:
> 
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
> 
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
> 

See do_recovery() in aer core, reset_link() is called only seeing fatal
error.

-- 
Sincerely,
Cao jin




Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Jonathan Corbet
On Thu, 8 Dec 2016 16:16:14 +0800
Cao jin  wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

As far as I can tell, the original text was correct here; why do you
think this change needs to be made?

Thanks,

jon


Re: [PATCH] pci-error-recover: doc cleanup

2016-12-08 Thread Jonathan Corbet
On Thu, 8 Dec 2016 16:16:14 +0800
Cao jin  wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

As far as I can tell, the original text was correct here; why do you
think this change needs to be made?

Thanks,

jon