Re: [PATCH] pci-error-recover: doc cleanup
Sorry for late. On 12/09/2016 10:37 PM, Jonathan Corbet wrote: > On Fri, 9 Dec 2016 14:37:47 +0800 > Cao jinwrote: > >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: > > Therein lies my original discomfort with the change; it didn't seem to > make sense to talk about recovering from a fatal error. Perhaps making > it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has > been detected that can be "solved" by resetting the link" or something > like that to make it clear how the term is being used? > I find that the .link_reset callback of struct pci_error_handlers isn't called by anyone(if I didn't miss anything), and just a few drivers implement this callback, and their implementation seems meaningless. And the reset_link() provided by aer driver seems is a different thing with .link_reset callback. So I am guessing this patch probably is not quite suitable, and the doc maybe need update totally. -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
Sorry for late. On 12/09/2016 10:37 PM, Jonathan Corbet wrote: > On Fri, 9 Dec 2016 14:37:47 +0800 > Cao jin wrote: > >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: > > Therein lies my original discomfort with the change; it didn't seem to > make sense to talk about recovering from a fatal error. Perhaps making > it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has > been detected that can be "solved" by resetting the link" or something > like that to make it clear how the term is being used? > I find that the .link_reset callback of struct pci_error_handlers isn't called by anyone(if I didn't miss anything), and just a few drivers implement this callback, and their implementation seems meaningless. And the reset_link() provided by aer driver seems is a different thing with .link_reset callback. So I am guessing this patch probably is not quite suitable, and the doc maybe need update totally. -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote: >On 09/12/16 17:24, Linas Vepstas wrote: >>I suppose I'm confused, but I recall that link resets are non-fatal. >>Fatal errors typically require that the the pci adapter be completely >>reset, any adapter firmware to be reloaded from scratch, the device >>driver has to kill all device state and start from scratch. Its huge. > >Is there a difference in terminology between an AER fatal error and what >EEH/IBM people think of as a fatal error? > They are different things. AER fatal error can lead to frozen PE error, not fenced PHB error basing on the configuration on PHB. >>If the fatal error is on pci device that is under a block device >>holding a file system, then (usually) there is no way to recover, >>because the block layer (and file system) cannot deal with a block >>device that disappeared and then reappeared some few seconds later. >>(maybe some future zfs or lvm or btrfs might be able to deal with >>this, but not today) > >Is this still true? I'm not at all familiar with the block device side of it, >but the cxlflash driver has reasonably full EEH support, including surviving >a full PHB fence and complete reset. > It's still true, especially when the recovery is going to affect the rootfs. On completion of error recovery, the driver (if necessary) and filesystem needs to be reloaded which depends on script or daemon and they are unavailable in this scenario. Thanks, Gavin
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote: >On 09/12/16 17:24, Linas Vepstas wrote: >>I suppose I'm confused, but I recall that link resets are non-fatal. >>Fatal errors typically require that the the pci adapter be completely >>reset, any adapter firmware to be reloaded from scratch, the device >>driver has to kill all device state and start from scratch. Its huge. > >Is there a difference in terminology between an AER fatal error and what >EEH/IBM people think of as a fatal error? > They are different things. AER fatal error can lead to frozen PE error, not fenced PHB error basing on the configuration on PHB. >>If the fatal error is on pci device that is under a block device >>holding a file system, then (usually) there is no way to recover, >>because the block layer (and file system) cannot deal with a block >>device that disappeared and then reappeared some few seconds later. >>(maybe some future zfs or lvm or btrfs might be able to deal with >>this, but not today) > >Is this still true? I'm not at all familiar with the block device side of it, >but the cxlflash driver has reasonably full EEH support, including surviving >a full PHB fence and complete reset. > It's still true, especially when the recovery is going to affect the rootfs. On completion of error recovery, the driver (if necessary) and filesystem needs to be reloaded which depends on script or daemon and they are unavailable in this scenario. Thanks, Gavin
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, 9 Dec 2016 14:44:25 +0800 Linas Vepstaswrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin wrote: > > > > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: > >> I suppose I'm confused, but I recall that link resets are non-fatal. > >> Fatal errors typically require that the the pci adapter be completely > >> reset, any adapter firmware to be reloaded from scratch, the device > >> driver has to kill all device state and start from scratch. Its huge. > >> If the fatal error is on pci device that is under a block device > >> holding a file system, then (usually) there is no way to recover, > >> because the block layer (and file system) cannot deal with a block > >> device that disappeared and then reappeared some few seconds later. > >> (maybe some future zfs or lvm or btrfs might be able to deal with > >> this, but not today) > >> > >> By contrast, link resets are far more gentle: the device driver might > >> have to discard some half-full FIFO's, or cancel some in-flight > >> commands, but can otherwise gracefully recover without telling the > >> higher layers that there were any problems. > >> > >> --linas > >> > > > > I am little confused too, even not sure if we are talking the same > > *fatal error*, I am talking the fatal error defined in PCI Express spec, > > chapter 6.2.2.2.1: > > > > Fatal errors are uncorrectable error conditions which render the > > particular Link and related hardware unreliable. For Fatal errors, a > > reset of the components on the Link may be required to return to > > reliable operation. Platform handling of Fatal errors, and any efforts > > to limit the effects of these errors, is platform implementation specific. > > > > Link reset means set *secondary bus reset* bit in pci bridge config > > space, can reset the link and device simultaneously, is the strongest > > kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device > reset? > > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. Perhaps you're thinking of link retraining? That sort of error would be considered correctable, not fatal. Fatal errors are uncorrected errors and a bigger hammer is needed to deal with them, such as a link reset. Thanks, Alex
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, 9 Dec 2016 14:44:25 +0800 Linas Vepstas wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin wrote: > > > > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: > >> I suppose I'm confused, but I recall that link resets are non-fatal. > >> Fatal errors typically require that the the pci adapter be completely > >> reset, any adapter firmware to be reloaded from scratch, the device > >> driver has to kill all device state and start from scratch. Its huge. > >> If the fatal error is on pci device that is under a block device > >> holding a file system, then (usually) there is no way to recover, > >> because the block layer (and file system) cannot deal with a block > >> device that disappeared and then reappeared some few seconds later. > >> (maybe some future zfs or lvm or btrfs might be able to deal with > >> this, but not today) > >> > >> By contrast, link resets are far more gentle: the device driver might > >> have to discard some half-full FIFO's, or cancel some in-flight > >> commands, but can otherwise gracefully recover without telling the > >> higher layers that there were any problems. > >> > >> --linas > >> > > > > I am little confused too, even not sure if we are talking the same > > *fatal error*, I am talking the fatal error defined in PCI Express spec, > > chapter 6.2.2.2.1: > > > > Fatal errors are uncorrectable error conditions which render the > > particular Link and related hardware unreliable. For Fatal errors, a > > reset of the components on the Link may be required to return to > > reliable operation. Platform handling of Fatal errors, and any efforts > > to limit the effects of these errors, is platform implementation specific. > > > > Link reset means set *secondary bus reset* bit in pci bridge config > > space, can reset the link and device simultaneously, is the strongest > > kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device > reset? > > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. Perhaps you're thinking of link retraining? That sort of error would be considered correctable, not fatal. Fatal errors are uncorrected errors and a bigger hammer is needed to deal with them, such as a link reset. Thanks, Alex
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, 9 Dec 2016 14:37:47 +0800 Cao jinwrote: > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: Therein lies my original discomfort with the change; it didn't seem to make sense to talk about recovering from a fatal error. Perhaps making it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has been detected that can be "solved" by resetting the link" or something like that to make it clear how the term is being used? Thanks, jon
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, 9 Dec 2016 14:37:47 +0800 Cao jin wrote: > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: Therein lies my original discomfort with the change; it didn't seem to make sense to talk about recovering from a fatal error. Perhaps making it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has been detected that can be "solved" by resetting the link" or something like that to make it clear how the term is being used? Thanks, jon
Re: [PATCH] pci-error-recover: doc cleanup
On 12/09/2016 02:44 PM, Linas Vepstas wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jinwrote: >> >> >> On 12/09/2016 02:24 PM, Linas Vepstas wrote: >>> I suppose I'm confused, but I recall that link resets are non-fatal. >>> Fatal errors typically require that the the pci adapter be completely >>> reset, any adapter firmware to be reloaded from scratch, the device >>> driver has to kill all device state and start from scratch. Its huge. >>> If the fatal error is on pci device that is under a block device >>> holding a file system, then (usually) there is no way to recover, >>> because the block layer (and file system) cannot deal with a block >>> device that disappeared and then reappeared some few seconds later. >>> (maybe some future zfs or lvm or btrfs might be able to deal with >>> this, but not today) >>> >>> By contrast, link resets are far more gentle: the device driver might >>> have to discard some half-full FIFO's, or cancel some in-flight >>> commands, but can otherwise gracefully recover without telling the >>> higher layers that there were any problems. >>> >>> --linas >>> >> >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: >> >> Fatal errors are uncorrectable error conditions which render the >> particular Link and related hardware unreliable. For Fatal errors, a >> reset of the components on the Link may be required to return to >> reliable operation. Platform handling of Fatal errors, and any efforts >> to limit the effects of these errors, is platform implementation specific. >> >> Link reset means set *secondary bus reset* bit in pci bridge config >> space, can reset the link and device simultaneously, is the strongest >> kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device > reset? > At least I don't find the exact words saying that. -- Sincerely, Cao jin > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. > > --linas > >> >>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: On 12/08/2016 10:05 PM, Jonathan Corbet wrote: > On Thu, 8 Dec 2016 16:16:14 +0800 > Cao jin wrote: > >> The platform resets the link, and then calls the link_reset() callback >> on all affected device drivers. This is a PCI-Express specific state >> -and is done whenever a non-fatal error has been detected that can be >> +and is done whenever a fatal error has been detected that can be >> "solved" by resetting the link. This call informs the driver of the > > As far as I can tell, the original text was correct here; why do you > think this change needs to be made? > See do_recovery() in aer core, reset_link() is called only seeing fatal error. -- Sincerely, Cao jin >>> >>> >>> >> >> -- >> Sincerely, >> Cao jin >> >> > > > . >
Re: [PATCH] pci-error-recover: doc cleanup
On 12/09/2016 02:44 PM, Linas Vepstas wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin wrote: >> >> >> On 12/09/2016 02:24 PM, Linas Vepstas wrote: >>> I suppose I'm confused, but I recall that link resets are non-fatal. >>> Fatal errors typically require that the the pci adapter be completely >>> reset, any adapter firmware to be reloaded from scratch, the device >>> driver has to kill all device state and start from scratch. Its huge. >>> If the fatal error is on pci device that is under a block device >>> holding a file system, then (usually) there is no way to recover, >>> because the block layer (and file system) cannot deal with a block >>> device that disappeared and then reappeared some few seconds later. >>> (maybe some future zfs or lvm or btrfs might be able to deal with >>> this, but not today) >>> >>> By contrast, link resets are far more gentle: the device driver might >>> have to discard some half-full FIFO's, or cancel some in-flight >>> commands, but can otherwise gracefully recover without telling the >>> higher layers that there were any problems. >>> >>> --linas >>> >> >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: >> >> Fatal errors are uncorrectable error conditions which render the >> particular Link and related hardware unreliable. For Fatal errors, a >> reset of the components on the Link may be required to return to >> reliable operation. Platform handling of Fatal errors, and any efforts >> to limit the effects of these errors, is platform implementation specific. >> >> Link reset means set *secondary bus reset* bit in pci bridge config >> space, can reset the link and device simultaneously, is the strongest >> kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device > reset? > At least I don't find the exact words saying that. -- Sincerely, Cao jin > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. > > --linas > >> >>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: On 12/08/2016 10:05 PM, Jonathan Corbet wrote: > On Thu, 8 Dec 2016 16:16:14 +0800 > Cao jin wrote: > >> The platform resets the link, and then calls the link_reset() callback >> on all affected device drivers. This is a PCI-Express specific state >> -and is done whenever a non-fatal error has been detected that can be >> +and is done whenever a fatal error has been detected that can be >> "solved" by resetting the link. This call informs the driver of the > > As far as I can tell, the original text was correct here; why do you > think this change needs to be made? > See do_recovery() in aer core, reset_link() is called only seeing fatal error. -- Sincerely, Cao jin >>> >>> >>> >> >> -- >> Sincerely, >> Cao jin >> >> > > > . >
Re: [PATCH] pci-error-recover: doc cleanup
On 12/09/2016 02:24 PM, Linas Vepstas wrote: > I suppose I'm confused, but I recall that link resets are non-fatal. > Fatal errors typically require that the the pci adapter be completely > reset, any adapter firmware to be reloaded from scratch, the device > driver has to kill all device state and start from scratch. Its huge. > If the fatal error is on pci device that is under a block device > holding a file system, then (usually) there is no way to recover, > because the block layer (and file system) cannot deal with a block > device that disappeared and then reappeared some few seconds later. > (maybe some future zfs or lvm or btrfs might be able to deal with > this, but not today) > > By contrast, link resets are far more gentle: the device driver might > have to discard some half-full FIFO's, or cancel some in-flight > commands, but can otherwise gracefully recover without telling the > higher layers that there were any problems. > > --linas > I am little confused too, even not sure if we are talking the same *fatal error*, I am talking the fatal error defined in PCI Express spec, chapter 6.2.2.2.1: Fatal errors are uncorrectable error conditions which render the particular Link and related hardware unreliable. For Fatal errors, a reset of the components on the Link may be required to return to reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these errors, is platform implementation specific. Link reset means set *secondary bus reset* bit in pci bridge config space, can reset the link and device simultaneously, is the strongest kind of reset as I know. > On Thu, Dec 8, 2016 at 10:13 PM, Cao jinwrote: >> >> >> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >>> On Thu, 8 Dec 2016 16:16:14 +0800 >>> Cao jin wrote: >>> The platform resets the link, and then calls the link_reset() callback on all affected device drivers. This is a PCI-Express specific state -and is done whenever a non-fatal error has been detected that can be +and is done whenever a fatal error has been detected that can be "solved" by resetting the link. This call informs the driver of the >>> >>> As far as I can tell, the original text was correct here; why do you >>> think this change needs to be made? >>> >> >> See do_recovery() in aer core, reset_link() is called only seeing fatal >> error. >> >> -- >> Sincerely, >> Cao jin >> >> > > > -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
On 12/09/2016 02:24 PM, Linas Vepstas wrote: > I suppose I'm confused, but I recall that link resets are non-fatal. > Fatal errors typically require that the the pci adapter be completely > reset, any adapter firmware to be reloaded from scratch, the device > driver has to kill all device state and start from scratch. Its huge. > If the fatal error is on pci device that is under a block device > holding a file system, then (usually) there is no way to recover, > because the block layer (and file system) cannot deal with a block > device that disappeared and then reappeared some few seconds later. > (maybe some future zfs or lvm or btrfs might be able to deal with > this, but not today) > > By contrast, link resets are far more gentle: the device driver might > have to discard some half-full FIFO's, or cancel some in-flight > commands, but can otherwise gracefully recover without telling the > higher layers that there were any problems. > > --linas > I am little confused too, even not sure if we are talking the same *fatal error*, I am talking the fatal error defined in PCI Express spec, chapter 6.2.2.2.1: Fatal errors are uncorrectable error conditions which render the particular Link and related hardware unreliable. For Fatal errors, a reset of the components on the Link may be required to return to reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these errors, is platform implementation specific. Link reset means set *secondary bus reset* bit in pci bridge config space, can reset the link and device simultaneously, is the strongest kind of reset as I know. > On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: >> >> >> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >>> On Thu, 8 Dec 2016 16:16:14 +0800 >>> Cao jin wrote: >>> The platform resets the link, and then calls the link_reset() callback on all affected device drivers. This is a PCI-Express specific state -and is done whenever a non-fatal error has been detected that can be +and is done whenever a fatal error has been detected that can be "solved" by resetting the link. This call informs the driver of the >>> >>> As far as I can tell, the original text was correct here; why do you >>> think this change needs to be made? >>> >> >> See do_recovery() in aer core, reset_link() is called only seeing fatal >> error. >> >> -- >> Sincerely, >> Cao jin >> >> > > > -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
On 09/12/16 17:24, Linas Vepstas wrote: I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge. Is there a difference in terminology between an AER fatal error and what EEH/IBM people think of as a fatal error? If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today) Is this still true? I'm not at all familiar with the block device side of it, but the cxlflash driver has reasonably full EEH support, including surviving a full PHB fence and complete reset. -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [PATCH] pci-error-recover: doc cleanup
On 09/12/16 17:24, Linas Vepstas wrote: I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge. Is there a difference in terminology between an AER fatal error and what EEH/IBM people think of as a fatal error? If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today) Is this still true? I'm not at all familiar with the block device side of it, but the cxlflash driver has reasonably full EEH support, including surviving a full PHB fence and complete reset. -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, Dec 9, 2016 at 2:37 PM, Cao jinwrote: > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: >> I suppose I'm confused, but I recall that link resets are non-fatal. >> Fatal errors typically require that the the pci adapter be completely >> reset, any adapter firmware to be reloaded from scratch, the device >> driver has to kill all device state and start from scratch. Its huge. >> If the fatal error is on pci device that is under a block device >> holding a file system, then (usually) there is no way to recover, >> because the block layer (and file system) cannot deal with a block >> device that disappeared and then reappeared some few seconds later. >> (maybe some future zfs or lvm or btrfs might be able to deal with >> this, but not today) >> >> By contrast, link resets are far more gentle: the device driver might >> have to discard some half-full FIFO's, or cancel some in-flight >> commands, but can otherwise gracefully recover without telling the >> higher layers that there were any problems. >> >> --linas >> > > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: > > Fatal errors are uncorrectable error conditions which render the > particular Link and related hardware unreliable. For Fatal errors, a > reset of the components on the Link may be required to return to > reliable operation. Platform handling of Fatal errors, and any efforts > to limit the effects of these errors, is platform implementation specific. > > Link reset means set *secondary bus reset* bit in pci bridge config > space, can reset the link and device simultaneously, is the strongest > kind of reset as I know. OK, well, its been far too many years, and I don't have the PCI spec at my fingertips. Isn't there a link reset that can be performed, without forcing a device reset? The intent was that some PCI link errors are due to vibration, ground-bounce, humidity, etc. and that these errors can be detected and do not corrupt the device state or the device driver state. Since they are not associated with data corruption (or rather, the corruption is local to the link), these can be recovered by reseting just the link, without resetting the whole adapter. They may require reseting some device-driver state, but not all of it. However, this was all decided before the PCI-E spec was written, so maybe the newer PCI-E specs now say something different. --linas > >> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: >>> >>> >>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: On Thu, 8 Dec 2016 16:16:14 +0800 Cao jin wrote: > The platform resets the link, and then calls the link_reset() callback > on all affected device drivers. This is a PCI-Express specific state > -and is done whenever a non-fatal error has been detected that can be > +and is done whenever a fatal error has been detected that can be > "solved" by resetting the link. This call informs the driver of the As far as I can tell, the original text was correct here; why do you think this change needs to be made? >>> >>> See do_recovery() in aer core, reset_link() is called only seeing fatal >>> error. >>> >>> -- >>> Sincerely, >>> Cao jin >>> >>> >> >> >> > > -- > Sincerely, > Cao jin > >
Re: [PATCH] pci-error-recover: doc cleanup
On Fri, Dec 9, 2016 at 2:37 PM, Cao jin wrote: > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: >> I suppose I'm confused, but I recall that link resets are non-fatal. >> Fatal errors typically require that the the pci adapter be completely >> reset, any adapter firmware to be reloaded from scratch, the device >> driver has to kill all device state and start from scratch. Its huge. >> If the fatal error is on pci device that is under a block device >> holding a file system, then (usually) there is no way to recover, >> because the block layer (and file system) cannot deal with a block >> device that disappeared and then reappeared some few seconds later. >> (maybe some future zfs or lvm or btrfs might be able to deal with >> this, but not today) >> >> By contrast, link resets are far more gentle: the device driver might >> have to discard some half-full FIFO's, or cancel some in-flight >> commands, but can otherwise gracefully recover without telling the >> higher layers that there were any problems. >> >> --linas >> > > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: > > Fatal errors are uncorrectable error conditions which render the > particular Link and related hardware unreliable. For Fatal errors, a > reset of the components on the Link may be required to return to > reliable operation. Platform handling of Fatal errors, and any efforts > to limit the effects of these errors, is platform implementation specific. > > Link reset means set *secondary bus reset* bit in pci bridge config > space, can reset the link and device simultaneously, is the strongest > kind of reset as I know. OK, well, its been far too many years, and I don't have the PCI spec at my fingertips. Isn't there a link reset that can be performed, without forcing a device reset? The intent was that some PCI link errors are due to vibration, ground-bounce, humidity, etc. and that these errors can be detected and do not corrupt the device state or the device driver state. Since they are not associated with data corruption (or rather, the corruption is local to the link), these can be recovered by reseting just the link, without resetting the whole adapter. They may require reseting some device-driver state, but not all of it. However, this was all decided before the PCI-E spec was written, so maybe the newer PCI-E specs now say something different. --linas > >> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: >>> >>> >>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: On Thu, 8 Dec 2016 16:16:14 +0800 Cao jin wrote: > The platform resets the link, and then calls the link_reset() callback > on all affected device drivers. This is a PCI-Express specific state > -and is done whenever a non-fatal error has been detected that can be > +and is done whenever a fatal error has been detected that can be > "solved" by resetting the link. This call informs the driver of the As far as I can tell, the original text was correct here; why do you think this change needs to be made? >>> >>> See do_recovery() in aer core, reset_link() is called only seeing fatal >>> error. >>> >>> -- >>> Sincerely, >>> Cao jin >>> >>> >> >> >> > > -- > Sincerely, > Cao jin > >
Re: [PATCH] pci-error-recover: doc cleanup
I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge. If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today) By contrast, link resets are far more gentle: the device driver might have to discard some half-full FIFO's, or cancel some in-flight commands, but can otherwise gracefully recover without telling the higher layers that there were any problems. --linas On Thu, Dec 8, 2016 at 10:13 PM, Cao jinwrote: > > > On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >> On Thu, 8 Dec 2016 16:16:14 +0800 >> Cao jin wrote: >> >>> The platform resets the link, and then calls the link_reset() callback >>> on all affected device drivers. This is a PCI-Express specific state >>> -and is done whenever a non-fatal error has been detected that can be >>> +and is done whenever a fatal error has been detected that can be >>> "solved" by resetting the link. This call informs the driver of the >> >> As far as I can tell, the original text was correct here; why do you >> think this change needs to be made? >> > > See do_recovery() in aer core, reset_link() is called only seeing fatal > error. > > -- > Sincerely, > Cao jin > >
Re: [PATCH] pci-error-recover: doc cleanup
I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge. If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today) By contrast, link resets are far more gentle: the device driver might have to discard some half-full FIFO's, or cancel some in-flight commands, but can otherwise gracefully recover without telling the higher layers that there were any problems. --linas On Thu, Dec 8, 2016 at 10:13 PM, Cao jin wrote: > > > On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >> On Thu, 8 Dec 2016 16:16:14 +0800 >> Cao jin wrote: >> >>> The platform resets the link, and then calls the link_reset() callback >>> on all affected device drivers. This is a PCI-Express specific state >>> -and is done whenever a non-fatal error has been detected that can be >>> +and is done whenever a fatal error has been detected that can be >>> "solved" by resetting the link. This call informs the driver of the >> >> As far as I can tell, the original text was correct here; why do you >> think this change needs to be made? >> > > See do_recovery() in aer core, reset_link() is called only seeing fatal > error. > > -- > Sincerely, > Cao jin > >
Re: [PATCH] pci-error-recover: doc cleanup
On 12/08/2016 10:05 PM, Jonathan Corbet wrote: > On Thu, 8 Dec 2016 16:16:14 +0800 > Cao jinwrote: > >> The platform resets the link, and then calls the link_reset() callback >> on all affected device drivers. This is a PCI-Express specific state >> -and is done whenever a non-fatal error has been detected that can be >> +and is done whenever a fatal error has been detected that can be >> "solved" by resetting the link. This call informs the driver of the > > As far as I can tell, the original text was correct here; why do you > think this change needs to be made? > See do_recovery() in aer core, reset_link() is called only seeing fatal error. -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
On 12/08/2016 10:05 PM, Jonathan Corbet wrote: > On Thu, 8 Dec 2016 16:16:14 +0800 > Cao jin wrote: > >> The platform resets the link, and then calls the link_reset() callback >> on all affected device drivers. This is a PCI-Express specific state >> -and is done whenever a non-fatal error has been detected that can be >> +and is done whenever a fatal error has been detected that can be >> "solved" by resetting the link. This call informs the driver of the > > As far as I can tell, the original text was correct here; why do you > think this change needs to be made? > See do_recovery() in aer core, reset_link() is called only seeing fatal error. -- Sincerely, Cao jin
Re: [PATCH] pci-error-recover: doc cleanup
On Thu, 8 Dec 2016 16:16:14 +0800 Cao jinwrote: > The platform resets the link, and then calls the link_reset() callback > on all affected device drivers. This is a PCI-Express specific state > -and is done whenever a non-fatal error has been detected that can be > +and is done whenever a fatal error has been detected that can be > "solved" by resetting the link. This call informs the driver of the As far as I can tell, the original text was correct here; why do you think this change needs to be made? Thanks, jon
Re: [PATCH] pci-error-recover: doc cleanup
On Thu, 8 Dec 2016 16:16:14 +0800 Cao jin wrote: > The platform resets the link, and then calls the link_reset() callback > on all affected device drivers. This is a PCI-Express specific state > -and is done whenever a non-fatal error has been detected that can be > +and is done whenever a fatal error has been detected that can be > "solved" by resetting the link. This call informs the driver of the As far as I can tell, the original text was correct here; why do you think this change needs to be made? Thanks, jon