Re: [PATCH v7 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
On Wed, Nov 25, 2020 at 2:47 AM Guilherme G. Piccoli wrote: > > Hi Kuppuswamy Sathyanarayanan (and all involved here), thanks for the > patch! I'd like to ask what is the status of this patchset - I just > "parachuted" in the issue, and by tracking the linux-pci ML, I found > this V7 (and all previous versions since V2). Also, noticed that Jay's > email might have gotten lost in translation (he's not CCed in latest > versions of the patchset). > > I was able to find even another interesting thread that might be > related, Ethan's patchset. So, if any of the developers can clarify the Thank you for asking. > current status of this patchset or if the functionality hereby proposed > ended-up being implemented in another patch, I appreciate a lot. > > Thanks in advance! Below, some references to lore archives. > Cheers, > > > Guilherme > > > References: > > This V7 link: > https://lore.kernel.org/linux-pci/546d346644654915877365b19ea534378db0894d.1602788209.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > V6: > https://lore.kernel.org/linux-pci/546d346644654915877365b19ea534378db0894d.1602663397.git.sathyanarayanan.kuppusw...@linux.intel.com/#t > > V5: > https://lore.kernel.org/linux-pci/162495c76c391de6e021919e2b69c5cd2dbbc22a.1602632140.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > V4: > https://lore.kernel.org/linux-pci/5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > V3: > https://lore.kernel.org/linux-pci/cbba08a5e9ca62778c8937f44eda2192a2045da7.1595617529.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > V2: > https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppusw...@linux.intel.com/#t > > Ethan's related(?) patchset, V8 : > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/#t This patchset I think need more discussion and maintainer's (Bjorn's) suggestion or advice. You are welcome to try it and give me the feedback. Thanks, Ethan >
Re: [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic
The current logic is if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && (pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER))) services |= PCIE_PORT_SERVICE_DPC; if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() && (pcie_ports_dpc_native)) services |= PCIE_PORT_SERVICE_DPC; or if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && pci_aer_available() &&(services & PCIE_PORT_SERVICE_AER)) services |= PCIE_PORT_SERVICE_DPC; do you mean one of the possible is if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && (pcie_ports_dpc_native)) services |= PCIE_PORT_SERVICE_DPC; after your patch ? nothing about AER ? Thanks, Ethan On Thu, Oct 29, 2020 at 1:14 AM Kuppuswamy, Sathyanarayanan wrote: > > > > On 10/27/20 11:00 PM, Ethan Zhao wrote: > > On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan > > wrote: > >> > >> In DPC service enable logic, check for > >> services & PCIE_PORT_SERVICE_AER implies pci_aer_available() > > How about PCIE_PORT_SERVICE_AER is not configured, but > > pcie_aer_disable == 0 ? > Its not possible in current code flow. DPC service is configured > following AER service configuration. > >> is true. So there is no need to explicitly check it again. > >> > >> Also, passing pcie_ports=dpc-native in kernel command line > >> implies DPC needs to be enabled in native mode irrespective > >> of AER ownership status. So checking for pci_aer_available() > >> without checking for pcie_ports status is incorrect. > >> > >> Signed-off-by: Kuppuswamy Sathyanarayanan > >> > >> --- > >> drivers/pci/pcie/portdrv_core.c | 1 - > >> 1 file changed, 1 deletion(-) > >> > >> diff --git a/drivers/pci/pcie/portdrv_core.c > >> b/drivers/pci/pcie/portdrv_core.c > >> index 2c0278f0fdcc..e257a2ca3595 100644 > >> --- a/drivers/pci/pcie/portdrv_core.c > >> +++ b/drivers/pci/pcie/portdrv_core.c > >> @@ -252,7 +252,6 @@ static int get_port_device_capability(struct pci_dev > >> *dev) > >> * permission to use AER. > >> */ > >> if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && > >> - pci_aer_available() && > >> (host->native_dpc || (services & PCIE_PORT_SERVICE_AER))) > >> services |= PCIE_PORT_SERVICE_DPC; > >> > >> -- > >> 2.17.1 > >> > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer
Re: [PATCH] AER: aer_root_reset() non-native handling
On Sat, Oct 31, 2020 at 6:36 AM Sean V Kelley wrote: > > If an OS has not been granted AER control via _OSC, then > the OS should not make changes to PCI_ERR_ROOT_COMMAND and > PCI_ERR_ROOT_STATUS related registers. Per section 4.5.1 of > the System Firmware Intermediary (SFI) _OSC and DPC Updates > ECN [1], this bit also covers these aspects of the PCI > Express Advanced Error Reporting. Further, the handling of > clear/enable of PCI_ERROR_ROOT_COMMAND when wrapped around > PCI_ERR_ROOT_STATUS should have no effect and be removed. > Based on the above and earlier discussion [2], make the > following changes: > > Add a check for the native case (i.e., AER control via _OSC) > Re-order and remove some of the handling: > - clear PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK > - do reset > - clear PCI_ERR_ROOT_STATUS > - enable PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK > > to this: > > - clear PCI_ERR_ROOT_STATUS > - do reset > > The current "clear, reset, enable" order suggests that the reset > might cause errors that we should ignore. But I am unable to find a > reference and the clearing of PCI_ERR_ROOT_STATUS does not require them. > > [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24, > 2020, affecting PCI Firmware Specification, Rev. 3.2 > https://members.pcisig.com/wg/PCI-SIG/document/14076 > [2] > https://lore.kernel.org/linux-pci/20201020162820.GA370938@bjorn-Precision-5520/ > > Signed-off-by: Sean V Kelley > --- > drivers/pci/pcie/aer.c | 21 ++--- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 65dff5f3457a..bbfb07842d89 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -1361,23 +1361,14 @@ static pci_ers_result_t aer_root_reset(struct pci_dev > *dev) > u32 reg32; > int rc; > > - > - /* Disable Root's interrupt in response to error messages */ > - pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, ); > - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK; > - pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32); There may be some reasons to disable interrupt first and then do resetting, clear status and re-enable interrupt. Perhaps to avoid error noise, what would happen if the resetting causes errors itself ? Thanks, Ethan > + if (pcie_aer_is_native(dev)) { > + /* Clear Root Error Status */ > + pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, ); > + pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32); > + } > > rc = pci_bus_error_reset(dev); > - pci_info(dev, "Root Port link has been reset\n"); > - > - /* Clear Root Error Status */ > - pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, ); > - pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32); > - > - /* Enable Root Port's interrupt in response to error messages */ > - pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, ); > - reg32 |= ROOT_PORT_INTR_ON_MESG_MASK; > - pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32); > + pci_info(dev, "Root Port link has been reset (%d)\n", rc); > > return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED; > } > -- > 2.29.2 >
Re: [PATCH v11 1/5] PCI: Conditionally initialize host bridge native_* members
On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan wrote: > > If CONFIG_PCIEPORTBUS is not enabled in kernel then initialing > struct pci_host_bridge PCIe specific native_* members to "1" is > incorrect. So protect the PCIe specific member initialization > with CONFIG_PCIEPORTBUS. > > Signed-off-by: Kuppuswamy Sathyanarayanan > > --- > drivers/pci/probe.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > index 4289030b0fff..756fa60ca708 100644 > --- a/drivers/pci/probe.c > +++ b/drivers/pci/probe.c > @@ -588,12 +588,14 @@ static void pci_init_host_bridge(struct pci_host_bridge > *bridge) > * may implement its own AER handling and use _OSC to prevent the > * OS from interfering. > */ > +#ifdef CONFIG_PCIEPORTBUS > bridge->native_aer = 1; > bridge->native_pcie_hotplug = 1; > - bridge->native_shpc_hotplug = 1; > bridge->native_pme = 1; > - bridge->native_ltr = 1; > bridge->native_dpc = 1; > +#endif If CONFIG_PCIEPORTBUS wasn't defined, leave them to "unknown" value ? > + bridge->native_ltr = 1; > + bridge->native_shpc_hotplug = 1; > > device_initialize(>dev); > } > -- > 2.17.1 >
Re: [PATCH v10 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver
On Tue, Oct 27, 2020 at 11:12 AM Kuppuswamy Sathyanarayanan wrote: > > Currently, AER and DPC Capabilities dependency checks is > distributed between DPC and portdrv service drivers. So move > them out of DPC driver. > > Also, since services & PCIE_PORT_SERVICE_AER check already > ensures AER native ownership, no need to add additional > pcie_aer_is_native() check. > > Signed-off-by: Kuppuswamy Sathyanarayanan > > --- > drivers/pci/pcie/dpc.c | 4 > drivers/pci/pcie/portdrv_core.c | 1 + > 2 files changed, 1 insertion(+), 4 deletions(-) > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > index 21f77420632b..a8b92207 100644 > --- a/drivers/pci/pcie/dpc.c > +++ b/drivers/pci/pcie/dpc.c > @@ -283,14 +283,10 @@ void pci_dpc_init(struct pci_dev *pdev) > static int dpc_probe(struct pcie_device *dev) > { > struct pci_dev *pdev = dev->port; > - struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus); > struct device *device = >device; > int status; > u16 ctl, cap; > > - if (!pcie_aer_is_native(pdev) && !host->native_dpc) > - return -ENOTSUPP; > - > status = devm_request_threaded_irq(device, dev->irq, dpc_irq, >dpc_handler, IRQF_SHARED, >"pcie-dpc", pdev); > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c > index e257a2ca3595..ffa1d9fc458e 100644 > --- a/drivers/pci/pcie/portdrv_core.c > +++ b/drivers/pci/pcie/portdrv_core.c > @@ -252,6 +252,7 @@ static int get_port_device_capability(struct pci_dev *dev) > * permission to use AER. > */ > if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && > + host->native_dpc && What hell the logic is here ? > (host->native_dpc || (services & PCIE_PORT_SERVICE_AER))) > services |= PCIE_PORT_SERVICE_DPC; > > -- > 2.17.1 >
Re: [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic
On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan wrote: > > In DPC service enable logic, check for > services & PCIE_PORT_SERVICE_AER implies pci_aer_available() How about PCIE_PORT_SERVICE_AER is not configured, but pcie_aer_disable == 0 ? > is true. So there is no need to explicitly check it again. > > Also, passing pcie_ports=dpc-native in kernel command line > implies DPC needs to be enabled in native mode irrespective > of AER ownership status. So checking for pci_aer_available() > without checking for pcie_ports status is incorrect. > > Signed-off-by: Kuppuswamy Sathyanarayanan > > --- > drivers/pci/pcie/portdrv_core.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c > index 2c0278f0fdcc..e257a2ca3595 100644 > --- a/drivers/pci/pcie/portdrv_core.c > +++ b/drivers/pci/pcie/portdrv_core.c > @@ -252,7 +252,6 @@ static int get_port_device_capability(struct pci_dev *dev) > * permission to use AER. > */ > if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) && > - pci_aer_available() && > (host->native_dpc || (services & PCIE_PORT_SERVICE_AER))) > services |= PCIE_PORT_SERVICE_DPC; > > -- > 2.17.1 >
Re: [PATCH V2 1/2] PCI/AER: Add pcie_is_ecrc_enabled() API
On Wed, Oct 28, 2020 at 9:48 AM Vidya Sagar wrote: > > Adds pcie_is_ecrc_enabled() API to let other sub-systems (like DesignWare) > to query if ECRC policy is enabled and perform any configuration > required in those respective sub-systems. > > Signed-off-by: Vidya Sagar > --- > V2: > * None from V1 > > drivers/pci/pci.h | 2 ++ > drivers/pci/pcie/aer.c | 11 +++ > 2 files changed, 13 insertions(+) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index fa12f7cbc1a0..325fdbf91dde 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -575,9 +575,11 @@ static inline void > pcie_aspm_powersave_config_link(struct pci_dev *pdev) { } > #ifdef CONFIG_PCIE_ECRC > void pcie_set_ecrc_checking(struct pci_dev *dev); > void pcie_ecrc_get_policy(char *str); > +bool pcie_is_ecrc_enabled(void); > #else > static inline void pcie_set_ecrc_checking(struct pci_dev *dev) { } > static inline void pcie_ecrc_get_policy(char *str) { } > +static inline bool pcie_is_ecrc_enabled(void) { return false; } > #endif > > #ifdef CONFIG_PCIE_PTM > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index 65dff5f3457a..24363c895aba 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -207,6 +207,17 @@ void pcie_ecrc_get_policy(char *str) > > ecrc_policy = i; > } > + > +/** > + * pcie_is_ecrc_enabled - returns if ECRC is enabled in the system or not > + * > + * Returns 1 if ECRC policy is enabled and 0 otherwise How about 'Returns true if ECRC policy is enabled and false otherwise' ? > + */ > +bool pcie_is_ecrc_enabled(void) > +{ > + return ecrc_policy == ECRC_POLICY_ON; > +} > +EXPORT_SYMBOL(pcie_is_ecrc_enabled); > #endif /* CONFIG_PCIE_ECRC */ > > #definePCI_EXP_AER_FLAGS (PCI_EXP_DEVCTL_CERE | > PCI_EXP_DEVCTL_NFERE | \ > -- > 2.17.1 >
Re: [PATCH v2] pciehp: Add check for DL_ACTIVE bit in pciehp_check_link_status()
On Tue, Oct 20, 2020 at 2:33 PM Sanjay R Mehta wrote: > > From: Sanjay R Mehta > > if DL_ACTIVE bit is set it means that there is no need to check > PCI_EXP_LNKSTA_LT bit, as DL_ACTIVE would have set only if the link > is already trained. Hence adding a check which takes care of this > scenario. Sorry, I couldn't understand the logic here. if DL_ACTIVE was set, PCI_EXP_LNKSTA_LT is to be cleared, vice versa. why need (lnk_status & PCI_EXP_LNKSTA_LT) && !(lnk_status & PCI_EXP_LNKSTA_DLLLA) Double safe ? Thanks, Ethan > > Signed-off-by: Sanjay R Mehta > --- > drivers/pci/hotplug/pciehp_hpc.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/hotplug/pciehp_hpc.c > b/drivers/pci/hotplug/pciehp_hpc.c > index 53433b37e181..8ab2f6a2f388 100644 > --- a/drivers/pci/hotplug/pciehp_hpc.c > +++ b/drivers/pci/hotplug/pciehp_hpc.c > @@ -309,7 +309,8 @@ int pciehp_check_link_status(struct controller *ctrl) > > pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status); > ctrl_dbg(ctrl, "%s: lnk_status = %x\n", __func__, lnk_status); > - if ((lnk_status & PCI_EXP_LNKSTA_LT) || > + if (((lnk_status & PCI_EXP_LNKSTA_LT) && > +!(lnk_status & PCI_EXP_LNKSTA_DLLLA)) || > !(lnk_status & PCI_EXP_LNKSTA_NLW)) { > ctrl_err(ctrl, "link training error: status %#06x\n", > lnk_status); > -- > 2.25.1 >
Re: [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
On Sat, Oct 17, 2020 at 6:29 AM Bjorn Helgaas wrote: > > [+cc Christoph, Ethan, Sinan, Keith; sorry should have cc'd you to > begin with since you're looking at this code too. Particularly > interested in your thoughts about whether we should be touching > PCI_ERR_ROOT_COMMAND and PCI_ERR_ROOT_STATUS when we don't own AER.] aer_root_reset() function has a prefix 'aer_', looks like it's a function of aer driver, will only be called by aer driver at runtime. if so it's up to the owner/aer to know if OSPM is granted to init. while actually some of the functions and runtime service of aer driver is also shared by GHES driver (running time) and DPC driver (compiling time ?) etc. then it is confused now. Shall we move some of the shared functions and running time service to pci/err.c ? if so , just like pcie_do_recovery(), it's share by firmware_first mode GHES ghes_probe() ->ghes_irq_func ->ghes_proc ->ghes_do_proc() ->ghes_handle_aer() ->aer_recover_work_func() ->pcie_do_recovery() ->aer_root_reset() and aer driver etc. if aer wants to do some access might conflict with firmware(or firmware in embedded controller) should check _OSC_ etc first. blindly issue PCI_ERR_ROOT_COMMAND or clear PCI_ERR_ROOT_STATUS *likely* cause errors by error handling itself. Thanks, Ethan > > On Fri, Oct 16, 2020 at 03:30:37PM -0500, Bjorn Helgaas wrote: > > [+to Jonathan] > > > > On Thu, Oct 15, 2020 at 05:11:10PM -0700, Sean V Kelley wrote: > > > From: Qiuxu Zhuo > > > > > > When attempting error recovery for an RCiEP associated with an RCEC > > > device, > > > there needs to be a way to update the Root Error Status, the Uncorrectable > > > Error Status and the Uncorrectable Error Severity of the parent RCEC. In > > > some non-native cases in which there is no OS-visible device associated > > > with the RCiEP, there is nothing to act upon as the firmware is acting > > > before the OS. > > > > > > Add handling for the linked RCEC in AER/ERR while taking into account > > > non-native cases. > > > > > > Co-developed-by: Sean V Kelley > > > Link: > > > https://lore.kernel.org/r/20201002184735.1229220-12-seanvk@oregontracks.org > > > Signed-off-by: Sean V Kelley > > > Signed-off-by: Qiuxu Zhuo > > > Signed-off-by: Bjorn Helgaas > > > Reviewed-by: Jonathan Cameron > > > --- > > > drivers/pci/pcie/aer.c | 53 ++ > > > drivers/pci/pcie/err.c | 20 > > > 2 files changed, 48 insertions(+), 25 deletions(-) > > > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > > index 65dff5f3457a..083f69b67bfd 100644 > > > --- a/drivers/pci/pcie/aer.c > > > +++ b/drivers/pci/pcie/aer.c > > > @@ -1357,27 +1357,50 @@ static int aer_probe(struct pcie_device *dev) > > > */ > > > static pci_ers_result_t aer_root_reset(struct pci_dev *dev) > > > { > > > - int aer = dev->aer_cap; > > > + int type = pci_pcie_type(dev); > > > + struct pci_dev *root; > > > + int aer = 0; > > > + int rc = 0; > > > u32 reg32; > > > - int rc; > > > > > > + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END) > > > > "type == PCI_EXP_TYPE_RC_END" > > > > > + /* > > > +* The reset should only clear the Root Error Status > > > +* of the RCEC. Only perform this for the > > > +* native case, i.e., an RCEC is present. > > > +*/ > > > + root = dev->rcec; > > > + else > > > + root = dev; > > > > > > - /* Disable Root's interrupt in response to error messages */ > > > - pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, ); > > > - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK; > > > - pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32); > > > + if (root) > > > + aer = dev->aer_cap; > > > > > > - rc = pci_bus_error_reset(dev); > > > - pci_info(dev, "Root Port link has been reset\n"); > > > + if (aer) { > > > + /* Disable Root's interrupt in response to error messages */ > > > + pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, > > > ); > > > + reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK; > > > + pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, > > > reg32); > > > > Not directly related to *this* patch, but my assumption was that in > > the APEI case, the firmware should retain ownership of the AER > > Capability, so the OS should not touch PCI_ERR_ROOT_COMMAND and > > PCI_ERR_ROOT_STATUS. > > > > But this code appears to ignore that ownership. Jonathan, you must > > have looked at this recently for 068c29a248b6 ("PCI/ERR: Clear PCIe > > Device Status errors only if OS owns AER"). Do you have any insight > > about this? > > > > > - /* Clear Root Error Status */ > > > - pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, ); > > > - pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32); > > > + /* Clear Root Error Status */ > > > +
Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Thu, Oct 15, 2020 at 1:53 PM Kuppuswamy, Sathyanarayanan wrote: > > > > On 10/14/20 10:05 PM, Ethan Zhao wrote: > > On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan > > wrote: > >> > >> > >> > >> On 10/14/20 6:58 PM, Ethan Zhao wrote: > >>> On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan > >>> wrote: > >>>> > >>>> > >>>> > >>>> On 10/14/20 8:07 AM, Ethan Zhao wrote: > >>>>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan > >>>>> wrote: > >>>>>> > >>>>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > >>>>>> merged fatal and non-fatal error recovery paths, and also made > >>>>>> recovery code depend on hotplug handler for "remove affected > >>>>>> device + rescan" support. But this change also complicated the > >>>>>> error recovery path and which in turn led to the following > >>>>>> issues. > >>>>>> > >>>>>> 1. We depend on hotplug handler for removing the affected > >>>>>> devices/drivers on DLLSC LINK down event (on DPC event > >>>>>> trigger) and DPC handler for handling the error recovery. Since > >>>>>> both handlers operate on same set of affected devices, it leads > >>>>>> to race condition, which in turn leads to NULL pointer > >>>>>> exceptions or error recovery failures.You can find more details > >>>>>> about this issue in following link. > >>>>>> > >>>>>> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > >>>>>> > >>>>>> 2. For non-hotplug capable devices fatal (DPC) error recovery > >>>>>> is currently broken. Current fatal error recovery implementation > >>>>>> relies on PCIe hotplug (pciehp) handler for detaching and > >>>>>> re-enumerating the affected devices/drivers. So when dealing with > >>>>>> non-hotplug capable devices, recovery code does not restore the state > >>>>>> of the affected devices correctly. You can find more details about > >>>>>> this issue in the following links. > >>>>>> > >>>>>> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > >>>>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > >>>>>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > >>>>>> > >>>>>> In order to fix the above two issues, we should stop relying on hotplug > >>>>> Yes, it doesn't rely on hotplug handler to remove and rescan the > >>>>> device, > >>>>> but it couldn't prevent hotplug drivers from doing another replicated > >>>>> removal/rescanning. > >>>>> it doesn't make sense to leave another useless removal/rescanning there. > >>>>> Maybe that's why these two paths were merged to one and made it rely on > >>>>> hotplug. > >>>> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence > >>>> depending on it to handle some of its recovery function is in-correct and > >>>> would lead to issues in non-hotplug capable platforms (which is true > >>>> currently). > >>>>> > >> > > > > >>>Though pciehp is not so hot/scalable and performance critical, but > >>> there > >>>is per cpu thread to handle hot-plug operation. synchronize all threads > >>>make them walk backwards for scalability. > >> DPC events does not happen in high frequency. So I don't think we should > > It's holding global lock, once malfunction happens to one device and > > it's driver, > > the whole system, everyone holds it, would be blocked to work. > >> worry about the performance here. Even hotplug handler will hold this lock > >> when adding/removing the devices. So adding/removing devices is a > >> serialized > > You don't worry about performance, but if there is a requirement needs > > more scalable > > and reliable hotplug, the effect will be much harder. what to do then ? > > choo
Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan wrote: > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > merged fatal and non-fatal error recovery paths, and also made > recovery code depend on hotplug handler for "remove affected > device + rescan" support. But this change also complicated the > error recovery path and which in turn led to the following > issues. > > 1. We depend on hotplug handler for removing the affected > devices/drivers on DLLSC LINK down event (on DPC event > trigger) and DPC handler for handling the error recovery. Since > both handlers operate on same set of affected devices, it leads > to race condition, which in turn leads to NULL pointer > exceptions or error recovery failures.You can find more details > about this issue in following link. > > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > > 2. For non-hotplug capable devices fatal (DPC) error recovery > is currently broken. Current fatal error recovery implementation > relies on PCIe hotplug (pciehp) handler for detaching and > re-enumerating the affected devices/drivers. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. You can find more details about > this issue in the following links. > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > In order to fix the above two issues, we should stop relying on hotplug > handler for cleaning the affected devices/drivers and let error recovery > handler own this functionality. So this patch reverts Commit bdb5ac85777d > ("PCI/ERR: Handle fatal error recovery") and re-introduce the "remove > affected device + rescan" functionality in fatal error recovery handler. > > Also holding pci_lock_rescan_remove() will prevent the race between hotplug > and DPC handler. > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > Signed-off-by: Kuppuswamy Sathyanarayanan > > Reviewed-by: Sinan Kaya > --- > Changes since v4: > * Added new interfaces for error recovery (pcie_do_fatal_recovery() > and pcie_do_nonfatal_recovery()). > > Changes since v5: > * Fixed static/non-static declartion issue. > > Documentation/PCI/pci-error-recovery.rst | 47 ++-- > Documentation/PCI/pcieaer-howto.rst | 2 +- > drivers/pci/pci.h| 5 +- > drivers/pci/pcie/aer.c | 10 ++-- > drivers/pci/pcie/dpc.c | 2 +- > drivers/pci/pcie/edr.c | 2 +- > drivers/pci/pcie/err.c | 70 ++-- > 7 files changed, 94 insertions(+), 44 deletions(-) > > diff --git a/Documentation/PCI/pci-error-recovery.rst > b/Documentation/PCI/pci-error-recovery.rst > index 84ceebb08cac..830c8af5838b 100644 > --- a/Documentation/PCI/pci-error-recovery.rst > +++ b/Documentation/PCI/pci-error-recovery.rst > @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a > PCI error > event will be platform-dependent, but will follow the general > sequence described below. > > -STEP 0: Error Event > +STEP 0: Error Event: ERR_NONFATAL > --- > A PCI bus error is detected by the PCI hardware. On powerpc, the slot > is isolated, in that all I/O is blocked: all reads return 0x, > @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and > recovery > proceeds to STEP 2 (MMIO Enable). > > If any driver requested a slot reset (by returning > PCI_ERS_RESULT_NEED_RESET), > -then recovery proceeds to STEP 4 (Slot Reset). > +then recovery proceeds to STEP 3 (Slot Reset). > > If the platform is unable to recover the slot, the next step > -is STEP 6 (Permanent Failure). > +is STEP 5 (Permanent Failure). > > .. note:: > > @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This > callback is made if > all drivers on a segment agree that they can try to recover and if no > automatic > link reset was performed by the HW. If the platform can't just re-enable IOs > without a slot reset or a link reset, it will not call this callback, and > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > +instead will have gone directly to STEP 3 (Slot Reset) > > .. note:: > > @@ -233,18 +233,12 @@ The driver should return one of the following result > codes: > > The next step taken depends on the results returned by the drivers. > If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). > +proceeds to STEP 4 (Resume Operations). > > If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform > -proceeds to STEP 4 (Slot Reset) > +proceeds to
Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan wrote: > > > > On 10/14/20 6:58 PM, Ethan Zhao wrote: > > On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan > > wrote: > >> > >> > >> > >> On 10/14/20 8:07 AM, Ethan Zhao wrote: > >>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan > >>> wrote: > >>>> > >>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > >>>> merged fatal and non-fatal error recovery paths, and also made > >>>> recovery code depend on hotplug handler for "remove affected > >>>> device + rescan" support. But this change also complicated the > >>>> error recovery path and which in turn led to the following > >>>> issues. > >>>> > >>>> 1. We depend on hotplug handler for removing the affected > >>>> devices/drivers on DLLSC LINK down event (on DPC event > >>>> trigger) and DPC handler for handling the error recovery. Since > >>>> both handlers operate on same set of affected devices, it leads > >>>> to race condition, which in turn leads to NULL pointer > >>>> exceptions or error recovery failures.You can find more details > >>>> about this issue in following link. > >>>> > >>>> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > >>>> > >>>> 2. For non-hotplug capable devices fatal (DPC) error recovery > >>>> is currently broken. Current fatal error recovery implementation > >>>> relies on PCIe hotplug (pciehp) handler for detaching and > >>>> re-enumerating the affected devices/drivers. So when dealing with > >>>> non-hotplug capable devices, recovery code does not restore the state > >>>> of the affected devices correctly. You can find more details about > >>>> this issue in the following links. > >>>> > >>>> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > >>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > >>>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > >>>> > >>>> In order to fix the above two issues, we should stop relying on hotplug > >>> Yes, it doesn't rely on hotplug handler to remove and rescan the > >>> device, > >>> but it couldn't prevent hotplug drivers from doing another replicated > >>> removal/rescanning. > >>> it doesn't make sense to leave another useless removal/rescanning there. > >>> Maybe that's why these two paths were merged to one and made it rely on > >>> hotplug. > >> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence > >> depending on it to handle some of its recovery function is in-correct and > >> would lead to issues in non-hotplug capable platforms (which is true > >> currently). > >>> > > > pci_lock_rescan_remove() is global lock for PCIe, the mal-functional > > device's port holds this lock, it prevents the whole system from doing > > hot-plug operation. > It does not prevent the hotplug operation, but it might delay it. Since both > DPC and hotplug operates on same set of devices, it must be synchronized. Think about a large system with some PCI domains, every domain has some ports and devices attached. why DPC and hotplug *must* operate on the same set of devices from different domains ? if it must be synchronized, why make the hotplug handlers threadable ? > > Though pciehp is not so hot/scalable and performance critical, but there > > is per cpu thread to handle hot-plug operation. synchronize all threads > > make them walk backwards for scalability. > DPC events does not happen in high frequency. So I don't think we should It's holding global lock, once malfunction happens to one device and it's driver, the whole system, everyone holds it, would be blocked to work. > worry about the performance here. Even hotplug handler will hold this lock > when adding/removing the devices. So adding/removing devices is a serialized You don't worry about performance, but if there is a requirement needs more scalable and reliable hotplug, the effect will be much harder. what to do then ? choose another OS ? To be honest, I don't like the global lock/ pci_lock_rescan_remove(). BTW, I didn't try the FATAL errors brute force injection on your patch, duplicated removal will work naturally because it was removed ? Thanks, Ethan > operation. > > > > >> > >>>> -- > >>>> 2.17.1 > >>>> > >> > >> -- > >> Sathyanarayanan Kuppuswamy > >> Linux Kernel Developer > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer
Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan wrote: > > > > On 10/14/20 8:07 AM, Ethan Zhao wrote: > > On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan > > wrote: > >> > >> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > >> merged fatal and non-fatal error recovery paths, and also made > >> recovery code depend on hotplug handler for "remove affected > >> device + rescan" support. But this change also complicated the > >> error recovery path and which in turn led to the following > >> issues. > >> > >> 1. We depend on hotplug handler for removing the affected > >> devices/drivers on DLLSC LINK down event (on DPC event > >> trigger) and DPC handler for handling the error recovery. Since > >> both handlers operate on same set of affected devices, it leads > >> to race condition, which in turn leads to NULL pointer > >> exceptions or error recovery failures.You can find more details > >> about this issue in following link. > >> > >> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > >> > >> 2. For non-hotplug capable devices fatal (DPC) error recovery > >> is currently broken. Current fatal error recovery implementation > >> relies on PCIe hotplug (pciehp) handler for detaching and > >> re-enumerating the affected devices/drivers. So when dealing with > >> non-hotplug capable devices, recovery code does not restore the state > >> of the affected devices correctly. You can find more details about > >> this issue in the following links. > >> > >> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > >> https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > >> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > >> > >> In order to fix the above two issues, we should stop relying on hotplug > >Yes, it doesn't rely on hotplug handler to remove and rescan the device, > > but it couldn't prevent hotplug drivers from doing another replicated > > removal/rescanning. > > it doesn't make sense to leave another useless removal/rescanning there. > > Maybe that's why these two paths were merged to one and made it rely on > > hotplug. > No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence > depending on it to handle some of its recovery function is in-correct and > would lead to issues in non-hotplug capable platforms (which is true > currently). > > > > >> + else > >> + udev = dev->bus->self; > >> + > >> + parent = udev->subordinate; > >> + pci_walk_bus(parent, pci_dev_set_disconnected, NULL); > >> + > >> +pci_lock_rescan_remove(); > > Though here you have lock, but hotplug will do another > > 'pci_stop_and_remove_bus_device()' > > without merging it with the hotplug driver, you have no way to > > remove the replicated actions in > >hotplug handler. > No, the core operation (remove/add device) is syncronzied and done in > only one thread. Please check the following flow. Even in hotplug pci_lock_rescan_remove() is global lock for PCIe, the mal-functional device's port holds this lock, it prevents the whole system from doing hot-plug operation. Though pciehp is not so hot/scalable and performance critical, but there is per cpu thread to handle hot-plug operation. synchronize all threads make them walk backwards for scalability. > handler, before removing the device, it attempts to hold > pci_lock_rescan_remove() > lock. So holding the same lock in DPC handler will syncronize the DPC/hotplug > handlers. Also if one of the thread (DPC or hotplug) removes/adds the > affected devices, > other thread will not repeat the same action (since the device is already > removed/added). > > ->pciehp_ist() >->pciehp_handle_presence_or_link_change() > ->pciehp_disable_slot() >->__pciehp_disable_slot() > ->remove_board() >->pciehp_unconfigure_device() > ->pci_lock_rescan_remove() > > > > > >Thanks, > >Ethan > >> +pci_dev_get(dev); > >> +list_for_each_entry_safe_reverse(pdev, temp, >devices, > >> +bus_list) { > >> + pci_stop_and_remove_bus_device(pdev); > >> + } > >> + > >> +
Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan wrote: > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > merged fatal and non-fatal error recovery paths, and also made > recovery code depend on hotplug handler for "remove affected > device + rescan" support. But this change also complicated the > error recovery path and which in turn led to the following > issues. > > 1. We depend on hotplug handler for removing the affected > devices/drivers on DLLSC LINK down event (on DPC event > trigger) and DPC handler for handling the error recovery. Since > both handlers operate on same set of affected devices, it leads > to race condition, which in turn leads to NULL pointer > exceptions or error recovery failures.You can find more details > about this issue in following link. > > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > > 2. For non-hotplug capable devices fatal (DPC) error recovery > is currently broken. Current fatal error recovery implementation > relies on PCIe hotplug (pciehp) handler for detaching and > re-enumerating the affected devices/drivers. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. You can find more details about > this issue in the following links. > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > In order to fix the above two issues, we should stop relying on hotplug Yes, it doesn't rely on hotplug handler to remove and rescan the device, but it couldn't prevent hotplug drivers from doing another replicated removal/rescanning. it doesn't make sense to leave another useless removal/rescanning there. Maybe that's why these two paths were merged to one and made it rely on hotplug. > handler for cleaning the affected devices/drivers and let error recovery > handler own this functionality. So this patch reverts Commit bdb5ac85777d > ("PCI/ERR: Handle fatal error recovery") and re-introduce the "remove > affected device + rescan" functionality in fatal error recovery handler. > > Also holding pci_lock_rescan_remove() will prevent the race between hotplug > and DPC handler. > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > Signed-off-by: Kuppuswamy Sathyanarayanan > > Reviewed-by: Sinan Kaya > --- > Changes since v4: > * Added new interfaces for error recovery (pcie_do_fatal_recovery() > and pcie_do_nonfatal_recovery()). > > Changes since v5: > * Fixed static/non-static declartion issue. > > Documentation/PCI/pci-error-recovery.rst | 47 ++-- > Documentation/PCI/pcieaer-howto.rst | 2 +- > drivers/pci/pci.h| 5 +- > drivers/pci/pcie/aer.c | 10 ++-- > drivers/pci/pcie/dpc.c | 2 +- > drivers/pci/pcie/edr.c | 2 +- > drivers/pci/pcie/err.c | 70 ++-- > 7 files changed, 94 insertions(+), 44 deletions(-) > > diff --git a/Documentation/PCI/pci-error-recovery.rst > b/Documentation/PCI/pci-error-recovery.rst > index 84ceebb08cac..830c8af5838b 100644 > --- a/Documentation/PCI/pci-error-recovery.rst > +++ b/Documentation/PCI/pci-error-recovery.rst > @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a > PCI error > event will be platform-dependent, but will follow the general > sequence described below. > > -STEP 0: Error Event > +STEP 0: Error Event: ERR_NONFATAL > --- > A PCI bus error is detected by the PCI hardware. On powerpc, the slot > is isolated, in that all I/O is blocked: all reads return 0x, > @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and > recovery > proceeds to STEP 2 (MMIO Enable). > > If any driver requested a slot reset (by returning > PCI_ERS_RESULT_NEED_RESET), > -then recovery proceeds to STEP 4 (Slot Reset). > +then recovery proceeds to STEP 3 (Slot Reset). > > If the platform is unable to recover the slot, the next step > -is STEP 6 (Permanent Failure). > +is STEP 5 (Permanent Failure). > > .. note:: > > @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This > callback is made if > all drivers on a segment agree that they can try to recover and if no > automatic > link reset was performed by the HW. If the platform can't just re-enable IOs > without a slot reset or a link reset, it will not call this callback, and > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > +instead will have gone directly to STEP 3 (Slot Reset) > > .. note:: > > @@ -233,18 +233,12 @@ The driver should return one of the following result > codes: > > The next step taken depends on the results returned by the
Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback
Please fix the building issue. drivers/pci/pcie/err.c:144:25: error: static declaration of ‘pcie_do_fatal_recovery’ follows non-static declaration static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, ^~ In file included from drivers/pci/pcie/err.c:21: drivers/pci/pcie/../pci.h:560:18: note: previous declaration of ‘pcie_do_fatal_recovery’ was here pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, ^~ drivers/pci/pcie/err.c:144:25: warning: ‘pcie_do_fatal_recovery’ defined but not used [-Wunused-function] static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev, Thanks, Ethan On Tue, Oct 13, 2020 at 10:18 PM Kuppuswamy, Sathyanarayanan wrote: > > > > On 10/12/20 2:05 PM, Raj, Ashok wrote: > > On Sun, Oct 11, 2020 at 10:03:40PM -0700, > > sathyanarayanan.nkuppusw...@gmail.com wrote: > >> From: Kuppuswamy Sathyanarayanan > >> > >> > >> Currently if report_error_detected() or report_mmio_enabled() > >> functions requests PCI_ERS_RESULT_NEED_RESET, current > >> pcie_do_recovery() implementation does not do the requested > >> explicit device reset, but instead just calls the > >> report_slot_reset() on all affected devices. Notifying about the > >> reset via report_slot_reset() without doing the actual device > >> reset is incorrect. So call pci_bus_reset() before triggering > >> ->slot_reset() callback. > >> > >> Signed-off-by: Kuppuswamy Sathyanarayanan > >> > >> --- > >> drivers/pci/pcie/err.c | 6 +- > >> 1 file changed, 1 insertion(+), 5 deletions(-) > >> > >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > >> index c543f419d8f9..067c58728b88 100644 > >> --- a/drivers/pci/pcie/err.c > >> +++ b/drivers/pci/pcie/err.c > >> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, > >> } > >> > >> if (status == PCI_ERS_RESULT_NEED_RESET) { > >> -/* > >> - * TODO: Should call platform-specific > >> - * functions to reset slot before calling > >> - * drivers' slot_reset callbacks? > >> - */ > >> +pci_reset_bus(dev); > > > > pci_reset_bus() returns an error, do you need to consult that before > > unconditionally setting PCI_ERS_RESULT_RECOVERED? > Good point. I will fix this in next version. > > > >> status = PCI_ERS_RESULT_RECOVERED; > >> pci_dbg(dev, "broadcast slot_reset message\n"); > >> pci_walk_bus(bus, report_slot_reset, ); > >> -- > >> 2.17.1 > >> > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer
Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling
On Mon, Oct 12, 2020 at 1:10 PM wrote: > > From: Kuppuswamy Sathyanarayanan > > Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > merged fatal and non-fatal error recovery paths, and also made > recovery code depend on hotplug handler for "remove affected > device + rescan" support. But this change also complicated the > error recovery path and which in turn led to the following > issues. > > 1. We depend on hotplug handler for removing the affected > devices/drivers on DLLSC LINK down event (on DPC event > trigger) and DPC handler for handling the error recovery. Since > both handlers operate on same set of affected devices, it leads > to race condition, which in turn leads to NULL pointer > exceptions or error recovery failures.You can find more details > about this issue in following link. > > https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t > > 2. For non-hotplug capable devices fatal (DPC) error recovery > is currently broken. Current fatal error recovery implementation > relies on PCIe hotplug (pciehp) handler for detaching and > re-enumerating the affected devices/drivers. So when dealing with > non-hotplug capable devices, recovery code does not restore the state > of the affected devices correctly. You can find more details about > this issue in the following links. > > https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/ > https://lore.kernel.org/linux-pci/12115.1588207324@famine/ > https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/ > > In order to fix the above two issues, we should stop relying on hotplug > handler for cleaning the affected devices/drivers and let error recovery > handler own this functionality. So this patch reverts Commit bdb5ac85777d > ("PCI/ERR: Handle fatal error recovery") and re-introduce the "remove > affected device + rescan" functionality in fatal error recovery handler. This patch only reverts the commit bdb5ac85777d ? or you'd better separate the revert and code you added. Thanks, Ethan > > Also holding pci_lock_rescan_remove() will prevent the race between hotplug > and DPC handler. > > Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") > Signed-off-by: Kuppuswamy Sathyanarayanan > > --- > Documentation/PCI/pci-error-recovery.rst | 47 ++-- > drivers/pci/pcie/err.c | 71 +++- > 2 files changed, 87 insertions(+), 31 deletions(-) > > diff --git a/Documentation/PCI/pci-error-recovery.rst > b/Documentation/PCI/pci-error-recovery.rst > index 84ceebb08cac..830c8af5838b 100644 > --- a/Documentation/PCI/pci-error-recovery.rst > +++ b/Documentation/PCI/pci-error-recovery.rst > @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a > PCI error > event will be platform-dependent, but will follow the general > sequence described below. > > -STEP 0: Error Event > +STEP 0: Error Event: ERR_NONFATAL > --- > A PCI bus error is detected by the PCI hardware. On powerpc, the slot > is isolated, in that all I/O is blocked: all reads return 0x, > @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and > recovery > proceeds to STEP 2 (MMIO Enable). > > If any driver requested a slot reset (by returning > PCI_ERS_RESULT_NEED_RESET), > -then recovery proceeds to STEP 4 (Slot Reset). > +then recovery proceeds to STEP 3 (Slot Reset). > > If the platform is unable to recover the slot, the next step > -is STEP 6 (Permanent Failure). > +is STEP 5 (Permanent Failure). > > .. note:: > > @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This > callback is made if > all drivers on a segment agree that they can try to recover and if no > automatic > link reset was performed by the HW. If the platform can't just re-enable IOs > without a slot reset or a link reset, it will not call this callback, and > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > +instead will have gone directly to STEP 3 (Slot Reset) > > .. note:: > > @@ -233,18 +233,12 @@ The driver should return one of the following result > codes: > > The next step taken depends on the results returned by the drivers. > If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). > +proceeds to STEP 4 (Resume Operations). > > If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform > -proceeds to STEP 4 (Slot Reset) > +proceeds to STEP 3 (Slot Reset) > > -STEP 3: Link Reset > --- > -The platform resets the link. This is a PCI-Express specific step > -and is done whenever a fatal error has been detected that can be > -"solved" by resetting the link. > - > -STEP 4: Slot Reset > +STEP 3: Slot Reset > -- > > In response to a return value of
Re: [PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling
On Thu, Oct 8, 2020 at 2:16 AM Kuppuswamy, Sathyanarayanan wrote: > > > On 10/7/20 4:31 AM, Ethan Zhao wrote: > > Once root port DPC capability is enabled and triggered, at the beginning > > of DPC is triggered, the DPC status bits are set by hardware and then > > sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will > > take the port and software DPC interrupt handler 10ms to 50ms (test data > > on ICS(Ice Lake SP platform, see > > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) > > & stable 5.9-rc6) to complete the DPC containment procedure > This data is based on one particular architecture. So using this > to create a timed loop in pci_wait_port_outdpc() looks incorrect. To clarify there, it is not random to wait for specific 1000ms for specific architecture. Though there is no specification to say how many ms totally cost by hardware DPC containment, plus its interrupt handling. but you could read the whole PCIe specification to know how many ms cost at most by power state transition, link training etc, you may know the max single delay in hardware state transition is 200ms. Other small delay in hardware state transition is 100ms, 48ms, 32ms etc. If the DPC containment hardware procedure is pure resetting (or cold power on) without software access configuration as the worst case. we wait its handling process from 10ms (actually 20ms is the minimum delay we could do with msleep() ) till 1000ms timeout is a reasonable value. Thanks, Ethan > > I still recommend looking for some locking model to fix this > issue (may be atomic state flag or lock). > > till the DPC status is cleared at the end of the DPC interrupt handler. > > > > We use this function to check if the root port is in DPC handling status > > and wait till the hardware and software completed the procedure. > > > > Signed-off-by: Ethan Zhao > > Tested-by: Wen Jin > > Tested-by: Shanshan Zhang > > --- > > changes: > > v2:align ICS code name to public doc. > > v3: no change. > > v4: response to Christoph's (Christoph Hellwig ) > > tip, move pci_wait_port_outdpc() to DPC driver and its declaration > > to pci.h. > > v5: fix building issue reported by l...@intel.com with some config. > > v6: move from [1/5] to [2/5]. > > v7: no change. > > v8: no change. > > > > drivers/pci/pci.h | 2 ++ > > drivers/pci/pcie/dpc.c | 27 +++ > > 2 files changed, 29 insertions(+) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index fa12f7cbc1a0..455b32187abd 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); > > void pci_dpc_init(struct pci_dev *pdev); > > void dpc_process_error(struct pci_dev *pdev); > > pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); > > +bool pci_wait_port_outdpc(struct pci_dev *pdev); > > #else > > static inline void pci_save_dpc_state(struct pci_dev *dev) {} > > static inline void pci_restore_dpc_state(struct pci_dev *dev) {} > > static inline void pci_dpc_init(struct pci_dev *pdev) {} > > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return > > false; } > > #endif > > > > #ifdef CONFIG_PCI_ATS > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > > index daa9a4153776..2e0e091ce923 100644 > > --- a/drivers/pci/pcie/dpc.c > > +++ b/drivers/pci/pcie/dpc.c > > @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) > > pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); > > } > > > > +bool pci_wait_port_outdpc(struct pci_dev *pdev) > > +{ > > + u16 cap = pdev->dpc_cap, status; > > + u16 loop = 0; > > + > > + if (!cap) { > > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); > > + return false; > > + } > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); > > + > > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { > > + msleep(10); > > + loop++; > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + } > > + > > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { > > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); > > + return true; > > + } > > + > > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); > > + return false; > > +} > > + > > static int dpc_wait_rp_inactive(struct pci_dev *pdev) > > { > > unsigned long timeout = jiffies + HZ; > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer >
Re: [PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling
On Thu, Oct 8, 2020 at 2:16 AM Kuppuswamy, Sathyanarayanan wrote: > > > On 10/7/20 4:31 AM, Ethan Zhao wrote: > > Once root port DPC capability is enabled and triggered, at the beginning > > of DPC is triggered, the DPC status bits are set by hardware and then > > sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will > > take the port and software DPC interrupt handler 10ms to 50ms (test data > > on ICS(Ice Lake SP platform, see > > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) > > & stable 5.9-rc6) to complete the DPC containment procedure > This data is based on one particular architecture. So using this > to create a timed loop in pci_wait_port_outdpc() looks incorrect. > > I still recommend looking for some locking model to fix this > issue (may be atomic state flag or lock). It is actually a device semaphore. DLLSC/PDC handler needs to wait for the critical area is clear to enter by monitoring the DPC triggered status is cleaned or not. Another problem is, DPC reset/interrupt is initiated by hardware, you couldn't place a software lock between interrupt handler and device resetting. While device semaphore--- DPC triggered status is the right one to wait for. Better idea ? Thanks, Ethan > > till the DPC status is cleared at the end of the DPC interrupt handler. > > > > We use this function to check if the root port is in DPC handling status > > and wait till the hardware and software completed the procedure. > > > > Signed-off-by: Ethan Zhao > > Tested-by: Wen Jin > > Tested-by: Shanshan Zhang > > --- > > changes: > > v2:align ICS code name to public doc. > > v3: no change. > > v4: response to Christoph's (Christoph Hellwig ) > > tip, move pci_wait_port_outdpc() to DPC driver and its declaration > > to pci.h. > > v5: fix building issue reported by l...@intel.com with some config. > > v6: move from [1/5] to [2/5]. > > v7: no change. > > v8: no change. > > > > drivers/pci/pci.h | 2 ++ > > drivers/pci/pcie/dpc.c | 27 +++ > > 2 files changed, 29 insertions(+) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index fa12f7cbc1a0..455b32187abd 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); > > void pci_dpc_init(struct pci_dev *pdev); > > void dpc_process_error(struct pci_dev *pdev); > > pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); > > +bool pci_wait_port_outdpc(struct pci_dev *pdev); > > #else > > static inline void pci_save_dpc_state(struct pci_dev *dev) {} > > static inline void pci_restore_dpc_state(struct pci_dev *dev) {} > > static inline void pci_dpc_init(struct pci_dev *pdev) {} > > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return > > false; } > > #endif > > > > #ifdef CONFIG_PCI_ATS > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > > index daa9a4153776..2e0e091ce923 100644 > > --- a/drivers/pci/pcie/dpc.c > > +++ b/drivers/pci/pcie/dpc.c > > @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) > > pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); > > } > > > > +bool pci_wait_port_outdpc(struct pci_dev *pdev) > > +{ > > + u16 cap = pdev->dpc_cap, status; > > + u16 loop = 0; > > + > > + if (!cap) { > > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); > > + return false; > > + } > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); > > + > > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { > > + msleep(10); > > + loop++; > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + } > > + > > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { > > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); > > + return true; > > + } > > + > > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); > > + return false; > > +} > > + > > static int dpc_wait_rp_inactive(struct pci_dev *pdev) > > { > > unsigned long timeout = jiffies + HZ; > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer >
Re: [PATCH v8 1/6] PCI/ERR: get device before call device driver to avoid NULL pointer dereference
On Thu, Oct 8, 2020 at 1:24 AM Kuppuswamy, Sathyanarayanan wrote: > > > On 10/7/20 4:31 AM, Ethan Zhao wrote: > > During DPC error injection test we found there is race condition between > > pciehp and DPC driver, NULL pointer dereference caused panic as following > > > > # setpci -s 64:02.0 0x196.w=000a > >// 64:02.0 is rootport has DPC capability > > # setpci -s 65:00.0 0x04.w=0544 > >// 65:00.0 is NVMe SSD populated in above port > > # mount /dev/nvme0n1p1 nvme > > > > (tested on stable 5.8 & ICS(Ice Lake SP platform, see > > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) > > > > BUG: kernel NULL pointer dereference, address: 0050 > > ... > > CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted > > 5.8.0-0.0.7.el8.x86_64+ #1 > > RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 > > Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 > > 65 ff > > ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 > > 3a 00 > > 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 > > RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 > > RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 > > RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 > > RBP: ff8e06cf8762fdd0 R08: 00bf R09: > > R10: 00eb8ebeab53 R11: 93453258 R12: 0002 > > R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 > > FS: () GS:ff4e3eab1fd0() > > knlGS: > > CS: 0010 DS: ES: CR0: 80050033 > > CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 > > DR0: DR1: DR2: > > DR3: DR6: fffe0ff0 DR7: 0400 > > PKRU: 5554 > > Call Trace: > > ? report_normal_detected+0x20/0x20 > > report_frozen_detected+0x16/0x20 > > pci_walk_bus+0x75/0x90 > > ? dpc_irq+0x90/0x90 > > pcie_do_recovery+0x157/0x201 > > ? irq_finalize_oneshot.part.47+0xe0/0xe0 > > dpc_handler+0x29/0x40 > > irq_thread_fn+0x24/0x60 > > ... > > > > Debug shows when port DPC feature was enabled and triggered by errors, > > DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost > > at the same time, and no delay between them is required by specification. > > so DPC driver and pciehp drivers may handle these interrupts cocurrently. > > > > While DPC driver is doing pci_walk_bus() and calling device driver's > > callback without pci_dev_get() to increase device reference count, the > > device and its driver instance are likely being freed by > > pci_stop_and_removed_bus_device() > > -> pci_dev_put(). > > > > So does pci_dev_get() before using the device instance to avoid NULL > > pointer dereference. > Won't it be better if you get this in pcie_do_recovery()? Don't think so, just like lock, we should keep the scope with lock protected as small as possible. Locking a big area unnecessarily isn't acceptable. Thanks, Ethan > > > > Signed-off-by: Ethan Zhao > > Tested-by: Wen Jin > > Tested-by: Shanshan Zhang > > --- > > Changes: > > v2: revise doc according to Andy's suggestion. > > v3: no change. > > v4: no change. > > v5: no change. > > v6: moved to [1/5] from [3/5] and revised comment according to Lukas' > > suggestion. > > v7: no change. > > v8: no change. > > > > drivers/pci/pcie/err.c | 12 > > 1 file changed, 12 insertions(+) > > > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > > index c543f419d8f9..e35c4480c86b 100644 > > --- a/drivers/pci/pcie/err.c > > +++ b/drivers/pci/pcie/err.c > > @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, > > pci_ers_result_t vote; > > const struct pci_error_handlers *err_handler; > > > > + if (!pci_dev_get(dev)) > > + return 0; > > device_lock(>dev); > > if (!pci_dev_set_io_state(dev, state) || > > !dev->driver || > > @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, > > pci_uevent_ers(dev, vote); > > *result = merge_result(*result, vote); > > device_unlock(>dev); > > + pci_dev_put(dev); > > return 0; > > } > > > > @@ -94,6 +97,8 @@ static int report_mmio_enabl
[PATCH v8 6/6] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: v2: no change. v3: no change. v4: no change. v5: no change. v6: no change. v7: change debug output information. v8: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..2ca2723f3b34 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device was in that state or not allowed setting.\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH v8 0/6] Fix DPC hotplug race and enhance error handling
Hi,folks, This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). The fundamental premise is that when due to error conditions (NON-FATAL/ FATAL) when events are processed by both DPC handler and hotplug handling of DLLSC/PDC both operating on the same device object ends up with crashes (from Ashok). Debug shows when port DPC feature was enabled and triggered by errors, DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost at the same time, and no delay between them is required by specification. so DPC driver and pciehp drivers may handle these interrupts cocurrently, thus introduces the possibility of race condition, other details see every commit description part. With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done This patch set could be applied to stable 5.9-rc6/rc7/rc8 directly. Help to review and test. v2: changed according to review by Andy Shevchenko. v3: changed patch 4/5 to simpler coding. v4: move function pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. (tip from Christoph Hellwig ). v5: fix building issue reported by l...@intel.com with some config. v6: move patch[3/5] as the first patch according to Lukas's suggestion. and rewrite the comment part of patch[3/5]. v7: change the patch[4/5], based on Bjorn's code and truth table. change the patch[5/5] about the debug output information. v8: according Bjorn's suggestion, put the pci_dev_set_io_state() simplification but no function code in one patch.(almost copy of Bjorn's code and truth table, understood). patch 5/6 re-based the function change code of pci_dev_set_io_state(). per Ashok's request, add more description to this cover-letter part. Thanks, Ethan Ethan Zhao (6): PCI/ERR: get device before call device driver to avoid NULL pointer dereference PCI/DPC: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: simplify function pci_dev_set_io_state() with if PCI/ERR: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 ++- drivers/pci/pci.h| 55 +--- drivers/pci/pcie/dpc.c | 27 drivers/pci/pcie/err.c | 18 +-- 4 files changed, 69 insertions(+), 35 deletions(-) base-commit: 549738f15da0e5a00275977623be199fbbf7df50 -- 2.18.4
[PATCH v8 5/6] PCI/ERR: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So change the pci_dev_set_io_state() function to only return true when dev->error_state is really changed. Signed-off-by: Ethan Zhao --- Changnes: v2: revise description and code according to suggestion from Andy. v3: change code to simpler. v4: no change. v5: no change. v6: no change. v7: changed based on Bjorn's code and truth table. v8: according to Bjorn's suggestion, rebase on another simplification patch. drivers/pci/pci.h | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index bceb3f108744..a11e0f9d9bdf 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -371,17 +371,14 @@ static inline bool pci_dev_set_io_state(struct pci_dev *dev, * perm_failure | perm_failure* perm_failure* perm_failure */ - /* Can always put a device in perm_failure state */ - if (new == pci_channel_io_perm_failure) { - dev->error_state = pci_channel_io_perm_failure; - return true; - } - - /* If already in perm_failure, can't set to normal or frozen */ + /* If already in perm_failure, can't change it's state */ if (dev->error_state == pci_channel_io_perm_failure) return false; + /* not change at all */ + else if (dev->error_state == new) + return false; -/* Can always change normal to frozen or vice versa */ +/* Can always change from normal/frozen to other different state */ dev->error_state = new; return true; } -- 2.18.4
[PATCH v8 4/6] PCI/ERR: simplify function pci_dev_set_io_state() with if
No function change. Signed-off-by: Ethan Zhao --- Changes: v8: based on Bjorn's code and truth table, simplify the logic of function pci_dev_set_io_state(), no function change. drivers/pci/pci.h | 54 --- 1 file changed, 23 insertions(+), 31 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 455b32187abd..bceb3f108744 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,31 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; +/* + * Truth table: + * requested new state + * current -- + * statenormal frozen perm_failure + * + - - + * normal| normal frozen perm_failure + * frozen| normal frozen perm_failure + * perm_failure | perm_failure* perm_failure* perm_failure + */ + + /* Can always put a device in perm_failure state */ + if (new == pci_channel_io_perm_failure) { + dev->error_state = pci_channel_io_perm_failure; + return true; } - if (changed) - dev->error_state = new; - return changed; + + /* If already in perm_failure, can't set to normal or frozen */ + if (dev->error_state == pci_channel_io_perm_failure) + return false; + +/* Can always change normal to frozen or vice versa */ + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH v8 1/6] PCI/ERR: get device before call device driver to avoid NULL pointer dereference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer dereference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) BUG: kernel NULL pointer dereference, address: 0050 ... CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 ... Debug shows when port DPC feature was enabled and triggered by errors, DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost at the same time, and no delay between them is required by specification. so DPC driver and pciehp drivers may handle these interrupts cocurrently. While DPC driver is doing pci_walk_bus() and calling device driver's callback without pci_dev_get() to increase device reference count, the device and its driver instance are likely being freed by pci_stop_and_removed_bus_device() -> pci_dev_put(). So does pci_dev_get() before using the device instance to avoid NULL pointer dereference. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [1/5] from [3/5] and revised comment according to Lukas' suggestion. v7: no change. v8: no change. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data) { const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev);
[PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- changes: v2???align ICS code name to public doc. v3: no change. v4: response to Christoph's (Christoph Hellwig ) tip, move pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. v5: fix building issue reported by l...@intel.com with some config. v6: move from [1/5] to [2/5]. v7: no change. v8: no change. drivers/pci/pci.h | 2 ++ drivers/pci/pcie/dpc.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..455b32187abd 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_wait_port_outdpc(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index daa9a4153776..2e0e091ce923 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} + static int dpc_wait_rp_inactive(struct pci_dev *pdev) { unsigned long timeout = jiffies + HZ; -- 2.18.4
[PATCH v8 3/6] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp drivers almost at the same time. Thus will cause following messed and confused errors handling/recovery/ removal/plugin procedure. 1. Port and device are in error recovery resetting initiated by DPC hardware, pciehp driver treats them as device is doing hot-remove or hot-plugin the same time. 2. While DPC handler calling device driver->err_handler callback( error_detected/resume etc), but the slot may be powered off by pciehp -> remove_board() -> pciehp_power_off_slot(). 3. While DPC handler -> pci_do_recovery is doing different action to detect error and recover based on device->error_state, pciehp driver could change it on the fly by: pciehp_unconfigure_device() ->pci_walk_bus() -> pci_dev_set_disconnected() 4. While DPC handler is calling device driver err_handler callback to detect error and recover, pciehp driver could is doing device unbind and release its driver. ... While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result is not determinate. So we need some kind of synchronization between pciehp DLLSC/PDC handling and DPC driver error recover handling. we need a determinate result of DPC error containment, link is recovered, link isn't recovered, device is still there, device is removed, then do pciehp hot-remove and hot-plugin procudure, don't mix them together. Per our test on ICS platform, DPC error containment and software handler will take 10ms up to 50ms till clean the DPC triggered status. it is quick enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC after doing power off. With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with any time internval between the two actions, also stressed with the DPC injection test. system recovered to normal working state from NON-FATAL/FATAL errors as expected. hotplug works well without any random undeterminate errors or malfunction. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [3/5] from [2/5] and re-wrote description. v7: no change. v8: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
Re: [PATCH v7 4/5] PCI: only return true when dev io state is really changed
Bjorn, On Sun, Oct 4, 2020 at 12:44 AM Bjorn Helgaas wrote: > > On Sat, Oct 03, 2020 at 03:55:13AM -0400, Ethan Zhao wrote: > > When uncorrectable error happens, AER driver and DPC driver interrupt > > handlers likely call > > > >pcie_do_recovery() > >->pci_walk_bus() > > ->report_frozen_detected() > > > > with pci_channel_io_frozen the same time. > >If pci_dev_set_io_state() return true even if the original state is > > pci_channel_io_frozen, that will cause AER or DPC handler re-enter > > the error detecting and recovery procedure one after another. > >The result is the recovery flow mixed between AER and DPC. > > So simplify the pci_dev_set_io_state() function to only return true > > when dev->error_state is really changed. > > > > Signed-off-by: Ethan Zhao > > Tested-by: Wen Jin > > Tested-by: Shanshan Zhang > > Reviewed-by: Alexandru Gagniuc > > Reviewed-by: Andy Shevchenko > > --- > > Changnes: > > v2: revise description and code according to suggestion from Andy. > > v3: change code to simpler. > > v4: no change. > > v5: no change. > > v6: no change. > > v7: changed based on Bjorn's code and truth table. > > > > drivers/pci/pci.h | 53 ++- > > 1 file changed, 20 insertions(+), 33 deletions(-) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 455b32187abd..47af1ff2a286 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -354,44 +354,31 @@ struct pci_sriov { > > * > > * Must be called with device_lock held. > > * > > - * Returns true if state has been changed to the requested state. > > + * Returns true if state has been really changed to the requested state. > > */ > > static inline bool pci_dev_set_io_state(struct pci_dev *dev, > > pci_channel_state_t new) > > { > > - bool changed = false; > > - > > device_lock_assert(>dev); > > - switch (new) { > > - case pci_channel_io_perm_failure: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - case pci_channel_io_perm_failure: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_frozen: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_normal: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - } > > - if (changed) > > - dev->error_state = new; > > - return changed; > > + > > +/* > > + * Truth table: > > + * requested new state > > + * current -- > > + * statenormal frozen perm_failure > > + * + - - > > + * normal| normal frozen perm_failure > > + * frozen| normal frozen perm_failure > > + * perm_failure | perm_failure* perm_failure* perm_failure > > + */ > > + > > + if (dev->error_state == pci_channel_io_perm_failure) > > + return false; > > + else if (dev->error_state == new) > > + return false; > > + > > + dev->error_state = new; > > + return true; > > No, you missed the point. I want > > 1) One patch that converts the "switch" to the shorter "if" > statements. This one will be big and ugly, but should not change > the functionality at all, and it should be pretty easy to verify > that since there aren't very many states involved. > > Since this one is pure code simplification, the commit log won't > say anything at all about AER or DPC or their requirements > because it's not changing any behavior. > > 2) A separate patch that's tiny and makes whatever functional change > you need. Make sense, clear, this time. Thanks, Ethan > > > } > > > > static inline int pci_dev_set_disconnected(struct pci_dev *dev, void > > *unused) > > -- > > 2.18.4 > >
Re: [PATCH v7 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
Lukas, On Mon, Oct 5, 2020 at 3:13 AM Lukas Wunner wrote: > > On Sat, Oct 03, 2020 at 03:55:12AM -0400, Ethan Zhao wrote: > > When root port has DPC capability and it is enabled, then triggered by > > errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp > > drivers almost at the same time. > > Do the DLLSC and PDC events occur as a result of handling the error > or do they occur independently? They could happen independently if links were recovered then the card was removed. They could happen as a result of handling the errors the same time. So don't assume DLLSC and PDC all occur at the same time. > > If the latter, I don't see how we can tell whether the card in the > slot is still the same. If PDC happens, the card in the slot might not be the same. so hot-removal /hot -plugin handling follows the PDC event. > > If the former, holding the hotplug slot's reset_lock and doing something > along the lines of pciehp_reset_slot() (or calling it directly) might > solve the race. DPC reset is done by hardware, only AER calls pciehp_reset_slot() as recovery handling initiated by software. Thanks, Ethan > > Thanks, > > Lukas
Re: [PATCH v7 0/5] Fix DPC hotplug race and enhance error handling
Raj, On Sun, Oct 4, 2020 at 12:57 PM Raj, Ashok wrote: > > Hi Ethan > > On Sat, Oct 03, 2020 at 03:55:09AM -0400, Ethan Zhao wrote: > > Hi,folks, > > > > This simple patch set fixed some serious security issues found when DPC > > error injection and NVMe SSD hotplug brute force test were doing -- race > > condition between DPC handler and pciehp, AER interrupt handlers, caused > > system hang and system with DPC feature couldn't recover to normal > > working state as expected (NVMe instance lost, mount operation hang, > > race PCIe access caused uncorrectable errors reported alternatively etc). > > I think maybe picking from other commit messages to make this description in > cover letter bit clear. The fundamental premise is that when due to error > conditions when events are processed by both DPC handler and hotplug handling > of > DLLSC both operating on the same device object ends up with crashes. Yep, that's right. Thanks, Ethan > > > > > > With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, > > see > > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) > > > > could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time > > interval between hot-remove and plug-in operation tens of times without > > any errors occur and system works normal. > > > > > With this patch set applied, system with DPC feature could recover from > > NON-FATAL and FATAL errors injection test and works as expected. > > > > System works smoothly when errors happen while hotplug is doing, no > > uncorrectable errors found. > > > > Brute DPC error injection script: > > > > for i in {0..100} > > do > > setpci -s 64:02.0 0x196.w=000a > > setpci -s 65:00.0 0x04.w=0544 > > mount /dev/nvme0n1p1 /root/nvme > > sleep 1 > > done > > > > Other details see every commits description part. > > > > This patch set could be applied to stable 5.9-rc6/rc7 directly. > > > > Help to review and test. > > > > v2: changed according to review by Andy Shevchenko. > > v3: changed patch 4/5 to simpler coding. > > v4: move function pci_wait_port_outdpc() to DPC driver and its > >declaration to pci.h. (tip from Christoph Hellwig ). > > v5: fix building issue reported by l...@intel.com with some config. > > v6: move patch[3/5] as the first patch according to Lukas's suggestion. > > and rewrite the comment part of patch[3/5]. > > v7: change the patch[4/5], based on Bjorn's code and truth table. > > change the patch[5/5] about the debug output information. > > > > Thanks, > > Ethan > > > > > > Ethan Zhao (5): > > PCI/ERR: get device before call device driver to avoid NULL pointer > > dereference > > PCI/DPC: define a function to check and wait till port finish DPC > > handling > > PCI: pciehp: check and wait port status out of DPC before handling > > DLLSC and PDC > > PCI: only return true when dev io state is really changed > > PCI/ERR: don't mix io state not changed and no driver together > > > > drivers/pci/hotplug/pciehp_hpc.c | 4 ++- > > drivers/pci/pci.h| 55 +--- > > drivers/pci/pcie/dpc.c | 27 > > drivers/pci/pcie/err.c | 18 +-- > > 4 files changed, 68 insertions(+), 36 deletions(-) > > > > > > base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7 > > -- > > 2.18.4 > >
[PATCH v7 5/5] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: v2: no change. v3: no change. v4: no change. v5: no change. v6: no change. v7: change debug output information. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..2ca2723f3b34 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device was in that state or not allowed setting.\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH v7 4/5] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is really changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Alexandru Gagniuc Reviewed-by: Andy Shevchenko --- Changnes: v2: revise description and code according to suggestion from Andy. v3: change code to simpler. v4: no change. v5: no change. v6: no change. v7: changed based on Bjorn's code and truth table. drivers/pci/pci.h | 53 ++- 1 file changed, 20 insertions(+), 33 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 455b32187abd..47af1ff2a286 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -354,44 +354,31 @@ struct pci_sriov { * * Must be called with device_lock held. * - * Returns true if state has been changed to the requested state. + * Returns true if state has been really changed to the requested state. */ static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + +/* + * Truth table: + * requested new state + * current -- + * statenormal frozen perm_failure + * + - - + * normal| normal frozen perm_failure + * frozen| normal frozen perm_failure + * perm_failure | perm_failure* perm_failure* perm_failure + */ + + if (dev->error_state == pci_channel_io_perm_failure) + return false; + else if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH v7 1/5] PCI/ERR: get device before call device driver to avoid NULL pointer dereference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer dereference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) BUG: kernel NULL pointer dereference, address: 0050 ... CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 ... Debug shows when port DPC feature was enabled and triggered by errors, DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost at the same time, and no delay between them is required by specification. so DPC driver and pciehp drivers may handle these interrupts cocurrently. While DPC driver is doing pci_walk_bus() and calling device driver's callback without pci_dev_get() to increase device reference count, the device and its driver instance are likely being freed by pci_stop_and_removed_bus_device() -> pci_dev_put(). So does pci_dev_get() before using the device instance to avoid NULL pointer dereference. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [1/5] from [3/5] and revised comment according to Lukas' suggestion. v7: no change. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data) { const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev,
[PATCH v7 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp drivers almost at the same time. That will cause following messed and confused errors handling/recovery/removal /plugin procedure. 1. Port and device are in error recovery reseting initiated by DPC hardware, pciehp driver treats them as device is doing hot-remove or hot-plugin the same time. 2. While DPC handler calling device driver->err_handler callback( error_detected/resume etc), but the slot may be powered off by pciehp -> remove_board() -> pciehp_power_off_slot(). 3. While DPC handler -> pci_do_recovery is doing different action to detect error and recover based on device->error_state, pciehp driver could change it on the fly by: pciehp_unconfigure_device() ->pci_walk_bus() -> pci_dev_set_disconnected() 4. While DPC handler is calling device driver err_handler callback to detect error and recover, pciehp driver could is doing device unbind and release its driver. ... While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result is not determinate. So we need some kind of synchronization between pciehp DLLSC/PDC handling and DPC driver error recover handling. we need a determinate result of DPC error containment, link is recovered, link isn't recovered, device is still there, device is removed, then do pciehp hot-remove and hot-plugin procudure, don't mix them together. Per our test on ICS platform, DPC error containment and software handler will take 10ms up to 50ms till clean the DPC triggered status. it is quick enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC after doing power off. With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with any time internval between the two actions, also stressed with the DPC injection test. system recovered to normal working state from NON-FATAL/FATAL errors as expected. hotplug works well without any random undeterminate errors or malfunction. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [3/5] from [2/5] and re-wrote description. v7: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH v7 2/5] PCI/DPC: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- changes: v2:align ICS code name to public doc. v3: no change. v4: response to Christoph's (Christoph Hellwig ) tip, move pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. v5: fix building issue reported by l...@intel.com with some config. v6: move from [1/5] to [2/5]. v7: no change. drivers/pci/pci.h | 2 ++ drivers/pci/pcie/dpc.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..455b32187abd 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_wait_port_outdpc(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index daa9a4153776..2e0e091ce923 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} + static int dpc_wait_rp_inactive(struct pci_dev *pdev) { unsigned long timeout = jiffies + HZ; -- 2.18.4
[PATCH v7 0/5] Fix DPC hotplug race and enhance error handling
Hi,folks, This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6/rc7 directly. Help to review and test. v2: changed according to review by Andy Shevchenko. v3: changed patch 4/5 to simpler coding. v4: move function pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. (tip from Christoph Hellwig ). v5: fix building issue reported by l...@intel.com with some config. v6: move patch[3/5] as the first patch according to Lukas's suggestion. and rewrite the comment part of patch[3/5]. v7: change the patch[4/5], based on Bjorn's code and truth table. change the patch[5/5] about the debug output information. Thanks, Ethan Ethan Zhao (5): PCI/ERR: get device before call device driver to avoid NULL pointer dereference PCI/DPC: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 ++- drivers/pci/pci.h| 55 +--- drivers/pci/pcie/dpc.c | 27 drivers/pci/pcie/err.c | 18 +-- 4 files changed, 68 insertions(+), 36 deletions(-) base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7 -- 2.18.4
Re: [PATCH v6 4/5] PCI: only return true when dev io state is really changed
Bjorn, On Sat, Oct 3, 2020 at 1:29 AM Bjorn Helgaas wrote: > > [+cc Sinan] > > On Wed, Sep 30, 2020 at 03:05:36AM -0400, Ethan Zhao wrote: > > When uncorrectable error happens, AER driver and DPC driver interrupt > > handlers likely call > > > >pcie_do_recovery() > >->pci_walk_bus() > > ->report_frozen_detected() > > > > with pci_channel_io_frozen the same time. > >If pci_dev_set_io_state() return true even if the original state is > > pci_channel_io_frozen, that will cause AER or DPC handler re-enter > > the error detecting and recovery procedure one after another. > >The result is the recovery flow mixed between AER and DPC. > > So simplify the pci_dev_set_io_state() function to only return true > > when dev->error_state is changed. > > > > Signed-off-by: Ethan Zhao > > Tested-by: Wen Jin > > Tested-by: Shanshan Zhang > > Reviewed-by: Alexandru Gagniuc > > Reviewed-by: Andy Shevchenko > > --- > > Changnes: > > v2: revise description and code according to suggestion from Andy. > > v3: change code to simpler. > > v4: no change. > > v5: no change. > > v6: no change. > > > > drivers/pci/pci.h | 37 + > > 1 file changed, 5 insertions(+), 32 deletions(-) > > > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > > index 455b32187abd..f2beeaeda321 100644 > > --- a/drivers/pci/pci.h > > +++ b/drivers/pci/pci.h > > @@ -359,39 +359,12 @@ struct pci_sriov { > > static inline bool pci_dev_set_io_state(struct pci_dev *dev, > > pci_channel_state_t new) > > { > > - bool changed = false; > > - > > device_lock_assert(>dev); > > - switch (new) { > > - case pci_channel_io_perm_failure: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - case pci_channel_io_perm_failure: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_frozen: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - case pci_channel_io_normal: > > - switch (dev->error_state) { > > - case pci_channel_io_frozen: > > - case pci_channel_io_normal: > > - changed = true; > > - break; > > - } > > - break; > > - } > > - if (changed) > > - dev->error_state = new; > > - return changed; > > + if (dev->error_state == new) > > + return false; > > + > > + dev->error_state = new; > > + return true; > > } > > IIUC this changes the behavior of the function, but it's difficult to > analyze because it does a lot of simplification at the same time. > > Please consider the following, which is intended to simplify the > function while preserving the behavior (but please verify; it's been a > long time since I looked at this). Then maybe see how your patch > could be done on top of this? > > Alternatively, come up with your own simplification patch + the > functionality change. > > > commit 983d9b1f8177 ("PCI/ERR: Simplify pci_dev_set_io_state()") > Author: Bjorn Helgaas > Date: Tue May 19 12:28:57 2020 -0500 > > PCI/ERR: Simplify pci_dev_set_io_state() > > Truth table: > > requested new state > current -- > statenormal frozen perm_failure > + - - > normal| normal frozen perm_failure > frozen| normal frozen perm_failure > perm_failure | perm_failure* perm_failure* perm_failure > > * "not changed", returns false > > No functional change intended. > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 6d3f75867106..81408552f7c9 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -358,39 +358,21 @@ struct pci_sriov { > static inline bool pci_dev_set_io_state(str
Re: [PATCH v6 4/5] PCI: only return true when dev io state is really changed
Sinan, On Sat, Oct 3, 2020 at 12:08 AM Sinan Kaya wrote: > > On 9/30/2020 3:05 AM, Ethan Zhao wrote: > > When uncorrectable error happens, AER driver and DPC driver interrupt > > handlers likely call > > > >pcie_do_recovery() > >->pci_walk_bus() > > ->report_frozen_detected() > > > > with pci_channel_io_frozen the same time. > > We need some more data on this. If DPC is supported by HW, errors > should be triggered by DPC not AER. > > If I remember right, there is a register that tells which AER errors > should be handled by DPC. When uncorrectable errors happen, non-fatal severity level, AER and DPC could be triggered at the same time. Thanks, Ethan >
[PATCH v6 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp drivers almost at the same time. That will cause following messed and confused errors handling/recovery/ removal/plugin procedure. 1. Port and device are in error recovery resetting initiated by DPC hardware, pciehp driver treats them as device is doing hot-remove or hot-plugin the same time. 2. While DPC handler calling device driver->err_handler callback( error_detected/resume etc), but the slot may be powered off by pciehp -> remove_board() -> pciehp_power_off_slot(). 3. While DPC handler -> pci_do_recovery is doing different action to detect error and recover based on device->error_state, pciehp driver could change it on the fly by: pciehp_unconfigure_device() ->pci_walk_bus() -> pci_dev_set_disconnected() 4. While DPC handler is calling device driver err_handler callback to detect error and recover, pciehp driver could is doing device unbind and release its driver. ... While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result is not determinate. So we need some kind of synchronization between pciehp DLLSC/PDC handling and DPC driver error recover handling. we need a determinate result of DPC error containment, link is recovered, link isn't recovered, device is still there, device is removed, then do pciehp hot-remove and hot-plugin procudure, don't mix them together. Per our test on ICS platform, DPC error containment and software handler will take 10ms up to 50ms till clean the DPC triggered status. it is quick enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC after doing power off. With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with any time internval between the two actions, also stressed with the DPC injection test. system recovered to normal working state from NON-FATAL/ FATAL errors as expected. hotplug works well without any random undeterminate errors or malfunction. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [3/5] from [2/5] and re-wrote description. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH v6 0/5] Fix DPC hotplug race and enhance error handling
Hi,folks, This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6/rc7 directly. Help to review and test. v2: changed according to review by Andy Shevchenko. v3: changed patch 4/5 to simpler coding. v4: move function pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. (tip from Christoph Hellwig ). v5: fix building issue reported by l...@intel.com with some config. v6: move patch[3/5] as the first patch according to Lukas's suggestion. and rewrite the comment part of patch[3/5]. Ethan Zhao (5): PCI/ERR: get device before call device driver to avoid NULL pointer dereference PCI/DPC: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 39 ++-- drivers/pci/pcie/dpc.c | 27 ++ drivers/pci/pcie/err.c | 18 +-- 4 files changed, 53 insertions(+), 35 deletions(-) base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7 -- 2.18.4
[PATCH v6 5/5] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: v2: no change. v3: no change. v4: no change. v5: no change. v6: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH v6 2/5] PCI/DPC: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- changes: v2:align ICS code name to public doc. v3: no change. v4: response to Christoph's (Christoph Hellwig ) tip, move pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. v5: fix building issue reported by l...@intel.com with some config. v6: move from [1/5] to [2/5]. drivers/pci/pci.h | 2 ++ drivers/pci/pcie/dpc.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..455b32187abd 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_wait_port_outdpc(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index daa9a4153776..2e0e091ce923 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} + static int dpc_wait_rp_inactive(struct pci_dev *pdev) { unsigned long timeout = jiffies + HZ; -- 2.18.4
[PATCH v6 4/5] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Alexandru Gagniuc Reviewed-by: Andy Shevchenko --- Changnes: v2: revise description and code according to suggestion from Andy. v3: change code to simpler. v4: no change. v5: no change. v6: no change. drivers/pci/pci.h | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 455b32187abd..f2beeaeda321 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,12 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH v6 1/5] PCI/ERR: get device before call device driver to avoid NULL pointer dereference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer dereference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) BUG: kernel NULL pointer dereference, address: 0050 ... CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 ... Debug shows when port DPC feature was enabled and triggered by errors, DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost at the same time, and no delay between them is required by specification. so DPC driver and pciehp drivers may handle these interrupts cocurrently. While DPC driver is doing pci_walk_bus() and calling device driver's callback without pci_dev_get() to increase device reference count, the device and its driver instance are likely being freed by pci_stop_and_removed_bus_device() -> pci_dev_put(). So does pci_dev_get() before using the device instance to avoid NULL pointer dereference. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- v2: revise doc according to Andy's suggestion. v3: no change. v4: no change. v5: no change. v6: moved to [1/5] from [3/5] and revised comment according to Lukas' suggestion. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data) { const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, pci_channel_io_norm
Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
On Tue, Sep 29, 2020 at 6:08 PM Lukas Wunner wrote: > > On Tue, Sep 29, 2020 at 05:46:41PM +0800, Ethan Zhao wrote: > > On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner wrote: > > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > > > > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > > > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void > > > > > *dev_id) > > > > > down_read(>reset_lock); > > > > > if (events & DISABLE_SLOT) > > > > > pciehp_handle_disable_request(ctrl); > > > > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > > > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > > > > + pci_wait_port_outdpc(pdev); > > > > > pciehp_handle_presence_or_link_change(ctrl, events); > > > > > + } > > > > > up_read(>reset_lock); > > > > > > > > This looks like a hack TBH. > [...] > > > > Why is device lock not protecting this situation? > > > > Is there a lock missing in hotplug driver? > > > > > > According to Ethan's commit message, there are two issues here: > > > One, that pciehp may remove a device even though DPC recovered the error, > > > and two, that a null pointer deref occurs. > > > > > > The latter is most certainly not a locking issue but failure of DPC > > > to hold a reference on the pci_dev. > > > > This is what patch 3/5 proposed to fix. > > Please reorder the series to fix the null pointer deref first, > i.e. move patch 3 before patch 2. If the null pointer deref is > fixed by patch 3, do not mention it in patch 2. Make sense. Thanks, Ethan > > Thanks, > > Lukas
Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference
On Tue, Sep 29, 2020 at 6:48 PM Andy Shevchenko wrote: > > On Tue, Sep 29, 2020 at 05:38:00PM +0800, Ethan Zhao wrote: > > On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko > > wrote: > > > On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote: > > > > Preferred style, there will be cleared comment in v6. > > > > > > Avoid top postings. > > > > > > > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko > > > > wrote: > > > > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote: > > ... > > > > > > > Buffer I/O error on dev nvme0n1p1, logical block 468843328, > > > > > > async page read > > > > > > BUG: kernel NULL pointer dereference, address: 0050 > > > > > > #PF: supervisor read access in kernel mode > > > > > > #PF: error_code(0x) - not-present page > > > > > > > > > > Same comment about Oops. > > > > > > In another thread it was a good advice to move the full Oops (if you > > > think it's > > > very useful to have) after the cutter '---' line, so it will be in email > > > archives but Git history. > > > > So git history wouldn't give any of the Oops context, and he/she has > > to access LKML, > > if offline, then ...lost. > > Tell me, do you really think the line: >#PF: error_code(0x) - not-present page > makes any sense in the commit message? > > I do not think so. And so on, for almost 60% of the Oops. > Really, to help one person you will make millions suffering. It's not okay. > If you and millions feel so suffered, why not try to simplify the Oops code to not output nonsense too much to console and log. :) They might not be so important to old birds as you. but it is easy to cut off the road for newcomers. Anyway, it is not the focus of this patchset. help to take a look and try the code. Thanks, Ethan > -- > With Best Regards, > Andy Shevchenko > >
Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner wrote: > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote: > > On 9/26/2020 11:28 PM, Ethan Zhao wrote: > > > --- a/drivers/pci/hotplug/pciehp_hpc.c > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) > > > down_read(>reset_lock); > > > if (events & DISABLE_SLOT) > > > pciehp_handle_disable_request(ctrl); > > > - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) > > > + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { > > > + pci_wait_port_outdpc(pdev); > > > pciehp_handle_presence_or_link_change(ctrl, events); > > > + } > > > up_read(>reset_lock); > > > > This looks like a hack TBH. > > > > Lukas, Keith; > > > > What is your take on this? > > Why is device lock not protecting this situation? > > > > Is there a lock missing in hotplug driver? > > According to Ethan's commit message, there are two issues here: > One, that pciehp may remove a device even though DPC recovered the error, > and two, that a null pointer deref occurs. > > The latter is most certainly not a locking issue but failure of DPC > to hold a reference on the pci_dev. This is what patch 3/5 proposed to fix. while this one is to re-order the mixed DPC recovery procedure and DLLSC/PDC event handling, to make pciehp to know the exact recovered result of DPC to malfunctional device link recovered, still there, or is removed from the slot. Thanks, Ethan > > Thanks, > > Lukas
Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference
Andy, On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko wrote: > > On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote: > > Preferred style, there will be cleared comment in v6. > > Avoid top postings. > > > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko > > wrote: > > > > > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote: > > > > During DPC error injection test we found there is race condition between > > > > pciehp and DPC driver, null pointer reference caused panic as following > > > > > > null -> NULL > > > > > > > > > > > # setpci -s 64:02.0 0x196.w=000a > > > > // 64:02.0 is rootport has DPC capability > > > > # setpci -s 65:00.0 0x04.w=0544 > > > > // 65:00.0 is NVMe SSD populated in above port > > > > # mount /dev/nvme0n1p1 nvme > > > > > > > > (tested on stable 5.8 & ICX platform) > > > > > > > > Buffer I/O error on dev nvme0n1p1, logical block 468843328, > > > > async page read > > > > BUG: kernel NULL pointer dereference, address: 0050 > > > > #PF: supervisor read access in kernel mode > > > > #PF: error_code(0x) - not-present page > > > > > > Same comment about Oops. > > In another thread it was a good advice to move the full Oops (if you think > it's > very useful to have) after the cutter '---' line, so it will be in email > archives but Git history. So git history wouldn't give any of the Oops context, and he/she has to access LKML, if offline, then ...lost. Thanks, Ethan > > -- > With Best Regards, > Andy Shevchenko > >
Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
On Tue, Sep 29, 2020 at 12:44 AM Sinan Kaya wrote: > > On 9/28/2020 7:10 AM, Sinan Kaya wrote: > > On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: > >> Sinan, > >>I explained the reason why locks don't protect this case in the patch > >> description part. > >> Write side and read side hold different semaphore and mutex. > >> > > I have been thinking about it some time but is there any reason why we > > have to handle all port AER/DPC/HP events in different threads? > > > > Can we go to single threaded event loop for all port drivers events? > > > > This will require some refactoring but it wlll eliminate the lock > > nightmares we are having. > > > > This means no sleeping. All sleeps need to happen outside of the loop. > > > > I wanted to see what you all are thinking about this. > > > > It might become a performance problem if the system is > > continuously observing a hotplug/aer/dpc events. > > > > I always think that these should be rare events. > > If restructuring would be too costly, the preferred solution should be > to fix the locks in hotplug driver rather than throwing there a random > wait call. My first though is to unify the pci_bus_sem & pci_rescan_remove_lock to one sleepable lock, but verifying every locking scenario to sort out dead lock warning, it is horrible job. I gave up and then played the device status waiting trick to workaround it. index 03d37128a24f..477d4c499f87 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -3223,17 +3223,19 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus); * pci_rescan_bus(), pci_rescan_bus_bridge_resize() and PCI device removal * routines should always be executed under this mutex. */ -static DEFINE_MUTEX(pci_rescan_remove_lock); +/* static DEFINE_MUTEX(pci_rescan_remove_lock); */ void pci_lock_rescan_remove(void) { - mutex_lock(_rescan_remove_lock); + /*mutex_lock(_rescan_remove_lock); */ + down_write(_bus_sem); } EXPORT_SYMBOL_GPL(pci_lock_rescan_remove); void pci_unlock_rescan_remove(void) { - mutex_unlock(_rescan_remove_lock); + /*mutex_unlock(_rescan_remove_lock); */ + up_write(_bus_sem); } EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove); Thanks, Ethan
Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference
Preferred style, there will be cleared comment in v6. Thanks, Ethan On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko wrote: > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote: > > During DPC error injection test we found there is race condition between > > pciehp and DPC driver, null pointer reference caused panic as following > > null -> NULL > > > > > # setpci -s 64:02.0 0x196.w=000a > > // 64:02.0 is rootport has DPC capability > > # setpci -s 65:00.0 0x04.w=0544 > > // 65:00.0 is NVMe SSD populated in above port > > # mount /dev/nvme0n1p1 nvme > > > > (tested on stable 5.8 & ICX platform) > > > > Buffer I/O error on dev nvme0n1p1, logical block 468843328, > > async page read > > BUG: kernel NULL pointer dereference, address: 0050 > > #PF: supervisor read access in kernel mode > > #PF: error_code(0x) - not-present page > > Same comment about Oops. > > -- > With Best Regards, > Andy Shevchenko > >
Re: [PATCH 1/5 V2] PCI: define a function to check and wait till port finish DPC handling
Fixed this concern by moving the function to DPC driver and its declaration to pci.h. see v5 Thanks, Ethan On Sun, Sep 27, 2020 at 2:27 PM Christoph Hellwig wrote: > > > +#ifdef CONFIG_PCIE_DPC > > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) > > +{ > > + u16 cap = pdev->dpc_cap, status; > > + u16 loop = 0; > > + > > + if (!cap) { > > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); > > + return false; > > + } > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); > > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { > > + msleep(10); > > + loop++; > > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); > > + } > > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { > > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); > > + return true; > > + } > > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); > > + return false; > > +} > > I don't think that there is any good reason to have this as an > inline function.
Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
On Tue, Sep 29, 2020 at 12:45 AM Kuppuswamy, Sathyanarayanan wrote: > > > On 9/28/20 9:43 AM, Sinan Kaya wrote: > > On 9/28/2020 7:10 AM, Sinan Kaya wrote: > >> On 9/27/2020 10:01 PM, Zhao, Haifeng wrote: > >>> Sinan, > >>> I explained the reason why locks don't protect this case in the patch > >>> description part. > >>> Write side and read side hold different semaphore and mutex. > >>> > >> I have been thinking about it some time but is there any reason why we > >> have to handle all port AER/DPC/HP events in different threads? > >> > >> Can we go to single threaded event loop for all port drivers events? > >> > >> This will require some refactoring but it wlll eliminate the lock > >> nightmares we are having. > >> > >> This means no sleeping. All sleeps need to happen outside of the loop. > >> > >> I wanted to see what you all are thinking about this. > >> > >> It might become a performance problem if the system is > >> continuously observing a hotplug/aer/dpc events. > >> > >> I always think that these should be rare events. > > If restructuring would be too costly, the preferred solution should be > > to fix the locks in hotplug driver rather than throwing there a random > > wait call. > Since the current race condition is detected between DPC and > hotplug, I recommend synchronizing them. The locks are the first place to root cause and try to fix. but not so easy to refactor the remove-scan-semaphore and the bus-walk-mutex. too expensive work. --- rework every piece of code that uses them. Thanks, Ethan > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer >
Re: [PATCH 3/5 V55555] PCI/ERR: get device before call device driver to avoid NULL pointer reference
On Mon, Sep 28, 2020 at 4:46 PM Andy Shevchenko wrote: > > On Mon, Sep 28, 2020 at 7:13 AM Ethan Zhao wrote: > > Same comments as per v4. > Also you have an issue in versioning here. Use -v parameter to `git > format-patch`, it will do it for you nicely. Aha, git has got this function. I thought it was still manual work. great tip.! Thanks, Ethan > > -- > With Best Regards, > Andy Shevchenko
Re: [PATCH 2/5 V5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
On Mon, Sep 28, 2020 at 4:45 PM Andy Shevchenko wrote: > > On Mon, Sep 28, 2020 at 7:10 AM Ethan Zhao wrote: > > We didn't settle on the v4, why v5? We could fix it with v6, v5 is used to fix other things. > > > When root port has DPC capability and it is enabled, then triggered by > > errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp > > driver at the same time. > > That will cause following result: > > > > 1. Link and device are recovered by hardware DPC and software DPC driver, > >device > >isn't removed, but the pciehp might treat it as device was hot removed. > > > > 2. Race condition happens bettween pciehp_unconfigure_device() called by > >pciehp_ist() in pciehp driver and pci_do_recovery() called by > >dpc_handler in DPC driver. no luck, there is no lock to protect > >pci_stop_and_remove_bus_device() > >against pci_walk_bus(), they hold different samphore and mutex, > >pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and > >pci_walk_bus() holds pci_bus_sem. > > > > This race condition is not purely code analysis, it could be triggered by > > following command series: > > > > # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability > > # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 NVMe SSD populated in port > > # mount /dev/nvme0n1p1 nvme > > > > One shot will cause system panic and NULL pointer reference happened. > > (tested on stable 5.8 & ICS(Ice Lake SP platform, see > > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) > > ... > > >Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read > >BUG: kernel NULL pointer dereference, address: 0050 > >#PF: supervisor read access in kernel mode > >#PF: error_code(0x) - not-present page > >PGD 0 > >Oops: [#1] SMP NOPTI > >CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1 > >RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 > >Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 > >65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 > >50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 > >RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 > >RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 > >RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 > >RBP: ff8e06cf8762fdd0 R08: 00bf R09: > >R10: 00eb8ebeab53 R11: 93453258 R12: 0002 > >R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 > >FS: () GS:ff4e3eab1fd0() knl > >GS: > >CS: 0010 DS: ES: CR0: 80050033 > >CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 > >DR0: DR1: DR2: > >DR3: DR6: fffe0ff0 DR7: 0400 > >PKRU: 5554 > >Call Trace: > >? report_normal_detected+0x20/0x20 > >report_frozen_detected+0x16/0x20 > >pci_walk_bus+0x75/0x90 > >? dpc_irq+0x90/0x90 > >pcie_do_recovery+0x157/0x201 > >? irq_finalize_oneshot.part.47+0xe0/0xe0 > >dpc_handler+0x29/0x40 > >irq_thread_fn+0x24/0x60 > >irq_thread+0xea/0x170 > >? irq_forced_thread_fn+0x80/0x80 > >? irq_thread_check_affinity+0xf0/0xf0 > >kthread+0x124/0x140 > >? kthread_park+0x90/0x90 > >ret_from_fork+0x1f/0x30 > >Modules linked in: nft_fib_inet. > >CR2: 0050 > > Do not pollute the commit messages with irrelevant information. That is whole panic, Andy, someone like the whole Oops, If I removed something, Will someone say 'why not integral,why removed somehting ‘ > > -- > With Best Regards, > Andy Shevchenko
Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call
Sathyanarayanan, On Mon, Sep 28, 2020 at 10:44 AM Kuppuswamy, Sathyanarayanan wrote: > > Hi, > > On 9/25/20 11:30 AM, Sinan Kaya wrote: > > On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote: > >>> > >>> If this is a too involved change, DPC driver should restore state > >>> when hotplug is not supported. > >> Yes. we can add a condition for hotplug capability check. > >>> > >>> DPC driver should be self-sufficient by itself. > >>> > > > > Sounds good. > > > Also for non-fatal errors, if reset is requested then we still need > some kind of bus reset call here > >>> > >>> DPC should handle both fatal and non-fatal cases > >> Currently DPC is only triggered for FATAL errors. > >> and cause a bus reset > > > > Thanks for the heads up. > > This seems to have changed since I looked at the DPC code. > > > >>> in hardware already before triggering an interrupt. > >> Error recovery is not triggered only DPC driver. AER also uses the > >> same error recovery code. If DPC is not supported, then we still need > >> reset logic. > > > > It sounds like we are cross-talking two issues. > > > > 1. no state restore on DPC after FATAL error. > > Let's fix this. > Agree. Few more detail about the above issue is, > > There are two cases under FATAL error. > > FATAL + hotplug - In this case, link will be reseted. And hotplug handler > will remove the driver state. This case works well with current code. > > FATAL + no-hotplug - In this case, link will still be reseted. But > currently driver state is not properly restored. So I attempted > to restore it using pci_reset_bus(). Seems you should fix something at device driver side, not do double-reset in DPC driver, one reset is done by hardware, and you want to do another by DPC driver ? Why hardware initiated reset is not enough for you ? Thanks, Ethan > status = reset_link(dev); > - if (status != PCI_ERS_RESULT_RECOVERED) { > + if (status == PCI_ERS_RESULT_RECOVERED) { > + status = PCI_ERS_RESULT_NEED_RESET; > > ... > > if (status == PCI_ERS_RESULT_NEED_RESET) { > /* > -* TODO: Should call platform-specific > -* functions to reset slot before calling > -* drivers' slot_reset callbacks? > +* TODO: Optimize the call to pci_reset_bus() > +* > +* There are two components to pci_reset_bus(). > +* > +* 1. Do platform specific slot/bus reset. > +* 2. Save/Restore all devices in the bus. > +* > +* For hotplug capable devices and fatal errors, > +* device is already in reset state due to link > +* reset. So repeating platform specific slot/bus > +* reset via pci_reset_bus() call is redundant. So > +* can optimize this logic and conditionally call > +* pci_reset_bus(). > */ > + pci_reset_bus(dev); > > > > > 2. no bus reset on NON_FATAL error through AER driver path. > > This already tells me that you need to split your change into > > multiple patches. > > > > Let's talk about this too. bus reset should be triggered via > > AER driver before informing the recovery. > But as per error recovery documentation, any call to > ->error_detected() or ->mmio_enabled() can request > PCI_ERS_RESULT_NEED_RESET. So we need to add code > to do the actual reset before calling ->slot_reset() > callback. So call to pci_reset_bus() fixes this > issue. > > if (status == PCI_ERS_RESULT_NEED_RESET) { > + pci_reset_bus(dev); > > > > > > if (status == PCI_ERS_RESULT_NEED_RESET) { > > /* > >* TODO: Should call platform-specific > >* functions to reset slot before calling > >* drivers' slot_reset callbacks? > >*/ > > status = PCI_ERS_RESULT_RECOVERED; > > pci_dbg(dev, "broadcast slot_reset message\n"); > > pci_walk_bus(bus, report_slot_reset, ); > > } > > > > -- > Sathyanarayanan Kuppuswamy > Linux Kernel Developer
[PATCH 5/5 V5] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: V2: no change. V3: no change. V4: no change. V5: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH 2/5 V5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time. That will cause following result: 1. Link and device are recovered by hardware DPC and software DPC driver, device isn't removed, but the pciehp might treat it as device was hot removed. 2. Race condition happens bettween pciehp_unconfigure_device() called by pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device() against pci_walk_bus(), they hold different samphore and mutex, pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus() holds pci_bus_sem. This race condition is not purely code analysis, it could be triggered by following command series: # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 NVMe SSD populated in port # mount /dev/nvme0n1p1 nvme One shot will cause system panic and NULL pointer reference happened. (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knl GS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. V4: no change. V5: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH 1/5 V5] PCI: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- changes: V2:align ICS code name to public doc. V3: no change. V4: response to Christoph's (Christoph Hellwig ) tip, move pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. V5: fix building issue reported by l...@intel.com with some config. drivers/pci/pci.h | 2 ++ drivers/pci/pcie/dpc.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..8fdb0d823d5a 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_wait_port_outdpc(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index daa9a4153776..2e0e091ce923 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} + static int dpc_wait_rp_inactive(struct pci_dev *pdev) { unsigned long timeout = jiffies + HZ; -- 2.18.4
[PATCH 0/5 V5] Fix DPC hotplug race and enhance error handling
This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6 directly. Help to review and test. V2: changed according to review by Andy Shevchenko. V3: changed patch 4/5 to simpler coding. V4: move function pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. (tip from Christoph Hellwig ). V5: fix building issue reported by l...@intel.com with some config. Thanks, Ethan Ethan Zhao (5): PCI: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: get device before call device driver to avoid NULL pointer reference PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 34 +--- drivers/pci/pcie/err.c | 18 +++-- include/linux/pci.h | 31 + 4 files changed, 55 insertions(+), 32 deletions(-) -- 2.18.4
[PATCH 4/5 V4] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Alexandru Gagniuc --- Changnes: V2: revise description and code according to suggestion from Andy. V3: change code to simpler. V4: no change. V5: no change. drivers/pci/pci.h | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..a2c1c7d5f494 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,12 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH 3/5 V55555] PCI/ERR: get device before call device driver to avoid NULL pointer reference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer reference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 Though we partly close the race condition with patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', but there is no hardware spec or software sequence to guarantee the pcie_ist() run into pci_wait_port_outdpc() first or DPC triggered status bits being set first when errors triggered DPC containment procedure, so device still could be removed by function pci_stop_and_removed_bus_device() then freed by pci_dev_put() in pciehp driver first during pcie_do_recover()/ pci_walk_bus() is called by dpc_handler() in DPC driver. Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the removal and walking operation is the right way, but here we use pci_dev_get() to increase the reference count of device before using the device to avoid it is freed in use. With this patch and patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error injection test and no panic happened. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. V4: no change. V5: no change. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); +
[PATCH 5/5 V4] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: V2: no change. V3: no change. V4: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH 3/5 V4] PCI/ERR: get device before call device driver to avoid NULL pointer reference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer reference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 Though we partly close the race condition with patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', but there is no hardware spec or software sequence to guarantee the pcie_ist() run into pci_wait_port_outdpc() first or DPC triggered status bits being set first when errors triggered DPC containment procedure, so device still could be removed by function pci_stop_and_removed_bus_device() then freed by pci_dev_put() in pciehp driver first during pcie_do_recover()/ pci_walk_bus() is called by dpc_handler() in DPC driver. Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the removal and walking operation is the right way, but here we use pci_dev_get() to increase the reference count of device before using the device to avoid it is freed in use. With this patch and patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error injection test and no panic happened. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. V4: no change. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); +
[PATCH 4/5 V4] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko Reviewed-by: Alexandru Gagniuc Reviewed-by: Joe Perches --- Changnes: V2: revise description and code according to suggestion from Andy. V3: change code to simpler. V4: no change. drivers/pci/pci.h | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..a2c1c7d5f494 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,12 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH 2/5 V4] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time. That will cause following result: 1. Link and device are recovered by hardware DPC and software DPC driver, device isn't removed, but the pciehp might treat it as device was hot removed. 2. Race condition happens bettween pciehp_unconfigure_device() called by pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device() against pci_walk_bus(), they hold different samphore and mutex, pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus() holds pci_bus_sem. This race condition is not purely code analysis, it could be triggered by following command series: # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 NVMe SSD populated in port # mount /dev/nvme0n1p1 nvme One shot will cause system panic and NULL pointer reference happened. (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knl GS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. V4: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH 0/5 V4] Fix DPC hotplug race and enhance error handling
This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6 directly. Help to review and test. V2: changed according to review by Andy Shevchenko. V3: changed patch 4/5 to simpler coding. V4: move function pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. (tip from Christoph Hellwig ). Thanks, Ethan Ethan Zhao (5): PCI: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: get device before call device driver to avoid NULL pointer reference PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 34 +--- drivers/pci/pcie/err.c | 18 +++-- include/linux/pci.h | 31 + 4 files changed, 55 insertions(+), 32 deletions(-) -- 2.18.4
[PATCH 1/5 V4] PCI: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko Reviewed-by: Christoph Hellwig --- changes: V2:align ICS code name to public doc. V3: no change. V4: response to Christoph's (Christoph Hellwig ) tip, move pci_wait_port_outdpc() to DPC driver and its declaration to pci.h. drivers/pci/pci.h | 2 ++ drivers/pci/pcie/dpc.c | 27 +++ 2 files changed, 29 insertions(+) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..8fdb0d823d5a 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); void dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); +bool pci_wait_port_outdpc(struct pci_dev *pdev); #else static inline void pci_save_dpc_state(struct pci_dev *dev) {} static inline void pci_restore_dpc_state(struct pci_dev *dev) {} static inline void pci_dpc_init(struct pci_dev *pdev) {} +inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; } #endif #ifdef CONFIG_PCI_ATS diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index daa9a4153776..2e0e091ce923 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev) pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap); } +bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} + static int dpc_wait_rp_inactive(struct pci_dev *pdev) { unsigned long timeout = jiffies + HZ; -- 2.18.4
[PATCH 5/5 V3] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: V2: no change. V3: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH 3/5 V3] PCI/ERR: get device before call device driver to avoid NULL pointer reference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer reference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 Though we partly close the race condition with patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', but there is no hardware spec or software sequence to guarantee the pcie_ist() run into pci_wait_port_outdpc() first or DPC triggered status bits being set first when errors triggered DPC containment procedure, so device still could be removed by function pci_stop_and_removed_bus_device() then freed by pci_dev_put() in pciehp driver first during pcie_do_recover()/ pci_walk_bus() is called by dpc_handler() in DPC driver. Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the removal and walking operation is the right way, but here we use pci_dev_get() to increase the reference count of device before using the device to avoid it is freed in use. With this patch and patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error injection test and no panic happened. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev)
[PATCH 4/5 V3] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko Reviewed-by: Alexandru Gagniuc Reviewed-by: Joe Perches --- Changnes: V2: revise description and code according to suggestion from Andy. V3: Change code to simpler. drivers/pci/pci.h | 37 + 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..a2c1c7d5f494 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -359,39 +359,12 @@ struct pci_sriov { static inline bool pci_dev_set_io_state(struct pci_dev *dev, pci_channel_state_t new) { - bool changed = false; - device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; - return changed; + if (dev->error_state == new) + return false; + + dev->error_state = new; + return true; } static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused) -- 2.18.4
[PATCH 0/5 V3] Fix DPC hotplug race and enhance error handling
This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6 directly. Help to review and test. V2: changed according to review by Andy Shevchenko. V3: changed patch 4/5 to simpler coding. Thanks, Ethan Ethan Zhao (5): PCI: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: get device before call device driver to avoid NULL pointer reference PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 34 +--- drivers/pci/pcie/err.c | 18 +++-- include/linux/pci.h | 31 + 4 files changed, 55 insertions(+), 32 deletions(-) -- 2.18.4
[PATCH 2/5 V3] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time. That will cause following result: 1. Link and device are recovered by hardware DPC and software DPC driver, device isn't removed, but the pciehp might treat it as device was hot removed. 2. Race condition happens bettween pciehp_unconfigure_device() called by pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device() against pci_walk_bus(), they hold different samphore and mutex, pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus() holds pci_bus_sem. This race condition is not purely code analysis, it could be triggered by following command series: # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 NVMe SSD populated in port # mount /dev/nvme0n1p1 nvme One shot will cause system panic and NULL pointer reference happened. (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knl GS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. V3: no change. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH 1/5 V3] PCI: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- changes: V2:align ICS code name to public doc. V3: no change. include/linux/pci.h | 31 +++ 1 file changed, 31 insertions(+) diff --git a/include/linux/pci.h b/include/linux/pci.h index 835530605c0d..5beb76c6ae26 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -2427,4 +2428,34 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type); WARN_ONCE(condition, "%s %s: " fmt, \ dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg) +#ifdef CONFIG_PCIE_DPC +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} +#else +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + return true; +} +#endif #endif /* LINUX_PCI_H */ -- 2.18.4
[PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC
When root port has DPC capability and it is enabled, then triggered by errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp driver at the same time. That will cause following result: 1. Link and device are recovered by hardware DPC and software DPC driver, device isn't removed, but the pciehp might treat it as device was hot removed. 2. Race condition happens bettween pciehp_unconfigure_device() called by pciehp_ist() in pciehp driver and pci_do_recovery() called by dpc_handler in DPC driver. no luck, there is no lock to protect pci_stop_and_remove_bus_device() against pci_walk_bus(), they hold different samphore and mutex, pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and pci_walk_bus() holds pci_bus_sem. This race condition is not purely code analysis, it could be triggered by following command series: # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 NVMe SSD populated in port # mount /dev/nvme0n1p1 nvme One shot will cause system panic and NULL pointer reference happened. (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knl GS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 With this patch, the handling flow of DPC containment and hotplug is partly ordered and serialized, let hardware DPC do the controller reset etc recovery action first, then DPC driver handling the call-back from device drivers, clear the DPC status, at the end, pciehp handle the DLLSC and PDC etc. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. drivers/pci/hotplug/pciehp_hpc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c index 53433b37e181..6f271160f18d 100644 --- a/drivers/pci/hotplug/pciehp_hpc.c +++ b/drivers/pci/hotplug/pciehp_hpc.c @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id) down_read(>reset_lock); if (events & DISABLE_SLOT) pciehp_handle_disable_request(ctrl); - else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) + else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) { + pci_wait_port_outdpc(pdev); pciehp_handle_presence_or_link_change(ctrl, events); + } up_read(>reset_lock); ret = IRQ_HANDLED; -- 2.18.4
[PATCH 3/5 V2] PCI/ERR: get device before call device driver to avoid NULL pointer reference
During DPC error injection test we found there is race condition between pciehp and DPC driver, NULL pointer reference caused panic as following # setpci -s 64:02.0 0x196.w=000a // 64:02.0 is rootport has DPC capability # setpci -s 65:00.0 0x04.w=0544 // 65:00.0 is NVMe SSD populated in above port # mount /dev/nvme0n1p1 nvme (tested on stable 5.8 & ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) Buffer I/O error on dev nvme0n1p1, logical block 468843328, async page read BUG: kernel NULL pointer dereference, address: 0050 #PF: supervisor read access in kernel mode #PF: error_code(0x) - not-present page PGD 0 Oops: [#1] SMP NOPTI CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246 RAX: RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138 RBP: ff8e06cf8762fdd0 R08: 00bf R09: R10: 00eb8ebeab53 R11: 93453258 R12: 0002 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828 FS: () GS:ff4e3eab1fd0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ? report_normal_detected+0x20/0x20 report_frozen_detected+0x16/0x20 pci_walk_bus+0x75/0x90 ? dpc_irq+0x90/0x90 pcie_do_recovery+0x157/0x201 ? irq_finalize_oneshot.part.47+0xe0/0xe0 dpc_handler+0x29/0x40 irq_thread_fn+0x24/0x60 irq_thread+0xea/0x170 ? irq_forced_thread_fn+0x80/0x80 ? irq_thread_check_affinity+0xf0/0xf0 kthread+0x124/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 Modules linked in: nft_fib_inet. CR2: 0050 Though we partly close the race condition with patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', but there is no hardware spec or software sequence to guarantee the pcie_ist() run into pci_wait_port_outdpc() first or DPC triggered status bits being set first when errors triggered DPC containment procedure, so device still could be removed by function pci_stop_and_removed_bus_device() then freed by pci_dev_put() in pciehp driver first during pcie_do_recover()/ pci_walk_bus() is called by dpc_handler() in DPC driver. Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the removal and walking operation is the right way, but here we use pci_dev_get() to increase the reference count of device before using the device to avoid it is freed in use. With this patch and patch 'PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error injection test and no panic happened. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- Changes: V2: revise doc according to Andy's suggestion. drivers/pci/pcie/err.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index c543f419d8f9..e35c4480c86b 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev, pci_ers_result_t vote; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!pci_dev_set_io_state(dev, state) || !dev->driver || @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev, pci_uevent_ers(dev, vote); *result = merge_result(*result, vote); device_unlock(>dev); + pci_dev_put(dev); return 0; } @@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) pci_ers_result_t vote, *result = data; const struct pci_error_handlers *err_handler; + if (!pci_dev_get(dev)) + return 0; device_lock(>dev); if (!dev->driver || !dev->driver->err_handler || @@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) *result = merge_result(*result, vote); out: device_unlock(>dev); + pci_dev_put(dev); return
[PATCH 5/5 V2] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang --- Chagnes: V2: no change. drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH 4/5 V2] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery() ->pci_walk_bus() ->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko Reviewed-by: Alexandru Gagniuc --- Changes: V2: revise doc and code flow according to Andy's suggestion. drivers/pci/pci.h | 34 +- 1 file changed, 5 insertions(+), 29 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..387f891ce6a1 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -362,35 +362,11 @@ static inline bool pci_dev_set_io_state(struct pci_dev *dev, bool changed = false; device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) - dev->error_state = new; + if (dev->error_state == new) + return changed; + + dev->error_state = new; + changed = true; return changed; } -- 2.18.4
[PATCH 1/5 V2] PCI: define a function to check and wait till port finish DPC handling
Once root port DPC capability is enabled and triggered, at the beginning of DPC is triggered, the DPC status bits are set by hardware and then sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will take the port and software DPC interrupt handler 10ms to 50ms (test data on ICS(Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server) & stable 5.9-rc6) to complete the DPC containment procedure till the DPC status is cleared at the end of the DPC interrupt handler. We use this function to check if the root port is in DPC handling status and wait till the hardware and software completed the procedure. Signed-off-by: Ethan Zhao Tested-by: Wen Jin Tested-by: Shanshan Zhang Reviewed-by: Andy Shevchenko --- changes: V2:align ICS code name to public doc. include/linux/pci.h | 31 +++ 1 file changed, 31 insertions(+) diff --git a/include/linux/pci.h b/include/linux/pci.h index 835530605c0d..5beb76c6ae26 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -38,6 +38,7 @@ #include #include #include +#include #include #include @@ -2427,4 +2428,34 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type); WARN_ONCE(condition, "%s %s: " fmt, \ dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg) +#ifdef CONFIG_PCIE_DPC +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + u16 cap = pdev->dpc_cap, status; + u16 loop = 0; + + if (!cap) { + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n"); + return false; + } + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap); + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) { + msleep(10); + loop++; + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, ); + } + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) { + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10); + return true; + } + pci_dbg(pdev, "Timeout to wait port out of DPC status\n"); + return false; +} +#else +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) +{ + return true; +} +#endif #endif /* LINUX_PCI_H */ -- 2.18.4
[PATCH 0/5 V2] Fix DPC hotplug race and enhance error handling
This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternatively etc). With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform, see https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)) could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6 directly. Help to review and test. V2: changed according to review by Andy Shevchenko. Thanks, Ethan Ethan Zhao (5): PCI: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: get device before call device driver to avoid NULL pointer reference PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 34 +--- drivers/pci/pcie/err.c | 18 +++-- include/linux/pci.h | 31 + 4 files changed, 55 insertions(+), 32 deletions(-) -- 2.18.4
[PATCH 5/5] PCI/ERR: don't mix io state not changed and no driver together
When we see 'can't recover (no error_detected callback)' on console, Maybe the reason is io state is not changed by calling pci_dev_set_io_state(), that is confused. fix it. Signed-off-by: Ethan Zhao Tested-by: Wen jin Tested-by: Shanshan Zhang --- drivers/pci/pcie/err.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e35c4480c86b..d85f27c90c26 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev, if (!pci_dev_get(dev)) return 0; device_lock(>dev); - if (!pci_dev_set_io_state(dev, state) || - !dev->driver || + if (!pci_dev_set_io_state(dev, state)) { + pci_dbg(dev, "Device might already being in error handling ...\n"); + vote = PCI_ERS_RESULT_NONE; + } else if (!dev->driver || !dev->driver->err_handler || !dev->driver->err_handler->error_detected) { /* -- 2.18.4
[PATCH 4/5] PCI: only return true when dev io state is really changed
When uncorrectable error happens, AER driver and DPC driver interrupt handlers likely call pcie_do_recovery()->pci_walk_bus()->report_frozen_detected() with pci_channel_io_frozen the same time. If pci_dev_set_io_state() return true even if the original state is pci_channel_io_frozen, that will cause AER or DPC handler re-enter the error detecting and recovery procedure one after another. The result is the recovery flow mixed between AER and DPC. So simplify the pci_dev_set_io_state() function to only return true when dev->error_state is changed. Signed-off-by: Ethan Zhao Tested-by: Wen jin Tested-by: Shanshan Zhang --- drivers/pci/pci.h | 31 +++ 1 file changed, 3 insertions(+), 28 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index fa12f7cbc1a0..d420bb977f3b 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -362,35 +362,10 @@ static inline bool pci_dev_set_io_state(struct pci_dev *dev, bool changed = false; device_lock_assert(>dev); - switch (new) { - case pci_channel_io_perm_failure: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - case pci_channel_io_perm_failure: - changed = true; - break; - } - break; - case pci_channel_io_frozen: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - case pci_channel_io_normal: - switch (dev->error_state) { - case pci_channel_io_frozen: - case pci_channel_io_normal: - changed = true; - break; - } - break; - } - if (changed) + if (dev->error_state != new) { dev->error_state = new; + changed = true; + } return changed; } -- 2.18.4
[PATCH 0/5] Fix DPC hotplug race and enhance error hanlding
This simple patch set fixed some serious security issues found when DPC error injection and NVMe SSD hotplug brute force test were doing -- race condition between DPC handler and pciehp, AER interrupt handlers, caused system hang and system with DPC feature couldn't recover to normal working state as expected (NVMe instance lost, mount operation hang, race PCIe access caused uncorrectable errors reported alternativly etc). With this patch set applied, stable 5.9-rc6 could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time interval between hot-remove and plug-in operation tens of times without any errors occur and system works normal. With this patch set applied, system with DPC feature could recover from NON-FATAL and FATAL errors injection test and works as expected. System works smoothly when errors happen while hotplug is doing, no uncorrectable errors found. Brute DPC error injection script: for i in {0..100} do setpci -s 64:02.0 0x196.w=000a setpci -s 65:00.0 0x04.w=0544 mount /dev/nvme0n1p1 /root/nvme sleep 1 done Other details see every commits description part. This patch set could be applied to stable 5.9-rc6 directly. Help to review and test. Thanks, Ethan Ethan Zhao (5): PCI: define a function to check and wait till port finish DPC handling PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC PCI/ERR: get device before call device driver to avoid null pointer reference PCI: only return true when dev io state is really changed PCI/ERR: don't mix io state not changed and no driver together drivers/pci/hotplug/pciehp_hpc.c | 4 +++- drivers/pci/pci.h| 31 +++ drivers/pci/pcie/err.c | 18 -- include/linux/pci.h | 31 +++ 4 files changed, 53 insertions(+), 31 deletions(-) -- 2.18.4
[PATCH] Revert "block: revert back to synchronous request_queue removal"
From: Ethan Zhao 'commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue removal")' introduced panic issue to NVMe hotplug as following(hit after just 2 times NVMe SSD hotplug under stable 5.9-RC2): BUG: sleeping function called from invalid context at block/genhd.c:1563 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/30 INFO: lockdep is turned off. CPU: 30 PID: 0 Comm: swapper/30 Tainted: G S W 5.9.0-RC2 #3 Hardware name: Intel Corporation Call Trace: dump_stack+0x9d/0xe0 ___might_sleep.cold.79+0x17f/0x1af disk_release+0x26/0x200 device_release+0x6d/0x1c0 kobject_put+0x14d/0x430 hd_struct_free+0xfb/0x260 percpu_ref_switch_to_atomic_rcu+0x3d1/0x580 ? rcu_nocb_unlock_irqrestore+0xb6/0xf0 ? trace_hardirqs_on+0x20/0x1b5 ? rcu_do_batch+0x3ff/0xb50 rcu_do_batch+0x47c/0xb50 ? rcu_accelerate_cbs+0xa9/0x740 ? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0 rcu_core+0x945/0xd90 ? __do_softirq+0x182/0xac0 __do_softirq+0x1ca/0xac0 asm_call_on_stack+0xf/0x20 do_softirq_own_stack+0x7f/0x90 irq_exit_rcu+0x1e3/0x230 sysvec_apic_timer_interrupt+0x48/0xb0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:cpuidle_enter_state+0x116/0xe90 Code: 00 31 ff e8 ac c8 a4 fe 80 7c 24 10 00 74 12 9c 58 f6 c4 02 0f 85 7e 08 00 00 31 ff e8 43 ca be fe e8 ae a3 d5 fe fb 45 85 ed <0f> 88 4e 05 00 00 4d 63 f5 49 83 fe 09 0f 87 29 0b 00 00 4b 8d 04 RSP: 0018:ff110001040dfd78 EFLAGS: 0206 RAX: 0007 RBX: ffd1fff7b1a01e00 RCX: 001f RDX: RSI: 4000 RDI: b7c5c0f2 RBP: b9a416a0 R08: R09: R10: ff110001040d0007 R11: ffe21c002081a000 R12: 0003 R13: 0003 R14: 0138 ... ... BUG: kernel NULL pointer dereference, address: PGD 0 Oops: 0010 [#1] SMP KASAN NOPTI CPU: 30 PID: 500 Comm: irq/124-pciehp Tainted: G S W 5.9.0-RC2 #3 Hardware name: Intel Corporation RIP: 0010:0x0 Code: Bad RIP value. RSP: 0018:ff110007d5ba75e8 EFLAGS: 00010096 RAX: RBX: ff110001040d RCX: ff110007d5ba7668 RDX: 0009 RSI: ff110001040d RDI: ff110008119f59c0 RBP: ff110008119f59c0 R08: fbfff73f7b4d R09: fbfff73f7b4d R10: b9fbda67 R11: fbfff73f7b4c R12: R13: ff110007d5ba7668 R14: ff110001040d R15: ff110008119f59c0 FS: () GS:ff1100081180() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: ffd6 CR3: 0007cea16002 CR4: 00771ee0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: ttwu_do_activate+0xd3/0x160 try_to_wake_up+0x652/0x1850 ? migrate_swap_stop+0xad0/0xad0 ? lock_contended+0xd70/0xd70 ? rcu_read_unlock+0x50/0x50 ? rcu_do_batch+0x3ff/0xb50 swake_up_locked+0x85/0x1c0 complete+0x4d/0x70 rcu_do_batch+0x47c/0xb50 ? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0 rcu_core+0x945/0xd90 ? __do_softirq+0x182/0xac0 __do_softirq+0x1ca/0xac0 irq_exit_rcu+0x1e3/0x230 sysvec_apic_timer_interrupt+0x48/0xb0 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:_raw_spin_unlock_irqrestore+0x40/0x50 Code: e8 35 ad 36 fe 48 89 ef e8 cd ce 37 fe f6 c7 02 75 11 53 9d e8 91 1f 5c fe 65 ff 0d ba af c2 47 5b 5d c3 e8 d2 22 5c fe 53 9d ed 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53 RSP: 0018:ff110007d5ba79d0 EFLAGS: 0293 RAX: 0007 RBX: 0293 RCX: b67710d4 RDX: RSI: 0004 RDI: b83f41ce RBP: bb77e740 R08: R09: R10: bb77e743 R11: fbfff76efce8 R12: 198e R13: ff11001031d7a0b0 R14: bb77e740 R15: bb77e788 ? do_raw_spin_unlock+0x54/0x230 ? _raw_spin_unlock_irqrestore+0x3e/0x50 dma_debug_device_change+0x150/0x5e0 notifier_call_chain+0x90/0x160 __blocking_notifier_call_chain+0x6d/0xa0 device_release_driver_internal+0x37d/0x490 pci_stop_bus_device+0x123/0x190 pci_stop_and_remove_bus_device+0xe/0x20 pciehp_unconfigure_device+0x17e/0x330 ? pciehp_configure_device+0x3e0/0x3e0 ? trace_hardirqs_on+0x20/0x1b5 pciehp_disable_slot+0x101/0x360 ? pme_is_native.cold.2+0x29/0x29 pciehp_handle_presence_or_link_change+0x1ac/0xee0 ? pciehp_handle_disable_request+0x110/0x110 pciehp_ist.cold.11+0x39/0x54 ? pciehp_set_indicators+0x190/0x190 ? alloc_desc+0x510/0xa30 ? irq_set_affinity_notifier+0x380/0x380 ? pciehp_set_indicators+0x190/0x190 ? irq_thread+0x137/0x420 irq_thread_fn+0x86/0x150 irq_thread+0x21f/0x420 ? irq_forced_thread_fn+0x170/0x170 ? irq_thread_check_affinity+0x210/0x210 ? __kthread_parkme+0x52/0x1a0 ? lockdep_hardirqs_on_prepare+0x33e/0x4e0 ? _raw_spin_unlock_irqrestore+0x3e/0x50 ? trace_hardirqs_on+0x20/0x1b5 ? wake_threads_waitq+0x40/0x40 ? __kthread_parkme+0xd1/0x1a0 ? irq_thread_check_affinity+0x210/0x210 kthread+0x36a/0x430 ? kthread_create_worker_
Re: [PATCH 04/27] Restrict /dev/mem and /dev/kmem when the kernel is locked down
David, May I ask a question here -- Is it intentionally enabling the read-only mode, so userspace tools like dmidecode could work with kernel_is_locked_down ? while it was impossible to work with the attached patch applied. Is it a security policy change with secure boot ? Thanks, Ethan On Mon, Oct 23, 2017 at 10:34 PM, David Howellswrote: > I think I should replace this patch with the attached. This will prevent > /dev/mem, /dev/kmem and /dev/port from being *opened*, and thereby preventing > read, write and ioctl. > > David > --- > commit e68daa2256986932b9a7d6709cf9e24b30d93583 > Author: Matthew Garrett > Date: Wed May 24 14:56:02 2017 +0100 > > Restrict /dev/{mem,kmem,port} when the kernel is locked down > > Allowing users to read and write to core kernel memory makes it possible > for the kernel to be subverted, avoiding module loading restrictions, and > also to steal cryptographic information. > > Disallow /dev/mem and /dev/kmem from being opened this when the kernel has > been locked down to prevent this. > > Also disallow /dev/port from being opened to prevent raw ioport access and > thus DMA from being used to accomplish the same thing. > > Signed-off-by: Matthew Garrett > Signed-off-by: David Howells > Reviewed-by: "Lee, Chun-Yi" > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > index 593a8818aca9..0ce5ac0a5c6b 100644 > --- a/drivers/char/mem.c > +++ b/drivers/char/mem.c > @@ -762,6 +762,8 @@ static loff_t memory_lseek(struct file *file, loff_t > offset, int orig) > > static int open_port(struct inode *inode, struct file *filp) > { > + if (kernel_is_locked_down("/dev/mem,kmem,port")) > + return -EPERM; > return capable(CAP_SYS_RAWIO) ? 0 : -EPERM; > } >
Re: [PATCH 04/27] Restrict /dev/mem and /dev/kmem when the kernel is locked down
David, May I ask a question here -- Is it intentionally enabling the read-only mode, so userspace tools like dmidecode could work with kernel_is_locked_down ? while it was impossible to work with the attached patch applied. Is it a security policy change with secure boot ? Thanks, Ethan On Mon, Oct 23, 2017 at 10:34 PM, David Howells wrote: > I think I should replace this patch with the attached. This will prevent > /dev/mem, /dev/kmem and /dev/port from being *opened*, and thereby preventing > read, write and ioctl. > > David > --- > commit e68daa2256986932b9a7d6709cf9e24b30d93583 > Author: Matthew Garrett > Date: Wed May 24 14:56:02 2017 +0100 > > Restrict /dev/{mem,kmem,port} when the kernel is locked down > > Allowing users to read and write to core kernel memory makes it possible > for the kernel to be subverted, avoiding module loading restrictions, and > also to steal cryptographic information. > > Disallow /dev/mem and /dev/kmem from being opened this when the kernel has > been locked down to prevent this. > > Also disallow /dev/port from being opened to prevent raw ioport access and > thus DMA from being used to accomplish the same thing. > > Signed-off-by: Matthew Garrett > Signed-off-by: David Howells > Reviewed-by: "Lee, Chun-Yi" > > diff --git a/drivers/char/mem.c b/drivers/char/mem.c > index 593a8818aca9..0ce5ac0a5c6b 100644 > --- a/drivers/char/mem.c > +++ b/drivers/char/mem.c > @@ -762,6 +762,8 @@ static loff_t memory_lseek(struct file *file, loff_t > offset, int orig) > > static int open_port(struct inode *inode, struct file *filp) > { > + if (kernel_is_locked_down("/dev/mem,kmem,port")) > + return -EPERM; > return capable(CAP_SYS_RAWIO) ? 0 : -EPERM; > } >
[tip:sched/core] sched/sysctl: Check user input value of sysctl_sched_time_avg
Commit-ID: 5ccba44ba118a500050076b0344632459779 Gitweb: https://git.kernel.org/tip/5ccba44ba118a500050076b0344632459779 Author: Ethan Zhao <ethan.z...@oracle.com> AuthorDate: Mon, 4 Sep 2017 13:59:34 +0800 Committer: Ingo Molnar <mi...@kernel.org> CommitDate: Fri, 29 Sep 2017 13:20:13 +0200 sched/sysctl: Check user input value of sysctl_sched_time_avg System will hang if user set sysctl_sched_time_avg to 0: [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function: update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hitting invalid input, and set the minimum limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org> Cc: Linus Torvalds <torva...@linux-foundation.org> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: efa...@gmx.de Cc: ethan.ker...@gmail.com Cc: keesc...@chromium.org Cc: mcg...@kernel.org Cc: <sta...@vger.kernel.org> Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.z...@oracle.com Signed-off-by: Ingo Molnar <mi...@kernel.org> --- kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = { .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS {
[tip:sched/core] sched/sysctl: Check user input value of sysctl_sched_time_avg
Commit-ID: 5ccba44ba118a500050076b0344632459779 Gitweb: https://git.kernel.org/tip/5ccba44ba118a500050076b0344632459779 Author: Ethan Zhao AuthorDate: Mon, 4 Sep 2017 13:59:34 +0800 Committer: Ingo Molnar CommitDate: Fri, 29 Sep 2017 13:20:13 +0200 sched/sysctl: Check user input value of sysctl_sched_time_avg System will hang if user set sysctl_sched_time_avg to 0: [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function: update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hitting invalid input, and set the minimum limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran Signed-off-by: Ethan Zhao Signed-off-by: Peter Zijlstra (Intel) Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: efa...@gmx.de Cc: ethan.ker...@gmail.com Cc: keesc...@chromium.org Cc: mcg...@kernel.org Cc: Link: http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.z...@oracle.com Signed-off-by: Ingo Molnar --- kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = { .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS {
Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg
On 2017/9/7 3:50, Luis R. Rodriguez wrote: On Mon, Sep 04, 2017 at 03:54:23PM +0800, Ethan Zhao wrote: Peter, On 2017/9/4 15:49, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, *sigh*, what's wrong with the below? Too easy? :), seems I walked zigzag several cycles to get the right point. It sounds like negative values are not possible too, right? If so then you can use proc_douintvec_minmax(). V4 has been sent days ago, please see v4. Thank, Ethan Luis
Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg
On 2017/9/7 3:50, Luis R. Rodriguez wrote: On Mon, Sep 04, 2017 at 03:54:23PM +0800, Ethan Zhao wrote: Peter, On 2017/9/4 15:49, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, *sigh*, what's wrong with the below? Too easy? :), seems I walked zigzag several cycles to get the right point. It sounds like negative values are not possible too, right? If so then you can use proc_douintvec_minmax(). V4 has been sent days ago, please see v4. Thank, Ethan Luis
Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg
Peter, On 2017/9/4 15:49, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, *sigh*, what's wrong with the below? Too easy? :), seems I walked zigzag several cycles to get the right point. Thanks, Ethan diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbbb8157..bbbc6a17c15e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = { .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_min_max, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS {
Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg
Peter, On 2017/9/4 15:49, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, *sigh*, what's wrong with the below? Too easy? :), seems I walked zigzag several cycles to get the right point. Thanks, Ethan diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbbb8157..bbbc6a17c15e 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = { .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_min_max, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS {
Re: [PATCH] sched: reset sysctl_sched_time_avg to default when
Peter, On 2017/9/4 15:32, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 08:33:03AM +0800, Ethan Zhao wrote: Yep, that is the first place I considered to set the limit, but that would break KABI ? nah.. V4 sent, please ignore v2 & v3. Thanks, Ethan
Re: [PATCH] sched: reset sysctl_sched_time_avg to default when
Peter, On 2017/9/4 15:32, Peter Zijlstra wrote: On Sat, Sep 02, 2017 at 08:33:03AM +0800, Ethan Zhao wrote: Yep, that is the first place I considered to set the limit, but that would break KABI ? nah.. V4 sent, please ignore v2 & v3. Thanks, Ethan
[PATCH v4] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com> --- v2: Check it at user input side in sysctl table (peterz). v3: Use proc_dointvec_minmax(). v4: Fix a too long line in descripton part. kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
[PATCH v4] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran Signed-off-by: Ethan Zhao --- v2: Check it at user input side in sysctl table (peterz). v3: Use proc_dointvec_minmax(). v4: Fix a too long line in descripton part. kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
[PATCH v3] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com> --- v2: Check it at user input side in sysctl table (peterz). v3: Use proc_dointvec_minmax(). kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
[PATCH v3] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg to 1 ms. Reported-by: James Puthukattukaran Signed-off-by: Ethan Zhao --- v2: Check it at user input side in sysctl table (peterz). v3: Use proc_dointvec_minmax(). kernel/sysctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..423554a 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = proc_dointvec_minmax, + .extra1 = , }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
[PATCH v2] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input. Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com> --- v2: check it in sysctl table (input side) as Peter suggested. Tested on stable 4.1, applied on stable 4.13-rc5 okay. include/linux/sched/sysctl.h | 3 +++ kernel/sched/fair.c | 26 ++ kernel/sysctl.c | 2 +- 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 0f5ecd4..cc25807 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -44,6 +44,9 @@ enum sched_tunable_scaling { int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); +int sched_time_avg_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); #endif /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c95880e..61155dc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -650,6 +650,32 @@ int sched_proc_update_handler(struct ctl_table *table, int write, return 0; } + +int sched_time_avg_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct ctl_table tbl; + int new_value, ret; + + memset(, 0, sizeof(struct ctl_table)); + tbl.maxlen = sizeof(unsigned int); + + if (write) + tbl.data = _value; + else + tbl.data = _sched_time_avg; + + ret = proc_dointvec(, write, buffer, lenp, ppos); + if (write && ret == 0) { + if (new_value <= 0) + ret = -EINVAL; + else + sysctl_sched_time_avg = new_value; + } + + return ret; +} #endif /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
[PATCH v2] sched: check user input value of sysctl_sched_time_avg
System will hang if user set sysctl_sched_time_avg to 0 by [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 Stack traceback for pid 0 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 883f7ccc3ae8 0018 810c4dd0 00017800 883f7ccc3d78 0003 883f7ccc3bf8 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 Call Trace: [] ? update_group_capacity+0x110/0x200 [] ? update_sd_lb_stats+0x109/0x600 [] ? find_busiest_group+0x47/0x530 [] ? load_balance+0x194/0x900 [] ? update_rq_clock.part.83+0x1a/0xe0 [] ? rebalance_domains+0x152/0x290 [] ? run_rebalance_domains+0xdc/0x1d0 [] ? __do_softirq+0xfb/0x320 [] ? irq_exit+0x125/0x130 [] ? scheduler_ipi+0x97/0x160 [] ? smp_reschedule_interrupt+0x29/0x30 [] ? reschedule_interrupt+0x6e/0x80 [] ? cpuidle_enter_state+0xcc/0x230 [] ? cpuidle_enter_state+0x9c/0x230 [] ? cpuidle_enter+0x17/0x20 [] ? cpu_startup_entry+0x38c/0x420 [] ? start_secondary+0x173/0x1e0 Because divide-by-zero error happens in function update_group_capacity() update_cpu_capacity() scale_rt_capacity() { ... total = sched_avg_period() + delta; used = div_u64(avg, total); ... } Seems this issue could be reproduced on all I tried stable 4.1 - last kernel. To fix this issue, check user input value of sysctl_sched_time_avg, keep it unchanged when hit invalid input. Reported-by: James Puthukattukaran Signed-off-by: Ethan Zhao --- v2: check it in sysctl table (input side) as Peter suggested. Tested on stable 4.1, applied on stable 4.13-rc5 okay. include/linux/sched/sysctl.h | 3 +++ kernel/sched/fair.c | 26 ++ kernel/sysctl.c | 2 +- 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h index 0f5ecd4..cc25807 100644 --- a/include/linux/sched/sysctl.h +++ b/include/linux/sched/sysctl.h @@ -44,6 +44,9 @@ enum sched_tunable_scaling { int sched_proc_update_handler(struct ctl_table *table, int write, void __user *buffer, size_t *length, loff_t *ppos); +int sched_time_avg_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); #endif /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c95880e..61155dc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -650,6 +650,32 @@ int sched_proc_update_handler(struct ctl_table *table, int write, return 0; } + +int sched_time_avg_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct ctl_table tbl; + int new_value, ret; + + memset(, 0, sizeof(struct ctl_table)); + tbl.maxlen = sizeof(unsigned int); + + if (write) + tbl.data = _value; + else + tbl.data = _sched_time_avg; + + ret = proc_dointvec(, write, buffer, lenp, ppos); + if (write && ret == 0) { + if (new_value <= 0) + ret = -EINVAL; + else + sysctl_sched_time_avg = new_value; + } + + return ret; +} #endif /* diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 6648fbb..609bed2 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, .data = _sched_time_avg, .maxlen = sizeof(unsigned int), .mode = 0644, - .proc_handler = proc_dointvec, + .proc_handler = sched_time_avg_handler, }, #ifdef CONFIG_SCHEDSTATS { -- 1.8.3.1
Re: [PATCH] sched: reset sysctl_sched_time_avg to default when
Peter, V2 sent, tested okay on stable 4.1, re-factor a little, could apply to v4.13-rc5, please review. [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 1000 [root@ ~]# sysctl kernel.sched_time_avg_ms=0 sysctl: setting key "kernel.sched_time_avg_ms": Invalid argument kernel.sched_time_avg_ms = 0 [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 1000 [root@ ~]# sysctl kernel.sched_time_avg_ms=900 kernel.sched_time_avg_ms = 900 [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 900 Thanks, Ethan On Sat, Sep 2, 2017 at 8:33 AM, Ethan Zhao <ethan.ker...@gmail.com> wrote: > Yep, that is the first place I considered to set the limit, but that would > break KABI ? > > Thanks, > Ethan > > On Fri, Sep 1, 2017 at 8:32 PM, Peter Zijlstra <pet...@infradead.org> wrote: >> On Fri, Sep 01, 2017 at 07:31:54PM +0800, Ethan Zhao wrote: >>> System will hang if user set sysctl_sched_time_avg to 0 by >>> >>> [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 >>> >>> Stack traceback for pid 0 >>> 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 >>> 883f7ccc3ae8 0018 810c4dd0 >>> 00017800 883f7ccc3d78 0003 883f7ccc3bf8 >>> 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 >>> Call Trace: >>> [] ? update_group_capacity+0x110/0x200 >>> [] ? update_sd_lb_stats+0x109/0x600 >>> [] ? find_busiest_group+0x47/0x530 >>> [] ? load_balance+0x194/0x900 >>> [] ? update_rq_clock.part.83+0x1a/0xe0 >>> [] ? rebalance_domains+0x152/0x290 >>> [] ? run_rebalance_domains+0xdc/0x1d0 >>> [] ? __do_softirq+0xfb/0x320 >>> [] ? irq_exit+0x125/0x130 >>> [] ? scheduler_ipi+0x97/0x160 >>> [] ? smp_reschedule_interrupt+0x29/0x30 >>> [] ? reschedule_interrupt+0x6e/0x80 >>> [] ? cpuidle_enter_state+0xcc/0x230 >>> [] ? cpuidle_enter_state+0x9c/0x230 >>> [] ? cpuidle_enter+0x17/0x20 >>> [] ? cpu_startup_entry+0x38c/0x420 >>> [] ? start_secondary+0x173/0x1e0 >>> >>> Because divide-by-zero error happens in function >>> >>> update_group_capacity() >>> update_cpu_capacity() >>> scale_rt_capacity() >>> { >>> ... >>> total = sched_avg_period() + delta; >>> used = div_u64(avg, total); >>> ... >>> } >>> >>> Seems this issue could be reproduced on all I tried stable 4.1 - last >>> kernel. >>> >>> To fix this issue, reset sysctl_sched_time_avg to default value >>> MSEC_PER_SEC if user input invalid value in function >>> >>> sched_avg_period() >>> >>> Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com> >>> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com> >>> --- >>> Tested on stable 4.1, compiled on stable 4.13-rc5 >>> kernel/sched/sched.h | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index eeef1a3..b398560 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -1620,6 +1620,9 @@ static inline void rq_last_tick_reset(struct rq *rq) >>> >>> static inline u64 sched_avg_period(void) >>> { >>> + if (unlikely(sysctl_sched_time_avg <= 0)) >>> + sysctl_sched_time_avg = MSEC_PER_SEC; >>> + >> >> Sight, no.. you can set limits in the sysctl table.
Re: [PATCH] sched: reset sysctl_sched_time_avg to default when
Peter, V2 sent, tested okay on stable 4.1, re-factor a little, could apply to v4.13-rc5, please review. [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 1000 [root@ ~]# sysctl kernel.sched_time_avg_ms=0 sysctl: setting key "kernel.sched_time_avg_ms": Invalid argument kernel.sched_time_avg_ms = 0 [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 1000 [root@ ~]# sysctl kernel.sched_time_avg_ms=900 kernel.sched_time_avg_ms = 900 [root@ ~]# sysctl kernel.sched_time_avg_ms kernel.sched_time_avg_ms = 900 Thanks, Ethan On Sat, Sep 2, 2017 at 8:33 AM, Ethan Zhao wrote: > Yep, that is the first place I considered to set the limit, but that would > break KABI ? > > Thanks, > Ethan > > On Fri, Sep 1, 2017 at 8:32 PM, Peter Zijlstra wrote: >> On Fri, Sep 01, 2017 at 07:31:54PM +0800, Ethan Zhao wrote: >>> System will hang if user set sysctl_sched_time_avg to 0 by >>> >>> [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0 >>> >>> Stack traceback for pid 0 >>> 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3 >>> 883f7ccc3ae8 0018 810c4dd0 >>> 00017800 883f7ccc3d78 0003 883f7ccc3bf8 >>> 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08 >>> Call Trace: >>> [] ? update_group_capacity+0x110/0x200 >>> [] ? update_sd_lb_stats+0x109/0x600 >>> [] ? find_busiest_group+0x47/0x530 >>> [] ? load_balance+0x194/0x900 >>> [] ? update_rq_clock.part.83+0x1a/0xe0 >>> [] ? rebalance_domains+0x152/0x290 >>> [] ? run_rebalance_domains+0xdc/0x1d0 >>> [] ? __do_softirq+0xfb/0x320 >>> [] ? irq_exit+0x125/0x130 >>> [] ? scheduler_ipi+0x97/0x160 >>> [] ? smp_reschedule_interrupt+0x29/0x30 >>> [] ? reschedule_interrupt+0x6e/0x80 >>> [] ? cpuidle_enter_state+0xcc/0x230 >>> [] ? cpuidle_enter_state+0x9c/0x230 >>> [] ? cpuidle_enter+0x17/0x20 >>> [] ? cpu_startup_entry+0x38c/0x420 >>> [] ? start_secondary+0x173/0x1e0 >>> >>> Because divide-by-zero error happens in function >>> >>> update_group_capacity() >>> update_cpu_capacity() >>> scale_rt_capacity() >>> { >>> ... >>> total = sched_avg_period() + delta; >>> used = div_u64(avg, total); >>> ... >>> } >>> >>> Seems this issue could be reproduced on all I tried stable 4.1 - last >>> kernel. >>> >>> To fix this issue, reset sysctl_sched_time_avg to default value >>> MSEC_PER_SEC if user input invalid value in function >>> >>> sched_avg_period() >>> >>> Reported-by: James Puthukattukaran >>> Signed-off-by: Ethan Zhao >>> --- >>> Tested on stable 4.1, compiled on stable 4.13-rc5 >>> kernel/sched/sched.h | 3 +++ >>> 1 file changed, 3 insertions(+) >>> >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index eeef1a3..b398560 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -1620,6 +1620,9 @@ static inline void rq_last_tick_reset(struct rq *rq) >>> >>> static inline u64 sched_avg_period(void) >>> { >>> + if (unlikely(sysctl_sched_time_avg <= 0)) >>> + sysctl_sched_time_avg = MSEC_PER_SEC; >>> + >> >> Sight, no.. you can set limits in the sysctl table.