from:"Ethan Zhao"

Re: [PATCH v7 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-11-26 Thread Ethan Zhao

On Wed, Nov 25, 2020 at 2:47 AM Guilherme G. Piccoli
 wrote:
>
> Hi Kuppuswamy Sathyanarayanan (and all involved here), thanks for the
> patch! I'd like to ask what is the status of this patchset - I just
> "parachuted" in the issue, and by tracking the linux-pci ML, I found
> this V7 (and all previous versions since V2). Also, noticed that Jay's
> email might have gotten lost in translation (he's not CCed in latest
> versions of the patchset).
>
> I was able to find even another interesting thread that might be
> related, Ethan's patchset. So, if any of the developers can clarify the

Thank you for asking.

> current status of this patchset or if the functionality hereby proposed
> ended-up being implemented in another patch, I appreciate a lot.
>
> Thanks in advance! Below, some references to lore archives.
> Cheers,
>
>
> Guilherme
>
>
> References:
>
> This V7 link:
> https://lore.kernel.org/linux-pci/546d346644654915877365b19ea534378db0894d.1602788209.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> V6:
> https://lore.kernel.org/linux-pci/546d346644654915877365b19ea534378db0894d.1602663397.git.sathyanarayanan.kuppusw...@linux.intel.com/#t
>
> V5:
> https://lore.kernel.org/linux-pci/162495c76c391de6e021919e2b69c5cd2dbbc22a.1602632140.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> V4:
> https://lore.kernel.org/linux-pci/5c5bca0bdb958e456176fe6ede10ba8f838fbafc.1602263264.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> V3:
> https://lore.kernel.org/linux-pci/cbba08a5e9ca62778c8937f44eda2192a2045da7.1595617529.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> V2:
> https://lore.kernel.org/linux-pci/ce417fbf81a8a46a89535f44b9224ee9fbb55a29.1591307288.git.sathyanarayanan.kuppusw...@linux.intel.com/#t
>
> Ethan's related(?) patchset, V8 :
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/#t

This patchset I think need more discussion and maintainer's (Bjorn's)
suggestion or advice.
You are welcome to try it and give me the feedback.

Thanks,
Ethan

>

Re: [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic

2020-11-02 Thread Ethan Zhao

The current logic is
if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
pci_aer_available() &&
(pcie_ports_dpc_native || (services & PCIE_PORT_SERVICE_AER)))
services |= PCIE_PORT_SERVICE_DPC;


if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
pci_aer_available() &&
(pcie_ports_dpc_native))
 services |= PCIE_PORT_SERVICE_DPC;

or

if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
pci_aer_available() &&(services & PCIE_PORT_SERVICE_AER)）
  services |= PCIE_PORT_SERVICE_DPC;

do you mean one of the possible is
if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
(pcie_ports_dpc_native))
 services |= PCIE_PORT_SERVICE_DPC;

after your patch ? nothing about AER ?

Thanks,
Ethan

On Thu, Oct 29, 2020 at 1:14 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/27/20 11:00 PM, Ethan Zhao wrote:
> > On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan
> >  wrote:
> >>
> >> In DPC service enable logic, check for
> >> services & PCIE_PORT_SERVICE_AER implies pci_aer_available()
> > How about PCIE_PORT_SERVICE_AER is not configured, but
> > pcie_aer_disable == 0 ？
> Its not possible in current code flow. DPC service is configured
> following AER service configuration.
> >> is true. So there is no need to explicitly check it again.
> >>
> >> Also, passing pcie_ports=dpc-native in kernel command line
> >> implies DPC needs to be enabled in native mode irrespective
> >> of AER ownership status. So checking for pci_aer_available()
> >> without checking for pcie_ports status is incorrect.
> >>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan 
> >> 
> >> ---
> >>   drivers/pci/pcie/portdrv_core.c | 1 -
> >>   1 file changed, 1 deletion(-)
> >>
> >> diff --git a/drivers/pci/pcie/portdrv_core.c 
> >> b/drivers/pci/pcie/portdrv_core.c
> >> index 2c0278f0fdcc..e257a2ca3595 100644
> >> --- a/drivers/pci/pcie/portdrv_core.c
> >> +++ b/drivers/pci/pcie/portdrv_core.c
> >> @@ -252,7 +252,6 @@ static int get_port_device_capability(struct pci_dev 
> >> *dev)
> >>   * permission to use AER.
> >>   */
> >>  if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
> >> -   pci_aer_available() &&
> >>  (host->native_dpc || (services & PCIE_PORT_SERVICE_AER)))
> >>  services |= PCIE_PORT_SERVICE_DPC;
> >>
> >> --
> >> 2.17.1
> >>
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

Re: [PATCH] AER: aer_root_reset() non-native handling

2020-11-01 Thread Ethan Zhao

On Sat, Oct 31, 2020 at 6:36 AM Sean V Kelley  wrote:
>
> If an OS has not been granted AER control via _OSC, then
> the OS should not make changes to PCI_ERR_ROOT_COMMAND and
> PCI_ERR_ROOT_STATUS related registers. Per section 4.5.1 of
> the System Firmware Intermediary (SFI) _OSC and DPC Updates
> ECN [1], this bit also covers these aspects of the PCI
> Express Advanced Error Reporting. Further, the handling of
> clear/enable of PCI_ERROR_ROOT_COMMAND when wrapped around
> PCI_ERR_ROOT_STATUS should have no effect and be removed.
> Based on the above and earlier discussion [2], make the
> following changes:
>
> Add a check for the native case (i.e., AER control via _OSC)
> Re-order and remove some of the handling:
> - clear PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK
> - do reset
> - clear PCI_ERR_ROOT_STATUS
> - enable PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK
>
> to this:
>
> - clear PCI_ERR_ROOT_STATUS
> - do reset
>
> The current "clear, reset, enable" order suggests that the reset
> might cause errors that we should ignore. But I am unable to find a
> reference and the clearing of PCI_ERR_ROOT_STATUS does not require them.
>
> [1] System Firmware Intermediary (SFI) _OSC and DPC Updates ECN, Feb 24,
> 2020, affecting PCI Firmware Specification, Rev. 3.2
> https://members.pcisig.com/wg/PCI-SIG/document/14076
> [2] 
> https://lore.kernel.org/linux-pci/20201020162820.GA370938@bjorn-Precision-5520/
>
> Signed-off-by: Sean V Kelley 
> ---
>  drivers/pci/pcie/aer.c | 21 ++---
>  1 file changed, 6 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 65dff5f3457a..bbfb07842d89 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1361,23 +1361,14 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
> *dev)
> u32 reg32;
> int rc;
>
> -
> -   /* Disable Root's interrupt in response to error messages */
> -   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, );
> -   reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> -   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
There may be some reasons to disable interrupt first and then do resetting,
clear status and re-enable interrupt.
Perhaps to avoid error noise,  what would happen if the resetting
causes errors itself ?

Thanks,
Ethan

> +   if (pcie_aer_is_native(dev)) {
> +   /* Clear Root Error Status */
> +   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, );
> +   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32);
> +   }
>
> rc = pci_bus_error_reset(dev);
> -   pci_info(dev, "Root Port link has been reset\n");
> -
> -   /* Clear Root Error Status */
> -   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, );
> -   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32);
> -
> -   /* Enable Root Port's interrupt in response to error messages */
> -   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, );
> -   reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> -   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +   pci_info(dev, "Root Port link has been reset (%d)\n", rc);
>
> return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
> --
> 2.29.2
>

Re: [PATCH v11 1/5] PCI: Conditionally initialize host bridge native_* members

2020-10-28 Thread Ethan Zhao

On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> If CONFIG_PCIEPORTBUS is not enabled in kernel then initialing
> struct pci_host_bridge PCIe specific native_* members to "1" is
> incorrect. So protect the PCIe specific member initialization
> with CONFIG_PCIEPORTBUS.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  drivers/pci/probe.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4289030b0fff..756fa60ca708 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -588,12 +588,14 @@ static void pci_init_host_bridge(struct pci_host_bridge 
> *bridge)
>  * may implement its own AER handling and use _OSC to prevent the
>  * OS from interfering.
>  */
> +#ifdef CONFIG_PCIEPORTBUS
> bridge->native_aer = 1;
> bridge->native_pcie_hotplug = 1;
> -   bridge->native_shpc_hotplug = 1;
> bridge->native_pme = 1;
> -   bridge->native_ltr = 1;
> bridge->native_dpc = 1;
> +#endif
  If CONFIG_PCIEPORTBUS wasn't defined, leave them to "unknown" value ?

> +   bridge->native_ltr = 1;
> +   bridge->native_shpc_hotplug = 1;
>
> device_initialize(>dev);
>  }
> --
> 2.17.1
>

Re: [PATCH v10 5/5] PCI/DPC: Move AER/DPC dependency checks out of DPC driver

2020-10-28 Thread Ethan Zhao

On Tue, Oct 27, 2020 at 11:12 AM Kuppuswamy Sathyanarayanan
 wrote:
>
> Currently, AER and DPC Capabilities dependency checks is
> distributed between DPC and portdrv service drivers. So move
> them out of DPC driver.
>
> Also, since services & PCIE_PORT_SERVICE_AER check already
> ensures AER native ownership, no need to add additional
> pcie_aer_is_native() check.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  drivers/pci/pcie/dpc.c  | 4 
>  drivers/pci/pcie/portdrv_core.c | 1 +
>  2 files changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 21f77420632b..a8b92207 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -283,14 +283,10 @@ void pci_dpc_init(struct pci_dev *pdev)
>  static int dpc_probe(struct pcie_device *dev)
>  {
> struct pci_dev *pdev = dev->port;
> -   struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus);
> struct device *device = >device;
> int status;
> u16 ctl, cap;
>
> -   if (!pcie_aer_is_native(pdev) && !host->native_dpc)
> -   return -ENOTSUPP;
> -
> status = devm_request_threaded_irq(device, dev->irq, dpc_irq,
>dpc_handler, IRQF_SHARED,
>"pcie-dpc", pdev);
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index e257a2ca3595..ffa1d9fc458e 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -252,6 +252,7 @@ static int get_port_device_capability(struct pci_dev *dev)
>  * permission to use AER.
>  */
> if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
> +   host->native_dpc &&
What hell the logic is here ?
> (host->native_dpc || (services & PCIE_PORT_SERVICE_AER)))
> services |= PCIE_PORT_SERVICE_DPC;
>
> --
> 2.17.1
>

Re: [PATCH v11 4/5] PCI/portdrv: Remove redundant pci_aer_available() check in DPC enable logic

2020-10-28 Thread Ethan Zhao

On Tue, Oct 27, 2020 at 10:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> In DPC service enable logic, check for
> services & PCIE_PORT_SERVICE_AER implies pci_aer_available()
How about PCIE_PORT_SERVICE_AER is not configured, but
pcie_aer_disable == 0 ？
> is true. So there is no need to explicitly check it again.
>
> Also, passing pcie_ports=dpc-native in kernel command line
> implies DPC needs to be enabled in native mode irrespective
> of AER ownership status. So checking for pci_aer_available()
> without checking for pcie_ports status is incorrect.
>
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  drivers/pci/pcie/portdrv_core.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 2c0278f0fdcc..e257a2ca3595 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -252,7 +252,6 @@ static int get_port_device_capability(struct pci_dev *dev)
>  * permission to use AER.
>  */
> if (pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DPC) &&
> -   pci_aer_available() &&
> (host->native_dpc || (services & PCIE_PORT_SERVICE_AER)))
> services |= PCIE_PORT_SERVICE_DPC;
>
> --
> 2.17.1
>

Re: [PATCH V2 1/2] PCI/AER: Add pcie_is_ecrc_enabled() API

2020-10-28 Thread Ethan Zhao

On Wed, Oct 28, 2020 at 9:48 AM Vidya Sagar  wrote:
>
> Adds pcie_is_ecrc_enabled() API to let other sub-systems (like DesignWare)
> to query if ECRC policy is enabled and perform any configuration
> required in those respective sub-systems.
>
> Signed-off-by: Vidya Sagar 
> ---
> V2:
> * None from V1
>
>  drivers/pci/pci.h  |  2 ++
>  drivers/pci/pcie/aer.c | 11 +++
>  2 files changed, 13 insertions(+)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fa12f7cbc1a0..325fdbf91dde 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -575,9 +575,11 @@ static inline void 
> pcie_aspm_powersave_config_link(struct pci_dev *pdev) { }
>  #ifdef CONFIG_PCIE_ECRC
>  void pcie_set_ecrc_checking(struct pci_dev *dev);
>  void pcie_ecrc_get_policy(char *str);
> +bool pcie_is_ecrc_enabled(void);
>  #else
>  static inline void pcie_set_ecrc_checking(struct pci_dev *dev) { }
>  static inline void pcie_ecrc_get_policy(char *str) { }
> +static inline bool pcie_is_ecrc_enabled(void) { return false; }
>  #endif
>
>  #ifdef CONFIG_PCIE_PTM
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 65dff5f3457a..24363c895aba 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -207,6 +207,17 @@ void pcie_ecrc_get_policy(char *str)
>
> ecrc_policy = i;
>  }
> +
> +/**
> + * pcie_is_ecrc_enabled - returns if ECRC is enabled in the system or not
> + *
> + * Returns 1 if ECRC policy is enabled and 0 otherwise
 How about 'Returns true if ECRC policy is enabled and false otherwise'  ?
> + */
> +bool pcie_is_ecrc_enabled(void)
> +{
> +   return ecrc_policy == ECRC_POLICY_ON;
> +}
> +EXPORT_SYMBOL(pcie_is_ecrc_enabled);
>  #endif /* CONFIG_PCIE_ECRC */
>
>  #definePCI_EXP_AER_FLAGS   (PCI_EXP_DEVCTL_CERE | 
> PCI_EXP_DEVCTL_NFERE | \
> --
> 2.17.1
>

Re: [PATCH v2] pciehp: Add check for DL_ACTIVE bit in pciehp_check_link_status()

2020-10-20 Thread Ethan Zhao

On Tue, Oct 20, 2020 at 2:33 PM Sanjay R Mehta  wrote:
>
> From: Sanjay R Mehta 
>
> if DL_ACTIVE bit is set it means that there is no need to check
> PCI_EXP_LNKSTA_LT bit, as DL_ACTIVE would have set only if the link
> is already trained. Hence adding a check which takes care of this
> scenario.
Sorry, I couldn't understand the logic here. if DL_ACTIVE was set,
PCI_EXP_LNKSTA_LT is to be cleared, vice versa.  why need

(lnk_status & PCI_EXP_LNKSTA_LT) &&
 !(lnk_status & PCI_EXP_LNKSTA_DLLLA)

Double safe ?

Thanks,
Ethan

>
> Signed-off-by: Sanjay R Mehta 
> ---
>  drivers/pci/hotplug/pciehp_hpc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/hotplug/pciehp_hpc.c 
> b/drivers/pci/hotplug/pciehp_hpc.c
> index 53433b37e181..8ab2f6a2f388 100644
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -309,7 +309,8 @@ int pciehp_check_link_status(struct controller *ctrl)
>
> pcie_capability_read_word(pdev, PCI_EXP_LNKSTA, _status);
> ctrl_dbg(ctrl, "%s: lnk_status = %x\n", __func__, lnk_status);
> -   if ((lnk_status & PCI_EXP_LNKSTA_LT) ||
> +   if (((lnk_status & PCI_EXP_LNKSTA_LT) &&
> +!(lnk_status & PCI_EXP_LNKSTA_DLLLA)) ||
> !(lnk_status & PCI_EXP_LNKSTA_NLW)) {
> ctrl_err(ctrl, "link training error: status %#06x\n",
>  lnk_status);
> --
> 2.25.1
>

Re: [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR

2020-10-19 Thread Ethan Zhao

On Sat, Oct 17, 2020 at 6:29 AM Bjorn Helgaas  wrote:
>
> [+cc Christoph, Ethan, Sinan, Keith; sorry should have cc'd you to
> begin with since you're looking at this code too.  Particularly
> interested in your thoughts about whether we should be touching
> PCI_ERR_ROOT_COMMAND and PCI_ERR_ROOT_STATUS when we don't own AER.]

aer_root_reset() function has a prefix  'aer_', looks like it's a
function of aer driver, will
only be called by aer driver at runtime. if so it's up to the
owner/aer to know if OSPM is
granted to init. while actually some of the functions and runtime service of
aer driver is also shared by GHES driver (running time) and DPC driver
(compiling time ?)
etc. then it is confused now.

Shall we move some of the shared functions and running time service to
pci/err.c ?
if so , just like pcie_do_recovery(), it's share by firmware_first  mode GHES
ghes_probe()
->ghes_irq_func
  ->ghes_proc
->ghes_do_proc()
  ->ghes_handle_aer()
->aer_recover_work_func()
  ->pcie_do_recovery()
->aer_root_reset()

and aer driver etc.  if aer wants to do some access might conflict
with firmware(or
firmware in embedded controller) should check _OSC_ etc first.  blindly issue
PCI_ERR_ROOT_COMMAND  or clear PCI_ERR_ROOT_STATUS *likely*
cause errors by error handling itself.

Thanks,
Ethan

>
> On Fri, Oct 16, 2020 at 03:30:37PM -0500, Bjorn Helgaas wrote:
> > [+to Jonathan]
> >
> > On Thu, Oct 15, 2020 at 05:11:10PM -0700, Sean V Kelley wrote:
> > > From: Qiuxu Zhuo 
> > >
> > > When attempting error recovery for an RCiEP associated with an RCEC 
> > > device,
> > > there needs to be a way to update the Root Error Status, the Uncorrectable
> > > Error Status and the Uncorrectable Error Severity of the parent RCEC.  In
> > > some non-native cases in which there is no OS-visible device associated
> > > with the RCiEP, there is nothing to act upon as the firmware is acting
> > > before the OS.
> > >
> > > Add handling for the linked RCEC in AER/ERR while taking into account
> > > non-native cases.
> > >
> > > Co-developed-by: Sean V Kelley 
> > > Link: 
> > > https://lore.kernel.org/r/20201002184735.1229220-12-seanvk@oregontracks.org
> > > Signed-off-by: Sean V Kelley 
> > > Signed-off-by: Qiuxu Zhuo 
> > > Signed-off-by: Bjorn Helgaas 
> > > Reviewed-by: Jonathan Cameron 
> > > ---
> > >  drivers/pci/pcie/aer.c | 53 ++
> > >  drivers/pci/pcie/err.c | 20 
> > >  2 files changed, 48 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index 65dff5f3457a..083f69b67bfd 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -1357,27 +1357,50 @@ static int aer_probe(struct pcie_device *dev)
> > >   */
> > >  static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
> > >  {
> > > -   int aer = dev->aer_cap;
> > > +   int type = pci_pcie_type(dev);
> > > +   struct pci_dev *root;
> > > +   int aer = 0;
> > > +   int rc = 0;
> > > u32 reg32;
> > > -   int rc;
> > >
> > > +   if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END)
> >
> > "type == PCI_EXP_TYPE_RC_END"
> >
> > > +   /*
> > > +* The reset should only clear the Root Error Status
> > > +* of the RCEC. Only perform this for the
> > > +* native case, i.e., an RCEC is present.
> > > +*/
> > > +   root = dev->rcec;
> > > +   else
> > > +   root = dev;
> > >
> > > -   /* Disable Root's interrupt in response to error messages */
> > > -   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, );
> > > -   reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> > > -   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> > > +   if (root)
> > > +   aer = dev->aer_cap;
> > >
> > > -   rc = pci_bus_error_reset(dev);
> > > -   pci_info(dev, "Root Port link has been reset\n");
> > > +   if (aer) {
> > > +   /* Disable Root's interrupt in response to error messages */
> > > +   pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, 
> > > );
> > > +   reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> > > +   pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, 
> > > reg32);
> >
> > Not directly related to *this* patch, but my assumption was that in
> > the APEI case, the firmware should retain ownership of the AER
> > Capability, so the OS should not touch PCI_ERR_ROOT_COMMAND and
> > PCI_ERR_ROOT_STATUS.
> >
> > But this code appears to ignore that ownership.  Jonathan, you must
> > have looked at this recently for 068c29a248b6 ("PCI/ERR: Clear PCIe
> > Device Status errors only if OS owns AER").  Do you have any insight
> > about this?
> >
> > > -   /* Clear Root Error Status */
> > > -   pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, );
> > > -   pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32);
> > > +   /* Clear Root Error Status */
> > > +

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Ethan Zhao

On Thu, Oct 15, 2020 at 1:53 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 10:05 PM, Ethan Zhao wrote:
> > On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan
> >  wrote:
> >>
> >>
> >>
> >> On 10/14/20 6:58 PM, Ethan Zhao wrote:
> >>> On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
> >>>  wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 10/14/20 8:07 AM, Ethan Zhao wrote:
> >>>>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >>>>>  wrote:
> >>>>>>
> >>>>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> >>>>>> merged fatal and non-fatal error recovery paths, and also made
> >>>>>> recovery code depend on hotplug handler for "remove affected
> >>>>>> device + rescan" support. But this change also complicated the
> >>>>>> error recovery path and which in turn led to the following
> >>>>>> issues.
> >>>>>>
> >>>>>> 1. We depend on hotplug handler for removing the affected
> >>>>>> devices/drivers on DLLSC LINK down event (on DPC event
> >>>>>> trigger) and DPC handler for handling the error recovery. Since
> >>>>>> both handlers operate on same set of affected devices, it leads
> >>>>>> to race condition, which in turn leads to  NULL pointer
> >>>>>> exceptions or error recovery failures.You can find more details
> >>>>>> about this issue in following link.
> >>>>>>
> >>>>>> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> >>>>>>
> >>>>>> 2. For non-hotplug capable devices fatal (DPC) error recovery
> >>>>>> is currently broken. Current fatal error recovery implementation
> >>>>>> relies on PCIe hotplug (pciehp) handler for detaching and
> >>>>>> re-enumerating the affected devices/drivers. So when dealing with
> >>>>>> non-hotplug capable devices, recovery code does not restore the state
> >>>>>> of the affected devices correctly. You can find more details about
> >>>>>> this issue in the following links.
> >>>>>>
> >>>>>> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> >>>>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> >>>>>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> >>>>>>
> >>>>>> In order to fix the above two issues, we should stop relying on hotplug
> >>>>>  Yes, it doesn't rely on hotplug handler to remove and rescan the 
> >>>>> device,
> >>>>> but it couldn't prevent hotplug drivers from doing another replicated
> >>>>> removal/rescanning.
> >>>>> it doesn't make sense to leave another useless removal/rescanning there.
> >>>>> Maybe that's why these two paths were merged to one and made it rely on
> >>>>> hotplug.
> >>>> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
> >>>> depending on it to handle some of its recovery function is in-correct and
> >>>> would lead to issues in non-hotplug capable platforms (which is true
> >>>> currently).
> >>>>>
> >>
>
> >
> >>>Though pciehp is not so hot/scalable and performance critical, but 
> >>> there
> >>>is per cpu thread to handle hot-plug operation. synchronize all threads
> >>>make them walk backwards for scalability.
> >> DPC events does not happen in high frequency. So I don't think we should
> >   It's holding global lock, once malfunction happens to one device and
> > it's driver,
> > the whole system, everyone holds it, would be blocked to work.
> >> worry about the performance here. Even hotplug handler will hold this lock
> >> when adding/removing the devices. So adding/removing devices is a 
> >> serialized
> > You don't worry about performance, but if there is a requirement needs
> > more scalable
> > and reliable hotplug, the effect will be much harder. what to do then ? 
> > choo

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-15 Thread Ethan Zhao

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Sinan Kaya 
> ---
>  Changes since v4:
>   * Added new interfaces for error recovery (pcie_do_fatal_recovery()
> and pcie_do_nonfatal_recovery()).
>
>  Changes since v5:
>   * Fixed static/non-static declartion issue.
>
>  Documentation/PCI/pci-error-recovery.rst | 47 ++--
>  Documentation/PCI/pcieaer-howto.rst  |  2 +-
>  drivers/pci/pci.h|  5 +-
>  drivers/pci/pcie/aer.c   | 10 ++--
>  drivers/pci/pcie/dpc.c   |  2 +-
>  drivers/pci/pcie/edr.c   |  2 +-
>  drivers/pci/pcie/err.c   | 70 ++--
>  7 files changed, 94 insertions(+), 44 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a 
> PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  ---
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0x,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
> recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning 
> PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
> callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result 
> codes:
>
>  The next step taken depends on the results returned by the drivers.
>  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> +proceeds to STEP 4 (Resume Operations).
>
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> -proceeds to STEP 4 (Slot Reset)
> +proceeds to

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao

On Thu, Oct 15, 2020 at 11:04 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 6:58 PM, Ethan Zhao wrote:
> > On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
> >  wrote:
> >>
> >>
> >>
> >> On 10/14/20 8:07 AM, Ethan Zhao wrote:
> >>> On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >>>  wrote:
> >>>>
> >>>> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> >>>> merged fatal and non-fatal error recovery paths, and also made
> >>>> recovery code depend on hotplug handler for "remove affected
> >>>> device + rescan" support. But this change also complicated the
> >>>> error recovery path and which in turn led to the following
> >>>> issues.
> >>>>
> >>>> 1. We depend on hotplug handler for removing the affected
> >>>> devices/drivers on DLLSC LINK down event (on DPC event
> >>>> trigger) and DPC handler for handling the error recovery. Since
> >>>> both handlers operate on same set of affected devices, it leads
> >>>> to race condition, which in turn leads to  NULL pointer
> >>>> exceptions or error recovery failures.You can find more details
> >>>> about this issue in following link.
> >>>>
> >>>> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> >>>>
> >>>> 2. For non-hotplug capable devices fatal (DPC) error recovery
> >>>> is currently broken. Current fatal error recovery implementation
> >>>> relies on PCIe hotplug (pciehp) handler for detaching and
> >>>> re-enumerating the affected devices/drivers. So when dealing with
> >>>> non-hotplug capable devices, recovery code does not restore the state
> >>>> of the affected devices correctly. You can find more details about
> >>>> this issue in the following links.
> >>>>
> >>>> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> >>>> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> >>>> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> >>>>
> >>>> In order to fix the above two issues, we should stop relying on hotplug
> >>> Yes, it doesn't rely on hotplug handler to remove and rescan the 
> >>> device,
> >>> but it couldn't prevent hotplug drivers from doing another replicated
> >>> removal/rescanning.
> >>> it doesn't make sense to leave another useless removal/rescanning there.
> >>> Maybe that's why these two paths were merged to one and made it rely on
> >>> hotplug.
> >> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
> >> depending on it to handle some of its recovery function is in-correct and
> >> would lead to issues in non-hotplug capable platforms (which is true
> >> currently).
> >>>
>
> >   pci_lock_rescan_remove() is global lock for PCIe, the mal-functional
> >   device's port holds this lock, it prevents the whole system from doing
> >   hot-plug operation.
> It does not prevent the hotplug operation, but it might delay it. Since both
> DPC and hotplug operates on same set of devices, it must be synchronized.
 Think about a large system with some PCI domains, every domain has some
 ports and devices attached. why DPC and hotplug *must* operate on the
 same set of devices from different domains ? if it must be synchronized, why
 make the hotplug handlers threadable ?

> >   Though pciehp is not so hot/scalable and performance critical, but there
> >   is per cpu thread to handle hot-plug operation. synchronize all threads
> >   make them walk backwards for scalability.
> DPC events does not happen in high frequency. So I don't think we should
 It's holding global lock, once malfunction happens to one device and
it's driver,
the whole system, everyone holds it, would be blocked to work.
> worry about the performance here. Even hotplug handler will hold this lock
> when adding/removing the devices. So adding/removing devices is a serialized
You don't worry about performance, but if there is a requirement needs
more scalable
and reliable hotplug, the effect will be much harder. what to do then ? choose
another OS ?
To be honest, I don't like the global lock/ pci_lock_rescan_remove().

BTW, I didn't try the FATAL errors brute force injection on your
patch, duplicated
removal will work naturally because it was removed ?

Thanks,
Ethan
> operation.
> >
>
> >>
> >>>> --
> >>>> 2.17.1
> >>>>
> >>
> >> --
> >> Sathyanarayanan Kuppuswamy
> >> Linux Kernel Developer
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao

On Thu, Oct 15, 2020 at 1:06 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/14/20 8:07 AM, Ethan Zhao wrote:
> > On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
> >  wrote:
> >>
> >> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> >> merged fatal and non-fatal error recovery paths, and also made
> >> recovery code depend on hotplug handler for "remove affected
> >> device + rescan" support. But this change also complicated the
> >> error recovery path and which in turn led to the following
> >> issues.
> >>
> >> 1. We depend on hotplug handler for removing the affected
> >> devices/drivers on DLLSC LINK down event (on DPC event
> >> trigger) and DPC handler for handling the error recovery. Since
> >> both handlers operate on same set of affected devices, it leads
> >> to race condition, which in turn leads to  NULL pointer
> >> exceptions or error recovery failures.You can find more details
> >> about this issue in following link.
> >>
> >> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
> >>
> >> 2. For non-hotplug capable devices fatal (DPC) error recovery
> >> is currently broken. Current fatal error recovery implementation
> >> relies on PCIe hotplug (pciehp) handler for detaching and
> >> re-enumerating the affected devices/drivers. So when dealing with
> >> non-hotplug capable devices, recovery code does not restore the state
> >> of the affected devices correctly. You can find more details about
> >> this issue in the following links.
> >>
> >> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> >> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> >> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
> >>
> >> In order to fix the above two issues, we should stop relying on hotplug
> >Yes, it doesn't rely on hotplug handler to remove and rescan the device,
> > but it couldn't prevent hotplug drivers from doing another replicated
> > removal/rescanning.
> > it doesn't make sense to leave another useless removal/rescanning there.
> > Maybe that's why these two paths were merged to one and made it rely on
> > hotplug.
> No, as per PCIe spec, hotplug and DPC has no functional dependency. Hence
> depending on it to handle some of its recovery function is in-correct and
> would lead to issues in non-hotplug capable platforms (which is true
> currently).
> >
>
> >> +   else
> >> +   udev = dev->bus->self;
> >> +
> >> +   parent = udev->subordinate;
> >> +   pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
> >> +
> >> +pci_lock_rescan_remove();
> > Though here you have lock, but hotplug will do another
> > 'pci_stop_and_remove_bus_device()'
> > without merging it with the hotplug driver, you have no way to
> > remove the replicated actions in
> >hotplug handler.
> No, the core operation (remove/add device) is syncronzied and done in
> only one thread. Please check the following flow. Even in hotplug
 pci_lock_rescan_remove() is global lock for PCIe, the mal-functional
 device's port holds this lock, it prevents the whole system from doing
 hot-plug operation.
 Though pciehp is not so hot/scalable and performance critical, but there
 is per cpu thread to handle hot-plug operation. synchronize all threads
 make them walk backwards for scalability.

> handler, before removing the device, it attempts to hold 
> pci_lock_rescan_remove()
> lock. So holding the same lock in DPC handler will syncronize the DPC/hotplug
> handlers. Also if one of the thread (DPC or hotplug) removes/adds the 
> affected devices,
> other thread will not repeat the same action (since the device is already 
> removed/added).
>
> ->pciehp_ist()
>->pciehp_handle_presence_or_link_change()
>  ->pciehp_disable_slot()
>->__pciehp_disable_slot()
>  ->remove_board()
>->pciehp_unconfigure_device()
>  ->pci_lock_rescan_remove()
> >
> >
> >Thanks,
> >Ethan
> >> +pci_dev_get(dev);
> >> +list_for_each_entry_safe_reverse(pdev, temp, >devices,
> >> +bus_list) {
> >> +   pci_stop_and_remove_bus_device(pdev);
> >> +   }
> >> +
> >> +

Re: [PATCH v6 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao

On Wed, Oct 14, 2020 at 5:00 PM Kuppuswamy Sathyanarayanan
 wrote:
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
  Yes, it doesn't rely on hotplug handler to remove and rescan the device,
but it couldn't prevent hotplug drivers from doing another replicated
removal/rescanning.
it doesn't make sense to leave another useless removal/rescanning there.
Maybe that's why these two paths were merged to one and made it rely on
hotplug.

> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.
>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> Reviewed-by: Sinan Kaya 
> ---
>  Changes since v4:
>   * Added new interfaces for error recovery (pcie_do_fatal_recovery()
> and pcie_do_nonfatal_recovery()).
>
>  Changes since v5:
>   * Fixed static/non-static declartion issue.
>
>  Documentation/PCI/pci-error-recovery.rst | 47 ++--
>  Documentation/PCI/pcieaer-howto.rst  |  2 +-
>  drivers/pci/pci.h|  5 +-
>  drivers/pci/pcie/aer.c   | 10 ++--
>  drivers/pci/pcie/dpc.c   |  2 +-
>  drivers/pci/pcie/edr.c   |  2 +-
>  drivers/pci/pcie/err.c   | 70 ++--
>  7 files changed, 94 insertions(+), 44 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a 
> PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  ---
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0x,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
> recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning 
> PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
> callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result 
> codes:
>
>  The next step taken depends on the results returned by the

Re: [PATCH v4 1/2] PCI/ERR: Call pci_bus_reset() before calling ->slot_reset() callback

2020-10-14 Thread Ethan Zhao

Please fix the building issue.

drivers/pci/pcie/err.c:144:25: error: static declaration of
‘pcie_do_fatal_recovery’ follows non-static declaration
 static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
 ^~
In file included from drivers/pci/pcie/err.c:21:
drivers/pci/pcie/../pci.h:560:18: note: previous declaration of
‘pcie_do_fatal_recovery’ was here
 pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,
  ^~
drivers/pci/pcie/err.c:144:25: warning: ‘pcie_do_fatal_recovery’
defined but not used [-Wunused-function]
 static pci_ers_result_t pcie_do_fatal_recovery(struct pci_dev *dev,


Thanks,
Ethan

On Tue, Oct 13, 2020 at 10:18 PM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
>
> On 10/12/20 2:05 PM, Raj, Ashok wrote:
> > On Sun, Oct 11, 2020 at 10:03:40PM -0700, 
> > sathyanarayanan.nkuppusw...@gmail.com wrote:
> >> From: Kuppuswamy Sathyanarayanan 
> >> 
> >>
> >> Currently if report_error_detected() or report_mmio_enabled()
> >> functions requests PCI_ERS_RESULT_NEED_RESET, current
> >> pcie_do_recovery() implementation does not do the requested
> >> explicit device reset, but instead just calls the
> >> report_slot_reset() on all affected devices. Notifying about the
> >> reset via report_slot_reset() without doing the actual device
> >> reset is incorrect. So call pci_bus_reset() before triggering
> >> ->slot_reset() callback.
> >>
> >> Signed-off-by: Kuppuswamy Sathyanarayanan 
> >> 
> >> ---
> >>   drivers/pci/pcie/err.c | 6 +-
> >>   1 file changed, 1 insertion(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index c543f419d8f9..067c58728b88 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -181,11 +181,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>  }
> >>
> >>  if (status == PCI_ERS_RESULT_NEED_RESET) {
> >> -/*
> >> - * TODO: Should call platform-specific
> >> - * functions to reset slot before calling
> >> - * drivers' slot_reset callbacks?
> >> - */
> >> +pci_reset_bus(dev);
> >
> > pci_reset_bus() returns an error, do you need to consult that before
> > unconditionally setting PCI_ERS_RESULT_RECOVERED?
> Good point. I will fix this in next version.
> >
> >>  status = PCI_ERS_RESULT_RECOVERED;
> >>  pci_dbg(dev, "broadcast slot_reset message\n");
> >>  pci_walk_bus(bus, report_slot_reset, );
> >> --
> >> 2.17.1
> >>
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

Re: [PATCH v4 2/2] PCI/ERR: Split the fatal and non-fatal error recovery handling

2020-10-14 Thread Ethan Zhao

On Mon, Oct 12, 2020 at 1:10 PM  wrote:
>
> From: Kuppuswamy Sathyanarayanan 
>
> Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> merged fatal and non-fatal error recovery paths, and also made
> recovery code depend on hotplug handler for "remove affected
> device + rescan" support. But this change also complicated the
> error recovery path and which in turn led to the following
> issues.
>
> 1. We depend on hotplug handler for removing the affected
> devices/drivers on DLLSC LINK down event (on DPC event
> trigger) and DPC handler for handling the error recovery. Since
> both handlers operate on same set of affected devices, it leads
> to race condition, which in turn leads to  NULL pointer
> exceptions or error recovery failures.You can find more details
> about this issue in following link.
>
> https://lore.kernel.org/linux-pci/20201007113158.48933-1-haifeng.z...@intel.com/T/#t
>
> 2. For non-hotplug capable devices fatal (DPC) error recovery
> is currently broken. Current fatal error recovery implementation
> relies on PCIe hotplug (pciehp) handler for detaching and
> re-enumerating the affected devices/drivers. So when dealing with
> non-hotplug capable devices, recovery code does not restore the state
> of the affected devices correctly. You can find more details about
> this issue in the following links.
>
> https://lore.kernel.org/linux-pci/20200527083130.4137-1-zhiqiang@nxp.com/
> https://lore.kernel.org/linux-pci/12115.1588207324@famine/
> https://lore.kernel.org/linux-pci/0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.158584.git.sathyanarayanan.kuppusw...@linux.intel.com/
>
> In order to fix the above two issues, we should stop relying on hotplug
> handler for cleaning the affected devices/drivers and let error recovery
> handler own this functionality. So this patch reverts Commit bdb5ac85777d
> ("PCI/ERR: Handle fatal error recovery") and re-introduce the  "remove
> affected device + rescan"  functionality in fatal error recovery handler.

This patch only reverts the commit bdb5ac85777d ?
or you'd better separate the revert and code you added.

Thanks,
Ethan

>
> Also holding pci_lock_rescan_remove() will prevent the race between hotplug
> and DPC handler.
>
> Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
> Signed-off-by: Kuppuswamy Sathyanarayanan 
> 
> ---
>  Documentation/PCI/pci-error-recovery.rst | 47 ++--
>  drivers/pci/pcie/err.c   | 71 +++-
>  2 files changed, 87 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/PCI/pci-error-recovery.rst 
> b/Documentation/PCI/pci-error-recovery.rst
> index 84ceebb08cac..830c8af5838b 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -115,7 +115,7 @@ The actual steps taken by a platform to recover from a 
> PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>
> -STEP 0: Error Event
> +STEP 0: Error Event: ERR_NONFATAL
>  ---
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0x,
> @@ -160,10 +160,10 @@ particular, if the platform doesn't isolate slots), and 
> recovery
>  proceeds to STEP 2 (MMIO Enable).
>
>  If any driver requested a slot reset (by returning 
> PCI_ERS_RESULT_NEED_RESET),
> -then recovery proceeds to STEP 4 (Slot Reset).
> +then recovery proceeds to STEP 3 (Slot Reset).
>
>  If the platform is unable to recover the slot, the next step
> -is STEP 6 (Permanent Failure).
> +is STEP 5 (Permanent Failure).
>
>  .. note::
>
> @@ -198,7 +198,7 @@ reset or some such, but not restart operations. This 
> callback is made if
>  all drivers on a segment agree that they can try to recover and if no 
> automatic
>  link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
> -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
> +instead will have gone directly to STEP 3 (Slot Reset)
>
>  .. note::
>
> @@ -233,18 +233,12 @@ The driver should return one of the following result 
> codes:
>
>  The next step taken depends on the results returned by the drivers.
>  If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
> -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
> +proceeds to STEP 4 (Resume Operations).
>
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
> -proceeds to STEP 4 (Slot Reset)
> +proceeds to STEP 3 (Slot Reset)
>
> -STEP 3: Link Reset
> ---
> -The platform resets the link.  This is a PCI-Express specific step
> -and is done whenever a fatal error has been detected that can be
> -"solved" by resetting the link.
> -
> -STEP 4: Slot Reset
> +STEP 3: Slot Reset
>  --
>
>  In response to a return value of

Re: [PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling

2020-10-08 Thread Ethan Zhao

On Thu, Oct 8, 2020 at 2:16 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
> On 10/7/20 4:31 AM, Ethan Zhao wrote:
> > Once root port DPC capability is enabled and triggered, at the beginning
> > of DPC is triggered, the DPC status bits are set by hardware and then
> > sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
> > take the port and software DPC interrupt handler 10ms to 50ms (test data
> > on ICS(Ice Lake SP platform, see
> > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
> > & stable 5.9-rc6) to complete the DPC containment procedure
> This data is based on one particular architecture. So using this
> to create a timed loop in pci_wait_port_outdpc() looks incorrect.

To clarify there, it is not random to wait for specific 1000ms for
specific architecture.
Though there is no specification to say how many ms totally cost by
hardware DPC containment,
plus its interrupt handling.  but you could read the whole PCIe
specification to know how
many ms cost at most by power state transition, link training etc, you
may know the max single delay
in hardware state transition is 200ms. Other small delay in hardware
state transition is 100ms,
48ms, 32ms etc.

If the DPC containment hardware procedure is pure resetting (or cold
power on) without
software access configuration as the worst case. we wait its handling
process from 10ms
(actually 20ms is the minimum delay we could do with msleep() )  till
1000ms timeout is a
reasonable value.

Thanks,
Ethan




>
> I still recommend looking for some locking model to fix this
> issue (may be atomic state flag or lock).
> > till the DPC status is cleared at the end of the DPC interrupt handler.
> >
> > We use this function to check if the root port is in DPC handling status
> > and wait till the hardware and software completed the procedure.
> >
> > Signed-off-by: Ethan Zhao 
> > Tested-by: Wen Jin 
> > Tested-by: Shanshan Zhang 
> > ---
> > changes:
> >   v2：align ICS code name to public doc.
> >   v3: no change.
> >   v4: response to Christoph's (Christoph Hellwig )
> >   tip, move pci_wait_port_outdpc() to DPC driver and its declaration
> >   to pci.h.
> >   v5: fix building issue reported by l...@intel.com with some config.
> >   v6: move from [1/5] to [2/5].
> >   v7: no change.
> >   v8: no change.
> >
> >   drivers/pci/pci.h  |  2 ++
> >   drivers/pci/pcie/dpc.c | 27 +++
> >   2 files changed, 29 insertions(+)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index fa12f7cbc1a0..455b32187abd 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
> >   void pci_dpc_init(struct pci_dev *pdev);
> >   void dpc_process_error(struct pci_dev *pdev);
> >   pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
> > +bool pci_wait_port_outdpc(struct pci_dev *pdev);
> >   #else
> >   static inline void pci_save_dpc_state(struct pci_dev *dev) {}
> >   static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
> >   static inline void pci_dpc_init(struct pci_dev *pdev) {}
> > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return 
> > false; }
> >   #endif
> >
> >   #ifdef CONFIG_PCI_ATS
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index daa9a4153776..2e0e091ce923 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
> >   pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
> >   }
> >
> > +bool pci_wait_port_outdpc(struct pci_dev *pdev)
> > +{
> > + u16 cap = pdev->dpc_cap, status;
> > + u16 loop = 0;
> > +
> > + if (!cap) {
> > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
> > + return false;
> > + }
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
> > +
> > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
> > + msleep(10);
> > + loop++;
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + }
> > +
> > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
> > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
> > + return true;
> > + }
> > +
> > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
> > + return false;
> > +}
> > +
> >   static int dpc_wait_rp_inactive(struct pci_dev *pdev)
> >   {
> >   unsigned long timeout = jiffies + HZ;
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
>

Re: [PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling

2020-10-07 Thread Ethan Zhao

On Thu, Oct 8, 2020 at 2:16 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
> On 10/7/20 4:31 AM, Ethan Zhao wrote:
> > Once root port DPC capability is enabled and triggered, at the beginning
> > of DPC is triggered, the DPC status bits are set by hardware and then
> > sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
> > take the port and software DPC interrupt handler 10ms to 50ms (test data
> > on ICS(Ice Lake SP platform, see
> > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
> > & stable 5.9-rc6) to complete the DPC containment procedure
> This data is based on one particular architecture. So using this
> to create a timed loop in pci_wait_port_outdpc() looks incorrect.
>
> I still recommend looking for some locking model to fix this
> issue (may be atomic state flag or lock).
 It is actually a device semaphore. DLLSC/PDC handler needs to wait
for the critical
area is clear to enter by monitoring the DPC triggered status is
cleaned or not.

 Another  problem is,  DPC reset/interrupt is initiated by hardware,
you couldn't place
 a software lock between interrupt handler and device resetting.

While device semaphore--- DPC triggered status is the right one to wait for.

Better idea ?

Thanks,
Ethan

> > till the DPC status is cleared at the end of the DPC interrupt handler.
> >
> > We use this function to check if the root port is in DPC handling status
> > and wait till the hardware and software completed the procedure.
> >
> > Signed-off-by: Ethan Zhao 
> > Tested-by: Wen Jin 
> > Tested-by: Shanshan Zhang 
> > ---
> > changes:
> >   v2：align ICS code name to public doc.
> >   v3: no change.
> >   v4: response to Christoph's (Christoph Hellwig )
> >   tip, move pci_wait_port_outdpc() to DPC driver and its declaration
> >   to pci.h.
> >   v5: fix building issue reported by l...@intel.com with some config.
> >   v6: move from [1/5] to [2/5].
> >   v7: no change.
> >   v8: no change.
> >
> >   drivers/pci/pci.h  |  2 ++
> >   drivers/pci/pcie/dpc.c | 27 +++
> >   2 files changed, 29 insertions(+)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index fa12f7cbc1a0..455b32187abd 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
> >   void pci_dpc_init(struct pci_dev *pdev);
> >   void dpc_process_error(struct pci_dev *pdev);
> >   pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
> > +bool pci_wait_port_outdpc(struct pci_dev *pdev);
> >   #else
> >   static inline void pci_save_dpc_state(struct pci_dev *dev) {}
> >   static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
> >   static inline void pci_dpc_init(struct pci_dev *pdev) {}
> > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return 
> > false; }
> >   #endif
> >
> >   #ifdef CONFIG_PCI_ATS
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index daa9a4153776..2e0e091ce923 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
> >   pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
> >   }
> >
> > +bool pci_wait_port_outdpc(struct pci_dev *pdev)
> > +{
> > + u16 cap = pdev->dpc_cap, status;
> > + u16 loop = 0;
> > +
> > + if (!cap) {
> > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
> > + return false;
> > + }
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
> > +
> > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
> > + msleep(10);
> > + loop++;
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + }
> > +
> > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
> > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
> > + return true;
> > + }
> > +
> > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
> > + return false;
> > +}
> > +
> >   static int dpc_wait_rp_inactive(struct pci_dev *pdev)
> >   {
> >   unsigned long timeout = jiffies + HZ;
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
>

Re: [PATCH v8 1/6] PCI/ERR: get device before call device driver to avoid NULL pointer dereference

2020-10-07 Thread Ethan Zhao

On Thu, Oct 8, 2020 at 1:24 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
> On 10/7/20 4:31 AM, Ethan Zhao wrote:
> > During DPC error injection test we found there is race condition between
> > pciehp and DPC driver, NULL pointer dereference caused panic as following
> >
> >   # setpci -s 64:02.0 0x196.w=000a
> >// 64:02.0 is rootport has DPC capability
> >   # setpci -s 65:00.0 0x04.w=0544
> >// 65:00.0 is NVMe SSD populated in above port
> >   # mount /dev/nvme0n1p1 nvme
> >
> >   (tested on stable 5.8 & ICS(Ice Lake SP platform, see
> >   https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))
> >
> >   BUG: kernel NULL pointer dereference, address: 0050
> >   ...
> >   CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 
> > 5.8.0-0.0.7.el8.x86_64+ #1
> >   RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
> >   Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 
> > 65 ff
> >   ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 
> > 3a 00
> >   41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
> >   RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
> >   RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
> >   RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
> >   RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
> >   R10: 00eb8ebeab53 R11: 93453258 R12: 0002
> >   R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
> >   FS:  () GS:ff4e3eab1fd0() 
> > knlGS:
> >   CS:  0010 DS:  ES:  CR0: 80050033
> >   CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
> >   DR0:  DR1:  DR2: 
> >   DR3:  DR6: fffe0ff0 DR7: 0400
> >   PKRU: 5554
> >   Call Trace:
> >   ? report_normal_detected+0x20/0x20
> >   report_frozen_detected+0x16/0x20
> >   pci_walk_bus+0x75/0x90
> >   ? dpc_irq+0x90/0x90
> >   pcie_do_recovery+0x157/0x201
> >   ? irq_finalize_oneshot.part.47+0xe0/0xe0
> >   dpc_handler+0x29/0x40
> >   irq_thread_fn+0x24/0x60
> >   ...
> >
> > Debug shows when port DPC feature was enabled and triggered by errors,
> > DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost
> > at the same time, and no delay between them is required by specification.
> > so DPC driver and pciehp drivers may handle these interrupts cocurrently.
> >
> > While DPC driver is doing pci_walk_bus() and calling device driver's
> > callback without pci_dev_get() to increase device reference count, the
> > device and its driver instance are likely being freed by
> > pci_stop_and_removed_bus_device()
> > -> pci_dev_put().
> >
> > So does pci_dev_get() before using the device instance to avoid NULL
> > pointer dereference.
> Won't it be better if you get this in pcie_do_recovery()?

Don't think so, just like lock, we should keep the scope with lock
protected as small as possible.
Locking a big area unnecessarily isn't acceptable.

Thanks,
Ethan
> >
> > Signed-off-by: Ethan Zhao 
> > Tested-by: Wen Jin 
> > Tested-by: Shanshan Zhang 
> > ---
> > Changes:
> >   v2: revise doc according to Andy's suggestion.
> >   v3: no change.
> >   v4: no change.
> >   v5: no change.
> >   v6: moved to [1/5] from [3/5] and revised comment according to Lukas'
> >   suggestion.
> >   v7: no change.
> >   v8: no change.
> >
> >   drivers/pci/pcie/err.c | 12 
> >   1 file changed, 12 insertions(+)
> >
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index c543f419d8f9..e35c4480c86b 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
> >   pci_ers_result_t vote;
> >   const struct pci_error_handlers *err_handler;
> >
> > + if (!pci_dev_get(dev))
> > + return 0;
> >   device_lock(>dev);
> >   if (!pci_dev_set_io_state(dev, state) ||
> >   !dev->driver ||
> > @@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
> >   pci_uevent_ers(dev, vote);
> >   *result = merge_result(*result, vote);
> >   device_unlock(>dev);
> > + pci_dev_put(dev);
> >   return 0;
> >   }
> >
> > @@ -94,6 +97,8 @@ static int report_mmio_enabl

[PATCH v8 6/6] PCI/ERR: don't mix io state not changed and no driver together

2020-10-07 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 v2: no change.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: no change.
 v7: change debug output information.
 v8: no change.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..2ca2723f3b34 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device was in that state or not allowed 
setting.\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH v8 0/6] Fix DPC hotplug race and enhance error handling

2020-10-07 Thread Ethan Zhao

Hi,folks,

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

The fundamental premise is that when due to error conditions (NON-FATAL/
FATAL) when events are processed by both DPC handler and hotplug handling
of DLLSC/PDC both operating on the same device object ends up with crashes 
(from  Ashok).

Debug shows when port DPC feature was enabled and triggered by errors,
DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost
at the same time, and no delay between them is required by specification.
so DPC driver and pciehp drivers may handle these interrupts cocurrently,
thus introduces the possibility of race condition, other details see every
commit description part.

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

This patch set could be applied to stable 5.9-rc6/rc7/rc8 directly.

Help to review and test.

v2: changed according to review by Andy Shevchenko.
v3: changed patch 4/5 to simpler coding.
v4: move function pci_wait_port_outdpc() to DPC driver and its
   declaration to pci.h. (tip from Christoph Hellwig ).
v5: fix building issue reported by l...@intel.com with some config.
v6: move patch[3/5] as the first patch according to Lukas's suggestion.
and rewrite the comment part of patch[3/5].
v7: change the patch[4/5], based on Bjorn's code and truth table.
change the patch[5/5] about the debug output information.
v8: according Bjorn's suggestion, put the pci_dev_set_io_state()
simplification but no function code in one patch.(almost copy of
Bjorn's code and truth table, understood).
patch 5/6 re-based the function change code of pci_dev_set_io_state().
per Ashok's request, add more description to this cover-letter part.


Thanks,
Ethan


Ethan Zhao (6):
  PCI/ERR: get device before call device driver to avoid NULL pointer
dereference
  PCI/DPC: define a function to check and wait till port finish DPC
handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: simplify function pci_dev_set_io_state() with if
  PCI/ERR: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 ++-
 drivers/pci/pci.h| 55 +---
 drivers/pci/pcie/dpc.c   | 27 
 drivers/pci/pcie/err.c   | 18 +--
 4 files changed, 69 insertions(+), 35 deletions(-)


base-commit: 549738f15da0e5a00275977623be199fbbf7df50
-- 
2.18.4

[PATCH v8 5/6] PCI/ERR: only return true when dev io state is really changed

2020-10-07 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So change the pci_dev_set_io_state() function to only return true
when dev->error_state is really changed.

Signed-off-by: Ethan Zhao 
---
Changnes:
 v2: revise description and code according to suggestion from Andy.
 v3: change code to simpler.
 v4: no change.
 v5: no change.
 v6: no change.
 v7: changed based on Bjorn's code and truth table.
 v8: according to Bjorn's suggestion, rebase on another simplification
 patch.   

 drivers/pci/pci.h | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index bceb3f108744..a11e0f9d9bdf 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -371,17 +371,14 @@ static inline bool pci_dev_set_io_state(struct pci_dev 
*dev,
  * perm_failure  |  perm_failure*  perm_failure*  perm_failure
  */
 
-   /* Can always put a device in perm_failure state */
-   if (new == pci_channel_io_perm_failure) {
-   dev->error_state = pci_channel_io_perm_failure;
-   return true;
-   }
-
-   /* If already in perm_failure, can't set to normal or frozen */
+   /* If already in perm_failure, can't change it's state */
if (dev->error_state == pci_channel_io_perm_failure)
return false;
+   /* not change at all */
+   else if (dev->error_state == new)
+   return false;
 
-/* Can always change normal to frozen or vice versa */
+/* Can always change from normal/frozen to other different state */
dev->error_state = new;
return true;
 }
-- 
2.18.4

[PATCH v8 4/6] PCI/ERR: simplify function pci_dev_set_io_state() with if

2020-10-07 Thread Ethan Zhao

No function change.

Signed-off-by: Ethan Zhao 
---
Changes:
 v8: based on Bjorn's code and truth table, simplify the logic of
function pci_dev_set_io_state(), no function change.

 drivers/pci/pci.h | 54 ---
 1 file changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 455b32187abd..bceb3f108744 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -359,39 +359,31 @@ struct pci_sriov {
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
+/*
+ * Truth table:
+ * requested new state
+ * current  --
+ * statenormal frozen perm_failure
+ *   +  -  -  
+ * normal|  normal frozen perm_failure
+ * frozen|  normal frozen perm_failure
+ * perm_failure  |  perm_failure*  perm_failure*  perm_failure
+ */
+
+   /* Can always put a device in perm_failure state */
+   if (new == pci_channel_io_perm_failure) {
+   dev->error_state = pci_channel_io_perm_failure;
+   return true;
}
-   if (changed)
-   dev->error_state = new;
-   return changed;
+
+   /* If already in perm_failure, can't set to normal or frozen */
+   if (dev->error_state == pci_channel_io_perm_failure)
+   return false;
+
+/* Can always change normal to frozen or vice versa */
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH v8 1/6] PCI/ERR: get device before call device driver to avoid NULL pointer dereference

2020-10-07 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer dereference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 BUG: kernel NULL pointer dereference, address: 0050
 ...
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 ...

Debug shows when port DPC feature was enabled and triggered by errors,
DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost
at the same time, and no delay between them is required by specification.
so DPC driver and pciehp drivers may handle these interrupts cocurrently.

While DPC driver is doing pci_walk_bus() and calling device driver's 
callback without pci_dev_get() to increase device reference count, the
device and its driver instance are likely being freed by
pci_stop_and_removed_bus_device()
-> pci_dev_put().

So does pci_dev_get() before using the device instance to avoid NULL
pointer dereference.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [1/5] from [3/5] and revised comment according to Lukas'
 suggestion.
 v7: no change.
 v8: no change.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 {
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);

[PATCH v8 2/6] PCI/DPC: define a function to check and wait till port finish DPC handling

2020-10-07 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
changes:
 v2???align ICS code name to public doc.
 v3: no change.
 v4: response to Christoph's (Christoph Hellwig )
 tip, move pci_wait_port_outdpc() to DPC driver and its declaration
 to pci.h.
 v5: fix building issue reported by l...@intel.com with some config.
 v6: move from [1/5] to [2/5].
 v7: no change.
 v8: no change.

 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/dpc.c | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..455b32187abd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_wait_port_outdpc(struct pci_dev *pdev);
 #else
 static inline void pci_save_dpc_state(struct pci_dev *dev) {}
 static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
 static inline void pci_dpc_init(struct pci_dev *pdev) {}
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776..2e0e091ce923 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
 }
 
+bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+
 static int dpc_wait_rp_inactive(struct pci_dev *pdev)
 {
unsigned long timeout = jiffies + HZ;
-- 
2.18.4

[PATCH v8 3/6] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-10-07 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, 
pciehp drivers almost at the same time.
Thus will cause following messed and confused errors handling/recovery/
removal/plugin procedure.

1. Port and device are in error recovery resetting initiated by DPC
   hardware, pciehp driver treats them as device is doing hot-remove or 
   hot-plugin the same time.

2. While DPC handler calling device driver->err_handler callback(
   error_detected/resume etc), but the slot may be powered off by

   pciehp
   -> remove_board()
  -> pciehp_power_off_slot().

3. While DPC handler -> pci_do_recovery is doing different action to
   detect error and recover based on device->error_state, pciehp driver
   could change it on the fly by:

   pciehp_unconfigure_device()
   ->pci_walk_bus()
 -> pci_dev_set_disconnected()

4. While DPC handler is calling device driver err_handler callback to 
   detect error and recover, pciehp driver could is doing device unbind
   and release its driver.

   ...

While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result
is not determinate.

So we need some kind of synchronization between pciehp DLLSC/PDC handling
and DPC driver error recover handling.  we need a determinate result
of DPC error containment, link is recovered, link isn't recovered, device
is still there, device is removed, then do pciehp hot-remove and
hot-plugin procudure, don't mix them together.

Per our test on ICS platform, DPC error containment and software handler
will take 10ms up to 50ms till clean the DPC triggered status. it is quick
enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC
after doing power off.

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with
any time internval between the two actions, also stressed with the DPC
injection test. system recovered to normal working state from
NON-FATAL/FATAL errors as expected. hotplug works well without any random
undeterminate errors or malfunction.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [3/5] from [2/5] and re-wrote description.
 v7: no change.
 v8: no change.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

Re: [PATCH v7 4/5] PCI: only return true when dev io state is really changed

2020-10-07 Thread Ethan Zhao

Bjorn,

On Sun, Oct 4, 2020 at 12:44 AM Bjorn Helgaas  wrote:
>
> On Sat, Oct 03, 2020 at 03:55:13AM -0400, Ethan Zhao wrote:
> > When uncorrectable error happens, AER driver and DPC driver interrupt
> > handlers likely call
> >
> >pcie_do_recovery()
> >->pci_walk_bus()
> >  ->report_frozen_detected()
> >
> > with pci_channel_io_frozen the same time.
> >If pci_dev_set_io_state() return true even if the original state is
> > pci_channel_io_frozen, that will cause AER or DPC handler re-enter
> > the error detecting and recovery procedure one after another.
> >The result is the recovery flow mixed between AER and DPC.
> > So simplify the pci_dev_set_io_state() function to only return true
> > when dev->error_state is really changed.
> >
> > Signed-off-by: Ethan Zhao 
> > Tested-by: Wen Jin 
> > Tested-by: Shanshan Zhang 
> > Reviewed-by: Alexandru Gagniuc 
> > Reviewed-by: Andy Shevchenko 
> > ---
> > Changnes:
> >  v2: revise description and code according to suggestion from Andy.
> >  v3: change code to simpler.
> >  v4: no change.
> >  v5: no change.
> >  v6: no change.
> >  v7: changed based on Bjorn's code and truth table.
> >
> >  drivers/pci/pci.h | 53 ++-
> >  1 file changed, 20 insertions(+), 33 deletions(-)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 455b32187abd..47af1ff2a286 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -354,44 +354,31 @@ struct pci_sriov {
> >   *
> >   * Must be called with device_lock held.
> >   *
> > - * Returns true if state has been changed to the requested state.
> > + * Returns true if state has been really changed to the requested state.
> >   */
> >  static inline bool pci_dev_set_io_state(struct pci_dev *dev,
> >   pci_channel_state_t new)
> >  {
> > - bool changed = false;
> > -
> >   device_lock_assert(>dev);
> > - switch (new) {
> > - case pci_channel_io_perm_failure:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - case pci_channel_io_perm_failure:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_frozen:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_normal:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - }
> > - if (changed)
> > - dev->error_state = new;
> > - return changed;
> > +
> > +/*
> > + *   Truth table:
> > + *   requested new state
> > + * current  --
> > + * statenormal frozen perm_failure
> > + *   +  -  -  
> > + * normal|  normal frozen perm_failure
> > + * frozen|  normal frozen perm_failure
> > + * perm_failure  |  perm_failure*  perm_failure*  perm_failure
> > + */
> > +
> > + if (dev->error_state == pci_channel_io_perm_failure)
> > + return false;
> > + else if (dev->error_state == new)
> > + return false;
> > +
> > + dev->error_state = new;
> > + return true;
>
> No, you missed the point.  I want
>
>   1) One patch that converts the "switch" to the shorter "if"
>  statements.  This one will be big and ugly, but should not change
>  the functionality at all, and it should be pretty easy to verify
>  that since there aren't very many states involved.
>
>  Since this one is pure code simplification, the commit log won't
>  say anything at all about AER or DPC or their requirements
>  because it's not changing any behavior.
>
>   2) A separate patch that's tiny and makes whatever functional change
>  you need.

   Make sense, clear,  this time.

 Thanks,
 Ethan
>
> >  }
> >
> >  static inline int pci_dev_set_disconnected(struct pci_dev *dev, void 
> > *unused)
> > --
> > 2.18.4
> >

Re: [PATCH v7 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-10-07 Thread Ethan Zhao

Lukas,

On Mon, Oct 5, 2020 at 3:13 AM Lukas Wunner  wrote:
>
> On Sat, Oct 03, 2020 at 03:55:12AM -0400, Ethan Zhao wrote:
> > When root port has DPC capability and it is enabled, then triggered by
> > errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp
> > drivers almost at the same time.
>
> Do the DLLSC and PDC events occur as a result of handling the error
> or do they occur independently?
They could happen independently if links were recovered then the card
was removed.
They could happen as a result of handling the errors the same time.

So don't assume DLLSC and PDC all occur at the same time.

>
> If the latter, I don't see how we can tell whether the card in the
> slot is still the same.
If PDC happens, the card in the slot might not be the same.  so
hot-removal /hot -plugin handling follows the PDC event.
>
> If the former, holding the hotplug slot's reset_lock and doing something
> along the lines of pciehp_reset_slot() (or calling it directly) might
> solve the race.

DPC reset is done by hardware, only AER calls pciehp_reset_slot() as recovery
handling initiated by software.

Thanks,
Ethan
>
> Thanks,
>
> Lukas

Re: [PATCH v7 0/5] Fix DPC hotplug race and enhance error handling

2020-10-07 Thread Ethan Zhao

Raj,

On Sun, Oct 4, 2020 at 12:57 PM Raj, Ashok  wrote:
>
> Hi Ethan
>
> On Sat, Oct 03, 2020 at 03:55:09AM -0400, Ethan Zhao wrote:
> > Hi,folks,
> >
> > This simple patch set fixed some serious security issues found when DPC
> > error injection and NVMe SSD hotplug brute force test were doing -- race
> > condition between DPC handler and pciehp, AER interrupt handlers, caused
> > system hang and system with DPC feature couldn't recover to normal
> > working state as expected (NVMe instance lost, mount operation hang,
> > race PCIe access caused uncorrectable errors reported alternatively etc).
>
> I think maybe picking from other commit messages to make this description in
> cover letter bit clear. The fundamental premise is that when due to error
> conditions when events are processed by both DPC handler and hotplug handling 
> of
> DLLSC both operating on the same device object ends up with crashes.

Yep, that's right.

Thanks,
Ethan
>
>
> >
> > With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
> > see
> > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))
> >
> > could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time
> > interval between hot-remove and plug-in operation tens of times without
> > any errors occur and system works normal.
>
> >
> > With this patch set applied, system with DPC feature could recover from
> > NON-FATAL and FATAL errors injection test and works as expected.
> >
> > System works smoothly when errors happen while hotplug is doing, no
> > uncorrectable errors found.
> >
> > Brute DPC error injection script:
> >
> > for i in {0..100}
> > do
> > setpci -s 64:02.0 0x196.w=000a
> > setpci -s 65:00.0 0x04.w=0544
> > mount /dev/nvme0n1p1 /root/nvme
> > sleep 1
> > done
> >
> > Other details see every commits description part.
> >
> > This patch set could be applied to stable 5.9-rc6/rc7 directly.
> >
> > Help to review and test.
> >
> > v2: changed according to review by Andy Shevchenko.
> > v3: changed patch 4/5 to simpler coding.
> > v4: move function pci_wait_port_outdpc() to DPC driver and its
> >declaration to pci.h. (tip from Christoph Hellwig ).
> > v5: fix building issue reported by l...@intel.com with some config.
> > v6: move patch[3/5] as the first patch according to Lukas's suggestion.
> > and rewrite the comment part of patch[3/5].
> > v7: change the patch[4/5], based on Bjorn's code and truth table.
> > change the patch[5/5] about the debug output information.
> >
> > Thanks,
> > Ethan
> >
> >
> > Ethan Zhao (5):
> >   PCI/ERR: get device before call device driver to avoid NULL pointer
> > dereference
> >   PCI/DPC: define a function to check and wait till port finish DPC
> > handling
> >   PCI: pciehp: check and wait port status out of DPC before handling
> > DLLSC and PDC
> >   PCI: only return true when dev io state is really changed
> >   PCI/ERR: don't mix io state not changed and no driver together
> >
> >  drivers/pci/hotplug/pciehp_hpc.c |  4 ++-
> >  drivers/pci/pci.h| 55 +---
> >  drivers/pci/pcie/dpc.c   | 27 
> >  drivers/pci/pcie/err.c   | 18 +--
> >  4 files changed, 68 insertions(+), 36 deletions(-)
> >
> >
> > base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7
> > --
> > 2.18.4
> >

[PATCH v7 5/5] PCI/ERR: don't mix io state not changed and no driver together

2020-10-03 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 v2: no change.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: no change.
 v7: change debug output information.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..2ca2723f3b34 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device was in that state or not allowed 
setting.\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH v7 4/5] PCI: only return true when dev io state is really changed

2020-10-03 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is really changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Alexandru Gagniuc 
Reviewed-by: Andy Shevchenko 
---
Changnes:
 v2: revise description and code according to suggestion from Andy.
 v3: change code to simpler.
 v4: no change.
 v5: no change.
 v6: no change.
 v7: changed based on Bjorn's code and truth table.

 drivers/pci/pci.h | 53 ++-
 1 file changed, 20 insertions(+), 33 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 455b32187abd..47af1ff2a286 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -354,44 +354,31 @@ struct pci_sriov {
  *
  * Must be called with device_lock held.
  *
- * Returns true if state has been changed to the requested state.
+ * Returns true if state has been really changed to the requested state.
  */
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
-   return changed;
+
+/*
+ * Truth table:
+ * requested new state
+ * current  --
+ * statenormal frozen perm_failure
+ *   +  -  -  
+ * normal|  normal frozen perm_failure
+ * frozen|  normal frozen perm_failure
+ * perm_failure  |  perm_failure*  perm_failure*  perm_failure
+ */
+
+   if (dev->error_state == pci_channel_io_perm_failure)
+   return false;
+   else if (dev->error_state == new)
+   return false;
+
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH v7 1/5] PCI/ERR: get device before call device driver to avoid NULL pointer dereference

2020-10-03 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer dereference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 BUG: kernel NULL pointer dereference, address: 0050
 ...
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 ...

Debug shows when port DPC feature was enabled and triggered by errors,
DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost
at the same time, and no delay between them is required by specification.
so DPC driver and pciehp drivers may handle these interrupts cocurrently.

While DPC driver is doing pci_walk_bus() and calling device driver's callback
without pci_dev_get() to increase device reference count, the device and its
driver instance are likely being freed by pci_stop_and_removed_bus_device()
-> pci_dev_put().

So does pci_dev_get() before using the device instance to avoid NULL pointer
dereference.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [1/5] from [3/5] and revised comment according to Lukas'
 suggestion.
 v7: no change.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 {
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev,

[PATCH v7 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-10-03 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp
drivers almost at the same time.
That will cause following messed and confused errors handling/recovery/removal
/plugin procedure.

1. Port and device are in error recovery reseting initiated by DPC hardware,
   pciehp driver treats them as device is doing hot-remove or hot-plugin the
   same time.

2. While DPC handler calling device driver->err_handler callback(
   error_detected/resume etc), but the slot may be powered off by

   pciehp
   -> remove_board()
  -> pciehp_power_off_slot().

3. While DPC handler -> pci_do_recovery is doing different action to detect
   error and recover based on device->error_state, pciehp driver could change
   it on the fly by:

   pciehp_unconfigure_device()
   ->pci_walk_bus()
 -> pci_dev_set_disconnected()

4. While DPC handler is calling device driver err_handler callback to detect
   error and recover, pciehp driver could is doing device unbind and release
   its driver.

   ...

While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result
is not determinate.

So we need some kind of synchronization between pciehp DLLSC/PDC handling
and DPC driver error recover handling.  we need a determinate result
of DPC error containment, link is recovered, link isn't recovered, device
is still there, device is removed, then do pciehp hot-remove and hot-plugin
procudure, don't mix them together.

Per our test on ICS platform, DPC error containment and software handler will
take 10ms up to 50ms till clean the DPC triggered status. it is quick enough
for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC after doing
power off.

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with
any time internval between the two actions, also stressed with the DPC
injection test. system recovered to normal working state from NON-FATAL/FATAL
errors as expected. hotplug works well without any random undeterminate errors
or malfunction.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [3/5] from [2/5] and re-wrote description.
 v7: no change.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH v7 2/5] PCI/DPC: define a function to check and wait till port finish DPC handling

2020-10-03 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
changes:
 v2：align ICS code name to public doc.
 v3: no change.
 v4: response to Christoph's (Christoph Hellwig )
 tip, move pci_wait_port_outdpc() to DPC driver and its declaration
 to pci.h.
 v5: fix building issue reported by l...@intel.com with some config.
 v6: move from [1/5] to [2/5].
 v7: no change.

 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/dpc.c | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..455b32187abd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_wait_port_outdpc(struct pci_dev *pdev);
 #else
 static inline void pci_save_dpc_state(struct pci_dev *dev) {}
 static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
 static inline void pci_dpc_init(struct pci_dev *pdev) {}
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776..2e0e091ce923 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
 }
 
+bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+
 static int dpc_wait_rp_inactive(struct pci_dev *pdev)
 {
unsigned long timeout = jiffies + HZ;
-- 
2.18.4

[PATCH v7 0/5] Fix DPC hotplug race and enhance error handling

2020-10-03 Thread Ethan Zhao

Hi,folks,

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6/rc7 directly.

Help to review and test.

v2: changed according to review by Andy Shevchenko.
v3: changed patch 4/5 to simpler coding.
v4: move function pci_wait_port_outdpc() to DPC driver and its
   declaration to pci.h. (tip from Christoph Hellwig ).
v5: fix building issue reported by l...@intel.com with some config.
v6: move patch[3/5] as the first patch according to Lukas's suggestion.
and rewrite the comment part of patch[3/5].
v7: change the patch[4/5], based on Bjorn's code and truth table.
change the patch[5/5] about the debug output information.

Thanks,
Ethan 


Ethan Zhao (5):
  PCI/ERR: get device before call device driver to avoid NULL pointer
dereference
  PCI/DPC: define a function to check and wait till port finish DPC
handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 ++-
 drivers/pci/pci.h| 55 +---
 drivers/pci/pcie/dpc.c   | 27 
 drivers/pci/pcie/err.c   | 18 +--
 4 files changed, 68 insertions(+), 36 deletions(-)


base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7
-- 
2.18.4

Re: [PATCH v6 4/5] PCI: only return true when dev io state is really changed

2020-10-03 Thread Ethan Zhao

Bjorn,

On Sat, Oct 3, 2020 at 1:29 AM Bjorn Helgaas  wrote:
>
> [+cc Sinan]
>
> On Wed, Sep 30, 2020 at 03:05:36AM -0400, Ethan Zhao wrote:
> > When uncorrectable error happens, AER driver and DPC driver interrupt
> > handlers likely call
> >
> >pcie_do_recovery()
> >->pci_walk_bus()
> >  ->report_frozen_detected()
> >
> > with pci_channel_io_frozen the same time.
> >If pci_dev_set_io_state() return true even if the original state is
> > pci_channel_io_frozen, that will cause AER or DPC handler re-enter
> > the error detecting and recovery procedure one after another.
> >The result is the recovery flow mixed between AER and DPC.
> > So simplify the pci_dev_set_io_state() function to only return true
> > when dev->error_state is changed.
> >
> > Signed-off-by: Ethan Zhao 
> > Tested-by: Wen Jin 
> > Tested-by: Shanshan Zhang 
> > Reviewed-by: Alexandru Gagniuc 
> > Reviewed-by: Andy Shevchenko 
> > ---
> > Changnes:
> >  v2: revise description and code according to suggestion from Andy.
> >  v3: change code to simpler.
> >  v4: no change.
> >  v5: no change.
> >  v6: no change.
> >
> >  drivers/pci/pci.h | 37 +
> >  1 file changed, 5 insertions(+), 32 deletions(-)
> >
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 455b32187abd..f2beeaeda321 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -359,39 +359,12 @@ struct pci_sriov {
> >  static inline bool pci_dev_set_io_state(struct pci_dev *dev,
> >   pci_channel_state_t new)
> >  {
> > - bool changed = false;
> > -
> >   device_lock_assert(>dev);
> > - switch (new) {
> > - case pci_channel_io_perm_failure:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - case pci_channel_io_perm_failure:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_frozen:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - case pci_channel_io_normal:
> > - switch (dev->error_state) {
> > - case pci_channel_io_frozen:
> > - case pci_channel_io_normal:
> > - changed = true;
> > - break;
> > - }
> > - break;
> > - }
> > - if (changed)
> > - dev->error_state = new;
> > - return changed;
> > + if (dev->error_state == new)
> > + return false;
> > +
> > + dev->error_state = new;
> > + return true;
> >  }
>
> IIUC this changes the behavior of the function, but it's difficult to
> analyze because it does a lot of simplification at the same time.
>
> Please consider the following, which is intended to simplify the
> function while preserving the behavior (but please verify; it's been a
> long time since I looked at this).  Then maybe see how your patch
> could be done on top of this?
>
> Alternatively, come up with your own simplification patch + the
> functionality change.
>
>
> commit 983d9b1f8177 ("PCI/ERR: Simplify pci_dev_set_io_state()")
> Author: Bjorn Helgaas 
> Date:   Tue May 19 12:28:57 2020 -0500
>
> PCI/ERR: Simplify pci_dev_set_io_state()
>
> Truth table:
>
>   requested new state
>   current  --
>   statenormal frozen perm_failure
>     +  -  -  
>   normal|  normal frozen perm_failure
>   frozen|  normal frozen perm_failure
>   perm_failure  |  perm_failure*  perm_failure*  perm_failure
>
>   * "not changed", returns false
>
> No functional change intended.
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 6d3f75867106..81408552f7c9 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -358,39 +358,21 @@ struct pci_sriov {
>  static inline bool pci_dev_set_io_state(str

Re: [PATCH v6 4/5] PCI: only return true when dev io state is really changed

2020-10-02 Thread Ethan Zhao

Sinan,

On Sat, Oct 3, 2020 at 12:08 AM Sinan Kaya  wrote:
>
> On 9/30/2020 3:05 AM, Ethan Zhao wrote:
> > When uncorrectable error happens, AER driver and DPC driver interrupt
> > handlers likely call
> >
> >pcie_do_recovery()
> >->pci_walk_bus()
> >  ->report_frozen_detected()
> >
> > with pci_channel_io_frozen the same time.
>
> We need some more data on this. If DPC is supported by HW, errors
> should be triggered by DPC not AER.
>
> If I remember right, there is a register that tells which AER errors
> should be handled by DPC.

When uncorrectable errors happen, non-fatal severity level, AER and
DPC could be triggered at the same time.


Thanks,
Ethan

>

[PATCH v6 3/5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-30 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC etc interrupts will be sent to DPC driver, pciehp
drivers almost at the same time.
That will cause following messed and confused errors handling/recovery/
removal/plugin procedure.

1. Port and device are in error recovery resetting initiated by DPC
   hardware, pciehp driver treats them as device is doing hot-remove or
   hot-plugin the same time.

2. While DPC handler calling device driver->err_handler callback(
   error_detected/resume etc), but the slot may be powered off by

   pciehp
   -> remove_board()
  -> pciehp_power_off_slot().

3. While DPC handler -> pci_do_recovery is doing different action to detect
   error and recover based on device->error_state, pciehp driver could
   change it on the fly by:

   pciehp_unconfigure_device()
   ->pci_walk_bus()
 -> pci_dev_set_disconnected()

4. While DPC handler is calling device driver err_handler callback to
   detect error and recover, pciehp driver could is doing device unbind and
   release its driver.

   ...

While NON-FATAL/FATAL errors happen while hotplug is(is not)doing, result
is not determinate.

So we need some kind of synchronization between pciehp DLLSC/PDC handling
and DPC driver error recover handling.  we need a determinate result
of DPC error containment, link is recovered, link isn't recovered, device
is still there, device is removed, then do pciehp hot-remove and hot-plugin
procudure, don't mix them together.

Per our test on ICS platform, DPC error containment and software handler
will take 10ms up to 50ms till clean the DPC triggered status. it is quick
enough for pciehp compared with its 1000ms waiting to ignore DLLSC/PDC
after doing power off.

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

After tens of PCIe Gen4 NVMe SSD brute force hot-remove and hot-plugin with
any time internval between the two actions, also stressed with the DPC
injection test. system recovered to normal working state from NON-FATAL/
FATAL errors as expected. hotplug works well without any random
undeterminate errors or malfunction.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [3/5] from [2/5] and re-wrote description.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH v6 0/5] Fix DPC hotplug race and enhance error handling

2020-09-30 Thread Ethan Zhao

Hi,folks,

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6/rc7 directly.

Help to review and test.

v2: changed according to review by Andy Shevchenko.
v3: changed patch 4/5 to simpler coding.
v4: move function pci_wait_port_outdpc() to DPC driver and its
   declaration to pci.h. (tip from Christoph Hellwig ).
v5: fix building issue reported by l...@intel.com with some config.
v6: move patch[3/5] as the first patch according to Lukas's suggestion.
and rewrite the comment part of patch[3/5].

Ethan Zhao (5):
  PCI/ERR: get device before call device driver to avoid NULL pointer
dereference
  PCI/DPC: define a function to check and wait till port finish DPC
handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 39 ++--
 drivers/pci/pcie/dpc.c   | 27 ++
 drivers/pci/pcie/err.c   | 18 +--
 4 files changed, 53 insertions(+), 35 deletions(-)


base-commit: a1b8638ba1320e6684aa98233c15255eb803fac7
-- 
2.18.4

[PATCH v6 5/5] PCI/ERR: don't mix io state not changed and no driver together

2020-09-30 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 v2: no change.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: no change.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH v6 2/5] PCI/DPC: define a function to check and wait till port finish DPC handling

2020-09-30 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
changes:
 v2：align ICS code name to public doc.
 v3: no change.
 v4: response to Christoph's (Christoph Hellwig )
 tip, move pci_wait_port_outdpc() to DPC driver and its declaration
 to pci.h.
 v5: fix building issue reported by l...@intel.com with some config.
 v6: move from [1/5] to [2/5].

 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/dpc.c | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..455b32187abd 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_wait_port_outdpc(struct pci_dev *pdev);
 #else
 static inline void pci_save_dpc_state(struct pci_dev *dev) {}
 static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
 static inline void pci_dpc_init(struct pci_dev *pdev) {}
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776..2e0e091ce923 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
 }
 
+bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+
 static int dpc_wait_rp_inactive(struct pci_dev *pdev)
 {
unsigned long timeout = jiffies + HZ;
-- 
2.18.4

[PATCH v6 4/5] PCI: only return true when dev io state is really changed

2020-09-30 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Alexandru Gagniuc 
Reviewed-by: Andy Shevchenko 
---
Changnes:
 v2: revise description and code according to suggestion from Andy.
 v3: change code to simpler.
 v4: no change.
 v5: no change.
 v6: no change.

 drivers/pci/pci.h | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 455b32187abd..f2beeaeda321 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -359,39 +359,12 @@ struct pci_sriov {
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
-   return changed;
+   if (dev->error_state == new)
+   return false;
+
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH v6 1/5] PCI/ERR: get device before call device driver to avoid NULL pointer dereference

2020-09-30 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer dereference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 BUG: kernel NULL pointer dereference, address: 0050
 ...
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 ...

Debug shows when port DPC feature was enabled and triggered by errors,
DLLSC/PDC/DPC interrupts will be sent to pciehp and DPC driver almost
at the same time, and no delay between them is required by specification.
so DPC driver and pciehp drivers may handle these interrupts cocurrently.

While DPC driver is doing pci_walk_bus() and calling device driver's 
callback without pci_dev_get() to increase device reference count, the
device and its driver instance are likely being freed by 

pci_stop_and_removed_bus_device()
-> pci_dev_put().

So does pci_dev_get() before using the device instance to avoid NULL
pointer dereference.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
 v2: revise doc according to Andy's suggestion.
 v3: no change.
 v4: no change.
 v5: no change.
 v6: moved to [1/5] from [3/5] and revised comment according to Lukas'
 suggestion.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -113,6 +119,8 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -124,6 +132,7 @@ static int report_slot_reset(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -131,6 +140,8 @@ static int report_resume(struct pci_dev *dev, void *data)
 {
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, pci_channel_io_norm

Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-29 Thread Ethan Zhao

On Tue, Sep 29, 2020 at 6:08 PM Lukas Wunner  wrote:
>
> On Tue, Sep 29, 2020 at 05:46:41PM +0800, Ethan Zhao wrote:
> > On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner  wrote:
> > > On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote:
> > > > On 9/26/2020 11:28 PM, Ethan Zhao wrote:
> > > > > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > > > > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > > > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void 
> > > > > *dev_id)
> > > > > down_read(>reset_lock);
> > > > > if (events & DISABLE_SLOT)
> > > > > pciehp_handle_disable_request(ctrl);
> > > > > -   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
> > > > > +   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
> > > > > +   pci_wait_port_outdpc(pdev);
> > > > > pciehp_handle_presence_or_link_change(ctrl, events);
> > > > > +   }
> > > > > up_read(>reset_lock);
> > > >
> > > > This looks like a hack TBH.
> [...]
> > > > Why is device lock not protecting this situation?
> > > > Is there a lock missing in hotplug driver?
> > >
> > > According to Ethan's commit message, there are two issues here:
> > > One, that pciehp may remove a device even though DPC recovered the error,
> > > and two, that a null pointer deref occurs.
> > >
> > > The latter is most certainly not a locking issue but failure of DPC
> > > to hold a reference on the pci_dev.
> >
> > This is what patch 3/5 proposed to fix.
>
> Please reorder the series to fix the null pointer deref first,
> i.e. move patch 3 before patch 2.  If the null pointer deref is
> fixed by patch 3, do not mention it in patch 2.

Make sense.

Thanks,
Ethan
>
> Thanks,
>
> Lukas

Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference

2020-09-29 Thread Ethan Zhao

On Tue, Sep 29, 2020 at 6:48 PM Andy Shevchenko
 wrote:
>
> On Tue, Sep 29, 2020 at 05:38:00PM +0800, Ethan Zhao wrote:
> > On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko
> >  wrote:
> > > On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> > > > Preferred style, there will be cleared comment in v6.
> > >
> > > Avoid top postings.
> > >
> > > > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> > > >  wrote:
> > > > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
>
> ...
>
> > > > > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > > > > >  async page read
> > > > > >  BUG: kernel NULL pointer dereference, address: 0050
> > > > > >  #PF: supervisor read access in kernel mode
> > > > > >  #PF: error_code(0x) - not-present page
> > > > >
> > > > > Same comment about Oops.
> > >
> > > In another thread it was a good advice to move the full Oops (if you 
> > > think it's
> > > very useful to have) after the cutter '---' line, so it will be in email
> > > archives but Git history.
> >
> > So git history wouldn't give any of the Oops context, and he/she has
> > to access LKML,
> > if offline, then ...lost.
>
> Tell me, do you really think the line:
>#PF: error_code(0x) - not-present page
> makes any sense in the commit message?
>
> I do not think so. And so on, for almost 60% of the Oops.
> Really, to help one person you will make millions suffering. It's not okay.
>
If you and millions  feel so suffered,  why not try to simplify the
Oops code to not
output nonsense too much to console and log. :)

They might not be so important to old birds as you. but it is easy to cut off
the road for newcomers.

Anyway, it is not the focus of this patchset. help to take a look and
try the code.


Thanks,
Ethan

> --
> With Best Regards,
> Andy Shevchenko
>
>

Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-29 Thread Ethan Zhao

On Tue, Sep 29, 2020 at 4:29 PM Lukas Wunner  wrote:
>
> On Sun, Sep 27, 2020 at 11:27:46AM -0400, Sinan Kaya wrote:
> > On 9/26/2020 11:28 PM, Ethan Zhao wrote:
> > > --- a/drivers/pci/hotplug/pciehp_hpc.c
> > > +++ b/drivers/pci/hotplug/pciehp_hpc.c
> > > @@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
> > > down_read(>reset_lock);
> > > if (events & DISABLE_SLOT)
> > > pciehp_handle_disable_request(ctrl);
> > > -   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
> > > +   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
> > > +   pci_wait_port_outdpc(pdev);
> > > pciehp_handle_presence_or_link_change(ctrl, events);
> > > +   }
> > > up_read(>reset_lock);
> >
> > This looks like a hack TBH.
> >
> > Lukas, Keith;
> >
> > What is your take on this?
> > Why is device lock not protecting this situation?
> >
> > Is there a lock missing in hotplug driver?
>
> According to Ethan's commit message, there are two issues here:
> One, that pciehp may remove a device even though DPC recovered the error,
> and two, that a null pointer deref occurs.
>
> The latter is most certainly not a locking issue but failure of DPC
> to hold a reference on the pci_dev.

This is what patch 3/5 proposed to fix. while this one is to re-order
the mixed DPC
recovery procedure and DLLSC/PDC event handling, to make pciehp to know the
exact recovered result of DPC to malfunctional device  link
recovered, still there,
or is removed from the slot.

Thanks,
Ethan

>
> Thanks,
>
> Lukas

Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference

2020-09-29 Thread Ethan Zhao

Andy,

On Tue, Sep 29, 2020 at 4:51 PM Andy Shevchenko
 wrote:
>
> On Tue, Sep 29, 2020 at 10:35:14AM +0800, Ethan Zhao wrote:
> > Preferred style, there will be cleared comment in v6.
>
> Avoid top postings.
>
> > On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
> >  wrote:
> > >
> > > On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> > > > During DPC error injection test we found there is race condition between
> > > > pciehp and DPC driver, null pointer reference caused panic as following
> > >
> > > null -> NULL
> > >
> > > >
> > > >  # setpci -s 64:02.0 0x196.w=000a
> > > >   // 64:02.0 is rootport has DPC capability
> > > >  # setpci -s 65:00.0 0x04.w=0544
> > > >   // 65:00.0 is NVMe SSD populated in above port
> > > >  # mount /dev/nvme0n1p1 nvme
> > > >
> > > >  (tested on stable 5.8 & ICX platform)
> > > >
> > > >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> > > >  async page read
> > > >  BUG: kernel NULL pointer dereference, address: 0050
> > > >  #PF: supervisor read access in kernel mode
> > > >  #PF: error_code(0x) - not-present page
> > >
> > > Same comment about Oops.
>
> In another thread it was a good advice to move the full Oops (if you think 
> it's
> very useful to have) after the cutter '---' line, so it will be in email
> archives but Git history.

So git history wouldn't give any of the Oops context, and he/she has
to access LKML,
if offline, then ...lost.

Thanks,
Ethan
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-28 Thread Ethan Zhao

On Tue, Sep 29, 2020 at 12:44 AM Sinan Kaya  wrote:
>
> On 9/28/2020 7:10 AM, Sinan Kaya wrote:
> > On 9/27/2020 10:01 PM, Zhao, Haifeng wrote:
> >> Sinan,
> >>I explained the reason why locks don't protect this case in the patch 
> >> description part.
> >> Write side and read side hold different semaphore and mutex.
> >>
> > I have been thinking about it some time but is there any reason why we
> > have to handle all port AER/DPC/HP events in different threads?
> >
> > Can we go to single threaded event loop for all port drivers events?
> >
> > This will require some refactoring but it wlll eliminate the lock
> > nightmares we are having.
> >
> > This means no sleeping. All sleeps need to happen outside of the loop.
> >
> > I wanted to see what you all are thinking about this.
> >
> > It might become a performance problem if the system is
> > continuously observing a hotplug/aer/dpc events.
> >
> > I always think that these should be rare events.
>
> If restructuring would be too costly, the preferred solution should be
> to fix the locks in hotplug driver rather than throwing there a random
> wait call.

  My first though is to unify the pci_bus_sem & pci_rescan_remove_lock
to one sleepable lock, but verifying every
locking scenario to sort out dead lock warning, it is horrible job. I
gave up and then played the device status waiting trick
to workaround it.

index 03d37128a24f..477d4c499f87 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -3223,17 +3223,19 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus);
  * pci_rescan_bus(), pci_rescan_bus_bridge_resize() and PCI device removal
  * routines should always be executed under this mutex.
  */
-static DEFINE_MUTEX(pci_rescan_remove_lock);
+/* static DEFINE_MUTEX(pci_rescan_remove_lock); */

 void pci_lock_rescan_remove(void)
 {
- mutex_lock(_rescan_remove_lock);
+ /*mutex_lock(_rescan_remove_lock); */
+ down_write(_bus_sem);
 }
 EXPORT_SYMBOL_GPL(pci_lock_rescan_remove);

 void pci_unlock_rescan_remove(void)
 {
- mutex_unlock(_rescan_remove_lock);
+ /*mutex_unlock(_rescan_remove_lock); */
+ up_write(_bus_sem);
 }
 EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove);

Thanks，
Ethan

Re: [PATCH 3/5] PCI/ERR: get device before call device driver to avoid null pointer reference

2020-09-28 Thread Ethan Zhao

Preferred style, there will be cleared comment in v6.

Thanks,
Ethan

On Sat, Sep 26, 2020 at 12:42 AM Andy Shevchenko
 wrote:
>
> On Thu, Sep 24, 2020 at 10:34:21PM -0400, Ethan Zhao wrote:
> > During DPC error injection test we found there is race condition between
> > pciehp and DPC driver, null pointer reference caused panic as following
>
> null -> NULL
>
> >
> >  # setpci -s 64:02.0 0x196.w=000a
> >   // 64:02.0 is rootport has DPC capability
> >  # setpci -s 65:00.0 0x04.w=0544
> >   // 65:00.0 is NVMe SSD populated in above port
> >  # mount /dev/nvme0n1p1 nvme
> >
> >  (tested on stable 5.8 & ICX platform)
> >
> >  Buffer I/O error on dev nvme0n1p1, logical block 468843328,
> >  async page read
> >  BUG: kernel NULL pointer dereference, address: 0050
> >  #PF: supervisor read access in kernel mode
> >  #PF: error_code(0x) - not-present page
>
> Same comment about Oops.
>
> --
> With Best Regards,
> Andy Shevchenko
>
>

Re: [PATCH 1/5 V2] PCI: define a function to check and wait till port finish DPC handling

2020-09-28 Thread Ethan Zhao

Fixed this concern by moving the function to DPC driver and its
declaration to pci.h.  see v5

Thanks,
Ethan

On Sun, Sep 27, 2020 at 2:27 PM Christoph Hellwig  wrote:
>
> > +#ifdef CONFIG_PCIE_DPC
> > +static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
> > +{
> > + u16 cap = pdev->dpc_cap, status;
> > + u16 loop = 0;
> > +
> > + if (!cap) {
> > + pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
> > + return false;
> > + }
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
> > + while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
> > + msleep(10);
> > + loop++;
> > + pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > + }
> > + if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
> > + pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
> > + return true;
> > + }
> > + pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
> > + return false;
> > +}
>
> I don't think that there is any good reason to have this as an
> inline function.

Re: [PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-28 Thread Ethan Zhao

On Tue, Sep 29, 2020 at 12:45 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
>
> On 9/28/20 9:43 AM, Sinan Kaya wrote:
> > On 9/28/2020 7:10 AM, Sinan Kaya wrote:
> >> On 9/27/2020 10:01 PM, Zhao, Haifeng wrote:
> >>> Sinan,
> >>> I explained the reason why locks don't protect this case in the patch 
> >>> description part.
> >>> Write side and read side hold different semaphore and mutex.
> >>>
> >> I have been thinking about it some time but is there any reason why we
> >> have to handle all port AER/DPC/HP events in different threads?
> >>
> >> Can we go to single threaded event loop for all port drivers events?
> >>
> >> This will require some refactoring but it wlll eliminate the lock
> >> nightmares we are having.
> >>
> >> This means no sleeping. All sleeps need to happen outside of the loop.
> >>
> >> I wanted to see what you all are thinking about this.
> >>
> >> It might become a performance problem if the system is
> >> continuously observing a hotplug/aer/dpc events.
> >>
> >> I always think that these should be rare events.
> > If restructuring would be too costly, the preferred solution should be
> > to fix the locks in hotplug driver rather than throwing there a random
> > wait call.
> Since the current race condition is detected between DPC and
> hotplug, I recommend synchronizing them.

The locks are the first place to root cause and try to fix. but not so easy to
refactor the remove-scan-semaphore and the bus-walk-mutex. too expensive
work. --- rework every piece of code that uses them.

Thanks,
Ethan

>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
>

Re: [PATCH 3/5 V55555] PCI/ERR: get device before call device driver to avoid NULL pointer reference

2020-09-28 Thread Ethan Zhao

On Mon, Sep 28, 2020 at 4:46 PM Andy Shevchenko
 wrote:
>
> On Mon, Sep 28, 2020 at 7:13 AM Ethan Zhao  wrote:
>
> Same comments as per v4.
> Also you have an issue in versioning here. Use -v parameter to `git
> format-patch`, it will do it for you nicely.

Aha, git has got this function. I thought it was still manual
work. great tip.!

Thanks,
Ethan

>
> --
> With Best Regards,
> Andy Shevchenko

Re: [PATCH 2/5 V5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-28 Thread Ethan Zhao

On Mon, Sep 28, 2020 at 4:45 PM Andy Shevchenko
 wrote:
>
> On Mon, Sep 28, 2020 at 7:10 AM Ethan Zhao  wrote:
>
> We didn't settle on the v4, why v5?

We could fix it with v6, v5 is used to fix other things.

>
> > When root port has DPC capability and it is enabled, then triggered by
> > errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp
> > driver at the same time.
> > That will cause following result:
> >
> > 1. Link and device are recovered by hardware DPC and software DPC driver,
> >device
> >isn't removed, but the pciehp might treat it as device was hot removed.
> >
> > 2. Race condition happens bettween pciehp_unconfigure_device() called by
> >pciehp_ist() in pciehp driver and pci_do_recovery() called by
> >dpc_handler in DPC driver. no luck, there is no lock to protect
> >pci_stop_and_remove_bus_device()
> >against pci_walk_bus(), they hold different samphore and mutex,
> >pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and
> >pci_walk_bus() holds pci_bus_sem.
> >
> > This race condition is not purely code analysis, it could be triggered by
> > following command series:
> >
> >   # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability
> >   # setpci -s 65:00.0 0x04.w=0544  // 65:00.0 NVMe SSD populated in port
> >   # mount /dev/nvme0n1p1 nvme
> >
> > One shot will cause system panic and NULL pointer reference happened.
> > (tested on stable 5.8 & ICS(Ice Lake SP platform, see
> > https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))
>
> ...
>
> >Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read
> >BUG: kernel NULL pointer dereference, address: 0050
> >#PF: supervisor read access in kernel mode
> >#PF: error_code(0x) - not-present page
> >PGD 0
> >Oops:  [#1] SMP NOPTI
> >CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1
> >RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
> >Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3
> >65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50
> >50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
> >RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
> >RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
> >RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
> >RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
> >R10: 00eb8ebeab53 R11: 93453258 R12: 0002
> >R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
> >FS:  () GS:ff4e3eab1fd0() knl
> >GS:
> >CS:  0010 DS:  ES:  CR0: 80050033
> >CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
> >DR0:  DR1:  DR2: 
> >DR3:  DR6: fffe0ff0 DR7: 0400
> >PKRU: 5554
> >Call Trace:
> >? report_normal_detected+0x20/0x20
> >report_frozen_detected+0x16/0x20
> >pci_walk_bus+0x75/0x90
> >? dpc_irq+0x90/0x90
> >pcie_do_recovery+0x157/0x201
> >? irq_finalize_oneshot.part.47+0xe0/0xe0
> >dpc_handler+0x29/0x40
> >irq_thread_fn+0x24/0x60
> >irq_thread+0xea/0x170
> >? irq_forced_thread_fn+0x80/0x80
> >? irq_thread_check_affinity+0xf0/0xf0
> >kthread+0x124/0x140
> >? kthread_park+0x90/0x90
> >ret_from_fork+0x1f/0x30
> >Modules linked in: nft_fib_inet.
> >CR2: 0050
>
> Do not pollute the commit messages with irrelevant information.

That is whole panic, Andy, someone like the whole Oops, If I removed something,
Will someone say 'why not integral，why removed somehting  ‘
>
> --
> With Best Regards,
> Andy Shevchenko

Re: [PATCH v3 1/1] PCI/ERR: Fix reset logic in pcie_do_recovery() call

2020-09-28 Thread Ethan Zhao

Sathyanarayanan,

On Mon, Sep 28, 2020 at 10:44 AM Kuppuswamy, Sathyanarayanan
 wrote:
>
> Hi,
>
> On 9/25/20 11:30 AM, Sinan Kaya wrote:
> > On 9/25/2020 2:16 PM, Kuppuswamy, Sathyanarayanan wrote:
> >>>
> >>> If this is a too involved change, DPC driver should restore state
> >>> when hotplug is not supported.
> >> Yes. we can add a condition for hotplug capability check.
> >>>
> >>> DPC driver should be self-sufficient by itself.
> >>>
> >
> > Sounds good.
> >
>  Also for non-fatal errors, if reset is requested then we still need
>  some kind of bus reset call here
> >>>
> >>> DPC should handle both fatal and non-fatal cases
> >> Currently DPC is only triggered for FATAL errors.
> >>   and cause a bus reset
> >
> > Thanks for the heads up.
> > This seems to have changed since I looked at the DPC code.
> >
> >>> in hardware already before triggering an interrupt.
> >> Error recovery is not triggered only DPC driver. AER also uses the
> >> same error recovery code. If DPC is not supported, then we still need
> >> reset logic.
> >
> > It sounds like we are cross-talking two issues.
> >
> > 1. no state restore on DPC after FATAL error.
> > Let's fix this.
> Agree. Few more detail about the above issue is,
>
> There are two cases under FATAL error.
>
> FATAL + hotplug - In this case, link will be reseted. And hotplug handler
> will remove the driver state. This case works well with current code.
>
> FATAL + no-hotplug - In this case, link will still be reseted. But
> currently driver state is not properly restored. So I attempted
> to restore it using pci_reset_bus().

Seems you should fix something at device driver side, not do double-reset in
DPC driver, one reset is done by hardware, and you want to do another by
DPC driver ?

Why hardware initiated reset is not enough for you ?

Thanks,
Ethan

> status = reset_link(dev);
> -   if (status != PCI_ERS_RESULT_RECOVERED) {
> +   if (status == PCI_ERS_RESULT_RECOVERED) {
> +   status = PCI_ERS_RESULT_NEED_RESET;
>
> ...
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> /*
> -* TODO: Should call platform-specific
> -* functions to reset slot before calling
> -* drivers' slot_reset callbacks?
> +* TODO: Optimize the call to pci_reset_bus()
> +*
> +* There are two components to pci_reset_bus().
> +*
> +* 1. Do platform specific slot/bus reset.
> +* 2. Save/Restore all devices in the bus.
> +*
> +* For hotplug capable devices and fatal errors,
> +* device is already in reset state due to link
> +* reset. So repeating platform specific slot/bus
> +* reset via pci_reset_bus() call is redundant. So
> +* can optimize this logic and conditionally call
> +* pci_reset_bus().
>  */
> +   pci_reset_bus(dev);
>
> >
> > 2. no bus reset on NON_FATAL error through AER driver path.
> > This already tells me that you need to split your change into
> > multiple patches.
> >
> > Let's talk about this too. bus reset should be triggered via
> > AER driver before informing the recovery.
> But as per error recovery documentation, any call to
> ->error_detected() or ->mmio_enabled() can request
> PCI_ERS_RESULT_NEED_RESET. So we need to add code
> to do the actual reset before calling ->slot_reset()
> callback. So call to pci_reset_bus() fixes this
> issue.
>
> if (status == PCI_ERS_RESULT_NEED_RESET) {
> +   pci_reset_bus(dev);
>
>
> >
> >   if (status == PCI_ERS_RESULT_NEED_RESET) {
> >   /*
> >* TODO: Should call platform-specific
> >* functions to reset slot before calling
> >* drivers' slot_reset callbacks?
> >*/
> >   status = PCI_ERS_RESULT_RECOVERED;
> >   pci_dbg(dev, "broadcast slot_reset message\n");
> >   pci_walk_bus(bus, report_slot_reset, );
> >   }
> >
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer

[PATCH 5/5 V5] PCI/ERR: don't mix io state not changed and no driver together

2020-09-27 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 V2: no change.
 V3: no change.
 V4: no change.
 V5: no change.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH 2/5 V5] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-27 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp
driver at the same time.
That will cause following result:

1. Link and device are recovered by hardware DPC and software DPC driver, 
   device
   isn't removed, but the pciehp might treat it as device was hot removed.

2. Race condition happens bettween pciehp_unconfigure_device() called by
   pciehp_ist() in pciehp driver and pci_do_recovery() called by
   dpc_handler in DPC driver. no luck, there is no lock to protect 
   pci_stop_and_remove_bus_device()
   against pci_walk_bus(), they hold different samphore and mutex,
   pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and
   pci_walk_bus() holds pci_bus_sem.

This race condition is not purely code analysis, it could be triggered by
following command series:

  # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability
  # setpci -s 65:00.0 0x04.w=0544  // 65:00.0 NVMe SSD populated in port
  # mount /dev/nvme0n1p1 nvme

One shot will cause system panic and NULL pointer reference happened.
(tested on stable 5.8 & ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

   Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read
   BUG: kernel NULL pointer dereference, address: 0050
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 0
   Oops:  [#1] SMP NOPTI
   CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1
   RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
   Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3
   65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50
   50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
   RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
   RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
   RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
   RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
   R10: 00eb8ebeab53 R11: 93453258 R12: 0002
   R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
   FS:  () GS:ff4e3eab1fd0() knl
   GS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   PKRU: 5554
   Call Trace:
   ? report_normal_detected+0x20/0x20
   report_frozen_detected+0x16/0x20
   pci_walk_bus+0x75/0x90
   ? dpc_irq+0x90/0x90
   pcie_do_recovery+0x157/0x201
   ? irq_finalize_oneshot.part.47+0xe0/0xe0
   dpc_handler+0x29/0x40
   irq_thread_fn+0x24/0x60
   irq_thread+0xea/0x170
   ? irq_forced_thread_fn+0x80/0x80
   ? irq_thread_check_affinity+0xf0/0xf0
   kthread+0x124/0x140
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x30
   Modules linked in: nft_fib_inet.
   CR2: 0050

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.
 V4: no change.
 V5: no change.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH 1/5 V5] PCI: define a function to check and wait till port finish DPC handling

2020-09-27 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
changes:
 V2：align ICS code name to public doc.
 V3: no change.
 V4: response to Christoph's (Christoph Hellwig ) 
 tip, move pci_wait_port_outdpc() to DPC driver and its declaration
 to pci.h.
 V5: fix building issue reported by l...@intel.com with some config.

 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/dpc.c | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..8fdb0d823d5a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_wait_port_outdpc(struct pci_dev *pdev);
 #else
 static inline void pci_save_dpc_state(struct pci_dev *dev) {}
 static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
 static inline void pci_dpc_init(struct pci_dev *pdev) {}
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776..2e0e091ce923 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
 }
 
+bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+
 static int dpc_wait_rp_inactive(struct pci_dev *pdev)
 {
unsigned long timeout = jiffies + HZ;
-- 
2.18.4

[PATCH 0/5 V5] Fix DPC hotplug race and enhance error handling

2020-09-27 Thread Ethan Zhao

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time 
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a 
setpci -s 65:00.0 0x04.w=0544 
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

V2: changed according to review by Andy Shevchenko.
V3: changed patch 4/5 to simpler coding.
V4: move function pci_wait_port_outdpc() to DPC driver and its
declaration to pci.h. (tip from Christoph Hellwig ).
V5: fix building issue reported by l...@intel.com with some config.

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid NULL pointer
reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 34 +---
 drivers/pci/pcie/err.c   | 18 +++--
 include/linux/pci.h  | 31 +
 4 files changed, 55 insertions(+), 32 deletions(-)

-- 
2.18.4

[PATCH 4/5 V4] PCI: only return true when dev io state is really changed

2020-09-27 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Alexandru Gagniuc 
---
Changnes:
 V2: revise description and code according to suggestion from Andy.
 V3: change code to simpler.
 V4: no change.
 V5: no change.
 
 drivers/pci/pci.h | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..a2c1c7d5f494 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -359,39 +359,12 @@ struct pci_sriov {
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
-   return changed;
+   if (dev->error_state == new)
+   return false;
+
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH 3/5 V55555] PCI/ERR: get device before call device driver to avoid NULL pointer reference

2020-09-27 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer reference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 Buffer I/O error on dev nvme0n1p1, logical block 468843328,
 async page read
 BUG: kernel NULL pointer dereference, address: 0050
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 0
 Oops:  [#1] SMP NOPTI
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 irq_thread+0xea/0x170
 ? irq_forced_thread_fn+0x80/0x80
 ? irq_thread_check_affinity+0xf0/0xf0
 kthread+0x124/0x140
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x30
 Modules linked in: nft_fib_inet.
 CR2: 0050

Though we partly close the race condition with patch 'PCI: pciehp: check
and wait port status out of DPC before handling DLLSC and PDC', but there
is no hardware spec or software sequence to guarantee the pcie_ist() run
into pci_wait_port_outdpc() first or DPC triggered status bits being set
first when errors triggered DPC containment procedure, so device still
could be removed by function pci_stop_and_removed_bus_device() then freed
by pci_dev_put() in pciehp driver first during pcie_do_recover()/
pci_walk_bus() is called by dpc_handler() in DPC driver.

Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the
removal and walking operation is the right way, but here we use
pci_dev_get() to increase the reference count of device before using the
device to avoid it is freed in use.

With this patch and patch 'PCI: pciehp: check and wait port status out of
DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error
injection test and no panic happened.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.
 V4: no change.
 V5: no change.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+

[PATCH 5/5 V4] PCI/ERR: don't mix io state not changed and no driver together

2020-09-27 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 V2: no change.
 V3: no change.
 V4: no change.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH 3/5 V4] PCI/ERR: get device before call device driver to avoid NULL pointer reference

2020-09-27 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer reference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 Buffer I/O error on dev nvme0n1p1, logical block 468843328,
 async page read
 BUG: kernel NULL pointer dereference, address: 0050
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 0
 Oops:  [#1] SMP NOPTI
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 irq_thread+0xea/0x170
 ? irq_forced_thread_fn+0x80/0x80
 ? irq_thread_check_affinity+0xf0/0xf0
 kthread+0x124/0x140
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x30
 Modules linked in: nft_fib_inet.
 CR2: 0050

Though we partly close the race condition with patch 'PCI: pciehp: check
and wait port status out of DPC before handling DLLSC and PDC', but there
is no hardware spec or software sequence to guarantee the pcie_ist() run
into pci_wait_port_outdpc() first or DPC triggered status bits being set
first when errors triggered DPC containment procedure, so device still
could be removed by function pci_stop_and_removed_bus_device() then freed
by pci_dev_put() in pciehp driver first during pcie_do_recover()/
pci_walk_bus() is called by dpc_handler() in DPC driver.

Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the
removal and walking operation is the right way, but here we use
pci_dev_get() to increase the reference count of device before using the
device to avoid it is freed in use.

With this patch and patch 'PCI: pciehp: check and wait port status out of
DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error
injection test and no panic happened.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.
 V4: no change.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+

[PATCH 4/5 V4] PCI: only return true when dev io state is really changed

2020-09-27 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Alexandru Gagniuc 
Reviewed-by: Joe Perches 
---
Changnes:
 V2: revise description and code according to suggestion from Andy.
 V3: change code to simpler.
 V4: no change.
 
 drivers/pci/pci.h | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..a2c1c7d5f494 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -359,39 +359,12 @@ struct pci_sriov {
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
-   return changed;
+   if (dev->error_state == new)
+   return false;
+
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH 2/5 V4] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-27 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp
driver at the same time.
That will cause following result:

1. Link and device are recovered by hardware DPC and software DPC driver, 
   device
   isn't removed, but the pciehp might treat it as device was hot removed.

2. Race condition happens bettween pciehp_unconfigure_device() called by
   pciehp_ist() in pciehp driver and pci_do_recovery() called by
   dpc_handler in DPC driver. no luck, there is no lock to protect 
   pci_stop_and_remove_bus_device()
   against pci_walk_bus(), they hold different samphore and mutex,
   pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and
   pci_walk_bus() holds pci_bus_sem.

This race condition is not purely code analysis, it could be triggered by
following command series:

  # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability
  # setpci -s 65:00.0 0x04.w=0544  // 65:00.0 NVMe SSD populated in port
  # mount /dev/nvme0n1p1 nvme

One shot will cause system panic and NULL pointer reference happened.
(tested on stable 5.8 & ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

   Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read
   BUG: kernel NULL pointer dereference, address: 0050
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 0
   Oops:  [#1] SMP NOPTI
   CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1
   RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
   Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3
   65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50
   50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
   RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
   RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
   RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
   RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
   R10: 00eb8ebeab53 R11: 93453258 R12: 0002
   R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
   FS:  () GS:ff4e3eab1fd0() knl
   GS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   PKRU: 5554
   Call Trace:
   ? report_normal_detected+0x20/0x20
   report_frozen_detected+0x16/0x20
   pci_walk_bus+0x75/0x90
   ? dpc_irq+0x90/0x90
   pcie_do_recovery+0x157/0x201
   ? irq_finalize_oneshot.part.47+0xe0/0xe0
   dpc_handler+0x29/0x40
   irq_thread_fn+0x24/0x60
   irq_thread+0xea/0x170
   ? irq_forced_thread_fn+0x80/0x80
   ? irq_thread_check_affinity+0xf0/0xf0
   kthread+0x124/0x140
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x30
   Modules linked in: nft_fib_inet.
   CR2: 0050

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.
 V4: no change.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH 0/5 V4] Fix DPC hotplug race and enhance error handling

2020-09-27 Thread Ethan Zhao

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time 
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a 
setpci -s 65:00.0 0x04.w=0544 
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

V2: changed according to review by Andy Shevchenko.
V3: changed patch 4/5 to simpler coding.
V4: move function pci_wait_port_outdpc() to DPC driver and its
declaration to pci.h. (tip from Christoph Hellwig ).

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid NULL pointer
reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 34 +---
 drivers/pci/pcie/err.c   | 18 +++--
 include/linux/pci.h  | 31 +
 4 files changed, 55 insertions(+), 32 deletions(-)

-- 
2.18.4

[PATCH 1/5 V4] PCI: define a function to check and wait till port finish DPC handling

2020-09-27 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Christoph Hellwig 
---
changes:
 V2：align ICS code name to public doc.
 V3: no change.
 V4: response to Christoph's (Christoph Hellwig ) 
 tip, move pci_wait_port_outdpc() to DPC driver and its declaration
 to pci.h.

 drivers/pci/pci.h  |  2 ++
 drivers/pci/pcie/dpc.c | 27 +++
 2 files changed, 29 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..8fdb0d823d5a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -455,10 +455,12 @@ void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
 void dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
+bool pci_wait_port_outdpc(struct pci_dev *pdev);
 #else
 static inline void pci_save_dpc_state(struct pci_dev *dev) {}
 static inline void pci_restore_dpc_state(struct pci_dev *dev) {}
 static inline void pci_dpc_init(struct pci_dev *pdev) {}
+inline bool pci_wait_port_outdpc(struct pci_dev *pdev) { return false; }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index daa9a4153776..2e0e091ce923 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -71,6 +71,33 @@ void pci_restore_dpc_state(struct pci_dev *dev)
pci_write_config_word(dev, dev->dpc_cap + PCI_EXP_DPC_CTL, *cap);
 }
 
+bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+
 static int dpc_wait_rp_inactive(struct pci_dev *pdev)
 {
unsigned long timeout = jiffies + HZ;
-- 
2.18.4

[PATCH 5/5 V3] PCI/ERR: don't mix io state not changed and no driver together

2020-09-26 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 V2: no change.
 V3: no change.

 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH 3/5 V3] PCI/ERR: get device before call device driver to avoid NULL pointer reference

2020-09-26 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer reference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 Buffer I/O error on dev nvme0n1p1, logical block 468843328,
 async page read
 BUG: kernel NULL pointer dereference, address: 0050
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 0
 Oops:  [#1] SMP NOPTI
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 irq_thread+0xea/0x170
 ? irq_forced_thread_fn+0x80/0x80
 ? irq_thread_check_affinity+0xf0/0xf0
 kthread+0x124/0x140
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x30
 Modules linked in: nft_fib_inet.
 CR2: 0050

Though we partly close the race condition with patch 'PCI: pciehp: check
and wait port status out of DPC before handling DLLSC and PDC', but there
is no hardware spec or software sequence to guarantee the pcie_ist() run
into pci_wait_port_outdpc() first or DPC triggered status bits being set
first when errors triggered DPC containment procedure, so device still
could be removed by function pci_stop_and_removed_bus_device() then freed
by pci_dev_put() in pciehp driver first during pcie_do_recover()/
pci_walk_bus() is called by dpc_handler() in DPC driver.

Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the
removal and walking operation is the right way, but here we use
pci_dev_get() to increase the reference count of device before using the
device to avoid it is freed in use.

With this patch and patch 'PCI: pciehp: check and wait port status out of
DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error
injection test and no panic happened.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev)

[PATCH 4/5 V3] PCI: only return true when dev io state is really changed

2020-09-26 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Alexandru Gagniuc 
Reviewed-by: Joe Perches 
---
Changnes:
 V2: revise description and code according to suggestion from Andy.
 V3: Change code to simpler.
 
 drivers/pci/pci.h | 37 +
 1 file changed, 5 insertions(+), 32 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..a2c1c7d5f494 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -359,39 +359,12 @@ struct pci_sriov {
 static inline bool pci_dev_set_io_state(struct pci_dev *dev,
pci_channel_state_t new)
 {
-   bool changed = false;
-
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
-   return changed;
+   if (dev->error_state == new)
+   return false;
+
+   dev->error_state = new;
+   return true;
 }
 
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
-- 
2.18.4

[PATCH 0/5 V3] Fix DPC hotplug race and enhance error handling

2020-09-26 Thread Ethan Zhao

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time 
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a 
setpci -s 65:00.0 0x04.w=0544 
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

V2: changed according to review by Andy Shevchenko.
V3: changed patch 4/5 to simpler coding.

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid NULL pointer
reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 34 +---
 drivers/pci/pcie/err.c   | 18 +++--
 include/linux/pci.h  | 31 +
 4 files changed, 55 insertions(+), 32 deletions(-)

-- 
2.18.4

[PATCH 2/5 V3] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-26 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp
driver at the same time.
That will cause following result:

1. Link and device are recovered by hardware DPC and software DPC driver, 
   device
   isn't removed, but the pciehp might treat it as device was hot removed.

2. Race condition happens bettween pciehp_unconfigure_device() called by
   pciehp_ist() in pciehp driver and pci_do_recovery() called by
   dpc_handler in DPC driver. no luck, there is no lock to protect 
   pci_stop_and_remove_bus_device()
   against pci_walk_bus(), they hold different samphore and mutex,
   pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and
   pci_walk_bus() holds pci_bus_sem.

This race condition is not purely code analysis, it could be triggered by
following command series:

  # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability
  # setpci -s 65:00.0 0x04.w=0544  // 65:00.0 NVMe SSD populated in port
  # mount /dev/nvme0n1p1 nvme

One shot will cause system panic and NULL pointer reference happened.
(tested on stable 5.8 & ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

   Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read
   BUG: kernel NULL pointer dereference, address: 0050
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 0
   Oops:  [#1] SMP NOPTI
   CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1
   RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
   Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3
   65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50
   50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
   RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
   RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
   RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
   RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
   R10: 00eb8ebeab53 R11: 93453258 R12: 0002
   R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
   FS:  () GS:ff4e3eab1fd0() knl
   GS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   PKRU: 5554
   Call Trace:
   ? report_normal_detected+0x20/0x20
   report_frozen_detected+0x16/0x20
   pci_walk_bus+0x75/0x90
   ? dpc_irq+0x90/0x90
   pcie_do_recovery+0x157/0x201
   ? irq_finalize_oneshot.part.47+0xe0/0xe0
   dpc_handler+0x29/0x40
   irq_thread_fn+0x24/0x60
   irq_thread+0xea/0x170
   ? irq_forced_thread_fn+0x80/0x80
   ? irq_thread_check_affinity+0xf0/0xf0
   kthread+0x124/0x140
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x30
   Modules linked in: nft_fib_inet.
   CR2: 0050

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 V3: no change.

 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH 1/5 V3] PCI: define a function to check and wait till port finish DPC handling

2020-09-26 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
changes:
 V2：align ICS code name to public doc.
 V3: no change. 
 include/linux/pci.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..5beb76c6ae26 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2427,4 +2428,34 @@ void pci_uevent_ers(struct pci_dev *pdev, enum  
pci_ers_result err_type);
WARN_ONCE(condition, "%s %s: " fmt, \
  dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg)
 
+#ifdef CONFIG_PCIE_DPC
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+#else
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   return true;
+}
+#endif
 #endif /* LINUX_PCI_H */
-- 
2.18.4

[PATCH 2/5 V2] PCI: pciehp: check and wait port status out of DPC before handling DLLSC and PDC

2020-09-26 Thread Ethan Zhao

When root port has DPC capability and it is enabled, then triggered by
errors, DPC DLLSC and PDC interrupts will be sent to DPC driver, pciehp
driver at the same time.
That will cause following result:

1. Link and device are recovered by hardware DPC and software DPC driver, 
   device
   isn't removed, but the pciehp might treat it as device was hot removed.

2. Race condition happens bettween pciehp_unconfigure_device() called by
   pciehp_ist() in pciehp driver and pci_do_recovery() called by
   dpc_handler in DPC driver. no luck, there is no lock to protect 
   pci_stop_and_remove_bus_device()
   against pci_walk_bus(), they hold different samphore and mutex,
   pci_stop_and_remove_bus_device holds pci_rescan_remove_lock, and
   pci_walk_bus() holds pci_bus_sem.

This race condition is not purely code analysis, it could be triggered by
following command series:

  # setpci -s 64:02.0 0x196.w=000a // 64:02.0 rootport has DPC capability
  # setpci -s 65:00.0 0x04.w=0544  // 65:00.0 NVMe SSD populated in port
  # mount /dev/nvme0n1p1 nvme

One shot will cause system panic and NULL pointer reference happened.
(tested on stable 5.8 & ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

   Buffer I/O error on dev nvme0n1p1, logical block 3328, async page read
   BUG: kernel NULL pointer dereference, address: 0050
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x) - not-present page
   PGD 0
   Oops:  [#1] SMP NOPTI
   CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0 el8.x86_64+ #1
   RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
   Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3
   65 ff ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50
   50 48 83 3a 00 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
   RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
   RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
   RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
   RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
   R10: 00eb8ebeab53 R11: 93453258 R12: 0002
   R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
   FS:  () GS:ff4e3eab1fd0() knl
   GS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   PKRU: 5554
   Call Trace:
   ? report_normal_detected+0x20/0x20
   report_frozen_detected+0x16/0x20
   pci_walk_bus+0x75/0x90
   ? dpc_irq+0x90/0x90
   pcie_do_recovery+0x157/0x201
   ? irq_finalize_oneshot.part.47+0xe0/0xe0
   dpc_handler+0x29/0x40
   irq_thread_fn+0x24/0x60
   irq_thread+0xea/0x170
   ? irq_forced_thread_fn+0x80/0x80
   ? irq_thread_check_affinity+0xf0/0xf0
   kthread+0x124/0x140
   ? kthread_park+0x90/0x90
   ret_from_fork+0x1f/0x30
   Modules linked in: nft_fib_inet.
   CR2: 0050

With this patch, the handling flow of DPC containment and hotplug is
partly ordered and serialized, let hardware DPC do the controller reset
etc recovery action first, then DPC driver handling the call-back from
device drivers, clear the DPC status, at the end, pciehp handle the DLLSC
and PDC etc.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.
 drivers/pci/hotplug/pciehp_hpc.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 53433b37e181..6f271160f18d 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -710,8 +710,10 @@ static irqreturn_t pciehp_ist(int irq, void *dev_id)
down_read(>reset_lock);
if (events & DISABLE_SLOT)
pciehp_handle_disable_request(ctrl);
-   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC))
+   else if (events & (PCI_EXP_SLTSTA_PDC | PCI_EXP_SLTSTA_DLLSC)) {
+   pci_wait_port_outdpc(pdev);
pciehp_handle_presence_or_link_change(ctrl, events);
+   }
up_read(>reset_lock);
 
ret = IRQ_HANDLED;
-- 
2.18.4

[PATCH 3/5 V2] PCI/ERR: get device before call device driver to avoid NULL pointer reference

2020-09-26 Thread Ethan Zhao

During DPC error injection test we found there is race condition between
pciehp and DPC driver, NULL pointer reference caused panic as following

 # setpci -s 64:02.0 0x196.w=000a
  // 64:02.0 is rootport has DPC capability
 # setpci -s 65:00.0 0x04.w=0544
  // 65:00.0 is NVMe SSD populated in above port
 # mount /dev/nvme0n1p1 nvme

 (tested on stable 5.8 & ICS(Ice Lake SP platform, see
 https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

 Buffer I/O error on dev nvme0n1p1, logical block 468843328,
 async page read
 BUG: kernel NULL pointer dereference, address: 0050
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x) - not-present page
 PGD 0
 Oops:  [#1] SMP NOPTI
 CPU: 12 PID: 513 Comm: irq/124-pcie-dp Not tainted 5.8.0-0.0.7.el8.x86_64+ #1
 RIP: 0010:report_error_detected.cold.4+0x7d/0xe6
 Code: b6 d0 e8 e8 fe 11 00 e8 16 c5 fb ff be 06 00 00 00 48 89 df e8 d3 65 ff
 ff b8 06 00 00 00 e9 75 fc ff ff 48 8b 43 68 45 31 c9 <48> 8b 50 50 48 83 3a 00
 41 0f 94 c1 45 31 c0 48 85 d2 41 0f 94 c0
 RSP: 0018:ff8e06cf8762fda8 EFLAGS: 00010246
 RAX:  RBX: ff4e3eaacf42a000 RCX: ff4e3eb31f223c01
 RDX: ff4e3eaacf42a140 RSI: ff4e3eb31f223c00 RDI: ff4e3eaacf42a138
 RBP: ff8e06cf8762fdd0 R08: 00bf R09: 
 R10: 00eb8ebeab53 R11: 93453258 R12: 0002
 R13: ff4e3eaacf42a130 R14: ff8e06cf8762fe2c R15: ff4e3eab44733828
 FS:  () GS:ff4e3eab1fd0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 0050 CR3: 000f8f80a004 CR4: 00761ee0
 DR0:  DR1:  DR2: 
 DR3:  DR6: fffe0ff0 DR7: 0400
 PKRU: 5554
 Call Trace:
 ? report_normal_detected+0x20/0x20
 report_frozen_detected+0x16/0x20
 pci_walk_bus+0x75/0x90
 ? dpc_irq+0x90/0x90
 pcie_do_recovery+0x157/0x201
 ? irq_finalize_oneshot.part.47+0xe0/0xe0
 dpc_handler+0x29/0x40
 irq_thread_fn+0x24/0x60
 irq_thread+0xea/0x170
 ? irq_forced_thread_fn+0x80/0x80
 ? irq_thread_check_affinity+0xf0/0xf0
 kthread+0x124/0x140
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x30
 Modules linked in: nft_fib_inet.
 CR2: 0050

Though we partly close the race condition with patch 'PCI: pciehp: check
and wait port status out of DPC before handling DLLSC and PDC', but there
is no hardware spec or software sequence to guarantee the pcie_ist() run
into pci_wait_port_outdpc() first or DPC triggered status bits being set
first when errors triggered DPC containment procedure, so device still
could be removed by function pci_stop_and_removed_bus_device() then freed
by pci_dev_put() in pciehp driver first during pcie_do_recover()/
pci_walk_bus() is called by dpc_handler() in DPC driver.

Maybe unify pci_bus_sem and pci_rescan_remove_lock to serialize the
removal and walking operation is the right way, but here we use
pci_dev_get() to increase the reference count of device before using the
device to avoid it is freed in use.

With this patch and patch 'PCI: pciehp: check and wait port status out of
DPC before handling DLLSC and PDC', stable 5.9-rc6 could pass the error
injection test and no panic happened.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
Changes:
 V2: revise doc according to Andy's suggestion.

 drivers/pci/pcie/err.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index c543f419d8f9..e35c4480c86b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -52,6 +52,8 @@ static int report_error_detected(struct pci_dev *dev,
pci_ers_result_t vote;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!pci_dev_set_io_state(dev, state) ||
!dev->driver ||
@@ -76,6 +78,7 @@ static int report_error_detected(struct pci_dev *dev,
pci_uevent_ers(dev, vote);
*result = merge_result(*result, vote);
device_unlock(>dev);
+   pci_dev_put(dev);
return 0;
 }
 
@@ -94,6 +97,8 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
pci_ers_result_t vote, *result = data;
const struct pci_error_handlers *err_handler;
 
+   if (!pci_dev_get(dev))
+   return 0;
device_lock(>dev);
if (!dev->driver ||
!dev->driver->err_handler ||
@@ -105,6 +110,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void 
*data)
*result = merge_result(*result, vote);
 out:
device_unlock(>dev);
+   pci_dev_put(dev);
return

[PATCH 5/5 V2] PCI/ERR: don't mix io state not changed and no driver together

2020-09-26 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
---
Chagnes:
 V2: no change.
 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH 4/5 V2] PCI: only return true when dev io state is really changed

2020-09-26 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call

   pcie_do_recovery()
   ->pci_walk_bus()
 ->report_frozen_detected()

with pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
Reviewed-by: Alexandru Gagniuc 
---
Changes:
 V2: revise doc and code flow according to Andy's suggestion.
 drivers/pci/pci.h | 34 +-
 1 file changed, 5 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..387f891ce6a1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -362,35 +362,11 @@ static inline bool pci_dev_set_io_state(struct pci_dev 
*dev,
bool changed = false;
 
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
-   dev->error_state = new;
+   if (dev->error_state == new)
+   return changed;
+
+   dev->error_state = new;
+   changed = true;
return changed;
 }
 
-- 
2.18.4

[PATCH 1/5 V2] PCI: define a function to check and wait till port finish DPC handling

2020-09-26 Thread Ethan Zhao

Once root port DPC capability is enabled and triggered, at the beginning
of DPC is triggered, the DPC status bits are set by hardware and then
sends DPC/DLLSC/PDC interrupts to OS DPC and pciehp drivers, it will
take the port and software DPC interrupt handler 10ms to 50ms (test data
on ICS(Ice Lake SP platform, see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server)
& stable 5.9-rc6) to complete the DPC containment procedure
till the DPC status is cleared at the end of the DPC interrupt handler.

We use this function to check if the root port is in DPC handling status
and wait till the hardware and software completed the procedure.

Signed-off-by: Ethan Zhao 
Tested-by: Wen Jin 
Tested-by: Shanshan Zhang 
Reviewed-by: Andy Shevchenko 
---
changes:
 V2：align ICS code name to public doc. 
 include/linux/pci.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 835530605c0d..5beb76c6ae26 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -2427,4 +2428,34 @@ void pci_uevent_ers(struct pci_dev *pdev, enum  
pci_ers_result err_type);
WARN_ONCE(condition, "%s %s: " fmt, \
  dev_driver_string(&(pdev)->dev), pci_name(pdev), ##arg)
 
+#ifdef CONFIG_PCIE_DPC
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   u16 cap = pdev->dpc_cap, status;
+   u16 loop = 0;
+
+   if (!cap) {
+   pci_WARN_ONCE(pdev, !cap, "No DPC capability initiated\n");
+   return false;
+   }
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   pci_dbg(pdev, "DPC status %x, cap %x\n", status, cap);
+   while (status & PCI_EXP_DPC_STATUS_TRIGGER && loop < 100) {
+   msleep(10);
+   loop++;
+   pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
+   }
+   if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
+   pci_dbg(pdev, "Out of DPC %x, cost %d ms\n", status, loop*10);
+   return true;
+   }
+   pci_dbg(pdev, "Timeout to wait port out of DPC status\n");
+   return false;
+}
+#else
+static inline bool pci_wait_port_outdpc(struct pci_dev *pdev)
+{
+   return true;
+}
+#endif
 #endif /* LINUX_PCI_H */
-- 
2.18.4

[PATCH 0/5 V2] Fix DPC hotplug race and enhance error handling

2020-09-26 Thread Ethan Zhao

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternatively etc).

With this patch set applied, stable 5.9-rc6 on ICS (Ice Lake SP platform,
see
https://en.wikichip.org/wiki/intel/microarchitectures/ice_lake_(server))

could pass the PCIe Gen4 NVMe SSD brute force hotplug test with any time 
interval between hot-remove and plug-in operation tens of times without
any errors occur and system works normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a 
setpci -s 65:00.0 0x04.w=0544 
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

V2: changed according to review by Andy Shevchenko.

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid NULL pointer
reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 34 +---
 drivers/pci/pcie/err.c   | 18 +++--
 include/linux/pci.h  | 31 +
 4 files changed, 55 insertions(+), 32 deletions(-)

-- 
2.18.4

[PATCH 5/5] PCI/ERR: don't mix io state not changed and no driver together

2020-09-24 Thread Ethan Zhao

When we see 'can't recover (no error_detected callback)' on console,
Maybe the reason is io state is not changed by calling
pci_dev_set_io_state(), that is confused. fix it.

Signed-off-by: Ethan Zhao 
Tested-by: Wen jin 
Tested-by: Shanshan Zhang 
---
 drivers/pci/pcie/err.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index e35c4480c86b..d85f27c90c26 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,8 +55,10 @@ static int report_error_detected(struct pci_dev *dev,
if (!pci_dev_get(dev))
return 0;
device_lock(>dev);
-   if (!pci_dev_set_io_state(dev, state) ||
-   !dev->driver ||
+   if (!pci_dev_set_io_state(dev, state)) {
+   pci_dbg(dev, "Device might already being in error handling 
...\n");
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!dev->driver ||
!dev->driver->err_handler ||
!dev->driver->err_handler->error_detected) {
/*
-- 
2.18.4

[PATCH 4/5] PCI: only return true when dev io state is really changed

2020-09-24 Thread Ethan Zhao

When uncorrectable error happens, AER driver and DPC driver interrupt
handlers likely call
   pcie_do_recovery()->pci_walk_bus()->report_frozen_detected() with
pci_channel_io_frozen the same time.
   If pci_dev_set_io_state() return true even if the original state is
pci_channel_io_frozen, that will cause AER or DPC handler re-enter
the error detecting and recovery procedure one after another.
   The result is the recovery flow mixed between AER and DPC.
So simplify the pci_dev_set_io_state() function to only return true
when dev->error_state is changed.

Signed-off-by: Ethan Zhao 
Tested-by: Wen jin 
Tested-by: Shanshan Zhang 
---
 drivers/pci/pci.h | 31 +++
 1 file changed, 3 insertions(+), 28 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fa12f7cbc1a0..d420bb977f3b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -362,35 +362,10 @@ static inline bool pci_dev_set_io_state(struct pci_dev 
*dev,
bool changed = false;
 
device_lock_assert(>dev);
-   switch (new) {
-   case pci_channel_io_perm_failure:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   case pci_channel_io_perm_failure:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_frozen:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   case pci_channel_io_normal:
-   switch (dev->error_state) {
-   case pci_channel_io_frozen:
-   case pci_channel_io_normal:
-   changed = true;
-   break;
-   }
-   break;
-   }
-   if (changed)
+   if (dev->error_state != new) {
dev->error_state = new;
+   changed = true;
+   }
return changed;
 }
 
-- 
2.18.4

[PATCH 0/5] Fix DPC hotplug race and enhance error hanlding

2020-09-24 Thread Ethan Zhao

This simple patch set fixed some serious security issues found when DPC
error injection and NVMe SSD hotplug brute force test were doing -- race
condition between DPC handler and pciehp, AER interrupt handlers, caused
system hang and system with DPC feature couldn't recover to normal
working state as expected (NVMe instance lost, mount operation hang,
race PCIe access caused uncorrectable errors reported alternativly etc). 

With this patch set applied, stable 5.9-rc6 could pass the PCIe Gen4 NVMe
SSD brute force hotplug test with any time interval between hot-remove and
plug-in operation tens of times without any errors occur and system works
normal.

With this patch set applied, system with DPC feature could recover from
NON-FATAL and FATAL errors injection test and works as expected.

System works smoothly when errors happen while hotplug is doing, no 
uncorrectable errors found.

Brute DPC error injection script:

for i in {0..100}
do
setpci -s 64:02.0 0x196.w=000a
setpci -s 65:00.0 0x04.w=0544
mount /dev/nvme0n1p1 /root/nvme
sleep 1
done

Other details see every commits description part.

This patch set could be applied to stable 5.9-rc6 directly.

Help to review and test.

Thanks,
Ethan


Ethan Zhao (5):
  PCI: define a function to check and wait till port finish DPC handling
  PCI: pciehp: check and wait port status out of DPC before handling
DLLSC and PDC
  PCI/ERR: get device before call device driver to avoid null pointer
reference
  PCI: only return true when dev io state is really changed
  PCI/ERR: don't mix io state not changed and no driver together

 drivers/pci/hotplug/pciehp_hpc.c |  4 +++-
 drivers/pci/pci.h| 31 +++
 drivers/pci/pcie/err.c   | 18 --
 include/linux/pci.h  | 31 +++
 4 files changed, 53 insertions(+), 31 deletions(-)

-- 
2.18.4

[PATCH] Revert "block: revert back to synchronous request_queue removal"

2020-09-08 Thread Ethan Zhao

From: Ethan Zhao 

'commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue
removal")' introduced panic issue to NVMe hotplug as following(hit
after just 2 times NVMe SSD hotplug under stable 5.9-RC2):

BUG: sleeping function called from invalid context at block/genhd.c:1563
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/30
INFO: lockdep is turned off.
CPU: 30 PID: 0 Comm: swapper/30 Tainted: G S  W 5.9.0-RC2 #3
Hardware name: Intel Corporation 
Call Trace:

dump_stack+0x9d/0xe0
___might_sleep.cold.79+0x17f/0x1af
disk_release+0x26/0x200
device_release+0x6d/0x1c0
kobject_put+0x14d/0x430
hd_struct_free+0xfb/0x260
percpu_ref_switch_to_atomic_rcu+0x3d1/0x580
? rcu_nocb_unlock_irqrestore+0xb6/0xf0
? trace_hardirqs_on+0x20/0x1b5
? rcu_do_batch+0x3ff/0xb50
rcu_do_batch+0x47c/0xb50
? rcu_accelerate_cbs+0xa9/0x740
? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0
rcu_core+0x945/0xd90
? __do_softirq+0x182/0xac0
__do_softirq+0x1ca/0xac0
asm_call_on_stack+0xf/0x20

do_softirq_own_stack+0x7f/0x90
irq_exit_rcu+0x1e3/0x230
sysvec_apic_timer_interrupt+0x48/0xb0
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:cpuidle_enter_state+0x116/0xe90
Code: 00 31 ff e8 ac c8 a4 fe 80 7c 24 10 00 74 12 9c 58 f6 c4 02
 0f 85 7e 08 00 00 31 ff e8 43 ca be fe e8 ae a3 d5 fe fb 45 85 ed
 <0f> 88 4e 05 00 00 4d 63 f5 49 83 fe 09 0f 87 29 0b 00 00 4b 8d 04
RSP: 0018:ff110001040dfd78 EFLAGS: 0206
RAX: 0007 RBX: ffd1fff7b1a01e00 RCX: 001f
RDX:  RSI: 4000 RDI: b7c5c0f2
RBP: b9a416a0 R08:  R09: 
R10: ff110001040d0007 R11: ffe21c002081a000 R12: 0003
R13: 0003 R14: 0138
... ...
BUG: kernel NULL pointer dereference, address: 
PGD 0
Oops: 0010 [#1] SMP KASAN NOPTI
CPU: 30 PID: 500 Comm: irq/124-pciehp Tainted: G S  W  5.9.0-RC2 #3
Hardware name: Intel Corporation 
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ff110007d5ba75e8 EFLAGS: 00010096
RAX:  RBX: ff110001040d RCX: ff110007d5ba7668
RDX: 0009 RSI: ff110001040d RDI: ff110008119f59c0
RBP: ff110008119f59c0 R08: fbfff73f7b4d R09: fbfff73f7b4d
R10: b9fbda67 R11: fbfff73f7b4c R12: 
R13: ff110007d5ba7668 R14: ff110001040d R15: ff110008119f59c0
FS:  () GS:ff1100081180() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ffd6 CR3: 0007cea16002 CR4: 00771ee0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
ttwu_do_activate+0xd3/0x160
try_to_wake_up+0x652/0x1850
? migrate_swap_stop+0xad0/0xad0
? lock_contended+0xd70/0xd70
? rcu_read_unlock+0x50/0x50
? rcu_do_batch+0x3ff/0xb50
swake_up_locked+0x85/0x1c0
complete+0x4d/0x70
rcu_do_batch+0x47c/0xb50
? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0
rcu_core+0x945/0xd90
? __do_softirq+0x182/0xac0
__do_softirq+0x1ca/0xac0
irq_exit_rcu+0x1e3/0x230
sysvec_apic_timer_interrupt+0x48/0xb0
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:_raw_spin_unlock_irqrestore+0x40/0x50
Code: e8 35 ad 36 fe 48 89 ef e8 cd ce 37 fe f6 c7 02 75 11 53 9d
 e8 91 1f 5c fe 65 ff 0d ba af c2 47 5b 5d c3 e8 d2 22 5c fe 53 9d
  ed 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53
RSP: 0018:ff110007d5ba79d0 EFLAGS: 0293
RAX: 0007 RBX: 0293 RCX: b67710d4
RDX:  RSI: 0004 RDI: b83f41ce
RBP: bb77e740 R08:  R09: 
R10: bb77e743 R11: fbfff76efce8 R12: 198e
R13: ff11001031d7a0b0 R14: bb77e740 R15: bb77e788
? do_raw_spin_unlock+0x54/0x230
? _raw_spin_unlock_irqrestore+0x3e/0x50
dma_debug_device_change+0x150/0x5e0
notifier_call_chain+0x90/0x160
__blocking_notifier_call_chain+0x6d/0xa0
device_release_driver_internal+0x37d/0x490
pci_stop_bus_device+0x123/0x190
pci_stop_and_remove_bus_device+0xe/0x20
pciehp_unconfigure_device+0x17e/0x330
? pciehp_configure_device+0x3e0/0x3e0
? trace_hardirqs_on+0x20/0x1b5
pciehp_disable_slot+0x101/0x360
? pme_is_native.cold.2+0x29/0x29
pciehp_handle_presence_or_link_change+0x1ac/0xee0
? pciehp_handle_disable_request+0x110/0x110
pciehp_ist.cold.11+0x39/0x54
? pciehp_set_indicators+0x190/0x190
? alloc_desc+0x510/0xa30
? irq_set_affinity_notifier+0x380/0x380
? pciehp_set_indicators+0x190/0x190
? irq_thread+0x137/0x420
irq_thread_fn+0x86/0x150
irq_thread+0x21f/0x420
? irq_forced_thread_fn+0x170/0x170
? irq_thread_check_affinity+0x210/0x210
? __kthread_parkme+0x52/0x1a0
? lockdep_hardirqs_on_prepare+0x33e/0x4e0
? _raw_spin_unlock_irqrestore+0x3e/0x50
? trace_hardirqs_on+0x20/0x1b5
? wake_threads_waitq+0x40/0x40
? __kthread_parkme+0xd1/0x1a0
? irq_thread_check_affinity+0x210/0x210
kthread+0x36a/0x430
? kthread_create_worker_

Re: [PATCH 04/27] Restrict /dev/mem and /dev/kmem when the kernel is locked down

2017-10-24 Thread Ethan Zhao

David，

May I ask a question here -- Is it intentionally enabling the
read-only mode, so userspace
tools like dmidecode could work with kernel_is_locked_down ?  while it
was impossible to work
with the attached patch applied. Is it a security policy change with
secure boot ?

Thanks,
Ethan

On Mon, Oct 23, 2017 at 10:34 PM, David Howells  wrote:
> I think I should replace this patch with the attached.  This will prevent
> /dev/mem, /dev/kmem and /dev/port from being *opened*, and thereby preventing
> read, write and ioctl.
>
> David
> ---
> commit e68daa2256986932b9a7d6709cf9e24b30d93583
> Author: Matthew Garrett 
> Date:   Wed May 24 14:56:02 2017 +0100
>
> Restrict /dev/{mem,kmem,port} when the kernel is locked down
>
> Allowing users to read and write to core kernel memory makes it possible
> for the kernel to be subverted, avoiding module loading restrictions, and
> also to steal cryptographic information.
>
> Disallow /dev/mem and /dev/kmem from being opened this when the kernel has
> been locked down to prevent this.
>
> Also disallow /dev/port from being opened to prevent raw ioport access and
> thus DMA from being used to accomplish the same thing.
>
> Signed-off-by: Matthew Garrett 
> Signed-off-by: David Howells 
> Reviewed-by: "Lee, Chun-Yi" 
>
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 593a8818aca9..0ce5ac0a5c6b 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -762,6 +762,8 @@ static loff_t memory_lseek(struct file *file, loff_t 
> offset, int orig)
>
>  static int open_port(struct inode *inode, struct file *filp)
>  {
> +   if (kernel_is_locked_down("/dev/mem,kmem,port"))
> +   return -EPERM;
> return capable(CAP_SYS_RAWIO) ? 0 : -EPERM;
>  }
>

Re: [PATCH 04/27] Restrict /dev/mem and /dev/kmem when the kernel is locked down

2017-10-24 Thread Ethan Zhao

David，

May I ask a question here -- Is it intentionally enabling the
read-only mode, so userspace
tools like dmidecode could work with kernel_is_locked_down ?  while it
was impossible to work
with the attached patch applied. Is it a security policy change with
secure boot ?

Thanks,
Ethan

On Mon, Oct 23, 2017 at 10:34 PM, David Howells  wrote:
> I think I should replace this patch with the attached.  This will prevent
> /dev/mem, /dev/kmem and /dev/port from being *opened*, and thereby preventing
> read, write and ioctl.
>
> David
> ---
> commit e68daa2256986932b9a7d6709cf9e24b30d93583
> Author: Matthew Garrett 
> Date:   Wed May 24 14:56:02 2017 +0100
>
> Restrict /dev/{mem,kmem,port} when the kernel is locked down
>
> Allowing users to read and write to core kernel memory makes it possible
> for the kernel to be subverted, avoiding module loading restrictions, and
> also to steal cryptographic information.
>
> Disallow /dev/mem and /dev/kmem from being opened this when the kernel has
> been locked down to prevent this.
>
> Also disallow /dev/port from being opened to prevent raw ioport access and
> thus DMA from being used to accomplish the same thing.
>
> Signed-off-by: Matthew Garrett 
> Signed-off-by: David Howells 
> Reviewed-by: "Lee, Chun-Yi" 
>
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 593a8818aca9..0ce5ac0a5c6b 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -762,6 +762,8 @@ static loff_t memory_lseek(struct file *file, loff_t 
> offset, int orig)
>
>  static int open_port(struct inode *inode, struct file *filp)
>  {
> +   if (kernel_is_locked_down("/dev/mem,kmem,port"))
> +   return -EPERM;
> return capable(CAP_SYS_RAWIO) ? 0 : -EPERM;
>  }
>

[tip:sched/core] sched/sysctl: Check user input value of sysctl_sched_time_avg

2017-09-29 Thread tip-bot for Ethan Zhao

Commit-ID:  5ccba44ba118a500050076b0344632459779
Gitweb: https://git.kernel.org/tip/5ccba44ba118a500050076b0344632459779
Author: Ethan Zhao <ethan.z...@oracle.com>
AuthorDate: Mon, 4 Sep 2017 13:59:34 +0800
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Fri, 29 Sep 2017 13:20:13 +0200

sched/sysctl: Check user input value of sysctl_sched_time_avg

System will hang if user set sysctl_sched_time_avg to 0:

  [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

  Stack traceback for pid 0
  0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
  883f7ccc3ae8 0018 810c4dd0 
  00017800 883f7ccc3d78 0003 883f7ccc3bf8
  810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
  Call Trace:
   [] ? update_group_capacity+0x110/0x200
  [] ? update_sd_lb_stats+0x109/0x600
  [] ? find_busiest_group+0x47/0x530
  [] ? load_balance+0x194/0x900
  [] ? update_rq_clock.part.83+0x1a/0xe0
  [] ? rebalance_domains+0x152/0x290
  [] ? run_rebalance_domains+0xdc/0x1d0
  [] ? __do_softirq+0xfb/0x320
  [] ? irq_exit+0x125/0x130
  [] ? scheduler_ipi+0x97/0x160
  [] ? smp_reschedule_interrupt+0x29/0x30
  [] ? reschedule_interrupt+0x6e/0x80
[] ? cpuidle_enter_state+0xcc/0x230
  [] ? cpuidle_enter_state+0x9c/0x230
  [] ? cpuidle_enter+0x17/0x20
  [] ? cpu_startup_entry+0x38c/0x420
  [] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com>
Signed-off-by: Ethan Zhao <ethan.z...@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <pet...@infradead.org>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Cc: efa...@gmx.de
Cc: ethan.ker...@gmail.com
Cc: keesc...@chromium.org
Cc: mcg...@kernel.org
Cc: <sta...@vger.kernel.org>
Link: 
http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.z...@oracle.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = {
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{

[tip:sched/core] sched/sysctl: Check user input value of sysctl_sched_time_avg

2017-09-29 Thread tip-bot for Ethan Zhao

Commit-ID:  5ccba44ba118a500050076b0344632459779
Gitweb: https://git.kernel.org/tip/5ccba44ba118a500050076b0344632459779
Author: Ethan Zhao 
AuthorDate: Mon, 4 Sep 2017 13:59:34 +0800
Committer:  Ingo Molnar 
CommitDate: Fri, 29 Sep 2017 13:20:13 +0200

sched/sysctl: Check user input value of sysctl_sched_time_avg

System will hang if user set sysctl_sched_time_avg to 0:

  [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

  Stack traceback for pid 0
  0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
  883f7ccc3ae8 0018 810c4dd0 
  00017800 883f7ccc3d78 0003 883f7ccc3bf8
  810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
  Call Trace:
   [] ? update_group_capacity+0x110/0x200
  [] ? update_sd_lb_stats+0x109/0x600
  [] ? find_busiest_group+0x47/0x530
  [] ? load_balance+0x194/0x900
  [] ? update_rq_clock.part.83+0x1a/0xe0
  [] ? rebalance_domains+0x152/0x290
  [] ? run_rebalance_domains+0xdc/0x1d0
  [] ? __do_softirq+0xfb/0x320
  [] ? irq_exit+0x125/0x130
  [] ? scheduler_ipi+0x97/0x160
  [] ? smp_reschedule_interrupt+0x29/0x30
  [] ? reschedule_interrupt+0x6e/0x80
[] ? cpuidle_enter_state+0xcc/0x230
  [] ? cpuidle_enter_state+0x9c/0x230
  [] ? cpuidle_enter+0x17/0x20
  [] ? cpu_startup_entry+0x38c/0x420
  [] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function:

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hitting invalid input, and set the minimum limit of
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: efa...@gmx.de
Cc: ethan.ker...@gmail.com
Cc: keesc...@chromium.org
Cc: mcg...@kernel.org
Cc: 
Link: 
http://lkml.kernel.org/r/1504504774-18253-1-git-send-email-ethan.z...@oracle.com
Signed-off-by: Ingo Molnar 
---
 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = {
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{

Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-06 Thread Ethan Zhao




On 2017/9/7 3:50, Luis R. Rodriguez wrote:

On Mon, Sep 04, 2017 at 03:54:23PM +0800, Ethan Zhao wrote:

Peter,


On 2017/9/4 15:49, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,

*sigh*, what's wrong with the below? Too easy?

  :),  seems I walked zigzag several cycles to get the right point.

It sounds like negative values are not possible too, right? If so then
you can use proc_douintvec_minmax().

 V4 has been sent days ago, please see v4.

 Thank,
 Ethan


   Luis

Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-06 Thread Ethan Zhao




On 2017/9/7 3:50, Luis R. Rodriguez wrote:

On Mon, Sep 04, 2017 at 03:54:23PM +0800, Ethan Zhao wrote:

Peter,


On 2017/9/4 15:49, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,

*sigh*, what's wrong with the below? Too easy?

  :),  seems I walked zigzag several cycles to get the right point.

It sounds like negative values are not possible too, right? If so then
you can use proc_douintvec_minmax().

 V4 has been sent days ago, please see v4.

 Thank,
 Ethan


   Luis

Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao


Peter,


On 2017/9/4 15:49, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,

*sigh*, what's wrong with the below? Too easy?

 :),  seems I walked zigzag several cycles to get the right point.

 Thanks,
 Ethan


diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..bbbc6a17c15e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = {
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_min_max,
+   .extra1 = ,
},
  #ifdef CONFIG_SCHEDSTATS
{

Re: [PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao


Peter,


On 2017/9/4 15:49, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 02:57:32PM +0800, Ethan Zhao wrote:

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,

*sigh*, what's wrong with the below? Too easy?

 :),  seems I walked zigzag several cycles to get the right point.

 Thanks,
 Ethan


diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbbb8157..bbbc6a17c15e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static struct ctl_table kern_table[] = {
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_min_max,
+   .extra1 = ,
},
  #ifdef CONFIG_SCHEDSTATS
{

Re: [PATCH] sched: reset sysctl_sched_time_avg to default when

2017-09-04 Thread Ethan Zhao


Peter,


On 2017/9/4 15:32, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 08:33:03AM +0800, Ethan Zhao wrote:

Yep, that is the first place I considered to set the limit, but that would
break KABI ?

nah..

  V4 sent, please ignore v2 & v3.

  Thanks,
  Ethan

Re: [PATCH] sched: reset sysctl_sched_time_avg to default when

2017-09-04 Thread Ethan Zhao


Peter,


On 2017/9/4 15:32, Peter Zijlstra wrote:

On Sat, Sep 02, 2017 at 08:33:03AM +0800, Ethan Zhao wrote:

Yep, that is the first place I considered to set the limit, but that would
break KABI ?

nah..

  V4 sent, please ignore v2 & v3.

  Thanks,
  Ethan

[PATCH v4] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input, set the min limit of 
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com>
Signed-off-by: Ethan Zhao <ethan.z...@oracle.com>
---
 v2: Check it at user input side in sysctl table (peterz).
 v3: Use proc_dointvec_minmax().
 v4: Fix a too long line in descripton part.

 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

[PATCH v4] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input, set the min limit of 
sysctl_sched_time_avg to 1 ms.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
---
 v2: Check it at user input side in sysctl table (peterz).
 v3: Use proc_dointvec_minmax().
 v4: Fix a too long line in descripton part.

 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

[PATCH v3] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg
to 1 ms.

Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com>
Signed-off-by: Ethan Zhao <ethan.z...@oracle.com>
---
 v2: Check it at user input side in sysctl table (peterz).
 v3: Use proc_dointvec_minmax().

 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

[PATCH v3] sched: check user input value of sysctl_sched_time_avg

2017-09-04 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
 {
  ...
  total = sched_avg_period() + delta;
  used = div_u64(avg, total);
  ...
 }

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input, set the min limit of sysctl_sched_time_avg
to 1 ms.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
---
 v2: Check it at user input side in sysctl table (peterz).
 v3: Use proc_dointvec_minmax().

 kernel/sysctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..423554a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,8 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = ,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

[PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-02 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
{
...
total = sched_avg_period() + delta;
used = div_u64(avg, total);
...
}

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input.

Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com>
Signed-off-by: Ethan Zhao <ethan.z...@oracle.com>
---
 v2: check it in sysctl table (input side) as Peter suggested.
 Tested on stable 4.1, applied on stable 4.13-rc5 okay.

 include/linux/sched/sysctl.h |  3 +++
 kernel/sched/fair.c  | 26 ++
 kernel/sysctl.c  |  2 +-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 0f5ecd4..cc25807 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,9 @@ enum sched_tunable_scaling {
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
+int sched_time_avg_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..61155dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -650,6 +650,32 @@ int sched_proc_update_handler(struct ctl_table *table, int 
write,
 
return 0;
 }
+
+int sched_time_avg_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   struct ctl_table tbl;
+   int new_value, ret;
+
+   memset(, 0, sizeof(struct ctl_table));
+   tbl.maxlen = sizeof(unsigned int);
+
+   if (write)
+   tbl.data = _value;
+   else
+   tbl.data = _sched_time_avg;
+
+   ret = proc_dointvec(, write, buffer, lenp, ppos);
+   if (write && ret == 0) {
+   if (new_value <= 0)
+   ret =  -EINVAL;
+   else
+   sysctl_sched_time_avg = new_value;
+   }
+
+   return ret;
+}
 #endif
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

[PATCH v2] sched: check user input value of sysctl_sched_time_avg

2017-09-02 Thread Ethan Zhao

System will hang if user set sysctl_sched_time_avg to 0 by

[root@XXX ~]# sysctl kernel.sched_time_avg_ms=0

Stack traceback for pid 0
0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
883f7ccc3ae8 0018 810c4dd0 
00017800 883f7ccc3d78 0003 883f7ccc3bf8
810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
Call Trace:
 [] ? update_group_capacity+0x110/0x200
[] ? update_sd_lb_stats+0x109/0x600
[] ? find_busiest_group+0x47/0x530
[] ? load_balance+0x194/0x900
[] ? update_rq_clock.part.83+0x1a/0xe0
[] ? rebalance_domains+0x152/0x290
[] ? run_rebalance_domains+0xdc/0x1d0
[] ? __do_softirq+0xfb/0x320
[] ? irq_exit+0x125/0x130
[] ? scheduler_ipi+0x97/0x160
[] ? smp_reschedule_interrupt+0x29/0x30
[] ? reschedule_interrupt+0x6e/0x80
  [] ? cpuidle_enter_state+0xcc/0x230
[] ? cpuidle_enter_state+0x9c/0x230
[] ? cpuidle_enter+0x17/0x20
[] ? cpu_startup_entry+0x38c/0x420
[] ? start_secondary+0x173/0x1e0

Because divide-by-zero error happens in function

update_group_capacity()
  update_cpu_capacity()
scale_rt_capacity()
{
...
total = sched_avg_period() + delta;
used = div_u64(avg, total);
...
}

Seems this issue could be reproduced on all I tried stable 4.1 - last
kernel.

To fix this issue, check user input value of sysctl_sched_time_avg, keep
it unchanged when hit invalid input.

Reported-by: James Puthukattukaran 
Signed-off-by: Ethan Zhao 
---
 v2: check it in sysctl table (input side) as Peter suggested.
 Tested on stable 4.1, applied on stable 4.13-rc5 okay.

 include/linux/sched/sysctl.h |  3 +++
 kernel/sched/fair.c  | 26 ++
 kernel/sysctl.c  |  2 +-
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 0f5ecd4..cc25807 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -44,6 +44,9 @@ enum sched_tunable_scaling {
 int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
+int sched_time_avg_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 #endif
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c95880e..61155dc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -650,6 +650,32 @@ int sched_proc_update_handler(struct ctl_table *table, int 
write,
 
return 0;
 }
+
+int sched_time_avg_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   struct ctl_table tbl;
+   int new_value, ret;
+
+   memset(, 0, sizeof(struct ctl_table));
+   tbl.maxlen = sizeof(unsigned int);
+
+   if (write)
+   tbl.data = _value;
+   else
+   tbl.data = _sched_time_avg;
+
+   ret = proc_dointvec(, write, buffer, lenp, ppos);
+   if (write && ret == 0) {
+   if (new_value <= 0)
+   ret =  -EINVAL;
+   else
+   sysctl_sched_time_avg = new_value;
+   }
+
+   return ret;
+}
 #endif
 
 /*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6648fbb..609bed2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,7 +367,7 @@ static int sysrq_sysctl_handler(struct ctl_table *table, 
int write,
.data   = _sched_time_avg,
.maxlen = sizeof(unsigned int),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = sched_time_avg_handler,
},
 #ifdef CONFIG_SCHEDSTATS
{
-- 
1.8.3.1

Re: [PATCH] sched: reset sysctl_sched_time_avg to default when

2017-09-02 Thread Ethan Zhao

Peter,

 V2 sent, tested okay on stable 4.1, re-factor a little, could apply
to v4.13-rc5,
please review.

[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 1000
[root@ ~]# sysctl kernel.sched_time_avg_ms=0
sysctl: setting key "kernel.sched_time_avg_ms": Invalid argument
kernel.sched_time_avg_ms = 0
[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 1000
[root@ ~]# sysctl kernel.sched_time_avg_ms=900
kernel.sched_time_avg_ms = 900
[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 900

Thanks,
Ethan

On Sat, Sep 2, 2017 at 8:33 AM, Ethan Zhao <ethan.ker...@gmail.com> wrote:
> Yep, that is the first place I considered to set the limit, but that would
> break KABI ?
>
> Thanks,
> Ethan
>
> On Fri, Sep 1, 2017 at 8:32 PM, Peter Zijlstra <pet...@infradead.org> wrote:
>> On Fri, Sep 01, 2017 at 07:31:54PM +0800, Ethan Zhao wrote:
>>> System will hang if user set sysctl_sched_time_avg to 0 by
>>>
>>> [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
>>>
>>> Stack traceback for pid 0
>>> 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
>>> 883f7ccc3ae8 0018 810c4dd0 
>>> 00017800 883f7ccc3d78 0003 883f7ccc3bf8
>>> 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
>>> Call Trace:
>>>  [] ? update_group_capacity+0x110/0x200
>>> [] ? update_sd_lb_stats+0x109/0x600
>>> [] ? find_busiest_group+0x47/0x530
>>> [] ? load_balance+0x194/0x900
>>> [] ? update_rq_clock.part.83+0x1a/0xe0
>>> [] ? rebalance_domains+0x152/0x290
>>> [] ? run_rebalance_domains+0xdc/0x1d0
>>> [] ? __do_softirq+0xfb/0x320
>>> [] ? irq_exit+0x125/0x130
>>> [] ? scheduler_ipi+0x97/0x160
>>> [] ? smp_reschedule_interrupt+0x29/0x30
>>> [] ? reschedule_interrupt+0x6e/0x80
>>>  [] ? cpuidle_enter_state+0xcc/0x230
>>> [] ? cpuidle_enter_state+0x9c/0x230
>>> [] ? cpuidle_enter+0x17/0x20
>>> [] ? cpu_startup_entry+0x38c/0x420
>>> [] ? start_secondary+0x173/0x1e0
>>>
>>> Because divide-by-zero error happens in function
>>>
>>> update_group_capacity()
>>>   update_cpu_capacity()
>>> scale_rt_capacity()
>>>   {
>>>   ...
>>>   total = sched_avg_period() + delta;
>>>   used = div_u64(avg, total);
>>>   ...
>>>   }
>>>
>>> Seems this issue could be reproduced on all I tried stable 4.1 - last
>>> kernel.
>>>
>>> To fix this issue, reset sysctl_sched_time_avg to default value
>>> MSEC_PER_SEC if user input invalid value in function
>>>
>>> sched_avg_period()
>>>
>>> Reported-by: James Puthukattukaran <james.puthukattuka...@oracle.com>
>>> Signed-off-by: Ethan Zhao <ethan.z...@oracle.com>
>>> ---
>>>  Tested on stable 4.1, compiled on stable 4.13-rc5
>>>  kernel/sched/sched.h | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index eeef1a3..b398560 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1620,6 +1620,9 @@ static inline void rq_last_tick_reset(struct rq *rq)
>>>
>>>  static inline u64 sched_avg_period(void)
>>>  {
>>> + if (unlikely(sysctl_sched_time_avg <= 0))
>>> + sysctl_sched_time_avg = MSEC_PER_SEC;
>>> +
>>
>> Sight, no.. you can set limits in the sysctl table.

Re: [PATCH] sched: reset sysctl_sched_time_avg to default when

2017-09-02 Thread Ethan Zhao

Peter,

 V2 sent, tested okay on stable 4.1, re-factor a little, could apply
to v4.13-rc5,
please review.

[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 1000
[root@ ~]# sysctl kernel.sched_time_avg_ms=0
sysctl: setting key "kernel.sched_time_avg_ms": Invalid argument
kernel.sched_time_avg_ms = 0
[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 1000
[root@ ~]# sysctl kernel.sched_time_avg_ms=900
kernel.sched_time_avg_ms = 900
[root@ ~]# sysctl kernel.sched_time_avg_ms
kernel.sched_time_avg_ms = 900

Thanks,
Ethan

On Sat, Sep 2, 2017 at 8:33 AM, Ethan Zhao  wrote:
> Yep, that is the first place I considered to set the limit, but that would
> break KABI ?
>
> Thanks,
> Ethan
>
> On Fri, Sep 1, 2017 at 8:32 PM, Peter Zijlstra  wrote:
>> On Fri, Sep 01, 2017 at 07:31:54PM +0800, Ethan Zhao wrote:
>>> System will hang if user set sysctl_sched_time_avg to 0 by
>>>
>>> [root@XXX ~]# sysctl kernel.sched_time_avg_ms=0
>>>
>>> Stack traceback for pid 0
>>> 0x883f6406c600 0 0 1 3 R 0x883f6406cf50 *swapper/3
>>> 883f7ccc3ae8 0018 810c4dd0 
>>> 00017800 883f7ccc3d78 0003 883f7ccc3bf8
>>> 810c4fc9 883f7ccc3c08 810c5043 883f7ccc3c08
>>> Call Trace:
>>>  [] ? update_group_capacity+0x110/0x200
>>> [] ? update_sd_lb_stats+0x109/0x600
>>> [] ? find_busiest_group+0x47/0x530
>>> [] ? load_balance+0x194/0x900
>>> [] ? update_rq_clock.part.83+0x1a/0xe0
>>> [] ? rebalance_domains+0x152/0x290
>>> [] ? run_rebalance_domains+0xdc/0x1d0
>>> [] ? __do_softirq+0xfb/0x320
>>> [] ? irq_exit+0x125/0x130
>>> [] ? scheduler_ipi+0x97/0x160
>>> [] ? smp_reschedule_interrupt+0x29/0x30
>>> [] ? reschedule_interrupt+0x6e/0x80
>>>  [] ? cpuidle_enter_state+0xcc/0x230
>>> [] ? cpuidle_enter_state+0x9c/0x230
>>> [] ? cpuidle_enter+0x17/0x20
>>> [] ? cpu_startup_entry+0x38c/0x420
>>> [] ? start_secondary+0x173/0x1e0
>>>
>>> Because divide-by-zero error happens in function
>>>
>>> update_group_capacity()
>>>   update_cpu_capacity()
>>> scale_rt_capacity()
>>>   {
>>>   ...
>>>   total = sched_avg_period() + delta;
>>>   used = div_u64(avg, total);
>>>   ...
>>>   }
>>>
>>> Seems this issue could be reproduced on all I tried stable 4.1 - last
>>> kernel.
>>>
>>> To fix this issue, reset sysctl_sched_time_avg to default value
>>> MSEC_PER_SEC if user input invalid value in function
>>>
>>> sched_avg_period()
>>>
>>> Reported-by: James Puthukattukaran 
>>> Signed-off-by: Ethan Zhao 
>>> ---
>>>  Tested on stable 4.1, compiled on stable 4.13-rc5
>>>  kernel/sched/sched.h | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index eeef1a3..b398560 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1620,6 +1620,9 @@ static inline void rq_last_tick_reset(struct rq *rq)
>>>
>>>  static inline u64 sched_avg_period(void)
>>>  {
>>> + if (unlikely(sysctl_sched_time_avg <= 0))
>>> + sysctl_sched_time_avg = MSEC_PER_SEC;
>>> +
>>
>> Sight, no.. you can set limits in the sysctl table.

1 2 3 4 5 6 7 >

1 - 100 of 646 matches

Mail list logo