Re: [PATCH 0/2] PCI/AER: Remove/unexport error reporting enable/disable

2023-07-10 Thread Sathyanarayanan Kuppuswamy



On 7/10/23 4:21 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas 
> 
> pci_disable_pcie_error_reporting() is unused; remove it.
> pci_enable_pcie_error_reporting() is used only inside aer.c; make it
> static.

Looks fine to me.

Reviewed-by: Kuppuswamy Sathyanarayanan 


> 
> Bjorn Helgaas (2):
>   PCI/AER: Drop unused pci_disable_pcie_error_reporting()
>   PCI/AER: Unexport pci_enable_pcie_error_reporting()
> 
>  drivers/pci/pcie/aer.c | 15 +--
>  include/linux/aer.h| 11 ---
>  2 files changed, 1 insertion(+), 25 deletions(-)
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v5 2/3] PCI/AER: Disable AER interrupt on suspend

2023-05-11 Thread Sathyanarayanan Kuppuswamy



On 5/11/23 6:36 AM, Kai-Heng Feng wrote:
> PCIe service that shares IRQ with PME may cause spurious wakeup on
> system suspend.
> 
> This is very similar to previous attempts to suspend AER and DPC [1],
> but this time disabling AER IRQ is to prevent immediate PME wakeup when
> AER shares the same IRQ line with PME.

IMHO, you don't need to mention the previous submission reason.

> 
> It's okay to disable AER because PCIe Base Spec 5.0, section 5.2 "Link
> State Power Management" states that TLP and DLLP transmission is
> disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold with aux power)
> and L3 (D3cold), hence we don't lose much here to disable AER IRQ during
> system suspend.

May be something like below?

PCIe services that share an IRQ with PME, such as AER or DPC, may cause a
spurious wakeup on system suspend. To prevent this, disable the AER
interrupt notification during the system suspend process.

As Per PCIe Base Spec 5.0, section 5.2, titled "Link State Power Management",
TLP and DLLP transmission are disabled for a Link in L2/L3 Ready (D3hot), L2
(D3cold with aux power) and L3 (D3cold) states. So disabling the AER 
notification
during suspend and re-enabling them during the resume process should not affect
the basic functionality.

> 
> [1] 
> https://lore.kernel.org/linux-pci/20220408153159.106741-1-kai.heng.f...@canonical.com/
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295
> 
> Reviewed-by: Mika Westerberg 
> Signed-off-by: Kai-Heng Feng 
> ---
> v5:
>  - Wording.
> 
> v4:
> v3:
>  - No change.
> 
> v2:
>  - Only disable AER IRQ.
>  - No more check on PME IRQ#.
>  - Use helper.
> 
>  drivers/pci/pcie/aer.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1420e1f27105..9c07fdbeb52d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1356,6 +1356,26 @@ static int aer_probe(struct pcie_device *dev)
>   return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_disable_irq(pdev);
> +
> + return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_enable_irq(pdev);
> +
> + return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1420,6 +1440,8 @@ static struct pcie_port_service_driver aerdriver = {
>   .service= PCIE_PORT_SERVICE_AER,
>  
>   .probe  = aer_probe,
> + .suspend= aer_suspend,
> + .resume = aer_resume,
>   .remove = aer_remove,
>  };
>  

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v4 2/3] PCI/AER: Disable AER interrupt on suspend

2023-04-24 Thread Sathyanarayanan Kuppuswamy
Hi,

On 4/24/23 10:55 PM, Kai-Heng Feng wrote:
> On Tue, Apr 25, 2023 at 7:47 AM Sathyanarayanan Kuppuswamy
>  wrote:
>>
>>
>>
>> On 4/23/23 10:52 PM, Kai-Heng Feng wrote:
>>> PCIe service that shares IRQ with PME may cause spurious wakeup on
>>> system suspend.
>>>
>>> PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states
>>> that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready
>>> (D3hot), L2 (D3cold with aux power) and L3 (D3cold), so we don't lose
>>> much here to disable AER during system suspend.
>>>
>>> This is very similar to previous attempts to suspend AER and DPC [1],
>>> but with a different reason.
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-pci/20220408153159.106741-1-kai.heng.f...@canonical.com/
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295
>>>
>>> Reviewed-by: Mika Westerberg 
>>> Signed-off-by: Kai-Heng Feng 
>>> ---
>>
>> IIUC, you encounter AER errors during the suspend/resume process, which
>> results in AER IRQ. Because AER and PME share an IRQ, it is regarded as a
>> spurious wake-up IRQ. So to fix it, you want to disable AER reporting,
>> right?
> 
> Yes. That's exactly what happened.
> 
>>
>> It looks like it is harmless to disable the AER during the suspend/resume
>> path. But, I am wondering why we get these errors? Did you check what errors
>> you get during the suspend/resume path? Are these errors valid?
> 
> I really don't know. I think it's similar to the reasoning in commit
> b07461a8e45b ("PCI/AER: Clear error status registers during
> enumeration and restore"): "AER errors might be recorded when
> powering-on devices. These errors can be ignored, ...".
> For this case, it happens when powering-off the device (D3cold) via
> turning off power resources.

Got it.

Reviewed-by: Kuppuswamy Sathyanarayanan 


> 
> Kai-Heng
> 
>>
>>
>>>  drivers/pci/pcie/aer.c | 22 ++
>>>  1 file changed, 22 insertions(+)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index 1420e1f27105..9c07fdbeb52d 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -1356,6 +1356,26 @@ static int aer_probe(struct pcie_device *dev)
>>>   return 0;
>>>  }
>>>
>>> +static int aer_suspend(struct pcie_device *dev)
>>> +{
>>> + struct aer_rpc *rpc = get_service_data(dev);
>>> + struct pci_dev *pdev = rpc->rpd;
>>> +
>>> + aer_disable_irq(pdev);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int aer_resume(struct pcie_device *dev)
>>> +{
>>> + struct aer_rpc *rpc = get_service_data(dev);
>>> + struct pci_dev *pdev = rpc->rpd;
>>> +
>>> + aer_enable_irq(pdev);
>>> +
>>> + return 0;
>>> +}
>>> +
>>>  /**
>>>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>>>   * @dev: pointer to Root Port, RCEC, or RCiEP
>>> @@ -1420,6 +1440,8 @@ static struct pcie_port_service_driver aerdriver = {
>>>   .service= PCIE_PORT_SERVICE_AER,
>>>
>>>   .probe  = aer_probe,
>>> + .suspend= aer_suspend,
>>> + .resume = aer_resume,
>>>   .remove = aer_remove,
>>>  };
>>>
>>
>> --
>> Sathyanarayanan Kuppuswamy
>> Linux Kernel Developer

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v4 2/3] PCI/AER: Disable AER interrupt on suspend

2023-04-24 Thread Sathyanarayanan Kuppuswamy



On 4/23/23 10:52 PM, Kai-Heng Feng wrote:
> PCIe service that shares IRQ with PME may cause spurious wakeup on
> system suspend.
> 
> PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states
> that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready
> (D3hot), L2 (D3cold with aux power) and L3 (D3cold), so we don't lose
> much here to disable AER during system suspend.
> 
> This is very similar to previous attempts to suspend AER and DPC [1],
> but with a different reason.
> 
> [1] 
> https://lore.kernel.org/linux-pci/20220408153159.106741-1-kai.heng.f...@canonical.com/
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295
> 
> Reviewed-by: Mika Westerberg 
> Signed-off-by: Kai-Heng Feng 
> ---

IIUC, you encounter AER errors during the suspend/resume process, which
results in AER IRQ. Because AER and PME share an IRQ, it is regarded as a
spurious wake-up IRQ. So to fix it, you want to disable AER reporting,
right?

It looks like it is harmless to disable the AER during the suspend/resume
path. But, I am wondering why we get these errors? Did you check what errors
you get during the suspend/resume path? Are these errors valid?


>  drivers/pci/pcie/aer.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1420e1f27105..9c07fdbeb52d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1356,6 +1356,26 @@ static int aer_probe(struct pcie_device *dev)
>   return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_disable_irq(pdev);
> +
> + return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_enable_irq(pdev);
> +
> + return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1420,6 +1440,8 @@ static struct pcie_port_service_driver aerdriver = {
>   .service= PCIE_PORT_SERVICE_AER,
>  
>   .probe  = aer_probe,
> + .suspend= aer_suspend,
> + .resume = aer_resume,
>   .remove = aer_remove,
>  };
>  

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3 3/4] PCI/AER: Disable AER interrupt on suspend

2023-04-20 Thread Sathyanarayanan Kuppuswamy



On 4/20/23 5:59 AM, Kai-Heng Feng wrote:
> PCIe service that shares IRQ with PME may cause spurious wakeup on
> system suspend.
> 
> PCIe Base Spec 5.0, section 5.2 "Link State Power Management" states
> that TLP and DLLP transmission is disabled for a Link in L2/L3 Ready
> (D3hot), L2 (D3cold with aux power) and L3 (D3cold), so we don't lose
> much here to disable AER during system suspend.
> 
> This is very similar to previous attempts to suspend AER and DPC [1],
> but with a different reason.
> 
> [1] 
> https://lore.kernel.org/linux-pci/20220408153159.106741-1-kai.heng.f...@canonical.com/
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=216295
> 
> Reviewed-by: Mika Westerberg 
> Signed-off-by: Kai-Heng Feng 
> ---

In Patch #1, you skip clearing AER errors in the resume path, right? So if we 
disable
interrupts here, will AER driver not be notified on resume path error?

> v3:
>  - No change.
> 
> v2:
>  - Only disable AER IRQ.
>  - No more check on PME IRQ#.
>  - Use helper.
> 
>  drivers/pci/pcie/aer.c | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1420e1f27105..9c07fdbeb52d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1356,6 +1356,26 @@ static int aer_probe(struct pcie_device *dev)
>   return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_disable_irq(pdev);
> +
> + return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> + struct aer_rpc *rpc = get_service_data(dev);
> + struct pci_dev *pdev = rpc->rpd;
> +
> + aer_enable_irq(pdev);
> +
> + return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1420,6 +1440,8 @@ static struct pcie_port_service_driver aerdriver = {
>   .service    = PCIE_PORT_SERVICE_AER,
>  
>   .probe  = aer_probe,
> + .suspend= aer_suspend,
> + .resume = aer_resume,
>   .remove = aer_remove,
>  };
>  

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3 2/4] PCI/AER: Factor out interrupt toggling into helpers

2023-04-20 Thread Sathyanarayanan Kuppuswamy



On 4/20/23 5:59 AM, Kai-Heng Feng wrote:
> There are many places that enable and disable AER interrput, so move
> them into helpers.

Add "No functional changes intended"

> 
> Signed-off-by: Kai-Heng Feng 
> ---

Looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan 


> v3:
>  - Correct subject.
> 
> v2:
>  - New patch.
> 
>  drivers/pci/pcie/aer.c | 45 +-
>  1 file changed, 27 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f6c24ded134c..1420e1f27105 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1227,6 +1227,28 @@ static irqreturn_t aer_irq(int irq, void *context)
>   return IRQ_WAKE_THREAD;
>  }
>  
> +static void aer_enable_irq(struct pci_dev *pdev)
> +{
> + int aer = pdev->aer_cap;
> + u32 reg32;
> +
> + /* Enable Root Port's interrupt in response to error messages */
> + pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> + reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> + pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +}
> +
> +static void aer_disable_irq(struct pci_dev *pdev)
> +{
> + int aer = pdev->aer_cap;
> + u32 reg32;
> +
> + /* Disable Root's interrupt in response to error messages */
> + pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> + reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> + pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +}
> +
>  /**
>   * aer_enable_rootport - enable Root Port's interrupts when receiving 
> messages
>   * @rpc: pointer to a Root Port data structure
> @@ -1256,10 +1278,7 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
>   pci_read_config_dword(pdev, aer + PCI_ERR_UNCOR_STATUS, );
>   pci_write_config_dword(pdev, aer + PCI_ERR_UNCOR_STATUS, reg32);
>  
> - /* Enable Root Port's interrupt in response to error messages */
> - pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_enable_irq(pdev);
>  }
>  
>  /**
> @@ -1274,10 +1293,7 @@ static void aer_disable_rootport(struct aer_rpc *rpc)
>   int aer = pdev->aer_cap;
>   u32 reg32;
>  
> - /* Disable Root's interrupt in response to error messages */
> - pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(pdev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_disable_irq(pdev);
>  
>   /* Clear Root's error status reg */
>   pci_read_config_dword(pdev, aer + PCI_ERR_ROOT_STATUS, );
> @@ -1372,12 +1388,8 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
> *dev)
>*/
>   aer = root ? root->aer_cap : 0;
>  
> - if ((host->native_aer || pcie_ports_native) && aer) {
> - /* Disable Root's interrupt in response to error messages */
> - pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);
> - }
> + if ((host->native_aer || pcie_ports_native) && aer)
> + aer_disable_irq(root);
>  
>   if (type == PCI_EXP_TYPE_RC_EC || type == PCI_EXP_TYPE_RC_END) {
>   rc = pcie_reset_flr(dev, PCI_RESET_DO_RESET);
> @@ -1396,10 +1408,7 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
> *dev)
>   pci_read_config_dword(root, aer + PCI_ERR_ROOT_STATUS, );
>   pci_write_config_dword(root, aer + PCI_ERR_ROOT_STATUS, reg32);
>  
> - /* Enable Root Port's interrupt in response to error messages */
> - pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, );
> - reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> - pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);
> + aer_enable_irq(root);
>   }
>  
>   return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-03-17 Thread Sathyanarayanan Kuppuswamy



On 3/17/23 10:51 AM, Grant Grundler wrote:
> Since correctable errors have been corrected (and counted), the dmesg output
> should not be reported as a warning, but rather as "informational".
> 
> Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
> station, the dmesg buffer can be spammed with correctable errors, 717 bytes
> per instance, potentially many MB per day.

Why don't you investigate why you are getting so many correctable errors?
Isn't solving the problem preferable to hiding the logs?

> 
> Given the "WARN" priority, these messages have already confused the typical
> user that stumbles across them, support staff (triaging feedback reports),
> and more than a few linux kernel devs. Changing to INFO will hide these
> messages from most audiences.
> 
> Signed-off-by: Grant Grundler 
> ---
>  drivers/pci/pcie/aer.c | 29 +++--
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f6c24ded134c..cb6b96233967 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -687,23 +687,29 @@ static void __aer_print_error(struct pci_dev *dev,
>  {
>   const char **strings;
>   unsigned long status = info->status & ~info->mask;
> - const char *level, *errmsg;
>   int i;
>  
>   if (info->severity == AER_CORRECTABLE) {
>   strings = aer_correctable_error_string;
> - level = KERN_WARNING;
> + pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
> + info->status, info->mask);
>   } else {
>   strings = aer_uncorrectable_error_string;
> - level = KERN_ERR;
> + pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
> + info->status, info->mask);
>   }
>  
>   for_each_set_bit(i, , 32) {
> - errmsg = strings[i];
> + const char *errmsg = strings[i];
> +
>   if (!errmsg)
>   errmsg = "Unknown Error Bit";
>  
> - pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
> + if (info->severity == AER_CORRECTABLE)
> + pci_info(dev, "   [%2d] %-22s%s\n", i, errmsg,
> + info->first_error == i ? " (First)" : "");
> + else
> + pci_err(dev, "   [%2d] %-22s%s\n", i, errmsg,
>   info->first_error == i ? " (First)" : "");
>   }
>   pci_dev_aer_stats_incr(dev, info);
> @@ -724,7 +730,7 @@ void aer_print_error(struct pci_dev *dev, struct 
> aer_err_info *info)
>   layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>   agent = AER_GET_AGENT(info->severity, info->status);
>  
> - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> + level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
>  
>   pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>  aer_error_severity_string[info->severity],
> @@ -797,14 +803,17 @@ void cper_print_aer(struct pci_dev *dev, int 
> aer_severity,
>   info.mask = mask;
>   info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>  
> - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>   __aer_print_error(dev, );
> - pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> - aer_error_layer[layer], aer_agent_string[agent]);
>  
> - if (aer_severity != AER_CORRECTABLE)
> + if (aer_severity == AER_CORRECTABLE) {
> + pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
> + aer_error_layer[layer], aer_agent_string[agent]);
> + } else {
> + pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> + aer_error_layer[layer], aer_agent_string[agent]);
>   pci_err(dev, "aer_uncor_severity: 0x%08x\n",
>   aer->uncor_severity);
> + }
>  
>   if (tlp_header_valid)
>   __print_tlp_header(dev, >header_log);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH V1] PCI/AER: Configure ECRC only AER is native

2023-01-11 Thread Sathyanarayanan Kuppuswamy



On 1/11/23 8:59 PM, Vidya Sagar wrote:
> 
> 
> On 1/12/2023 9:18 AM, Sathyanarayanan Kuppuswamy wrote:
>> External email: Use caution opening links or attachments
>>
>>
>> On 1/11/23 7:33 PM, Vidya Sagar wrote:
>>> I think we still need bios option. For example, consider a system where 
>>> BIOS needs to keep ECRC enabled for integrity reasons but if kernel doesn't 
>>> want it for perf reasons, then, kernel can always use 'ecrc=off' option.
>>
>> I agree that "on" and "off" option makes sense. Since the kernel defaults 
>> ecrc setting to "bios", why again allow it as a command line option?
> 
> Agree. "on" and "off" are fine but "default" is redundant. Do you want me to 
> push a change to remove that as part of this patch itself? I think
> it is more like a cleanup and should go separately.

IMO, the "bios" option cleanup and command line update from Bjorn can be in one 
patch, and your change could be a separate patch. But it is
up to you and Bjorn.

> 
>>
>> -- 
>> Sathyanarayanan Kuppuswamy
>> Linux Kernel Developer

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH V1] PCI/AER: Configure ECRC only AER is native

2023-01-11 Thread Sathyanarayanan Kuppuswamy



On 1/11/23 7:33 PM, Vidya Sagar wrote:
> I think we still need bios option. For example, consider a system where BIOS 
> needs to keep ECRC enabled for integrity reasons but if kernel doesn't want 
> it for perf reasons, then, kernel can always use 'ecrc=off' option.

I agree that "on" and "off" option makes sense. Since the kernel defaults ecrc 
setting to "bios", why again allow it as a command line option?

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH V1] PCI/AER: Configure ECRC only AER is native

2023-01-11 Thread Sathyanarayanan Kuppuswamy
Hi,

On 1/11/23 3:10 PM, Bjorn Helgaas wrote:
> On Wed, Jan 11, 2023 at 01:42:21PM -0800, Sathyanarayanan Kuppuswamy wrote:
>> On 1/11/23 12:31 PM, Vidya Sagar wrote:
>>> As the ECRC configuration bits are part of AER registers, configure
>>> ECRC only if AER is natively owned by the kernel.
>>
>> ecrc command line option takes "bios/on/off" as possible options. It
>> does not clarify whether "on/off" choices can only be used if AER is
>> owned by OS or it can override the ownership of ECRC configuration 
>> similar to pcie_ports=native option. Maybe that needs to be clarified.
> 
> Good point, what do you think of an update like this:
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 6cfa6e3996cf..f7b40a439194 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -4296,7 +4296,9 @@
>   specified, e.g., 12@pci:8086:9c22:103c:198f
>   for 4096-byte alignment.
>   ecrc=   Enable/disable PCIe ECRC (transaction layer
> - end-to-end CRC checking).
> + end-to-end CRC checking).  Only effective
> + if OS has native AER control (either granted by
> + ACPI _OSC or forced via "pcie_ports=native").
>   bios: Use BIOS/firmware settings. This is the
>   the default.
>   off: Turn ECRC off

Looks fine. But do we even need "bios" option? Since it is the default
value, I am not sure why we need to list that as an option again. IMO
this could be removed.

> 
> I don't know whether the "ecrc=" parameter is really needed.  If we
> were adding it today, I would ask "why not enable ECRC wherever it is
> supported?"  If there are devices where it's broken, we could always
> add quirks to disable it on a case-by-case basis.

Checking the original patch which added it, it looks like the intention
is to give option to boost performance over integrity.

commit 43c16408842b0eeb367c23a6fa540ce69f99e347
Author: Andrew Patterson 
Date:   Wed Apr 22 16:52:09 2009 -0600

PCI: Add support for turning PCIe ECRC on or off

Adds support for PCI Express transaction layer end-to-end CRC checking
(ECRC).  This patch will enable/disable ECRC checking by setting/clearing
the ECRC Check Enable and/or ECRC Generation Enable bits for devices that
support ECRC.

The ECRC setting is controlled by the "pci=ecrc=" command-line
option. If this option is not set or is set to 'bios", the enable and
generation bits are left in whatever state that firmware/BIOS set them to.
The "off" setting turns them off, and the "on" option turns them on (if the
device supports it).

Turning ECRC on or off can be a data integrity versus performance
tradeoff.  In theory, turning it on will catch more data errors, turning
it off means possibly better performance since CRC does not need to be
calculated by the PCIe hardware and packet sizes are reduced.


> 
> But I think the patch below is the right thing to do for now.  Vidya,

Agree.

> did you trip over an issue because of this, e.g., a conflict between
> firmware use of AER and Linux use of it?  If so, maybe we could
> mention a symptom on the commit log.  But my guess is you probably
> found this by inspection.
> 
> Bjorn
> 
>>> Signed-off-by: Vidya Sagar 
>>> ---
>>>  drivers/pci/pcie/aer.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index e2d8a74f83c3..730b47bdcdef 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -184,6 +184,9 @@ static int disable_ecrc_checking(struct pci_dev *dev)
>>>   */
>>>  void pcie_set_ecrc_checking(struct pci_dev *dev)
>>>  {
>>> +   if (!pcie_aer_is_native(dev))
>>> +   return;
>>> +
>>> switch (ecrc_policy) {
>>> case ECRC_POLICY_DEFAULT:
>>> return;
>>
>> -- 
>> Sathyanarayanan Kuppuswamy
>> Linux Kernel Developer

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH V1] PCI/AER: Configure ECRC only AER is native

2023-01-11 Thread Sathyanarayanan Kuppuswamy



On 1/11/23 12:31 PM, Vidya Sagar wrote:
> As the ECRC configuration bits are part of AER registers, configure
> ECRC only if AER is natively owned by the kernel.

ecrc command line option takes "bios/on/off" as possible options. It
does not clarify whether "on/off" choices can only be used if AER is
owned by OS or it can override the ownership of ECRC configuration 
similar to pcie_ports=native option. Maybe that needs to be clarified.

> 
> Signed-off-by: Vidya Sagar 
> ---
>  drivers/pci/pcie/aer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e2d8a74f83c3..730b47bdcdef 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -184,6 +184,9 @@ static int disable_ecrc_checking(struct pci_dev *dev)
>   */
>  void pcie_set_ecrc_checking(struct pci_dev *dev)
>  {
> + if (!pcie_aer_is_native(dev))
> + return;
> +
>   switch (ecrc_policy) {
>   case ECRC_POLICY_DEFAULT:
>   return;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [External] Re: [PATCH v2 3/9] NTB: Change to use pci_aer_clear_uncorrect_error_status()

2022-09-27 Thread Sathyanarayanan Kuppuswamy



On 9/27/22 9:20 PM, Zhuo Chen wrote:
> 
> 
> On 9/28/22 3:39 AM, Sathyanarayanan Kuppuswamy wrote:
>>
>>
>> On 9/27/22 8:35 AM, Zhuo Chen wrote:
>>> Status bits for ERR_NONFATAL errors only are cleared in
>>> pci_aer_clear_nonfatal_status(), but we want clear uncorrectable
>>> error status in idt_init_pci(), so we change to use
>>> pci_aer_clear_uncorrect_error_status().
>>
>> You mean currently driver does not clear fatal errors now, and it is
>> a problem? Any error reported?
>>
> Hi Sathyanarayanan,
> 
> No error reports yet, I just changes the behavior back to what it was before 
> commit e7b0b847de6d ("PCI/AER: Clear only ERR_NONFATAL bits during non-fatal 
> recovery"), because this commit change the original function in commit 
> bf2a952d31d2 ("NTB: Add IDT 89HPESxNTx PCIe-switches support").
> 

Ok. Thanks for clarifying.

>> Also, I am wondering why is it required to clear errors during init
>> code. Is it a norm?
>>
> I think there is no need to clear errors during init code.
>>>
>>> Signed-off-by: Zhuo Chen 
>>> ---
>>>   drivers/ntb/hw/idt/ntb_hw_idt.c | 4 ++--
>>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/ntb/hw/idt/ntb_hw_idt.c 
>>> b/drivers/ntb/hw/idt/ntb_hw_idt.c
>>> index 0ed6f809ff2e..d5f0aa87f817 100644
>>> --- a/drivers/ntb/hw/idt/ntb_hw_idt.c
>>> +++ b/drivers/ntb/hw/idt/ntb_hw_idt.c
>>> @@ -2657,8 +2657,8 @@ static int idt_init_pci(struct idt_ntb_dev *ndev)
>>>   ret = pci_enable_pcie_error_reporting(pdev);
>>>   if (ret != 0)
>>>   dev_warn(>dev, "PCIe AER capability disabled\n");
>>> -    else /* Cleanup nonfatal error status before getting to init */
>>> -    pci_aer_clear_nonfatal_status(pdev);
>>> +    else /* Cleanup uncorrectable error status before getting to init */
>>> +    pci_aer_clear_uncorrect_error_status(pdev);
>>>     /* First enable the PCI device */
>>>   ret = pcim_enable_device(pdev);
>>
> 

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 5/9] PCI/AER: Unexport pci_aer_clear_nonfatal_status()

2022-09-27 Thread Sathyanarayanan Kuppuswamy



On 9/27/22 8:35 AM, Zhuo Chen wrote:
> Since pci_aer_clear_nonfatal_status() is used only internally, move
> its declaration to the PCI internal header file. Also, no one cares
> about return value of pci_aer_clear_nonfatal_status(), so make it void.
> 
> Signed-off-by: Zhuo Chen 
> ---

Looks good to me.

>  drivers/pci/pci.h  | 2 ++
>  drivers/pci/pcie/aer.c | 7 ++-
>  include/linux/aer.h| 5 -
>  3 files changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 785f31086313..a114175d08e4 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -684,6 +684,7 @@ void pci_aer_init(struct pci_dev *dev);
>  void pci_aer_exit(struct pci_dev *dev);
>  extern const struct attribute_group aer_stats_attr_group;
>  void pci_aer_clear_fatal_status(struct pci_dev *dev);
> +void pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pci_aer_clear_status(struct pci_dev *dev);
>  int pci_aer_raw_clear_status(struct pci_dev *dev);
>  #else
> @@ -691,6 +692,7 @@ static inline void pci_no_aer(void) { }
>  static inline void pci_aer_init(struct pci_dev *d) { }
>  static inline void pci_aer_exit(struct pci_dev *d) { }
>  static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
> +static inline void pci_aer_clear_nonfatal_status(struct pci_dev *dev) { }
>  static inline int pci_aer_clear_status(struct pci_dev *dev) { return 
> -EINVAL; }
>  static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return 
> -EINVAL; }
>  #endif
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4e637121be23..e2ebd108339d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -251,13 +251,13 @@ int pci_disable_pcie_error_reporting(struct pci_dev 
> *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_disable_pcie_error_reporting);
>  
> -int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> +void pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  {
>   int aer = dev->aer_cap;
>   u32 status, sev;
>  
>   if (!pcie_aer_is_native(dev))
> - return -EIO;
> + return;
>  
>   /* Clear status bits for ERR_NONFATAL errors only */
>   pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, );
> @@ -265,10 +265,7 @@ int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>   status &= ~sev;
>   if (status)
>   pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
> -
> - return 0;
>  }
> -EXPORT_SYMBOL_GPL(pci_aer_clear_nonfatal_status);
>  
>  void pci_aer_clear_fatal_status(struct pci_dev *dev)
>  {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 154690c278cb..f638ad955deb 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -44,7 +44,6 @@ struct aer_capability_regs {
>  /* PCIe port driver needs this function to enable AER */
>  int pci_enable_pcie_error_reporting(struct pci_dev *dev);
>  int pci_disable_pcie_error_reporting(struct pci_dev *dev);
> -int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pci_aer_clear_uncorrect_error_status(struct pci_dev *dev);
>  void pci_save_aer_state(struct pci_dev *dev);
>  void pci_restore_aer_state(struct pci_dev *dev);
> @@ -57,10 +56,6 @@ static inline int pci_disable_pcie_error_reporting(struct 
> pci_dev *dev)
>  {
>   return -EINVAL;
>  }
> -static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> -{
> - return -EINVAL;
> -}
>  static inline int pci_aer_clear_uncorrect_error_status(struct pci_dev *dev)
>  {
>   return -EINVAL;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 4/9] scsi: lpfc: Change to use pci_aer_clear_uncorrect_error_status()

2022-09-27 Thread Sathyanarayanan Kuppuswamy



On 9/27/22 8:35 AM, Zhuo Chen wrote:
> Status bits for ERR_NONFATAL errors only are cleared in
> pci_aer_clear_nonfatal_status(), but we want clear uncorrectable
> error status in lpfc_aer_cleanup_state(), so we change to use
> pci_aer_clear_uncorrect_error_status().

I think you don't need to mention status bits here. Just use terms
"fatal" and "non-fatal" errors.

lpfc_aer_cleanup_state() requires clearing both fatal and non-fatal
uncorrectable error status. But using  pci_aer_clear_nonfatal_status()
will only clear non-fatal error status. To clear both fatal and non-fatal
error status, use pci_aer_clear_uncorrect_error_status().

> 
> Signed-off-by: Zhuo Chen 
> ---
>  drivers/scsi/lpfc/lpfc_attr.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
> index 09cf2cd0ae60..d835cc0ba153 100644
> --- a/drivers/scsi/lpfc/lpfc_attr.c
> +++ b/drivers/scsi/lpfc/lpfc_attr.c
> @@ -4689,7 +4689,7 @@ static DEVICE_ATTR_RW(lpfc_aer_support);
>   * Description:
>   * If the @buf contains 1 and the device currently has the AER support
>   * enabled, then invokes the kernel AER helper routine
> - * pci_aer_clear_nonfatal_status() to clean up the uncorrectable
> + * pci_aer_clear_uncorrect_error_status() to clean up the uncorrectable
>   * error status register.
>   *
>   * Notes:
> @@ -4715,7 +4715,7 @@ lpfc_aer_cleanup_state(struct device *dev, struct 
> device_attribute *attr,
>   return -EINVAL;
>  
>   if (phba->hba_flag & HBA_AER_ENABLED)
> - rc = pci_aer_clear_nonfatal_status(phba->pcidev);
> + rc = pci_aer_clear_uncorrect_error_status(phba->pcidev);
>  
>   if (rc == 0)
>   return strlen(buf);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 3/9] NTB: Change to use pci_aer_clear_uncorrect_error_status()

2022-09-27 Thread Sathyanarayanan Kuppuswamy



On 9/27/22 8:35 AM, Zhuo Chen wrote:
> Status bits for ERR_NONFATAL errors only are cleared in
> pci_aer_clear_nonfatal_status(), but we want clear uncorrectable
> error status in idt_init_pci(), so we change to use
> pci_aer_clear_uncorrect_error_status().

You mean currently driver does not clear fatal errors now, and it is
a problem? Any error reported?

Also, I am wondering why is it required to clear errors during init
code. Is it a norm?

> 
> Signed-off-by: Zhuo Chen 
> ---
>  drivers/ntb/hw/idt/ntb_hw_idt.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/ntb/hw/idt/ntb_hw_idt.c b/drivers/ntb/hw/idt/ntb_hw_idt.c
> index 0ed6f809ff2e..d5f0aa87f817 100644
> --- a/drivers/ntb/hw/idt/ntb_hw_idt.c
> +++ b/drivers/ntb/hw/idt/ntb_hw_idt.c
> @@ -2657,8 +2657,8 @@ static int idt_init_pci(struct idt_ntb_dev *ndev)
>   ret = pci_enable_pcie_error_reporting(pdev);
>   if (ret != 0)
>   dev_warn(>dev, "PCIe AER capability disabled\n");
> - else /* Cleanup nonfatal error status before getting to init */
> - pci_aer_clear_nonfatal_status(pdev);
> + else /* Cleanup uncorrectable error status before getting to init */
> + pci_aer_clear_uncorrect_error_status(pdev);
>  
>   /* First enable the PCI device */
>   ret = pcim_enable_device(pdev);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/9] PCI/AER: Add pci_aer_clear_uncorrect_error_status() to PCI core

2022-09-27 Thread Sathyanarayanan Kuppuswamy
Hi,

On 9/27/22 8:35 AM, Zhuo Chen wrote:
> Sometimes we need to clear aer uncorrectable error status, so we add

Adding n actual use case will help.

> pci_aer_clear_uncorrect_error_status() to PCI core.

If possible, try to avoid "we" usage in commit log. Just say "so add
pci_aer_clear_uncorrect_error_status() function" 

> 
> Signed-off-by: Zhuo Chen 
> ---
>  drivers/pci/pcie/aer.c | 16 
>  include/linux/aer.h|  5 +
>  2 files changed, 21 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e2d8a74f83c3..4e637121be23 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -286,6 +286,22 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>   pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>  }
>  
> +int pci_aer_clear_uncorrect_error_status(struct pci_dev *dev)
> +{
> + int aer = dev->aer_cap;
> + u32 status;
> +
> + if (!pcie_aer_is_native(dev))
> + return -EIO;
> +
> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, );
> + if (status)
> + pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);

Why not just write all '1' and clear it? Why read and write?

> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_aer_clear_uncorrect_error_status);

Add details about why you want to export in commit log.

> +
>  /**
>   * pci_aer_raw_clear_status - Clear AER error registers.
>   * @dev: the PCI device
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 97f64ba1b34a..154690c278cb 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -45,6 +45,7 @@ struct aer_capability_regs {
>  int pci_enable_pcie_error_reporting(struct pci_dev *dev);
>  int pci_disable_pcie_error_reporting(struct pci_dev *dev);
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> +int pci_aer_clear_uncorrect_error_status(struct pci_dev *dev);
>  void pci_save_aer_state(struct pci_dev *dev);
>  void pci_restore_aer_state(struct pci_dev *dev);
>  #else
> @@ -60,6 +61,10 @@ static inline int pci_aer_clear_nonfatal_status(struct 
> pci_dev *dev)
>  {
>   return -EINVAL;
>  }
> +static inline int pci_aer_clear_uncorrect_error_status(struct pci_dev *dev)
> +{
> + return -EINVAL;
> +}
>  static inline void pci_save_aer_state(struct pci_dev *dev) {}
>  static inline void pci_restore_aer_state(struct pci_dev *dev) {}
>  #endif

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 2/9] PCI/DPC: Use pci_aer_clear_uncorrect_error_status() to clear uncorrectable error status

2022-09-27 Thread Sathyanarayanan Kuppuswamy
Hi,

On 9/27/22 8:35 AM, Zhuo Chen wrote:
> Use pci_aer_clear_nonfatal_status() in dpc_process_error(), which has
> no functional changes.

Just say pci_aer_clear_uncorrect_error_status() clears both fatal and
non-fatal errors. So use it in place of pci_aer_clear_nonfatal_status()
and pci_aer_clear_fatal_status().

> 
> Signed-off-by: Zhuo Chen 
> ---
>  drivers/pci/pcie/dpc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 3e9afee02e8d..7942073fbb34 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
>dpc_get_aer_uncorrect_severity(pdev, ) &&
>aer_get_device_error_info(pdev, )) {
>   aer_print_error(pdev, );
> - pci_aer_clear_nonfatal_status(pdev);
> - pci_aer_clear_fatal_status(pdev);
> + pci_aer_clear_uncorrect_error_status(pdev);
>   }
>  }
>  

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3] PCI/ERR: Use pcie_aer_is_native() to judge whether OS owns AER

2022-08-02 Thread Sathyanarayanan Kuppuswamy



On 7/27/22 2:37 AM, Zhuo Chen wrote:
>>
> Do you mean changing "if ((host->native_aer || pcie_ports_native) && aer)" 
> into "if (pcie_aer_is_native(dev) && aer)" ?
> I thought changing into "if (pcie_aer_is_native(dev))" before.
> 
> One another doubt. Not every pci device support aer. When dev->aer_cap is 
> NULL and root->aer_cap is not NULL in aer_root_reset(), pcie_aer_is_native() 
> will return false and OS cannot operate root register. It's different from 
> just using "(host->native_aer || pcie_ports_native)".
> 
> Or we can change "if ((host->native_aer || pcie_ports_native) && aer)" into 
> "if (pcie_aer_is_native(root))". But in this way, argument NULL pointer check 
> should be added in pcie_aer_is_native().

Looking into it again, I think it is better to leave it as it is. Please ignore 
my comment.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3] PCI/ERR: Use pcie_aer_is_native() to judge whether OS owns AER

2022-07-26 Thread Sathyanarayanan Kuppuswamy



On 7/26/22 8:53 PM, Zhuo Chen wrote:
> Use pcie_aer_is_native() in place of "host->native_aer ||
> pcie_ports_native" to judge whether OS has native control of AER
> in pcie_do_recovery().
> 
> Replace "dev->aer_cap && (pcie_ports_native || host->native_aer)" in
> get_port_device_capability() with pcie_aer_is_native(), which has no
> functional changes.
> 
> Signed-off-by: Zhuo Chen 
> ---

Patch looks better now. It looks like following two changes
can also be replaced with pcie_aer_is_native() check.

drivers/pci/pcie/aer.c:1407:if ((host->native_aer || pcie_ports_native) && 
aer) {
drivers/pci/pcie/aer.c:1426:if ((host->native_aer || pcie_ports_native) && 
aer) {



> Changelog:
> v3:
> - Simplify why we use pcie_aer_is_native().
> - Revert modification of pci_aer_clear_nonfatal_status() and comments.
> v2:
> - Add details and note in commit log.
> ---
>  drivers/pci/pcie/err.c  | 3 +--
>  drivers/pci/pcie/portdrv_core.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 0c5a143025af..121a53338e44 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -184,7 +184,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   int type = pci_pcie_type(dev);
>   struct pci_dev *bridge;
>   pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> - struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
>   /*
>* If the error was detected by a Root Port, Downstream Port, RCEC,
> @@ -243,7 +242,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>* it is responsible for clearing this status.  In that case, the
>* signaling device may not even be visible to the OS.
>*/
> - if (host->native_aer || pcie_ports_native) {
> + if (pcie_aer_is_native(dev)) {
>   pcie_clear_device_status(dev);
>   pci_aer_clear_nonfatal_status(dev);
>   }
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 604feeb84ee4..98c18f4a01b2 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -221,8 +221,7 @@ static int get_port_device_capability(struct pci_dev *dev)
>   }
>  
>  #ifdef CONFIG_PCIEAER
> - if (dev->aer_cap && pci_aer_available() &&
> - (pcie_ports_native || host->native_aer)) {
> + if (pcie_aer_is_native(dev) && pci_aer_available()) {
>   services |= PCIE_PORT_SERVICE_AER;
>  
>   /*

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2] PCI/ERR: Use pcie_aer_is_native() to judge whether OS owns AER

2022-07-26 Thread Sathyanarayanan Kuppuswamy



On 7/25/22 7:05 PM, Zhuo Chen wrote:
> The AER status of the device that reported the error rather than
> the first downstream port is cleared after commit 7d7cbeaba5b7
> ("PCI/ERR: Clear status of the reporting device"). So "a bridge
> may not exist" which commit aa344bc8b727 ("PCI/ERR: Clear AER
> status only when we control AER") referring to is no longer
> existent, and we just use pcie_aer_is_native() in stead of
> "host->native_aer || pcie_ports_native".

IMO, above history is not required to justify using pcie_aer_is_native()
in place of "host->native_aer || pcie_ports_native".

> 
> pci_aer_clear_nonfatal_status() already has pcie_aer_is_native(),
> so we move pci_aer_clear_nonfatal_status() out of
> pcie_aer_is_native().

Moving it outside (pcie_aer_is_native()) does not optimize the
code. So I think it is better to leave it inside.

> 
> Replace statements that judge whether OS owns AER in
> get_port_device_capability() with pcie_aer_is_native(), which has
> no functional changes.
> 
> Signed-off-by: Zhuo Chen 
> ---
> v2:
> - Add details and note in commit log
> ---
>  drivers/pci/pcie/err.c  | 12 ++--
>  drivers/pci/pcie/portdrv_core.c |  3 +--
>  2 files changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 0c5a143025af..28339c741555 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -184,7 +184,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   int type = pci_pcie_type(dev);
>   struct pci_dev *bridge;
>   pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> - struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
>   /*
>* If the error was detected by a Root Port, Downstream Port, RCEC,
> @@ -237,16 +236,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_dbg(bridge, "broadcast resume message\n");
>   pci_walk_bridge(bridge, report_resume, );
>  
> - /*
> -  * If we have native control of AER, clear error status in the device
> -  * that detected the error.  If the platform retained control of AER,
> -  * it is responsible for clearing this status.  In that case, the
> -  * signaling device may not even be visible to the OS.
> -  */

The above comment is still applicable. So I think you don't need to remove it.

> - if (host->native_aer || pcie_ports_native) {
> + if (pcie_aer_is_native(dev))
>   pcie_clear_device_status(dev);
> - pci_aer_clear_nonfatal_status(dev);
> - }
> + pci_aer_clear_nonfatal_status(dev);
>   pci_info(bridge, "device recovery successful\n");
>   return status;
>  
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 604feeb84ee4..98c18f4a01b2 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -221,8 +221,7 @@ static int get_port_device_capability(struct pci_dev *dev)
>   }
>  
>  #ifdef CONFIG_PCIEAER
> - if (dev->aer_cap && pci_aer_available() &&
> - (pcie_ports_native || host->native_aer)) {
> + if (pcie_aer_is_native(dev) && pci_aer_available()) {
>   services |= PCIE_PORT_SERVICE_AER;
>  
>   /*

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH] PCI/ERR: Use pcie_aer_is_native() to judge whether OS owns AER

2022-07-25 Thread Sathyanarayanan Kuppuswamy



On 7/25/22 9:01 AM, Zhuo Chen wrote:
> After commit 7d7cbeaba5b7 ("PCI/ERR: Clear status of the reporting
> device"), the AER status of the device that reported the error
> rather than the first downstream port is cleared. So the problem
> in commit aa344bc8b727 ("PCI/ERR: Clear AER status only when we
> control AER") is no longer existent, and we change to use
> pcie_aer_is_native() here.
Can you add the details of the problem you are referring to? Also
include details about how this problem relates to your commit.

IIUC, your commit replaces "host->native_aer || pcie_ports_native"
with pcie_aer_is_native(dev, correct? If so, add a note in commit
log that it has no functional changes.

> 
> pci_aer_clear_nonfatal_status() already has pcie_aer_is_native(),
> so we move pci_aer_clear_nonfatal_status() out of
> pcie_aer_is_native().
> 
> Replace statements that judge whether OS owns AER in
> get_port_device_capability() with pcie_aer_is_native().
> 
> Signed-off-by: Zhuo Chen 
> ---
>  drivers/pci/pcie/err.c  | 12 ++--
>  drivers/pci/pcie/portdrv_core.c |  3 +--
>  2 files changed, 3 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 0c5a143025af..28339c741555 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -184,7 +184,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   int type = pci_pcie_type(dev);
>   struct pci_dev *bridge;
>   pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> - struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
>   /*
>* If the error was detected by a Root Port, Downstream Port, RCEC,
> @@ -237,16 +236,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   pci_dbg(bridge, "broadcast resume message\n");
>   pci_walk_bridge(bridge, report_resume, );
>  
> - /*
> -  * If we have native control of AER, clear error status in the device
> -  * that detected the error.  If the platform retained control of AER,
> -  * it is responsible for clearing this status.  In that case, the
> -  * signaling device may not even be visible to the OS.
> -  */
> - if (host->native_aer || pcie_ports_native) {
> + if (pcie_aer_is_native(dev))
>   pcie_clear_device_status(dev);
> - pci_aer_clear_nonfatal_status(dev);
> - }
> + pci_aer_clear_nonfatal_status(dev);
>   pci_info(bridge, "device recovery successful\n");
>   return status;
>  
> diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
> index 604feeb84ee4..98c18f4a01b2 100644
> --- a/drivers/pci/pcie/portdrv_core.c
> +++ b/drivers/pci/pcie/portdrv_core.c
> @@ -221,8 +221,7 @@ static int get_port_device_capability(struct pci_dev *dev)
>   }
>  
>  #ifdef CONFIG_PCIEAER
> - if (dev->aer_cap && pci_aer_available() &&
> - (pcie_ports_native || host->native_aer)) {
> + if (pcie_aer_is_native(dev) && pci_aer_available()) {
>   services |= PCIE_PORT_SERVICE_AER;
>  
>   /*

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH] PCI/ERR: handle disconnected devices in report_error_detected

2022-06-07 Thread Sathyanarayanan Kuppuswamy

Hi,

On 6/1/22 12:40 AM, Christoph Hellwig wrote:

When a device is already unplugged by pciehp by the time that the AER
handler is invoked, the PCIe device will lready be in the


/s/lready/already


pci_channel_io_perm_failure state.  In that case we should simply
return PCI_ERS_RESULT_DISCONNECT instead of trying to do a state
transition that will fail.

Also untangle the state transition failure from the lack of methods to
improve the debugging output in case it will happen ever again.


Otherwise, it looks good to me.

Reviewed-by: Kuppuswamy Sathyanarayanan 





Signed-off-by: Christoph Hellwig 
---
  drivers/pci/pcie/err.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 0c5a143025af4..59c90d04a609a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -55,10 +55,14 @@ static int report_error_detected(struct pci_dev *dev,
  
  	device_lock(>dev);

pdrv = dev->driver;
-   if (!pci_dev_set_io_state(dev, state) ||
-   !pdrv ||
-   !pdrv->err_handler ||
-   !pdrv->err_handler->error_detected) {
+   if (pci_dev_is_disconnected(dev)) {
+   vote = PCI_ERS_RESULT_DISCONNECT;
+   } else if (!pci_dev_set_io_state(dev, state)) {
+   pci_info(dev, "can't recover (state transition %u -> %u 
invalid)\n",
+   dev->error_state, state);
+   vote = PCI_ERS_RESULT_NONE;
+   } else if (!pdrv || !pdrv->err_handler ||
+  !pdrv->err_handler->error_detected) {
/*
 * If any device in the subtree does not have an error_detected
 * callback, PCI_ERS_RESULT_NO_AER_DRIVER prevents subsequent


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-05-11 Thread Sathyanarayanan Kuppuswamy




On 5/11/22 4:40 PM, Bjorn Helgaas wrote:

On Mon, Apr 18, 2022 at 03:02:37PM +, Kuppuswamy Sathyanarayanan wrote:

Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Also note that, for cases with only multi-bit error is set, since this
is not the first occurrence of the error, PCI_ERR_ROOT_ERR_SRC may have
zero or some junk value. So we cannot cleanly process this error
information using aer_isr_one_error(). All we are attempting with this
fix is to make sure error interrupt processing can continue in this
scenario.

This error can be reproduced by making following changes to the
aer_irq() function and by executing the given test commands.

  static irqreturn_t aer_irq(int irq, void *context)
  struct aer_err_source e_src = {};

  pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS,
_src.status);
  +   pci_dbg(pdev->port, "Root Error Status: %04x\n",
  + e_src.status);
  if (!(e_src.status & AER_ERR_STATUS_MASK))


Do you mean

   if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))

here?  AER_ERR_STATUS_MASK would be after this fix.


Yes. You are correct. Do you want me to update it and Fixes tag
and send next version?




  return IRQ_NONE;

  +   mdelay(5000);


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-05-11 Thread Sathyanarayanan Kuppuswamy




On 5/11/22 4:27 PM, Bjorn Helgaas wrote:

[Eric: proposed reproducing steps]
Fixes: 4696b828ca37 ("PCI/AER: Hoist aerdrv.c, aer_inject.c up to 
drivers/pci/pcie/")

4696b828ca37 only*moves*  drivers/pci/pcie/aer/aerdrv.c to
drivers/pci/pcie/aer.c, so I don't think it's related.

I think the actual change of interest is e167bfcaa4cd ("PCI: aerdrv:
remove magical ROOT_ERR_STATUS_MASKS") [1].  It looks like we did
exactly what you propose before that commit.

I can update this unless you disagree.

[1]https://git.kernel.org/linus/e167bfcaa4cd



Agree. Please update it.

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-05-11 Thread Sathyanarayanan Kuppuswamy

Hi Bjorn,

On 4/18/22 8:02 AM, Kuppuswamy Sathyanarayanan wrote:

Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Also note that, for cases with only multi-bit error is set, since this
is not the first occurrence of the error, PCI_ERR_ROOT_ERR_SRC may have
zero or some junk value. So we cannot cleanly process this error
information using aer_isr_one_error(). All we are attempting with this
fix is to make sure error interrupt processing can continue in this
scenario.

This error can be reproduced by making following changes to the
aer_irq() function and by executing the given test commands.

  static irqreturn_t aer_irq(int irq, void *context)
  struct aer_err_source e_src = {};

  pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS,
_src.status);
  +   pci_dbg(pdev->port, "Root Error Status: %04x\n",
  + e_src.status);
  if (!(e_src.status & AER_ERR_STATUS_MASK))
  return IRQ_NONE;

  +   mdelay(5000);

  # Prep injection data for a correctable error.
  $ cd /sys/kernel/debug/apei/einj
  $ echo 0x0040 > error_type
  $ echo 0x4 > flags
  $ echo 0x891000 > param4

  # Root Error Status is initially clear
  $ setpci -s  ECAP0001+0x30.w
  

  # Inject one error
  $ echo 1 > error_inject

  # Interrupt received
  pcieport : AER: Root Error Status 0001

  # Inject another error (within 5 seconds)
  $ echo 1 > error_inject

  # You will get a new IRQ with only multiple ERR_COR bit set
  pcieport : AER: Root Error Status 0002

Currently, the above issue has been only reproduced in the ICL server
platform.

[Eric: proposed reproducing steps]
Fixes: 4696b828ca37 ("PCI/AER: Hoist aerdrv.c, aer_inject.c up to 
drivers/pci/pcie/")
Reported-by: Eric Badger 
Reviewed-by: Ashok Raj 
Signed-off-by: Kuppuswamy Sathyanarayanan 

---



Any comments on this patch? I'm wondering whether you are expecting any
changes to be done to it. Please let me know.


Changes since v2:
  * Added more details to the commit log.
  * Rebased on v5.18-rc1.

Changes since v1:
  * Added Fixes tag.
  * Included reproducing steps proposed by Eric.

  drivers/pci/pcie/aer.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..7952e5efd6cf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -101,6 +101,11 @@ struct aer_stats {
  #define ERR_COR_ID(d) (d & 0x)
  #define ERR_UNCOR_ID(d)   (d >> 16)
  
+#define AER_ERR_STATUS_MASK		(PCI_ERR_ROOT_UNCOR_RCV |	\

+   PCI_ERR_ROOT_COR_RCV |  \
+   PCI_ERR_ROOT_MULTI_COR_RCV |\
+   PCI_ERR_ROOT_MULTI_UNCOR_RCV)
+
  static int pcie_aer_disable;
  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
  
@@ -1196,7 +1201,7 @@ static irqreturn_t aer_irq(int irq, void *context)

struct aer_err_source e_src = {};
  
  	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, _src.status);

-   if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
+   if (!(e_src.status & AER_ERR_STATUS_MASK))
return IRQ_NONE;
  
  	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, _src.id);


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v4 2/2] PCI/DPC: Disable DPC service when link is in L2/L3 ready, L2 and L3 state

2022-04-17 Thread Sathyanarayanan Kuppuswamy




On 4/8/22 8:31 AM, Kai-Heng Feng wrote:

On Intel Alder Lake platforms, Thunderbolt entering D3cold can cause
some errors reported by AER:
[   30.100211] pcieport :00:1d.0: AER: Uncorrected (Non-Fatal) error 
received: :00:1d.0
[   30.100251] pcieport :00:1d.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport :00:1d.0:   device [8086:7ab0] error 
status/mask=0010/4000
[   30.100262] pcieport :00:1d.0:[20] UnsupReq   (First)
[   30.100267] pcieport :00:1d.0: AER:   TLP Header: 3400 0852 
 
[   30.100372] thunderbolt :0a:00.0: AER: can't recover (no error_detected 
callback)
[   30.100401] xhci_hcd :3e:00.0: AER: can't recover (no error_detected 
callback)
[   30.100427] pcieport :00:1d.0: AER: device recovery failed

Since AER is disabled in previous patch for a Link in L2/L3 Ready, L2
and L3, also disable DPC here as DPC depends on AER to work.

Bugzilla:https://bugzilla.kernel.org/show_bug.cgi?id=215453
Reviewed-by: Mika Westerberg
Signed-off-by: Kai-Heng Feng


Reviewed-by: Kuppuswamy Sathyanarayanan 


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state

2022-03-30 Thread Sathyanarayanan Kuppuswamy




On 3/29/22 1:31 AM, Kai-Heng Feng wrote:

On some Intel AlderLake platforms, Thunderbolt entering D3cold can cause
some errors reported by AER:
[   30.100211] pcieport :00:1d.0: AER: Uncorrected (Non-Fatal) error 
received: :00:1d.0
[   30.100251] pcieport :00:1d.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport :00:1d.0:   device [8086:7ab0] error 
status/mask=0010/4000
[   30.100262] pcieport :00:1d.0:[20] UnsupReq   (First)
[   30.100267] pcieport :00:1d.0: AER:   TLP Header: 3400 0852 
 
[   30.100372] thunderbolt :0a:00.0: AER: can't recover (no error_detected 
callback)
[   30.100401] xhci_hcd :3e:00.0: AER: can't recover (no error_detected 
callback)
[   30.100427] pcieport :00:1d.0: AER: device recovery failed


Include details about in which platform you have seen it and whether
this is a generic power issue?



So disable AER service to avoid the noises from turning power rails
on/off when the device is in low power states (D3hot and D3cold), as
PCIe spec "5.2 Link State Power Management" states that TLP and DLLP


Also include PCIe specification version number.


transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold
with aux power) and L3 (D3cold).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Reviewed-by: Mika Westerberg 
Signed-off-by: Kai-Heng Feng 
---
v3:
  - Remove reference to ACS.
  - Wording change.

v2:
  - Wording change.

  drivers/pci/pcie/aer.c | 31 +--
  1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b270..e4e9d4a3098d7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
return 0;
  }
  
+static int aer_suspend(struct pcie_device *dev)

+{
+   struct aer_rpc *rpc = get_service_data(dev);
+
+   aer_disable_rootport(rpc);
+   return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+   struct aer_rpc *rpc = get_service_data(dev);
+
+   aer_enable_rootport(rpc);
+   return 0;
+}
+
  /**
   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
   * @dev: pointer to Root Port, RCEC, or RCiEP
@@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
*dev)
  }
  
  static struct pcie_port_service_driver aerdriver = {

-   .name   = "aer",
-   .port_type  = PCIE_ANY_PORT,
-   .service= PCIE_PORT_SERVICE_AER,
-
-   .probe  = aer_probe,
-   .remove = aer_remove,
+   .name   = "aer",
+   .port_type  = PCIE_ANY_PORT,
+   .service= PCIE_PORT_SERVICE_AER,
+   .probe  = aer_probe,
+   .suspend= aer_suspend,
+   .resume = aer_resume,
+   .runtime_suspend= aer_suspend,
+   .runtime_resume = aer_resume,
+   .remove = aer_remove,
  };
  
  /**


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state

2022-03-20 Thread Sathyanarayanan Kuppuswamy




On 3/20/22 7:38 PM, Kai-Heng Feng wrote:

On Sun, Mar 20, 2022 at 4:38 AM Sathyanarayanan Kuppuswamy
 wrote:




On 1/26/22 6:54 PM, Kai-Heng Feng wrote:

Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
hint") enables ACS, and some platforms lose its NVMe after resume from


Why enabling ACS makes platform lose NVMe? Can you add more details
about the problem?


I don't have a hardware analyzer, so the only detail I can provide is
the symptom.
I believe the affected system was sent Intel, and there wasn't any
feedback since then.


Since your commit log refers to ACS, I think first we need to understand
following points.

1. Why we get ACSViol during S3 resume. Is this just a noise?
2. Why AER recovery fails?
3. Is this common for all platforms, or only happens in your test
   platform?

If you are not clear about above points, I think you can submit this
patch as adding suspend/resume support to AER/DPC driver and not include
the issue about ACS.

From your commit log, the problem is not very clear.






S3:
[   50.947816] pcieport :00:1b.0: DPC: containment event, status:0x1f01 
source:0x
[   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error detected
[   50.947829] pcieport :00:1b.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Receiver ID)
[   50.947830] pcieport :00:1b.0:   device [8086:06ac] error 
status/mask=0020/0001
[   50.947831] pcieport :00:1b.0:[21] ACSViol(First)
[   50.947841] pcieport :00:1b.0: AER: broadcast error_detected message
[   50.947843] nvme nvme0: frozen state error detected, reset controller

It happens right after ACS gets enabled during resume.

There's another case, when Thunderbolt reaches D3cold:
[   30.100211] pcieport :00:1d.0: AER: Uncorrected (Non-Fatal) error 
received: :00:1d.0
[   30.100251] pcieport :00:1d.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport :00:1d.0:   device [8086:7ab0] error 
status/mask=0010/4000
[   30.100262] pcieport :00:1d.0:[20] UnsupReq   (First)
[   30.100267] pcieport :00:1d.0: AER:   TLP Header: 3400 0852 
 
[   30.100372] thunderbolt :0a:00.0: AER: can't recover (no error_detected 
callback)


no callback message means one or more devices in the given port does not
support error handler. How is this related to ACS?


This case is about D3cold, not related to ACS.
And no error_detected is just part of the message. The whole AER
message is more important.

Kai-Heng




[   30.100401] xhci_hcd :3e:00.0: AER: can't recover (no error_detected 
callback)
[   30.100427] pcieport :00:1d.0: AER: device recovery failed

So disable AER service to avoid the noises from turning power rails
on/off when the device is in low power states (D3hot and D3cold), as
PCIe spec "5.2 Link State Power Management" states that TLP and DLLP
transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold
with aux power) and L3 (D3cold).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
Signed-off-by: Kai-Heng Feng 
---
v2:
   - Wording change.

   drivers/pci/pcie/aer.c | 31 +--
   1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b270..e4e9d4a3098d7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
   return 0;
   }

+static int aer_suspend(struct pcie_device *dev)
+{
+ struct aer_rpc *rpc = get_service_data(dev);
+
+ aer_disable_rootport(rpc);
+ return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+ struct aer_rpc *rpc = get_service_data(dev);
+
+ aer_enable_rootport(rpc);
+ return 0;
+}
+
   /**
* aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
* @dev: pointer to Root Port, RCEC, or RCiEP
@@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
*dev)
   }

   static struct pcie_port_service_driver aerdriver = {
- .name   = "aer",
- .port_type  = PCIE_ANY_PORT,
- .service= PCIE_PORT_SERVICE_AER,
-
- .probe  = aer_probe,
- .remove = aer_remove,
+ .name   = "aer",
+ .port_type  = PCIE_ANY_PORT,
+ .service= PCIE_PORT_SERVICE_AER,
+ .probe  = aer_probe,
+ .suspend= aer_suspend,
+ .resume = aer_resume,
+ .runtime_suspend= aer_suspend,
+ .runtime_resume = aer_resume,
+ .remove     = aer_remove,
   };

   /**


--

Re: [PATCH v2 2/2] PCI/DPC: Disable DPC service when link is in L2/L3 ready, L2 and L3 state

2022-03-19 Thread Sathyanarayanan Kuppuswamy




On 1/26/22 6:54 PM, Kai-Heng Feng wrote:

Since TLP and DLLP transmission is disabled for a Link in L2/L3 Ready,
L2 and L3 (i.e. device in D3hot and D3cold), and DPC depends on AER, so


Better description about the problem would be helpful. I know you
have included a log in AER patch. But a quick summary of the problem
in this patch will make it easier to read the patch.


also disable DPC here.

Signed-off-by: Kai-Heng Feng 
---
v2:
  - Wording change.
  - Empty line dropped.

  drivers/pci/pcie/dpc.c | 60 +++---
  1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 3e9afee02e8d1..414258967f08e 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -343,13 +343,33 @@ void pci_dpc_init(struct pci_dev *pdev)
}
  }
  
+static void dpc_enable(struct pcie_device *dev)

+{
+   struct pci_dev *pdev = dev->port;
+   u16 ctl;
+
+   pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, );
+   ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_FATAL | 
PCI_EXP_DPC_CTL_INT_EN;
+   pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, ctl);
+}
+
+static void dpc_disable(struct pcie_device *dev)
+{
+   struct pci_dev *pdev = dev->port;
+   u16 ctl;
+
+   pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, );
+   ctl &= ~(PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN);
+   pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, ctl);
+}
+
  #define FLAG(x, y) (((x) & (y)) ? '+' : '-')
  static int dpc_probe(struct pcie_device *dev)
  {
struct pci_dev *pdev = dev->port;
struct device *device = >device;
int status;
-   u16 ctl, cap;
+   u16 cap;
  
  	if (!pcie_aer_is_native(pdev) && !pcie_ports_dpc_native)

return -ENOTSUPP;
@@ -364,10 +384,7 @@ static int dpc_probe(struct pcie_device *dev)
}
  
  	pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CAP, );

-   pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, );
-
-   ctl = (ctl & 0xfff4) | PCI_EXP_DPC_CTL_EN_FATAL | 
PCI_EXP_DPC_CTL_INT_EN;
-   pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, ctl);
+   dpc_enable(dev);
pci_info(pdev, "enabled with IRQ %d\n", dev->irq);
  
  	pci_info(pdev, "error containment capabilities: Int Msg #%d, RPExt%c PoisonedTLP%c SwTrigger%c RP PIO Log %d, DL_ActiveErr%c\n",

@@ -380,22 +397,33 @@ static int dpc_probe(struct pcie_device *dev)
return status;
  }
  
-static void dpc_remove(struct pcie_device *dev)

+static int dpc_suspend(struct pcie_device *dev)
  {
-   struct pci_dev *pdev = dev->port;
-   u16 ctl;
+   dpc_disable(dev);
+   return 0;
+}
  
-	pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, );

-   ctl &= ~(PCI_EXP_DPC_CTL_EN_FATAL | PCI_EXP_DPC_CTL_INT_EN);
-   pci_write_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_CTL, ctl);
+static int dpc_resume(struct pcie_device *dev)
+{
+   dpc_enable(dev);
+   return 0;
+}
+
+static void dpc_remove(struct pcie_device *dev)
+{
+   dpc_disable(dev);
  }
  
  static struct pcie_port_service_driver dpcdriver = {

-   .name   = "dpc",
-   .port_type  = PCIE_ANY_PORT,
-   .service= PCIE_PORT_SERVICE_DPC,
-   .probe  = dpc_probe,
-   .remove = dpc_remove,
+   .name   = "dpc",
+   .port_type  = PCIE_ANY_PORT,
+   .service= PCIE_PORT_SERVICE_DPC,
+   .probe  = dpc_probe,
+   .suspend= dpc_suspend,
+   .resume = dpc_resume,
+   .runtime_suspend= dpc_suspend,
+   .runtime_resume     = dpc_resume,
+   .remove = dpc_remove,
  };
  
  int __init pcie_dpc_init(void)


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2 1/2] PCI/AER: Disable AER service when link is in L2/L3 ready, L2 and L3 state

2022-03-19 Thread Sathyanarayanan Kuppuswamy




On 1/26/22 6:54 PM, Kai-Heng Feng wrote:

Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
hint") enables ACS, and some platforms lose its NVMe after resume from


Why enabling ACS makes platform lose NVMe? Can you add more details
about the problem?


S3:
[   50.947816] pcieport :00:1b.0: DPC: containment event, status:0x1f01 
source:0x
[   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error detected
[   50.947829] pcieport :00:1b.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Receiver ID)
[   50.947830] pcieport :00:1b.0:   device [8086:06ac] error 
status/mask=0020/0001
[   50.947831] pcieport :00:1b.0:[21] ACSViol(First)
[   50.947841] pcieport :00:1b.0: AER: broadcast error_detected message
[   50.947843] nvme nvme0: frozen state error detected, reset controller

It happens right after ACS gets enabled during resume.

There's another case, when Thunderbolt reaches D3cold:
[   30.100211] pcieport :00:1d.0: AER: Uncorrected (Non-Fatal) error 
received: :00:1d.0
[   30.100251] pcieport :00:1d.0: PCIe Bus Error: severity=Uncorrected 
(Non-Fatal), type=Transaction Layer, (Requester ID)
[   30.100256] pcieport :00:1d.0:   device [8086:7ab0] error 
status/mask=0010/4000
[   30.100262] pcieport :00:1d.0:[20] UnsupReq   (First)
[   30.100267] pcieport :00:1d.0: AER:   TLP Header: 3400 0852 
 
[   30.100372] thunderbolt :0a:00.0: AER: can't recover (no error_detected 
callback)


no callback message means one or more devices in the given port does not
support error handler. How is this related to ACS?


[   30.100401] xhci_hcd :3e:00.0: AER: can't recover (no error_detected 
callback)
[   30.100427] pcieport :00:1d.0: AER: device recovery failed

So disable AER service to avoid the noises from turning power rails
on/off when the device is in low power states (D3hot and D3cold), as
PCIe spec "5.2 Link State Power Management" states that TLP and DLLP
transmission is disabled for a Link in L2/L3 Ready (D3hot), L2 (D3cold
with aux power) and L3 (D3cold).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=215453
Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
Signed-off-by: Kai-Heng Feng 
---
v2:
  - Wording change.

  drivers/pci/pcie/aer.c | 31 +--
  1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b270..e4e9d4a3098d7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1367,6 +1367,22 @@ static int aer_probe(struct pcie_device *dev)
return 0;
  }
  
+static int aer_suspend(struct pcie_device *dev)

+{
+   struct aer_rpc *rpc = get_service_data(dev);
+
+   aer_disable_rootport(rpc);
+   return 0;
+}
+
+static int aer_resume(struct pcie_device *dev)
+{
+   struct aer_rpc *rpc = get_service_data(dev);
+
+   aer_enable_rootport(rpc);
+   return 0;
+}
+
  /**
   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
   * @dev: pointer to Root Port, RCEC, or RCiEP
@@ -1433,12 +1449,15 @@ static pci_ers_result_t aer_root_reset(struct pci_dev 
*dev)
  }
  
  static struct pcie_port_service_driver aerdriver = {

-   .name   = "aer",
-   .port_type  = PCIE_ANY_PORT,
-   .service= PCIE_PORT_SERVICE_AER,
-
-   .probe  = aer_probe,
-   .remove = aer_remove,
+   .name   = "aer",
+   .port_type  = PCIE_ANY_PORT,
+   .service= PCIE_PORT_SERVICE_AER,
+   .probe  = aer_probe,
+   .suspend= aer_suspend,
+   .resume = aer_resume,
+   .runtime_suspend= aer_suspend,
+   .runtime_resume = aer_resume,
+   .remove     = aer_remove,
  };
  
  /**


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-03-18 Thread Sathyanarayanan Kuppuswamy




On 3/17/22 3:59 PM, Bjorn Helgaas wrote:

Thanks for the additional details!

After this patch, I guess aer_irq() still reads 0x2
(PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
PCI_ERR_ROOT_MULTI_COR_RCV.

In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
which probably contains either 0 or junk left over from being captured
when PCI_ERR_ROOT_COR_RCV was set.

And aer_irq() will queue an e_src record with status ==
PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
in status, aer_isr_one_error() will do nothing, right?

That might not be*terrible*  and is definitely better than not being
able to handle future interrupts.  But we basically threw away the
information that multiple errors occurred, and we queued an e_src
record that occupies space without being used for anything.


Yes, you are correct.  One other way to minimize this race window is to
clear the Root status register as soon as possible. Maybe we can move
it before source ID read as below.

--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1204,8 +1204,8 @@ static irqreturn_t aer_irq(int irq, void *context)
if (!(e_src.status & AER_ERR_STATUS_MASK))
return IRQ_NONE;

-   pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, _src.id);
pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, 
e_src.status);

+   pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, _src.id)

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-03-15 Thread Sathyanarayanan Kuppuswamy




On 3/15/22 12:52 PM, Eric Badger wrote:

On Tue, Mar 15, 2022 at 10:26:46AM -0700, Sathyanarayanan Kuppuswamy wrote:

On 3/15/22 10:14 AM, Eric Badger wrote:

   # Prep injection data for a correctable error.
   $ cd /sys/kernel/debug/apei/einj
   $ echo 0x0040 > error_type
   $ echo 0x4 > flags
   $ echo 0x891000 > param4

   # Root Error Status is initially clear
   $ setpci -s  ECAP0001+0x30.w
   

   # Inject one error
   $ echo 1 > error_inject

   # Interrupt received
   pcieport : AER: Root Error Status 0001

   # Inject another error (within 5 seconds)
   $ echo 1 > error_inject

   # No interrupt received, but "multiple ERR_COR" is now set
   $ setpci -s  ECAP0001+0x30.w
   0003

   # Wait for a while, then clear ERR_COR. A new interrupt immediately
 fires.
   $ setpci -s  ECAP0001+0x30.w=0x1
   pcieport : AER: Root Error Status 0002

Currently, the above issue has been only reproduced in the ICL server
platform.

[Eric: proposed reproducing steps]

Hmm, this differs from the procedure I described on v1, and I don't
think will work as described here.


I have attempted to modify the steps to reproduce it without returning
IRQ_NONE for all cases (which will break the functionality). But I
think I did not correct the last few steps.


Well, the thinking in always returning IRQ_NONE was so that only setpci
modified the register and we could clearly see how writes to the
register affect interrupt generation.


Got it. Makes sense.




How about replacing the last 3 steps with following?

  # Inject another error (within 5 seconds)
  $ echo 1 > error_inject

  # You will get a new IRQ with only multiple ERR_COR bit set
  pcieport : AER: Root Error Status 0002


This seems accurate. Though it does muddy a detail that I think was
clearer in the original procedure: was the second interrupt triggered by
the second error, or by the write of 0x1 to Root Error Status?


I think you are talking about the following command, right?

setpci -s  ECAP0001+0x30.w=0x1

If yes, my previously modified instructions already removed it. So
no confusion.

To summarize,

In your case, you have controlled both register read/write of Root
error status register to simulate the interrupt with only multi
ERR_COR bit set.

In my case, I have attempted to simulate it without changing the
default behavior of aer_irq() in the kernel.

Both seem ok to me. Although my personal preference is to trigger
the error without changing the code behavior, if both you and Bjorn
prefer to revert to old instructions, I will fix this in the next version.




Cheers,
Eric


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v2] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

2022-03-15 Thread Sathyanarayanan Kuppuswamy




On 3/15/22 10:14 AM, Eric Badger wrote:

  # Prep injection data for a correctable error.
  $ cd /sys/kernel/debug/apei/einj
  $ echo 0x0040 > error_type
  $ echo 0x4 > flags
  $ echo 0x891000 > param4

  # Root Error Status is initially clear
  $ setpci -s  ECAP0001+0x30.w
  

  # Inject one error
  $ echo 1 > error_inject

  # Interrupt received
  pcieport : AER: Root Error Status 0001

  # Inject another error (within 5 seconds)
  $ echo 1 > error_inject

  # No interrupt received, but "multiple ERR_COR" is now set
  $ setpci -s  ECAP0001+0x30.w
  0003

  # Wait for a while, then clear ERR_COR. A new interrupt immediately
fires.
  $ setpci -s  ECAP0001+0x30.w=0x1
  pcieport : AER: Root Error Status 0002

Currently, the above issue has been only reproduced in the ICL server
platform.

[Eric: proposed reproducing steps]

Hmm, this differs from the procedure I described on v1, and I don't
think will work as described here.


I have attempted to modify the steps to reproduce it without returning
IRQ_NONE for all cases (which will break the functionality). But I
think I did not correct the last few steps.

How about replacing the last 3 steps with following?

 # Inject another error (within 5 seconds)
 $ echo 1 > error_inject

 # You will get a new IRQ with only multiple ERR_COR bit set
 pcieport : AER: Root Error Status 0002

--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer