Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

2024-03-06 Thread Terry Bowman
HI Yuquan,

For your test, the first logging will come from the AER driver if 
everything is working correctly.

You may want to check if the upstream pci bridge's AER UIE/CIE 
masks are set. This could prevent the error from handled by the OS's
aer driver.

Regards,
Terry

On 3/6/24 11:12, Terry Bowman wrote:
> Hi Yuquan an Jon,
> 
> I added responses inline below.
> 
> On 3/6/24 07:23, Jonathan Cameron wrote:
>> On Wed, 6 Mar 2024 19:27:07 +0800
>> Yuquan Wang  wrote:
>>
>>> Hello, Jonathan
>>>
>>> Recently I met some problems on CXL RAS tests. 
>>>
>>> I tried to use "cxl-inject-uncorrectable-errors" and 
>>> "cxl-inject-correctable-error"
>>> qmp to inject CXL errors, however, there was no any kernel printing 
>>> information in 
>>> my qemu machine. And the qmp connection was unstable that made the machine 
>>> always "terminating on signal 2".
>>
>> The qmp connection being unstable is odd - might be related to the CXL code, 
>> but
>> I'm not sure how..
>>
>>>
>>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the 
>>> same conditions.
>>> The kernel showed relevant print information.
>>
>> IIRC the AER paths print under all circumstances whereas CXL errors do not, 
>> they simply
>> trigger tracepoints - but you should have seen device resets.
>>
>> However I span up a test and I think the issue is more straight forward.
>> The uncorrectable internal error and correctable internal errors are masked 
>> on the device.
>> I thought we changed the default on this in linux but maybe not :(
>>
> 
> Device AER UIE/CIE mask can be set and still expect to handle device AER 
> errors. The device reports 
> AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc 
> errors. 
> 
> In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to 
> properly receive 
> AER UIE/CI notifications from devices and RCH dports.
> 
> "CXL Protocol and Link errors detected by components that are part of a CXL 
> VH are
> escalated and reported using standard PCIe error reporting mechanisms over 
> CXL.io as
> UIEs and/or CIEs. See PCIe Base Specification for details."[1]
> 
> [1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting
> 
>> Hack is fine the relevant device with lspci -tv and then use
>> setpci -s 0d:00.0 0x208.l=0
>> to clear all the mask bits for uncorrectable errors.
>>
>> Note I tested this on a convenient arm64 setup so always possible there is 
>> yet
>> another problem on x86.
>>
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and 
>> there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future 
>> occasion?
>>
>> Jonathan
>>
>>
> 
> I tested (qemu x86) using the aer-inject tool and found it to work. Below 
> shows the 
> endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly 
> handled
> with root port logging and cxl_pci handler trace logs.
> 
>  # lspci | grep -i cxl
>  
> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)   
>   
>   
>   
>   
>   
> # lspci -s 0d:00.0 -vvv | grep Advanced   
>   
>   
> Capabilities: [200 v2] Advanced Error Reporting   
>   
>   
>   
>   
>   
> # setpci -s 0d:00.0 0x208.l   
>   

Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

2024-03-06 Thread Terry Bowman
Hi Jon,

This appears to partially address the same problem myself and Robert are 
working on. We 
are working to add support for CXL port devices to include root ports, RCECs, 
USPs, 
and DSPs. This was covered with LPC presentation and discussion.

We did not originally include RCEC error handling support because the same is 
needed 
for all CXL port devices. Also, we wanted to avoid adding more CXL specifics to 
aer.c and 
were looking for a more general solution. This led to the discussion about 
changes to 
the PCIe port bus driver.

Regards,
Terry

On 3/6/24 11:16, Dan Williams wrote:
> [ add Li Ming ]
> 
> Jonathan Cameron wrote:
> [..]
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and 
>> there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future 
>> occasion?
>>
> 
> Li Ming has this patch below waiting in wings. Li Ming, this patch is
> timely for this dicussion, care to send out the full series? I expect it
> needs to be an RFC given concerns with integrating with the pending port
> switch error handling work.
> 
> -- 8< --
> From: Li Ming 
> Subject: [PATCH RFC v3 3/6] PCI/AER: Enable RCEC to report internal error for 
> CXL root port
> Date: Thu, 1 Feb 2024 05:58:08 +
> 
> Per CXL r3.1 section 12.2.2, RCEC is possible to log the CXL.cachemem
> protocol errors detected by CXL root port as PCI_ERR_UNC_INTN or
> PCI_ERR_COR_INTERNAL in AER Capability. So unmask PCI_ERR_UNC_INTN and
> PCI_ERR_COR_INTERNAL for that case.
> 
> Signed-off-by: Li Ming 
> ---
>  drivers/pci/pcie/aer.c | 25 ++---
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 42a3bd35a3e1..ef8fd77cb920 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -985,7 +985,7 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  {
>   struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> - return (pcie_ports_native || host->native_aer);
> + return (pcie_ports_native || host->native_aer) && host->is_cxl;
>  }
>  
>  static bool is_internal_error(struct aer_err_info *info)
> @@ -1041,8 +1041,14 @@ static int handles_cxl_error_iter(struct pci_dev *dev, 
> void *data)
>  {
>   bool *handles_cxl = data;
>  
> - if (!*handles_cxl)
> - *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> + if (!*handles_cxl) {
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END &&
> + is_cxl_mem_dev(dev) && cxl_error_is_native(dev))
> + *handles_cxl = true;
> + if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT &&
> + cxl_error_is_native(dev))
> + *handles_cxl = true;
> + }
>  
>   /* Non-zero terminates iteration */
>   return *handles_cxl;
> @@ -1054,13 +1060,18 @@ static bool handles_cxl_errors(struct pci_dev *rcec)
>  
>   if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
>   pcie_aer_is_native(rcec))
> - pcie_walk_rcec(rcec, handles_cxl_error_iter, _cxl);
> + pcie_walk_rcec_all(rcec, handles_cxl_error_iter, _cxl);
>  
>   return handles_cxl;
>  }
>  
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +static void cxl_enable_rcec(struct pci_dev *rcec)
>  {
> + /*
> +  * Enable RCEC's internal error report for two cases:
> +  * 1. RCiEP detected CXL.cachemem protocol errors
> +  * 2. CXL root port detected CXL.cachemem protocol errors.
> +  */
>   if (!handles_cxl_errors(rcec))
>   return;
>  
> @@ -1069,7 +1080,7 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  }
>  
>  #else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> +static inline void cxl_enable_rcec(struct pci_dev *dev) { }
>  static inline void cxl_rch_handle_error(struct pci_dev *dev,
>   struct aer_err_info *info) { }
>  #endif
> @@ -1494,7 +1505,7 @@ static int aer_probe(struct pcie_device *dev)
>   return status;
>   }
>  
> - cxl_rch_enable_rcec(port);
> + cxl_enable_rcec(port);
>   aer_enable_rootport(rpc);
>   pci_info(port, "enabled with IRQ %d\n", dev->irq);
>   return 0;



Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

2024-03-06 Thread Terry Bowman
Hi Yuquan an Jon,

I added responses inline below.

On 3/6/24 07:23, Jonathan Cameron wrote:
> On Wed, 6 Mar 2024 19:27:07 +0800
> Yuquan Wang  wrote:
> 
>> Hello, Jonathan
>>
>> Recently I met some problems on CXL RAS tests. 
>>
>> I tried to use "cxl-inject-uncorrectable-errors" and 
>> "cxl-inject-correctable-error"
>> qmp to inject CXL errors, however, there was no any kernel printing 
>> information in 
>> my qemu machine. And the qmp connection was unstable that made the machine 
>> always "terminating on signal 2".
> 
> The qmp connection being unstable is odd - might be related to the CXL code, 
> but
> I'm not sure how..
> 
>>
>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same 
>> conditions.
>> The kernel showed relevant print information.
> 
> IIRC the AER paths print under all circumstances whereas CXL errors do not, 
> they simply
> trigger tracepoints - but you should have seen device resets.
> 
> However I span up a test and I think the issue is more straight forward.
> The uncorrectable internal error and correctable internal errors are masked 
> on the device.
> I thought we changed the default on this in linux but maybe not :(
> 

Device AER UIE/CIE mask can be set and still expect to handle device AER 
errors. The device reports 
AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc errors. 

In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to 
properly receive 
AER UIE/CI notifications from devices and RCH dports.

"CXL Protocol and Link errors detected by components that are part of a CXL VH 
are
escalated and reported using standard PCIe error reporting mechanisms over 
CXL.io as
UIEs and/or CIEs. See PCIe Base Specification for details."[1]

[1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting

> Hack is fine the relevant device with lspci -tv and then use
> setpci -s 0d:00.0 0x208.l=0
> to clear all the mask bits for uncorrectable errors.
> 
> Note I tested this on a convenient arm64 setup so always possible there is yet
> another problem on x86.
> 
> Robert / Terry, I tracked down the patch where you enabled this for RCHs and 
> there was
> some discussion on walking out on VH as well to enable this, but seems it
> never happened. Can you remember why?  Just kicked back for a future occasion?
> 
> Jonathan
> 
> 

I tested (qemu x86) using the aer-inject tool and found it to work. Below shows 
the 
endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly 
handled
with root port logging and cxl_pci handler trace logs.

 # lspci | grep -i cxl  
   
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01) 
  

  
# lspci -s 0d:00.0 -vvv | grep Advanced 
  
Capabilities: [200 v2] Advanced Error Reporting 
  

  
# setpci -s 0d:00.0 0x208.l 
  
0240
  

  
# setpci -s 0d:00.0 0x214.l 
  
e000
  

  
# cat aer-input.txt 
  
# Inject a correctable bad TLP error into the device with header log
  
# words 0 1 2 3.
  
#