Re: [PATCH v10 13/15] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-09-26 Thread Dan Williams
Terry Bowman wrote:
> Hi Dan,
> 
> On 8/31/23 15:35, Dan Williams wrote:
> > Terry Bowman wrote:
> >> From: Robert Richter 
> >>
> >> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> >> RCiEP, but CXL downstream and upstream ports are not enumerated and
> >> not visible in the PCIe hierarchy. [1] Protocol and link errors from
> >> these non-enumerated ports are signaled as internal AER errors, either
> >> Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE)
> >> via an RCEC.
> >>
> >> Restricted CXL host (RCH) downstream port-detected errors have the
> >> Requester ID of the RCEC set in the RCEC's AER Error Source ID
> >> register. A CXL handler must then inspect the error status in various
> >> CXL registers residing in the dport's component register space (CXL
> >> RAS capability) or the dport's RCRB (PCIe AER extended
> >> capability). [2]
> >>
> >> Errors showing up in the RCEC's error handler must be handled and
> >> connected to the CXL subsystem. Implement this by forwarding the error
> >> to all CXL devices below the RCEC. Since the entire CXL device is
> >> controlled only using PCIe Configuration Space of device 0, function
> >> 0, only pass it there [3]. The error handling is limited to currently
> >> supported devices with the Memory Device class code set (CXL Type 3
> >> Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in
> >> the device's cxl_pci driver. Support for other CXL Device Types
> >> (e.g. a CXL.cache Device) can be added later.
> >>
> >> To handle downstream port errors in addition to errors directed to the
> >> CXL endpoint device, a handler must also inspect the CXL RAS and PCIe
> >> AER capabilities of the CXL downstream port the device is connected
> >> to.
> >>
> >> Since CXL downstream port errors are signaled using internal errors,
> >> the handler requires those errors to be unmasked. This is subject of a
> >> follow-on patch.
> >>
> >> The reason for choosing this implementation is that the AER service
> >> driver claims the RCEC device, but does not allow it to register a
> >> custom specific handler to support CXL. Connecting the RCEC hard-wired
> >> with a CXL handler does not work, as the CXL subsystem might not be
> >> present all the time. The alternative to add an implementation to the
> >> portdrv to allow the registration of a custom RCEC error handler isn't
> >> worth doing it as CXL would be its only user. Instead, just check for
> >> an CXL RCEC and pass it down to the connected CXL device's error
> >> handler. With this approach the code can entirely be implemented in
> >> the PCIe AER driver and is independent of the CXL subsystem. The CXL
> >> driver only provides the handler.
> >>
> >> [1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH
> >> [2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> >> [3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> >>
> >> Co-developed-by: Terry Bowman 
> >> Signed-off-by: Terry Bowman 
> >> Signed-off-by: Robert Richter 
> >> Cc: "Oliver O'Halloran" 
> >> Cc: Bjorn Helgaas 
> >> Cc: linuxppc-dev@lists.ozlabs.org
> >> Cc: linux-...@vger.kernel.org
> >> Acked-by: Bjorn Helgaas 
> >> Reviewed-by: Jonathan Cameron 
> >> Reviewed-by: Dave Jiang 
> >> ---
> >>  drivers/pci/pcie/Kconfig | 12 +
> >>  drivers/pci/pcie/aer.c   | 96 +++-
> >>  2 files changed, 106 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> >> index 228652a59f27..4f0e70fafe2d 100644
> >> --- a/drivers/pci/pcie/Kconfig
> >> +++ b/drivers/pci/pcie/Kconfig
> >> @@ -49,6 +49,18 @@ config PCIEAER_INJECT
> >>  gotten from:
> >> 
> >> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
> >>  
> >> +config PCIEAER_CXL
> >> +  bool "PCI Express CXL RAS support for Restricted Hosts (RCH)"
> > 
> > Why the "for Restricted Hosts (RCH)" clarification? I am seeing nothing
> > that prevents this from working with RCECs on VH topologies.
> > 
> 
> The same option can be used in VH mode. Will remove the RCH reference.
> 
> > 
> >> +  default y
> > 
> > Minor, but I think "default PCIEAER" makes it slightly clearer that CXL
> > error handling comes along for the ride with PCIE AER.
> >
> 
> We found Kconfig entries do not typically list a dependancy and the default 
> to be the same. We prefer to leave as 'default y'. If you want we can make 
> your requested change. 

The tie breaker would be to follow whatever the local precedent is.
Indeed, it looks like "depends with default y" is consistent with other
drivers/pci/pcie/Kconfig entries. So I retract my comment.

> 
> >> +  depends on PCIEAER && CXL_PCI
> >> +  help
> >> +Enables error handling of downstream ports of a CXL host
> >> +that is operating in RCD mode (Restricted CXL Host, RCH).
> >> +The downstream port reports AER errors to a given RCEC.
> >> +Errors are handled by the CXL memory device driver.
> >> +

Re: [PATCH v10 13/15] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-09-20 Thread Terry Bowman
Hi Dan,

I adde danothe comment below.

On 9/19/23 15:58, Terry Bowman wrote:
> Hi Dan,
> 
> On 8/31/23 15:35, Dan Williams wrote:
>> Terry Bowman wrote:
>>> From: Robert Richter 
>>>
>>> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
>>> RCiEP, but CXL downstream and upstream ports are not enumerated and
>>> not visible in the PCIe hierarchy. [1] Protocol and link errors from
>>> these non-enumerated ports are signaled as internal AER errors, either
>>> Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE)
>>> via an RCEC.
>>>
>>> Restricted CXL host (RCH) downstream port-detected errors have the
>>> Requester ID of the RCEC set in the RCEC's AER Error Source ID
>>> register. A CXL handler must then inspect the error status in various
>>> CXL registers residing in the dport's component register space (CXL
>>> RAS capability) or the dport's RCRB (PCIe AER extended
>>> capability). [2]
>>>
>>> Errors showing up in the RCEC's error handler must be handled and
>>> connected to the CXL subsystem. Implement this by forwarding the error
>>> to all CXL devices below the RCEC. Since the entire CXL device is
>>> controlled only using PCIe Configuration Space of device 0, function
>>> 0, only pass it there [3]. The error handling is limited to currently
>>> supported devices with the Memory Device class code set (CXL Type 3
>>> Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in
>>> the device's cxl_pci driver. Support for other CXL Device Types
>>> (e.g. a CXL.cache Device) can be added later.
>>>
>>> To handle downstream port errors in addition to errors directed to the
>>> CXL endpoint device, a handler must also inspect the CXL RAS and PCIe
>>> AER capabilities of the CXL downstream port the device is connected
>>> to.
>>>
>>> Since CXL downstream port errors are signaled using internal errors,
>>> the handler requires those errors to be unmasked. This is subject of a
>>> follow-on patch.
>>>
>>> The reason for choosing this implementation is that the AER service
>>> driver claims the RCEC device, but does not allow it to register a
>>> custom specific handler to support CXL. Connecting the RCEC hard-wired
>>> with a CXL handler does not work, as the CXL subsystem might not be
>>> present all the time. The alternative to add an implementation to the
>>> portdrv to allow the registration of a custom RCEC error handler isn't
>>> worth doing it as CXL would be its only user. Instead, just check for
>>> an CXL RCEC and pass it down to the connected CXL device's error
>>> handler. With this approach the code can entirely be implemented in
>>> the PCIe AER driver and is independent of the CXL subsystem. The CXL
>>> driver only provides the handler.
>>>
>>> [1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH
>>> [2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
>>> [3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
>>>
>>> Co-developed-by: Terry Bowman 
>>> Signed-off-by: Terry Bowman 
>>> Signed-off-by: Robert Richter 
>>> Cc: "Oliver O'Halloran" 
>>> Cc: Bjorn Helgaas 
>>> Cc: linuxppc-dev@lists.ozlabs.org
>>> Cc: linux-...@vger.kernel.org
>>> Acked-by: Bjorn Helgaas 
>>> Reviewed-by: Jonathan Cameron 
>>> Reviewed-by: Dave Jiang 
>>> ---
>>>  drivers/pci/pcie/Kconfig | 12 +
>>>  drivers/pci/pcie/aer.c   | 96 +++-
>>>  2 files changed, 106 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
>>> index 228652a59f27..4f0e70fafe2d 100644
>>> --- a/drivers/pci/pcie/Kconfig
>>> +++ b/drivers/pci/pcie/Kconfig
>>> @@ -49,6 +49,18 @@ config PCIEAER_INJECT
>>>   gotten from:
>>>  
>>> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
>>>  
>>> +config PCIEAER_CXL
>>> +   bool "PCI Express CXL RAS support for Restricted Hosts (RCH)"
>>
>> Why the "for Restricted Hosts (RCH)" clarification? I am seeing nothing
>> that prevents this from working with RCECs on VH topologies.
>>
> 
> The same option can be used in VH mode. Will remove the RCH reference.
> 
>>
>>> +   default y
>>
>> Minor, but I think "default PCIEAER" makes it slightly clearer that CXL
>> error handling comes along for the ride with PCIE AER.
>>
> 
> We found Kconfig entries do not typically list a dependancy and the default 
> to be the same. We prefer to leave as 'default y'. If you want we can make 
> your requested change. 
> 
>>> +   depends on PCIEAER && CXL_PCI
>>> +   help
>>> + Enables error handling of downstream ports of a CXL host
>>> + that is operating in RCD mode (Restricted CXL Host, RCH).
>>> + The downstream port reports AER errors to a given RCEC.
>>> + Errors are handled by the CXL memory device driver.
>>> +
>>> + If unsure, say Y.
>>> +
>>>  #
>>>  # PCI Express ECRC
>>>  #
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index d3344fcf1f79..c354ca5e8f2b 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ 

Re: [PATCH v10 13/15] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-09-19 Thread Terry Bowman
Hi Dan,

On 8/31/23 15:35, Dan Williams wrote:
> Terry Bowman wrote:
>> From: Robert Richter 
>>
>> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
>> RCiEP, but CXL downstream and upstream ports are not enumerated and
>> not visible in the PCIe hierarchy. [1] Protocol and link errors from
>> these non-enumerated ports are signaled as internal AER errors, either
>> Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE)
>> via an RCEC.
>>
>> Restricted CXL host (RCH) downstream port-detected errors have the
>> Requester ID of the RCEC set in the RCEC's AER Error Source ID
>> register. A CXL handler must then inspect the error status in various
>> CXL registers residing in the dport's component register space (CXL
>> RAS capability) or the dport's RCRB (PCIe AER extended
>> capability). [2]
>>
>> Errors showing up in the RCEC's error handler must be handled and
>> connected to the CXL subsystem. Implement this by forwarding the error
>> to all CXL devices below the RCEC. Since the entire CXL device is
>> controlled only using PCIe Configuration Space of device 0, function
>> 0, only pass it there [3]. The error handling is limited to currently
>> supported devices with the Memory Device class code set (CXL Type 3
>> Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in
>> the device's cxl_pci driver. Support for other CXL Device Types
>> (e.g. a CXL.cache Device) can be added later.
>>
>> To handle downstream port errors in addition to errors directed to the
>> CXL endpoint device, a handler must also inspect the CXL RAS and PCIe
>> AER capabilities of the CXL downstream port the device is connected
>> to.
>>
>> Since CXL downstream port errors are signaled using internal errors,
>> the handler requires those errors to be unmasked. This is subject of a
>> follow-on patch.
>>
>> The reason for choosing this implementation is that the AER service
>> driver claims the RCEC device, but does not allow it to register a
>> custom specific handler to support CXL. Connecting the RCEC hard-wired
>> with a CXL handler does not work, as the CXL subsystem might not be
>> present all the time. The alternative to add an implementation to the
>> portdrv to allow the registration of a custom RCEC error handler isn't
>> worth doing it as CXL would be its only user. Instead, just check for
>> an CXL RCEC and pass it down to the connected CXL device's error
>> handler. With this approach the code can entirely be implemented in
>> the PCIe AER driver and is independent of the CXL subsystem. The CXL
>> driver only provides the handler.
>>
>> [1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH
>> [2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
>> [3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
>>
>> Co-developed-by: Terry Bowman 
>> Signed-off-by: Terry Bowman 
>> Signed-off-by: Robert Richter 
>> Cc: "Oliver O'Halloran" 
>> Cc: Bjorn Helgaas 
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-...@vger.kernel.org
>> Acked-by: Bjorn Helgaas 
>> Reviewed-by: Jonathan Cameron 
>> Reviewed-by: Dave Jiang 
>> ---
>>  drivers/pci/pcie/Kconfig | 12 +
>>  drivers/pci/pcie/aer.c   | 96 +++-
>>  2 files changed, 106 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
>> index 228652a59f27..4f0e70fafe2d 100644
>> --- a/drivers/pci/pcie/Kconfig
>> +++ b/drivers/pci/pcie/Kconfig
>> @@ -49,6 +49,18 @@ config PCIEAER_INJECT
>>gotten from:
>>   
>> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
>>  
>> +config PCIEAER_CXL
>> +bool "PCI Express CXL RAS support for Restricted Hosts (RCH)"
> 
> Why the "for Restricted Hosts (RCH)" clarification? I am seeing nothing
> that prevents this from working with RCECs on VH topologies.
> 

The same option can be used in VH mode. Will remove the RCH reference.

> 
>> +default y
> 
> Minor, but I think "default PCIEAER" makes it slightly clearer that CXL
> error handling comes along for the ride with PCIE AER.
>

We found Kconfig entries do not typically list a dependancy and the default 
to be the same. We prefer to leave as 'default y'. If you want we can make 
your requested change. 

>> +depends on PCIEAER && CXL_PCI
>> +help
>> +  Enables error handling of downstream ports of a CXL host
>> +  that is operating in RCD mode (Restricted CXL Host, RCH).
>> +  The downstream port reports AER errors to a given RCEC.
>> +  Errors are handled by the CXL memory device driver.
>> +
>> +  If unsure, say Y.
>> +
>>  #
>>  # PCI Express ECRC
>>  #
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index d3344fcf1f79..c354ca5e8f2b 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -946,14 +946,100 @@ static bool find_source_device(struct pci_dev *parent,
>>  return true;
>>  }
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +
>> +static bool 

RE: [PATCH v10 13/15] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

2023-08-31 Thread Dan Williams
Terry Bowman wrote:
> From: Robert Richter 
> 
> In Restricted CXL Device (RCD) mode a CXL device is exposed as an
> RCiEP, but CXL downstream and upstream ports are not enumerated and
> not visible in the PCIe hierarchy. [1] Protocol and link errors from
> these non-enumerated ports are signaled as internal AER errors, either
> Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE)
> via an RCEC.
> 
> Restricted CXL host (RCH) downstream port-detected errors have the
> Requester ID of the RCEC set in the RCEC's AER Error Source ID
> register. A CXL handler must then inspect the error status in various
> CXL registers residing in the dport's component register space (CXL
> RAS capability) or the dport's RCRB (PCIe AER extended
> capability). [2]
> 
> Errors showing up in the RCEC's error handler must be handled and
> connected to the CXL subsystem. Implement this by forwarding the error
> to all CXL devices below the RCEC. Since the entire CXL device is
> controlled only using PCIe Configuration Space of device 0, function
> 0, only pass it there [3]. The error handling is limited to currently
> supported devices with the Memory Device class code set (CXL Type 3
> Device, PCI_CLASS_MEMORY_CXL, 502h), handle downstream port errors in
> the device's cxl_pci driver. Support for other CXL Device Types
> (e.g. a CXL.cache Device) can be added later.
> 
> To handle downstream port errors in addition to errors directed to the
> CXL endpoint device, a handler must also inspect the CXL RAS and PCIe
> AER capabilities of the CXL downstream port the device is connected
> to.
> 
> Since CXL downstream port errors are signaled using internal errors,
> the handler requires those errors to be unmasked. This is subject of a
> follow-on patch.
> 
> The reason for choosing this implementation is that the AER service
> driver claims the RCEC device, but does not allow it to register a
> custom specific handler to support CXL. Connecting the RCEC hard-wired
> with a CXL handler does not work, as the CXL subsystem might not be
> present all the time. The alternative to add an implementation to the
> portdrv to allow the registration of a custom RCEC error handler isn't
> worth doing it as CXL would be its only user. Instead, just check for
> an CXL RCEC and pass it down to the connected CXL device's error
> handler. With this approach the code can entirely be implemented in
> the PCIe AER driver and is independent of the CXL subsystem. The CXL
> driver only provides the handler.
> 
> [1] CXL 3.0 spec: 9.11.8 CXL Devices Attached to an RCH
> [2] CXL 3.0 spec, 12.2.1.1 RCH Downstream Port-detected Errors
> [3] CXL 3.0 spec, 8.1.3 PCIe DVSEC for CXL Devices
> 
> Co-developed-by: Terry Bowman 
> Signed-off-by: Terry Bowman 
> Signed-off-by: Robert Richter 
> Cc: "Oliver O'Halloran" 
> Cc: Bjorn Helgaas 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-...@vger.kernel.org
> Acked-by: Bjorn Helgaas 
> Reviewed-by: Jonathan Cameron 
> Reviewed-by: Dave Jiang 
> ---
>  drivers/pci/pcie/Kconfig | 12 +
>  drivers/pci/pcie/aer.c   | 96 +++-
>  2 files changed, 106 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> index 228652a59f27..4f0e70fafe2d 100644
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -49,6 +49,18 @@ config PCIEAER_INJECT
> gotten from:
>
> https://git.kernel.org/cgit/linux/kernel/git/gong.chen/aer-inject.git/
>  
> +config PCIEAER_CXL
> + bool "PCI Express CXL RAS support for Restricted Hosts (RCH)"

Why the "for Restricted Hosts (RCH)" clarification? I am seeing nothing
that prevents this from working with RCECs on VH topologies.


> + default y

Minor, but I think "default PCIEAER" makes it slightly clearer that CXL
error handling comes along for the ride with PCIE AER.

> + depends on PCIEAER && CXL_PCI
> + help
> +   Enables error handling of downstream ports of a CXL host
> +   that is operating in RCD mode (Restricted CXL Host, RCH).
> +   The downstream port reports AER errors to a given RCEC.
> +   Errors are handled by the CXL memory device driver.
> +
> +   If unsure, say Y.
> +
>  #
>  # PCI Express ECRC
>  #
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d3344fcf1f79..c354ca5e8f2b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -946,14 +946,100 @@ static bool find_source_device(struct pci_dev *parent,
>   return true;
>  }
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +
> +static bool is_cxl_mem_dev(struct pci_dev *dev)
> +{
> + /*
> +  * The capability, status, and control fields in Device 0,
> +  * Function 0 DVSEC control the CXL functionality of the
> +  * entire device (CXL 3.0, 8.1.3).
> +  */
> + if (dev->devfn != PCI_DEVFN(0, 0))
> + return false;
> +
> + /*
> +  * CXL Memory Devices must have the 502h class code set (CXL
> +  *