Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-07-17 Thread Oded Gabbay
On Wed, Jun 21, 2023 at 8:24 PM Sebastian Wick
 wrote:
>
> On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty
>  wrote:
> >
> > Our hardware supports RAS(Reliability, Availability, Serviceability) by
> > exposing a set of error counters which can be used by observability
> > tools to take corrective actions or repairs. Traditionally there were
> > being exposed via PMU (for relative counters) and sysfs interface (for
> > absolute value) in our internal branch. But, due to the limitations in
> > this approach to use two interfaces and also not able to have an event
> > based reporting or configurability, an alternative approach to try
> > netlink was suggested by community for drm subsystem wide UAPI for RAS
> > and telemetry as discussed in [1].
> >
> > This [1] is the inspiration to this series. It uses the generic
> > netlink(genl) family subsystem and exposes a set of commands that can
> > be used by every drm driver, the framework provides a means to have
> > custom commands too. Each drm driver instance in this example xe driver
> > instance registers a family and operations to the genl subsystem through
> > which it enumerates and reports the error counters. An event based
> > notification is also supported to which userpace can subscribe to and
> > be notified when any error occurs and read the error counter this avoids
> > continuous polling on error counter. This can also be extended to
> > threshold based notification.
>
> Be aware that netlink can be quite awkward in user space because it's
> attached to the netns while the device is in the mount ns and there
> are special rules for netlink regarding namespacing.
I agree, we need to be sure this works in all common deployments,
mainly dockers and kubernetes, before deciding to go down this path.
Oded

>
> > [1]: 
> > https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
> >
> > this series is on top of https://patchwork.freedesktop.org/series/116181/
> >
> > Below is an example tool drm_ras which demonstrates the use of the
> > supported commands. The tool will be sent to ML with the subject
> > "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
> > error counters"
> >
> > read single error counter:
> >
> > $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 
> > --error_id=0x0005
> > counter value 0
> >
> > read all error counters:
> >
> > $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> > nameconfig-id   
> > counter
> >
> > error-gt0-correctable-guc   0x0001  
> > 0
> > error-gt0-correctable-slm   0x0003  
> > 0
> > error-gt0-correctable-eu-ic 0x0004  
> > 0
> > error-gt0-correctable-eu-grf0x0005  
> > 0
> > error-gt0-fatal-guc 0x0009  
> > 0
> > error-gt0-fatal-slm 0x000d  
> > 0
> > error-gt0-fatal-eu-grf  0x000f  
> > 0
> > error-gt0-fatal-fpu 0x0010  
> > 0
> > error-gt0-fatal-tlb 0x0011  
> > 0
> > error-gt0-fatal-l3-fabric   0x0012  
> > 0
> > error-gt0-correctable-subslice  0x0013  
> > 0
> > error-gt0-correctable-l3bank0x0014  
> > 0
> > error-gt0-fatal-subslice0x0015  
> > 0
> > error-gt0-fatal-l3bank  0x0016  
> > 0
> > error-gt0-sgunit-correctable0x0017  
> > 0
> > error-gt0-sgunit-nonfatal   0x0018  
> > 0
> > error-gt0-sgunit-fatal  0x0019  
> > 0
> > error-gt0-soc-fatal-psf-csc-0   0x001a  
> > 0
> > error-gt0-soc-fatal-psf-csc-1   0x001b  
> > 0
> > error-gt0-soc-fatal-psf-csc-2   0x001c  
> > 0
> > error-gt0-soc-fatal-punit   0x001d  
> > 0
> > error-gt0-soc-fatal-psf-0   0x001e  
> > 0
> > error-gt0-soc-fatal-psf-1   0x001f  
> > 0
> > error-gt0-soc-fatal-psf-2   0x0020  
> > 0
> > error-gt0-soc-fatal-cd0 0x0021  
> > 0
> > error-gt0-soc-fatal-cd0-mdfi0x0022  
> > 0
> > error-gt0-soc-fatal-mdfi-east   0x0023  
> > 

Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-06-21 Thread Sebastian Wick
On Fri, May 26, 2023 at 6:21 PM Aravind Iddamsetty
 wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

Be aware that netlink can be quite awkward in user space because it's
attached to the netns while the device is in the mount ns and there
are special rules for netlink regarding namespacing.

> [1]: 
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
> error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> nameconfig-id 
>   counter
>
> error-gt0-correctable-guc   0x0001
>   0
> error-gt0-correctable-slm   0x0003
>   0
> error-gt0-correctable-eu-ic 0x0004
>   0
> error-gt0-correctable-eu-grf0x0005
>   0
> error-gt0-fatal-guc 0x0009
>   0
> error-gt0-fatal-slm 0x000d
>   0
> error-gt0-fatal-eu-grf  0x000f
>   0
> error-gt0-fatal-fpu 0x0010
>   0
> error-gt0-fatal-tlb 0x0011
>   0
> error-gt0-fatal-l3-fabric   0x0012
>   0
> error-gt0-correctable-subslice  0x0013
>   0
> error-gt0-correctable-l3bank0x0014
>   0
> error-gt0-fatal-subslice0x0015
>   0
> error-gt0-fatal-l3bank  0x0016
>   0
> error-gt0-sgunit-correctable0x0017
>   0
> error-gt0-sgunit-nonfatal   0x0018
>   0
> error-gt0-sgunit-fatal  0x0019
>   0
> error-gt0-soc-fatal-psf-csc-0   0x001a
>   0
> error-gt0-soc-fatal-psf-csc-1   0x001b
>   0
> error-gt0-soc-fatal-psf-csc-2   0x001c
>   0
> error-gt0-soc-fatal-punit   0x001d
>   0
> error-gt0-soc-fatal-psf-0   0x001e
>   0
> error-gt0-soc-fatal-psf-1   0x001f
>   0
> error-gt0-soc-fatal-psf-2   0x0020
>   0
> error-gt0-soc-fatal-cd0 0x0021
>   0
> error-gt0-soc-fatal-cd0-mdfi0x0022
>   0
> error-gt0-soc-fatal-mdfi-east   0x0023
>   0
> error-gt0-soc-fatal-mdfi-south  0x0024
>   0
> error-gt0-soc-fatal-hbm-ss0-0   0x0025
>   0
> error-gt0-soc-fatal-hbm-ss0-1   0x0026
>   0
> error-gt0-soc-fatal-hbm-ss0-2   0x0027
>   0
> error-gt0-soc-fatal-hbm-ss0-3  

Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-06-06 Thread Iddamsetty, Aravind



On 05-06-2023 22:17, Alex Deucher wrote:
> Adding the relevant AMD folks for RAS.  We currently expose RAS via
> sysfs, but also have an event interface in KFD which may be somewhat
> similar to this.
> 
> If we were to converge on a common RAS interface, would we want to
> look at any commonality in bad page storage/reporting for device
> memory?

Could you please elaborate a bit on this.

Thanks,
Aravind.
> 
> Alex
> 
> On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty
>  wrote:
>>
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> exposing a set of error counters which can be used by observability
>> tools to take corrective actions or repairs. Traditionally there were
>> being exposed via PMU (for relative counters) and sysfs interface (for
>> absolute value) in our internal branch. But, due to the limitations in
>> this approach to use two interfaces and also not able to have an event
>> based reporting or configurability, an alternative approach to try
>> netlink was suggested by community for drm subsystem wide UAPI for RAS
>> and telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
>>
>> [1]: 
>> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/116181/
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
>> error counters"
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 
>> --error_id=0x0005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> nameconfig-id
>>counter
>>
>> error-gt0-correctable-guc   0x0001   
>>0
>> error-gt0-correctable-slm   0x0003   
>>0
>> error-gt0-correctable-eu-ic 0x0004   
>>0
>> error-gt0-correctable-eu-grf0x0005   
>>0
>> error-gt0-fatal-guc 0x0009   
>>0
>> error-gt0-fatal-slm 0x000d   
>>0
>> error-gt0-fatal-eu-grf  0x000f   
>>0
>> error-gt0-fatal-fpu 0x0010   
>>0
>> error-gt0-fatal-tlb 0x0011   
>>0
>> error-gt0-fatal-l3-fabric   0x0012   
>>0
>> error-gt0-correctable-subslice  0x0013   
>>0
>> error-gt0-correctable-l3bank0x0014   
>>0
>> error-gt0-fatal-subslice0x0015   
>>0
>> error-gt0-fatal-l3bank  0x0016   
>>0
>> error-gt0-sgunit-correctable0x0017   
>>0
>> error-gt0-sgunit-nonfatal   0x0018   
>>0
>> error-gt0-sgunit-fatal  0x0019   
>>0
>> error-gt0-soc-fatal-psf-csc-0   0x001a   
>>0
>> error-gt0-soc-fatal-psf-csc-1   0x001b   
>>0
>> error-gt0-soc-fatal-psf-csc-2   0x001c   
>>0
>> error-gt0-soc-fatal-punit   0x001d   
>>0
>> error-gt0-soc-fatal-psf-0   0x001e   
>>0
>> error-gt0-soc-fatal-psf-1   0x001f   
>>0
>> error-gt0-soc-fatal-psf-2   0x0020   
>>0
>> error-gt0-soc-fatal-cd0 0x0021   
>>0
>> error-gt0-soc-fatal-cd0-mdfi0x0022   
>>0
>> error-gt0-soc-fatal-mdfi-east   0x0023   
>>0
>> error-gt0-soc-fatal-mdfi-south  

Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-06-05 Thread Iddamsetty, Aravind



On 04-06-2023 22:37, Tomer Tayar wrote:
> On 26/05/2023 19:20, Aravind Iddamsetty wrote:
>> Our hardware supports RAS(Reliability, Availability, Serviceability) by
>> exposing a set of error counters which can be used by observability
>> tools to take corrective actions or repairs. Traditionally there were
>> being exposed via PMU (for relative counters) and sysfs interface (for
>> absolute value) in our internal branch. But, due to the limitations in
>> this approach to use two interfaces and also not able to have an event
>> based reporting or configurability, an alternative approach to try
>> netlink was suggested by community for drm subsystem wide UAPI for RAS
>> and telemetry as discussed in [1].
>>
>> This [1] is the inspiration to this series. It uses the generic
>> netlink(genl) family subsystem and exposes a set of commands that can
>> be used by every drm driver, the framework provides a means to have
>> custom commands too. Each drm driver instance in this example xe driver
>> instance registers a family and operations to the genl subsystem through
>> which it enumerates and reports the error counters. An event based
>> notification is also supported to which userpace can subscribe to and
>> be notified when any error occurs and read the error counter this avoids
>> continuous polling on error counter. This can also be extended to
>> threshold based notification.
> 
> Hi Aravind,

Hi Tomer,

Thanks a lot for your review.
> 
> The habanalabs driver is another candidate to use this netlink-based drm 
> framework.
> As a single-user device, we have an additional "control" device that 
> allows multiple applications to query for information and to monitor the 
> "compute" device.
> And while we are about to move the compute device to the accel nodes, we 
> don't have a real replacement there for the control device.
> 
> Another possible usage of this framework for habanalabs is the events 
> notification.
> Currently we have an eventfd-based mechanism, and after being notified 
> about an event, user starts querying about the event and the relevant 
> info, usually in several requests.
> With this framework we should be allegedly possible to gather all 
> relevant info together with the event itself.

that is right with the multicast event we can pack data too.
> 
> The current implementation seems intended more to errors (and quite 
> "tailored" to Xe needs ...), while in habanalabs we would need it also 
> for non-error static/dynamic info.
> Maybe we should revise the existing commands/attributes to be more generic?

correct, at present that is the usecase xe driver has and atleast for
the error part I believe is generic if not we can make it, the framework
is extensible. The idea I had was generic commands which every driver
can use will be part of drm framework and if there are specific commands
or attributes that shall be part of driver. But some thought is needed
here as MAX attributes is needed by userspace and how to define
attribute policy etc..,

> 
> Moreover, the drm part is very small, while most of the netlink "mess" 
> is still done by the specific driver.
> So what is the added value in making it a "drm framework"? Do we enforce 
> something here for drm drivers that use it? Do we help them with simpler 
> APIs and hiding the internals of netlink?> Maybe it would be worth moving 
> some functionality from the Xe driver
> into drm helpers?

your suggestion sounds good and interesting but it might need some
analysis like if we move the registration parts to drm framework how
would we register the driver private commands and attributes if there
are any. But ya having most of the part at drm level helps all the
driver. I'll do some analysis and i'll come back on this.

Thanks,
Aravind.

> 
> Thanks,
> Tomer
> 
>> [1]: 
>> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>>
>> this series is on top of https://patchwork.freedesktop.org/series/116181/
>>
>> Below is an example tool drm_ras which demonstrates the use of the
>> supported commands. The tool will be sent to ML with the subject
>> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
>> error counters"
>>
>> read single error counter:
>>
>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 
>> --error_id=0x0005
>> counter value 0
>>
>> read all error counters:
>>
>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
>> nameconfig-id
>>counter
>>
>> error-gt0-correctable-guc   0x0001   
>>0
>> error-gt0-correctable-slm   0x0003   
>>0
>> error-gt0-correctable-eu-ic 0x0004   
>>0
>> error-gt0-correctable-eu-grf0x0005   
>>0
>> error-gt0-fatal-guc 0x0009   
>>0
>> 

Re: [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-06-05 Thread Alex Deucher
Adding the relevant AMD folks for RAS.  We currently expose RAS via
sysfs, but also have an event interface in KFD which may be somewhat
similar to this.

If we were to converge on a common RAS interface, would we want to
look at any commonality in bad page storage/reporting for device
memory?

Alex

On Fri, May 26, 2023 at 12:21 PM Aravind Iddamsetty
 wrote:
>
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.
>
> [1]: 
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
> error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> nameconfig-id 
>   counter
>
> error-gt0-correctable-guc   0x0001
>   0
> error-gt0-correctable-slm   0x0003
>   0
> error-gt0-correctable-eu-ic 0x0004
>   0
> error-gt0-correctable-eu-grf0x0005
>   0
> error-gt0-fatal-guc 0x0009
>   0
> error-gt0-fatal-slm 0x000d
>   0
> error-gt0-fatal-eu-grf  0x000f
>   0
> error-gt0-fatal-fpu 0x0010
>   0
> error-gt0-fatal-tlb 0x0011
>   0
> error-gt0-fatal-l3-fabric   0x0012
>   0
> error-gt0-correctable-subslice  0x0013
>   0
> error-gt0-correctable-l3bank0x0014
>   0
> error-gt0-fatal-subslice0x0015
>   0
> error-gt0-fatal-l3bank  0x0016
>   0
> error-gt0-sgunit-correctable0x0017
>   0
> error-gt0-sgunit-nonfatal   0x0018
>   0
> error-gt0-sgunit-fatal  0x0019
>   0
> error-gt0-soc-fatal-psf-csc-0   0x001a
>   0
> error-gt0-soc-fatal-psf-csc-1   0x001b
>   0
> error-gt0-soc-fatal-psf-csc-2   0x001c
>   0
> error-gt0-soc-fatal-punit   0x001d
>   0
> error-gt0-soc-fatal-psf-0   0x001e
>   0
> error-gt0-soc-fatal-psf-1   0x001f
>   0
> error-gt0-soc-fatal-psf-2   0x0020
>   0
> error-gt0-soc-fatal-cd0 0x0021
>   0
> error-gt0-soc-fatal-cd0-mdfi0x0022
>   0
> error-gt0-soc-fatal-mdfi-east   0x0023
>   0
> error-gt0-soc-fatal-mdfi-south  0x0024
>   0
> error-gt0-soc-fatal-hbm-ss0-0   0x0025
>   0
> error-gt0-soc-fatal-hbm-ss0-1   0x0026
>   0
> 

Re: [Intel-xe] [RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-06-04 Thread Tomer Tayar
On 26/05/2023 19:20, Aravind Iddamsetty wrote:
> Our hardware supports RAS(Reliability, Availability, Serviceability) by
> exposing a set of error counters which can be used by observability
> tools to take corrective actions or repairs. Traditionally there were
> being exposed via PMU (for relative counters) and sysfs interface (for
> absolute value) in our internal branch. But, due to the limitations in
> this approach to use two interfaces and also not able to have an event
> based reporting or configurability, an alternative approach to try
> netlink was suggested by community for drm subsystem wide UAPI for RAS
> and telemetry as discussed in [1].
>
> This [1] is the inspiration to this series. It uses the generic
> netlink(genl) family subsystem and exposes a set of commands that can
> be used by every drm driver, the framework provides a means to have
> custom commands too. Each drm driver instance in this example xe driver
> instance registers a family and operations to the genl subsystem through
> which it enumerates and reports the error counters. An event based
> notification is also supported to which userpace can subscribe to and
> be notified when any error occurs and read the error counter this avoids
> continuous polling on error counter. This can also be extended to
> threshold based notification.

Hi Aravind,

The habanalabs driver is another candidate to use this netlink-based drm 
framework.
As a single-user device, we have an additional "control" device that 
allows multiple applications to query for information and to monitor the 
"compute" device.
And while we are about to move the compute device to the accel nodes, we 
don't have a real replacement there for the control device.

Another possible usage of this framework for habanalabs is the events 
notification.
Currently we have an eventfd-based mechanism, and after being notified 
about an event, user starts querying about the event and the relevant 
info, usually in several requests.
With this framework we should be allegedly possible to gather all 
relevant info together with the event itself.

The current implementation seems intended more to errors (and quite 
"tailored" to Xe needs ...), while in habanalabs we would need it also 
for non-error static/dynamic info.
Maybe we should revise the existing commands/attributes to be more generic?

Moreover, the drm part is very small, while most of the netlink "mess" 
is still done by the specific driver.
So what is the added value in making it a "drm framework"? Do we enforce 
something here for drm drivers that use it? Do we help them with simpler 
APIs and hiding the internals of netlink?
Maybe it would be worth moving some functionality from the Xe driver 
into drm helpers?

Thanks,
Tomer

> [1]: 
> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html
>
> this series is on top of https://patchwork.freedesktop.org/series/116181/
>
> Below is an example tool drm_ras which demonstrates the use of the
> supported commands. The tool will be sent to ML with the subject
> "[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS 
> error counters"
>
> read single error counter:
>
> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0005
> counter value 0
>
> read all error counters:
>
> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
> nameconfig-id 
>   counter
>
> error-gt0-correctable-guc   0x0001
>   0
> error-gt0-correctable-slm   0x0003
>   0
> error-gt0-correctable-eu-ic 0x0004
>   0
> error-gt0-correctable-eu-grf0x0005
>   0
> error-gt0-fatal-guc 0x0009
>   0
> error-gt0-fatal-slm 0x000d
>   0
> error-gt0-fatal-eu-grf  0x000f
>   0
> error-gt0-fatal-fpu 0x0010
>   0
> error-gt0-fatal-tlb 0x0011
>   0
> error-gt0-fatal-l3-fabric   0x0012
>   0
> error-gt0-correctable-subslice  0x0013
>   0
> error-gt0-correctable-l3bank0x0014
>   0
> error-gt0-fatal-subslice0x0015
>   0
> error-gt0-fatal-l3bank  0x0016
>   0
> error-gt0-sgunit-correctable0x0017
>   0
> error-gt0-sgunit-nonfatal   0x0018
>   0
> error-gt0-sgunit-fatal  0x0019
>   0
> error-gt0-soc-fatal-psf-csc-0 

[RFC 0/5] Proposal to use netlink for RAS and Telemetry across drm subsystem

2023-05-26 Thread Aravind Iddamsetty
Our hardware supports RAS(Reliability, Availability, Serviceability) by
exposing a set of error counters which can be used by observability
tools to take corrective actions or repairs. Traditionally there were
being exposed via PMU (for relative counters) and sysfs interface (for
absolute value) in our internal branch. But, due to the limitations in
this approach to use two interfaces and also not able to have an event
based reporting or configurability, an alternative approach to try
netlink was suggested by community for drm subsystem wide UAPI for RAS
and telemetry as discussed in [1]. 

This [1] is the inspiration to this series. It uses the generic
netlink(genl) family subsystem and exposes a set of commands that can
be used by every drm driver, the framework provides a means to have
custom commands too. Each drm driver instance in this example xe driver
instance registers a family and operations to the genl subsystem through
which it enumerates and reports the error counters. An event based
notification is also supported to which userpace can subscribe to and
be notified when any error occurs and read the error counter this avoids
continuous polling on error counter. This can also be extended to
threshold based notification.

[1]: https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html

this series is on top of https://patchwork.freedesktop.org/series/116181/

Below is an example tool drm_ras which demonstrates the use of the
supported commands. The tool will be sent to ML with the subject
"[RFC i-g-t 0/1] A tool to demonstrate use of netlink sockets to read RAS error 
counters"

read single error counter:

$ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 --error_id=0x0005
counter value 0

read all error counters:

$ ./drm_ras READ_ALL --device=drm:/dev/dri/card1
nameconfig-id   
counter

error-gt0-correctable-guc   0x0001  0
error-gt0-correctable-slm   0x0003  0
error-gt0-correctable-eu-ic 0x0004  0
error-gt0-correctable-eu-grf0x0005  0
error-gt0-fatal-guc 0x0009  0
error-gt0-fatal-slm 0x000d  0
error-gt0-fatal-eu-grf  0x000f  0
error-gt0-fatal-fpu 0x0010  0
error-gt0-fatal-tlb 0x0011  0
error-gt0-fatal-l3-fabric   0x0012  0
error-gt0-correctable-subslice  0x0013  0
error-gt0-correctable-l3bank0x0014  0
error-gt0-fatal-subslice0x0015  0
error-gt0-fatal-l3bank  0x0016  0
error-gt0-sgunit-correctable0x0017  0
error-gt0-sgunit-nonfatal   0x0018  0
error-gt0-sgunit-fatal  0x0019  0
error-gt0-soc-fatal-psf-csc-0   0x001a  0
error-gt0-soc-fatal-psf-csc-1   0x001b  0
error-gt0-soc-fatal-psf-csc-2   0x001c  0
error-gt0-soc-fatal-punit   0x001d  0
error-gt0-soc-fatal-psf-0   0x001e  0
error-gt0-soc-fatal-psf-1   0x001f  0
error-gt0-soc-fatal-psf-2   0x0020  0
error-gt0-soc-fatal-cd0 0x0021  0
error-gt0-soc-fatal-cd0-mdfi0x0022  0
error-gt0-soc-fatal-mdfi-east   0x0023  0
error-gt0-soc-fatal-mdfi-south  0x0024  0
error-gt0-soc-fatal-hbm-ss0-0   0x0025  0
error-gt0-soc-fatal-hbm-ss0-1   0x0026  0
error-gt0-soc-fatal-hbm-ss0-2   0x0027  0
error-gt0-soc-fatal-hbm-ss0-3   0x0028  0
error-gt0-soc-fatal-hbm-ss0-4   0x0029  0
error-gt0-soc-fatal-hbm-ss0-5   0x002a  0
error-gt0-soc-fatal-hbm-ss0-6   0x002b  0
error-gt0-soc-fatal-hbm-ss0-7   0x002c  0
error-gt0-soc-fatal-hbm-ss1-0   0x002d  0
error-gt0-soc-fatal-hbm-ss1-1