Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-23 Thread Greg Kroah-Hartman
On Tue, May 22, 2018 at 03:28:05PM -0700, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
> 
> Signed-off-by: Rajat Jain 
> ---
>  Documentation/PCI/pcieaer-howto.txt | 35 +
>  1 file changed, 35 insertions(+)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.txt 
> b/Documentation/PCI/pcieaer-howto.txt
> index acd06bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
> who sends
>  the error message to root port. Pls. refer to pci express specs for
>  other fields.
>  
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link 
> partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).
> +
> + * dev_total_cor_errs: number of correctable errors seen by the device.
> + * dev_total_fatal_errs: number of fatal uncorrectable errors seen by the 
> device.
> + * dev_total_nonfatal_errs: number of nonfatal uncorr errors seen by the 
> device.
> + * dev_breakdown_correctable: Provides a breakdown of different type of
> +  correctable errors seen.
> + * dev_breakdown_uncorrectable: Provides a breakdown of different type of
> +  uncorrectable errors seen.
> +
> +2.4.1 Rootport sysfs Attributes
> +
> +These attributes showup under only the rootports that are AER capable. These
> +indicate the number of error messages as "reported to" the rootport. Please 
> note
> +that the rootports also transmit (internally) the ERR_* messages for errors 
> seen
> +by the internal rootport PCI device, so these counters includes them and are
> +thus cumulative of all the error messages on the PCI hierarchy originating
> +at that root port.
> +
> + * rootport_total_cor_errs: number of ERR_COR messages reported to rootport.
> + * rootport_total_fatal_errs: number of ERR_FATAL messages reported to 
> rootport.
> + * rootport_total_nonfatal_errs: number of ERR_NONFATAL messages reporeted to
> + rootport.

These all belong in Documentation/ABI/ please.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-22 Thread Rajat Jain
Hi,

On Tue, May 22, 2018 at 3:52 PM, Alex G.  wrote:
> On 05/22/2018 05:28 PM, Rajat Jain wrote:
>> Add the PCI AER statistics details to
>> Documentation/PCI/pcieaer-howto.txt
>>
>> Signed-off-by: Rajat Jain 
>> ---
>>  Documentation/PCI/pcieaer-howto.txt | 35 +
>>  1 file changed, 35 insertions(+)
>>
>> diff --git a/Documentation/PCI/pcieaer-howto.txt 
>> b/Documentation/PCI/pcieaer-howto.txt
>> index acd06bb8..86ee9f9ff5e1 100644
>> --- a/Documentation/PCI/pcieaer-howto.txt
>> +++ b/Documentation/PCI/pcieaer-howto.txt
>> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
>> who sends
>>  the error message to root port. Pls. refer to pci express specs for
>>  other fields.
>>
>> +2.4 AER statistics
>> +
>> +When AER messages are captured, the statistics are exposed via the following
>> +sysfs attributes under the "aer_stats" folder for the device:
>> +
>> +2.4.1 Device sysfs Attributes
>> +
>> +These attributes show up under all the devices that are AER capable. These
>> +indicate the errors "as seen by the device". Note that this may mean that if
>> +an end point is causing problems, the AER counters may increment at its link
>> +partner (e.g. root port) because the errors will be "seen" by the link 
>> partner
>> +and not the the problematic end point itself (which may report all counters
>> +as 0 as it never saw any problems).
>
> I was afraid of that. Is there a way to look at the requester ID to log
> AER errors to the correct device?

I do not think it is possible to pin point the source of the problem.
Errors may be caused due to sub optimal link tuning, or signal
integrity, or either of the link partners. Both the link partners will
detect and report the errors that they "see".

The bits and errors defined by the PCIe spec, follow the same semantics i.e.
 => the spec defines the different error conditions "as
seen/encountered by the device",
   => Thus the device reports those errors to the root port
   => which is what we are counting and reporting here.

IMHO, any interpretation / analysis of this error data / counters
should be left to the user so that he can look at different devices
and the errors they see, and then conclude on what might be the
problem.

Thanks,
Rajat

>
> Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-22 Thread Alex G.
On 05/22/2018 05:28 PM, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/PCI/pcieaer-howto.txt
> 
> Signed-off-by: Rajat Jain 
> ---
>  Documentation/PCI/pcieaer-howto.txt | 35 +
>  1 file changed, 35 insertions(+)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.txt 
> b/Documentation/PCI/pcieaer-howto.txt
> index acd06bb8..86ee9f9ff5e1 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
> who sends
>  the error message to root port. Pls. refer to pci express specs for
>  other fields.
>  
> +2.4 AER statistics
> +
> +When AER messages are captured, the statistics are exposed via the following
> +sysfs attributes under the "aer_stats" folder for the device:
> +
> +2.4.1 Device sysfs Attributes
> +
> +These attributes show up under all the devices that are AER capable. These
> +indicate the errors "as seen by the device". Note that this may mean that if
> +an end point is causing problems, the AER counters may increment at its link
> +partner (e.g. root port) because the errors will be "seen" by the link 
> partner
> +and not the the problematic end point itself (which may report all counters
> +as 0 as it never saw any problems).

I was afraid of that. Is there a way to look at the requester ID to log
AER errors to the correct device?

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] Documentation/PCI: Add details of PCI AER statistics

2018-05-22 Thread Rajat Jain
Add the PCI AER statistics details to
Documentation/PCI/pcieaer-howto.txt

Signed-off-by: Rajat Jain 
---
 Documentation/PCI/pcieaer-howto.txt | 35 +
 1 file changed, 35 insertions(+)

diff --git a/Documentation/PCI/pcieaer-howto.txt 
b/Documentation/PCI/pcieaer-howto.txt
index acd06bb8..86ee9f9ff5e1 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,41 @@ In the example, 'Requester ID' means the ID of the device 
who sends
 the error message to root port. Pls. refer to pci express specs for
 other fields.
 
+2.4 AER statistics
+
+When AER messages are captured, the statistics are exposed via the following
+sysfs attributes under the "aer_stats" folder for the device:
+
+2.4.1 Device sysfs Attributes
+
+These attributes show up under all the devices that are AER capable. These
+indicate the errors "as seen by the device". Note that this may mean that if
+an end point is causing problems, the AER counters may increment at its link
+partner (e.g. root port) because the errors will be "seen" by the link partner
+and not the the problematic end point itself (which may report all counters
+as 0 as it never saw any problems).
+
+ * dev_total_cor_errs: number of correctable errors seen by the device.
+ * dev_total_fatal_errs: number of fatal uncorrectable errors seen by the 
device.
+ * dev_total_nonfatal_errs: number of nonfatal uncorr errors seen by the 
device.
+ * dev_breakdown_correctable: Provides a breakdown of different type of
+  correctable errors seen.
+ * dev_breakdown_uncorrectable: Provides a breakdown of different type of
+  uncorrectable errors seen.
+
+2.4.1 Rootport sysfs Attributes
+
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please 
note
+that the rootports also transmit (internally) the ERR_* messages for errors 
seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+ * rootport_total_cor_errs: number of ERR_COR messages reported to rootport.
+ * rootport_total_fatal_errs: number of ERR_FATAL messages reported to 
rootport.
+ * rootport_total_nonfatal_errs: number of ERR_NONFATAL messages reporeted to
+ rootport.
 
 3. Developer Guide
 
-- 
2.17.0.441.gb46fe60e1d-goog

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html