Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-04-01 Thread Xi Ruoyao
On Wed, 2024-03-27 at 11:49 +0800, Ethan Zhao wrote:
> so, yup, basically, the signal integrity is not good enough.
> Though the function could work, its performance will be impacted.

FWIW I've replaced the motherboard and this is gone.  So it's likely a
signal integrity issue of the motherboard.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-26 Thread Ethan Zhao

On 3/27/2024 5:17 AM, Bjorn Helgaas wrote:

On Tue, Mar 26, 2024 at 09:39:54AM +0800, Ethan Zhao wrote:

On 3/25/2024 6:15 PM, Xi Ruoyao wrote:

On Mon, 2024-03-25 at 16:45 +0800, Ethan Zhao wrote:

On 3/25/2024 1:19 AM, Xi Ruoyao wrote:

On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:

On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:

...
My workstation suffers from too much correctable AER reporting as well
(related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
Generate Correctable Errors" and/or the motherboard design, I guess).

We should rate-limit correctable error reporting so it's not
overwhelming.

At the same time, I'm *also* interested in the cause of these errors,
in case there's a Linux defect or a hardware erratum that we can work
around.  Do you have a bug report with any more details, e.g., a dmesg
log and "sudo lspci -vv" output?

Hi Bjorn,

Sorry for the *very* late reply (somehow I didn't see the reply at all
before it was removed by my cron job, and now I just savaged it from
lore.kernel.org...)

The dmesg is like:

[  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  882.457002] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0
[  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  883.545789] pcieport :00:1c.1: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Receiver ID)
[  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
status/mask=0001/2000
[  883.545792] pcieport :00:1c.1:    [ 0] RxErr  (First)
[  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is reported 
first
[  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Transmitter ID)
[  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
status/mask=1101/e000
[  883.545800] r8169 :06:00.0:    [ 0] RxErr  (First)
[  883.545801] r8169 :06:00.0:    [ 8] Rollover
[  883.545802] r8169 :06:00.0:    [12] Timeout
[  883.545815] pcieport :00:1c.1: AER: Correctable error message received 
from :00:1c.1
[  883.545823] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0

lspci output attached.

Intel has issued an errata "RPL013" saying:

"Under complex microarchitectural conditions, the PCIe controller may
transmit an incorrectly formed Transaction Layer Packet (TLP), which
will fail CRC checks. When this erratum occurs, the PCIe end point may
record correctable errors resulting in either a NAK or link recovery.
Intel® has not observed any functional impact due to this erratum."

But I'm really unsure if it describes my issue.

Do you think I have some broken hardware and I should replace the CPU
and/or the motherboard (where the r8169 is soldered)?  I've noticed that
my 13900K is almost impossible to overclock (despite it's a K), but I've
not encountered any issue other than these AER reporting so far after I
gave up overclocking.

Seems there are two r8169 nics on your board, only :06:00.0 reports
aer errors, how about another one the :07:00.0 nic ?

It never happens to :07:00.0, even if I plug the ethernet cable into
it instead of :06:00.0.

So something is wrong with the physical layer, I guess.


Maybe I should just use :07:00.0 and blacklist :06:00.0 as I
don't need two NICs?

Yup,
ratelimit the AER warning is another choice instead of change WARN to INFO.
if corrected error flood happens, even the function is working, suggests
something was already wrong, likely will be worse, that is the meaning of
WARN I think.

We should fix this.  IMHO Correctable Errors should be "info" level,
non-alarming, and rate-limited.  They're basically hints about link
integrity.


This case, hit following errors:

[  883.545800] r8169 :06:00.0:    [ 0] RxErr
[  883.545801] r8169 :06:00.0:    [ 8] Rollover
[  883.545802] r8169 :06:00.0:    [12] Timeout

#1 Timeout -- replay timer timeout, means endpoint didn't response with ACK 
DLLP or
NACK in time, that caused the re-send timer timeout, the sender will re-send the
packet.

#2 Rollover -- the counter of re-transmission reaches 0 (from 11b ->00b), means 
the
sender had tried 3 times. that would trigger link retraining to recover.

#1 & #2 happened together, but no uncorrected errors reported, means the link 
was
recovered, the issue mostly caused by improper TxEQ, receiver equalization, bad
signal integrity.

#3 RxErr -- bad DLLP, bad TLP, clock issue, signal integrity issue etc.

so, yup, basically, the signal integrity is not good enough.
Though the function could work, its performance will be impacted.

If we change 

Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-26 Thread Bjorn Helgaas
On Tue, Mar 26, 2024 at 09:39:54AM +0800, Ethan Zhao wrote:
> On 3/25/2024 6:15 PM, Xi Ruoyao wrote:
> > On Mon, 2024-03-25 at 16:45 +0800, Ethan Zhao wrote:
> > > On 3/25/2024 1:19 AM, Xi Ruoyao wrote:
> > > > On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:
> > > > > On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> > > > > > ...
> > > > > > My workstation suffers from too much correctable AER reporting as 
> > > > > > well
> > > > > > (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets 
> > > > > > May
> > > > > > Generate Correctable Errors" and/or the motherboard design, I 
> > > > > > guess).
> > > > > We should rate-limit correctable error reporting so it's not
> > > > > overwhelming.
> > > > > 
> > > > > At the same time, I'm *also* interested in the cause of these errors,
> > > > > in case there's a Linux defect or a hardware erratum that we can work
> > > > > around.  Do you have a bug report with any more details, e.g., a dmesg
> > > > > log and "sudo lspci -vv" output?
> > > > Hi Bjorn,
> > > > 
> > > > Sorry for the *very* late reply (somehow I didn't see the reply at all
> > > > before it was removed by my cron job, and now I just savaged it from
> > > > lore.kernel.org...)
> > > > 
> > > > The dmesg is like:
> > > > 
> > > > [  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > > message received from :00:1c.1
> > > > [  882.457002] pcieport :00:1c.1: AER: found no error details for 
> > > > :00:1c.1
> > > > [  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > > message received from :06:00.0
> > > > [  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > > message received from :00:1c.1
> > > > [  883.545789] pcieport :00:1c.1: PCIe Bus Error: 
> > > > severity=Correctable, type=Physical Layer, (Receiver ID)
> > > > [  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
> > > > status/mask=0001/2000
> > > > [  883.545792] pcieport :00:1c.1:    [ 0] RxErr  
> > > > (First)
> > > > [  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is 
> > > > reported first
> > > > [  883.545798] r8169 :06:00.0: PCIe Bus Error: 
> > > > severity=Correctable, type=Physical Layer, (Transmitter ID)
> > > > [  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
> > > > status/mask=1101/e000
> > > > [  883.545800] r8169 :06:00.0:    [ 0] RxErr  
> > > > (First)
> > > > [  883.545801] r8169 :06:00.0:    [ 8] Rollover
> > > > [  883.545802] r8169 :06:00.0:    [12] Timeout
> > > > [  883.545815] pcieport :00:1c.1: AER: Correctable error message 
> > > > received from :00:1c.1
> > > > [  883.545823] pcieport :00:1c.1: AER: found no error details for 
> > > > :00:1c.1
> > > > [  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > > message received from :06:00.0
> > > > 
> > > > lspci output attached.
> > > > 
> > > > Intel has issued an errata "RPL013" saying:
> > > > 
> > > > "Under complex microarchitectural conditions, the PCIe controller may
> > > > transmit an incorrectly formed Transaction Layer Packet (TLP), which
> > > > will fail CRC checks. When this erratum occurs, the PCIe end point may
> > > > record correctable errors resulting in either a NAK or link recovery.
> > > > Intel® has not observed any functional impact due to this erratum."
> > > > 
> > > > But I'm really unsure if it describes my issue.
> > > > 
> > > > Do you think I have some broken hardware and I should replace the CPU
> > > > and/or the motherboard (where the r8169 is soldered)?  I've noticed that
> > > > my 13900K is almost impossible to overclock (despite it's a K), but I've
> > > > not encountered any issue other than these AER reporting so far after I
> > > > gave up overclocking.
> > > Seems there are two r8169 nics on your board, only :06:00.0 reports
> > > aer errors, how about another one the :07:00.0 nic ?
> > It never happens to :07:00.0, even if I plug the ethernet cable into
> > it instead of :06:00.0.
> 
> So something is wrong with the physical layer, I guess.
> 
> > Maybe I should just use :07:00.0 and blacklist :06:00.0 as I
> > don't need two NICs?
> 
> Yup,
> ratelimit the AER warning is another choice instead of change WARN to INFO.
> if corrected error flood happens, even the function is working, suggests
> something was already wrong, likely will be worse, that is the meaning of
> WARN I think.

We should fix this.  IMHO Correctable Errors should be "info" level,
non-alarming, and rate-limited.  They're basically hints about link
integrity.

Bjorn


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-25 Thread Ethan Zhao

On 3/25/2024 6:15 PM, Xi Ruoyao wrote:

On Mon, 2024-03-25 at 16:45 +0800, Ethan Zhao wrote:

On 3/25/2024 1:19 AM, Xi Ruoyao wrote:

On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:

On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:

...
My workstation suffers from too much correctable AER reporting as well
(related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
Generate Correctable Errors" and/or the motherboard design, I guess).

We should rate-limit correctable error reporting so it's not
overwhelming.

At the same time, I'm *also* interested in the cause of these errors,
in case there's a Linux defect or a hardware erratum that we can work
around.  Do you have a bug report with any more details, e.g., a dmesg
log and "sudo lspci -vv" output?

Hi Bjorn,

Sorry for the *very* late reply (somehow I didn't see the reply at all
before it was removed by my cron job, and now I just savaged it from
lore.kernel.org...)

The dmesg is like:

[  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  882.457002] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0
[  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  883.545789] pcieport :00:1c.1: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Receiver ID)
[  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
status/mask=0001/2000
[  883.545792] pcieport :00:1c.1:    [ 0] RxErr  (First)
[  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is reported 
first
[  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Transmitter ID)
[  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
status/mask=1101/e000
[  883.545800] r8169 :06:00.0:    [ 0] RxErr  (First)
[  883.545801] r8169 :06:00.0:    [ 8] Rollover
[  883.545802] r8169 :06:00.0:    [12] Timeout
[  883.545815] pcieport :00:1c.1: AER: Correctable error message received 
from :00:1c.1
[  883.545823] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0

lspci output attached.

Intel has issued an errata "RPL013" saying:

"Under complex microarchitectural conditions, the PCIe controller may
transmit an incorrectly formed Transaction Layer Packet (TLP), which
will fail CRC checks. When this erratum occurs, the PCIe end point may
record correctable errors resulting in either a NAK or link recovery.
Intel® has not observed any functional impact due to this erratum."

But I'm really unsure if it describes my issue.

Do you think I have some broken hardware and I should replace the CPU
and/or the motherboard (where the r8169 is soldered)?  I've noticed that
my 13900K is almost impossible to overclock (despite it's a K), but I've
not encountered any issue other than these AER reporting so far after I
gave up overclocking.

Seems there are two r8169 nics on your board, only :06:00.0 reports
aer errors, how about another one the :07:00.0 nic ?

It never happens to :07:00.0, even if I plug the ethernet cable into
it instead of :06:00.0.


So something is wrong with the physical layer, I guess.



Maybe I should just use :07:00.0 and blacklist :06:00.0 as I
don't need two NICs?


Yup,
ratelimit the AER warning is another choice instead of change WARN to INFO.
if corrected error flood happens, even the function is working, suggests
something was already wrong, likely will be worse, that is the meaning of
WARN I think.


Thanks,
Ethan





Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-25 Thread Xi Ruoyao
On Mon, 2024-03-25 at 18:15 +0800, Xi Ruoyao wrote:
> On Mon, 2024-03-25 at 16:45 +0800, Ethan Zhao wrote:
> > On 3/25/2024 1:19 AM, Xi Ruoyao wrote:
> > > On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:
> > > > On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> > > > > ...
> > > > > My workstation suffers from too much correctable AER reporting as well
> > > > > (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets 
> > > > > May
> > > > > Generate Correctable Errors" and/or the motherboard design, I guess).
> > > > We should rate-limit correctable error reporting so it's not
> > > > overwhelming.
> > > > 
> > > > At the same time, I'm *also* interested in the cause of these errors,
> > > > in case there's a Linux defect or a hardware erratum that we can work
> > > > around.  Do you have a bug report with any more details, e.g., a dmesg
> > > > log and "sudo lspci -vv" output?
> > > Hi Bjorn,
> > > 
> > > Sorry for the *very* late reply (somehow I didn't see the reply at all
> > > before it was removed by my cron job, and now I just savaged it from
> > > lore.kernel.org...)
> > > 
> > > The dmesg is like:
> > > 
> > > [  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > message received from :00:1c.1
> > > [  882.457002] pcieport :00:1c.1: AER: found no error details for 
> > > :00:1c.1
> > > [  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > message received from :06:00.0
> > > [  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > message received from :00:1c.1
> > > [  883.545789] pcieport :00:1c.1: PCIe Bus Error: 
> > > severity=Correctable, type=Physical Layer, (Receiver ID)
> > > [  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
> > > status/mask=0001/2000
> > > [  883.545792] pcieport :00:1c.1:    [ 0] RxErr  
> > > (First)
> > > [  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is 
> > > reported first
> > > [  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
> > > type=Physical Layer, (Transmitter ID)
> > > [  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
> > > status/mask=1101/e000
> > > [  883.545800] r8169 :06:00.0:    [ 0] RxErr  (First)
> > > [  883.545801] r8169 :06:00.0:    [ 8] Rollover
> > > [  883.545802] r8169 :06:00.0:    [12] Timeout
> > > [  883.545815] pcieport :00:1c.1: AER: Correctable error message 
> > > received from :00:1c.1
> > > [  883.545823] pcieport :00:1c.1: AER: found no error details for 
> > > :00:1c.1
> > > [  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error 
> > > message received from :06:00.0
> > > 
> > > lspci output attached.
> > > 
> > > Intel has issued an errata "RPL013" saying:
> > > 
> > > "Under complex microarchitectural conditions, the PCIe controller may
> > > transmit an incorrectly formed Transaction Layer Packet (TLP), which
> > > will fail CRC checks. When this erratum occurs, the PCIe end point may
> > > record correctable errors resulting in either a NAK or link recovery.
> > > Intel® has not observed any functional impact due to this erratum."
> > > 
> > > But I'm really unsure if it describes my issue.
> > > 
> > > Do you think I have some broken hardware and I should replace the CPU
> > > and/or the motherboard (where the r8169 is soldered)?  I've noticed that
> > > my 13900K is almost impossible to overclock (despite it's a K), but I've
> > > not encountered any issue other than these AER reporting so far after I
> > > gave up overclocking.
> > 
> > Seems there are two r8169 nics on your board, only :06:00.0 reports
> > aer errors, how about another one the :07:00.0 nic ?
> 
> It never happens to :07:00.0, even if I plug the ethernet cable into
> it instead of :06:00.0.
> 
> Maybe I should just use :07:00.0 and blacklist :06:00.0 as I
> don't need two NICs?

Plugging the ethernet cable into :07:00.0 and then
"echo 1 > /sys/bus/pci/devices/:00:1c.1/remove" work for me...

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-25 Thread Ethan Zhao

On 3/25/2024 1:19 AM, Xi Ruoyao wrote:

On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:

On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:

...
My workstation suffers from too much correctable AER reporting as well
(related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
Generate Correctable Errors" and/or the motherboard design, I guess).

We should rate-limit correctable error reporting so it's not
overwhelming.

At the same time, I'm *also* interested in the cause of these errors,
in case there's a Linux defect or a hardware erratum that we can work
around.  Do you have a bug report with any more details, e.g., a dmesg
log and "sudo lspci -vv" output?

Hi Bjorn,

Sorry for the *very* late reply (somehow I didn't see the reply at all
before it was removed by my cron job, and now I just savaged it from
lore.kernel.org...)

The dmesg is like:

[  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  882.457002] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0
[  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  883.545789] pcieport :00:1c.1: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Receiver ID)
[  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
status/mask=0001/2000
[  883.545792] pcieport :00:1c.1:[ 0] RxErr  (First)
[  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is reported 
first
[  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Transmitter ID)
[  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
status/mask=1101/e000
[  883.545800] r8169 :06:00.0:[ 0] RxErr  (First)
[  883.545801] r8169 :06:00.0:[ 8] Rollover
[  883.545802] r8169 :06:00.0:[12] Timeout
[  883.545815] pcieport :00:1c.1: AER: Correctable error message received 
from :00:1c.1
[  883.545823] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0

lspci output attached.

Intel has issued an errata "RPL013" saying:

"Under complex microarchitectural conditions, the PCIe controller may
transmit an incorrectly formed Transaction Layer Packet (TLP), which
will fail CRC checks. When this erratum occurs, the PCIe end point may
record correctable errors resulting in either a NAK or link recovery.
Intel® has not observed any functional impact due to this erratum."

But I'm really unsure if it describes my issue.

Do you think I have some broken hardware and I should replace the CPU
and/or the motherboard (where the r8169 is soldered)?  I've noticed that
my 13900K is almost impossible to overclock (despite it's a K), but I've
not encountered any issue other than these AER reporting so far after I
gave up overclocking.


Seems there are two r8169 nics on your board, only :06:00.0 reports
aer errors, how about another one the :07:00.0 nic ?


Thanks,
Ethan



Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-25 Thread Xi Ruoyao
On Mon, 2024-03-25 at 16:45 +0800, Ethan Zhao wrote:
> On 3/25/2024 1:19 AM, Xi Ruoyao wrote:
> > On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:
> > > On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> > > > ...
> > > > My workstation suffers from too much correctable AER reporting as well
> > > > (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
> > > > Generate Correctable Errors" and/or the motherboard design, I guess).
> > > We should rate-limit correctable error reporting so it's not
> > > overwhelming.
> > > 
> > > At the same time, I'm *also* interested in the cause of these errors,
> > > in case there's a Linux defect or a hardware erratum that we can work
> > > around.  Do you have a bug report with any more details, e.g., a dmesg
> > > log and "sudo lspci -vv" output?
> > Hi Bjorn,
> > 
> > Sorry for the *very* late reply (somehow I didn't see the reply at all
> > before it was removed by my cron job, and now I just savaged it from
> > lore.kernel.org...)
> > 
> > The dmesg is like:
> > 
> > [  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error 
> > message received from :00:1c.1
> > [  882.457002] pcieport :00:1c.1: AER: found no error details for 
> > :00:1c.1
> > [  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error 
> > message received from :06:00.0
> > [  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error 
> > message received from :00:1c.1
> > [  883.545789] pcieport :00:1c.1: PCIe Bus Error: severity=Correctable, 
> > type=Physical Layer, (Receiver ID)
> > [  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
> > status/mask=0001/2000
> > [  883.545792] pcieport :00:1c.1:    [ 0] RxErr  (First)
> > [  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is 
> > reported first
> > [  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
> > type=Physical Layer, (Transmitter ID)
> > [  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
> > status/mask=1101/e000
> > [  883.545800] r8169 :06:00.0:    [ 0] RxErr  (First)
> > [  883.545801] r8169 :06:00.0:    [ 8] Rollover
> > [  883.545802] r8169 :06:00.0:    [12] Timeout
> > [  883.545815] pcieport :00:1c.1: AER: Correctable error message 
> > received from :00:1c.1
> > [  883.545823] pcieport :00:1c.1: AER: found no error details for 
> > :00:1c.1
> > [  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error 
> > message received from :06:00.0
> > 
> > lspci output attached.
> > 
> > Intel has issued an errata "RPL013" saying:
> > 
> > "Under complex microarchitectural conditions, the PCIe controller may
> > transmit an incorrectly formed Transaction Layer Packet (TLP), which
> > will fail CRC checks. When this erratum occurs, the PCIe end point may
> > record correctable errors resulting in either a NAK or link recovery.
> > Intel® has not observed any functional impact due to this erratum."
> > 
> > But I'm really unsure if it describes my issue.
> > 
> > Do you think I have some broken hardware and I should replace the CPU
> > and/or the motherboard (where the r8169 is soldered)?  I've noticed that
> > my 13900K is almost impossible to overclock (despite it's a K), but I've
> > not encountered any issue other than these AER reporting so far after I
> > gave up overclocking.
> 
> Seems there are two r8169 nics on your board, only :06:00.0 reports
> aer errors, how about another one the :07:00.0 nic ?

It never happens to :07:00.0, even if I plug the ethernet cable into
it instead of :06:00.0.

Maybe I should just use :07:00.0 and blacklist :06:00.0 as I
don't need two NICs?

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2024-03-24 Thread Xi Ruoyao
On Mon, 2023-09-18 at 14:39 -0500, Bjorn Helgaas wrote:
> On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> > ...
> 
> > My workstation suffers from too much correctable AER reporting as well
> > (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
> > Generate Correctable Errors" and/or the motherboard design, I guess).
> 
> We should rate-limit correctable error reporting so it's not
> overwhelming.
> 
> At the same time, I'm *also* interested in the cause of these errors,
> in case there's a Linux defect or a hardware erratum that we can work
> around.  Do you have a bug report with any more details, e.g., a dmesg
> log and "sudo lspci -vv" output?

Hi Bjorn,

Sorry for the *very* late reply (somehow I didn't see the reply at all
before it was removed by my cron job, and now I just savaged it from
lore.kernel.org...)

The dmesg is like:

[  882.456994] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  882.457002] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  882.457003] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0
[  883.545763] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :00:1c.1
[  883.545789] pcieport :00:1c.1: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Receiver ID)
[  883.545790] pcieport :00:1c.1:   device [8086:7a39] error 
status/mask=0001/2000
[  883.545792] pcieport :00:1c.1:[ 0] RxErr  (First)
[  883.545794] pcieport :00:1c.1: AER:   Error of this Agent is reported 
first
[  883.545798] r8169 :06:00.0: PCIe Bus Error: severity=Correctable, 
type=Physical Layer, (Transmitter ID)
[  883.545799] r8169 :06:00.0:   device [10ec:8125] error 
status/mask=1101/e000
[  883.545800] r8169 :06:00.0:[ 0] RxErr  (First)
[  883.545801] r8169 :06:00.0:[ 8] Rollover  
[  883.545802] r8169 :06:00.0:[12] Timeout   
[  883.545815] pcieport :00:1c.1: AER: Correctable error message received 
from :00:1c.1
[  883.545823] pcieport :00:1c.1: AER: found no error details for 
:00:1c.1
[  883.545824] pcieport :00:1c.1: AER: Multiple Correctable error message 
received from :06:00.0

lspci output attached.

Intel has issued an errata "RPL013" saying:

"Under complex microarchitectural conditions, the PCIe controller may
transmit an incorrectly formed Transaction Layer Packet (TLP), which
will fail CRC checks. When this erratum occurs, the PCIe end point may
record correctable errors resulting in either a NAK or link recovery.
Intel® has not observed any functional impact due to this erratum."

But I'm really unsure if it describes my issue.

Do you think I have some broken hardware and I should replace the CPU
and/or the motherboard (where the r8169 is soldered)?  I've noticed that
my 13900K is almost impossible to overclock (despite it's a K), but I've
not encountered any issue other than these AER reporting so far after I
gave up overclocking.

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University
00:00.0 Host bridge: Intel Corporation Device a700 (rev 01)
DeviceName: Onboard - Other
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- 
Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, 
IntMsgNum 0
DevCap: MaxPayload 128 bytes, PhantFunc 0
ExtTag- RBE+ FLReset+
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
MaxPayload 128 bytes, MaxReadReq 128 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
TransPend-
DevCap2: Completion Timeout: Not Supported, TimeoutDis- 
NROPrPrP- LTR-
 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- 
EETLPPrefix-
 EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-
 FRS-
 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
 AtomicOpsCtl: ReqEn-
 IDOReq- IDOCompl- LTR- EmergencyPowerReductionReq-
 10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit-
Address: fee00018  Data: 
Masking:   Pending: 
Capabilities: [d0] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 

Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-09-18 Thread Bjorn Helgaas
On Mon, Sep 18, 2023 at 07:42:30PM +0800, Xi Ruoyao wrote:
> ...

> My workstation suffers from too much correctable AER reporting as well
> (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
> Generate Correctable Errors" and/or the motherboard design, I guess).

We should rate-limit correctable error reporting so it's not
overwhelming.

At the same time, I'm *also* interested in the cause of these errors,
in case there's a Linux defect or a hardware erratum that we can work
around.  Do you have a bug report with any more details, e.g., a dmesg
log and "sudo lspci -vv" output?

Bjorn


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-09-18 Thread Grant Grundler
On Mon, Sep 18, 2023 at 4:42 AM Xi Ruoyao  wrote:
>
> On Mon, 2023-08-14 at 08:40 -0700, Grant Grundler wrote:
> > On Sat, Aug 12, 2023 at 5:45 PM David Heidelberg 
> > wrote:
> > >
> > > Tested-by: David Heidelberg 
> >
> > Thanks David!
> >
> > > For PATCH v4 please fix the typo reported by the bot :)
> >
> > Sorry - I'll do that today.
>
> Hi Grant,
>
> Is there an update of this series?

Sorry, while I had good intentions, my work has completely derailed my
attempts to make time for this. :(

I'll give this another run.

I'm also not offended if someone else picks this up and improves the situation.

cheers,
grant

>
> My workstation suffers from too much correctable AER reporting as well
> (related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
> Generate Correctable Errors" and/or the motherboard design, I guess).
>
> --
> Xi Ruoyao 
> School of Aerospace Science and Technology, Xidian University


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-09-18 Thread Xi Ruoyao
On Mon, 2023-08-14 at 08:40 -0700, Grant Grundler wrote:
> On Sat, Aug 12, 2023 at 5:45 PM David Heidelberg 
> wrote:
> > 
> > Tested-by: David Heidelberg 
> 
> Thanks David!
> 
> > For PATCH v4 please fix the typo reported by the bot :)
> 
> Sorry - I'll do that today.

Hi Grant,

Is there an update of this series?

My workstation suffers from too much correctable AER reporting as well
(related to Intel's errata "RPL013: Incorrectly Formed PCIe Packets May
Generate Correctable Errors" and/or the motherboard design, I guess).

-- 
Xi Ruoyao 
School of Aerospace Science and Technology, Xidian University


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-08-14 Thread Grant Grundler
On Sat, Aug 12, 2023 at 5:45 PM David Heidelberg  wrote:
>
> Tested-by: David Heidelberg 

Thanks David!

> For PATCH v4 please fix the typo reported by the bot :)

Sorry - I'll do that today.

> Seeing messages as
>
> __aer_print_error: 72 callbacks suppressed
>
> but it still prints many errors on my laptop. Anyway, the log is less
> filled with this patch, so great!

Awesome! That's what I'm hoping for. :)

cheers,
grant

>
>
> Thank you
> David
>
> --
> David Heidelberg
> Certified Linux Magician
>


Re: [PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-08-12 Thread David Heidelberg

Tested-by: David Heidelberg 

For PATCH v4 please fix the typo reported by the bot :)

Seeing messages as

__aer_print_error: 72 callbacks suppressed

but it still prints many errors on my laptop. Anyway, the log is less 
filled with this patch, so great!


Thank you
David

--
David Heidelberg
Certified Linux Magician



[PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-06-05 Thread Grant Grundler
Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..d7bfc6070ddb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
 
if (info->severity == AER_CORRECTABLE) {
strings = aer_correctable_error_string;
-   level = KERN_WARNING;
+   level = KERN_INFO;
} else {
strings = aer_uncorrectable_error_string;
level = KERN_ERR;
@@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
   aer_error_severity_string[info->severity],
@@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+   if (aer_severity == AER_CORRECTABLE)
+   pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+   else
+   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+
__aer_print_error(dev, );
-   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-   aer_error_layer[layer], aer_agent_string[agent]);
 
-   if (aer_severity != AER_CORRECTABLE)
+   if (aer_severity == AER_CORRECTABLE) {
+   pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
+   } else {
+   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
+   }
 
if (tlp_header_valid)
__print_tlp_header(dev, >header_log);
-- 
2.41.0.rc0.172.g3f132b7071-goog