Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Alexander Duyck
On Fri, Aug 5, 2016 at 11:15 AM, Daniel Drake  wrote:
> Hi Alexander,
>
> Reviving an old topic here...
>
> We are seeing this "problem" on an increasing number of units from the
> vendor, and searching around it can also be seen on Dell and HP
> products. Always with the same Realtek b723 wifi device. e.g.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
>
> The amount of error spam is problematic in that it slows down boot
> really significantly, while printing lots of scary messages for the
> user.
> We tried doing a PCI MSI blacklist for affected laptops but we are
> struggling to keep that blacklist updated with the increasing number
> of affected models.
>
> Enough hacks, I am wondering what we can do to solve this problem in
> the mainline kernel...
>
> On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
>  wrote:
>> On 09/03/2015 06:32 AM, Daniel Drake wrote:
>>>
>>> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
>>>  wrote:

 Since it is correctable errors it is likely some sort of signalling
 issue.
 Could we get the output of something like an lspci -vt? Then you would be
 able to tell what the device is on the other side of the link from
 00:1c.5
 and then we could probably check to see if there has been any changes for
 the device driver on the other end of the link.
>>>
>>> "lspci -vt" reliably causes one occurance of the message, which is
>>> logged by the kernel before lspci has written anything to stdout.
>>>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
>>>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
>>> type=Physical Layer, id=00e5(Receiver ID)
>>>   pcieport :00:1c.5:   device [8086:9d15] error
>>> status/mask=0001/2000
>>>   pcieport :00:1c.5:[ 0] Receiver Error
>>>
>>> -[:00]-+-00.0  Intel Corporation Device 1904
>>> +-02.0  Intel Corporation Device 1916
>>> +-04.0  Intel Corporation Device 1903
>>> +-08.0  Intel Corporation Device 1911
>>> +-14.0  Intel Corporation Device 9d2f
>>> +-14.2  Intel Corporation Device 9d31
>>> +-15.0  Intel Corporation Device 9d60
>>> +-15.1  Intel Corporation Device 9d61
>>> +-16.0  Intel Corporation Device 9d3a
>>> +-17.0  Intel Corporation Device 9d03
>>> +-1c.0-[01]--
>>> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
>>> RTL8111/8168 PCI Express Gigabit Ethernet controller
>>> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
>>> b723
>>> +-1f.0  Intel Corporation Device 9d48
>>> +-1f.2  Intel Corporation Device 9d21
>>> +-1f.3  Intel Corporation Device 9d70
>>> \-1f.4  Intel Corporation Device 9d23
>>>
>>> Does this mean these messages are somehow related to the Realtek b723
>>> device? That is the wifi card.
>>> Using x86_64_defconfig there is not even any driver loaded for this
>>> device, yet the messages appear quite a bit.
>>> If I use a full config with all the relevant drivers including
>>> rtlwifi, the frequency of these messages goes up a lot though.
>>
>>
>> The correctable errors are likely a result of some sort of link error
>> between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
>> likely happening is that when the device is unused it transitions down to a
>> lower power link state like L0s or L1, and when it comes out of that state
>> it is likely triggering the PCIe error most likely as a result of something
>> during the PCIe link training sequence.
>>
>> You might want to notify the manufacturer of the laptop as they may need to
>> address an issue in their hardware, firmware, or possibly add  a workaround
>> to mask off Receiver Error reporting for their part via either a PCIe quirk
>> or driver fix.
>>
 My suspicion since this is a laptop is that something like a power
 management change might be responsible if this is a regression as I have
 seen messages like this pop up as a result of ASPM being enabled before.
>>>
>>> It's likely not a regression, this is brand new hardware and this
>>> message is seen on all kernels that we have tried (4.1, 4.2, master).
>>> pcie_aspm=off also makes these messages go away.
>>
>>
>> Correctable errors are considered a sign of the PCIe link health. In theory
>> they can be ignored since by definition they can be corrected by the
>> hardware.  One thing you could do if you aren't using the wireless card
>> would be to simply switch off the correctable error reporting by setting the
>> mask bit for it in configuration space using setpci.
>>
>> To do that what you could do is find the offset for the PCIe AER
>> configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and
>> what you should get will be a dump listing the capabilities and their
>> current settings.  In 

Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Alexander Duyck
On Fri, Aug 5, 2016 at 11:15 AM, Daniel Drake  wrote:
> Hi Alexander,
>
> Reviving an old topic here...
>
> We are seeing this "problem" on an increasing number of units from the
> vendor, and searching around it can also be seen on Dell and HP
> products. Always with the same Realtek b723 wifi device. e.g.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
>
> The amount of error spam is problematic in that it slows down boot
> really significantly, while printing lots of scary messages for the
> user.
> We tried doing a PCI MSI blacklist for affected laptops but we are
> struggling to keep that blacklist updated with the increasing number
> of affected models.
>
> Enough hacks, I am wondering what we can do to solve this problem in
> the mainline kernel...
>
> On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
>  wrote:
>> On 09/03/2015 06:32 AM, Daniel Drake wrote:
>>>
>>> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
>>>  wrote:

 Since it is correctable errors it is likely some sort of signalling
 issue.
 Could we get the output of something like an lspci -vt? Then you would be
 able to tell what the device is on the other side of the link from
 00:1c.5
 and then we could probably check to see if there has been any changes for
 the device driver on the other end of the link.
>>>
>>> "lspci -vt" reliably causes one occurance of the message, which is
>>> logged by the kernel before lspci has written anything to stdout.
>>>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
>>>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
>>> type=Physical Layer, id=00e5(Receiver ID)
>>>   pcieport :00:1c.5:   device [8086:9d15] error
>>> status/mask=0001/2000
>>>   pcieport :00:1c.5:[ 0] Receiver Error
>>>
>>> -[:00]-+-00.0  Intel Corporation Device 1904
>>> +-02.0  Intel Corporation Device 1916
>>> +-04.0  Intel Corporation Device 1903
>>> +-08.0  Intel Corporation Device 1911
>>> +-14.0  Intel Corporation Device 9d2f
>>> +-14.2  Intel Corporation Device 9d31
>>> +-15.0  Intel Corporation Device 9d60
>>> +-15.1  Intel Corporation Device 9d61
>>> +-16.0  Intel Corporation Device 9d3a
>>> +-17.0  Intel Corporation Device 9d03
>>> +-1c.0-[01]--
>>> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
>>> RTL8111/8168 PCI Express Gigabit Ethernet controller
>>> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
>>> b723
>>> +-1f.0  Intel Corporation Device 9d48
>>> +-1f.2  Intel Corporation Device 9d21
>>> +-1f.3  Intel Corporation Device 9d70
>>> \-1f.4  Intel Corporation Device 9d23
>>>
>>> Does this mean these messages are somehow related to the Realtek b723
>>> device? That is the wifi card.
>>> Using x86_64_defconfig there is not even any driver loaded for this
>>> device, yet the messages appear quite a bit.
>>> If I use a full config with all the relevant drivers including
>>> rtlwifi, the frequency of these messages goes up a lot though.
>>
>>
>> The correctable errors are likely a result of some sort of link error
>> between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
>> likely happening is that when the device is unused it transitions down to a
>> lower power link state like L0s or L1, and when it comes out of that state
>> it is likely triggering the PCIe error most likely as a result of something
>> during the PCIe link training sequence.
>>
>> You might want to notify the manufacturer of the laptop as they may need to
>> address an issue in their hardware, firmware, or possibly add  a workaround
>> to mask off Receiver Error reporting for their part via either a PCIe quirk
>> or driver fix.
>>
 My suspicion since this is a laptop is that something like a power
 management change might be responsible if this is a regression as I have
 seen messages like this pop up as a result of ASPM being enabled before.
>>>
>>> It's likely not a regression, this is brand new hardware and this
>>> message is seen on all kernels that we have tried (4.1, 4.2, master).
>>> pcie_aspm=off also makes these messages go away.
>>
>>
>> Correctable errors are considered a sign of the PCIe link health. In theory
>> they can be ignored since by definition they can be corrected by the
>> hardware.  One thing you could do if you aren't using the wireless card
>> would be to simply switch off the correctable error reporting by setting the
>> mask bit for it in configuration space using setpci.
>>
>> To do that what you could do is find the offset for the PCIe AER
>> configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and
>> what you should get will be a dump listing the capabilities and their
>> current settings.  In there you should find a line like:
>> Capabilities: [148 v1] Advanced 

Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Bjorn Helgaas
On Fri, Aug 05, 2016 at 12:15:53PM -0600, Daniel Drake wrote:
> Hi Alexander,
> 
> Reviving an old topic here...
> 
> We are seeing this "problem" on an increasing number of units from the
> vendor, and searching around it can also be seen on Dell and HP
> products. Always with the same Realtek b723 wifi device. e.g.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
> 
> The amount of error spam is problematic in that it slows down boot
> really significantly, while printing lots of scary messages for the
> user.
> We tried doing a PCI MSI blacklist for affected laptops but we are
> struggling to keep that blacklist updated with the increasing number
> of affected models.
> 
> Enough hacks, I am wondering what we can do to solve this problem in
> the mainline kernel...

I think this is a bug in AER:
https://bugzilla.kernel.org/show_bug.cgi?id=109691

I think I understand the problem, but I haven't had time to fix it.
The bugzilla has a pointer to more details, and it would be awesome if
somebody would jump in.

> On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
>  wrote:
> > On 09/03/2015 06:32 AM, Daniel Drake wrote:
> >>
> >> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
> >>  wrote:
> >>>
> >>> Since it is correctable errors it is likely some sort of signalling
> >>> issue.
> >>> Could we get the output of something like an lspci -vt? Then you would be
> >>> able to tell what the device is on the other side of the link from
> >>> 00:1c.5
> >>> and then we could probably check to see if there has been any changes for
> >>> the device driver on the other end of the link.
> >>
> >> "lspci -vt" reliably causes one occurance of the message, which is
> >> logged by the kernel before lspci has written anything to stdout.
> >>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
> >>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> >> type=Physical Layer, id=00e5(Receiver ID)
> >>   pcieport :00:1c.5:   device [8086:9d15] error
> >> status/mask=0001/2000
> >>   pcieport :00:1c.5:[ 0] Receiver Error
> >>
> >> -[:00]-+-00.0  Intel Corporation Device 1904
> >> +-02.0  Intel Corporation Device 1916
> >> +-04.0  Intel Corporation Device 1903
> >> +-08.0  Intel Corporation Device 1911
> >> +-14.0  Intel Corporation Device 9d2f
> >> +-14.2  Intel Corporation Device 9d31
> >> +-15.0  Intel Corporation Device 9d60
> >> +-15.1  Intel Corporation Device 9d61
> >> +-16.0  Intel Corporation Device 9d3a
> >> +-17.0  Intel Corporation Device 9d03
> >> +-1c.0-[01]--
> >> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
> >> RTL8111/8168 PCI Express Gigabit Ethernet controller
> >> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
> >> b723
> >> +-1f.0  Intel Corporation Device 9d48
> >> +-1f.2  Intel Corporation Device 9d21
> >> +-1f.3  Intel Corporation Device 9d70
> >> \-1f.4  Intel Corporation Device 9d23
> >>
> >> Does this mean these messages are somehow related to the Realtek b723
> >> device? That is the wifi card.
> >> Using x86_64_defconfig there is not even any driver loaded for this
> >> device, yet the messages appear quite a bit.
> >> If I use a full config with all the relevant drivers including
> >> rtlwifi, the frequency of these messages goes up a lot though.
> >
> >
> > The correctable errors are likely a result of some sort of link error
> > between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
> > likely happening is that when the device is unused it transitions down to a
> > lower power link state like L0s or L1, and when it comes out of that state
> > it is likely triggering the PCIe error most likely as a result of something
> > during the PCIe link training sequence.
> >
> > You might want to notify the manufacturer of the laptop as they may need to
> > address an issue in their hardware, firmware, or possibly add  a workaround
> > to mask off Receiver Error reporting for their part via either a PCIe quirk
> > or driver fix.
> >
> >>> My suspicion since this is a laptop is that something like a power
> >>> management change might be responsible if this is a regression as I have
> >>> seen messages like this pop up as a result of ASPM being enabled before.
> >>
> >> It's likely not a regression, this is brand new hardware and this
> >> message is seen on all kernels that we have tried (4.1, 4.2, master).
> >> pcie_aspm=off also makes these messages go away.
> >
> >
> > Correctable errors are considered a sign of the PCIe link health. In theory
> > they can be ignored since by definition they can be corrected by the
> > hardware.  One thing you could do if you aren't using the wireless card
> > would be to simply switch off the correctable error reporting by setting 

Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Bjorn Helgaas
On Fri, Aug 05, 2016 at 12:15:53PM -0600, Daniel Drake wrote:
> Hi Alexander,
> 
> Reviving an old topic here...
> 
> We are seeing this "problem" on an increasing number of units from the
> vendor, and searching around it can also be seen on Dell and HP
> products. Always with the same Realtek b723 wifi device. e.g.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
> 
> The amount of error spam is problematic in that it slows down boot
> really significantly, while printing lots of scary messages for the
> user.
> We tried doing a PCI MSI blacklist for affected laptops but we are
> struggling to keep that blacklist updated with the increasing number
> of affected models.
> 
> Enough hacks, I am wondering what we can do to solve this problem in
> the mainline kernel...

I think this is a bug in AER:
https://bugzilla.kernel.org/show_bug.cgi?id=109691

I think I understand the problem, but I haven't had time to fix it.
The bugzilla has a pointer to more details, and it would be awesome if
somebody would jump in.

> On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
>  wrote:
> > On 09/03/2015 06:32 AM, Daniel Drake wrote:
> >>
> >> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
> >>  wrote:
> >>>
> >>> Since it is correctable errors it is likely some sort of signalling
> >>> issue.
> >>> Could we get the output of something like an lspci -vt? Then you would be
> >>> able to tell what the device is on the other side of the link from
> >>> 00:1c.5
> >>> and then we could probably check to see if there has been any changes for
> >>> the device driver on the other end of the link.
> >>
> >> "lspci -vt" reliably causes one occurance of the message, which is
> >> logged by the kernel before lspci has written anything to stdout.
> >>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
> >>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> >> type=Physical Layer, id=00e5(Receiver ID)
> >>   pcieport :00:1c.5:   device [8086:9d15] error
> >> status/mask=0001/2000
> >>   pcieport :00:1c.5:[ 0] Receiver Error
> >>
> >> -[:00]-+-00.0  Intel Corporation Device 1904
> >> +-02.0  Intel Corporation Device 1916
> >> +-04.0  Intel Corporation Device 1903
> >> +-08.0  Intel Corporation Device 1911
> >> +-14.0  Intel Corporation Device 9d2f
> >> +-14.2  Intel Corporation Device 9d31
> >> +-15.0  Intel Corporation Device 9d60
> >> +-15.1  Intel Corporation Device 9d61
> >> +-16.0  Intel Corporation Device 9d3a
> >> +-17.0  Intel Corporation Device 9d03
> >> +-1c.0-[01]--
> >> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
> >> RTL8111/8168 PCI Express Gigabit Ethernet controller
> >> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
> >> b723
> >> +-1f.0  Intel Corporation Device 9d48
> >> +-1f.2  Intel Corporation Device 9d21
> >> +-1f.3  Intel Corporation Device 9d70
> >> \-1f.4  Intel Corporation Device 9d23
> >>
> >> Does this mean these messages are somehow related to the Realtek b723
> >> device? That is the wifi card.
> >> Using x86_64_defconfig there is not even any driver loaded for this
> >> device, yet the messages appear quite a bit.
> >> If I use a full config with all the relevant drivers including
> >> rtlwifi, the frequency of these messages goes up a lot though.
> >
> >
> > The correctable errors are likely a result of some sort of link error
> > between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
> > likely happening is that when the device is unused it transitions down to a
> > lower power link state like L0s or L1, and when it comes out of that state
> > it is likely triggering the PCIe error most likely as a result of something
> > during the PCIe link training sequence.
> >
> > You might want to notify the manufacturer of the laptop as they may need to
> > address an issue in their hardware, firmware, or possibly add  a workaround
> > to mask off Receiver Error reporting for their part via either a PCIe quirk
> > or driver fix.
> >
> >>> My suspicion since this is a laptop is that something like a power
> >>> management change might be responsible if this is a regression as I have
> >>> seen messages like this pop up as a result of ASPM being enabled before.
> >>
> >> It's likely not a regression, this is brand new hardware and this
> >> message is seen on all kernels that we have tried (4.1, 4.2, master).
> >> pcie_aspm=off also makes these messages go away.
> >
> >
> > Correctable errors are considered a sign of the PCIe link health. In theory
> > they can be ignored since by definition they can be corrected by the
> > hardware.  One thing you could do if you aren't using the wireless card
> > would be to simply switch off the correctable error reporting by setting the
> > mask bit for it in configuration space using 

Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Daniel Drake
Hi Alexander,

Reviving an old topic here...

We are seeing this "problem" on an increasing number of units from the
vendor, and searching around it can also be seen on Dell and HP
products. Always with the same Realtek b723 wifi device. e.g.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The amount of error spam is problematic in that it slows down boot
really significantly, while printing lots of scary messages for the
user.
We tried doing a PCI MSI blacklist for affected laptops but we are
struggling to keep that blacklist updated with the increasing number
of affected models.

Enough hacks, I am wondering what we can do to solve this problem in
the mainline kernel...

On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
 wrote:
> On 09/03/2015 06:32 AM, Daniel Drake wrote:
>>
>> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
>>  wrote:
>>>
>>> Since it is correctable errors it is likely some sort of signalling
>>> issue.
>>> Could we get the output of something like an lspci -vt? Then you would be
>>> able to tell what the device is on the other side of the link from
>>> 00:1c.5
>>> and then we could probably check to see if there has been any changes for
>>> the device driver on the other end of the link.
>>
>> "lspci -vt" reliably causes one occurance of the message, which is
>> logged by the kernel before lspci has written anything to stdout.
>>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
>>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
>> type=Physical Layer, id=00e5(Receiver ID)
>>   pcieport :00:1c.5:   device [8086:9d15] error
>> status/mask=0001/2000
>>   pcieport :00:1c.5:[ 0] Receiver Error
>>
>> -[:00]-+-00.0  Intel Corporation Device 1904
>> +-02.0  Intel Corporation Device 1916
>> +-04.0  Intel Corporation Device 1903
>> +-08.0  Intel Corporation Device 1911
>> +-14.0  Intel Corporation Device 9d2f
>> +-14.2  Intel Corporation Device 9d31
>> +-15.0  Intel Corporation Device 9d60
>> +-15.1  Intel Corporation Device 9d61
>> +-16.0  Intel Corporation Device 9d3a
>> +-17.0  Intel Corporation Device 9d03
>> +-1c.0-[01]--
>> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
>> RTL8111/8168 PCI Express Gigabit Ethernet controller
>> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
>> b723
>> +-1f.0  Intel Corporation Device 9d48
>> +-1f.2  Intel Corporation Device 9d21
>> +-1f.3  Intel Corporation Device 9d70
>> \-1f.4  Intel Corporation Device 9d23
>>
>> Does this mean these messages are somehow related to the Realtek b723
>> device? That is the wifi card.
>> Using x86_64_defconfig there is not even any driver loaded for this
>> device, yet the messages appear quite a bit.
>> If I use a full config with all the relevant drivers including
>> rtlwifi, the frequency of these messages goes up a lot though.
>
>
> The correctable errors are likely a result of some sort of link error
> between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
> likely happening is that when the device is unused it transitions down to a
> lower power link state like L0s or L1, and when it comes out of that state
> it is likely triggering the PCIe error most likely as a result of something
> during the PCIe link training sequence.
>
> You might want to notify the manufacturer of the laptop as they may need to
> address an issue in their hardware, firmware, or possibly add  a workaround
> to mask off Receiver Error reporting for their part via either a PCIe quirk
> or driver fix.
>
>>> My suspicion since this is a laptop is that something like a power
>>> management change might be responsible if this is a regression as I have
>>> seen messages like this pop up as a result of ASPM being enabled before.
>>
>> It's likely not a regression, this is brand new hardware and this
>> message is seen on all kernels that we have tried (4.1, 4.2, master).
>> pcie_aspm=off also makes these messages go away.
>
>
> Correctable errors are considered a sign of the PCIe link health. In theory
> they can be ignored since by definition they can be corrected by the
> hardware.  One thing you could do if you aren't using the wireless card
> would be to simply switch off the correctable error reporting by setting the
> mask bit for it in configuration space using setpci.
>
> To do that what you could do is find the offset for the PCIe AER
> configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and
> what you should get will be a dump listing the capabilities and their
> current settings.  In there you should find a line like:
> Capabilities: [148 v1] Advanced Error Reporting
>
> The 148 is the hex offset of the configuration space.  The Correctable Error
> mask is located at a hex 

Re: pcieport AER error spam on Intel Skylake

2016-08-05 Thread Daniel Drake
Hi Alexander,

Reviving an old topic here...

We are seeing this "problem" on an increasing number of units from the
vendor, and searching around it can also be seen on Dell and HP
products. Always with the same Realtek b723 wifi device. e.g.
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173

The amount of error spam is problematic in that it slows down boot
really significantly, while printing lots of scary messages for the
user.
We tried doing a PCI MSI blacklist for affected laptops but we are
struggling to keep that blacklist updated with the increasing number
of affected models.

Enough hacks, I am wondering what we can do to solve this problem in
the mainline kernel...

On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
 wrote:
> On 09/03/2015 06:32 AM, Daniel Drake wrote:
>>
>> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
>>  wrote:
>>>
>>> Since it is correctable errors it is likely some sort of signalling
>>> issue.
>>> Could we get the output of something like an lspci -vt? Then you would be
>>> able to tell what the device is on the other side of the link from
>>> 00:1c.5
>>> and then we could probably check to see if there has been any changes for
>>> the device driver on the other end of the link.
>>
>> "lspci -vt" reliably causes one occurance of the message, which is
>> logged by the kernel before lspci has written anything to stdout.
>>   pcieport :00:1c.5: AER: Corrected error received: id=00e5
>>   pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
>> type=Physical Layer, id=00e5(Receiver ID)
>>   pcieport :00:1c.5:   device [8086:9d15] error
>> status/mask=0001/2000
>>   pcieport :00:1c.5:[ 0] Receiver Error
>>
>> -[:00]-+-00.0  Intel Corporation Device 1904
>> +-02.0  Intel Corporation Device 1916
>> +-04.0  Intel Corporation Device 1903
>> +-08.0  Intel Corporation Device 1911
>> +-14.0  Intel Corporation Device 9d2f
>> +-14.2  Intel Corporation Device 9d31
>> +-15.0  Intel Corporation Device 9d60
>> +-15.1  Intel Corporation Device 9d61
>> +-16.0  Intel Corporation Device 9d3a
>> +-17.0  Intel Corporation Device 9d03
>> +-1c.0-[01]--
>> +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
>> RTL8111/8168 PCI Express Gigabit Ethernet controller
>> +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device
>> b723
>> +-1f.0  Intel Corporation Device 9d48
>> +-1f.2  Intel Corporation Device 9d21
>> +-1f.3  Intel Corporation Device 9d70
>> \-1f.4  Intel Corporation Device 9d23
>>
>> Does this mean these messages are somehow related to the Realtek b723
>> device? That is the wifi card.
>> Using x86_64_defconfig there is not even any driver loaded for this
>> device, yet the messages appear quite a bit.
>> If I use a full config with all the relevant drivers including
>> rtlwifi, the frequency of these messages goes up a lot though.
>
>
> The correctable errors are likely a result of some sort of link error
> between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What is
> likely happening is that when the device is unused it transitions down to a
> lower power link state like L0s or L1, and when it comes out of that state
> it is likely triggering the PCIe error most likely as a result of something
> during the PCIe link training sequence.
>
> You might want to notify the manufacturer of the laptop as they may need to
> address an issue in their hardware, firmware, or possibly add  a workaround
> to mask off Receiver Error reporting for their part via either a PCIe quirk
> or driver fix.
>
>>> My suspicion since this is a laptop is that something like a power
>>> management change might be responsible if this is a regression as I have
>>> seen messages like this pop up as a result of ASPM being enabled before.
>>
>> It's likely not a regression, this is brand new hardware and this
>> message is seen on all kernels that we have tried (4.1, 4.2, master).
>> pcie_aspm=off also makes these messages go away.
>
>
> Correctable errors are considered a sign of the PCIe link health. In theory
> they can be ignored since by definition they can be corrected by the
> hardware.  One thing you could do if you aren't using the wireless card
> would be to simply switch off the correctable error reporting by setting the
> mask bit for it in configuration space using setpci.
>
> To do that what you could do is find the offset for the PCIe AER
> configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and
> what you should get will be a dump listing the capabilities and their
> current settings.  In there you should find a line like:
> Capabilities: [148 v1] Advanced Error Reporting
>
> The 148 is the hex offset of the configuration space.  The Correctable Error
> mask is located at a hex offset of 0x14 from there.  So adding the hex
> 

Re: pcieport AER error spam on Intel Skylake

2015-09-03 Thread Alexander Duyck

On 09/03/2015 06:32 AM, Daniel Drake wrote:

On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
 wrote:

Since it is correctable errors it is likely some sort of signalling issue.
Could we get the output of something like an lspci -vt? Then you would be
able to tell what the device is on the other side of the link from 00:1c.5
and then we could probably check to see if there has been any changes for
the device driver on the other end of the link.

"lspci -vt" reliably causes one occurance of the message, which is
logged by the kernel before lspci has written anything to stdout.
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error

-[:00]-+-00.0  Intel Corporation Device 1904
+-02.0  Intel Corporation Device 1916
+-04.0  Intel Corporation Device 1903
+-08.0  Intel Corporation Device 1911
+-14.0  Intel Corporation Device 9d2f
+-14.2  Intel Corporation Device 9d31
+-15.0  Intel Corporation Device 9d60
+-15.1  Intel Corporation Device 9d61
+-16.0  Intel Corporation Device 9d3a
+-17.0  Intel Corporation Device 9d03
+-1c.0-[01]--
+-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168 PCI Express Gigabit Ethernet controller
+-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device b723
+-1f.0  Intel Corporation Device 9d48
+-1f.2  Intel Corporation Device 9d21
+-1f.3  Intel Corporation Device 9d70
\-1f.4  Intel Corporation Device 9d23

Does this mean these messages are somehow related to the Realtek b723
device? That is the wifi card.
Using x86_64_defconfig there is not even any driver loaded for this
device, yet the messages appear quite a bit.
If I use a full config with all the relevant drivers including
rtlwifi, the frequency of these messages goes up a lot though.


The correctable errors are likely a result of some sort of link error 
between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What 
is likely happening is that when the device is unused it transitions 
down to a lower power link state like L0s or L1, and when it comes out 
of that state it is likely triggering the PCIe error most likely as a 
result of something during the PCIe link training sequence.


You might want to notify the manufacturer of the laptop as they may need 
to address an issue in their hardware, firmware, or possibly add  a 
workaround to mask off Receiver Error reporting for their part via 
either a PCIe quirk or driver fix.



My suspicion since this is a laptop is that something like a power
management change might be responsible if this is a regression as I have
seen messages like this pop up as a result of ASPM being enabled before.

It's likely not a regression, this is brand new hardware and this
message is seen on all kernels that we have tried (4.1, 4.2, master).
pcie_aspm=off also makes these messages go away.


Correctable errors are considered a sign of the PCIe link health. In 
theory they can be ignored since by definition they can be corrected by 
the hardware.  One thing you could do if you aren't using the wireless 
card would be to simply switch off the correctable error reporting by 
setting the mask bit for it in configuration space using setpci.


To do that what you could do is find the offset for the PCIe AER 
configuration register for your port by doing a "lspci -vvv -s 0:1c.5" 
and what you should get will be a dump listing the capabilities and 
their current settings.  In there you should find a line like:

Capabilities: [148 v1] Advanced Error Reporting

The 148 is the hex offset of the configuration space.  The Correctable 
Error mask is located at a hex offset of 0x14 from there.  So adding the 
hex values 0x148 and 0x14 gives us 0x15C.  To disable reporting 
correctable receiver errors you would just want to add a 1 to whatever 
value you get from "setpci -s 0:1c.5 0x15C.l" and then write that value 
back.  So for example on my system I ended up with something like 
"setpci -s 0:1c.5 0x15C.l=2001" where the output from the first command 
was 2000.


- Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-03 Thread Daniel Drake
On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
 wrote:
> Since it is correctable errors it is likely some sort of signalling issue.
> Could we get the output of something like an lspci -vt? Then you would be
> able to tell what the device is on the other side of the link from 00:1c.5
> and then we could probably check to see if there has been any changes for
> the device driver on the other end of the link.

"lspci -vt" reliably causes one occurance of the message, which is
logged by the kernel before lspci has written anything to stdout.
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error

-[:00]-+-00.0  Intel Corporation Device 1904
   +-02.0  Intel Corporation Device 1916
   +-04.0  Intel Corporation Device 1903
   +-08.0  Intel Corporation Device 1911
   +-14.0  Intel Corporation Device 9d2f
   +-14.2  Intel Corporation Device 9d31
   +-15.0  Intel Corporation Device 9d60
   +-15.1  Intel Corporation Device 9d61
   +-16.0  Intel Corporation Device 9d3a
   +-17.0  Intel Corporation Device 9d03
   +-1c.0-[01]--
   +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168 PCI Express Gigabit Ethernet controller
   +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device b723
   +-1f.0  Intel Corporation Device 9d48
   +-1f.2  Intel Corporation Device 9d21
   +-1f.3  Intel Corporation Device 9d70
   \-1f.4  Intel Corporation Device 9d23

Does this mean these messages are somehow related to the Realtek b723
device? That is the wifi card.
Using x86_64_defconfig there is not even any driver loaded for this
device, yet the messages appear quite a bit.
If I use a full config with all the relevant drivers including
rtlwifi, the frequency of these messages goes up a lot though.

> My suspicion since this is a laptop is that something like a power
> management change might be responsible if this is a regression as I have
> seen messages like this pop up as a result of ASPM being enabled before.

It's likely not a regression, this is brand new hardware and this
message is seen on all kernels that we have tried (4.1, 4.2, master).
pcie_aspm=off also makes these messages go away.

Thanks
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-03 Thread Alexander Duyck

On 09/03/2015 06:32 AM, Daniel Drake wrote:

On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
 wrote:

Since it is correctable errors it is likely some sort of signalling issue.
Could we get the output of something like an lspci -vt? Then you would be
able to tell what the device is on the other side of the link from 00:1c.5
and then we could probably check to see if there has been any changes for
the device driver on the other end of the link.

"lspci -vt" reliably causes one occurance of the message, which is
logged by the kernel before lspci has written anything to stdout.
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error

-[:00]-+-00.0  Intel Corporation Device 1904
+-02.0  Intel Corporation Device 1916
+-04.0  Intel Corporation Device 1903
+-08.0  Intel Corporation Device 1911
+-14.0  Intel Corporation Device 9d2f
+-14.2  Intel Corporation Device 9d31
+-15.0  Intel Corporation Device 9d60
+-15.1  Intel Corporation Device 9d61
+-16.0  Intel Corporation Device 9d3a
+-17.0  Intel Corporation Device 9d03
+-1c.0-[01]--
+-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168 PCI Express Gigabit Ethernet controller
+-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device b723
+-1f.0  Intel Corporation Device 9d48
+-1f.2  Intel Corporation Device 9d21
+-1f.3  Intel Corporation Device 9d70
\-1f.4  Intel Corporation Device 9d23

Does this mean these messages are somehow related to the Realtek b723
device? That is the wifi card.
Using x86_64_defconfig there is not even any driver loaded for this
device, yet the messages appear quite a bit.
If I use a full config with all the relevant drivers including
rtlwifi, the frequency of these messages goes up a lot though.


The correctable errors are likely a result of some sort of link error 
between the root port 00:1c.5 and the wireless adapter at 3:00.0.  What 
is likely happening is that when the device is unused it transitions 
down to a lower power link state like L0s or L1, and when it comes out 
of that state it is likely triggering the PCIe error most likely as a 
result of something during the PCIe link training sequence.


You might want to notify the manufacturer of the laptop as they may need 
to address an issue in their hardware, firmware, or possibly add  a 
workaround to mask off Receiver Error reporting for their part via 
either a PCIe quirk or driver fix.



My suspicion since this is a laptop is that something like a power
management change might be responsible if this is a regression as I have
seen messages like this pop up as a result of ASPM being enabled before.

It's likely not a regression, this is brand new hardware and this
message is seen on all kernels that we have tried (4.1, 4.2, master).
pcie_aspm=off also makes these messages go away.


Correctable errors are considered a sign of the PCIe link health. In 
theory they can be ignored since by definition they can be corrected by 
the hardware.  One thing you could do if you aren't using the wireless 
card would be to simply switch off the correctable error reporting by 
setting the mask bit for it in configuration space using setpci.


To do that what you could do is find the offset for the PCIe AER 
configuration register for your port by doing a "lspci -vvv -s 0:1c.5" 
and what you should get will be a dump listing the capabilities and 
their current settings.  In there you should find a line like:

Capabilities: [148 v1] Advanced Error Reporting

The 148 is the hex offset of the configuration space.  The Correctable 
Error mask is located at a hex offset of 0x14 from there.  So adding the 
hex values 0x148 and 0x14 gives us 0x15C.  To disable reporting 
correctable receiver errors you would just want to add a 1 to whatever 
value you get from "setpci -s 0:1c.5 0x15C.l" and then write that value 
back.  So for example on my system I ended up with something like 
"setpci -s 0:1c.5 0x15C.l=2001" where the output from the first command 
was 2000.


- Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-03 Thread Daniel Drake
On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
 wrote:
> Since it is correctable errors it is likely some sort of signalling issue.
> Could we get the output of something like an lspci -vt? Then you would be
> able to tell what the device is on the other side of the link from 00:1c.5
> and then we could probably check to see if there has been any changes for
> the device driver on the other end of the link.

"lspci -vt" reliably causes one occurance of the message, which is
logged by the kernel before lspci has written anything to stdout.
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error

-[:00]-+-00.0  Intel Corporation Device 1904
   +-02.0  Intel Corporation Device 1916
   +-04.0  Intel Corporation Device 1903
   +-08.0  Intel Corporation Device 1911
   +-14.0  Intel Corporation Device 9d2f
   +-14.2  Intel Corporation Device 9d31
   +-15.0  Intel Corporation Device 9d60
   +-15.1  Intel Corporation Device 9d61
   +-16.0  Intel Corporation Device 9d3a
   +-17.0  Intel Corporation Device 9d03
   +-1c.0-[01]--
   +-1c.4-[02]00.0  Realtek Semiconductor Co., Ltd.
RTL8111/8168 PCI Express Gigabit Ethernet controller
   +-1c.5-[03]00.0  Realtek Semiconductor Co., Ltd. Device b723
   +-1f.0  Intel Corporation Device 9d48
   +-1f.2  Intel Corporation Device 9d21
   +-1f.3  Intel Corporation Device 9d70
   \-1f.4  Intel Corporation Device 9d23

Does this mean these messages are somehow related to the Realtek b723
device? That is the wifi card.
Using x86_64_defconfig there is not even any driver loaded for this
device, yet the messages appear quite a bit.
If I use a full config with all the relevant drivers including
rtlwifi, the frequency of these messages goes up a lot though.

> My suspicion since this is a laptop is that something like a power
> management change might be responsible if this is a regression as I have
> seen messages like this pop up as a result of ASPM being enabled before.

It's likely not a regression, this is brand new hardware and this
message is seen on all kernels that we have tried (4.1, 4.2, master).
pcie_aspm=off also makes these messages go away.

Thanks
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-02 Thread Alexander Duyck

On 09/02/2015 03:53 PM, Bjorn Helgaas wrote:

On Wed, Sep 2, 2015 at 5:01 PM, Daniel Drake  wrote:

Hi,

Working with a sample for a new laptop based on Intel Skylake, the
kernel logs are full of these messages:

  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error (First)
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error (First)
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: can't find device of ID00e5

Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.

Apart from the log spam, there is no user-visible effect that I'm
aware of. Booting with pci=nomsi makes the messages go away.

Any thoughts, is this something worth looking into in more detail?

full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks, Daniel, this is indeed really annoying and worth looking into.
Do you happen to know whether it's a regression?  We haven't changed
much in AER recently, but it's possible we broke something.

Even if it's not a regression, the output seems a bit wordy and redundant.

Bjorn


Since it is correctable errors it is likely some sort of signalling 
issue.  Could we get the output of something like an lspci -vt? Then you 
would be able to tell what the device is on the other side of the link 
from 00:1c.5 and then we could probably check to see if there has been 
any changes for the device driver on the other end of the link.


My suspicion since this is a laptop is that something like a power 
management change might be responsible if this is a regression as I have 
seen messages like this pop up as a result of ASPM being enabled before.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-02 Thread Bjorn Helgaas
On Wed, Sep 2, 2015 at 5:01 PM, Daniel Drake  wrote:
> Hi,
>
> Working with a sample for a new laptop based on Intel Skylake, the
> kernel logs are full of these messages:
>
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e5(Receiver ID)
>  pcieport :00:1c.5:   device [8086:9d15] error 
> status/mask=0001/2000
>  pcieport :00:1c.5:[ 0] Receiver Error (First)
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e5(Receiver ID)
>  pcieport :00:1c.5:   device [8086:9d15] error 
> status/mask=0001/2000
>  pcieport :00:1c.5:[ 0] Receiver Error (First)
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: can't find device of ID00e5
>
> Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.
>
> Apart from the log spam, there is no user-visible effect that I'm
> aware of. Booting with pci=nomsi makes the messages go away.
>
> Any thoughts, is this something worth looking into in more detail?
>
> full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
> lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks, Daniel, this is indeed really annoying and worth looking into.
Do you happen to know whether it's a regression?  We haven't changed
much in AER recently, but it's possible we broke something.

Even if it's not a regression, the output seems a bit wordy and redundant.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pcieport AER error spam on Intel Skylake

2015-09-02 Thread Daniel Drake
Hi,

Working with a sample for a new laptop based on Intel Skylake, the
kernel logs are full of these messages:

 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error (First)
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error (First)
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: can't find device of ID00e5

Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.

Apart from the log spam, there is no user-visible effect that I'm
aware of. Booting with pci=nomsi makes the messages go away.

Any thoughts, is this something worth looking into in more detail?

full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-02 Thread Alexander Duyck

On 09/02/2015 03:53 PM, Bjorn Helgaas wrote:

On Wed, Sep 2, 2015 at 5:01 PM, Daniel Drake  wrote:

Hi,

Working with a sample for a new laptop based on Intel Skylake, the
kernel logs are full of these messages:

  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error (First)
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
  pcieport :00:1c.5:   device [8086:9d15] error 
status/mask=0001/2000
  pcieport :00:1c.5:[ 0] Receiver Error (First)
  pcieport :00:1c.5: AER: Corrected error received: id=00e5
  pcieport :00:1c.5: can't find device of ID00e5

Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.

Apart from the log spam, there is no user-visible effect that I'm
aware of. Booting with pci=nomsi makes the messages go away.

Any thoughts, is this something worth looking into in more detail?

full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks, Daniel, this is indeed really annoying and worth looking into.
Do you happen to know whether it's a regression?  We haven't changed
much in AER recently, but it's possible we broke something.

Even if it's not a regression, the output seems a bit wordy and redundant.

Bjorn


Since it is correctable errors it is likely some sort of signalling 
issue.  Could we get the output of something like an lspci -vt? Then you 
would be able to tell what the device is on the other side of the link 
from 00:1c.5 and then we could probably check to see if there has been 
any changes for the device driver on the other end of the link.


My suspicion since this is a laptop is that something like a power 
management change might be responsible if this is a regression as I have 
seen messages like this pop up as a result of ASPM being enabled before.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pcieport AER error spam on Intel Skylake

2015-09-02 Thread Daniel Drake
Hi,

Working with a sample for a new laptop based on Intel Skylake, the
kernel logs are full of these messages:

 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error (First)
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
type=Physical Layer, id=00e5(Receiver ID)
 pcieport :00:1c.5:   device [8086:9d15] error status/mask=0001/2000
 pcieport :00:1c.5:[ 0] Receiver Error (First)
 pcieport :00:1c.5: AER: Corrected error received: id=00e5
 pcieport :00:1c.5: can't find device of ID00e5

Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.

Apart from the log spam, there is no user-visible effect that I'm
aware of. Booting with pci=nomsi makes the messages go away.

Any thoughts, is this something worth looking into in more detail?

full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pcieport AER error spam on Intel Skylake

2015-09-02 Thread Bjorn Helgaas
On Wed, Sep 2, 2015 at 5:01 PM, Daniel Drake  wrote:
> Hi,
>
> Working with a sample for a new laptop based on Intel Skylake, the
> kernel logs are full of these messages:
>
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e5(Receiver ID)
>  pcieport :00:1c.5:   device [8086:9d15] error 
> status/mask=0001/2000
>  pcieport :00:1c.5:[ 0] Receiver Error (First)
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: PCIe Bus Error: severity=Corrected,
> type=Physical Layer, id=00e5(Receiver ID)
>  pcieport :00:1c.5:   device [8086:9d15] error 
> status/mask=0001/2000
>  pcieport :00:1c.5:[ 0] Receiver Error (First)
>  pcieport :00:1c.5: AER: Corrected error received: id=00e5
>  pcieport :00:1c.5: can't find device of ID00e5
>
> Reproduced on 4.2 and on linus master as of today, using x86_64_defconfig.
>
> Apart from the log spam, there is no user-visible effect that I'm
> aware of. Booting with pci=nomsi makes the messages go away.
>
> Any thoughts, is this something worth looking into in more detail?
>
> full dmesg: https://gist.github.com/dsd/1d7f738e917465edf2ae
> lspci dump: https://gist.github.com/dsd/dc2481d64aadd520b0b3

Thanks, Daniel, this is indeed really annoying and worth looking into.
Do you happen to know whether it's a regression?  We haven't changed
much in AER recently, but it's possible we broke something.

Even if it's not a regression, the output seems a bit wordy and redundant.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/