Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

2020-10-18 Thread Borislav Petkov
On Mon, Oct 19, 2020 at 03:27:48AM +0530, Jeffrin Jose T wrote:
> On Sun, 2020-10-18 at 23:03 +0200, Borislav Petkov wrote:
> > On Mon, Oct 19, 2020 at 01:51:34AM +0530, Jeffrin Jose T wrote:
> > > On Sun, 2020-10-18 at 19:49 +0200, Borislav Petkov wrote:
> > > > On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> > > > > smpboot: Scheduler frequency invariance went wobbly, disabling!
> > > > > [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at
> > > > > rIP:
> > > > > 0xb5c9a184 (native_read_msr+0x4/0x30)
> > 
> > Ok, you forgot to say in your initial mail that this happens when you
> > suspend your laptop.
> > 
> > Now, this unchecked MSR error thing happens only once because that
> > early
> > during resume the microcode on CPU1 is not updated yet - and that
> > needs
> > to be debugged separately and I'll try to reproduce that on my
> > machine -
> > so the microcode is not updated yet and therefore the 0x123 MSR is
> > not
> > "emulated" by the microcode, so to speak, thus the warning.
> > 
> > That warning doesn't happen anymore, though, once the microcode is
> > updated.
> > 
> > But what happens after that is you get a flood of correctable PCIe
> > errors about a transaction to a device timeoutting:
> > 
> > pcieport :00:1c.5: AER: Corrected error received: :00:1c.5
> > pcieport :00:1c.5: PCIe Bus Error: severity=Corrected, type=Data
> > Link Layer, (Transmitter ID)
> > pcieport :00:1c.5:   device [8086:9d15] error
> > status/mask=1000/2000
> > pcieport :00:1c.5:[12] Timeout 
> > 
> > and it looks like that flood is slowing down the machine because it
> > is
> > busy logging them.
> > 
> > Do
> > 
> > # lspci -nn -xxx
> > 
> > as root. It'll show us which device that 8086:9d15 is.
> > 
> > Thx.
> > 
> 
> $sudo lspci -nn -xxx | grep 9d15
> 00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI
> Express Root Port #6 [8086:9d15] (rev f1)

Hm, looks like a builtin pci express port can't stomach suspend/resume
and starts throwing AER errors.

Adding Bjorn for a comment and leaving in the rest for reference.

> file lspci.txt is attached
> -- 
> software engineer
> rajagiri school of engineering and technology

> 00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core 
> Processor Host Bridge/DRAM Registers [8086:5904] (rev 03)
> 00: 86 80 04 59 06 00 90 20 03 00 00 06 00 00 00 00
> 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 11 13
> 30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
> 40: 01 90 d1 fe 00 00 00 00 01 00 d1 fe 00 00 00 00
> 50: c1 02 00 00 b1 00 00 00 47 00 f0 9f 01 00 00 9b
> 60: 03 00 00 f0 00 00 00 00 01 80 d1 fe 00 00 00 00
> 70: 00 00 00 ff 01 00 00 00 00 0c 00 ff 7f 00 00 00
> 80: 01 00 00 00 00 00 00 00 1a 00 00 00 00 00 00 00
> 90: 01 00 00 ff 01 00 00 00 01 00 f0 5e 02 00 00 00
> a0: 01 00 00 00 02 00 00 00 01 00 00 5f 02 00 00 00
> b0: 01 00 00 9c 01 00 80 9b 01 00 00 9b 01 00 00 a0
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 09 00 10 01 31 60 61 7a dc 80 15 94 00 c0 06 00
> f0: 00 00 00 00 c8 0f 09 00 00 00 00 00 00 00 00 00
> 
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Device 
> [8086:5921] (rev 06)
> 00: 86 80 21 59 07 04 10 00 06 00 00 03 10 00 00 00
> 10: 04 00 00 ee 00 00 00 00 0c 00 00 d0 00 00 00 00
> 20: 01 f0 00 00 00 00 00 00 00 00 00 00 43 10 11 13
> 30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
> 40: 09 70 0c 01 31 60 61 7a dc 80 15 94 00 00 00 00
> 50: c1 02 00 00 b1 00 00 00 00 00 00 00 01 00 00 9c
> 60: 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
> 70: 10 ac 92 00 00 80 00 10 00 00 00 00 00 00 00 00
> 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 05 d0 01 00
> b0: 18 00 e0 fe 00 00 00 00 00 00 00 00 00 00 00 00
> c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
> e0: 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00
> f0: 00 00 00 00 00 00 00 00 00 00 00 00 18 50 90 9a
> 
> 00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 
> v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 03)
> 00: 86 80 03 19 02 00 90 00 03 00 80 11 00 00 00 00
> 10: 04 80 1a ef 00 00 00 00 00 00 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 11 13
> 30: 00 00 00 00 90 00 00 00 00 00 00 00 ff 01 00 00
> 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 50: 00 00 00 00 b1 00 00 00 00 00 00 00 00 00 00 00
> 60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 90: 05 d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

2020-10-18 Thread Jeffrin Jose T
On Sun, 2020-10-18 at 23:03 +0200, Borislav Petkov wrote:
> On Mon, Oct 19, 2020 at 01:51:34AM +0530, Jeffrin Jose T wrote:
> > On Sun, 2020-10-18 at 19:49 +0200, Borislav Petkov wrote:
> > > On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> > > > smpboot: Scheduler frequency invariance went wobbly, disabling!
> > > > [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at
> > > > rIP:
> > > > 0xb5c9a184 (native_read_msr+0x4/0x30)
> 
> Ok, you forgot to say in your initial mail that this happens when you
> suspend your laptop.
> 
> Now, this unchecked MSR error thing happens only once because that
> early
> during resume the microcode on CPU1 is not updated yet - and that
> needs
> to be debugged separately and I'll try to reproduce that on my
> machine -
> so the microcode is not updated yet and therefore the 0x123 MSR is
> not
> "emulated" by the microcode, so to speak, thus the warning.
> 
> That warning doesn't happen anymore, though, once the microcode is
> updated.
> 
> But what happens after that is you get a flood of correctable PCIe
> errors about a transaction to a device timeoutting:
> 
> pcieport :00:1c.5: AER: Corrected error received: :00:1c.5
> pcieport :00:1c.5: PCIe Bus Error: severity=Corrected, type=Data
> Link Layer, (Transmitter ID)
> pcieport :00:1c.5:   device [8086:9d15] error
> status/mask=1000/2000
> pcieport :00:1c.5:[12] Timeout 
> 
> and it looks like that flood is slowing down the machine because it
> is
> busy logging them.
> 
> Do
> 
> # lspci -nn -xxx
> 
> as root. It'll show us which device that 8086:9d15 is.
> 
> Thx.
> 

$sudo lspci -nn -xxx | grep 9d15
00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI
Express Root Port #6 [8086:9d15] (rev f1)
$


file lspci.txt is attached
-- 
software engineer
rajagiri school of engineering and technology
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core 
Processor Host Bridge/DRAM Registers [8086:5904] (rev 03)
00: 86 80 04 59 06 00 90 20 03 00 00 06 00 00 00 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 11 13
30: 00 00 00 00 e0 00 00 00 00 00 00 00 00 00 00 00
40: 01 90 d1 fe 00 00 00 00 01 00 d1 fe 00 00 00 00
50: c1 02 00 00 b1 00 00 00 47 00 f0 9f 01 00 00 9b
60: 03 00 00 f0 00 00 00 00 01 80 d1 fe 00 00 00 00
70: 00 00 00 ff 01 00 00 00 00 0c 00 ff 7f 00 00 00
80: 01 00 00 00 00 00 00 00 1a 00 00 00 00 00 00 00
90: 01 00 00 ff 01 00 00 00 01 00 f0 5e 02 00 00 00
a0: 01 00 00 00 02 00 00 00 01 00 00 5f 02 00 00 00
b0: 01 00 00 9c 01 00 80 9b 01 00 00 9b 01 00 00 a0
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 09 00 10 01 31 60 61 7a dc 80 15 94 00 c0 06 00
f0: 00 00 00 00 c8 0f 09 00 00 00 00 00 00 00 00 00

00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:5921] 
(rev 06)
00: 86 80 21 59 07 04 10 00 06 00 00 03 10 00 00 00
10: 04 00 00 ee 00 00 00 00 0c 00 00 d0 00 00 00 00
20: 01 f0 00 00 00 00 00 00 00 00 00 00 43 10 11 13
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 00 00
40: 09 70 0c 01 31 60 61 7a dc 80 15 94 00 00 00 00
50: c1 02 00 00 b1 00 00 00 00 00 00 00 01 00 00 9c
60: 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 10 ac 92 00 00 80 00 10 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 05 d0 01 00
b0: 18 00 e0 fe 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 80 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 18 50 90 9a

00:04.0 Signal processing controller [1180]: Intel Corporation Xeon E3-1200 
v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem [8086:1903] (rev 03)
00: 86 80 03 19 02 00 90 00 03 00 80 11 00 00 00 00
10: 04 80 1a ef 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 11 13
30: 00 00 00 00 90 00 00 00 00 00 00 00 ff 01 00 00
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
50: 00 00 00 00 b1 00 00 00 00 00 00 00 00 00 00 00
60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 05 d0 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 01 e0 03 00 08 00 00 00 00 00 00 00 00 00 00 00
e0: 09 00 0c 01 31 60 61 7a dc 80 15 94 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI 
Controller [8086:9d2f] (rev 21)
00: 86 80 2f 9d 02 04 90 02 21 30 03 0c 00 00 80 00
10: 04 00 19 ef 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 

Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

2020-10-18 Thread Borislav Petkov
On Mon, Oct 19, 2020 at 01:51:34AM +0530, Jeffrin Jose T wrote:
> On Sun, 2020-10-18 at 19:49 +0200, Borislav Petkov wrote:
> > On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> > > smpboot: Scheduler frequency invariance went wobbly, disabling!
> > > [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at rIP:
> > > 0xb5c9a184 (native_read_msr+0x4/0x30)

Ok, you forgot to say in your initial mail that this happens when you
suspend your laptop.

Now, this unchecked MSR error thing happens only once because that early
during resume the microcode on CPU1 is not updated yet - and that needs
to be debugged separately and I'll try to reproduce that on my machine -
so the microcode is not updated yet and therefore the 0x123 MSR is not
"emulated" by the microcode, so to speak, thus the warning.

That warning doesn't happen anymore, though, once the microcode is
updated.

But what happens after that is you get a flood of correctable PCIe
errors about a transaction to a device timeoutting:

pcieport :00:1c.5: AER: Corrected error received: :00:1c.5
pcieport :00:1c.5: PCIe Bus Error: severity=Corrected, type=Data Link 
Layer, (Transmitter ID)
pcieport :00:1c.5:   device [8086:9d15] error status/mask=1000/2000
pcieport :00:1c.5:[12] Timeout 

and it looks like that flood is slowing down the machine because it is
busy logging them.

Do

# lspci -nn -xxx

as root. It'll show us which device that 8086:9d15 is.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

2020-10-18 Thread Borislav Petkov
On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> smpboot: Scheduler frequency invariance went wobbly, disabling!
> [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at rIP:
> 0xb5c9a184 (native_read_msr+0x4/0x30)
> [ 1112.592869] Call Trace:
> [ 1112.592876]  update_srbds_msr+0x6f/0xb0
> [ 1112.592880]  smp_store_cpu_info+0x8e/0xb0
> [ 1112.592883]  start_secondary+0x93/0x200
> [ 1112.592887]  ? set_cpu_sibling_map+0xcb0/0xcb0
> [ 1112.592891]  secondary_startup_64+0xa4/0xb0
> [ 1112.592898] unchecked MSR access error: WRMSR to 0x123 (tried to
> write 0x) at rIP: 0xb5c9a264
> (native_write_msr+0x4/0x20)
> [ 1112.592899] Call Trace:
> [ 1112.592902]  update_srbds_msr+0x98/0xb0
> [ 1112.592904]  smp_store_cpu_info+0x8e/0xb0
> [ 1112.592907]  start_secondary+0x93/0x200
> [ 1112.592911]  ? set_cpu_sibling_map+0xcb0/0xcb0
> [ 1112.592914]  secondary_startup_64+0xa4/0xb0
> [ 2915.106879] show_signal: 6 callbacks suppressed
> [ 6089.209343] WARNING: stack going in the wrong direction? at
> i915_gem_close_object+0x2fb/0x560 [i915]

This looks strange.

Please send

- full dmesg
- output from the "grep -r . /sys/devices/system/cpu/vulnerabilities/" command
- /proc/cpuinfo
- .config

Privately is fine too.

> -x---xx
> Linux debian 5.9.1-rc1+ #4 SMP Fri Oct 16 16:48:04 IST 2020 x86_64

What kernel is that exactly?

Can you reproduce with plain v5.9 too?

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

2020-10-18 Thread Jeffrin Jose T


--- Begin Message ---
hello,
System was slow and  audio and video
was breaking. I was compiling a kernel and also
played a youtube video in firefox and  maybe evolution.

meminfo.txt  and lscpu.txt files are attached.

The following is a part from "dmesg -l warn"

---x--x-x
smpboot: Scheduler frequency invariance went wobbly, disabling!
[ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at rIP:
0xb5c9a184 (native_read_msr+0x4/0x30)
[ 1112.592869] Call Trace:
[ 1112.592876]  update_srbds_msr+0x6f/0xb0
[ 1112.592880]  smp_store_cpu_info+0x8e/0xb0
[ 1112.592883]  start_secondary+0x93/0x200
[ 1112.592887]  ? set_cpu_sibling_map+0xcb0/0xcb0
[ 1112.592891]  secondary_startup_64+0xa4/0xb0
[ 1112.592898] unchecked MSR access error: WRMSR to 0x123 (tried to
write 0x) at rIP: 0xb5c9a264
(native_write_msr+0x4/0x20)
[ 1112.592899] Call Trace:
[ 1112.592902]  update_srbds_msr+0x98/0xb0
[ 1112.592904]  smp_store_cpu_info+0x8e/0xb0
[ 1112.592907]  start_secondary+0x93/0x200
[ 1112.592911]  ? set_cpu_sibling_map+0xcb0/0xcb0
[ 1112.592914]  secondary_startup_64+0xa4/0xb0
[ 2915.106879] show_signal: 6 callbacks suppressed
[ 6089.209343] WARNING: stack going in the wrong direction? at
i915_gem_close_object+0x2fb/0x560 [i915]
$
-x---xx
Linux debian 5.9.1-rc1+ #4 SMP Fri Oct 16 16:48:04 IST 2020 x86_64
GNU/Linux

GNU Make4.3
Binutils2.35.1
Util-linux  2.36
Mount   2.36
Bison   3.7.2
Flex2.6.4
Dynamic linker (ldd)2.30
Procps  3.3.16
Kbd 2.3.0
Console-tools   2.3.0
Sh-utils8.32
Udev246


Reported-by: Jeffrin Jose T MemTotal:6952576 kB
MemFree:  299000 kB
MemAvailable:2964748 kB
Buffers:  169024 kB
Cached:  3447336 kB
SwapCached:  108 kB
Active:  1996356 kB
Inactive:3315820 kB
Active(anon): 643748 kB
Inactive(anon):  1950892 kB
Active(file):1352608 kB
Inactive(file):  1364928 kB
Unevictable:   57668 kB
Mlocked: 160 kB
SwapTotal:   8263676 kB
SwapFree:8261092 kB
Dirty: 16632 kB
Writeback: 0 kB
AnonPages:   1736772 kB
Mapped:   520572 kB
Shmem:898760 kB
KReclaimable: 264048 kB
Slab:1045968 kB
SReclaimable: 264048 kB
SUnreclaim:   781920 kB
KernelStack:   28320 kB
PageTables:27856 kB
NFS_Unstable:  0 kB
Bounce:0 kB
WritebackTmp:  0 kB
CommitLimit:11739964 kB
Committed_AS:7741848 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   60544 kB
VmallocChunk:  0 kB
Percpu: 2672 kB
HardwareCorrupted: 0 kB
AnonHugePages:763904 kB
ShmemHugePages:0 kB
ShmemPmdMapped:0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
HugePages_Total:   0
HugePages_Free:0
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB
Hugetlb:   0 kB
DirectMap4k:  891756 kB
DirectMap2M: 7372800 kB
DirectMap1G: 1048576 kB
Architecture:x86_64
CPU op-mode(s):  32-bit, 64-bit
Byte Order:  Little Endian
Address sizes:   39 bits physical, 48 bits virtual
CPU(s):  4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):   1
NUMA node(s):1
Vendor ID:   GenuineIntel
CPU family:  6
Model:   142
Model name:  Intel(R) Core(TM) i3-7020U CPU @ 2.30GHz
Stepping:9
CPU MHz: 2299.999
CPU max MHz: 2300.
CPU min MHz: 400.
BogoMIPS:4599.93
Virtualization:  VT-x
L1d cache:   64 KiB
L1i cache:   64 KiB
L2 cache:512 KiB
L3 cache:3 MiB
NUMA node0 CPU(s):   0-3
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf:  Mitigation; PTE Inversion; VMX conditional 
cache flushes, SMT vulnerable
Vulnerability Mds:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:  Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled 
via prctl and seccomp
Vulnerability Spectre v1:Mitigation; usercopy/swapgs barriers and 
__user pointer sanitization
Vulnerability Spectre v2:Mitigation; Full generic retpoline, IBPB 
conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Mitigation; Microcode