Hello everyone,

TLDR: If you have a Ryzen 7000 series CPU it would be great if you could boot 
the test case we repaired at:

        https://dl.t2sde.org/amd-zen4-smt-c0fefe/

The full story:

During cross compiling our Linux Distribution. I observers some random illegal 
instruction build errors since we got ourselves a Ryzen 7950x on launch day a 
year ago:

vendor_id : AuthenticAMD
cpu family : 25
model : 97
model name : AMD Ryzen 9 7950X 16-Core Processor
stepping : 2
microcode : 0xa601203

Initially I thought it must surely be some early system instability, and some 
DDR5 AGESA and microcode updates will eventually take care of that. Month after 
month passed and so far no BIOS update helped. So I finally started to 
investigate this over the last months, run the DDR5 memory at base clock, then 
disabled Precision Boost, and at the end run the CPU and RAM even below 
advertised base clock and the pseudo random illegal instructions at some gcc 
instances where still observed and I started to realized they where actually 
99% quite identical around gcc user-space address 0xc0e0c0 (sometimes slightly 
off like 0xc08aaf):

during GIMPLE pass: switchlower
../src/intel/vulkan/anv_nir_lower_ubo_loads.c: In function 
'lower_ubo_load_instr':
../src/intel/vulkan/anv_nir_lower_ubo_loads.c:28:1: internal compiler error: 
Illegal instruction
  28 | lower_ubo_load_instr(nir_builder *b, nir_instr *instr, UNUSED void 
*_data)
     | ^~~~~~~~~~~~~~~~~~~~
0x1435c95 internal_error(char const*, ...)
       ???:0
0xc0e0c0 
tree_switch_conversion::switch_decision_tree::try_switch_expansion(vec<tree_switch_conversion::cluster*,
 va_heap, vl_ptr>&)
       ???:0
0xc0eb69 
tree_switch_conversion::switch_decision_tree::analyze_switch_statement()
       ???:0

  0x0000000000c0e0b6 <+246>:   cmp    $0xffffffff,%r14d
  0x0000000000c0e0ba <+250>:   je     0xc0e190 
<_ZN22tree_switch_conversion20switch_decision_tree20try_switch_expansionER3vecIPNS_7clusterE7va_heap6vl_ptrE+464>
  0x0000000000c0e0c0 <+256>:   mov    %rax,%rcx # <----- HERE -----!
  0x0000000000c0e0c3 <+259>:   cmpb   $0x0,0xa8(%r13)
  0x0000000000c0e0cb <+267>:   jne    0xc0e060 
<_ZN22tree_switch_conversion20switch_decision_tree20try_switch_expansionER3vecIPNS_7clusterE7va_heap6vl_ptrE+160>
  0x0000000000c0e0cd <+269>:   mov    0xa0(%r13),%rsi
  0x0000000000c0e0d4 <+276>:   mov    $0x8,%eax

The illegal instructions only occur sometimes, so rebuilding a package is 
usually successful.
To rule out any software inherited instability, I booted the bit identical 
system copy (using rsync) on a Ryzen 5950x and could build all the system using 
the identical kernel and gcc binaries without any such spurious illegal 
instructions.

This appeared to mostly show up with gcc as cross compiled for sparc64, but 
that should not matter, as this ist just generic x86-64(-v1) code that in 
similar sequence, memory access and I/O pattern could likely appear in any 
other sophisticated and complex enough user-space program.

Trying to further narrow this down, and wether it is just one defect core, I 
patched the kernel to show the likely CPU. Not sure if this is the most 
reliable, but that is the patch:

--- linux-6.5/arch/x86/kernel/traps.c.orig      2023-10-02 11:53:47.413623693 
+0200
+++ linux-6.5/arch/x86/kernel/traps.c   2023-10-02 11:53:58.580624927 +0200
@@ -294,8 +294,12 @@
static inline void handle_invalid_op(struct pt_regs regs)
#endif
{
+       void __useraddr = error_get_trap_addr(regs);
+       int cpu = raw_smp_processor_id();
+       printk("INVALID OPCODE: %lx likely on CPU %d (core %d, socket %d)\n",
+               cpu, addr, topology_core_id(cpu), 
topology_physical_package_id(cpu));
       do_error_trap(regs, 0, "invalid opcode", X86_TRAP_UD, SIGILL,
ILL_ILLOPN, error_get_trap_addr(regs));
+                     ILL_ILLOPN, addr);
}

This showed number over all cores and CCX to be affected:

[ 1901.688448] INVALID OPCODE: c0e0c0 likely on CPU 26 (core 10, socket 0)
[ 1930.529211] INVALID OPCODE: c0e0c0 likely on CPU 21 (core 5, socket 0)
[ 1971.898911] INVALID OPCODE: c0e0c0 likely on CPU 27 (core 11, socket 0)
[ 2006.781557] INVALID OPCODE: c0e0c0 likely on CPU 19 (core 3, socket 0)
[ 2054.672900] INVALID OPCODE: c0e0c0 likely on CPU 30 (core 14, socket 0)
[ 2097.180969] INVALID OPCODE: c0e0c0 likely on CPU 27 (core 11, socket 0)
[ 2140.558150] INVALID OPCODE: c0e0c0 likely on CPU 23 (core 7, socket 0)
[ 2168.601674] INVALID OPCODE: c0e0c0 likely on CPU 15 (core 15, socket 0)
…

I sorted the result # dmesg | grep INVALID| sed 's/.*://' | cut -d ' ' -f 6| 
sort -n| uniq -c
     4 0
     2 1
     2 2
     2 3
     5 4
     5 5 
     3 7 
     4 8 
     4 9 
     2 10
     4 11
     3 12
     5 13
     5 14
     2 15
     6 16
     8 17
     2 18
     5 19
     5 20
     3 21
     7 22
     5 23
     3 24
     2 25
     2 26
     3 27
     4 28
     4 29
     6 30
     5 31

Already discussing this issue with some other folks and kernel developer it was 
suggested it could be TLB related, and we realized we were booting with 
mitigations=off for a little higher all system compilation performance and I 
can report that without mitigations=off this spurious illegal instructions do 
not appear. Also disabling SMT makes the problem disappear, too.

So I iterated over all the mitigation options and found spectre_v2_user=off to 
be enough to make this bug reproducibly appear when loading most cores running 
this sparc64-t2-linux-gcc.

Now the good news is: running with modern security mitigations enabled hides 
this what to me looks like a Zen 4 SMT sibling processor state corruption bug 
or mis-speculation. However, I would argue, non malicious user-more programs 
should not exhibit spurious illegal instructions with an operating system 
running in a classic, high performance mode without any special security 
mitigations in place.

As this is very reproducible with GCC for sparc64 for me, I created an initrd 
with a pre-processed source file (from Mesa IIRC) setup to boot into a loop 
running sparc64-t2-linux-gcc on all cores (all grouped in usr/local) for others 
to test how widespread this issue is:

https://dl.t2sde.org/amd-zen4-smt-c0fefe/

Boot with:
spectre_v2_user=off or mitigations=off

It is even reproducible in qemu/kvm running on a host with this 
spectre_v2_user=off:
qemu-system-x86_64 --enable-kvm -smp 32 -cpu host -m 4G -kernel 
vmlinuz-6.5.5-t2 -initrd initrd-6.5.5-t2.gz -nographic -append "console=ttyS0"

To test on your system with chroot:
mkdir bug; cd bug; gunzip ../initrd-6.5.5-t2 | cpio -i
chroot . usr/local/init

With this reduced test case illegal instructions appear within an average of 
just 5 seconds on my Ryzen 7950x.

To rule out that this is some random linux kernel config and optimization 
fluke, I built the kernel with clang and gcc, without any change, and also 
downloaded the latest Intel Clear Linux kernel binary to double checked that it 
is affected in the same way, and sure it does.

After all this research, to me this looks like an Zen 4 CPU bug, but any other 
comments, hints, patches welcome!

I realize AMD has never microcode for Epyc server CPUs, if this is already 
fixed in some newer microcode, it would really be amazing (hint) if AMD would 
release microcode updates for $999 consumer CPUs in a more timely manner, and 
not only high end server SKUs via linux-firmware, ...

Thank you so much,

        René Rebe

--
ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
http://exactcode.com | http://exactscan.com | http://ocrkit.com

----------------------------------------------------------- 
If you wish to unsubscribe from this mailing, send mail to
li...@t2-project.org with a subject of: unsubscribe t2

Reply via email to