Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Hi Florian, On Mon, Jun 05, 2023 at 08:49:25PM +0200, Florian Lehner wrote: > Hi all, > > the fix was merged upstream with > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/maccess.c?id=d319f344561de23e810515d109c7278919bff7b0 And so landed in 6.4-rc1. Great thanks! Would you mind proposing the change as well for inclusion in the relevant stable versions? Make sure to CC the relevant maintainers in the request for stable? Regards, Salvatore
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Hi all, the fix was merged upstream with https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/maccess.c?id=d319f344561de23e810515d109c7278919bff7b0 - florian On 3/25/23 16:58, Diederik de Haas wrote: Control: found -1 5.19~rc4-1~exp1 Control: forwarded -1 https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/ On Saturday, 25 March 2023 16:00:47 CET Florian Lehner wrote: Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily test various kernel versions. Could you try whether 5.19~rc4-1~exp1 indeed produces the problem? Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1 Thanks. Then the most likely case was that it was introduced in the 5.19 merge window and thus also present in 5.19-rc1, but there isn't a prebuild kernel to verify. Since the running program is rather complex, it is not easily possible to carve out a small reproducer. We can provide gdb backtraces from freezes inside qemu. Someone else would have to chime in for the backtraces; that's beyond my skill set. I just learned about https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoitov@gmail. com/. With the provided patch applied I no longer mange to freeze the system. I see you already responded to that thread, excellent :-) Hopefully they'll read this whole bug report, but mentioning that your actual problem was NOT triggered till 5.18, but did trigger from 5.19-rc4 and later, could be useful. I may not fully understand what upstream talked about, but I only saw a reference to a 6.0.0 kernel. Thanks for testing and reporting back :-)
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Control: found -1 5.19~rc4-1~exp1 Control: forwarded -1 https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/ On Saturday, 25 March 2023 16:00:47 CET Florian Lehner wrote: > > Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily > > test various kernel versions. Could you try whether 5.19~rc4-1~exp1 > > indeed produces the problem? > > Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1 Thanks. Then the most likely case was that it was introduced in the 5.19 merge window and thus also present in 5.19-rc1, but there isn't a prebuild kernel to verify. > > > Since the running program is rather complex, it is not easily possible > > > to carve out a small reproducer. We can provide gdb backtraces from > > > freezes inside qemu. > > > > Someone else would have to chime in for the backtraces; that's beyond my > > skill set. > > I just learned about > https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoitov@gmail. > com/. With the provided patch applied I no longer mange to freeze the > system. I see you already responded to that thread, excellent :-) Hopefully they'll read this whole bug report, but mentioning that your actual problem was NOT triggered till 5.18, but did trigger from 5.19-rc4 and later, could be useful. I may not fully understand what upstream talked about, but I only saw a reference to a 6.0.0 kernel. Thanks for testing and reporting back :-) signature.asc Description: This is a digitally signed message part.
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
On Fri, 24 Mar 2023 13:50:15 +0100 Diederik de Haas wrote: On Friday, 24 March 2023 12:44:33 CET Tim Rühsen wrote: > Package: linux-image-amd64 > Version: 6.1.20-1 > > We run a priviledged eBPF based tool with a communication between kernel and > user space. It runs without issues on kernels 4.15 to 5.18. > On kernels 5.19+, the whole system freezes after a few minutes. Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily test various kernel versions. Could you try whether 5.19~rc4-1~exp1 indeed produces the problem? Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1 (2022-07-01) from https://snapshot.debian.org/package/linux-signed-amd64/5.19~rc4%2B1~exp1/. > Since the running program is rather complex, it is not easily possible to > carve out a small reproducer. We can provide gdb backtraces from freezes > inside qemu. Someone else would have to chime in for the backtraces; that's beyond my skill set. I just learned about https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/. With the provided patch applied I no longer mange to freeze the system. - florian
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Hi, maybe some additional information. The eBPF program is of type BPF_PROG_TYPE_PERF_EVENT and attached to all CPUs via the perf subsystem and the use of PERF_COUNT_SW_CPU_CLOCK. It is executed on a constant sampling frequency (usually 20 Hz). We also do have qemus guest memory dumps available if this would help investigate the issue. - florian On Fri, 24 Mar 2023 12:44:33 +0100 =?utf-8?q?Tim_R=C3=BChsen?= wrote: Package: linux-image-amd64 Version: 6.1.20-1 Severity: important X-Debbugs-Cc: tim.rueh...@gmx.de Dear Maintainer, * What led up to the situation? We run a priviledged eBPF based tool with a communication between kernel and user space. It runs without issues on kernels 4.15 to 5.18. On kernels 5.19+, the whole system freezes after a few minutes. It seems that with more system activities (load, forks) the freeze happens earlier. The underlying hardware seems to play no role, we could reproduce this on different bare metal systems as well as within a qemu based VM. Since the running program is rather complex, it is not easily possible to carve out a small reproducer. We can provide gdb backtraces from freezes inside qemu. -- System Information: Debian Release: 12.0 APT prefers testing-security APT policy: (500, 'testing-security'), (500, 'testing-debug'), (500, 'unstable'), (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 6.1.0-7-amd64 (SMP w/20 CPU threads; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=locale: Cannot set LC_ALL to default locale: No such file or directory UTF-8), LANGUAGE=en_US:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages linux-image-amd64 depends on: ii linux-image-6.1.0-7-amd64 6.1.20-1 linux-image-amd64 recommends no packages. linux-image-amd64 suggests no packages. -- debconf information: perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = "en_US:en", LC_ALL = (unset), LC_TIME = "en_DE.UTF-8", LC_MONETARY = "en_DE.UTF-8", LC_COLLATE = "en_DE.UTF-8", LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to a fallback locale ("en_US.UTF-8"). locale: Cannot set LC_ALL to default locale: No such file or directory
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Control: reassign -1 src:linux 6.1.20-1 On Friday, 24 March 2023 12:44:33 CET Tim Rühsen wrote: > Package: linux-image-amd64 > Version: 6.1.20-1 > > We run a priviledged eBPF based tool with a communication between kernel and > user space. It runs without issues on kernels 4.15 to 5.18. > On kernels 5.19+, the whole system freezes after a few minutes. Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily test various kernel versions. Could you try whether 5.19~rc4-1~exp1 indeed produces the problem? > Since the running program is rather complex, it is not easily possible to > carve out a small reproducer. We can provide gdb backtraces from freezes > inside qemu. Someone else would have to chime in for the backtraces; that's beyond my skill set. Verifying in which kernel version/commit the issue started is still useful. signature.asc Description: This is a digitally signed message part.
Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+
Package: linux-image-amd64 Version: 6.1.20-1 Severity: important X-Debbugs-Cc: tim.rueh...@gmx.de Dear Maintainer, * What led up to the situation? We run a priviledged eBPF based tool with a communication between kernel and user space. It runs without issues on kernels 4.15 to 5.18. On kernels 5.19+, the whole system freezes after a few minutes. It seems that with more system activities (load, forks) the freeze happens earlier. The underlying hardware seems to play no role, we could reproduce this on different bare metal systems as well as within a qemu based VM. Since the running program is rather complex, it is not easily possible to carve out a small reproducer. We can provide gdb backtraces from freezes inside qemu. -- System Information: Debian Release: 12.0 APT prefers testing-security APT policy: (500, 'testing-security'), (500, 'testing-debug'), (500, 'unstable'), (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 6.1.0-7-amd64 (SMP w/20 CPU threads; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=locale: Cannot set LC_ALL to default locale: No such file or directory UTF-8), LANGUAGE=en_US:en Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages linux-image-amd64 depends on: ii linux-image-6.1.0-7-amd64 6.1.20-1 linux-image-amd64 recommends no packages. linux-image-amd64 suggests no packages. -- debconf information: perl: warning: Setting locale failed. perl: warning: Please check that your locale settings: LANGUAGE = "en_US:en", LC_ALL = (unset), LC_TIME = "en_DE.UTF-8", LC_MONETARY = "en_DE.UTF-8", LC_COLLATE = "en_DE.UTF-8", LANG = "en_US.UTF-8" are supported and installed on your system. perl: warning: Falling back to a fallback locale ("en_US.UTF-8"). locale: Cannot set LC_ALL to default locale: No such file or directory (gdb) thread apply all bt full Thread 8 (Thread 1.8 (CPU#7 [running])): #0 arch_atomic_read (v=0x837c2b4c ) at arch/x86/include/asm/atomic.h:29 No locals. #1 atomic_read (v=0x837c2b4c ) at include/linux/atomic/atomic-instrumented.h:28 No locals. #2 pv_hybrid_queued_unfair_trylock (lock=0x837c2b4c ) at kernel/locking/qspinlock_paravirt.h:88 val = #3 __pv_queued_spin_lock_slowpath (lock=0x837c2b4c , val=) at kernel/locking/qspinlock.c:446 prev = next = node = 0x88813bdf1b40 old = tail = 2097152 idx = 0 queue = cnt = __PTR = VAL = _val = __PTR = VAL = __vpp_verify = _val = __PTR = VAL = __vpp_verify = pao_ID__ = pao_tmp__ = pto_val__ = pto_tmp__ = pao_ID__ = pao_tmp__ = pto_val__ = pto_tmp__ = pao_ID__ = pao_tmp__ = pto_val__ = pto_tmp__ = #4 0x81a2b6f0 in pv_queued_spin_lock_slowpath (val=7, lock=0x837c2b4c ) at arch/x86/include/asm/paravirt.h:591 __esi = __edx = __edi = __ecx = __eax = #5 queued_spin_lock_slowpath (val=7, lock=0x837c2b4c ) at arch/x86/include/asm/qspinlock.h:51 No locals. #6 queued_spin_lock (lock=0x837c2b4c ) at include/asm-generic/qspinlock.h:114 val = 7 val = #7 do_raw_spin_lock (lock=0x837c2b4c ) at include/linux/spinlock.h:186 No locals. #8 __raw_spin_lock (lock=0x837c2b4c ) at include/linux/spinlock_api_smp.h:134 No locals. #9 _raw_spin_lock (lock=lock@entry=0x837c2b4c ) at kernel/locking/spinlock.c:154 No locals. #10 0x812e1ba7 in spin_lock (lock=0x837c2b4c ) at include/linux/spinlock.h:350 No locals. #11 alloc_vmap_area (size=size@entry=20480, align=align@entry=16384, vstart=vstart@entry=18446683600570023936, vend=vend@entry=18446718784942112767, node=node@entry=-1, gfp_mask=3264, gfp_mask@entry=3520) at mm/vmalloc.c:1634 va = 0x88802dbb05c0 freed = 0 addr = purged = 0 ret = retry = __func__ = "alloc_vmap_area" #12 0x812e2111 in __get_vm_area_node (size=20480, size@entry=16384, align=align@entry=16384, shift=shift@entry=12, flags=flags@entry=34, start=start@entry=18446683600570023936, end=end@entry=18446718784942112767, node=-1, gfp_mask=3520, caller=0x8109ad0f ) at mm/vmalloc.c:2501 va = area = 0x888113d8dfc0 requested_size = 16384 #13 0x812e52c4 in __vmalloc_node_range (size=, size@entry=16384, align=align@entry=16384, start=, end=, gfp_mask=gfp_mask@entry=3520, prot=..., vm_flags=, node=, caller=) at mm/vmalloc.c:3173 area = ret = kasan_flags = real_size = 16384 real_align = 16384 shift = 12 again = #14