Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-06-07 Thread Salvatore Bonaccorso
Hi Florian,

On Mon, Jun 05, 2023 at 08:49:25PM +0200, Florian Lehner wrote:
> Hi all,
> 
> the fix was merged upstream with 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/maccess.c?id=d319f344561de23e810515d109c7278919bff7b0

And so landed in 6.4-rc1.

Great thanks! Would you mind proposing the change as well for
inclusion in the relevant stable versions? Make sure to CC the
relevant maintainers in the request for stable?

Regards,
Salvatore



Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-06-05 Thread Florian Lehner

Hi all,

the fix was merged upstream with 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/maccess.c?id=d319f344561de23e810515d109c7278919bff7b0


- florian

On 3/25/23 16:58, Diederik de Haas wrote:

Control: found -1 5.19~rc4-1~exp1
Control: forwarded -1 
https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/

On Saturday, 25 March 2023 16:00:47 CET Florian Lehner wrote:

Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily
test various kernel versions. Could you try whether 5.19~rc4-1~exp1
indeed produces the problem?


Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1


Thanks. Then the most likely case was that it was introduced in
the 5.19 merge window and thus also present in 5.19-rc1, but there isn't a
prebuild kernel to verify.


Since the running program is rather complex, it is not easily possible
to carve out a small reproducer. We can provide gdb backtraces from
freezes inside qemu.


Someone else would have to chime in for the backtraces; that's beyond my
skill set.


I just learned about
https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoitov@gmail.
com/. With the provided patch applied I no longer mange to freeze the
system.


I see you already responded to that thread, excellent :-)
Hopefully they'll read this whole bug report, but mentioning that your actual
problem was NOT triggered till 5.18, but did trigger from 5.19-rc4 and later,
could be useful. I may not fully understand what upstream talked about, but I
only saw a reference to a 6.0.0 kernel.

Thanks for testing and reporting back :-)




Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-03-25 Thread Diederik de Haas
Control: found -1 5.19~rc4-1~exp1
Control: forwarded -1 
https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/

On Saturday, 25 March 2023 16:00:47 CET Florian Lehner wrote:
> > Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily
> > test various kernel versions. Could you try whether 5.19~rc4-1~exp1
> > indeed produces the problem?
> 
> Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1

Thanks. Then the most likely case was that it was introduced in
the 5.19 merge window and thus also present in 5.19-rc1, but there isn't a 
prebuild kernel to verify.

> > > Since the running program is rather complex, it is not easily possible
> > > to carve out a small reproducer. We can provide gdb backtraces from
> > > freezes inside qemu.
> > 
> > Someone else would have to chime in for the backtraces; that's beyond my
> > skill set.
> 
> I just learned about
> https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoitov@gmail.
> com/. With the provided patch applied I no longer mange to freeze the
> system.

I see you already responded to that thread, excellent :-)
Hopefully they'll read this whole bug report, but mentioning that your actual
problem was NOT triggered till 5.18, but did trigger from 5.19-rc4 and later,
could be useful. I may not fully understand what upstream talked about, but I
only saw a reference to a 6.0.0 kernel.

Thanks for testing and reporting back :-)

signature.asc
Description: This is a digitally signed message part.


Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-03-25 Thread Florian Lehner



On Fri, 24 Mar 2023 13:50:15 +0100 Diederik de Haas 
 wrote:

On Friday, 24 March 2023 12:44:33 CET Tim Rühsen wrote:
> Package: linux-image-amd64
> Version: 6.1.20-1
> 
> We run a priviledged eBPF based tool with a communication between kernel and

> user space. It runs without issues on kernels 4.15 to 5.18.
> On kernels 5.19+, the whole system freezes after a few minutes.

Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily test 
various kernel versions. Could you try whether 5.19~rc4-1~exp1 indeed produces 
the problem?


Yes - I can reproduce the total system freeze with 5.19~rc4-1~exp1 
(2022-07-01) from 
https://snapshot.debian.org/package/linux-signed-amd64/5.19~rc4%2B1~exp1/.




> Since the running program is rather complex, it is not easily possible to
> carve out a small reproducer. We can provide gdb backtraces from freezes
> inside qemu.

Someone else would have to chime in for the backtraces; that's beyond my skill 
set.


I just learned about 
https://lore.kernel.org/bpf/20230118051443.78988-1-alexei.starovoi...@gmail.com/. 
With the provided patch applied I no longer mange to freeze the system.


- florian



Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-03-24 Thread Florian Lehner

Hi,

maybe some additional information.

The eBPF program is of type BPF_PROG_TYPE_PERF_EVENT and attached to all 
CPUs via the perf subsystem and the use of PERF_COUNT_SW_CPU_CLOCK. It 
is executed on a constant sampling frequency (usually 20 Hz).


We also do have qemus guest memory dumps available if this would help 
investigate the issue.


- florian

On Fri, 24 Mar 2023 12:44:33 +0100 =?utf-8?q?Tim_R=C3=BChsen?= 
 wrote:

Package: linux-image-amd64
Version: 6.1.20-1
Severity: important
X-Debbugs-Cc: tim.rueh...@gmx.de

Dear Maintainer,

   * What led up to the situation?

We run a priviledged eBPF based tool with a communication between kernel and 
user space.
It runs without issues on kernels 4.15 to 5.18.
On kernels 5.19+, the whole system freezes after a few minutes.
It seems that with more system activities (load, forks) the freeze happens 
earlier.
The underlying hardware seems to play no role, we could reproduce this on 
different
bare metal systems as well as within a qemu based VM.

Since the running program is rather complex, it is not easily possible to carve 
out a small reproducer.
We can provide gdb backtraces from freezes inside qemu.


-- System Information:
Debian Release: 12.0
  APT prefers testing-security
  APT policy: (500, 'testing-security'), (500, 'testing-debug'), (500, 
'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 6.1.0-7-amd64 (SMP w/20 CPU threads; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=locale: Cannot set 
LC_ALL to default locale: No such file or directory
UTF-8), LANGUAGE=en_US:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages linux-image-amd64 depends on:
ii  linux-image-6.1.0-7-amd64  6.1.20-1

linux-image-amd64 recommends no packages.

linux-image-amd64 suggests no packages.

-- debconf information:
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "en_US:en",
LC_ALL = (unset),
LC_TIME = "en_DE.UTF-8",
LC_MONETARY = "en_DE.UTF-8",
LC_COLLATE = "en_DE.UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
locale: Cannot set LC_ALL to default locale: No such file or directory




Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-03-24 Thread Diederik de Haas
Control: reassign -1 src:linux 6.1.20-1

On Friday, 24 March 2023 12:44:33 CET Tim Rühsen wrote:
> Package: linux-image-amd64
> Version: 6.1.20-1
> 
> We run a priviledged eBPF based tool with a communication between kernel and
> user space. It runs without issues on kernels 4.15 to 5.18.
> On kernels 5.19+, the whole system freezes after a few minutes.

Via https://snapshot.debian.org/binary/linux-image-amd64/ you can easily test 
various kernel versions. Could you try whether 5.19~rc4-1~exp1 indeed produces 
the problem?

> Since the running program is rather complex, it is not easily possible to
> carve out a small reproducer. We can provide gdb backtraces from freezes
> inside qemu.

Someone else would have to chime in for the backtraces; that's beyond my skill 
set.
Verifying in which kernel version/commit the issue started is still useful.

signature.asc
Description: This is a digitally signed message part.


Bug#1033398: linux-image-amd64: reproducible kernel freeze on 5.19+

2023-03-24 Thread Tim Rühsen
Package: linux-image-amd64
Version: 6.1.20-1
Severity: important
X-Debbugs-Cc: tim.rueh...@gmx.de

Dear Maintainer,

   * What led up to the situation?

We run a priviledged eBPF based tool with a communication between kernel and 
user space.
It runs without issues on kernels 4.15 to 5.18.
On kernels 5.19+, the whole system freezes after a few minutes.
It seems that with more system activities (load, forks) the freeze happens 
earlier.
The underlying hardware seems to play no role, we could reproduce this on 
different
bare metal systems as well as within a qemu based VM.

Since the running program is rather complex, it is not easily possible to carve 
out a small reproducer.
We can provide gdb backtraces from freezes inside qemu.


-- System Information:
Debian Release: 12.0
  APT prefers testing-security
  APT policy: (500, 'testing-security'), (500, 'testing-debug'), (500, 
'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 6.1.0-7-amd64 (SMP w/20 CPU threads; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=locale: Cannot set 
LC_ALL to default locale: No such file or directory
UTF-8), LANGUAGE=en_US:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages linux-image-amd64 depends on:
ii  linux-image-6.1.0-7-amd64  6.1.20-1

linux-image-amd64 recommends no packages.

linux-image-amd64 suggests no packages.

-- debconf information:
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = "en_US:en",
LC_ALL = (unset),
LC_TIME = "en_DE.UTF-8",
LC_MONETARY = "en_DE.UTF-8",
LC_COLLATE = "en_DE.UTF-8",
LANG = "en_US.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to a fallback locale ("en_US.UTF-8").
locale: Cannot set LC_ALL to default locale: No such file or directory
(gdb) thread apply all bt full

Thread 8 (Thread 1.8 (CPU#7 [running])):
#0  arch_atomic_read (v=0x837c2b4c ) at 
arch/x86/include/asm/atomic.h:29
No locals.
#1  atomic_read (v=0x837c2b4c ) at 
include/linux/atomic/atomic-instrumented.h:28
No locals.
#2  pv_hybrid_queued_unfair_trylock (lock=0x837c2b4c ) 
at kernel/locking/qspinlock_paravirt.h:88
val = 
#3  __pv_queued_spin_lock_slowpath (lock=0x837c2b4c , 
val=) at kernel/locking/qspinlock.c:446
prev = 
next = 
node = 0x88813bdf1b40
old = 
tail = 2097152
idx = 0
queue = 
cnt = 
__PTR = 
VAL = 
_val = 
__PTR = 
VAL = 
__vpp_verify = 
_val = 
__PTR = 
VAL = 
__vpp_verify = 
pao_ID__ = 
pao_tmp__ = 
pto_val__ = 
pto_tmp__ = 
pao_ID__ = 
pao_tmp__ = 
pto_val__ = 
pto_tmp__ = 
pao_ID__ = 
pao_tmp__ = 
pto_val__ = 
pto_tmp__ = 
#4  0x81a2b6f0 in pv_queued_spin_lock_slowpath (val=7, 
lock=0x837c2b4c ) at arch/x86/include/asm/paravirt.h:591
__esi = 
__edx = 
__edi = 
__ecx = 
__eax = 
#5  queued_spin_lock_slowpath (val=7, lock=0x837c2b4c ) 
at arch/x86/include/asm/qspinlock.h:51
No locals.
#6  queued_spin_lock (lock=0x837c2b4c ) at 
include/asm-generic/qspinlock.h:114
val = 7
val = 
#7  do_raw_spin_lock (lock=0x837c2b4c ) at 
include/linux/spinlock.h:186
No locals.
#8  __raw_spin_lock (lock=0x837c2b4c ) at 
include/linux/spinlock_api_smp.h:134
No locals.
#9  _raw_spin_lock (lock=lock@entry=0x837c2b4c ) at 
kernel/locking/spinlock.c:154
No locals.
#10 0x812e1ba7 in spin_lock (lock=0x837c2b4c ) 
at include/linux/spinlock.h:350
No locals.
#11 alloc_vmap_area (size=size@entry=20480, align=align@entry=16384, 
vstart=vstart@entry=18446683600570023936, vend=vend@entry=18446718784942112767, 
node=node@entry=-1, gfp_mask=3264, gfp_mask@entry=3520) at mm/vmalloc.c:1634
va = 0x88802dbb05c0
freed = 0
addr = 
purged = 0
ret = 
retry = 
__func__ = "alloc_vmap_area"
#12 0x812e2111 in __get_vm_area_node (size=20480, size@entry=16384, 
align=align@entry=16384, shift=shift@entry=12, flags=flags@entry=34, 
start=start@entry=18446683600570023936, end=end@entry=18446718784942112767, 
node=-1, gfp_mask=3520, caller=0x8109ad0f ) at 
mm/vmalloc.c:2501
va = 
area = 0x888113d8dfc0
requested_size = 16384
#13 0x812e52c4 in __vmalloc_node_range (size=, 
size@entry=16384, align=align@entry=16384, start=, 
end=, gfp_mask=gfp_mask@entry=3520, prot=..., 
vm_flags=, node=, caller=) at 
mm/vmalloc.c:3173
area = 
ret = 
kasan_flags = 
real_size = 16384
real_align = 16384
shift = 12
again = 
#14