Re: crash in timerfd building pandoc / ghc94 related

2023-02-05 Thread PHO

On 2/6/23 8:54 AM, matthew green wrote:> hi folks.
>
>
> i saw a report about ghc94 related crashes, and while it's easy
> to build ghc94 itself, it's easy to trigger a crash by having
> packages use it.  for me 'pandoc' wants a bunch of hs-* pkgs and
> i had crashes in 2 separate ones.
>
> i added some addditional logging to the failed assert to confirm
> what part of it is failing.  here's the panic and stack:
>
> [ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp 
== curlwp || c->c_cpu->cc_active != c" failed: file 
"/usr/src/sys/kern/kern_timeout.c", line 381 running callout 
0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) c_active 
(0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed from 
0x80fa0d89

>
> breakpoint() at netbsd:breakpoint+0x5
> vpanic() at netbsd:vpanic+0x183
> kern_assert() at netbsd:kern_assert+0x4b
> callout_destroy() at netbsd:callout_destroy+0xbc
> timerfd_fop_close() at netbsd:timerfd_fop_close+0x36
> closef() at netbsd:closef+0x60
> fd_close() at netbsd:fd_close+0x138
> sys_close() at netbsd:sys_close+0x22
> syscall() at netbsd:syscall+0x196
> --- syscall (number 6) ---
>
>
> as you can see, "c_active" is "c", and cc_lwp is not curlwp, so
> the assert triggers.  the active lwp is a softint thread:
>
> db{1}> bt/a 0xfab1b4bba080
> trace: pid 0 lid 5 at 0xa990969120e0
> softint_dispatch() at netbsd:softint_dispatch+0x1ba
> DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0
> Xsoftintr() at netbsd:Xsoftintr+0x4c
> --- interrupt ---
>
> this softint_dispatch() address is:
>
> (gdb) l *(softint_dispatch+0x1ba)
> 0x80f45c4b is in softint_dispatch 
(/usr/src/sys/kern/kern_softint.c:623).

> 621 PSREF_DEBUG_BARRIER();
> 622
> 623 CPU_COUNT(CPU_COUNT_NSOFT, 1);
>
> and the actual address is a "test" instruction, so it seems that
> this lwp was interrupted by the panic and saved at this point of
> execution.  so the assert is firing because the callout is both
> currently about to run _and_ being destroyed.

Thank you for your analysis. I tried to make a small test case to 
reproduce the issue but so far without a success. This is what GHC 9.4 
basically does:


https://gist.github.com/depressed-pho/5d117dbca872ef7c28ee7786e0ad8a8a

But this code does not trigger the panic.


crash in timerfd building pandoc / ghc94 related

2023-02-05 Thread matthew green
hi folks.


i saw a report about ghc94 related crashes, and while it's easy
to build ghc94 itself, it's easy to trigger a crash by having
packages use it.  for me 'pandoc' wants a bunch of hs-* pkgs and
i had crashes in 2 separate ones.

i added some addditional logging to the failed assert to confirm
what part of it is failing.  here's the panic and stack:

[ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp 
|| c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", 
line 381 running callout 0xfaa403b50e00: c_func (0x80f53893) 
c_flags (0x100) c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) 
destroyed from 0x80fa0d89

breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x183
kern_assert() at netbsd:kern_assert+0x4b
callout_destroy() at netbsd:callout_destroy+0xbc
timerfd_fop_close() at netbsd:timerfd_fop_close+0x36
closef() at netbsd:closef+0x60
fd_close() at netbsd:fd_close+0x138
sys_close() at netbsd:sys_close+0x22
syscall() at netbsd:syscall+0x196
--- syscall (number 6) ---


as you can see, "c_active" is "c", and cc_lwp is not curlwp, so
the assert triggers.  the active lwp is a softint thread:

db{1}> bt/a 0xfab1b4bba080
trace: pid 0 lid 5 at 0xa990969120e0
softint_dispatch() at netbsd:softint_dispatch+0x1ba
DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0
Xsoftintr() at netbsd:Xsoftintr+0x4c
--- interrupt ---

this softint_dispatch() address is:

(gdb) l *(softint_dispatch+0x1ba)
0x80f45c4b is in softint_dispatch 
(/usr/src/sys/kern/kern_softint.c:623).
621 PSREF_DEBUG_BARRIER();
622
623 CPU_COUNT(CPU_COUNT_NSOFT, 1);

and the actual address is a "test" instruction, so it seems that
this lwp was interrupted by the panic and saved at this point of
execution.  so the assert is firing because the callout is both
currently about to run _and_ being destroyed.


this is what i've learned about this so far.


.mrg.