Re: crash in timerfd building pandoc / ghc94 related
On Mon, Feb 06, 2023 at 09:19:26PM -0500, Mouse wrote: > Perhaps the simplest is > > dd if=/dev/urandom bs=65536 of=/dev/mem > > but there are others. > > Yet I can't help feeling that there is some sense in which it *is* fair > to say that userland should never be able to crash the kernel. I have > been mulling over this paradox for some time but have not come up with > an alternative phrasing that avoids the reasonable crashes while still > capturing a significant fraction of the useful meaning. Arguably, /dev/mem just shouldn't exist... I'm not sure there are actually reasonable crashes. -- David A. Holland dholl...@netbsd.org
re: crash in timerfd building pandoc / ghc94 related
> dd if=/dev/urandom bs=65536 of=/dev/mem FWIW, secururelevel > 0 fixes this issue. so, perhaps you can rephrase by including something about correct separation of privs, since root write-access to /dev/mem is literally giving it kernel-level privs. .mrg.
Re: crash in timerfd building pandoc / ghc94 related
>> It seems so far, from not really paying attention, that there is >> nothing wrong with ghc but that there is a bug in the kernel. > Yes of course no userland code should be able to crash the kernel :D I used to think so. Then it occurred to me that there are various ways for userland to crash the kernel which are perfectly reasonable, where of course "reasonable" is a vague term, meaning maybe something like "I don't think they indicate anything in need of fixing". Perhaps the simplest is dd if=/dev/urandom bs=65536 of=/dev/mem but there are others. Yet I can't help feeling that there is some sense in which it *is* fair to say that userland should never be able to crash the kernel. I have been mulling over this paradox for some time but have not come up with an alternative phrasing that avoids the reasonable crashes while still capturing a significant fraction of the useful meaning. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: crash in timerfd building pandoc / ghc94 related
Committed a workaround to lang/ghc94. I hope it can avoid the panic. You can remove the workaround simply by deleting lang/ghc94/hacks.mk. On 2/7/23 12:36 AM, Greg Troxel wrote: PHO writes: On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote: I encountered this on some version of 10.99.2 and last night again on 10.99.2 from Friday morning. This is an obvious blocker for me for making 9.4.4 the default. I propose to either revert to the last version or make the default GHC version setable. I wish I could do the latter, but unfortunately not all Haskell packages are buildable with 2 major versions of GHC at the same time (most are, but there are a few exceptions). Alternatively, I think I can patch GHC 9.4 so that it won't use timerfd. It appears to be an optional feature after all; if its ./configure doesn't find timerfd it won't use it. Let me try that. If it's possible to only do this on NetBSD 10.99, that would be good. Yeah I did exactly that. It seems so far, from not really paying attention, that there is nothing wrong with ghc but that there is a bug in the kernel. It would also be good to get a reproduction recipe without haskell. Yes of course no userland code should be able to crash the kernel :D
Re: crash in timerfd building pandoc / ghc94 related
PHO writes: > On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote: >> I encountered this on some version of 10.99.2 and last night again on >> 10.99.2 from Friday morning. >> This is an obvious blocker for me for making 9.4.4 the default. >> I propose to either revert to the last version or make the default GHC >> version setable. > > I wish I could do the latter, but unfortunately not all Haskell > packages are buildable with 2 major versions of GHC at the same time > (most are, but there are a few exceptions). > > Alternatively, I think I can patch GHC 9.4 so that it won't use > timerfd. It appears to be an optional feature after all; if its > ./configure doesn't find timerfd it won't use it. Let me try that. If it's possible to only do this on NetBSD 10.99, that would be good. It seems so far, from not really paying attention, that there is nothing wrong with ghc but that there is a bug in the kernel. It would also be good to get a reproduction recipe without haskell.
Re: crash in timerfd building pandoc / ghc94 related
PHO transcribed 0.7K bytes: On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote: I encountered this on some version of 10.99.2 and last night again on 10.99.2 from Friday morning. This is an obvious blocker for me for making 9.4.4 the default. I propose to either revert to the last version or make the default GHC version setable. I wish I could do the latter, but unfortunately not all Haskell packages are buildable with 2 major versions of GHC at the same time (most are, but there are a few exceptions). Okay that makes sense. Alternatively, I think I can patch GHC 9.4 so that it won't use timerfd. It appears to be an optional feature after all; if its ./configure doesn't find timerfd it won't use it. Let me try that. thanks!
Re: crash in timerfd building pandoc / ghc94 related
On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote: > I encountered this on some version of 10.99.2 and last night again on > 10.99.2 from Friday morning. > This is an obvious blocker for me for making 9.4.4 the default. > I propose to either revert to the last version or make the default GHC > version setable. I wish I could do the latter, but unfortunately not all Haskell packages are buildable with 2 major versions of GHC at the same time (most are, but there are a few exceptions). Alternatively, I think I can patch GHC 9.4 so that it won't use timerfd. It appears to be an optional feature after all; if its ./configure doesn't find timerfd it won't use it. Let me try that.
Re: crash in timerfd building pandoc / ghc94 related
I encountered this on some version of 10.99.2 and last night again on 10.99.2 from Friday morning. This is an obvious blocker for me for making 9.4.4 the default. I propose to either revert to the last version or make the default GHC version setable. PHO transcribed 2.3K bytes: On 2/6/23 8:54 AM, matthew green wrote:> hi folks. i saw a report about ghc94 related crashes, and while it's easy to build ghc94 itself, it's easy to trigger a crash by having packages use it. for me 'pandoc' wants a bunch of hs-* pkgs and i had crashes in 2 separate ones. i added some addditional logging to the failed assert to confirm what part of it is failing. here's the panic and stack: [ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp || c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", line 381 running callout 0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed from 0x80fa0d89 breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x183 kern_assert() at netbsd:kern_assert+0x4b callout_destroy() at netbsd:callout_destroy+0xbc timerfd_fop_close() at netbsd:timerfd_fop_close+0x36 closef() at netbsd:closef+0x60 fd_close() at netbsd:fd_close+0x138 sys_close() at netbsd:sys_close+0x22 syscall() at netbsd:syscall+0x196 --- syscall (number 6) --- as you can see, "c_active" is "c", and cc_lwp is not curlwp, so the assert triggers. the active lwp is a softint thread: db{1}> bt/a 0xfab1b4bba080 trace: pid 0 lid 5 at 0xa990969120e0 softint_dispatch() at netbsd:softint_dispatch+0x1ba DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0 Xsoftintr() at netbsd:Xsoftintr+0x4c --- interrupt --- this softint_dispatch() address is: (gdb) l *(softint_dispatch+0x1ba) 0x80f45c4b is in softint_dispatch (/usr/src/sys/kern/kern_softint.c:623). 621 PSREF_DEBUG_BARRIER(); 622 623 CPU_COUNT(CPU_COUNT_NSOFT, 1); and the actual address is a "test" instruction, so it seems that this lwp was interrupted by the panic and saved at this point of execution. so the assert is firing because the callout is both currently about to run _and_ being destroyed. Thank you for your analysis. I tried to make a small test case to reproduce the issue but so far without a success. This is what GHC 9.4 basically does: https://gist.github.com/depressed-pho/5d117dbca872ef7c28ee7786e0ad8a8a But this code does not trigger the panic.
Re: crash in timerfd building pandoc / ghc94 related
On 2/6/23 8:54 AM, matthew green wrote:> hi folks. > > > i saw a report about ghc94 related crashes, and while it's easy > to build ghc94 itself, it's easy to trigger a crash by having > packages use it. for me 'pandoc' wants a bunch of hs-* pkgs and > i had crashes in 2 separate ones. > > i added some addditional logging to the failed assert to confirm > what part of it is failing. here's the panic and stack: > > [ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp || c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", line 381 running callout 0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed from 0x80fa0d89 > > breakpoint() at netbsd:breakpoint+0x5 > vpanic() at netbsd:vpanic+0x183 > kern_assert() at netbsd:kern_assert+0x4b > callout_destroy() at netbsd:callout_destroy+0xbc > timerfd_fop_close() at netbsd:timerfd_fop_close+0x36 > closef() at netbsd:closef+0x60 > fd_close() at netbsd:fd_close+0x138 > sys_close() at netbsd:sys_close+0x22 > syscall() at netbsd:syscall+0x196 > --- syscall (number 6) --- > > > as you can see, "c_active" is "c", and cc_lwp is not curlwp, so > the assert triggers. the active lwp is a softint thread: > > db{1}> bt/a 0xfab1b4bba080 > trace: pid 0 lid 5 at 0xa990969120e0 > softint_dispatch() at netbsd:softint_dispatch+0x1ba > DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0 > Xsoftintr() at netbsd:Xsoftintr+0x4c > --- interrupt --- > > this softint_dispatch() address is: > > (gdb) l *(softint_dispatch+0x1ba) > 0x80f45c4b is in softint_dispatch (/usr/src/sys/kern/kern_softint.c:623). > 621 PSREF_DEBUG_BARRIER(); > 622 > 623 CPU_COUNT(CPU_COUNT_NSOFT, 1); > > and the actual address is a "test" instruction, so it seems that > this lwp was interrupted by the panic and saved at this point of > execution. so the assert is firing because the callout is both > currently about to run _and_ being destroyed. Thank you for your analysis. I tried to make a small test case to reproduce the issue but so far without a success. This is what GHC 9.4 basically does: https://gist.github.com/depressed-pho/5d117dbca872ef7c28ee7786e0ad8a8a But this code does not trigger the panic.
crash in timerfd building pandoc / ghc94 related
hi folks. i saw a report about ghc94 related crashes, and while it's easy to build ghc94 itself, it's easy to trigger a crash by having packages use it. for me 'pandoc' wants a bunch of hs-* pkgs and i had crashes in 2 separate ones. i added some addditional logging to the failed assert to confirm what part of it is failing. here's the panic and stack: [ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp || c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", line 381 running callout 0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed from 0x80fa0d89 breakpoint() at netbsd:breakpoint+0x5 vpanic() at netbsd:vpanic+0x183 kern_assert() at netbsd:kern_assert+0x4b callout_destroy() at netbsd:callout_destroy+0xbc timerfd_fop_close() at netbsd:timerfd_fop_close+0x36 closef() at netbsd:closef+0x60 fd_close() at netbsd:fd_close+0x138 sys_close() at netbsd:sys_close+0x22 syscall() at netbsd:syscall+0x196 --- syscall (number 6) --- as you can see, "c_active" is "c", and cc_lwp is not curlwp, so the assert triggers. the active lwp is a softint thread: db{1}> bt/a 0xfab1b4bba080 trace: pid 0 lid 5 at 0xa990969120e0 softint_dispatch() at netbsd:softint_dispatch+0x1ba DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0 Xsoftintr() at netbsd:Xsoftintr+0x4c --- interrupt --- this softint_dispatch() address is: (gdb) l *(softint_dispatch+0x1ba) 0x80f45c4b is in softint_dispatch (/usr/src/sys/kern/kern_softint.c:623). 621 PSREF_DEBUG_BARRIER(); 622 623 CPU_COUNT(CPU_COUNT_NSOFT, 1); and the actual address is a "test" instruction, so it seems that this lwp was interrupted by the panic and saved at this point of execution. so the assert is firing because the callout is both currently about to run _and_ being destroyed. this is what i've learned about this so far. .mrg.