Re: crash in timerfd building pandoc / ghc94 related

2023-02-07 Thread David Holland
On Mon, Feb 06, 2023 at 09:19:26PM -0500, Mouse wrote:
 > Perhaps the simplest is
 > 
 > dd if=/dev/urandom bs=65536 of=/dev/mem
 > 
 > but there are others.
 > 
 > Yet I can't help feeling that there is some sense in which it *is* fair
 > to say that userland should never be able to crash the kernel.  I have
 > been mulling over this paradox for some time but have not come up with
 > an alternative phrasing that avoids the reasonable crashes while still
 > capturing a significant fraction of the useful meaning.

Arguably, /dev/mem just shouldn't exist... I'm not sure there are
actually reasonable crashes.

-- 
David A. Holland
dholl...@netbsd.org


re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread matthew green
> dd if=/dev/urandom bs=65536 of=/dev/mem

FWIW, secururelevel > 0 fixes this issue.  so, perhaps you
can rephrase by including something about correct separation
of privs, since root write-access to /dev/mem is literally
giving it kernel-level privs.


.mrg.


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread Mouse
>> It seems so far, from not really paying attention, that there is
>> nothing wrong with ghc but that there is a bug in the kernel.
> Yes of course no userland code should be able to crash the kernel :D

I used to think so.  Then it occurred to me that there are various ways
for userland to crash the kernel which are perfectly reasonable, where
of course "reasonable" is a vague term, meaning maybe something like "I
don't think they indicate anything in need of fixing".  Perhaps the
simplest is

dd if=/dev/urandom bs=65536 of=/dev/mem

but there are others.

Yet I can't help feeling that there is some sense in which it *is* fair
to say that userland should never be able to crash the kernel.  I have
been mulling over this paradox for some time but have not come up with
an alternative phrasing that avoids the reasonable crashes while still
capturing a significant fraction of the useful meaning.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread PHO
Committed a workaround to lang/ghc94. I hope it can avoid the panic. You 
can remove the workaround simply by deleting lang/ghc94/hacks.mk.



On 2/7/23 12:36 AM, Greg Troxel wrote:

PHO  writes:


On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote:

I encountered this on some version of 10.99.2 and last night again on
10.99.2 from Friday morning.
This is an obvious blocker for me for making 9.4.4 the default.
I propose to either revert to the last version or make the default GHC
version setable.


I wish I could do the latter, but unfortunately not all Haskell
packages are buildable with 2 major versions of GHC at the same time
(most are, but there are a few exceptions).

Alternatively, I think I can patch GHC 9.4 so that it won't use
timerfd. It appears to be an optional feature after all; if its
./configure doesn't find timerfd it won't use it. Let me try that.


If it's possible to only do this on NetBSD 10.99, that would be good.


Yeah I did exactly that.



It seems so far, from not really paying attention, that there is nothing
wrong with ghc but that there is a bug in the kernel.   It would also
be good to get a reproduction recipe without haskell.


Yes of course no userland code should be able to crash the kernel :D


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread Greg Troxel
PHO  writes:

> On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote:
>> I encountered this on some version of 10.99.2 and last night again on
>> 10.99.2 from Friday morning.
>> This is an obvious blocker for me for making 9.4.4 the default.
>> I propose to either revert to the last version or make the default GHC
>> version setable.
>
> I wish I could do the latter, but unfortunately not all Haskell
> packages are buildable with 2 major versions of GHC at the same time
> (most are, but there are a few exceptions).
>
> Alternatively, I think I can patch GHC 9.4 so that it won't use
> timerfd. It appears to be an optional feature after all; if its
> ./configure doesn't find timerfd it won't use it. Let me try that.

If it's possible to only do this on NetBSD 10.99, that would be good.
It seems so far, from not really paying attention, that there is nothing
wrong with ghc but that there is a bug in the kernel.   It would also
be good to get a reproduction recipe without haskell.


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread Nikita Ronja Gillmann

PHO transcribed 0.7K bytes:

On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote:

I encountered this on some version of 10.99.2 and last night again on
10.99.2 from Friday morning.
This is an obvious blocker for me for making 9.4.4 the default.
I propose to either revert to the last version or make the default GHC
version setable.


I wish I could do the latter, but unfortunately not all Haskell 
packages are buildable with 2 major versions of GHC at the same time 
(most are, but there are a few exceptions).


Okay that makes sense.

Alternatively, I think I can patch GHC 9.4 so that it won't use 
timerfd. It appears to be an optional feature after all; if its 
./configure doesn't find timerfd it won't use it. Let me try that.


thanks!


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread PHO

On 2/6/23 5:27 PM, Nikita Ronja Gillmann wrote:
> I encountered this on some version of 10.99.2 and last night again on
> 10.99.2 from Friday morning.
> This is an obvious blocker for me for making 9.4.4 the default.
> I propose to either revert to the last version or make the default GHC
> version setable.

I wish I could do the latter, but unfortunately not all Haskell packages 
are buildable with 2 major versions of GHC at the same time (most are, 
but there are a few exceptions).


Alternatively, I think I can patch GHC 9.4 so that it won't use timerfd. 
It appears to be an optional feature after all; if its ./configure 
doesn't find timerfd it won't use it. Let me try that.


Re: crash in timerfd building pandoc / ghc94 related

2023-02-06 Thread Nikita Ronja Gillmann
I encountered this on some version of 10.99.2 and last night again on 
10.99.2 from Friday morning.

This is an obvious blocker for me for making 9.4.4 the default.
I propose to either revert to the last version or make the default GHC
version setable.

PHO transcribed 2.3K bytes:

On 2/6/23 8:54 AM, matthew green wrote:> hi folks.



i saw a report about ghc94 related crashes, and while it's easy
to build ghc94 itself, it's easy to trigger a crash by having
packages use it.  for me 'pandoc' wants a bunch of hs-* pkgs and
i had crashes in 2 separate ones.

i added some addditional logging to the failed assert to confirm
what part of it is failing.  here's the panic and stack:

[ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp 
== curlwp || c->c_cpu->cc_active != c" failed: file 
"/usr/src/sys/kern/kern_timeout.c", line 381 running callout 
0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) 
c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed 
from 0x80fa0d89


breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x183
kern_assert() at netbsd:kern_assert+0x4b
callout_destroy() at netbsd:callout_destroy+0xbc
timerfd_fop_close() at netbsd:timerfd_fop_close+0x36
closef() at netbsd:closef+0x60
fd_close() at netbsd:fd_close+0x138
sys_close() at netbsd:sys_close+0x22
syscall() at netbsd:syscall+0x196
--- syscall (number 6) ---


as you can see, "c_active" is "c", and cc_lwp is not curlwp, so
the assert triggers.  the active lwp is a softint thread:

db{1}> bt/a 0xfab1b4bba080
trace: pid 0 lid 5 at 0xa990969120e0
softint_dispatch() at netbsd:softint_dispatch+0x1ba
DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0
Xsoftintr() at netbsd:Xsoftintr+0x4c
--- interrupt ---

this softint_dispatch() address is:

(gdb) l *(softint_dispatch+0x1ba)
0x80f45c4b is in softint_dispatch 

(/usr/src/sys/kern/kern_softint.c:623).

621 PSREF_DEBUG_BARRIER();
622
623 CPU_COUNT(CPU_COUNT_NSOFT, 1);

and the actual address is a "test" instruction, so it seems that
this lwp was interrupted by the panic and saved at this point of
execution.  so the assert is firing because the callout is both
currently about to run _and_ being destroyed.


Thank you for your analysis. I tried to make a small test case to 
reproduce the issue but so far without a success. This is what GHC 9.4 
basically does:


https://gist.github.com/depressed-pho/5d117dbca872ef7c28ee7786e0ad8a8a

But this code does not trigger the panic.


Re: crash in timerfd building pandoc / ghc94 related

2023-02-05 Thread PHO

On 2/6/23 8:54 AM, matthew green wrote:> hi folks.
>
>
> i saw a report about ghc94 related crashes, and while it's easy
> to build ghc94 itself, it's easy to trigger a crash by having
> packages use it.  for me 'pandoc' wants a bunch of hs-* pkgs and
> i had crashes in 2 separate ones.
>
> i added some addditional logging to the failed assert to confirm
> what part of it is failing.  here's the panic and stack:
>
> [ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp 
== curlwp || c->c_cpu->cc_active != c" failed: file 
"/usr/src/sys/kern/kern_timeout.c", line 381 running callout 
0xfaa403b50e00: c_func (0x80f53893) c_flags (0x100) c_active 
(0xfaa403b50e00) cc_lwp (0xfab1b4bba080) destroyed from 
0x80fa0d89

>
> breakpoint() at netbsd:breakpoint+0x5
> vpanic() at netbsd:vpanic+0x183
> kern_assert() at netbsd:kern_assert+0x4b
> callout_destroy() at netbsd:callout_destroy+0xbc
> timerfd_fop_close() at netbsd:timerfd_fop_close+0x36
> closef() at netbsd:closef+0x60
> fd_close() at netbsd:fd_close+0x138
> sys_close() at netbsd:sys_close+0x22
> syscall() at netbsd:syscall+0x196
> --- syscall (number 6) ---
>
>
> as you can see, "c_active" is "c", and cc_lwp is not curlwp, so
> the assert triggers.  the active lwp is a softint thread:
>
> db{1}> bt/a 0xfab1b4bba080
> trace: pid 0 lid 5 at 0xa990969120e0
> softint_dispatch() at netbsd:softint_dispatch+0x1ba
> DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0
> Xsoftintr() at netbsd:Xsoftintr+0x4c
> --- interrupt ---
>
> this softint_dispatch() address is:
>
> (gdb) l *(softint_dispatch+0x1ba)
> 0x80f45c4b is in softint_dispatch 
(/usr/src/sys/kern/kern_softint.c:623).

> 621 PSREF_DEBUG_BARRIER();
> 622
> 623 CPU_COUNT(CPU_COUNT_NSOFT, 1);
>
> and the actual address is a "test" instruction, so it seems that
> this lwp was interrupted by the panic and saved at this point of
> execution.  so the assert is firing because the callout is both
> currently about to run _and_ being destroyed.

Thank you for your analysis. I tried to make a small test case to 
reproduce the issue but so far without a success. This is what GHC 9.4 
basically does:


https://gist.github.com/depressed-pho/5d117dbca872ef7c28ee7786e0ad8a8a

But this code does not trigger the panic.


crash in timerfd building pandoc / ghc94 related

2023-02-05 Thread matthew green
hi folks.


i saw a report about ghc94 related crashes, and while it's easy
to build ghc94 itself, it's easy to trigger a crash by having
packages use it.  for me 'pandoc' wants a bunch of hs-* pkgs and
i had crashes in 2 separate ones.

i added some addditional logging to the failed assert to confirm
what part of it is failing.  here's the panic and stack:

[ 2875.6028592] panic: kernel diagnostic assertion "c->c_cpu->cc_lwp == curlwp 
|| c->c_cpu->cc_active != c" failed: file "/usr/src/sys/kern/kern_timeout.c", 
line 381 running callout 0xfaa403b50e00: c_func (0x80f53893) 
c_flags (0x100) c_active (0xfaa403b50e00) cc_lwp (0xfab1b4bba080) 
destroyed from 0x80fa0d89

breakpoint() at netbsd:breakpoint+0x5
vpanic() at netbsd:vpanic+0x183
kern_assert() at netbsd:kern_assert+0x4b
callout_destroy() at netbsd:callout_destroy+0xbc
timerfd_fop_close() at netbsd:timerfd_fop_close+0x36
closef() at netbsd:closef+0x60
fd_close() at netbsd:fd_close+0x138
sys_close() at netbsd:sys_close+0x22
syscall() at netbsd:syscall+0x196
--- syscall (number 6) ---


as you can see, "c_active" is "c", and cc_lwp is not curlwp, so
the assert triggers.  the active lwp is a softint thread:

db{1}> bt/a 0xfab1b4bba080
trace: pid 0 lid 5 at 0xa990969120e0
softint_dispatch() at netbsd:softint_dispatch+0x1ba
DDB lost frame for netbsd:Xsoftintr+0x4c, trying 0xa990969120f0
Xsoftintr() at netbsd:Xsoftintr+0x4c
--- interrupt ---

this softint_dispatch() address is:

(gdb) l *(softint_dispatch+0x1ba)
0x80f45c4b is in softint_dispatch 
(/usr/src/sys/kern/kern_softint.c:623).
621 PSREF_DEBUG_BARRIER();
622
623 CPU_COUNT(CPU_COUNT_NSOFT, 1);

and the actual address is a "test" instruction, so it seems that
this lwp was interrupted by the panic and saved at this point of
execution.  so the assert is firing because the callout is both
currently about to run _and_ being destroyed.


this is what i've learned about this so far.


.mrg.