Re: sys_pselect assertion "timo || _kernel_lock_held()" failed

2023-02-13 Thread Mark Kettenis
> Date: Tue, 14 Feb 2023 07:25:18 +0100
> From: Anton Lindqvist 
> 
> On Tue, Feb 14, 2023 at 01:08:54AM +0100, Alexander Bluhm wrote:
> > Hi,
> > 
> > Today I saw this panic on my i386 regress machine.  
> > 
> > panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: 
> > file "/usr/src/sys/kern/kern_synch.c", line 127
> 
> Same here on my amd64 and arm64 machines. Was the poll/select unlock
> ever tested beyond compile-time?

Clearly not.  Please back it out.

> > Looks like src/regress/lib/libc/sys/ triggered it.  Kernel was built
> > from some 2023-02-13 source checkout.  I will keep watching if it
> > happens again.
> > 
> >  run-t_select 
> > cc -O2 -pipe  -std=gnu99  -MD -MP  -c 
> > /usr/src/regress/lib/libc/sys/t_select.c
> > cc   -o t_select t_select.o atf-c.o 
> > ulimit -c unlimited &&  ntests="`./t_select -n`" &&  echo "1..$ntests" &&  
> > tnumbers="`jot -ns' ' - 1 $ntests`" &&  make -C 
> > /usr/src/regress/lib/libc/sys PROG=t_select NUMBERS="$tnumbers" regress
> > 1..2
> >  run-t_select-1 
> > 1 Checks pselect's temporary mask setting when a signal is received (PR 
> > lib/43625)
> > ./t_select -r 1
> > Timeout, server ot4 not responding.
> > 
> > panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: 
> > file "/usr/src/sys/kern/kern_synch.c", line 127
> > Stopped at  db_enter+0x4:   popl%ebp
> > TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> > *494368  36805  0   00x89  t_select
> >  223467  53838  0 0x14000 0x42000  softclock
> > db_enter() at db_enter+0x4
> > panic(d0c60cae) at panic+0x7a
> > __assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
> > tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
> > tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
> > dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
> > sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
> > syscall(f5b0ac10) at syscall+0x301
> > Xsyscall_untramp() at Xsyscall_untramp+0xa9
> > end of kernel
> > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > reports.  Insufficient info makes it difficult to find and fix bugs.
> > 
> > ddb{9}> show panic
> > *cpu9: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: 
> > file "/usr/src/sys/kern/kern_synch.c", line 127
> > 
> > ddb{9}> trace
> > db_enter() at db_enter+0x4
> > panic(d0c60cae) at panic+0x7a
> > __assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
> > tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
> > tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
> > dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
> > sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
> > syscall(f5b0ac10) at syscall+0x301
> > Xsyscall_untramp() at Xsyscall_untramp+0xa9
> > end of kernel
> > 
> > ddb{9}> ps
> >PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
> > *36805  494368  81530  0  7 0x8t_select
> >  81530  403270  40998  0  30x82  nanoslp   t_select
> >  40998   94085  15069  0  30x10008a  sigsusp   make
> >  15069  146328  39109  0  30x10008a  sigsusp   sh
> >  39109  344737  51331  0  30x10008a  sigsusp   make
> >  51331  138704  34007  0  30x10008a  sigsusp   sh
> >  34007   93629  19868  0  30x10008a  sigsusp   make
> >  19868  317272  38049  0  30x10008a  sigsusp   sh
> >  38049  440823   5232  0  30x10008a  sigsusp   make
> >  37773  470707   5644  0  30x100082  piperdgzip
> >   5644   31299   5232  0  30x100082  piperdpax
> >   5232  392807  48080  0  30x82  piperdperl
> >  48080  317705   5394  0  30x10008a  sigsusp   ksh
> >   5394   64950  41772  0  30x9a  kqreadsshd
> >  21685  469765  1  0  30x100083  ttyin getty
> >  54886  396067  1  0  30x100083  ttyin getty
> >  30334   46910  1  0  30x100083  ttyin getty
> >  32613  460954  1  0  30x100083  ttyin getty
> >  98276  290762  1  0  30x100083  ttyin getty
> >  83393  276624  1  0  30x100083  ttyin getty
> >  58262  416339  1  0  30x100098  kqreadcron
> >  63115  378072  1 99  3   0x1100090  kqreadsndiod
> >  16973  434562  1110  30x100090  kqreadsndiod
> >  31914  352102  1  0  30x100090  kqreadinetd
> >  29012  259544  75755 95  3   0x1100092  kqreadsmtpd
> >  75423   90965  75755103  3   0x1100092  kqreadsmtpd
> >  21079  436365  75755 95  3   0x1100092  kqreadsmtpd
> >  50673  244082  75755 95  30x100092  kqreadsmtpd
> >  29210  305613  75755 95  3   0x1100092  kqreadsmtpd
> >  23149  

Re: sys_pselect assertion "timo || _kernel_lock_held()" failed

2023-02-13 Thread Anton Lindqvist
On Tue, Feb 14, 2023 at 01:08:54AM +0100, Alexander Bluhm wrote:
> Hi,
> 
> Today I saw this panic on my i386 regress machine.  
> 
> panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
> "/usr/src/sys/kern/kern_synch.c", line 127

Same here on my amd64 and arm64 machines. Was the poll/select unlock
ever tested beyond compile-time?

> Looks like src/regress/lib/libc/sys/ triggered it.  Kernel was built
> from some 2023-02-13 source checkout.  I will keep watching if it
> happens again.
> 
>  run-t_select 
> cc -O2 -pipe  -std=gnu99  -MD -MP  -c /usr/src/regress/lib/libc/sys/t_select.c
> cc   -o t_select t_select.o atf-c.o 
> ulimit -c unlimited &&  ntests="`./t_select -n`" &&  echo "1..$ntests" &&  
> tnumbers="`jot -ns' ' - 1 $ntests`" &&  make -C /usr/src/regress/lib/libc/sys 
> PROG=t_select NUMBERS="$tnumbers" regress
> 1..2
>  run-t_select-1 
> 1 Checks pselect's temporary mask setting when a signal is received (PR 
> lib/43625)
> ./t_select -r 1
> Timeout, server ot4 not responding.
> 
> panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
> "/usr/src/sys/kern/kern_synch.c", line 127
> Stopped at  db_enter+0x4:   popl%ebp
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *494368  36805  0   00x89  t_select
>  223467  53838  0 0x14000 0x42000  softclock
> db_enter() at db_enter+0x4
> panic(d0c60cae) at panic+0x7a
> __assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
> tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
> tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
> dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
> sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
> syscall(f5b0ac10) at syscall+0x301
> Xsyscall_untramp() at Xsyscall_untramp+0xa9
> end of kernel
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.
> 
> ddb{9}> show panic
> *cpu9: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
> "/usr/src/sys/kern/kern_synch.c", line 127
> 
> ddb{9}> trace
> db_enter() at db_enter+0x4
> panic(d0c60cae) at panic+0x7a
> __assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
> tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
> tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
> dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
> sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
> syscall(f5b0ac10) at syscall+0x301
> Xsyscall_untramp() at Xsyscall_untramp+0xa9
> end of kernel
> 
> ddb{9}> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
> *36805  494368  81530  0  7 0x8t_select
>  81530  403270  40998  0  30x82  nanoslp   t_select
>  40998   94085  15069  0  30x10008a  sigsusp   make
>  15069  146328  39109  0  30x10008a  sigsusp   sh
>  39109  344737  51331  0  30x10008a  sigsusp   make
>  51331  138704  34007  0  30x10008a  sigsusp   sh
>  34007   93629  19868  0  30x10008a  sigsusp   make
>  19868  317272  38049  0  30x10008a  sigsusp   sh
>  38049  440823   5232  0  30x10008a  sigsusp   make
>  37773  470707   5644  0  30x100082  piperdgzip
>   5644   31299   5232  0  30x100082  piperdpax
>   5232  392807  48080  0  30x82  piperdperl
>  48080  317705   5394  0  30x10008a  sigsusp   ksh
>   5394   64950  41772  0  30x9a  kqreadsshd
>  21685  469765  1  0  30x100083  ttyin getty
>  54886  396067  1  0  30x100083  ttyin getty
>  30334   46910  1  0  30x100083  ttyin getty
>  32613  460954  1  0  30x100083  ttyin getty
>  98276  290762  1  0  30x100083  ttyin getty
>  83393  276624  1  0  30x100083  ttyin getty
>  58262  416339  1  0  30x100098  kqreadcron
>  63115  378072  1 99  3   0x1100090  kqreadsndiod
>  16973  434562  1110  30x100090  kqreadsndiod
>  31914  352102  1  0  30x100090  kqreadinetd
>  29012  259544  75755 95  3   0x1100092  kqreadsmtpd
>  75423   90965  75755103  3   0x1100092  kqreadsmtpd
>  21079  436365  75755 95  3   0x1100092  kqreadsmtpd
>  50673  244082  75755 95  30x100092  kqreadsmtpd
>  29210  305613  75755 95  3   0x1100092  kqreadsmtpd
>  23149  255222  75755 95  3   0x1100092  kqreadsmtpd
>  75755   17291  1  0  30x100080  kqreadsmtpd
>  93529  264256  66818 91  30x92  kqreadsnmpd_metrics
>  66818  486393  1  0  30x100080  kqreadsnmpd
>  44889  252400  1 91  3  

sys_pselect assertion "timo || _kernel_lock_held()" failed

2023-02-13 Thread Alexander Bluhm
Hi,

Today I saw this panic on my i386 regress machine.  

panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
"/usr/src/sys/kern/kern_synch.c", line 127

Looks like src/regress/lib/libc/sys/ triggered it.  Kernel was built
from some 2023-02-13 source checkout.  I will keep watching if it
happens again.

 run-t_select 
cc -O2 -pipe  -std=gnu99  -MD -MP  -c /usr/src/regress/lib/libc/sys/t_select.c
cc   -o t_select t_select.o atf-c.o 
ulimit -c unlimited &&  ntests="`./t_select -n`" &&  echo "1..$ntests" &&  
tnumbers="`jot -ns' ' - 1 $ntests`" &&  make -C /usr/src/regress/lib/libc/sys 
PROG=t_select NUMBERS="$tnumbers" regress
1..2
 run-t_select-1 
1 Checks pselect's temporary mask setting when a signal is received (PR 
lib/43625)
./t_select -r 1
Timeout, server ot4 not responding.

panic: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
"/usr/src/sys/kern/kern_synch.c", line 127
Stopped at  db_enter+0x4:   popl%ebp
TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
*494368  36805  0   00x89  t_select
 223467  53838  0 0x14000 0x42000  softclock
db_enter() at db_enter+0x4
panic(d0c60cae) at panic+0x7a
__assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
syscall(f5b0ac10) at syscall+0x301
Xsyscall_untramp() at Xsyscall_untramp+0xa9
end of kernel
https://www.openbsd.org/ddb.html describes the minimum info required in bug
reports.  Insufficient info makes it difficult to find and fix bugs.

ddb{9}> show panic
*cpu9: kernel diagnostic assertion "timo || _kernel_lock_held()" failed: file 
"/usr/src/sys/kern/kern_synch.c", line 127

ddb{9}> trace
db_enter() at db_enter+0x4
panic(d0c60cae) at panic+0x7a
__assert(d0cccada,d0c6daa7,7f,d0ce58fd) at __assert+0x19
tsleep(d0fcefd0,118,d0d0cf14,0) at tsleep+0x117
tsleep_nsec(d0fcefd0,118,d0d0cf14,) at tsleep_nsec+0xcc
dopselect(d621da8c,1,cf7db3f8,0,0,0,f5b0ab74,f5b0abc8) at dopselect+0x49c
sys_pselect(d621da8c,f5b0abd0,f5b0abc8) at sys_pselect+0xa9
syscall(f5b0ac10) at syscall+0x301
Xsyscall_untramp() at Xsyscall_untramp+0xa9
end of kernel

ddb{9}> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
*36805  494368  81530  0  7 0x8t_select
 81530  403270  40998  0  30x82  nanoslp   t_select
 40998   94085  15069  0  30x10008a  sigsusp   make
 15069  146328  39109  0  30x10008a  sigsusp   sh
 39109  344737  51331  0  30x10008a  sigsusp   make
 51331  138704  34007  0  30x10008a  sigsusp   sh
 34007   93629  19868  0  30x10008a  sigsusp   make
 19868  317272  38049  0  30x10008a  sigsusp   sh
 38049  440823   5232  0  30x10008a  sigsusp   make
 37773  470707   5644  0  30x100082  piperdgzip
  5644   31299   5232  0  30x100082  piperdpax
  5232  392807  48080  0  30x82  piperdperl
 48080  317705   5394  0  30x10008a  sigsusp   ksh
  5394   64950  41772  0  30x9a  kqreadsshd
 21685  469765  1  0  30x100083  ttyin getty
 54886  396067  1  0  30x100083  ttyin getty
 30334   46910  1  0  30x100083  ttyin getty
 32613  460954  1  0  30x100083  ttyin getty
 98276  290762  1  0  30x100083  ttyin getty
 83393  276624  1  0  30x100083  ttyin getty
 58262  416339  1  0  30x100098  kqreadcron
 63115  378072  1 99  3   0x1100090  kqreadsndiod
 16973  434562  1110  30x100090  kqreadsndiod
 31914  352102  1  0  30x100090  kqreadinetd
 29012  259544  75755 95  3   0x1100092  kqreadsmtpd
 75423   90965  75755103  3   0x1100092  kqreadsmtpd
 21079  436365  75755 95  3   0x1100092  kqreadsmtpd
 50673  244082  75755 95  30x100092  kqreadsmtpd
 29210  305613  75755 95  3   0x1100092  kqreadsmtpd
 23149  255222  75755 95  3   0x1100092  kqreadsmtpd
 75755   17291  1  0  30x100080  kqreadsmtpd
 93529  264256  66818 91  30x92  kqreadsnmpd_metrics
 66818  486393  1  0  30x100080  kqreadsnmpd
 44889  252400  1 91  3   0x1100092  kqreadsnmpd
 41772  245133  1  0  30x88  kqreadsshd
 66620  411509  0  0  3 0x14280  nfsidlnfsio
 80915  345110  0  0  3 0x14280  nfsidlnfsio
 10608  431262  0  0  3 0x14280  nfsidlnfsio
 11489  201187  0  0  3 0x14280  nfsidlnfsio

Re: iwx(4) panic on resume after hibernation

2023-02-13 Thread Vitaliy Makkoveev
On Mon, Feb 13, 2023 at 01:31:30PM +0100, Stefan Sperling wrote:
> On Sat, Feb 11, 2023 at 06:32:31PM +0300, Vitaliy Makkoveev wrote:
> > Unfortunately the diff doesn't help. I attached photos with debug
> > information and panic report.
> 
> Based on the command code in the firmware trace the error seems to be
> caused by the driver sending this command:
> 
> #define IWX_ADD_STA_KEY   0x17
> 
> It seems the set_key task is running while the driver is in INIT state?
> That is certainly not intended, and it is unclear why this is happening.
> 
> What is the effect of this diff? Do you see the printfs added here?
> 

Still panics. The only printf from your diff I see is "iwx_setkey_task:
in state RUN".

I suspect I found the problem. I made the diff below. It introduced some
debug output to catch thread which sets dram->paging concurrently with
iwx_init_task(). Please note, during this race we have IFF_FLAG set, but
IFF_RUNNING flag not set, so the iwx_init_task() will not call
iwx_stop(), but will call iwx_init().

[ cut ]

Index: sys/dev/pci/if_iwx.c
===
RCS file: /cvs/src/sys/dev/pci/if_iwx.c,v
retrieving revision 1.152
diff -u -p -r1.152 if_iwx.c
--- sys/dev/pci/if_iwx.c24 Jan 2023 16:18:22 -  1.152
+++ sys/dev/pci/if_iwx.c13 Feb 2023 23:37:19 -
@@ -8767,6 +8767,7 @@ iwx_init(struct ifnet *ifp)
struct ieee80211com *ic = >sc_ic;
int err, generation;
 
+   printf("%s\n", __func__);
rw_assert_wrlock(>ioctl_rwl);
 
generation = ++sc->sc_generation;
@@ -8901,6 +8902,7 @@ iwx_stop(struct ifnet *ifp)
struct iwx_node *in = (void *)ic->ic_bss;
int i, s = splnet();
 
+   printf("%s\n", __func__);
rw_assert_wrlock(>ioctl_rwl);
 
sc->sc_flags |= IWX_FLAG_SHUTDOWN; /* Disallow new tasks. */
@@ -10444,6 +10446,7 @@ iwx_attach_hook(struct device *self)
struct iwx_softc *sc = (void *)self;
 
KASSERT(!cold);
+   printf("%s\n", __func__);
 
iwx_preinit(sc);
 }
@@ -10944,6 +10947,7 @@ iwx_init_task(void *arg1)
int fatal = (sc->sc_flags & (IWX_FLAG_HW_ERR | IWX_FLAG_RFKILL));
 
rw_enter_write(>ioctl_rwl);
+   printf("%s\n", __func__);
if (generation != sc->sc_generation) {
rw_exit(>ioctl_rwl);
splx(s);
@@ -11030,10 +11034,12 @@ iwx_activate(struct device *self, int ac
}
break;
case DVACT_RESUME:
+   printf("%s - resume\n", __func__);
iwx_resume(sc);
break;
case DVACT_WAKEUP:
if ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) == IFF_UP) {
+   printf("%s - wakeup\n", __func__);
err = iwx_wakeup(sc);
if (err)
printf("%s: could not initialize hardware\n",

[ cut ]

The output on resume was:

iwx_stop
iwx_activate - resume
iwx_activate - wakeup
iwx0: fatal firmware error
iwx_init_task
iwx_init
panic: kernel diagnostic assertion "dram->

So iwx_activate():

iwx_activate(struct device *self, int act)
{
/* ... */
switch (act) {
/* ... */
case DVACT_RESUME:
printf("%s - resume\n", __func__);
iwx_resume(sc);
break;
case DVACT_WAKEUP:
if ((ifp->if_flags & (IFF_UP | IFF_RUNNING)) == IFF_UP) {
printf("%s - wakeup\n", __func__);
err = iwx_wakeup(sc);


iwx_resume() doesn't do any software context allocation, but
iwx_wakeup() does. It calls iwx_init_hw(), ..., iwx_init_fw_sec() which
allocates "dram->paging". So, when iwx_init_task() scheduled by
interrupt started to run, iwx_init_fw_sec() will be entered again, but
with "dram->paging" set.

I hope this information will be useful to you. I'm not familiar with iwm
code, but existing `if_flags' checks looks wrong to me. May be firmware
errors should be handled with special IWX_FLAG_SW_ERR instead.

I could provide panic screenshot if required.



Re: iwx(4) panic on resume after hibernation

2023-02-13 Thread Stefan Sperling
On Sat, Feb 11, 2023 at 06:32:31PM +0300, Vitaliy Makkoveev wrote:
> Unfortunately the diff doesn't help. I attached photos with debug
> information and panic report.

Based on the command code in the firmware trace the error seems to be
caused by the driver sending this command:

#define IWX_ADD_STA_KEY 0x17

It seems the set_key task is running while the driver is in INIT state?
That is certainly not intended, and it is unclear why this is happening.

What is the effect of this diff? Do you see the printfs added here?

diff /usr/src
commit - 272fae57df2955410d97e29c5923cd4948623da1
path + /usr/src
blob - fe1bee0e59b08e6de8b7e3c00e20b930a29a6bb1
file + sys/dev/pci/if_iwx.c
--- sys/dev/pci/if_iwx.c
+++ sys/dev/pci/if_iwx.c
@@ -8554,12 +8554,18 @@ iwx_setkey_task(void *arg)
struct iwx_softc *sc = arg;
struct iwx_setkey_task_arg *a;
int err = 0, s = splnet();
+   enum ieee80211_state state = sc->sc_ic.ic_state;
 
-   while (sc->setkey_nkeys > 0) {
-   if (err || (sc->sc_flags & IWX_FLAG_SHUTDOWN))
-   break;
+   printf("%s: in state %s\n", __func__, ieee80211_state_name[state]);
+   while (state == IEEE80211_S_RUN && sc->setkey_nkeys > 0 &&
+   err == 0 && (sc->sc_flags & IWX_FLAG_SHUTDOWN) == 0) {
a = >setkey_arg[sc->setkey_tail];
err = iwx_add_sta_key(sc, a->sta_id, a->ni, a->k);
+   if (err) {
+   printf("%s: could not set %s key (error %d)\n",
+   DEVNAME(sc), (a->k->k_flags & IEEE80211_KEY_GROUP) ?
+   "group" : "pairwise", err);
+   }
a->sta_id = 0;
a->ni = NULL;
a->k = NULL;
@@ -8568,6 +8574,10 @@ iwx_setkey_task(void *arg)
sc->setkey_nkeys--;
}
 
+   /* Reset everything if key installation failed. */
+   if (err && (sc->sc_flags & IWX_FLAG_SHUTDOWN) == 0)
+   task_add(systq, >init_task);
+
refcnt_rele_wake(>task_refs);
splx(s);
 }