gdb in uninterruptible wait
fyi: ran into this egdb step next issue: http://openbsd-archive.7691.n7.nabble.com/gdb-in-uninterruptible-wait-td395449.html Does anyone know the status? Do I have to manually patch the kernel? NOTE: I am a fairly recent convert to openbsd. I have migrated one of my laptops and dedicated it openbsd. So, I am willing to learn how to hack the kernel if needed. From my reading, however, the custom kernel will no longer be supported. I would be on my own and that does not resonate with me just yet. 6.7 GENERIC.MP#6
Re: gdb in uninterruptible wait
On Tue, 21 Jul 2020 19:23:44 +0100 Julian Smith wrote: > On Mon, 20 Jul 2020 17:18:19 +0100 > Julian Smith wrote: > > > On Mon, 20 Jul 2020 15:26:11 + > > Visa Hankala wrote: > > > > > On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote: > > > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > > > > > > > > > I've been finding egdb and gdb rather easily get stuck in an > > > > > uninterruptible wait, e.g. when running the 'next' command > > > > > after hitting a breakpoint. > > > > [...] > > > > > > The single-thread check done by wait4() is non-interruptible. > > > > When the debugger gets stuck, is it blocked in "suspend" state? > > > > > > > > ps reports it to be in state 'D'. > > > > > > > > > > However, I think there is a bug in the single-thread switch > > > > code. It looks that ps_singlecount can be decremented too much. > > > > This probably is a regression of making ps_singlecount unsigned > > > > and letting single_thread_check() run without the kernel lock. > > > > > > > > The bug might go away if single_thread_check() made sure that > > > > P_SUSPSINGLE is set before the thread suspends. > > > > > > Below is an updated patch for testing. It extends the scope of > > > SCHED_LOCK() so that there are fewer chances of interleaving of > > > single_thread_set() and single_thread_check(). > > > > Many thanks for these patches. I'll try to test in the next couple > > of days. Though the last time i built an OpenBSD kernel was well > > over a decade ago, so it might take me a little longer. > > I managed to build a patched kernel, and it seems to fix the problem - > i haven't been able to get egdb into an uninterruptable wait state. > > Also, i've been running the patched kernel all day now and it doesn't > seem to be causing any problems elsewhere. Unfortunately the same problem has just occurred again. I've run egdb quite a few times since i updated the kernel, so the patch has definitely reduced the problem, but it doesn't seem to have eliminated it. Let me know if there anything i could do to find out more information. Thanks, - Jules -- http://op59.net
Re: gdb in uninterruptible wait
On Mon, 20 Jul 2020 17:18:19 +0100 Julian Smith wrote: > On Mon, 20 Jul 2020 15:26:11 + > Visa Hankala wrote: > > > On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote: > > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > > > > I've been finding egdb and gdb rather easily get stuck in an > > > > uninterruptible wait, e.g. when running the 'next' command after > > > > hitting a breakpoint. > > [...] > > > > The single-thread check done by wait4() is non-interruptible. > > > When the debugger gets stuck, is it blocked in "suspend" state? > > ps reports it to be in state 'D'. > > > > > > > However, I think there is a bug in the single-thread switch code. > > > It looks that ps_singlecount can be decremented too much. This > > > probably is a regression of making ps_singlecount unsigned and > > > letting single_thread_check() run without the kernel lock. > > > > > > The bug might go away if single_thread_check() made sure that > > > P_SUSPSINGLE is set before the thread suspends. > > > > Below is an updated patch for testing. It extends the scope of > > SCHED_LOCK() so that there are fewer chances of interleaving of > > single_thread_set() and single_thread_check(). > > Many thanks for these patches. I'll try to test in the next couple of > days. Though the last time i built an OpenBSD kernel was well over a > decade ago, so it might take me a little longer. I managed to build a patched kernel, and it seems to fix the problem - i haven't been able to get egdb into an uninterruptable wait state. Also, i've been running the patched kernel all day now and it doesn't seem to be causing any problems elsewhere. Thanks, - Jules -- http://op59.net
Re: gdb in uninterruptible wait
On Mon, 20 Jul 2020 15:26:11 + Visa Hankala wrote: > On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote: > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > > > I've been finding egdb and gdb rather easily get stuck in an > > > uninterruptible wait, e.g. when running the 'next' command after > > > hitting a breakpoint. [...] > > The single-thread check done by wait4() is non-interruptible. > > When the debugger gets stuck, is it blocked in "suspend" state? ps reports it to be in state 'D'. > > > > However, I think there is a bug in the single-thread switch code. > > It looks that ps_singlecount can be decremented too much. This > > probably is a regression of making ps_singlecount unsigned and > > letting single_thread_check() run without the kernel lock. > > > > The bug might go away if single_thread_check() made sure that > > P_SUSPSINGLE is set before the thread suspends. > > Below is an updated patch for testing. It extends the scope of > SCHED_LOCK() so that there are fewer chances of interleaving of > single_thread_set() and single_thread_check(). Many thanks for these patches. I'll try to test in the next couple of days. Though the last time i built an OpenBSD kernel was well over a decade ago, so it might take me a little longer. Thanks, - Jules > > One problem is that once single_thread_set() sets ps_single, the other > threads can enter the suspension code in single_thread_check(). The > extended lock scope prevents the threads from taking action before > single_thread_set() has finished with the state updates. > > Index: kern/kern_sig.c > === > RCS file: src/sys/kern/kern_sig.c,v > retrieving revision 1.258 > diff -u -p -r1.258 kern_sig.c > --- kern/kern_sig.c 15 Jun 2020 13:18:33 - 1.258 > +++ kern/kern_sig.c 20 Jul 2020 13:29:50 - > @@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int > return (EINTR); > } > > + SCHED_LOCK(s); > if (atomic_dec_int_nv(>ps_singlecount) > == 0) wakeup(>ps_singlecount); > if (pr->ps_flags & PS_SINGLEEXIT) { > + SCHED_UNLOCK(s); > KERNEL_LOCK(); > exit1(p, 0, 0, EXIT_THREAD_NOCHECK); > - KERNEL_UNLOCK(); > + /* NOTREACHED */ > } > > /* not exiting and don't need to unwind, so > suspend */ > - SCHED_LOCK(s); > p->p_stat = SSTOP; > mi_switch(); > SCHED_UNLOCK(s); > @@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s > { > struct process *pr = p->p_p; > struct proc *q; > - int error; > + int error, s; > > KERNEL_ASSERT_LOCKED(); > KASSERT(curproc == p); > @@ -1974,26 +1975,22 @@ single_thread_set(struct proc *p, enum s > panic("single_thread_mode = %d", mode); > #endif > } > + SCHED_LOCK(s); > pr->ps_singlecount = 0; > membar_producer(); > pr->ps_single = p; > TAILQ_FOREACH(q, >ps_threads, p_thr_link) { > - int s; > - > if (q == p) > continue; > if (q->p_flag & P_WEXIT) { > if (mode == SINGLE_EXIT) { > - SCHED_LOCK(s); > if (q->p_stat == SSTOP) { > setrunnable(q); > atomic_inc_int(>ps_singlecount); > } > - SCHED_UNLOCK(s); > } > continue; > } > - SCHED_LOCK(s); > atomic_setbits_int(>p_flag, P_SUSPSINGLE); > switch (q->p_stat) { > case SIDL: > @@ -2027,8 +2024,8 @@ single_thread_set(struct proc *p, enum s > signotify(q); > break; > } > - SCHED_UNLOCK(s); > } > + SCHED_UNLOCK(s); > > if (mode != SINGLE_PTRACE) > single_thread_wait(pr, 1); > > -- http://op59.net
Re: gdb in uninterruptible wait
On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote: > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > > I've been finding egdb and gdb rather easily get stuck in an > > uninterruptible wait, e.g. when running the 'next' command after > > hitting a breakpoint. > > > > So it's not possible to kill the debuggee or gdb and the only way to > > kill the debuggee process and free up its listening sockets seems to be > > to reboot the entire system. > > > > Perhaps unsurprisingly one cannot attach a second invocation of gdb to > > the uninterruptible gdb, so i don't know for sure what syscall is being > > run that is getting stuck. > > > > The debuggee is a local build of the flightgear flight simulator. > > > > Here's the output of ps for the debugger and debuggee: > > > > 12419 p0 D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set > > print thread-events off -ex set print pretty on -ex run --args > > build-walk/fgfs,clang,debug,opt,co > > 63921 p0 TX+ 0:42.45 > > /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe > > --airport=egtk (fgfs,clang,debug) > > > > I've tried using ktrace on egdb, and the kdump output ends like this: > > > > 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) > > 53950 egdb RET wait4 97562/0x17d1a > > 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12) > > 53950 egdb RET ptrace 0 > > 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7e8b40,0) > > 53950 egdb RET ptrace 0 > > 53950 egdb CALL > > futex(0x6444e37c490,0x82,1,0,0) > > 53950 egdb RET futex 0 > > 53950 egdb CALL > > futex(0x644bef12740,0x82,1,0,0) > > 53950 egdb RET futex 0 > > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) > > 53950 egdb RET ptrace 0 > > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) > > 53950 egdb RET ptrace 0 > > 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0) > > 53950 egdb RET ptrace 0 > > 53950 egdb CALL read(6,0x7f7e9187,0x1) > > 53950 egdb RET read -1 errno 35 Resource temporarily unavailable > > 53950 egdb CALL poll(0x6441581e720,3,0) > > 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1, > > revents=0<> } { fd=6, events=0x1, revents=0<> } { fd=10, > > events=0x1, revents=0<> } > > 53950 egdb RET poll 0 > > 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) > > > > Assuming that this is the actual end of the ktrace output and there > > isn't some missing ktrace output in a buffer somewhere, this looks > > like egdb is simply blocked in wait4(), which should be harmless and > > certainly not uninterruptable? > > The single-thread check done by wait4() is non-interruptible. > When the debugger gets stuck, is it blocked in "suspend" state? > > However, I think there is a bug in the single-thread switch code. > It looks that ps_singlecount can be decremented too much. This probably > is a regression of making ps_singlecount unsigned and letting > single_thread_check() run without the kernel lock. > > The bug might go away if single_thread_check() made sure that > P_SUSPSINGLE is set before the thread suspends. Below is an updated patch for testing. It extends the scope of SCHED_LOCK() so that there are fewer chances of interleaving of single_thread_set() and single_thread_check(). One problem is that once single_thread_set() sets ps_single, the other threads can enter the suspension code in single_thread_check(). The extended lock scope prevents the threads from taking action before single_thread_set() has finished with the state updates. Index: kern/kern_sig.c === RCS file: src/sys/kern/kern_sig.c,v retrieving revision 1.258 diff -u -p -r1.258 kern_sig.c --- kern/kern_sig.c 15 Jun 2020 13:18:33 - 1.258 +++ kern/kern_sig.c 20 Jul 2020 13:29:50 - @@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int return (EINTR); } + SCHED_LOCK(s); if (atomic_dec_int_nv(>ps_singlecount) == 0) wakeup(>ps_singlecount); if (pr->ps_flags & PS_SINGLEEXIT) { + SCHED_UNLOCK(s); KERNEL_LOCK(); exit1(p, 0, 0, EXIT_THREAD_NOCHECK); - KERNEL_UNLOCK(); + /* NOTREACHED */ } /* not exiting and don't need to unwind, so suspend */ - SCHED_LOCK(s); p->p_stat = SSTOP; mi_switch(); SCHED_UNLOCK(s); @@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s { struct process *pr = p->p_p; struct proc *q; -
Re: gdb in uninterruptible wait
On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote: > I've been finding egdb and gdb rather easily get stuck in an > uninterruptible wait, e.g. when running the 'next' command after > hitting a breakpoint. > > So it's not possible to kill the debuggee or gdb and the only way to > kill the debuggee process and free up its listening sockets seems to be > to reboot the entire system. > > Perhaps unsurprisingly one cannot attach a second invocation of gdb to > the uninterruptible gdb, so i don't know for sure what syscall is being > run that is getting stuck. > > The debuggee is a local build of the flightgear flight simulator. > > Here's the output of ps for the debugger and debuggee: > > 12419 p0 D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set > print thread-events off -ex set print pretty on -ex run --args > build-walk/fgfs,clang,debug,opt,co > 63921 p0 TX+ 0:42.45 > /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe > --airport=egtk (fgfs,clang,debug) > > I've tried using ktrace on egdb, and the kdump output ends like this: > > 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) > 53950 egdb RET wait4 97562/0x17d1a > 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7e8b40,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL > futex(0x6444e37c490,0x82,1,0,0) > 53950 egdb RET futex 0 > 53950 egdb CALL > futex(0x644bef12740,0x82,1,0,0) > 53950 egdb RET futex 0 > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0) > 53950 egdb RET ptrace 0 > 53950 egdb CALL read(6,0x7f7e9187,0x1) > 53950 egdb RET read -1 errno 35 Resource temporarily unavailable > 53950 egdb CALL poll(0x6441581e720,3,0) > 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1, > revents=0<> } { fd=6, events=0x1, revents=0<> } { fd=10, > events=0x1, revents=0<> } > 53950 egdb RET poll 0 > 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) > > Assuming that this is the actual end of the ktrace output and there > isn't some missing ktrace output in a buffer somewhere, this looks > like egdb is simply blocked in wait4(), which should be harmless and > certainly not uninterruptable? The single-thread check done by wait4() is non-interruptible. When the debugger gets stuck, is it blocked in "suspend" state? However, I think there is a bug in the single-thread switch code. It looks that ps_singlecount can be decremented too much. This probably is a regression of making ps_singlecount unsigned and letting single_thread_check() run without the kernel lock. The bug might go away if single_thread_check() made sure that P_SUSPSINGLE is set before the thread suspends. Does the following patch help? Even if it does, it probably needs some refining. Index: kern/kern_sig.c === RCS file: src/sys/kern/kern_sig.c,v retrieving revision 1.258 diff -u -p -r1.258 kern_sig.c --- kern/kern_sig.c 15 Jun 2020 13:18:33 - 1.258 +++ kern/kern_sig.c 20 Jul 2020 04:27:30 - @@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int return (EINTR); } - if (atomic_dec_int_nv(>ps_singlecount) == 0) - wakeup(>ps_singlecount); + SCHED_LOCK(s); + if (p->p_flag & P_SUSPSINGLE) { + if (atomic_dec_int_nv(>ps_singlecount) == 0) + wakeup(>ps_singlecount); + } else if ((p->p_flag & P_WEXIT) == 0) { + SCHED_UNLOCK(s); + CPU_BUSY_CYCLE(); + continue; + } if (pr->ps_flags & PS_SINGLEEXIT) { + SCHED_UNLOCK(s); KERNEL_LOCK(); exit1(p, 0, 0, EXIT_THREAD_NOCHECK); - KERNEL_UNLOCK(); + /* NOTREACHED */ } /* not exiting and don't need to unwind, so suspend */ - SCHED_LOCK(s); p->p_stat = SSTOP; mi_switch(); SCHED_UNLOCK(s);
gdb in uninterruptible wait
I've been finding egdb and gdb rather easily get stuck in an uninterruptible wait, e.g. when running the 'next' command after hitting a breakpoint. So it's not possible to kill the debuggee or gdb and the only way to kill the debuggee process and free up its listening sockets seems to be to reboot the entire system. Perhaps unsurprisingly one cannot attach a second invocation of gdb to the uninterruptible gdb, so i don't know for sure what syscall is being run that is getting stuck. The debuggee is a local build of the flightgear flight simulator. Here's the output of ps for the debugger and debuggee: 12419 p0 D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set print thread-events off -ex set print pretty on -ex run --args build-walk/fgfs,clang,debug,opt,co 63921 p0 TX+ 0:42.45 /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe --airport=egtk (fgfs,clang,debug) I've tried using ktrace on egdb, and the kdump output ends like this: 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) 53950 egdb RET wait4 97562/0x17d1a 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12) 53950 egdb RET ptrace 0 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7e8b40,0) 53950 egdb RET ptrace 0 53950 egdb CALL futex(0x6444e37c490,0x82,1,0,0) 53950 egdb RET futex 0 53950 egdb CALL futex(0x644bef12740,0x82,1,0,0) 53950 egdb RET futex 0 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) 53950 egdb RET ptrace 0 53950 egdb CALL ptrace(PT_IO,97562,0x7f7e8a30,0) 53950 egdb RET ptrace 0 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0) 53950 egdb RET ptrace 0 53950 egdb CALL read(6,0x7f7e9187,0x1) 53950 egdb RET read -1 errno 35 Resource temporarily unavailable 53950 egdb CALL poll(0x6441581e720,3,0) 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1, revents=0<> } { fd=6, events=0x1, revents=0<> } { fd=10, events=0x1, revents=0<> } 53950 egdb RET poll 0 53950 egdb CALL wait4(WAIT_ANY,0x7f7e8efc,0<>,0) Assuming that this is the actual end of the ktrace output and there isn't some missing ktrace output in a buffer somewhere, this looks like egdb is simply blocked in wait4(), which should be harmless and certainly not uninterruptable? Does anyone have any suggestions about how to investigate this further? I'm running OpenBSD 6.7 GENERIC.MP#182 amd64. Thanks, - Jules -- http://op59.net