gdb in uninterruptible wait

2020-09-16 Thread flint pyrite
fyi: ran into this egdb step next issue:

http://openbsd-archive.7691.n7.nabble.com/gdb-in-uninterruptible-wait-td395449.html

Does anyone know the status?
Do I have to manually patch the kernel?

NOTE: I am a fairly recent convert to openbsd. I have migrated one of
my laptops and dedicated it openbsd. So, I am willing to learn how to
hack the kernel if needed. From my reading, however, the custom kernel
will no longer be supported. I would be on my own and that does not
resonate with me just yet.

6.7 GENERIC.MP#6



Re: gdb in uninterruptible wait

2020-08-08 Thread Julian Smith
On Tue, 21 Jul 2020 19:23:44 +0100
Julian Smith  wrote:

> On Mon, 20 Jul 2020 17:18:19 +0100
> Julian Smith  wrote:
> 
> > On Mon, 20 Jul 2020 15:26:11 +
> > Visa Hankala  wrote:
> >   
> > > On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote:
> > > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > > >
> > > > > I've been finding egdb and gdb rather easily get stuck in an
> > > > > uninterruptible wait, e.g. when running the 'next' command
> > > > > after hitting a breakpoint.
> > 
> > [...]
> >   
> > > > The single-thread check done by wait4() is non-interruptible.
> > > > When the debugger gets stuck, is it blocked in "suspend" state?
> > > >
> > 
> > ps reports it to be in state 'D'.
> >   
> > > > 
> > > > However, I think there is a bug in the single-thread switch
> > > > code. It looks that ps_singlecount can be decremented too much.
> > > > This probably is a regression of making ps_singlecount unsigned
> > > > and letting single_thread_check() run without the kernel lock.
> > > > 
> > > > The bug might go away if single_thread_check() made sure that
> > > > P_SUSPSINGLE is set before the thread suspends.   
> > > 
> > > Below is an updated patch for testing. It extends the scope of
> > > SCHED_LOCK() so that there are fewer chances of interleaving of
> > > single_thread_set() and single_thread_check().
> > 
> > Many thanks for these patches. I'll try to test in the next couple
> > of days. Though the last time i built an OpenBSD kernel was well
> > over a decade ago, so it might take me a little longer.  
> 
> I managed to build a patched kernel, and it seems to fix the problem -
> i haven't been able to get egdb into an uninterruptable wait state.
> 
> Also, i've been running the patched kernel all day now and it doesn't
> seem to be causing any problems elsewhere.

Unfortunately the same problem has just occurred again. I've run egdb
quite a few times since i updated the kernel, so the patch has
definitely reduced the problem, but it doesn't seem to have eliminated
it.

Let me know if there anything i could do to find out more information.

Thanks,

- Jules

-- 
http://op59.net




Re: gdb in uninterruptible wait

2020-07-21 Thread Julian Smith
On Mon, 20 Jul 2020 17:18:19 +0100
Julian Smith  wrote:

> On Mon, 20 Jul 2020 15:26:11 +
> Visa Hankala  wrote:
> 
> > On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote:  
> > > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > > > I've been finding egdb and gdb rather easily get stuck in an
> > > > uninterruptible wait, e.g. when running the 'next' command after
> > > > hitting a breakpoint.  
> 
> [...]
> 
> > > The single-thread check done by wait4() is non-interruptible.
> > > When the debugger gets stuck, is it blocked in "suspend" state?  
> 
> ps reports it to be in state 'D'.
> 
> > > 
> > > However, I think there is a bug in the single-thread switch code.
> > > It looks that ps_singlecount can be decremented too much. This
> > > probably is a regression of making ps_singlecount unsigned and
> > > letting single_thread_check() run without the kernel lock.
> > > 
> > > The bug might go away if single_thread_check() made sure that
> > > P_SUSPSINGLE is set before the thread suspends. 
> > 
> > Below is an updated patch for testing. It extends the scope of
> > SCHED_LOCK() so that there are fewer chances of interleaving of
> > single_thread_set() and single_thread_check().  
> 
> Many thanks for these patches. I'll try to test in the next couple of
> days. Though the last time i built an OpenBSD kernel was well over a
> decade ago, so it might take me a little longer.

I managed to build a patched kernel, and it seems to fix the problem -
i haven't been able to get egdb into an uninterruptable wait state.

Also, i've been running the patched kernel all day now and it doesn't
seem to be causing any problems elsewhere.

Thanks,

- Jules

-- 
http://op59.net




Re: gdb in uninterruptible wait

2020-07-20 Thread Julian Smith
On Mon, 20 Jul 2020 15:26:11 +
Visa Hankala  wrote:

> On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote:
> > On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:  
> > > I've been finding egdb and gdb rather easily get stuck in an
> > > uninterruptible wait, e.g. when running the 'next' command after
> > > hitting a breakpoint.

[...]

> > The single-thread check done by wait4() is non-interruptible.
> > When the debugger gets stuck, is it blocked in "suspend" state?

ps reports it to be in state 'D'.

> > 
> > However, I think there is a bug in the single-thread switch code.
> > It looks that ps_singlecount can be decremented too much. This
> > probably is a regression of making ps_singlecount unsigned and
> > letting single_thread_check() run without the kernel lock.
> > 
> > The bug might go away if single_thread_check() made sure that
> > P_SUSPSINGLE is set before the thread suspends.   
> 
> Below is an updated patch for testing. It extends the scope of
> SCHED_LOCK() so that there are fewer chances of interleaving of
> single_thread_set() and single_thread_check().

Many thanks for these patches. I'll try to test in the next couple of
days. Though the last time i built an OpenBSD kernel was well over a
decade ago, so it might take me a little longer.

Thanks,

- Jules


> 
> One problem is that once single_thread_set() sets ps_single, the other
> threads can enter the suspension code in single_thread_check(). The
> extended lock scope prevents the threads from taking action before
> single_thread_set() has finished with the state updates.
> 
> Index: kern/kern_sig.c
> ===
> RCS file: src/sys/kern/kern_sig.c,v
> retrieving revision 1.258
> diff -u -p -r1.258 kern_sig.c
> --- kern/kern_sig.c   15 Jun 2020 13:18:33 -  1.258
> +++ kern/kern_sig.c   20 Jul 2020 13:29:50 -
> @@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int 
>   return (EINTR);
>   }
>  
> + SCHED_LOCK(s);
>   if (atomic_dec_int_nv(>ps_singlecount)
> == 0) wakeup(>ps_singlecount);
>   if (pr->ps_flags & PS_SINGLEEXIT) {
> + SCHED_UNLOCK(s);
>   KERNEL_LOCK();
>   exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
> - KERNEL_UNLOCK();
> + /* NOTREACHED */
>   }
>  
>   /* not exiting and don't need to unwind, so
> suspend */
> - SCHED_LOCK(s);
>   p->p_stat = SSTOP;
>   mi_switch();
>   SCHED_UNLOCK(s);
> @@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s
>  {
>   struct process *pr = p->p_p;
>   struct proc *q;
> - int error;
> + int error, s;
>  
>   KERNEL_ASSERT_LOCKED();
>   KASSERT(curproc == p);
> @@ -1974,26 +1975,22 @@ single_thread_set(struct proc *p, enum s
>   panic("single_thread_mode = %d", mode);
>  #endif
>   }
> + SCHED_LOCK(s);
>   pr->ps_singlecount = 0;
>   membar_producer();
>   pr->ps_single = p;
>   TAILQ_FOREACH(q, >ps_threads, p_thr_link) {
> - int s;
> -
>   if (q == p)
>   continue;
>   if (q->p_flag & P_WEXIT) {
>   if (mode == SINGLE_EXIT) {
> - SCHED_LOCK(s);
>   if (q->p_stat == SSTOP) {
>   setrunnable(q);
>   atomic_inc_int(>ps_singlecount);
>   }
> - SCHED_UNLOCK(s);
>   }
>   continue;
>   }
> - SCHED_LOCK(s);
>   atomic_setbits_int(>p_flag, P_SUSPSINGLE);
>   switch (q->p_stat) {
>   case SIDL:
> @@ -2027,8 +2024,8 @@ single_thread_set(struct proc *p, enum s
>   signotify(q);
>   break;
>   }
> - SCHED_UNLOCK(s);
>   }
> + SCHED_UNLOCK(s);
>  
>   if (mode != SINGLE_PTRACE)
>   single_thread_wait(pr, 1);
> 
> 



-- 
http://op59.net



Re: gdb in uninterruptible wait

2020-07-20 Thread Visa Hankala
On Mon, Jul 20, 2020 at 04:35:12AM +, Visa Hankala wrote:
> On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > I've been finding egdb and gdb rather easily get stuck in an
> > uninterruptible wait, e.g. when running the 'next' command after
> > hitting a breakpoint.
> > 
> > So it's not possible to kill the debuggee or gdb and the only way to
> > kill the debuggee process and free up its listening sockets seems to be
> > to reboot the entire system.
> > 
> > Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> > the uninterruptible gdb, so i don't know for sure what syscall is being
> > run that is getting stuck.
> > 
> > The debuggee is a local build of the flightgear flight simulator.
> > 
> > Here's the output of ps for the debugger and debuggee:
> > 
> > 12419 p0  D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set 
> > print thread-events off -ex set print pretty on -ex run --args 
> > build-walk/fgfs,clang,debug,opt,co
> > 63921 p0  TX+  0:42.45 
> > /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe 
> > --airport=egtk (fgfs,clang,debug)
> > 
> > I've tried using ktrace on egdb, and the kdump output ends like this:
> > 
> >  53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)
> >  53950 egdb RET   wait4 97562/0x17d1a
> >  53950 egdb CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12)
> >  53950 egdb RET   ptrace 0
> >  53950 egdb CALL  ptrace(PT_GETREGS,161560,0x7f7e8b40,0)
> >  53950 egdb RET   ptrace 0
> >  53950 egdb CALL  
> > futex(0x6444e37c490,0x82,1,0,0)
> >  53950 egdb RET   futex 0
> >  53950 egdb CALL  
> > futex(0x644bef12740,0x82,1,0,0)
> >  53950 egdb RET   futex 0
> >  53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
> >  53950 egdb RET   ptrace 0
> >  53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
> >  53950 egdb RET   ptrace 0
> >  53950 egdb CALL  ptrace(PT_STEP,97562,0x1,0)
> >  53950 egdb RET   ptrace 0
> >  53950 egdb CALL  read(6,0x7f7e9187,0x1)
> >  53950 egdb RET   read -1 errno 35 Resource temporarily unavailable
> >  53950 egdb CALL  poll(0x6441581e720,3,0)
> >  53950 egdb STRU  struct pollfd [3] { fd=4, events=0x1, 
> > revents=0<> } { fd=6, events=0x1, revents=0<> } { fd=10, 
> > events=0x1, revents=0<> }
> >  53950 egdb RET   poll 0
> >  53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)
> > 
> > Assuming that this is the actual end of the ktrace output and there
> > isn't some missing ktrace output in a buffer somewhere, this looks
> > like egdb is simply blocked in wait4(), which should be harmless and
> > certainly not uninterruptable?
> 
> The single-thread check done by wait4() is non-interruptible.
> When the debugger gets stuck, is it blocked in "suspend" state?
> 
> However, I think there is a bug in the single-thread switch code.
> It looks that ps_singlecount can be decremented too much. This probably
> is a regression of making ps_singlecount unsigned and letting
> single_thread_check() run without the kernel lock.
> 
> The bug might go away if single_thread_check() made sure that
> P_SUSPSINGLE is set before the thread suspends. 

Below is an updated patch for testing. It extends the scope of
SCHED_LOCK() so that there are fewer chances of interleaving of
single_thread_set() and single_thread_check().

One problem is that once single_thread_set() sets ps_single, the other
threads can enter the suspension code in single_thread_check(). The
extended lock scope prevents the threads from taking action before
single_thread_set() has finished with the state updates.

Index: kern/kern_sig.c
===
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -  1.258
+++ kern/kern_sig.c 20 Jul 2020 13:29:50 -
@@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int 
return (EINTR);
}
 
+   SCHED_LOCK(s);
if (atomic_dec_int_nv(>ps_singlecount) == 0)
wakeup(>ps_singlecount);
if (pr->ps_flags & PS_SINGLEEXIT) {
+   SCHED_UNLOCK(s);
KERNEL_LOCK();
exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
-   KERNEL_UNLOCK();
+   /* NOTREACHED */
}
 
/* not exiting and don't need to unwind, so suspend */
-   SCHED_LOCK(s);
p->p_stat = SSTOP;
mi_switch();
SCHED_UNLOCK(s);
@@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s
 {
struct process *pr = p->p_p;
struct proc *q;
- 

Re: gdb in uninterruptible wait

2020-07-19 Thread Visa Hankala
On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> I've been finding egdb and gdb rather easily get stuck in an
> uninterruptible wait, e.g. when running the 'next' command after
> hitting a breakpoint.
> 
> So it's not possible to kill the debuggee or gdb and the only way to
> kill the debuggee process and free up its listening sockets seems to be
> to reboot the entire system.
> 
> Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> the uninterruptible gdb, so i don't know for sure what syscall is being
> run that is getting stuck.
> 
> The debuggee is a local build of the flightgear flight simulator.
> 
> Here's the output of ps for the debugger and debuggee:
> 
> 12419 p0  D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set 
> print thread-events off -ex set print pretty on -ex run --args 
> build-walk/fgfs,clang,debug,opt,co
> 63921 p0  TX+  0:42.45 
> /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe 
> --airport=egtk (fgfs,clang,debug)
> 
> I've tried using ktrace on egdb, and the kdump output ends like this:
> 
>  53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)
>  53950 egdb RET   wait4 97562/0x17d1a
>  53950 egdb CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12)
>  53950 egdb RET   ptrace 0
>  53950 egdb CALL  ptrace(PT_GETREGS,161560,0x7f7e8b40,0)
>  53950 egdb RET   ptrace 0
>  53950 egdb CALL  
> futex(0x6444e37c490,0x82,1,0,0)
>  53950 egdb RET   futex 0
>  53950 egdb CALL  
> futex(0x644bef12740,0x82,1,0,0)
>  53950 egdb RET   futex 0
>  53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
>  53950 egdb RET   ptrace 0
>  53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
>  53950 egdb RET   ptrace 0
>  53950 egdb CALL  ptrace(PT_STEP,97562,0x1,0)
>  53950 egdb RET   ptrace 0
>  53950 egdb CALL  read(6,0x7f7e9187,0x1)
>  53950 egdb RET   read -1 errno 35 Resource temporarily unavailable
>  53950 egdb CALL  poll(0x6441581e720,3,0)
>  53950 egdb STRU  struct pollfd [3] { fd=4, events=0x1, 
> revents=0<> } { fd=6, events=0x1, revents=0<> } { fd=10, 
> events=0x1, revents=0<> }
>  53950 egdb RET   poll 0
>  53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)
> 
> Assuming that this is the actual end of the ktrace output and there
> isn't some missing ktrace output in a buffer somewhere, this looks
> like egdb is simply blocked in wait4(), which should be harmless and
> certainly not uninterruptable?

The single-thread check done by wait4() is non-interruptible.
When the debugger gets stuck, is it blocked in "suspend" state?

However, I think there is a bug in the single-thread switch code.
It looks that ps_singlecount can be decremented too much. This probably
is a regression of making ps_singlecount unsigned and letting
single_thread_check() run without the kernel lock.

The bug might go away if single_thread_check() made sure that
P_SUSPSINGLE is set before the thread suspends. 

Does the following patch help? Even if it does, it probably needs
some refining.

Index: kern/kern_sig.c
===
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -  1.258
+++ kern/kern_sig.c 20 Jul 2020 04:27:30 -
@@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int 
return (EINTR);
}
 
-   if (atomic_dec_int_nv(>ps_singlecount) == 0)
-   wakeup(>ps_singlecount);
+   SCHED_LOCK(s);
+   if (p->p_flag & P_SUSPSINGLE) {
+   if (atomic_dec_int_nv(>ps_singlecount) == 0)
+   wakeup(>ps_singlecount);
+   } else if ((p->p_flag & P_WEXIT) == 0) {
+   SCHED_UNLOCK(s);
+   CPU_BUSY_CYCLE();
+   continue;
+   }
if (pr->ps_flags & PS_SINGLEEXIT) {
+   SCHED_UNLOCK(s);
KERNEL_LOCK();
exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
-   KERNEL_UNLOCK();
+   /* NOTREACHED */
}
 
/* not exiting and don't need to unwind, so suspend */
-   SCHED_LOCK(s);
p->p_stat = SSTOP;
mi_switch();
SCHED_UNLOCK(s);



gdb in uninterruptible wait

2020-07-19 Thread Julian Smith
I've been finding egdb and gdb rather easily get stuck in an
uninterruptible wait, e.g. when running the 'next' command after
hitting a breakpoint.

So it's not possible to kill the debuggee or gdb and the only way to
kill the debuggee process and free up its listening sockets seems to be
to reboot the entire system.

Perhaps unsurprisingly one cannot attach a second invocation of gdb to
the uninterruptible gdb, so i don't know for sure what syscall is being
run that is getting stuck.

The debuggee is a local build of the flightgear flight simulator.

Here's the output of ps for the debugger and debuggee:

12419 p0  D0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set print 
thread-events off -ex set print pretty on -ex run --args 
build-walk/fgfs,clang,debug,opt,co
63921 p0  TX+  0:42.45 
/home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe 
--airport=egtk (fgfs,clang,debug)

I've tried using ktrace on egdb, and the kdump output ends like this:

 53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)
 53950 egdb RET   wait4 97562/0x17d1a
 53950 egdb CALL  ptrace(PT_GET_PROCESS_STATE,97562,0x7f7e8ef0,12)
 53950 egdb RET   ptrace 0
 53950 egdb CALL  ptrace(PT_GETREGS,161560,0x7f7e8b40,0)
 53950 egdb RET   ptrace 0
 53950 egdb CALL  
futex(0x6444e37c490,0x82,1,0,0)
 53950 egdb RET   futex 0
 53950 egdb CALL  
futex(0x644bef12740,0x82,1,0,0)
 53950 egdb RET   futex 0
 53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
 53950 egdb RET   ptrace 0
 53950 egdb CALL  ptrace(PT_IO,97562,0x7f7e8a30,0)
 53950 egdb RET   ptrace 0
 53950 egdb CALL  ptrace(PT_STEP,97562,0x1,0)
 53950 egdb RET   ptrace 0
 53950 egdb CALL  read(6,0x7f7e9187,0x1)
 53950 egdb RET   read -1 errno 35 Resource temporarily unavailable
 53950 egdb CALL  poll(0x6441581e720,3,0)
 53950 egdb STRU  struct pollfd [3] { fd=4, events=0x1, revents=0<> 
} { fd=6, events=0x1, revents=0<> } { fd=10, events=0x1, 
revents=0<> }
 53950 egdb RET   poll 0
 53950 egdb CALL  wait4(WAIT_ANY,0x7f7e8efc,0<>,0)

Assuming that this is the actual end of the ktrace output and there
isn't some missing ktrace output in a buffer somewhere, this looks
like egdb is simply blocked in wait4(), which should be harmless and
certainly not uninterruptable?

Does anyone have any suggestions about how to investigate this further?

I'm running OpenBSD 6.7 GENERIC.MP#182 amd64.

Thanks,

- Jules

-- 
http://op59.net