> Yes. In particular, ptrace(PTRACE_DETACH, SIGKILL) should cancel > SIGNAL_STOP_STOPPED, yes?
Yes. > > > - sig->flags = SIGNAL_STOP_STOPPED; > > > + sig->flags = SIGNAL_STOP_STOPPED | > > > SIGNAL_STOP_DEQUEUED; > > > > Boy, do I not understand why that does anything about this at all! > > But I am barely awake tonight. Ok, I guess I do sort of if it goes > > along with some other patch to set SIGNAL_STOP_STOPPED. But since > > you've verified you really understand what happens, you can tell us! I actually thought of it right after I sent this, but I was too tired to follow up then. It's good that you've posted this particular concrete scenario to document it more fully. Here's the way I think about that: SIGNAL_STOP_DEQUEUED exists for one purpose. It's to ensure that SIGCONT and SIGKILL can clear it to make complete their required effect of clearing all pending stop signals. (It fills the hole when another thread has dequeued a stop signal and then dropped the siglock to make its call to is_current_pgrp_orphaned()--so that half-delivered signal is still considered "pending" and thus must be cancelled by SIGCONT or SIGKILL.) In the debugger case, there is a far larger hole possible, where a thread has dequeued a stop signal and then dropped the siglock to block for an arbitrary period while the debugger contemplates the signal. But to me this is really the same case as far as the signal semantics are concerned. When the debugger decides to send the signal on, it then picks up in the same "half-delivered" situation and goes the rest of the way. What I've just described is a simple "race" with an external SIGCONT or SIGKILL. This maps exactly to the is_current_pgrp_orphaned() window--it's just a window that can easily be far larger, and can be kept open forever and so to the debugger user with global perspective can be observed as a "non-racey" hole (hit SIGTSTP in the debugger, send SIGCONT from another terminal, continue in the debugger). Now, the case we are considering really is different from that race. But I think the same essential logic applies: you have a half-delivered stop signal "in flight", so either there has been a SIGCONT or SIGKILL to cancel it, or there hasn't. Since there hasn't, nothing should prevent the normal operation of that stop signal's final delivery. It's a bug that something does. Another way to put it is to say that the "exists for one purpose" statement above implies that only an actual SIGCONT or SIGKILL should ever clear SIGNAL_STOP_DEQUEUED. In fact, only one place clears the flag explicitly, but six others do so implicitly. The one explicit place and one of the implicit places is the one that clearly should: the SIGCONT case in prepare_signal(). Three implicit places are the ->flags = SIGNAL_GROUP_EXIT cases (zap_process, do_group_exit, complete_signal). These are harmless because they are already effectively mutually exclusive, since the one check of SIGNAL_STOP_DEQUEUED is: if (!likely(sig->flags & SIGNAL_STOP_DEQUEUED) || unlikely(signal_group_exit(sig))) return 0; The remaining two places are the ->flags = SIGNAL_STOP_STOPPED cases in do_signal_stop and exit_signals. Since SIGNAL_STOP_DEQUEUED must always have been set before if you can get to those situations, it is harmless to use "->flags = SIGNAL_STOP_STOPPED | SIGNAL_STOP_DEQUEUED" instead of: sig->flags &= SIGNAL_STOP_DEQUEUED; sig->flags |= SIGNAL_STOP_STOPPED; or anything like that. It's probably cleanest to consolidate those two cases to call a single subroutine that does the tracehook_notify_jctl logic, unlock and do_notify_parent_cldstop. It can take a caller flag or just check PF_EXITING to omit the ->exit_code + ->state change of the do_signal_stop version of the code. That one subroutine can have a clear comment about the nonobvious flag usage. > But please remember, the patch above is not complete of course and currently > I do not see the good solution. What's incomplete aside from handling the exit_signals case the same way? > I am starting to think we should forget > about these bugs, merge utrace-ptrace, and then try to fix them. If we can have utrace-ptrace code whose corner behavior matches the old code and is itself clean, then I don't care about the order of the changes going in. But it's not really clear to me that we can even describe the old behavior in terms clean enough to make an exact work-alike implementation that could possibly be clean. Thanks, Roland