> I do not know. I'd leave this to Roland. I mean, if he thinks this > should be fixed - I'll try to fix. > > But. This all looks unfixeable to me. In my opinion, the kernel is > obviously wrong, and test-case are wrong too. And any fix in this > area is user-visible and can break the current expectations.
My view is that the handling of a tracee that was already job-stopped (left by ^Z, etc.) or of a tracee that you want to leave job-stopped for its natural parent to see on detach, has never really worked. In the past the kernel and GDB (let alone strace) have both been wrong and even internally inconsistent, and each changed incompatibly with itself (not to mention with each other) over the years. So for those cases as such, I don't think any particular compatibility matters so much as just settling on something acceptable now. I expect that there is no one behavior for this case that is actually compatible with the expectations of a significant range of tools/versions such that nearly any change at all wouldn't be a regression for some and an improvement for others. However, we unfortunately cannot say the same for all quirks that we would probably like to change. That is, the treatment of SIGSTOP and what wakeups get done and so forth is an intricate dependency of plain attach/detach logic in all manner of debuggeresque tools for numerous corner cases (many MT-related). So I think the main constraint on what we can come up with as "newly almost sane" semantics is that we really should not regress for any other corner of attach/detach et al logic in any current or past version of gdb, or strace, or anything else. > Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977. Yes. That unconditional wakeup was one of the very first things years ago that made me shake my head and marvel about what sucker they would ever get to maintain what all the purported ABI semantics corners of this ludicrous ptrace() implementation were supposed to be, glad that it was not my problem to touch that wormy can of flaming obvious wrong with a ten foot pole. At the time it was probably already obvious to most people other than me that I would be that very sucker for years to come. Hmm, perhaps I have not set this up very well now for asking you to think up all the details of matching the established semantics (that really sounds better than "purported", doesn't it?) of this insane implementation. :-) > We had a lengthy discussion about this. Yes. I only ever wanted that revert then because it was too late in the 2.6.30 cycle to hash this all out and get it really right. I meant that we should leave wrong enough alone in 2.6.30 but get it all worked out more properly in 2.6.31, but I forgot to follow up on it. If we can iron out the behavior now and the upstream version of implementing it is not big new hair, it might still be possible to get it fixed in 2.6.32. That piece of implementation is 100% wrong. But we have to figure out what the manifest semantics are today from the userland perspective and decide what exactly we want them to be before we implement those precise semantics in some sensible way. We may have to settle on something that, while more consistent than today's kernel, is still somewhat wrong in the abstract, when we need to preserve application compatibility. > - sig->flags = SIGNAL_STOP_STOPPED; > + sig->flags = SIGNAL_STOP_STOPPED | > SIGNAL_STOP_DEQUEUED; Boy, do I not understand why that does anything about this at all! But I am barely awake tonight. Ok, I guess I do sort of if it goes along with some other patch to set SIGNAL_STOP_STOPPED. But since you've verified you really understand what happens, you can tell us! > But as I said, I do not really understand what this test-case tries > to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that > ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed > with this signal, as if it was sent by, say, kill. Except for not being possibly intercepted again, roughly yes. But note that this really is only delivery, not generation (POSIX signal terms). (In the kludge cases for blocked signals and non-signal stops, sometimes it really is generation like kill, but here I mean resuming from real signal stops.) That means that magic generation-time effects, such as SIGSTOP clearing pending SIGCONT and vice versa, do not happen (if passing on the signal, they happened before at original generation). Internally, "dequeue-time" effects, i.e. SIGNAL_STOP_DEQUEUED and timer rearming, also don't happen (because they happened before). But since SIGNAL_STOP_DEQUEUED is pure internal bookkeeping, that is something we could change as an implementation detail for the semantics we want. > In this case, I don't understand why stopped-attach-transparency > "sends" SIGSTOP to every sub-thread. If the tracer wants to stop > the thread group after detach, it can do > > ptrace(PTRACE_DETACH, anythread, SIGSTOP); > for_each_other_thread(pid) > ptrace(PTRACE_DETACH, anythread, 0); That is racy. Each thread could resume and run a little before the first thread gets scheduled, processes the signal delivery, and interrupts all the other threads. That's always been the case. AFAIK this code pattern has been consistently reliable to leave the whole process stopped for a long time, so any existing debugger code that works this way will want to stay as it is for the foreseeable future so it works with a wide range of kernel vintages. Thanks, Roland