> I do not know. I'd leave this to Roland. I mean, if he thinks this
> should be fixed - I'll try to fix.
> 
> But. This all looks unfixeable to me. In my opinion, the kernel is
> obviously wrong, and test-case are wrong too. And any fix in this
> area is user-visible and can break the current expectations.

My view is that the handling of a tracee that was already job-stopped
(left by ^Z, etc.) or of a tracee that you want to leave job-stopped for
its natural parent to see on detach, has never really worked.  In the
past the kernel and GDB (let alone strace) have both been wrong and even
internally inconsistent, and each changed incompatibly with itself (not
to mention with each other) over the years.  So for those cases as such,
I don't think any particular compatibility matters so much as just
settling on something acceptable now.  I expect that there is no one
behavior for this case that is actually compatible with the expectations
of a significant range of tools/versions such that nearly any change at
all wouldn't be a regression for some and an improvement for others.

However, we unfortunately cannot say the same for all quirks that we
would probably like to change.  That is, the treatment of SIGSTOP and
what wakeups get done and so forth is an intricate dependency of plain
attach/detach logic in all manner of debuggeresque tools for numerous
corner cases (many MT-related).  So I think the main constraint on what
we can come up with as "newly almost sane" semantics is that we really
should not regress for any other corner of attach/detach et al logic in
any current or past version of gdb, or strace, or anything else.

> Firstly, I think we should un-revert edaba2c5334492f82d39ec35637c6dea5176a977.

Yes.  That unconditional wakeup was one of the very first things years
ago that made me shake my head and marvel about what sucker they would
ever get to maintain what all the purported ABI semantics corners of
this ludicrous ptrace() implementation were supposed to be, glad that it
was not my problem to touch that wormy can of flaming obvious wrong with
a ten foot pole.  At the time it was probably already obvious to most
people other than me that I would be that very sucker for years to come.
Hmm, perhaps I have not set this up very well now for asking you to
think up all the details of matching the established semantics (that
really sounds better than "purported", doesn't it?) of this insane
implementation. :-)

> We had a lengthy discussion about this.

Yes.  I only ever wanted that revert then because it was too late in the
2.6.30 cycle to hash this all out and get it really right.  I meant that
we should leave wrong enough alone in 2.6.30 but get it all worked out
more properly in 2.6.31, but I forgot to follow up on it.  If we can
iron out the behavior now and the upstream version of implementing it is
not big new hair, it might still be possible to get it fixed in 2.6.32.

That piece of implementation is 100% wrong.  But we have to figure out
what the manifest semantics are today from the userland perspective and
decide what exactly we want them to be before we implement those precise
semantics in some sensible way.  We may have to settle on something
that, while more consistent than today's kernel, is still somewhat wrong
in the abstract, when we need to preserve application compatibility.

>       -                       sig->flags = SIGNAL_STOP_STOPPED;
>       +                       sig->flags = SIGNAL_STOP_STOPPED | 
> SIGNAL_STOP_DEQUEUED;

Boy, do I not understand why that does anything about this at all!
But I am barely awake tonight.  Ok, I guess I do sort of if it goes
along with some other patch to set SIGNAL_STOP_STOPPED.  But since
you've verified you really understand what happens, you can tell us!

> But as I said, I do not really understand what this test-case tries
> to do. What ptrace(PTRACE_DETACH, SIGSTOP) should mean? I think that
> ptrace(PTRACE_DETACH, signr) should mean the tracee should proceed
> with this signal, as if it was sent by, say, kill.

Except for not being possibly intercepted again, roughly yes.  But note
that this really is only delivery, not generation (POSIX signal terms).
(In the kludge cases for blocked signals and non-signal stops, sometimes
it really is generation like kill, but here I mean resuming from real
signal stops.)  That means that magic generation-time effects, such as
SIGSTOP clearing pending SIGCONT and vice versa, do not happen (if
passing on the signal, they happened before at original generation).

Internally, "dequeue-time" effects, i.e. SIGNAL_STOP_DEQUEUED and timer
rearming, also don't happen (because they happened before).  But since
SIGNAL_STOP_DEQUEUED is pure internal bookkeeping, that is something we
could change as an implementation detail for the semantics we want.

> In this case, I don't understand why stopped-attach-transparency
> "sends" SIGSTOP to every sub-thread. If the tracer wants to stop
> the thread group after detach, it can do
> 
>       ptrace(PTRACE_DETACH, anythread, SIGSTOP);
>       for_each_other_thread(pid)
>               ptrace(PTRACE_DETACH, anythread, 0);

That is racy.  Each thread could resume and run a little before the
first thread gets scheduled, processes the signal delivery, and
interrupts all the other threads.  That's always been the case.
AFAIK this code pattern has been consistently reliable to leave the
whole process stopped for a long time, so any existing debugger code
that works this way will want to stay as it is for the foreseeable
future so it works with a wide range of kernel vintages.


Thanks,
Roland

Reply via email to