This has a woeful lack of introduction or explanation for readers of the posting, and of appropriate comments in the code about the whole plan here.
Firstly, I'll explain the background to this for the benefit of other reviewers and mailing list readers. The Linux kernel implementation of wait4 works by iterating over a list of all children and tracees, checking each for being dead or stopped or whatever so as it's interesting to report it. In Linux kernels prior to 2.6.25, when this logic found a stopped child to report (i.e. when yielding a WIFSTOPPED status), it would move that child to the end of the list. Hence, the next wait4 call would check each other child first, before the one returned last time. This reordering had a useful effect in some scenarios. A simple-minded tracer will use wait4 to see a tracee stop, then use ptrace to resume that tracee, and then loop to call wait4 for the next tracee to stop. This is exactly what strace does. When tracing multiple threads, if the first thread to be examined is busy-waiting, or just doing very quick system calls with little computation between them, that thread might already have stopped again for its next event by the time strace calls wait4 again. When that happens, without the reordering behavior, strace will report and resume that some one thread over and over again, leaving the other threads still stopped and never getting around to reporting and resuming the. This is called a "starvation" scenario. But, because of the old reordering behavior, reporting would always be round-robin among stopped candidates, and hence so would strace's resuming. So before 2.6.25, the "starvation" scenario did not show up in practice. Since version 2.6.25, Linux no longer does this reordering. The list stays in its original order throughout (roughly, the order of child creation, followed by the order of tracee attach). The kernel maintainers decided that there was never any guarantee of any sort of about the order in which available children or tracees are seen in wait4 calls--it's always been subject to timing, and userland always should have been sophisticated enough to drain all the pending reports before resuming anybody. (In fact, I was the kernel maintainer who decided that. It definitely is the right thing for the kernel to do, I can say with my kernel hat on. With my strace hat on, I can fret that the kernel's change has made it more likely that cases where strace's lack of sophistication in this regard matters will arise in practice. But the kernel change has already happened and it won't go back.) Now, to this patch. There are magic global variables with no explanation why--something like that always just obviously needs more scrutiny. There are also nits like wrong comment formatting that everybody here should know well for this code base by now. It is a bit hard to follow the changes to the control flow being made, in the absence of any comments about it all. I think it is doing it sort of inside out of how I would approach the problem. Here is what I think is the way to do this that is fairly clean and the least invasive to strace's existing code and control flow. In trace(), replace the three places PTRACE_SYSCALL is used with a common subroutine that doesn't call ptrace immediately. Instead, it stores the resumption signal and adds the tcb to the tail of a linked list. At the top of the loop, where we call wait4, check that list. If it's nonempty, then use WNOHANG. When wait4 returns 0, then consume the linked list by doing each PTRACE_SYSCALL in the order they were reported, and short-circuit back to the top of the loop. There the list will be empty, so you'll call wait4 without WNOHANG. For the !WIFSTOPPED (death) cases, you need to be sure not to get to droptcb while the linked list still points to the tcb. (You can get a death report without having resuming it yet, for various sudden-death cases.) Probably the easiest thing is to use a new flag TCB_RESUMING to keep track of being on the list (or you might just use the list pointer field in a way where you can check it, i.e. tail element's pointer is not to NULL so it looks like an unlisted tcb). Check that in droptcb and have it just clear ->pid but leave ->flags TCB_INUSE|TCB_RESUMING. Then the resumption loop can notice these ->pid==0 tcb's and just clear their ->flags (droptcb already having done the rest of the work). Does that make sense? Thanks, Roland ------------------------------------------------------------------------------ _______________________________________________ Strace-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/strace-devel
