Greetings all.
One other observation that can be made is that grandchild process times
are included if an application terminates in a normal manner. If this
were not true, the user and system times would be near 0 for the locale
tests.
A likely reason this is the case is if the child process waits on those
grandchild processes, either directly (using wait() and waitpid()), or
indirectly (using system()).
I suspect a better description of the time values returned for platforms
other than Solaris would be 'times for all immediate children, plus
times for all descendants who have been waited upon by their parent
processes.' It should be possible to alter the test case to confirm or
deny this hypothesis, but would involve wait()ing for terminated
grandchild processes in the child process, prior to the child process
being killed.
Attached is a refinement of the version C of the patches I sent to the
list earlier. The main changes in this version are a refinment on the
handling of kill failures and correction of a pair of off-by one errors.
--Andrew Black
Log:
* exec.cpp [!_WIN32 && !WIN64] (wait_for_child): Evaluate return value
when sending signals to child process group, correct off-by-one issue
when checking for end of signal array, Try to kill off any grandchildren
left in the child process group after the child process terminates.
Martin Sebor wrote:
Martin Sebor wrote:
I took a closer look at the output produced by my little test program
(after making a small change to it where I moved the sleep(1) call in
the parent branch immediately above the waitpid call). Here's the
behavior I have observed on each of the following operating systems:
Here's a corrected interpretation of the results (the corrected
program is attached):
AIX: only immediate children's times are returned
HP-UX: only immediate children's times are returned
IRIX 6.5: only immediate children's times are returned
Linux: only immediate children's times are returned
Solaris: cumulative times of children and all their
descendants are returned
Tru64: only immediate children's times are returned
I was misled by the rapidly decreasing user times in test runs
that created increasing numbers of grandchildren. The decreasing
numbers actually make sense since more processes compete for the
CPU and thus get to use it less time (with the OS spending more
of its own time switching between them).
So I guess the only odd duck is Solaris which accumulates the
time used up by the child's children's despite the fact that
they were never waited on.
Martin
Index: exec.cpp
===================================================================
--- exec.cpp (revision 449032)
+++ exec.cpp (working copy)
@@ -467,8 +467,33 @@
break;
}
- /* ignore kill errors (perhaps should record them)*/
- (void)kill (-child_pid, signals [siginx]);
+ if(0 != kill (-child_pid, signals [siginx])) {
+ if (ESRCH == errno)
+ /* ESRCH means 'No process (group) found'. Since
+ there aren't any processes in the process group,
+ we'll continue so we can collect the return value
+ if needed.
+ */
+ continue;
+ /* In addition to ESRCH, kill () may also set errno to
+ EINVAL or EPERM, according to the POSIX spec, in
+ addition to any platform specific extensions.
+ EPERM means 'No permissions to signal any recieving
+ process'. It is unlikely that this situation will
+ change, but we will try the remaining signals in the
+ signals array, in the same manner as if the signal had
+ been sent correctly.
+ EINVAL means 'The signal is an invalid or unsupported
+ signal number'. As the signal number macros used in
+ the signal array are hard coded, issues should be
+ detected at compile time, not run time. This is not a
+ fatal situation, so the remainder of signals in the
+ signal array will be tried, as if this transmission
+ had been successfull.
+ The correct behavior for any platform-specific
+ extensions needs to be evaluated, but we are treating
+ them like EPERM or EINVAL at this time. */
+ }
/* Record the signal used*/
state.killed = signals [siginx];
@@ -476,7 +501,7 @@
++siginx;
/* Step to the next signal */
- if (siginx > sigcount) {
+ if (siginx >= sigcount) {
/* Still running, but we've run out of signals to try
Therefore, we'll set error flags and break out of
the loop.
@@ -522,6 +547,14 @@
/* Clear alarm */
alarm (0);
+ /* Kill/cleanup any grandchildren. */
+ /* On solaris, this logic tries to avoid the situation where grandchild
+ process times are rolled into the timing of a later process */
+ while (siginx < sigcount && 0 == kill (-child_pid, signals [siginx])) {
+ ++siginx;
+ sleep(1);
+ }
+
return state;
}