Greetings all.

One other observation that can be made is that grandchild process times are included if an application terminates in a normal manner. If this were not true, the user and system times would be near 0 for the locale tests.

A likely reason this is the case is if the child process waits on those grandchild processes, either directly (using wait() and waitpid()), or indirectly (using system()).

I suspect a better description of the time values returned for platforms other than Solaris would be 'times for all immediate children, plus times for all descendants who have been waited upon by their parent processes.' It should be possible to alter the test case to confirm or deny this hypothesis, but would involve wait()ing for terminated grandchild processes in the child process, prior to the child process being killed.

Attached is a refinement of the version C of the patches I sent to the list earlier. The main changes in this version are a refinment on the handling of kill failures and correction of a pair of off-by one errors.

--Andrew Black

Log:
* exec.cpp [!_WIN32 && !WIN64] (wait_for_child): Evaluate return value when sending signals to child process group, correct off-by-one issue when checking for end of signal array, Try to kill off any grandchildren left in the child process group after the child process terminates.

Martin Sebor wrote:
Martin Sebor wrote:
I took a closer look at the output produced by my little test program
(after making a small change to it where I moved the sleep(1) call in
the parent branch immediately above the waitpid call). Here's the
behavior I have observed on each of the following operating systems:


Here's a corrected interpretation of the results (the corrected
program is attached):

AIX:      only immediate children's times are returned
HP-UX:    only immediate children's times are returned
IRIX 6.5: only immediate children's times are returned
Linux:    only immediate children's times are returned
Solaris:  cumulative times of children and all their
          descendants are returned
Tru64:    only immediate children's times are returned

I was misled by the rapidly decreasing user times in test runs
that created increasing numbers of grandchildren. The decreasing
numbers actually make sense since more processes compete for the
CPU and thus get to use it less time (with the OS spending more
of its own time switching between them).

So I guess the only odd duck is Solaris which accumulates the
time used up by the child's children's despite the fact that
they were never waited on.

Martin
Index: exec.cpp
===================================================================
--- exec.cpp	(revision 449032)
+++ exec.cpp	(working copy)
@@ -467,8 +467,33 @@
                     break;
                 }
 
-                /* ignore kill errors (perhaps should record them)*/
-                (void)kill (-child_pid, signals [siginx]);
+                if(0 != kill (-child_pid, signals [siginx])) {
+                    if (ESRCH == errno)
+                        /* ESRCH means 'No process (group) found'.  Since 
+                           there aren't any processes in the process group, 
+                           we'll continue so we can collect the return value
+                           if needed.
+                        */
+                        continue;
+                    /* In addition to ESRCH, kill () may also set errno to 
+                       EINVAL or EPERM, according to the POSIX spec, in 
+                       addition to any platform specific extensions.
+                       EPERM means 'No permissions to signal any recieving 
+                       process'.  It is unlikely that this situation will
+                       change, but we will try the remaining signals in the
+                       signals array, in the same manner as if the signal had
+                       been sent correctly.
+                       EINVAL means 'The signal is an invalid or unsupported
+                       signal number'.  As the signal number macros used in 
+                       the signal array are hard coded, issues should be 
+                       detected at compile time, not run time.  This is not a
+                       fatal situation, so the remainder of signals in the
+                       signal array will be tried, as if this transmission
+                       had been successfull.
+                       The correct behavior for any platform-specific 
+                       extensions needs to be evaluated, but we are treating
+                       them like EPERM or EINVAL at this time. */
+                }
 
                 /* Record the signal used*/
                 state.killed = signals [siginx];
@@ -476,7 +501,7 @@
                 ++siginx;
 
                 /* Step to the next signal */
-                if (siginx > sigcount) {
+                if (siginx >= sigcount) {
                     /* Still running, but we've run out of signals to try
                        Therefore, we'll set error flags and break out of 
                        the loop.
@@ -522,6 +547,14 @@
     /* Clear alarm */
     alarm (0);
 
+    /* Kill/cleanup any grandchildren. */
+    /* On solaris, this logic tries to avoid the situation where grandchild
+       process times are rolled into the timing of a later process */
+    while (siginx < sigcount && 0 == kill (-child_pid, signals [siginx])) {
+        ++siginx;
+        sleep(1);
+    }
+
     return state;
 }
 

Reply via email to