Greetings all.
During an offline brainstorming session, Martin and I determined that
the cause of this observed problem is a race condition when a child
(monitored) process creates grandchild processes. If a child process
terminates prior to the grandchild process, it is possible for the
call to getrusage() to occur prior to the termination of the
grandchild process, causing the time consumed by that process to be
deferred to the next call to getrusage(), and the time to be merged
with that of the following process.
Attached are some possible solutions I've put together for this issue.
With solution A, I receive a usable status code on Linux, along with
an account of the amount of user/system time spent in the child
process (but not the grandchild processes). On Solaris, I receive an
ERROR status code, which may be caused by an inability of the Solaris
waitpid syscall to wait on process groups.
With solution B, the status code on Linux is inaccurate, though the
user and system times appear to reflect that of the child process
(again excluding the grandchild processes). On Solaris, I observe the
same behavior as in solution A, for the same probable reason.
With solution C, the status code on Linux is accurate, though the user
and system times appear to be inaccurate. On Solaris, the status and
user/system times appear to be correct.
At this point, I believe that solution C is the best choice. If one
of these patches is to be applied, the following change log could be
used.
Log:
* exec.cpp [!_WIN32 && !_WIN64] (wait_for_child): Resolve race
condition resulting in incorrect child process times.
--Andrew Black
Martin Sebor wrote:
Yo Andrew,
I have another issue with the reported times (this one seems real).
The times reported for a killed process are all zero and the time
seems to be added to the next process that runs after it. I noticed
it while running the locale tests:
$ make run
NAME STATUS ASSERTS FAILED PERCNT USER SYS
sanity_test.sh 0 46 0 100% 0.380 0.880
af_ZA.ISO-8859-1.sh 0 7 0 100% 39.570 1.110
ar_AE.ISO-8859-6.sh 0 7 0 100% 39.000 1.250
ar_BH.ISO-8859-6.sh 0 7 0 100% 39.010 1.300
ar_DZ.ISO-8859-6.sh 0 7 0 100% 38.970 1.220
ar_EG.ISO-8859-6.sh 0 7 0 100% 39.010 1.350
ar_IN.UTF-8.sh HUP 0.000 0.010
ar_IQ.ISO-8859-6.sh 0 7 0 100% 217.890 2.200
Could you see what's going on?
Thanks
Martin
------------------------------------------------------------------------
Index: exec.cpp
===================================================================
--- exec.cpp (revision 447484)
+++ exec.cpp (working copy)
@@ -419,7 +419,7 @@
unsigned siginx = 0;
- int stopped = 0;
+ int status, stopped = 0;
#ifdef WCONTINUED
int waitopts = WUNTRACED | WCONTINUED;
@@ -442,14 +442,14 @@
alarm (timeout);
while (1) {
- const pid_t wait_pid = waitpid (child_pid, &state.status,
waitopts);
+ const pid_t wait_pid = waitpid (-child_pid, &status, waitopts);
if (child_pid == wait_pid) {
- if (WIFEXITED (state.status) || WIFSIGNALED (state.status))
- break; /*we've got an exit state, so let's bail*/
- else if (WIFSTOPPED (state.status))
- stopped = state.status;
+ if (WIFEXITED (status) || WIFSIGNALED (status))
+ state.status = status;
+ else if (WIFSTOPPED (status))
+ stopped = status;
#ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this
guard also */
- else if (WIFCONTINUED (state.status))
+ else if (WIFCONTINUED (status))
stopped = 0;
#endif
else
@@ -460,15 +460,34 @@
/* timeout elapsed, so send a signal to the child */
if (stopped) {
/* If the child process is stopped, it is
incapable of
- recieving signals. Therefore, we'll record this
- and break out of the loop.
+ recieving signals. Therefore, we'll record this.
*/
state.status = stopped;
- break;
}
- /* ignore kill errors (perhaps should record them)*/
- (void)kill (-child_pid, signals [siginx]);
+ if(0 != kill (-child_pid, signals [siginx])) {
+ if (EPERM == errno){
+ /* EPERM means 'No permissions to signal any
recieving + process. Since we can't send
the signal, we'll
+ just move on.
+ */
+ state.status = -1;
+ state.killed = -1;
+ break;
+ }
+ if (ESRCH == errno)
+ /* ESRCH means 'No process (group) found'.
Since + there aren't any processes in the
process group, + we'll continue so we can
collect the return value
+ if needed.
+ */
+ continue;
+ /* The remaining error value is EINVAL, which
should never + happen. If one of our signals
were unsupported, we + should get a compile
error. Regardess, we tried to + send it, and
will move on.
+ */
+ }
/* Record the signal used*/
state.killed = signals [siginx];
@@ -501,9 +520,8 @@
; /* Now what? */
}
else if (ECHILD == errno) {
- /* should not happen */
- warn ("waitpid (%d) error: %s\n", (int)child_pid,
- strerror (errno));
+ /* There are no children left in the process group,
so exit */
+ break;
}
else {
/* waitpid () error */
@@ -511,12 +529,10 @@
strerror (errno));
}
}
- else if ((pid_t)0 == wait_pid) {
- /* should not happen */
- }
- else {
- /* what the heck? */
- }
+ /* Other possible values for wait_pid are 0, denoting no
valid status
+ when polling the process group, and the pid of a
grandchild process.
+ The former should never happen, and we don't care about
the later.
+ */
}
/* Clear alarm */
------------------------------------------------------------------------
Index: exec.cpp
===================================================================
--- exec.cpp (revision 447484)
+++ exec.cpp (working copy)
@@ -419,8 +419,6 @@
unsigned siginx = 0;
- int stopped = 0;
-
#ifdef WCONTINUED
int waitopts = WUNTRACED | WCONTINUED;
#else
@@ -442,34 +440,34 @@
alarm (timeout);
while (1) {
- const pid_t wait_pid = waitpid (child_pid, &state.status,
waitopts);
- if (child_pid == wait_pid) {
- if (WIFEXITED (state.status) || WIFSIGNALED (state.status))
- break; /*we've got an exit state, so let's bail*/
- else if (WIFSTOPPED (state.status))
- stopped = state.status;
-#ifdef WIFCONTINUED /*Perhaps we should guard WIFSTOPPED with this
guard also */
- else if (WIFCONTINUED (state.status))
- stopped = 0;
-#endif
- else
- ; /* huh? */
- }
- else if ((pid_t)-1 == wait_pid) {
+ const pid_t wait_pid = waitpid (-child_pid, 0, waitopts);
+ if ((pid_t)-1 == wait_pid) {
if (EINTR == errno && alarm_timeout) {
/* timeout elapsed, so send a signal to the child */
- if (stopped) {
- /* If the child process is stopped, it is
incapable of
- recieving signals. Therefore, we'll record this
- and break out of the loop.
+ if(0 != kill (-child_pid, signals [siginx])) {
+ if (EPERM == errno){
+ /* EPERM means 'No permissions to signal any
recieving + process. Since we can't send
the signal, we'll
+ just move on.
+ */
+ state.status = -1;
+ state.killed = -1;
+ break;
+ }
+ if (ESRCH == errno)
+ /* ESRCH means 'No process (group) found'.
Since + there aren't any processes in the
process group, + we'll continue so we can
collect the return value
+ if needed.
+ */
+ continue;
+ /* The remaining error value is EINVAL, which
should never + happen. If one of our signals
were unsupported, we + should get a compile
error. Regardess, we tried to + send it, and
will move on.
*/
- state.status = stopped;
- break;
}
- /* ignore kill errors (perhaps should record them)*/
- (void)kill (-child_pid, signals [siginx]);
-
/* Record the signal used*/
state.killed = signals [siginx];
@@ -501,9 +499,8 @@
; /* Now what? */
}
else if (ECHILD == errno) {
- /* should not happen */
- warn ("waitpid (%d) error: %s\n", (int)child_pid,
- strerror (errno));
+ /* There are no children left in the process group,
so exit */
+ break;
}
else {
/* waitpid () error */
@@ -511,14 +508,14 @@
strerror (errno));
}
}
- else if ((pid_t)0 == wait_pid) {
- /* should not happen */
- }
- else {
- /* what the heck? */
- }
+ /* Other possible values for wait_pid are 0, denoting no
valid status
+ when polling the process group, and the pid of a
grandchild process.
+ The former should never happen, and we don't care about
the later.
+ */
}
+ if (-1 != state.killed)
+ (void)waitpid (-child_pid, &state.status, waitopts);
/* Clear alarm */
alarm (0);
------------------------------------------------------------------------
Index: exec.cpp
===================================================================
--- exec.cpp (revision 447484)
+++ exec.cpp (working copy)
@@ -467,8 +467,29 @@
break;
}
- /* ignore kill errors (perhaps should record them)*/
- (void)kill (-child_pid, signals [siginx]);
+ if(0 != kill (-child_pid, signals [siginx])) {
+ if (EPERM == errno){
+ /* EPERM means 'No permissions to signal any
recieving + process. Since we can't send
the signal, we'll
+ just move on.
+ */
+ state.status = -1;
+ state.killed = -1;
+ break;
+ }
+ if (ESRCH == errno)
+ /* ESRCH means 'No process (group) found'.
Since + there aren't any processes in the
process group, + we'll continue so we can
collect the return value
+ if needed.
+ */
+ continue;
+ /* The remaining error value is EINVAL, which
should never + happen. If one of our signals
were unsupported, we + should get a compile
error. Regardess, we tried to + send it, and
will move on.
+ */
+ }
/* Record the signal used*/
state.killed = signals [siginx];
@@ -522,6 +543,12 @@
/* Clear alarm */
alarm (0);
+ /*Kill/cleanup any grandchildren.*/
+ while (siginx <= sigcount && 0 == kill (-child_pid, signals
[siginx])) {
+ ++siginx;
+ sleep(1);
+ }
+
return state;
}