Recently Jenkins unit test sometimes fails on PidStatTests.test, for
example http://gerrit.ovirt.org/#/c/14670/

After it execCmd a "sleep" command with sync=False, in
/proc/[xxxpid]/stat we should see the name is "sleep", but in this case
we get "python", which means there is a possible race condition. The
most possible situation is that execCmd returns before the child process
execvp the "sleep", then the parent process reads the stat and sees the
process name is still "python".

cpopen is designed to avoid this kind of race. It uses a pipe to
synchronize the child and parent. It sets the FD_CLOEXEC on the child
side of the pipe, so that once execvp succeeds, the child pipe is
closed. If execvp fails, child writes error code to the pipe. The parent
reads the other end of the pipe like follow.

if (read(errnofd[0], &childErrno, sizeof(int)) == sizeof(int)) {
    PyErr_SetString(PyExc_OSError, strerror(childErrno));
    goto fail;

It assumes that when read returns, the return value is either
sizeof(int), which indicates an error in the child side, or 0, which
indicates the child side of pipe is closed and execvp succeeds. However
this assumption may not always be true. If the parent process gets a
signal, the read invocation would be interrupt, and the code treats this
interruption the same as the case of execvp succeeds. If the system is
very busy like our Jenkins slave concurrently executing jobs, it's
possible the parent gets interrupted before the child execvp succeeds,
so cpopen returns to execCmd and cause the race.

To produce this problem, we can add a "sleep(10);" before the exec
invocation in cpopen.c, it simulates a busy/slow system. Then in a
Python interpreter, register a signal handler and calls execCmd.

>>> from vdsm.utils import execCmd
>>> from vdsm import utils
>>> def handler(signum, frame):
...     print 'Signal handler called with signal', signum
>>> import signal
>>> signal.signal(signal.SIGALRM, handler)
>>> signal.signal(signal.SIGCHLD, handler)
>>> p = execCmd(['sleep', '3'], sync=False, sudo=False); s =
utils.pidStat(p.pid); print s; p.wait()

Then kill -SIGCHLD or kill -SIGALRM to this Python interpreter process,
we can see the output.

Signal handler called with signal 17
(9541, 'python', 'S', 9509, 9509, 6843, 34817, 9509, 4218944, 66, 0, 0,
0, 0, 0, 0, 0, 20, 0, 1, 0, 663767, 217919488, 1676,
18446744073709551615L, 4194304, 4197004, 140735665131088,
140735665124456, 268226504400, 0, 0, 16781312, 73730,
18446744071579398713L, 0, 0, 17, 3, 0, 0, 0, 0, 0, 6294952, 6297616,
19152896, 140735665132425, 140735665132432, 140735665132432,
140735665135592, 0)

I have not found other ways to produce it, just found this method. Not
sure it is a bug. Is it reasonable to check the read() return value and
retry on EAGAIN/EINTR to fix this?
Thanks and best regards!

Zhou Zheng Sheng / 周征晟
E-mail: zhshz...@linux.vnet.ibm.com
Telephone: 86-10-82454397

vdsm-devel mailing list

Reply via email to