Re: [vdsm] Potential Bug in cpopen

2013-05-29 Thread Mark Wu

On Wed 29 May 2013 12:02:47 PM CST, Zhou Zheng Sheng wrote:


Hi,

Recently Jenkins unit test sometimes fails on PidStatTests.test, for
example http://gerrit.ovirt.org/#/c/14670/

After it execCmd a sleep command with sync=False, in
/proc/[xxxpid]/stat we should see the name is sleep, but in this case
we get python, which means there is a possible race condition. The
most possible situation is that execCmd returns before the child process
execvp the sleep, then the parent process reads the stat and sees the
process name is still python.

cpopen is designed to avoid this kind of race. It uses a pipe to
synchronize the child and parent. It sets the FD_CLOEXEC on the child
side of the pipe, so that once execvp succeeds, the child pipe is
closed. If execvp fails, child writes error code to the pipe. The parent
reads the other end of the pipe like follow.

if (read(errnofd[0], childErrno, sizeof(int)) == sizeof(int)) {
 PyErr_SetString(PyExc_OSError, strerror(childErrno));
 goto fail;
}

It assumes that when read returns, the return value is either
sizeof(int), which indicates an error in the child side, or 0, which
indicates the child side of pipe is closed and execvp succeeds. However
this assumption may not always be true. If the parent process gets a
signal, the read invocation would be interrupt, and the code treats this
interruption the same as the case of execvp succeeds. If the system is
very busy like our Jenkins slave concurrently executing jobs, it's
possible the parent gets interrupted before the child execvp succeeds,
so cpopen returns to execCmd and cause the race.

To produce this problem, we can add a sleep(10); before the exec
invocation in cpopen.c, it simulates a busy/slow system. Then in a
Python interpreter, register a signal handler and calls execCmd.


from vdsm.utils import execCmd
from vdsm import utils
def handler(signum, frame):

... print 'Signal handler called with signal', signum
...

import signal
signal.signal(signal.SIGALRM, handler)
signal.signal(signal.SIGCHLD, handler)
p = execCmd(['sleep', '3'], sync=False, sudo=False); s =

utils.pidStat(p.pid); print s; p.wait()

Then kill -SIGCHLD or kill -SIGALRM to this Python interpreter process,
we can see the output.

Signal handler called with signal 17
(9541, 'python', 'S', 9509, 9509, 6843, 34817, 9509, 4218944, 66, 0, 0,
0, 0, 0, 0, 0, 20, 0, 1, 0, 663767, 217919488, 1676,
18446744073709551615L, 4194304, 4197004, 140735665131088,
140735665124456, 268226504400, 0, 0, 16781312, 73730,
18446744071579398713L, 0, 0, 17, 3, 0, 0, 0, 0, 0, 6294952, 6297616,
19152896, 140735665132425, 140735665132432, 140735665132432,
140735665135592, 0)
True

I have not found other ways to produce it, just found this method. Not
sure it is a bug. Is it reasonable to check the read() return value and
retry on EAGAIN/EINTR to fix this?


Good catch!  If you can't find out how the race is invoked on jenkins 
slave,  you could just add a patch to fix the  EAGAIN/EINTR issue in 
cpopen, and add some fake patches to invoke multiple jenkins jobs.  
Then you could find if the fix of EAGAIN/EINTR helps.



___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel


[vdsm] Potential Bug in cpopen

2013-05-28 Thread Zhou Zheng Sheng

Hi,

Recently Jenkins unit test sometimes fails on PidStatTests.test, for
example http://gerrit.ovirt.org/#/c/14670/

After it execCmd a sleep command with sync=False, in
/proc/[xxxpid]/stat we should see the name is sleep, but in this case
we get python, which means there is a possible race condition. The
most possible situation is that execCmd returns before the child process
execvp the sleep, then the parent process reads the stat and sees the
process name is still python.

cpopen is designed to avoid this kind of race. It uses a pipe to
synchronize the child and parent. It sets the FD_CLOEXEC on the child
side of the pipe, so that once execvp succeeds, the child pipe is
closed. If execvp fails, child writes error code to the pipe. The parent
reads the other end of the pipe like follow.

if (read(errnofd[0], childErrno, sizeof(int)) == sizeof(int)) {
PyErr_SetString(PyExc_OSError, strerror(childErrno));
goto fail;
}

It assumes that when read returns, the return value is either
sizeof(int), which indicates an error in the child side, or 0, which
indicates the child side of pipe is closed and execvp succeeds. However
this assumption may not always be true. If the parent process gets a
signal, the read invocation would be interrupt, and the code treats this
interruption the same as the case of execvp succeeds. If the system is
very busy like our Jenkins slave concurrently executing jobs, it's
possible the parent gets interrupted before the child execvp succeeds,
so cpopen returns to execCmd and cause the race.

To produce this problem, we can add a sleep(10); before the exec
invocation in cpopen.c, it simulates a busy/slow system. Then in a
Python interpreter, register a signal handler and calls execCmd.

 from vdsm.utils import execCmd
 from vdsm import utils
 def handler(signum, frame):
... print 'Signal handler called with signal', signum
...
 import signal
 signal.signal(signal.SIGALRM, handler)
 signal.signal(signal.SIGCHLD, handler)
 p = execCmd(['sleep', '3'], sync=False, sudo=False); s =
utils.pidStat(p.pid); print s; p.wait()

Then kill -SIGCHLD or kill -SIGALRM to this Python interpreter process,
we can see the output.

Signal handler called with signal 17
(9541, 'python', 'S', 9509, 9509, 6843, 34817, 9509, 4218944, 66, 0, 0,
0, 0, 0, 0, 0, 20, 0, 1, 0, 663767, 217919488, 1676,
18446744073709551615L, 4194304, 4197004, 140735665131088,
140735665124456, 268226504400, 0, 0, 16781312, 73730,
18446744071579398713L, 0, 0, 17, 3, 0, 0, 0, 0, 0, 6294952, 6297616,
19152896, 140735665132425, 140735665132432, 140735665132432,
140735665135592, 0)
True

I have not found other ways to produce it, just found this method. Not
sure it is a bug. Is it reasonable to check the read() return value and
retry on EAGAIN/EINTR to fix this?
-- 
Thanks and best regards!

Zhou Zheng Sheng / 周征晟
E-mail: zhshz...@linux.vnet.ibm.com
Telephone: 86-10-82454397

___
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel