Saggi helped and took control on it :)

please check - http://gerrit.ovirt.org/#/c/15165/

Thanks for the report!

Yaniv Bronhaim.

----- Original Message -----
> From: "Yaniv Bronheim" <ybron...@redhat.com>
> To: "Mark Wu" <wu...@linux.vnet.ibm.com>
> Cc: "VDSM Project Development" <vdsm-devel@lists.fedorahosted.org>
> Sent: Wednesday, May 29, 2013 5:37:39 PM
> Subject: Re: [vdsm] Potential Bug in cpopen
> 
> 
> 
> ----- Original Message -----
> > From: "Mark Wu" <wu...@linux.vnet.ibm.com>
> > To: "Zhou Zheng Sheng" <zhshz...@linux.vnet.ibm.com>
> > Cc: "VDSM Project Development" <vdsm-devel@lists.fedorahosted.org>
> > Sent: Wednesday, May 29, 2013 11:38:40 AM
> > Subject: Re: [vdsm] Potential Bug in cpopen
> > 
> > On Wed 29 May 2013 12:02:47 PM CST, Zhou Zheng Sheng wrote:
> > >
> > > Hi,
> > >
> > > Recently Jenkins unit test sometimes fails on PidStatTests.test, for
> > > example http://gerrit.ovirt.org/#/c/14670/
> > >
> > > After it execCmd a "sleep" command with sync=False, in
> > > /proc/[xxxpid]/stat we should see the name is "sleep", but in this case
> > > we get "python", which means there is a possible race condition. The
> > > most possible situation is that execCmd returns before the child process
> > > execvp the "sleep", then the parent process reads the stat and sees the
> > > process name is still "python".
> > >
> > > cpopen is designed to avoid this kind of race. It uses a pipe to
> > > synchronize the child and parent. It sets the FD_CLOEXEC on the child
> > > side of the pipe, so that once execvp succeeds, the child pipe is
> > > closed. If execvp fails, child writes error code to the pipe. The parent
> > > reads the other end of the pipe like follow.
> > >
> > > if (read(errnofd[0], &childErrno, sizeof(int)) == sizeof(int)) {
> > >      PyErr_SetString(PyExc_OSError, strerror(childErrno));
> > >      goto fail;
> > > }
> > >
> > > It assumes that when read returns, the return value is either
> > > sizeof(int), which indicates an error in the child side, or 0, which
> > > indicates the child side of pipe is closed and execvp succeeds. However
> > > this assumption may not always be true. If the parent process gets a
> > > signal, the read invocation would be interrupt, and the code treats this
> > > interruption the same as the case of execvp succeeds. If the system is
> > > very busy like our Jenkins slave concurrently executing jobs, it's
> > > possible the parent gets interrupted before the child execvp succeeds,
> > > so cpopen returns to execCmd and cause the race.
> > >
> > > To produce this problem, we can add a "sleep(10);" before the exec
> > > invocation in cpopen.c, it simulates a busy/slow system. Then in a
> > > Python interpreter, register a signal handler and calls execCmd.
> > >
> > >>>> from vdsm.utils import execCmd
> > >>>> from vdsm import utils
> > >>>> def handler(signum, frame):
> > > ...     print 'Signal handler called with signal', signum
> > > ...
> > >>>> import signal
> > >>>> signal.signal(signal.SIGALRM, handler)
> > >>>> signal.signal(signal.SIGCHLD, handler)
> > >>>> p = execCmd(['sleep', '3'], sync=False, sudo=False); s =
> > > utils.pidStat(p.pid); print s; p.wait()
> > >
> > > Then kill -SIGCHLD or kill -SIGALRM to this Python interpreter process,
> > > we can see the output.
> > >
> > > Signal handler called with signal 17
> > > (9541, 'python', 'S', 9509, 9509, 6843, 34817, 9509, 4218944, 66, 0, 0,
> > > 0, 0, 0, 0, 0, 20, 0, 1, 0, 663767, 217919488, 1676,
> > > 18446744073709551615L, 4194304, 4197004, 140735665131088,
> > > 140735665124456, 268226504400, 0, 0, 16781312, 73730,
> > > 18446744071579398713L, 0, 0, 17, 3, 0, 0, 0, 0, 0, 6294952, 6297616,
> > > 19152896, 140735665132425, 140735665132432, 140735665132432,
> > > 140735665135592, 0)
> > > True
> > >
> > > I have not found other ways to produce it, just found this method. Not
> > > sure it is a bug. Is it reasonable to check the read() return value and
> > > retry on EAGAIN/EINTR to fix this?
> It's surely a bug...
> > 
> > Good catch!  If you can't find out how the race is invoked on jenkins
> > slave,  you could just add a patch to fix the  EAGAIN/EINTR issue in
> > cpopen, and add some fake patches to invoke multiple jenkins jobs.
> > Then you could find if the fix of EAGAIN/EINTR helps.
> > 
> 
> good catch alright.
> The returncode of execCmd also None, so you have no way to catch the failure,
> Seems like Mark is right, Im checking it and I'll upload a patch to fix it
> soon as possible.
> 
> Thanks!!
> > 
> > _______________________________________________
> > vdsm-devel mailing list
> > vdsm-devel@lists.fedorahosted.org
> > https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> > 
> _______________________________________________
> vdsm-devel mailing list
> vdsm-devel@lists.fedorahosted.org
> https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel
> 
_______________________________________________
vdsm-devel mailing list
vdsm-devel@lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/vdsm-devel

Reply via email to