Re: [MTT devel] fix zombie commit

2013-02-26 Thread Jeff Squyres (jsquyres)
On Feb 26, 2013, at 2:11 AM, Mike Dubman  wrote:

> On Mon, Feb 25, 2013 at 6:24 PM, Jeff Squyres (jsquyres)  
> wrote:
> >Looking at the code, you're checking for zombie status before MTT kills the 
> >proc.  Am I reading that right?
> I don`t think the order matters, if process is not Zombie yet and about to be 
> killed by MTT later - it is a good flow.
> If process is already Zombie - mtt will not be able to kill it anyway and and 
> can stop waiting and switch to the new task.

No, the _kill_proc() routine does both a kill() and a waitpid().  The waitpid() 
should reap the zombie.

I.e., if the process has died, MTT simply just hasn't reaped it yet.  Hence, 
it's a zombie.

> >If so, then it could well be that the process has exited but not yet been 
> >reaped (because _kill_proc() hasn't been invoked yet).  If this is the case, 
> >is the real cause of the problem that >the OUTread and ERRread aren't being 
> >closed when the child process exits, and therefore we keep looping looking 
> >for new output from them?
> yep, sounds like it can be the cause, need to look into this code.

Ok.  It would be interesting to see if the process dies, but:

1) MTT is still blocking in select() (i.e., OUTread and OUTerr aren't returning 
0 from sysread upon process death)

2) $done is somehow not getting set to 0, and therefore MTT is still looping 
until the timeout expires

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [MTT devel] fix zombie commit

2013-02-26 Thread Mike Dubman
On Mon, Feb 25, 2013 at 6:24 PM, Jeff Squyres (jsquyres)  wrote:

> >Looking at the code, you're checking for zombie status before MTT kills
> the proc.  Am I reading that right?
>
I don`t think the order matters, if process is not Zombie yet and about to
be killed by MTT later - it is a good flow.
If process is already Zombie - mtt will not be able to kill it anyway and
and can stop waiting and switch to the new task.


> >If so, then it could well be that the process has exited but not yet been
> reaped (because _kill_proc() hasn't been invoked yet).  If this is the
> case, is the real cause of the problem that >the OUTread and ERRread aren't
> being closed when the child process exits, and therefore we keep looping
> looking for new output from them?
>
yep, sounds like it can be the cause, need to look into this code.


Re: [MTT devel] fix zombie commit

2013-02-25 Thread Jeff Squyres (jsquyres)
On Feb 24, 2013, at 6:59 AM, Mike Dubman  wrote:

> What protection do you mean? Check that /proc/pid/status exists? It is done 
> in Grep()

Ah, excellent -- I hadn't noticed that.

> We observe that process which was launched by mtt and hangs (mtt detect 
> timeout and starts do_command procedure), later enters into "defunct" state.

Looking at the code, you're checking for zombie status before MTT kills the 
proc.  Am I reading that right?

If so, then it could well be that the process has exited but not yet been 
reaped (because _kill_proc() hasn't been invoked yet).  If this is the case, is 
the real cause of the problem that the OUTread and ERRread aren't being closed 
when the child process exits, and therefore we keep looping looking for new 
output from them?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [MTT devel] fix zombie commit

2013-02-24 Thread Mike Dubman
Hi Jeff,

What protection do you mean? Check that /proc/pid/status exists? It is done
in Grep()



We observe that process which was launched by mtt and hangs (mtt detect
timeout and starts do_command procedure), later enters into "defunct" state.



The mtt sends email that process hangs and when we check the reason, it
appears that process basically finished and mtt monitoring "defunct"
process which is an only left.



This fix will let mtt detect that it is monitoring such process and proceed
to the next test.



I don`t know yet what mtt part caused "defunct" but looking into it.

After some googling found that fork from perl (used in mtt) can have such
side-effect.



This is an example, based on true story:



miked 1362  0.0  0.0  0 0 ?Z13:36   0:00 [sh]




My guess, inside mtt.ini we use mpi details like this, which calls "sh"
from shebang. Somehow and sometimes it can become zombie.



exec  =< -Original Message-

> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com
]

> Sent: Sunday, February 24, 2013 13:10

> To: 

> Cc: Mike Dubman

> Subject: fix zombie commit

>

> Mike --

>

> Please protect this code better; MTT is also run on Solaris and OS X.

>

> Also, can you describe more fully the case where zombies are being left

> behind by MTT?

>

>

> On Feb 24, 2013, at 1:44 AM,  wrote:

>

> > Author: miked (Mike Dubman)

> > Date: 2013-02-24 01:44:31 EST (Sun, 24 Feb 2013) New Revision: 1589

> > URL: https://svn.open-mpi.org/trac/mtt/changeset/1589

> >

> > Log:

> > * fix: fork leaves zombie processes sometimes. temp fix: detect zombie

> and proceed with tests.

> >

> > Text files modified:

> >   trunk/lib/MTT/DoCommand.pm | 6 ++

> >   1 files changed, 6 insertions(+), 0 deletions(-)

> >

> > Modified: trunk/lib/MTT/DoCommand.pm

> >

> ==

> 

> > --- trunk/lib/MTT/DoCommand.pm  Wed Feb 20 12:41:12 2013

>(r1588)

> > +++ trunk/lib/MTT/DoCommand.pm  2013-02-24 01:44:31 EST (Sun, 24 Feb

> 2013) (r1589)

> > @@ -641,6 +641,12 @@

> > if (!$pid_exists) {

> > Verbose("--> Process completed somehow at " . time() . ",

> proceeding with tests\n");

> > $resume_tests++;

> > +} else {

> > +my $matches = MTT::Files::Grep("zombie",
"/proc/$pid/status");

> > +if (@$matches) {

> > +Verbose("--> Process become Zombie at " . time() . ",
proceeding

> with tests\n");

> > +$resume_tests++;

> > +}

> > }

> > # Remove the timeout sentinel file, if a timeout notify timeout
value is

> set

> > if (defined($end_time) and time() > $end_time) {

> > ___

> > mtt-svn mailing list

> > mtt-...@open-mpi.org

> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn

>

>

> --

> Jeff Squyres

> jsquy...@cisco.com

> For corporate legal information go to:

> http://www.cisco.com/web/about/doing_business/legal/cri/





On Sun, Feb 24, 2013 at 1:09 PM, Jeff Squyres (jsquyres)  wrote:

> Mike --
>
> Please protect this code better; MTT is also run on Solaris and OS X.
>
> Also, can you describe more fully the case where zombies are being left
> behind by MTT?
>
>
> On Feb 24, 2013, at 1:44 AM,  wrote:
>
> > Author: miked (Mike Dubman)
> > Date: 2013-02-24 01:44:31 EST (Sun, 24 Feb 2013)
> > New Revision: 1589
> > URL: https://svn.open-mpi.org/trac/mtt/changeset/1589
> >
> > Log:
> > * fix: fork leaves zombie processes sometimes. temp fix: detect zombie
> and proceed with tests.
> >
> > Text files modified:
> >   trunk/lib/MTT/DoCommand.pm | 6 ++
> >   1 files changed, 6 insertions(+), 0 deletions(-)
> >
> > Modified: trunk/lib/MTT/DoCommand.pm
> >
> ==
> > --- trunk/lib/MTT/DoCommand.pmWed Feb 20 12:41:12 2013
>  (r1588)
> > +++ trunk/lib/MTT/DoCommand.pm2013-02-24 01:44:31 EST (Sun, 24
> Feb 2013)  (r1589)
> > @@ -641,6 +641,12 @@
> > if (!$pid_exists) {
> > Verbose("--> Process completed somehow at " . time() . ",
> proceeding with tests\n");
> > $resume_tests++;
> > +} else {
> > +my $matches = MTT::Files::Grep("zombie",
> "/proc/$pid/status");
> > +if (@$matches) {
> > +Verbose("--> Process become Zombie at " . time() . ",
> proceeding with tests\n");
> > +$resume_tests++;
> > +}
> > }
> > # Remove the timeout sentinel file, if a timeout notify timeout
> value is set
> > if (defined($end_time) and time() > $end_time) {
> > ___
> > mtt-svn mailing list
> > mtt-...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal informatio