Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
Ok, just got in to Chicago from my flight and am back online.

Mike: you are still not providing very much information.  :-\

Your first mails make it seem like MTT is continuing to run, but leaving 
"launchers" (assumedly mpirun processes) still running, but they have no 
children.  Which would be very weird for mpirun to do, if it has no children 
left.  This could be both an MTT and an ORTE bug, in this case.

But your last mail seems to imply that MTT is hanging indefinitely.

Can you please provide a clear, precise description of what is happening?

FWIW: Yes, we are killing the parent first now, to give mpirun a chance to 
cleanup / tell remote orteds to die / kill children processes / etc.  Killing 
the children first both doesn't test the common case of how people kill MPI 
processes (i.e., they kill mpirun), and it also doesn't allow mpirun to tell 
remote processes to die.

Do you run with --verbose output?  MTT should output messages like "*** Killing 
mpirun with SIGTERM", and the like.  Do you see timeout messages at all?  I.e., 
is MTT not entering the timeout code at all?

...etc.



On Jun 23, 2014, at 12:16 PM, Dave Goodell (dgoodell)  
wrote:

> On Jun 23, 2014, at 8:48 AM, Mike Dubman  wrote:
> 
>> btw, i think now, when parent process is killed before child, OS makes child 
>> as "" which stick around for good.
> 
> The grandparent should inherit the child.  If the grandparent then does not 
> wait(2) on the child, then the child will remain a zombie / defunct.  So in 
> our specific case, this behavior will depend on what the parent process of 
> mpirun is and whether it is waiting on child processes appropriately.
> 
> -Dave
> 
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0633.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Dave Goodell (dgoodell)
On Jun 23, 2014, at 8:48 AM, Mike Dubman  wrote:

> btw, i think now, when parent process is killed before child, OS makes child 
> as "" which stick around for good.

The grandparent should inherit the child.  If the grandparent then does not 
wait(2) on the child, then the child will remain a zombie / defunct.  So in our 
specific case, this behavior will depend on what the parent process of mpirun 
is and whether it is waiting on child processes appropriately.

-Dave



Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
it seems that mpirun got no signal (no evidence in the log). mtt was
spinning and mpirun was a only process who left on the node.
It was unclear why mtt did not kill mpirun.
will try to extract perl stacktrace from mtt on tomorrow`s nightly run.


On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres (jsquyres)  wrote:

> On Jun 23, 2014, at 7:47 AM, Mike Dubman  wrote:
>
> > after patch, it killed child processes but kept mpirun ... itself.
>
> What does that mean -- are you saying that mpirun is still running?  Was
> mpirun sent a signal at all?  What kind of messages are being displayed?
>  ...etc.
>
> The commits fix important bugs for me and others.  Clearly, there's still
> something not right.  And of course I'm willing to track it down.  But I
> can't help you if you just say "it doesn't work."
>
> > before that patch - all processes were killed (and you are right,
> "mpirun died right at the end of the timeout" was reported)
>
> ...which led to many months of misleading ORTE debugging, BTW.  :-\
>  That's why this commit was introduced into MTT -- in the quest of finally
> fixing both the mysterious ORTE hangs and the erroneous timeouts/"mpirun
> died right at the end" messages.
>
> > but at least it left the cluster in the clean state w/o leftovers.
> > now many "orphan" launchers  are alive from previous invocations.
>
> Does "launchers" = mpirun?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> mtt-devel mailing list
> mtt-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/mtt-devel
> Link to this post:
> http://www.open-mpi.org/community/lists/mtt-devel/2014/06/0629.php
>


Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
On Jun 23, 2014, at 7:47 AM, Mike Dubman  wrote:

> after patch, it killed child processes but kept mpirun ... itself.

What does that mean -- are you saying that mpirun is still running?  Was mpirun 
sent a signal at all?  What kind of messages are being displayed?  ...etc.

The commits fix important bugs for me and others.  Clearly, there's still 
something not right.  And of course I'm willing to track it down.  But I can't 
help you if you just say "it doesn't work."

> before that patch - all processes were killed (and you are right, "mpirun 
> died right at the end of the timeout" was reported)

...which led to many months of misleading ORTE debugging, BTW.  :-\  That's why 
this commit was introduced into MTT -- in the quest of finally fixing both the 
mysterious ORTE hangs and the erroneous timeouts/"mpirun died right at the end" 
messages.

> but at least it left the cluster in the clean state w/o leftovers.
> now many "orphan" launchers  are alive from previous invocations.

Does "launchers" = mpirun?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
after patch, it killed child processes but kept mpirun ... itself.

before that patch - all processes were killed (and you are right, "mpirun
died right at the end of the timeout" was reported) but at least it left
the cluster in the clean state w/o leftovers.
now many "orphan" launchers  are alive from previous invocations.


On Mon, Jun 23, 2014 at 2:18 PM, Jeff Squyres (jsquyres)  wrote:

> There was actually quite a bit of testing before this was committed.  This
> commit resolved a lot of hangs across multiple organizations.
>
> Can you be more specific as to what is happening?
>
> The prior code was killing child processes before mpirun itself, for
> example, which has led MTT to erroneously report that mpirun died right at
> the end of the timeout without being killed.  This has been ongoing for
> many months, at a minimum.
>
>
>
>
> On Jun 23, 2014, at 4:37 AM, Mike Dubman  wrote:
>
> > this commit does more harm then good.
> > we experience following:
> >
> > - some child processes still running after timeout and mtt killed the
> job.
> >
> > before this commit - it worked fine.
> > please revert and test more.
> >
> >
> >
> > On Sat, Jun 21, 2014 at 3:30 PM, MPI Team  wrote:
> > The branch, master has been updated
> >via  016088f2a0831b32ab5fd6f60f4cabe67e92e594 (commit)
> >via  7fb4c6a4c9d71be127ea53bd463178510577f71f (commit)
> >via  381ba177d835a54c3197d846f5a4edfc314efe27 (commit)
> >via  cfdd29de2012eeb7592706f00dd07a52dd48cf6b (commit)
> >via  940030ca20eb1eaf256e898b83866c1cb83aca5c (commit)
> >   from  c99ed7c7b159a2cab58a251bd7c0dad8972ff901 (commit)
> >
> > Those revisions listed above that are new to this repository have
> > not appeared on any other notification email; so we list those
> > revisions in full, below.
> >
> > - Log -
> >
> https://github.com/open-mpi/mtt/commit/016088f2a0831b32ab5fd6f60f4cabe67e92e594
> >
> > commit 016088f2a0831b32ab5fd6f60f4cabe67e92e594
> > Author: Jeff Squyres 
> > Date:   Sat Jun 21 04:58:45 2014 -0700
> >
> > DoCommand: several fixes to kill_proc logic
> >
> > 1. Fix the kill(0, $pid) test to see if the process was still alive.
> >
> > 2. Rename _kill_proc() to _kill_proc_tree() to indicate that it's
> > really killing not only the PID in question, but also all of its
> > descendants.
> >
> > 3. In _kill_proc_tree(), change the order to kill the main PID first,
> > and ''then'' kill all the descendants.
> >
> > The main use case is when killing mpirun: if we kill mpirun's
> > descendants ''first'', mpirun will detect its childrens' deaths and
> > then cleanup and exit.  Later, when MTT finally gets around to
> killing
> > mpirun, MTT will detect that mpirun is already dead and therefore
> emit
> > a confusing "mpirun died right at end of timeout" message.  This is
> > misleading at best; it doesn't indicate what actually happened.
> >
> > However, if we kill mpirun first, it will take care of killing all of
> > its descendants.  MTT will therefore emit the right messages about
> > killing mpirun.  MTT will then redundantly try to kill a bunch of
> > now-nonexistent descendant processes of mpirun, but that's ok/safe.
> > We actually ''want'' this try-to-kill-mpirun's-descendants behavior
> to
> > handle the case when mpirun is misbehaving / not cleaning up its
> > descendants.
> >
> > 4. DoCommand() is used for more than launching mpirun, so pass down
> > $argv0 so that we can print the actual command name that is being
> > killed in various Verbose/Debug messages, not the hard-coded "mpirun"
> > string (which, in practice, was probably almost always correct, but
> > still...).
> > ---
> >  lib/MTT/DoCommand.pm | 78
> 
> >  1 file changed, 55 insertions(+), 23 deletions(-)
> >
> > diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> > index 02cdb94..646ca31 100644
> > --- a/lib/MTT/DoCommand.pm
> > +++ b/lib/MTT/DoCommand.pm
> > @@ -2,7 +2,7 @@
> >  #
> >  # Copyright (c) 2005-2006 The Trustees of Indiana University.
> >  # All rights reserved.
> > -# Copyright (c) 2006-2013 Cisco Systems, Inc.  All rights reserved.
> > +# Copyright (c) 2006-2014 Cisco Systems, Inc.  All rights reserved.
> >  # Copyright (c) 2007-2008 Sun Microsystems, Inc.  All rights reserved.
> >  # Copyright (c) 2007-2012 High Performance Computing Center Stuttgart,
> >  # University of Stuttgart.  All rights reserved.
> > @@ -57,23 +57,27 @@ sub DoCommand {
> >  ($time_arg, $no_execute) = @_;
> >  }
> >
> > +# This function is called for killing both mpirun and each of its
> > +# descendants.  We really only want to see verbose output for when we
> > +# kill mpirun itself, so 

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
There was actually quite a bit of testing before this was committed.  This 
commit resolved a lot of hangs across multiple organizations.

Can you be more specific as to what is happening?

The prior code was killing child processes before mpirun itself, for example, 
which has led MTT to erroneously report that mpirun died right at the end of 
the timeout without being killed.  This has been ongoing for many months, at a 
minimum.




On Jun 23, 2014, at 4:37 AM, Mike Dubman  wrote:

> this commit does more harm then good.
> we experience following:
> 
> - some child processes still running after timeout and mtt killed the job.
> 
> before this commit - it worked fine.
> please revert and test more.
>  
> 
> 
> On Sat, Jun 21, 2014 at 3:30 PM, MPI Team  wrote:
> The branch, master has been updated
>via  016088f2a0831b32ab5fd6f60f4cabe67e92e594 (commit)
>via  7fb4c6a4c9d71be127ea53bd463178510577f71f (commit)
>via  381ba177d835a54c3197d846f5a4edfc314efe27 (commit)
>via  cfdd29de2012eeb7592706f00dd07a52dd48cf6b (commit)
>via  940030ca20eb1eaf256e898b83866c1cb83aca5c (commit)
>   from  c99ed7c7b159a2cab58a251bd7c0dad8972ff901 (commit)
> 
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
> 
> - Log -
> https://github.com/open-mpi/mtt/commit/016088f2a0831b32ab5fd6f60f4cabe67e92e594
> 
> commit 016088f2a0831b32ab5fd6f60f4cabe67e92e594
> Author: Jeff Squyres 
> Date:   Sat Jun 21 04:58:45 2014 -0700
> 
> DoCommand: several fixes to kill_proc logic
> 
> 1. Fix the kill(0, $pid) test to see if the process was still alive.
> 
> 2. Rename _kill_proc() to _kill_proc_tree() to indicate that it's
> really killing not only the PID in question, but also all of its
> descendants.
> 
> 3. In _kill_proc_tree(), change the order to kill the main PID first,
> and ''then'' kill all the descendants.
> 
> The main use case is when killing mpirun: if we kill mpirun's
> descendants ''first'', mpirun will detect its childrens' deaths and
> then cleanup and exit.  Later, when MTT finally gets around to killing
> mpirun, MTT will detect that mpirun is already dead and therefore emit
> a confusing "mpirun died right at end of timeout" message.  This is
> misleading at best; it doesn't indicate what actually happened.
> 
> However, if we kill mpirun first, it will take care of killing all of
> its descendants.  MTT will therefore emit the right messages about
> killing mpirun.  MTT will then redundantly try to kill a bunch of
> now-nonexistent descendant processes of mpirun, but that's ok/safe.
> We actually ''want'' this try-to-kill-mpirun's-descendants behavior to
> handle the case when mpirun is misbehaving / not cleaning up its
> descendants.
> 
> 4. DoCommand() is used for more than launching mpirun, so pass down
> $argv0 so that we can print the actual command name that is being
> killed in various Verbose/Debug messages, not the hard-coded "mpirun"
> string (which, in practice, was probably almost always correct, but
> still...).
> ---
>  lib/MTT/DoCommand.pm | 78 
> 
>  1 file changed, 55 insertions(+), 23 deletions(-)
> 
> diff --git a/lib/MTT/DoCommand.pm b/lib/MTT/DoCommand.pm
> index 02cdb94..646ca31 100644
> --- a/lib/MTT/DoCommand.pm
> +++ b/lib/MTT/DoCommand.pm
> @@ -2,7 +2,7 @@
>  #
>  # Copyright (c) 2005-2006 The Trustees of Indiana University.
>  # All rights reserved.
> -# Copyright (c) 2006-2013 Cisco Systems, Inc.  All rights reserved.
> +# Copyright (c) 2006-2014 Cisco Systems, Inc.  All rights reserved.
>  # Copyright (c) 2007-2008 Sun Microsystems, Inc.  All rights reserved.
>  # Copyright (c) 2007-2012 High Performance Computing Center Stuttgart,
>  # University of Stuttgart.  All rights reserved.
> @@ -57,23 +57,27 @@ sub DoCommand {
>  ($time_arg, $no_execute) = @_;
>  }
> 
> +# This function is called for killing both mpirun and each of its
> +# descendants.  We really only want to see verbose output for when we
> +# kill mpirun itself, so only show output when the caller provides us
> +# with a $argv0 value.
>  sub _kill_proc_one {
> -my ($pid) = @_;
> +my ($pid, $argv0) = @_;
> 
>  # How long to wait after each kill()
>  my $wait_time = 5;
> 
>  # See if the proc is alive first
> -my $kid;
> -kill(0, $pid);
> -$kid = waitpid($pid, WNOHANG);
> -return "mpirun died right at end of timeout (MTT did not have to kill 
> it)"
> -if (-1 == $kid);
> +my $num_alive = kill(0, $pid);
> +return "$argv0 died right at end of timeout (MTT did not have to kill 
> it)"
> +if (0 ==