There was actually quite a bit of testing before this was committed. This
commit resolved a lot of hangs across multiple organizations.
Can you be more specific as to what is happening?
The prior code was killing child processes before mpirun itself, for example,
which has led MTT to
after patch, it killed child processes but kept mpirun ... itself.
before that patch - all processes were killed (and you are right, "mpirun
died right at the end of the timeout" was reported) but at least it left
the cluster in the clean state w/o leftovers.
now many "orphan" launchers are
On Jun 23, 2014, at 7:47 AM, Mike Dubman wrote:
> after patch, it killed child processes but kept mpirun ... itself.
What does that mean -- are you saying that mpirun is still running? Was mpirun
sent a signal at all? What kind of messages are being displayed?
it seems that mpirun got no signal (no evidence in the log). mtt was
spinning and mpirun was a only process who left on the node.
It was unclear why mtt did not kill mpirun.
will try to extract perl stacktrace from mtt on tomorrow`s nightly run.
On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres
On Jun 23, 2014, at 8:48 AM, Mike Dubman wrote:
> btw, i think now, when parent process is killed before child, OS makes child
> as "" which stick around for good.
The grandparent should inherit the child. If the grandparent then does not
wait(2) on the child, then
Ok, just got in to Chicago from my flight and am back online.
Mike: you are still not providing very much information. :-\
Your first mails make it seem like MTT is continuing to run, but leaving
"launchers" (assumedly mpirun processes) still running, but they have no
children. Which would