Ok, thanks. In the meantime, please roll back to the v3.0.0 tag and you should
be good. Sorry for the hassle. :-(
On Jun 25, 2014, at 12:19 AM, Mike Dubman wrote:
> Hi
> sorry for incomplete description. will trace problem more closely later next
> week and provide.
>
> M
>
>
> On Mon,
Hi
sorry for incomplete description. will trace problem more closely later
next week and provide.
M
On Mon, Jun 23, 2014 at 10:13 PM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
> Ok, just got in to Chicago from my flight and am back online.
>
> Mike: you are still not providing very m
Ok, just got in to Chicago from my flight and am back online.
Mike: you are still not providing very much information. :-\
Your first mails make it seem like MTT is continuing to run, but leaving
"launchers" (assumedly mpirun processes) still running, but they have no
children. Which would be
On Jun 23, 2014, at 8:48 AM, Mike Dubman wrote:
> btw, i think now, when parent process is killed before child, OS makes child
> as "" which stick around for good.
The grandparent should inherit the child. If the grandparent then does not
wait(2) on the child, then the child will remain a zom
btw, i think now, when parent process is killed before child, OS makes
child as "" which stick around for good.
On Mon, Jun 23, 2014 at 4:11 PM, Mike Dubman
wrote:
> it seems that mpirun got no signal (no evidence in the log). mtt was
> spinning and mpirun was a only process who left on the nod
it seems that mpirun got no signal (no evidence in the log). mtt was
spinning and mpirun was a only process who left on the node.
It was unclear why mtt did not kill mpirun.
will try to extract perl stacktrace from mtt on tomorrow`s nightly run.
On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres (jsqu
On Jun 23, 2014, at 7:47 AM, Mike Dubman wrote:
> after patch, it killed child processes but kept mpirun ... itself.
What does that mean -- are you saying that mpirun is still running? Was mpirun
sent a signal at all? What kind of messages are being displayed? ...etc.
The commits fix import
after patch, it killed child processes but kept mpirun ... itself.
before that patch - all processes were killed (and you are right, "mpirun
died right at the end of the timeout" was reported) but at least it left
the cluster in the clean state w/o leftovers.
now many "orphan" launchers are alive
There was actually quite a bit of testing before this was committed. This
commit resolved a lot of hangs across multiple organizations.
Can you be more specific as to what is happening?
The prior code was killing child processes before mpirun itself, for example,
which has led MTT to erroneous
this commit does more harm then good.
we experience following:
- some child processes still running after timeout and mtt killed the job.
before this commit - it worked fine.
please revert and test more.
On Sat, Jun 21, 2014 at 3:30 PM, MPI Team wrote:
> The branch, master has been updated
>
10 matches
Mail list logo