Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
Ok, just got in to Chicago from my flight and am back online. Mike: you are still not providing very much information. :-\ Your first mails make it seem like MTT is continuing to run, but leaving "launchers" (assumedly mpirun processes) still running, but they have no children. Which would

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Dave Goodell (dgoodell)
On Jun 23, 2014, at 8:48 AM, Mike Dubman wrote: > btw, i think now, when parent process is killed before child, OS makes child > as "" which stick around for good. The grandparent should inherit the child. If the grandparent then does not wait(2) on the child, then

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
it seems that mpirun got no signal (no evidence in the log). mtt was spinning and mpirun was a only process who left on the node. It was unclear why mtt did not kill mpirun. will try to extract perl stacktrace from mtt on tomorrow`s nightly run. On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
On Jun 23, 2014, at 7:47 AM, Mike Dubman wrote: > after patch, it killed child processes but kept mpirun ... itself. What does that mean -- are you saying that mpirun is still running? Was mpirun sent a signal at all? What kind of messages are being displayed?

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
after patch, it killed child processes but kept mpirun ... itself. before that patch - all processes were killed (and you are right, "mpirun died right at the end of the timeout" was reported) but at least it left the cluster in the clean state w/o leftovers. now many "orphan" launchers are

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
There was actually quite a bit of testing before this was committed. This commit resolved a lot of hangs across multiple organizations. Can you be more specific as to what is happening? The prior code was killing child processes before mpirun itself, for example, which has led MTT to