Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-25 Thread Jeff Squyres (jsquyres)
Ok, thanks. In the meantime, please roll back to the v3.0.0 tag and you should be good. Sorry for the hassle. :-( On Jun 25, 2014, at 12:19 AM, Mike Dubman wrote: > Hi > sorry for incomplete description. will trace problem more closely later next > week and provide. > > M > > > On Mon,

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-25 Thread Mike Dubman
Hi sorry for incomplete description. will trace problem more closely later next week and provide. M On Mon, Jun 23, 2014 at 10:13 PM, Jeff Squyres (jsquyres) < jsquy...@cisco.com> wrote: > Ok, just got in to Chicago from my flight and am back online. > > Mike: you are still not providing very m

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
Ok, just got in to Chicago from my flight and am back online. Mike: you are still not providing very much information. :-\ Your first mails make it seem like MTT is continuing to run, but leaving "launchers" (assumedly mpirun processes) still running, but they have no children. Which would be

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Dave Goodell (dgoodell)
On Jun 23, 2014, at 8:48 AM, Mike Dubman wrote: > btw, i think now, when parent process is killed before child, OS makes child > as "" which stick around for good. The grandparent should inherit the child. If the grandparent then does not wait(2) on the child, then the child will remain a zom

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
btw, i think now, when parent process is killed before child, OS makes child as "" which stick around for good. On Mon, Jun 23, 2014 at 4:11 PM, Mike Dubman wrote: > it seems that mpirun got no signal (no evidence in the log). mtt was > spinning and mpirun was a only process who left on the nod

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
it seems that mpirun got no signal (no evidence in the log). mtt was spinning and mpirun was a only process who left on the node. It was unclear why mtt did not kill mpirun. will try to extract perl stacktrace from mtt on tomorrow`s nightly run. On Mon, Jun 23, 2014 at 2:59 PM, Jeff Squyres (jsqu

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
On Jun 23, 2014, at 7:47 AM, Mike Dubman wrote: > after patch, it killed child processes but kept mpirun ... itself. What does that mean -- are you saying that mpirun is still running? Was mpirun sent a signal at all? What kind of messages are being displayed? ...etc. The commits fix import

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
after patch, it killed child processes but kept mpirun ... itself. before that patch - all processes were killed (and you are right, "mpirun died right at the end of the timeout" was reported) but at least it left the cluster in the clean state w/o leftovers. now many "orphan" launchers are alive

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Jeff Squyres (jsquyres)
There was actually quite a bit of testing before this was committed. This commit resolved a lot of hangs across multiple organizations. Can you be more specific as to what is happening? The prior code was killing child processes before mpirun itself, for example, which has led MTT to erroneous

Re: [MTT devel] [MTT svn] GIT: MTT branch master updated. 016088f2a0831b32ab5fd6f60f4cabe67e92e594

2014-06-23 Thread Mike Dubman
this commit does more harm then good. we experience following: - some child processes still running after timeout and mtt killed the job. before this commit - it worked fine. please revert and test more. On Sat, Jun 21, 2014 at 3:30 PM, MPI Team wrote: > The branch, master has been updated >