Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Just in the FWIW category: the HNP used to send the singleton’s name down the pipe at startup, which eliminated the code line you identified. Now, we are pushing the name into the environment as a PMIx envar, and having the PMIx component pick it up. Roundabout way of getting it, and that’s

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Ah...I take that back. We changed this and now we _do_ indeed go down that code path. Not good. So yes, we need that putenv so it gets the jobid from the HNP that was launched, like it used to do. You want to throw that in? Thanks Ralph > On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Nah, something isn’t right here. The singleton doesn’t go thru that code line, or it isn’t supposed to do so. I think the problem lies in the way the singleton in 2.x is starting up. Let me take a look at how singletons are working over there. > On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Ralph, i think i just found the root cause :-) from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c /* store our jobid and rank */ if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) { /* if we were launched by the OMPI RTE, then * the jobid is in a special

Re: [OMPI devel] toward a unique session directory

2016-09-14 Thread r...@open-mpi.org
If we are going to make a change, then let’s do it only once. Since we introduced PMIx and the concept of the string namespace, the plan has been to switch away from a numerical jobid and to the namespace. This eliminates the issue of the hash altogether. If we are going to make a disruptive

[OMPI devel] toward a unique session directory

2016-09-14 Thread Gilles Gouaillardet
Ralph, On 9/15/2016 12:11 AM, r...@open-mpi.org wrote: Many things are possible, given infinite time :-) i could not agree more :-D The issue with this notion lies in direct launch scenarios - i.e., when procs are launched directly by the RM and not via mpirun. In this case, there is

[hwloc-devel] Create success (hwloc git dev-1242-g45371f6)

2016-09-14 Thread Ralph H Castain
Creating nightly hwloc snapshot git tarball was a success. Snapshot: hwloc dev-1242-g45371f6 Start time: Wed Sep 14 18:01:08 PDT 2016 End time: Wed Sep 14 18:04:47 PDT 2016 Your friendly daemon, Cyrador ___ hwloc-devel mailing list

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread r...@open-mpi.org
The problem I hit, and the reason I’m pushing back, was that it required me to have a smart phone handy. Not everyone has a smart phone, nor do they always have it sitting next to them. In the case I hit, I was sitting somewhere that (a) had poor cell reception, and (b) didn’t have my cell

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread Pritchard Jr., Howard
Ralph, I know with older versions of git you may have problems since you can’t use https. I think with newer versions it will prompt not just for passed but also 2-factor. That’s one problem I hit anyway when first enabling 2-factor. Howard -- Howard Pritchard HPC-DES Los Alamos National

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread Jeff Squyres (jsquyres)
Sure. There's no rush at all; in fact, this is probably a decent topic for our next face-to-face. > On Sep 14, 2016, at 2:46 PM, r...@open-mpi.org wrote: > > I’d want to _fully_ understand the implications before forcing something on > everyone that might prove burdensome, especially when it

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread r...@open-mpi.org
I’d want to _fully_ understand the implications before forcing something on everyone that might prove burdensome, especially when it “solves” a currently non-existent problem > On Sep 14, 2016, at 11:43 AM, Jeff Squyres (jsquyres) > wrote: > > On Sep 14, 2016, at 2:40

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread Jeff Squyres (jsquyres)
On Sep 14, 2016, at 2:40 PM, r...@open-mpi.org wrote: > >> - Code reviews got better / more organized >> - Some project management tools now available >> - We can enforce the use of 2-factor authentication > > Please don’t do that... Certainly wouldn't do the last one without talking it through

Re: [OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread r...@open-mpi.org
> On Sep 14, 2016, at 11:37 AM, Jeff Squyres (jsquyres) > wrote: > > - Code reviews got better / more organized > - Some project management tools now available > - We can enforce the use of 2-factor authentication Please don’t do that... > >

[OMPI devel] Lots of new features rolled out on github.com today

2016-09-14 Thread Jeff Squyres (jsquyres)
- Code reviews got better / more organized - Some project management tools now available - We can enforce the use of 2-factor authentication https://github.com/blog/2256-a-whole-new-github-universe-announcing-new-tools-forums-and-features Sweet! -- Jeff Squyres jsquy...@cisco.com For corporate

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path

[MTT devel] Performance Measurement

2016-09-14 Thread Josh Hursey
(I'm not going to be on the concall today, so maybe we can talk about this on the Issue and in next week's meeting) There is an Issue under discussion here: https://github.com/open-mpi/mtt/issues/445 I'd like us to keep discussing this issue with the hope to get started on it in the near term

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Many things are possible, given infinite time :-) The issue with this notion lies in direct launch scenarios - i.e., when procs are launched directly by the RM and not via mpirun. In this case, there is nobody who can give us the session directory (well, until PMIx becomes universal), and so

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Ralph, is there any reason to use a session directory based on the jobid (or job family) ? I mean, could we use mkstemp to generate a unique directory, and then propagate the path via orted comm or the environment ? Cheers, Gilles On Wednesday, September 14, 2016, r...@open-mpi.org

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote: Eric, do you mean you have a unique $TMP per a.out ? No or a unique $TMP per "batch" of run ? Yes. I was happy because each nighlty batch has it's own TMP, so I can check afterward for problems related to a specific night without

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Eric, do you mean you have a unique $TMP per a.out ? or a unique $TMP per "batch" of run ? in the first case, my understanding is that conflicts cannot happen ... once you hit the bug, can you please please post the output of the failed a.out, and run egrep 'jobfam|stop' on all your logs, so we

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
This has nothing to do with PMIx, Josh - the error is coming out of the usock OOB component. > On Sep 14, 2016, at 7:17 AM, Joshua Ladd wrote: > > Eric, > > We are looking into the PMIx code path that sets up the jobid. The session > directories are created based on

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Eric, We are looking into the PMIx code path that sets up the jobid. The session directories are created based on the jobid. It might be the case that the jobids (generated with rand) happen to be the same for different jobs resulting in multiple jobs sharing the same session directory, but we

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Thanks Eric, the goal of the patch is simply not to output info that is not needed (by both orted and a.out) /* since you ./a.out, an orted is forked under the hood */ so the patch is really optional, though convenient. Cheers, Gilles On Wednesday, September 14, 2016, Eric Chamberland <

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
Lucky! Since each runs have a specific TMP, I still have it on disc. for the faulty run, the TMP variable was: TMP=/tmp/tmp.wOv5dkNaSI and into $TMP I have: openmpi-sessions-40031@lorien_0 and into this subdirectory I have a bunch of empty dirs:

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote: Eric, can you please provide more information on how your tests are launched ? Yes! do you mpirun -np 1 ./a.out or do you simply ./a.out For all sequential tests, we do ./a.out. do you use a batch manager ? if yes, which one ?

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Hi, Eric I **think** this might be related to the following: https://github.com/pmix/master/pull/145 I'm wondering if you can look into the /tmp directory and see if you have a bunch of stale usock files. Best, Josh On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet

Re: [OMPI devel] link issue on master with --disable-shared --enable-static --disable-dlopen

2016-09-14 Thread Gilles Gouaillardet
Thanks Ralph, i investigated this a bit deeper, and found the $enable_dlopen variable is not correctly used in pmix3x. /* my understanding of pmix3x is that --disable-dlopen implies --disable-pdl-dlopen, but that did not happen */ i opened https://github.com/open-mpi/ompi/pull/2079 so