Just in the FWIW category: the HNP used to send the singleton’s name down the
pipe at startup, which eliminated the code line you identified. Now, we are
pushing the name into the environment as a PMIx envar, and having the PMIx
component pick it up. Roundabout way of getting it, and that’s
Ah...I take that back. We changed this and now we _do_ indeed go down that code
path. Not good.
So yes, we need that putenv so it gets the jobid from the HNP that was
launched, like it used to do. You want to throw that in?
Thanks
Ralph
> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
Nah, something isn’t right here. The singleton doesn’t go thru that code line,
or it isn’t supposed to do so. I think the problem lies in the way the
singleton in 2.x is starting up. Let me take a look at how singletons are
working over there.
> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet
Ralph,
i think i just found the root cause :-)
from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
/* store our jobid and rank */
if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* if we were launched by the OMPI RTE, then
* the jobid is in a special
If we are going to make a change, then let’s do it only once. Since we
introduced PMIx and the concept of the string namespace, the plan has been to
switch away from a numerical jobid and to the namespace. This eliminates the
issue of the hash altogether. If we are going to make a disruptive
Ralph,
On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
Many things are possible, given infinite time :-)
i could not agree more :-D
The issue with this notion lies in direct launch scenarios - i.e.,
when procs are launched directly by the RM and not via mpirun. In this
case, there is
Creating nightly hwloc snapshot git tarball was a success.
Snapshot: hwloc dev-1242-g45371f6
Start time: Wed Sep 14 18:01:08 PDT 2016
End time: Wed Sep 14 18:04:47 PDT 2016
Your friendly daemon,
Cyrador
___
hwloc-devel mailing list
The problem I hit, and the reason I’m pushing back, was that it required me to
have a smart phone handy. Not everyone has a smart phone, nor do they always
have it sitting next to them. In the case I hit, I was sitting somewhere that
(a) had poor cell reception, and (b) didn’t have my cell
Ralph,
I know with older versions of git you may have problems since you can’t use
https. I think with newer versions it will prompt not just for passed but
also
2-factor.
That’s one problem I hit anyway when first enabling 2-factor.
Howard
--
Howard Pritchard
HPC-DES
Los Alamos National
Sure. There's no rush at all; in fact, this is probably a decent topic for our
next face-to-face.
> On Sep 14, 2016, at 2:46 PM, r...@open-mpi.org wrote:
>
> I’d want to _fully_ understand the implications before forcing something on
> everyone that might prove burdensome, especially when it
I’d want to _fully_ understand the implications before forcing something on
everyone that might prove burdensome, especially when it “solves” a currently
non-existent problem
> On Sep 14, 2016, at 11:43 AM, Jeff Squyres (jsquyres)
> wrote:
>
> On Sep 14, 2016, at 2:40
On Sep 14, 2016, at 2:40 PM, r...@open-mpi.org wrote:
>
>> - Code reviews got better / more organized
>> - Some project management tools now available
>> - We can enforce the use of 2-factor authentication
>
> Please don’t do that...
Certainly wouldn't do the last one without talking it through
> On Sep 14, 2016, at 11:37 AM, Jeff Squyres (jsquyres)
> wrote:
>
> - Code reviews got better / more organized
> - Some project management tools now available
> - We can enforce the use of 2-factor authentication
Please don’t do that...
>
>
- Code reviews got better / more organized
- Some project management tools now available
- We can enforce the use of 2-factor authentication
https://github.com/blog/2256-a-whole-new-github-universe-announcing-new-tools-forums-and-features
Sweet!
--
Jeff Squyres
jsquy...@cisco.com
For corporate
Ok,
one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:
stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path
(I'm not going to be on the concall today, so maybe we can talk about this
on the Issue and in next week's meeting)
There is an Issue under discussion here:
https://github.com/open-mpi/mtt/issues/445
I'd like us to keep discussing this issue with the hope to get started on
it in the near term
Many things are possible, given infinite time :-)
The issue with this notion lies in direct launch scenarios - i.e., when procs
are launched directly by the RM and not via mpirun. In this case, there is
nobody who can give us the session directory (well, until PMIx becomes
universal), and so
Ralph,
is there any reason to use a session directory based on the jobid (or job
family) ?
I mean, could we use mkstemp to generate a unique directory, and then
propagate the path via orted comm or the environment ?
Cheers,
Gilles
On Wednesday, September 14, 2016, r...@open-mpi.org
On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:
Eric,
do you mean you have a unique $TMP per a.out ?
No
or a unique $TMP per "batch" of run ?
Yes.
I was happy because each nighlty batch has it's own TMP, so I can check
afterward for problems related to a specific night without
Eric,
do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?
in the first case, my understanding is that conflicts cannot happen ...
once you hit the bug, can you please please post the output of the failed
a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we
This has nothing to do with PMIx, Josh - the error is coming out of the usock
OOB component.
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd wrote:
>
> Eric,
>
> We are looking into the PMIx code path that sets up the jobid. The session
> directories are created based on
Eric,
We are looking into the PMIx code path that sets up the jobid. The session
directories are created based on the jobid. It might be the case that the
jobids (generated with rand) happen to be the same for different jobs
resulting in multiple jobs sharing the same session directory, but we
Thanks Eric,
the goal of the patch is simply not to output info that is not needed (by
both orted and a.out)
/* since you ./a.out, an orted is forked under the hood */
so the patch is really optional, though convenient.
Cheers,
Gilles
On Wednesday, September 14, 2016, Eric Chamberland <
Lucky!
Since each runs have a specific TMP, I still have it on disc.
for the faulty run, the TMP variable was:
TMP=/tmp/tmp.wOv5dkNaSI
and into $TMP I have:
openmpi-sessions-40031@lorien_0
and into this subdirectory I have a bunch of empty dirs:
On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
Eric,
can you please provide more information on how your tests are launched ?
Yes!
do you
mpirun -np 1 ./a.out
or do you simply
./a.out
For all sequential tests, we do ./a.out.
do you use a batch manager ? if yes, which one ?
Hi, Eric
I **think** this might be related to the following:
https://github.com/pmix/master/pull/145
I'm wondering if you can look into the /tmp directory and see if you have a
bunch of stale usock files.
Best,
Josh
On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
Thanks Ralph,
i investigated this a bit deeper, and found the $enable_dlopen variable
is not correctly used in pmix3x.
/* my understanding of pmix3x is that --disable-dlopen implies
--disable-pdl-dlopen,
but that did not happen */
i opened https://github.com/open-mpi/ompi/pull/2079 so
27 matches
Mail list logo