Re: [OMPI devel] toward a unique session directory

2016-09-15 Thread r...@open-mpi.org
Actually, you just use the envar that was previously cited on a different email 
thread: 

 if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* you were launched by mpirun */
} else {
/* you were direct launched */
}

This is available from time of first instruction, so no worries as to when you 
look.


> On Sep 15, 2016, at 7:50 AM, Pritchard Jr., Howard  wrote:
> 
> HI Gilles,
> 
> From what point in the job launch are you needed to determine whether
> or not the job was direct launched?
> 
> Howard
> 
> -- 
> Howard Pritchard
> 
> HPC-DES
> Los Alamos National Laboratory
> 
> 
> 
> 
> 
> On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet"
>  gilles.gouaillar...@gmail.com> wrote:
> 
>> Ralph,
>> 
>> that looks good to me.
>> 
>> can you please remind me how to test if an app was launched by
>> mpirun/orted or direct launched by the RM ?
>> 
>> right now, which direct launch method is supported ?
>> i am aware of srun (SLURM) and apron (CRAY), are there any other ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org 
>> wrote:
>>> 
>>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet 
>>> wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> my reply is in the text
>>> 
>>> 
>>> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
>>> 
>>> If we are going to make a change, then let’s do it only once. Since we
>>> introduced PMIx and the concept of the string namespace, the plan has
>>> been
>>> to switch away from a numerical jobid and to the namespace. This
>>> eliminates
>>> the issue of the hash altogether. If we are going to make a disruptive
>>> change, then let’s do that one. Either way, this isn’t something that
>>> could
>>> go into the 2.x series. It is far too invasive, and would have to be
>>> delayed
>>> until a 3.x at the earliest.
>>> 
>>> got it !
>>> 
>>> Note that I am not yet convinced that is the issue here. We’ve had this
>>> hash
>>> for 12 years, and this is the first time someone has claimed to see a
>>> problem. That makes me very suspicious that the root cause isn’t what
>>> you
>>> are pursuing. This is only being reported for _singletons_, and that is
>>> a
>>> very unique code path. The only reason for launching the orted is to
>>> support
>>> PMIx operations such as notification and comm_spawn. If those aren’t
>>> being
>>> used, then we could use the “isolated” mode where the usock OOB isn’t
>>> even
>>> activated, thus eliminating the problem. This would be a much smaller
>>> “fix”
>>> and could potentially fit into 2.x
>>> 
>>> a bug has been identified and fixed, let's wait and see how things go
>>> 
>>> how can i use the isolated mode ?
>>> shall i simply
>>> export OMPI_MCA_pmix=isolated
>>> export OMPI_MCA_plm=isolated
>>> ?
>>> 
>>> out of curiosity, does "isolated" means we would not even need to fork
>>> the
>>> HNP ?
>>> 
>>> 
>>> Yes - that’s the idea. Simplify and make things faster. All you have to
>>> do
>>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that
>>> code
>>> is in 2.x as well
>>> 
>>> 
>>> 
>>> FWIW: every organization I’ve worked with has an epilog script that
>>> blows
>>> away temp dirs. It isn’t the RM-based environment that is of concern -
>>> it’s
>>> the non-RM one where epilog scripts don’t exist that is the problem.
>>> 
>>> well, i was looking at this the other way around.
>>> if mpirun/orted creates the session directory with mkstemp(), then
>>> there is
>>> no more need to do any cleanup
>>> (as long as you do not run out of disk space)
>>> but with direct run, there is always a little risk that a previous
>>> session
>>> directory is used, hence the requirement for an epilogue.
>>> also, if the RM is configured to run one job at a time per a given node,
>>> epilog can be quite trivial.
>>> but if several jobs can run on a given node at the same time, epilog
>>> become
>>> less trivial
>>> 
>>> 
>>> Yeah, this session directory thing has always been problematic. We’ve
>>> had
>>> litter problems since day one, and tried multiple solutions over the
>>> years.
>>> Obviously, none of those has proven fully successful :-(
>>> 
>>> Artem came up with a good solution using PMIx that allows the RM to
>>> control
>>> the session directory location for both direct launch and mpirun launch,
>>> thus ensuring the RM can cleanup the correct place upon session
>>> termination.
>>> As we get better adoption of that method out there, then the RM-based
>>> solution (even for multiple jobs sharing a node) should be resolved.
>>> 
>>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem.
>>> Your
>>> proposal would resolve that one as (a) we always have orted’s in that
>>> scenario, and (b) the orted’s pass the session directory down to the
>>> apps.
>>> So maybe the right approach is to use mkstemp() in the scenario where
>>> we are
>>> launched via orted and the RM has not specified a session directory.
>>> 
>>> I’m not sure we can resolve the direct launch without PMIx prob

Re: [OMPI devel] toward a unique session directory

2016-09-15 Thread Pritchard Jr., Howard
HI Gilles,

From what point in the job launch are you needed to determine whether
or not the job was direct launched?

Howard

-- 
Howard Pritchard

HPC-DES
Los Alamos National Laboratory





On 9/15/16, 7:38 AM, "devel on behalf of Gilles Gouaillardet"
 wrote:

>Ralph,
>
>that looks good to me.
>
>can you please remind me how to test if an app was launched by
>mpirun/orted or direct launched by the RM ?
>
>right now, which direct launch method is supported ?
>i am aware of srun (SLURM) and apron (CRAY), are there any other ?
>
>Cheers,
>
>Gilles
>
>On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org 
>wrote:
>>
>> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet 
>>wrote:
>>
>> Ralph,
>>
>>
>> my reply is in the text
>>
>>
>> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
>>
>> If we are going to make a change, then let’s do it only once. Since we
>> introduced PMIx and the concept of the string namespace, the plan has
>>been
>> to switch away from a numerical jobid and to the namespace. This
>>eliminates
>> the issue of the hash altogether. If we are going to make a disruptive
>> change, then let’s do that one. Either way, this isn’t something that
>>could
>> go into the 2.x series. It is far too invasive, and would have to be
>>delayed
>> until a 3.x at the earliest.
>>
>> got it !
>>
>> Note that I am not yet convinced that is the issue here. We’ve had this
>>hash
>> for 12 years, and this is the first time someone has claimed to see a
>> problem. That makes me very suspicious that the root cause isn’t what
>>you
>> are pursuing. This is only being reported for _singletons_, and that is
>>a
>> very unique code path. The only reason for launching the orted is to
>>support
>> PMIx operations such as notification and comm_spawn. If those aren’t
>>being
>> used, then we could use the “isolated” mode where the usock OOB isn’t
>>even
>> activated, thus eliminating the problem. This would be a much smaller
>>“fix”
>> and could potentially fit into 2.x
>>
>> a bug has been identified and fixed, let's wait and see how things go
>>
>> how can i use the isolated mode ?
>> shall i simply
>> export OMPI_MCA_pmix=isolated
>> export OMPI_MCA_plm=isolated
>> ?
>>
>> out of curiosity, does "isolated" means we would not even need to fork
>>the
>> HNP ?
>>
>>
>> Yes - that’s the idea. Simplify and make things faster. All you have to
>>do
>> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that
>>code
>> is in 2.x as well
>>
>>
>>
>> FWIW: every organization I’ve worked with has an epilog script that
>>blows
>> away temp dirs. It isn’t the RM-based environment that is of concern -
>>it’s
>> the non-RM one where epilog scripts don’t exist that is the problem.
>>
>> well, i was looking at this the other way around.
>> if mpirun/orted creates the session directory with mkstemp(), then
>>there is
>> no more need to do any cleanup
>> (as long as you do not run out of disk space)
>> but with direct run, there is always a little risk that a previous
>>session
>> directory is used, hence the requirement for an epilogue.
>> also, if the RM is configured to run one job at a time per a given node,
>> epilog can be quite trivial.
>> but if several jobs can run on a given node at the same time, epilog
>>become
>> less trivial
>>
>>
>> Yeah, this session directory thing has always been problematic. We’ve
>>had
>> litter problems since day one, and tried multiple solutions over the
>>years.
>> Obviously, none of those has proven fully successful :-(
>>
>> Artem came up with a good solution using PMIx that allows the RM to
>>control
>> the session directory location for both direct launch and mpirun launch,
>> thus ensuring the RM can cleanup the correct place upon session
>>termination.
>> As we get better adoption of that method out there, then the RM-based
>> solution (even for multiple jobs sharing a node) should be resolved.
>>
>> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem.
>>Your
>> proposal would resolve that one as (a) we always have orted’s in that
>> scenario, and (b) the orted’s pass the session directory down to the
>>apps.
>> So maybe the right approach is to use mkstemp() in the scenario where
>>we are
>> launched via orted and the RM has not specified a session directory.
>>
>> I’m not sure we can resolve the direct launch without PMIx problem - I
>>think
>> that’s best left as another incentive for RMs to get on-board the PMIx
>>bus.
>>
>> Make sense?
>>
>>
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet 
>>wrote:
>>
>> Ralph,
>>
>>
>> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
>>
>> Many things are possible, given infinite time :-)
>>
>> i could not agree more :-D
>>
>> The issue with this notion lies in direct launch scenarios - i.e., when
>> procs are launched directly by the RM and not via mpirun. In this case,
>> there is nobody who can give us the session directory (well, until PMIx
>> becomes universal), and so the apps must b

Re: [OMPI devel] toward a unique session directory

2016-09-15 Thread Gilles Gouaillardet
Ralph,

that looks good to me.

can you please remind me how to test if an app was launched by
mpirun/orted or direct launched by the RM ?

right now, which direct launch method is supported ?
i am aware of srun (SLURM) and apron (CRAY), are there any other ?

Cheers,

Gilles

On Thu, Sep 15, 2016 at 7:10 PM, r...@open-mpi.org  wrote:
>
> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet  wrote:
>
> Ralph,
>
>
> my reply is in the text
>
>
> On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
>
> If we are going to make a change, then let’s do it only once. Since we
> introduced PMIx and the concept of the string namespace, the plan has been
> to switch away from a numerical jobid and to the namespace. This eliminates
> the issue of the hash altogether. If we are going to make a disruptive
> change, then let’s do that one. Either way, this isn’t something that could
> go into the 2.x series. It is far too invasive, and would have to be delayed
> until a 3.x at the earliest.
>
> got it !
>
> Note that I am not yet convinced that is the issue here. We’ve had this hash
> for 12 years, and this is the first time someone has claimed to see a
> problem. That makes me very suspicious that the root cause isn’t what you
> are pursuing. This is only being reported for _singletons_, and that is a
> very unique code path. The only reason for launching the orted is to support
> PMIx operations such as notification and comm_spawn. If those aren’t being
> used, then we could use the “isolated” mode where the usock OOB isn’t even
> activated, thus eliminating the problem. This would be a much smaller “fix”
> and could potentially fit into 2.x
>
> a bug has been identified and fixed, let's wait and see how things go
>
> how can i use the isolated mode ?
> shall i simply
> export OMPI_MCA_pmix=isolated
> export OMPI_MCA_plm=isolated
> ?
>
> out of curiosity, does "isolated" means we would not even need to fork the
> HNP ?
>
>
> Yes - that’s the idea. Simplify and make things faster. All you have to do
> is set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that code
> is in 2.x as well
>
>
>
> FWIW: every organization I’ve worked with has an epilog script that blows
> away temp dirs. It isn’t the RM-based environment that is of concern - it’s
> the non-RM one where epilog scripts don’t exist that is the problem.
>
> well, i was looking at this the other way around.
> if mpirun/orted creates the session directory with mkstemp(), then there is
> no more need to do any cleanup
> (as long as you do not run out of disk space)
> but with direct run, there is always a little risk that a previous session
> directory is used, hence the requirement for an epilogue.
> also, if the RM is configured to run one job at a time per a given node,
> epilog can be quite trivial.
> but if several jobs can run on a given node at the same time, epilog become
> less trivial
>
>
> Yeah, this session directory thing has always been problematic. We’ve had
> litter problems since day one, and tried multiple solutions over the years.
> Obviously, none of those has proven fully successful :-(
>
> Artem came up with a good solution using PMIx that allows the RM to control
> the session directory location for both direct launch and mpirun launch,
> thus ensuring the RM can cleanup the correct place upon session termination.
> As we get better adoption of that method out there, then the RM-based
> solution (even for multiple jobs sharing a node) should be resolved.
>
> This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. Your
> proposal would resolve that one as (a) we always have orted’s in that
> scenario, and (b) the orted’s pass the session directory down to the apps.
> So maybe the right approach is to use mkstemp() in the scenario where we are
> launched via orted and the RM has not specified a session directory.
>
> I’m not sure we can resolve the direct launch without PMIx problem - I think
> that’s best left as another incentive for RMs to get on-board the PMIx bus.
>
> Make sense?
>
>
>
>
> Cheers,
>
> Gilles
>
>
> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet  wrote:
>
> Ralph,
>
>
> On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:
>
> Many things are possible, given infinite time :-)
>
> i could not agree more :-D
>
> The issue with this notion lies in direct launch scenarios - i.e., when
> procs are launched directly by the RM and not via mpirun. In this case,
> there is nobody who can give us the session directory (well, until PMIx
> becomes universal), and so the apps must be able to generate a name that
> they all can know. Otherwise, we lose shared memory support because they
> can’t rendezvous.
>
> thanks for the explanation,
> now let me rephrase that
> "a MPI task must be able to rebuild the path to the session directory, based
> on the information it has when launched.
> if mpirun is used, we have several options to pass this option to the MPI
> tasks.
> in case of direct run, this info is unlikely (PMIx is n

Re: [OMPI devel] toward a unique session directory

2016-09-15 Thread r...@open-mpi.org

> On Sep 15, 2016, at 12:51 AM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> 
> 
> my reply is in the text
> 
> 
> On 9/15/2016 11:11 AM, r...@open-mpi.org  wrote:
>> If we are going to make a change, then let’s do it only once. Since we 
>> introduced PMIx and the concept of the string namespace, the plan has been 
>> to switch away from a numerical jobid and to the namespace. This eliminates 
>> the issue of the hash altogether. If we are going to make a disruptive 
>> change, then let’s do that one. Either way, this isn’t something that could 
>> go into the 2.x series. It is far too invasive, and would have to be delayed 
>> until a 3.x at the earliest.
>> 
> got it !
>> Note that I am not yet convinced that is the issue here. We’ve had this hash 
>> for 12 years, and this is the first time someone has claimed to see a 
>> problem. That makes me very suspicious that the root cause isn’t what you 
>> are pursuing. This is only being reported for _singletons_, and that is a 
>> very unique code path. The only reason for launching the orted is to support 
>> PMIx operations such as notification and comm_spawn. If those aren’t being 
>> used, then we could use the “isolated” mode where the usock OOB isn’t even 
>> activated, thus eliminating the problem. This would be a much smaller “fix” 
>> and could potentially fit into 2.x
>> 
> a bug has been identified and fixed, let's wait and see how things go
> 
> how can i use the isolated mode ?
> shall i simply
> export OMPI_MCA_pmix=isolated
> export OMPI_MCA_plm=isolated
> ?
> 
> out of curiosity, does "isolated" means we would not even need to fork the 
> HNP ?

Yes - that’s the idea. Simplify and make things faster. All you have to do is 
set OMPI_MCA_ess_singleton_isolated=1 on master, and I believe that code is in 
2.x as well

> 
> 
>> FWIW: every organization I’ve worked with has an epilog script that blows 
>> away temp dirs. It isn’t the RM-based environment that is of concern - it’s 
>> the non-RM one where epilog scripts don’t exist that is the problem.
> well, i was looking at this the other way around.
> if mpirun/orted creates the session directory with mkstemp(), then there is 
> no more need to do any cleanup
> (as long as you do not run out of disk space)
> but with direct run, there is always a little risk that a previous session 
> directory is used, hence the requirement for an epilogue.
> also, if the RM is configured to run one job at a time per a given node, 
> epilog can be quite trivial.
> but if several jobs can run on a given node at the same time, epilog become 
> less trivial

Yeah, this session directory thing has always been problematic. We’ve had 
litter problems since day one, and tried multiple solutions over the years. 
Obviously, none of those has proven fully successful :-(

Artem came up with a good solution using PMIx that allows the RM to control the 
session directory location for both direct launch and mpirun launch, thus 
ensuring the RM can cleanup the correct place upon session termination. As we 
get better adoption of that method out there, then the RM-based solution (even 
for multiple jobs sharing a node) should be resolved.

This leaves the non-RM (i.e., ssh-based launch using mpirun) problem. Your 
proposal would resolve that one as (a) we always have orted’s in that scenario, 
and (b) the orted’s pass the session directory down to the apps. So maybe the 
right approach is to use mkstemp() in the scenario where we are launched via 
orted and the RM has not specified a session directory.

I’m not sure we can resolve the direct launch without PMIx problem - I think 
that’s best left as another incentive for RMs to get on-board the PMIx bus.

Make sense?


> 
> 
> Cheers,
> 
> Gilles
>> 
>>> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet >> > wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> On 9/15/2016 12:11 AM, r...@open-mpi.org  wrote:
 Many things are possible, given infinite time :-)
 
>>> i could not agree more :-D
 The issue with this notion lies in direct launch scenarios - i.e., when 
 procs are launched directly by the RM and not via mpirun. In this case, 
 there is nobody who can give us the session directory (well, until PMIx 
 becomes universal), and so the apps must be able to generate a name that 
 they all can know. Otherwise, we lose shared memory support because they 
 can’t rendezvous.
>>> thanks for the explanation,
>>> now let me rephrase that
>>> "a MPI task must be able to rebuild the path to the session directory, 
>>> based on the information it has when launched.
>>> if mpirun is used, we have several options to pass this option to the MPI 
>>> tasks.
>>> in case of direct run, this info is unlikely (PMIx is not universal (yet)) 
>>> passed by the batch manager, so we have to use what is available"
>>> 
>>> my concern is that, to keep things simple, session directory is based on 
>>> the 

Re: [OMPI devel] toward a unique session directory

2016-09-15 Thread Gilles Gouaillardet

Ralph,


my reply is in the text


On 9/15/2016 11:11 AM, r...@open-mpi.org wrote:
If we are going to make a change, then let’s do it only once. Since we 
introduced PMIx and the concept of the string namespace, the plan has 
been to switch away from a numerical jobid and to the namespace. This 
eliminates the issue of the hash altogether. If we are going to make a 
disruptive change, then let’s do that one. Either way, this isn’t 
something that could go into the 2.x series. It is far too invasive, 
and would have to be delayed until a 3.x at the earliest.



got it !
Note that I am not yet convinced that is the issue here. We’ve had 
this hash for 12 years, and this is the first time someone has claimed 
to see a problem. That makes me very suspicious that the root cause 
isn’t what you are pursuing. This is only being reported for 
_singletons_, and that is a very unique code path. The only reason for 
launching the orted is to support PMIx operations such as notification 
and comm_spawn. If those aren’t being used, then we could use the 
“isolated” mode where the usock OOB isn’t even activated, thus 
eliminating the problem. This would be a much smaller “fix” and could 
potentially fit into 2.x



a bug has been identified and fixed, let's wait and see how things go

how can i use the isolated mode ?
shall i simply
export OMPI_MCA_pmix=isolated
export OMPI_MCA_plm=isolated
?

out of curiosity, does "isolated" means we would not even need to fork 
the HNP ?



FWIW: every organization I’ve worked with has an epilog script that 
blows away temp dirs. It isn’t the RM-based environment that is of 
concern - it’s the non-RM one where epilog scripts don’t exist that is 
the problem.

well, i was looking at this the other way around.
if mpirun/orted creates the session directory with mkstemp(), then there 
is no more need to do any cleanup

(as long as you do not run out of disk space)
but with direct run, there is always a little risk that a previous 
session directory is used, hence the requirement for an epilogue.
also, if the RM is configured to run one job at a time per a given node, 
epilog can be quite trivial.
but if several jobs can run on a given node at the same time, epilog 
become less trivial



Cheers,

Gilles


On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet > wrote:


Ralph,


On 9/15/2016 12:11 AM, r...@open-mpi.org wrote:

Many things are possible, given infinite time :-)


i could not agree more :-D
The issue with this notion lies in direct launch scenarios - i.e., 
when procs are launched directly by the RM and not via mpirun. In 
this case, there is nobody who can give us the session directory 
(well, until PMIx becomes universal), and so the apps must be able 
to generate a name that they all can know. Otherwise, we lose shared 
memory support because they can’t rendezvous.

thanks for the explanation,
now let me rephrase that
"a MPI task must be able to rebuild the path to the session 
directory, based on the information it has when launched.
if mpirun is used, we have several options to pass this option to the 
MPI tasks.
in case of direct run, this info is unlikely (PMIx is not universal 
(yet)) passed by the batch manager, so we have to use what is available"


my concern is that, to keep things simple, session directory is based 
on the Open MPI jobid, and since stepid is zero most of the time, 
jobid really means job family

which is stored on 16 bits.

in the case of mpirun, jobfam is a 16 bit hash of the hostname 
(reasonnable sized string) and the mpirun pid (32 bits on Linux)
if several mpirun are invoked on the same host at a given time, there 
is a risk two distinct jobs are assigned the same jobfam (since we 
hash from 32 bits down to 16 bits).
also, there is a risk the session directory already exists from a 
previous job, with some/all files and unix sockets from a previous 
job, leading to undefined behavior

(an early crash if we are lucky, odd things otherwise).

in the case of direct run, i guess jobfam is a 16 bit hash of the 
jobid passed by the RM, and once again, there is a risk of conflict 
and/or the re-use of a previous session directory.


to me, the issue here is we are using the Open MPI jobfam in order to 
build the session directory path

instead, what if we
1) when mpirun, use a session directory created by mkstemp(), and 
pass it to MPI tasks via the environment or retrieve it from 
orted/mpirun right after the communication has been established.
2) for direct run, use a session directory based on the full jobid 
(which might be a string or a number) as passed by the RM


in case of 1), there is no more risk of a hash conflict, or re-using 
a previous session directory
in case of 2), there is no more risk of a hash conflict, but there is 
still a risk of re-using a session directory from a previous (e.g. 
terminated) job.
that being said, once we document how the session directory is built 
from the jobid, sysadmins will be ab

Re: [OMPI devel] toward a unique session directory

2016-09-14 Thread r...@open-mpi.org
If we are going to make a change, then let’s do it only once. Since we 
introduced PMIx and the concept of the string namespace, the plan has been to 
switch away from a numerical jobid and to the namespace. This eliminates the 
issue of the hash altogether. If we are going to make a disruptive change, then 
let’s do that one. Either way, this isn’t something that could go into the 2.x 
series. It is far too invasive, and would have to be delayed until a 3.x at the 
earliest.

Note that I am not yet convinced that is the issue here. We’ve had this hash 
for 12 years, and this is the first time someone has claimed to see a problem. 
That makes me very suspicious that the root cause isn’t what you are pursuing. 
This is only being reported for _singletons_, and that is a very unique code 
path. The only reason for launching the orted is to support PMIx operations 
such as notification and comm_spawn. If those aren’t being used, then we could 
use the “isolated” mode where the usock OOB isn’t even activated, thus 
eliminating the problem. This would be a much smaller “fix” and could 
potentially fit into 2.x

FWIW: every organization I’ve worked with has an epilog script that blows away 
temp dirs. It isn’t the RM-based environment that is of concern - it’s the 
non-RM one where epilog scripts don’t exist that is the problem.


> On Sep 14, 2016, at 6:05 PM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> On 9/15/2016 12:11 AM, r...@open-mpi.org  wrote:
>> Many things are possible, given infinite time :-)
>> 
> i could not agree more :-D
>> The issue with this notion lies in direct launch scenarios - i.e., when 
>> procs are launched directly by the RM and not via mpirun. In this case, 
>> there is nobody who can give us the session directory (well, until PMIx 
>> becomes universal), and so the apps must be able to generate a name that 
>> they all can know. Otherwise, we lose shared memory support because they 
>> can’t rendezvous.
> thanks for the explanation,
> now let me rephrase that
> "a MPI task must be able to rebuild the path to the session directory, based 
> on the information it has when launched.
> if mpirun is used, we have several options to pass this option to the MPI 
> tasks.
> in case of direct run, this info is unlikely (PMIx is not universal (yet)) 
> passed by the batch manager, so we have to use what is available"
> 
> my concern is that, to keep things simple, session directory is based on the 
> Open MPI jobid, and since stepid is zero most of the time, jobid really means 
> job family
> which is stored on 16 bits.
> 
> in the case of mpirun, jobfam is a 16 bit hash of the hostname (reasonnable 
> sized string) and the mpirun pid (32 bits on Linux)
> if several mpirun are invoked on the same host at a given time, there is a 
> risk two distinct jobs are assigned the same jobfam (since we hash from 32 
> bits down to 16 bits).
> also, there is a risk the session directory already exists from a previous 
> job, with some/all files and unix sockets from a previous job, leading to 
> undefined behavior
> (an early crash if we are lucky, odd things otherwise).
> 
> in the case of direct run, i guess jobfam is a 16 bit hash of the jobid 
> passed by the RM, and once again, there is a risk of conflict and/or the 
> re-use of a previous session directory.
> 
> to me, the issue here is we are using the Open MPI jobfam in order to build 
> the session directory path
> instead, what if we
> 1) when mpirun, use a session directory created by mkstemp(), and pass it to 
> MPI tasks via the environment or retrieve it from orted/mpirun right after 
> the communication has been established.
> 2) for direct run, use a session directory based on the full jobid (which 
> might be a string or a number) as passed by the RM
> 
> in case of 1), there is no more risk of a hash conflict, or re-using a 
> previous session directory
> in case of 2), there is no more risk of a hash conflict, but there is still a 
> risk of re-using a session directory from a previous (e.g. terminated) job.
> that being said, once we document how the session directory is built from the 
> jobid, sysadmins will be able to write a RM epilog that do remove the session 
> directory.
> 
>  does that make sense ?
>> 
>> However, that doesn’t seem to be the root problem here. I suspect there is a 
>> bug in the code that spawns the orted from the singleton, and subsequently 
>> parses the returned connection info. If you look at the error, you’ll see 
>> that both jobid’s have “zero” for their local jobid. This means that the two 
>> procs attempting to communicate both think they are daemons, which is 
>> impossible in this scenario.
>> 
>> So something garbled the string that the orted returns on startup to the 
>> singleton, and/or the singleton is parsing it incorrectly. IIRC, the 
>> singleton gets its name from that string, and so I expect it is getting the 
>> wrong name - and hence the error.
>> 
> i will i