Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Gilles Gouaillardet

Eric,


2.0.2 is scheduled to happen.

2.1.0 will bring some new features whereas v2.0.2 is a bug fix release.

my guess is v2.0.2 will come first, but this is just a guess

(even if v2.1.0 comes first, v2.0.2 will be released anyway)


Cheers,


Gilles


On 10/7/2016 2:34 AM, Eric Chamberland wrote:

Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, I 
haven't got any failure!


Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be 
a good one... So will there be a 2.0.2 release or will it go to 2.1.0 
directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior 
anyway)


Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
> wrote:

Hi,

I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh 
path NULL

[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from 
[[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm


unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt


Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch




the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,

it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the 
automatic

process started again... (which erase all directories before
starting) :/

I would like to put core files in a user specific directory, 
but it

seems it has to be a system-wide configuration... :/  I will
trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even
after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-10-06 Thread Eric Chamberland

Hi Gilles,

just to mention that since the PR 2091 as been merged into 2.0.x, I 
haven't got any failure!


Since 2.0.0 and 2.0.1 aren't usable for us, the next version should be a 
good one... So will there be a 2.0.2 release or will it go to 2.1.0 
directly?


Thanks,

Eric

On 16/09/16 10:01 AM, Gilles Gouaillardet wrote:

Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would
not worry too much of that crash (to me, it is an undefined behavior anyway)

Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland
> wrote:

Hi,

I know the pull request has not (yet) been merged, but here is a
somewhat "different" output from a single sequential test
(automatically) laucnhed without mpirun last night:

[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh :
rsh path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename
hash 1366255883
[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[41545,0],0] from [[39075,0],0]
[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive
stop comm


unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log




http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt



Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at

https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch





the bug is specific to singleton mode (e.g. ./a.out vs
mpirun -np 1
./a.out), so if applying a patch does not fit your test
workflow,

it might be easier for you to update it and mpirun -np 1
./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which
include
sprintf.
so yes, it is possible to crash an app by increasing
verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can
get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before
starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will
trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even
after I
relaunched the process..

Thanks for all the support!

Eric


Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the
*same* bug because
there has been a segfault:

stderr:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt





[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Gilles Gouaillardet
Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would not
worry too much of that crash (to me, it is an undefined behavior anyway)

Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Hi,
>
> I know the pull request has not (yet) been merged, but here is a somewhat
> "different" output from a single sequential test (automatically) laucnhed
> without mpirun last night:
>
> [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
> NULL
> [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash
> 1366255883
> [lorien:172229] plm:base:set_hnp_name: final jobfam 39075
> [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:172229] [[39075,0],0] plm:base:receive start comm
> [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
> [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
> dynamic spawn
> [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received
> unexpected process identifier [[41545,0],0] from [[39075,0],0]
> [lorien:172218] *** Process received signal ***
> [lorien:172218] Signal: Segmentation fault (11)
> [lorien:172218] Signal code: Invalid permissions (2)
> [lorien:172218] Failing at address: 0x2d07e00
> [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop
> comm
>
>
> unfortunately, I didn't got any coredump (???)  The line:
>
> [lorien:172218] Signal code: Invalid permissions (2)
>
> is curious or not?
>
> as usual, here are the build logs:
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_config.log
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_ompi_info_all.txt
>
> Does the PR #1376 will prevent or fix this too?
>
> Thanks again!
>
> Eric
>
>
>
> On 15/09/16 09:32 AM, Eric Chamberland wrote:
>
>> Hi Gilles,
>>
>> On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:
>>
>>> Eric,
>>>
>>>
>>> a bug has been identified, and a patch is available at
>>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-
>>> release/pull/1376.patch
>>>
>>>
>>>
>>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
>>> ./a.out), so if applying a patch does not fit your test workflow,
>>>
>>> it might be easier for you to update it and mpirun -np 1 ./a.out instead
>>> of ./a.out
>>>
>>>
>>> basically, increasing verbosity runs some extra code, which include
>>> sprintf.
>>> so yes, it is possible to crash an app by increasing verbosity by
>>> running into a bug that is hidden under normal operation.
>>> my intuition suggests this is quite unlikely ... if you can get a core
>>> file and a backtrace, we will soon find out
>>>
>>> Damn! I did got one but it got erased last night when the automatic
>> process started again... (which erase all directories before starting) :/
>>
>> I would like to put core files in a user specific directory, but it
>> seems it has to be a system-wide configuration... :/  I will trick this
>> by changing the "pwd" to a path outside the erased directory...
>>
>> So as of tonight I should be able to retrieve core files even after I
>> relaunched the process..
>>
>> Thanks for all the support!
>>
>> Eric
>>
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>>
 Ok,

 one test segfaulted *but* I can't tell if it is the *same* bug because
 there has been a segfault:

 stderr:
 http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
 .10h38m52s.faultyCerr.Triangle.h_cte_1.txt



 [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
 path NULL
 [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
 hash 1366255883
 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
 [lorien:190552] [[53310,0],0] plm:base:receive start comm
 *** Error in `orted': realloc(): invalid next size: 0x01e58770
 ***
 ...
 ...
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
 daemon on the local node in file ess_singleton_module.c at line 573
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
 daemon on the local node in file ess_singleton_module.c at line 163
 *** An error occurred in MPI_Init_thread
 *** on a NULL communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***and potentially your MPI job)
 [lorien:190306] Local abort before MPI_INIT completed completed
 successfully, but am not able to aggregate error messages, and not
 able to guarantee that all other processes were killed!

 stdout:

 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Eric Chamberland

Hi,

I know the pull request has not (yet) been merged, but here is a 
somewhat "different" output from a single sequential test 
(automatically) laucnhed without mpirun last night:


[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 
1366255883

[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a 
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[41545,0],0] from [[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop 
comm



unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt

Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch



the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out


Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even after I
relaunched the process..

Thanks for all the support!

Eric



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--


--


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread r...@open-mpi.org
It’s okay - it was just confusing

This actually wound up having nothing to do with how the jobid is generated. 
The root cause of the problem was that we took an mpirun-generated jobid, and 
then mistakenly passed it back thru a hash function instead of just using it. 
So we hashed a perfectly good jobid.

What is puzzling is how it could ever have worked, yet the user said it only 
occasionally messed things up enough to cause breakage. You would think that 
hashing a valid jobid would create an unusable mess, but that doesn’t appear to 
be a definitive result.

 probably indicative of the weakness of the hash :-)


> On Sep 15, 2016, at 8:34 AM, Joshua Ladd  wrote:
> 
> Ralph,
> 
> We love PMIx :). In this context, when I say PMIx, I am referring to the PMIx 
> framework in OMPI/OPAL, not the standalone PMIx library. Sorry that wasn't 
> clear.
> 
> Josh 
> 
> On Thu, Sep 15, 2016 at 10:07 AM, r...@open-mpi.org 
>  > 
> wrote:
> I don’t understand this fascination with PMIx. PMIx didn’t calculate this 
> jobid - OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to 
> do with PMIx.
> 
> So why do you want to continue to blame PMIx for this problem??
> 
> 
>> On Sep 15, 2016, at 4:29 AM, Joshua Ladd > > wrote:
>> 
>> Great catch, Gilles! Not much of a surprise though. 
>> 
>> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the 
>> jobid, which, in this case, results in hash collisions. ;-P
>> 
>> Josh
>> 
>> On Thursday, September 15, 2016, Gilles Gouaillardet > > wrote:
>> Eric,
>> 
>> 
>> a bug has been identified, and a patch is available at 
>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
>>  
>> 
>> 
>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 
>> ./a.out), so if applying a patch does not fit your test workflow,
>> 
>> it might be easier for you to update it and mpirun -np 1 ./a.out instead of 
>> ./a.out
>> 
>> 
>> basically, increasing verbosity runs some extra code, which include sprintf.
>> so yes, it is possible to crash an app by increasing verbosity by running 
>> into a bug that is hidden under normal operation.
>> my intuition suggests this is quite unlikely ... if you can get a core file 
>> and a backtrace, we will soon find out
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>> Ok,
>> 
>> one test segfaulted *but* I can't tell if it is the *same* bug because there 
>> has been a segfault:
>> 
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>  
>> 
>>  
>> 
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>> NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> 
>> stdout:
>> 
>> -- 
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>   orte_ess_init failed
>>   --> Returned value Unable to start a daemon on the local node (-127) 
>> instead of ORTE_SUCCESS
>> -- 
>> -- 
>> It looks like MPI_INIT failed for some reason; your 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Joshua Ladd
Ralph,

We love PMIx :). In this context, when I say PMIx, I am referring to the
PMIx framework in OMPI/OPAL, not the standalone PMIx library. Sorry that
wasn't clear.

Josh

On Thu, Sep 15, 2016 at 10:07 AM, r...@open-mpi.org  wrote:

> I don’t understand this fascination with PMIx. PMIx didn’t calculate this
> jobid - OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing-
> to do with PMIx.
>
> So why do you want to continue to blame PMIx for this problem??
>
>
> On Sep 15, 2016, at 4:29 AM, Joshua Ladd  wrote:
>
> Great catch, Gilles! Not much of a surprise though.
>
> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the
> jobid, which, in this case, results in hash collisions. ;-P
>
> Josh
>
> On Thursday, September 15, 2016, Gilles Gouaillardet 
> wrote:
>
>> Eric,
>>
>>
>> a bug has been identified, and a patch is available at
>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-r
>> elease/pull/1376.patch
>>
>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
>> ./a.out), so if applying a patch does not fit your test workflow,
>>
>> it might be easier for you to update it and mpirun -np 1 ./a.out instead
>> of ./a.out
>>
>>
>> basically, increasing verbosity runs some extra code, which include
>> sprintf.
>> so yes, it is possible to crash an app by increasing verbosity by running
>> into a bug that is hidden under normal operation.
>> my intuition suggests this is quite unlikely ... if you can get a core
>> file and a backtrace, we will soon find out
>>
>>
>> Cheers,
>>
>> Gilles
>>
>>
>>
>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>
>>> Ok,
>>>
>>> one test segfaulted *but* I can't tell if it is the *same* bug because
>>> there has been a segfault:
>>>
>>> stderr:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>>
>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>>> path NULL
>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash
>>> 1366255883
>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>>> *** Error in `orted': realloc(): invalid next size: 0x01e58770
>>> ***
>>> ...
>>> ...
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>> daemon on the local node in file ess_singleton_module.c at line 573
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>>> daemon on the local node in file ess_singleton_module.c at line 163
>>> *** An error occurred in MPI_Init_thread
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [lorien:190306] Local abort before MPI_INIT completed completed
>>> successfully, but am not able to aggregate error messages, and not able to
>>> guarantee that all other processes were killed!
>>>
>>> stdout:
>>>
>>> --
>>>
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>>   orte_ess_init failed
>>>   --> Returned value Unable to start a daemon on the local node (-127)
>>> instead of ORTE_SUCCESS
>>> --
>>>
>>> --
>>>
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or
>>> environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>>
>>>   ompi_mpi_init: ompi_rte_init failed
>>>   --> Returned "Unable to start a daemon on the local node" (-127)
>>> instead of "Success" (0)
>>> --
>>>
>>>
>>> openmpi content of $TMP:
>>>
>>> /tmp/tmp.GoQXICeyJl> ls -la
>>> total 1500
>>> drwx--3 cmpbib bib 250 Sep 14 13:34 .
>>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>>> ...
>>> drwx-- 1848 cmpbib bib   45056 Sep 14 13:34
>>> openmpi-sessions-40031@lorien_0
>>> srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552
>>>
>>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find
>>> . -type f
>>> ./53310/contact.txt

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread r...@open-mpi.org
I don’t understand this fascination with PMIx. PMIx didn’t calculate this jobid 
- OMPI did. Yes, it is in the opal/pmix layer, but it had -nothing- to do with 
PMIx.

So why do you want to continue to blame PMIx for this problem??


> On Sep 15, 2016, at 4:29 AM, Joshua Ladd  wrote:
> 
> Great catch, Gilles! Not much of a surprise though. 
> 
> Indeed, this issue has EVERYTHING to do with how PMIx is calculating the 
> jobid, which, in this case, results in hash collisions. ;-P
> 
> Josh
> 
> On Thursday, September 15, 2016, Gilles Gouaillardet  > wrote:
> Eric,
> 
> 
> a bug has been identified, and a patch is available at 
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch
>  
> 
> 
> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), 
> so if applying a patch does not fit your test workflow,
> 
> it might be easier for you to update it and mpirun -np 1 ./a.out instead of 
> ./a.out
> 
> 
> basically, increasing verbosity runs some extra code, which include sprintf.
> so yes, it is possible to crash an app by increasing verbosity by running 
> into a bug that is hidden under normal operation.
> my intuition suggests this is quite unlikely ... if you can get a core file 
> and a backtrace, we will soon find out
> 
> 
> Cheers,
> 
> Gilles
> 
> 
> 
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
> Ok,
> 
> one test segfaulted *but* I can't tell if it is the *same* bug because there 
> has been a segfault:
> 
> stderr:
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>  
> 
>  
> 
> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
> NULL
> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
> 1366255883
> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:190552] [[53310,0],0] plm:base:receive start comm
> *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
> ...
> ...
> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
> on the local node in file ess_singleton_module.c at line 573
> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
> on the local node in file ess_singleton_module.c at line 163
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [lorien:190306] Local abort before MPI_INIT completed completed successfully, 
> but am not able to aggregate error messages, and not able to guarantee that 
> all other processes were killed!
> 
> stdout:
> 
> -- 
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_init failed
>   --> Returned value Unable to start a daemon on the local node (-127) 
> instead of ORTE_SUCCESS
> -- 
> -- 
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>   ompi_mpi_init: ompi_rte_init failed
>   --> Returned "Unable to start a daemon on the local node" (-127) instead of 
> "Success" (0)
> -- 
> 
> openmpi content of $TMP:
> 
> /tmp/tmp.GoQXICeyJl> ls -la
> total 1500
> drwx--3 cmpbib bib 250 Sep 14 13:34 .
> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
> ...
> drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
> openmpi-sessions-40031@lorien_0
> srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552
> 
> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
> -type f
> ./53310/contact.txt
> 
> cat 53310/contact.txt
> 3493724160.0;usock;tcp://132.203.7.36:54605 
> 190552
> 
> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
> 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Eric Chamberland

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch


the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out

Damn! I did got one but it got erased last night when the automatic 
process started again... (which erase all directories before starting) :/


I would like to put core files in a user specific directory, but it 
seems it has to be a system-wide configuration... :/  I will trick this 
by changing the "pwd" to a path outside the erased directory...


So as of tonight I should be able to retrieve core files even after I 
relaunched the process..


Thanks for all the support!

Eric



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt


[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--

--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127)
instead of "Success" (0)
--


openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34
openmpi-sessions-40031@lorien_0
srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0>
find . -type f
./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
plm:base:set_hnp_name: final jobfam 53310

(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt


config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log


ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt


Maybe it aborted (instead of giving the other message) while doing the
error because of export OMPI_MCA_plm_base_verbose=5 ?


Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Joshua Ladd
Great catch, Gilles! Not much of a surprise though.

Indeed, this issue has EVERYTHING to do with how PMIx is calculating the
jobid, which, in this case, results in hash collisions. ;-P

Josh

On Thursday, September 15, 2016, Gilles Gouaillardet 
wrote:

> Eric,
>
>
> a bug has been identified, and a patch is available at
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-
> release/pull/1376.patch
>
> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
> ./a.out), so if applying a patch does not fit your test workflow,
>
> it might be easier for you to update it and mpirun -np 1 ./a.out instead
> of ./a.out
>
>
> basically, increasing verbosity runs some extra code, which include
> sprintf.
> so yes, it is possible to crash an app by increasing verbosity by running
> into a bug that is hidden under normal operation.
> my intuition suggests this is quite unlikely ... if you can get a core
> file and a backtrace, we will soon find out
>
>
> Cheers,
>
> Gilles
>
>
>
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>
>> Ok,
>>
>> one test segfaulted *but* I can't tell if it is the *same* bug because
>> there has been a segfault:
>>
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
>> path NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>> daemon on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
>> daemon on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed
>> successfully, but am not able to aggregate error messages, and not able to
>> guarantee that all other processes were killed!
>>
>> stdout:
>>
>> --
>>
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   orte_ess_init failed
>>   --> Returned value Unable to start a daemon on the local node (-127)
>> instead of ORTE_SUCCESS
>> --
>>
>> --
>>
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: ompi_rte_init failed
>>   --> Returned "Unable to start a daemon on the local node" (-127)
>> instead of "Success" (0)
>> --
>>
>>
>> openmpi content of $TMP:
>>
>> /tmp/tmp.GoQXICeyJl> ls -la
>> total 1500
>> drwx--3 cmpbib bib 250 Sep 14 13:34 .
>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>> ...
>> drwx-- 1848 cmpbib bib   45056 Sep 14 13:34
>> openmpi-sessions-40031@lorien_0
>> srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552
>>
>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find
>> . -type f
>> ./53310/contact.txt
>>
>> cat 53310/contact.txt
>> 3493724160.0;usock;tcp://132.203.7.36:54605
>> 190552
>>
>> egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
>> dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552]
>> plm:base:set_hnp_name: final jobfam 53310
>>
>> (this is the faulty test)
>> full egrep:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s.egrep.txt
>>
>> config.log:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s_config.log
>>
>> ompi_info:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
>> .10h38m52s_ompi_info_all.txt
>>
>> Maybe it aborted (instead of giving the other message) while doing the
>> 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Gilles Gouaillardet

Eric,


a bug has been identified, and a patch is available at 
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch


the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 
./a.out), so if applying a patch does not fit your test workflow,


it might be easier for you to update it and mpirun -np 1 ./a.out instead 
of ./a.out



basically, increasing verbosity runs some extra code, which include sprintf.
so yes, it is possible to crash an app by increasing verbosity by 
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core 
file and a backtrace, we will soon find out



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because 
there has been a segfault:


stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt 



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename 
hash 1366255883

[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770 
***

...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 163

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not 
able to guarantee that all other processes were killed!


stdout:

-- 


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) 
instead of ORTE_SUCCESS
-- 

-- 


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or 
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) 
instead of "Success" (0)
-- 



openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
openmpi-sessions-40031@lorien_0

srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> 
find . -type f

./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
plm:base:set_hnp_name: final jobfam 53310


(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt 



config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log 



ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt 



Maybe it aborted (instead of giving the other message) while doing the 
error because of export OMPI_MCA_plm_base_verbose=5 ?


Thanks,

Eric


On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?

in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict

Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland


Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-15 Thread Gilles Gouaillardet

Ralph,


i fixed master at 
https://github.com/open-mpi/ompi/commit/11ebf3ab23bdaeb0ec96818c119364c6d837cd3b


and PR for v2.x at https://github.com/open-mpi/ompi-release/pull/1376


Cheers,


Gilles


On 9/15/2016 12:26 PM, r...@open-mpi.org wrote:

Ah...I take that back. We changed this and now we _do_ indeed go down that code 
path. Not good.

So yes, we need that putenv so it gets the jobid from the HNP that was 
launched, like it used to do. You want to throw that in?

Thanks
Ralph


On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:

Nah, something isn’t right here. The singleton doesn’t go thru that code line, 
or it isn’t supposed to do so. I think the problem lies in the way the 
singleton in 2.x is starting up. Let me take a look at how singletons are 
working over there.


On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet  wrote:

Ralph,


i think i just found the root cause :-)


from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c

  /* store our jobid and rank */
  if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
   /* if we were launched by the OMPI RTE, then
* the jobid is in a special format - so get it */
   mca_pmix_pmix112_component.native_launch = true;
   opal_convert_string_to_jobid(, my_proc.nspace);
   } else {
   /* we were launched by someone else, so make the
* jobid just be the hash of the nspace */
   OPAL_HASH_STR(my_proc.nspace, pname.jobid);
   /* keep it from being negative */
   pname.jobid &= ~(0x8000);
   }


in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
so we end up using a jobid that is a hash of the namespace

the initial error message was
[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[9325,0],0] from [[5590,0],0]

and with little surprise
((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated by 
OPAL_HASH_STR


i am now thinking the right fix is to simply
putenv(OPAL_MCA_PREFIX"orte_launch=1");
in ess/singleton

makes sense ?


Cheers,

Gilles

On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because there 
has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt

[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on 
the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on 
the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed successfully, 
but am not able to aggregate error messages, and not able to guarantee that all 
other processes were killed!

stdout:

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

orte_ess_init failed
--> Returned value Unable to start a daemon on the local node (-127) instead of 
ORTE_SUCCESS
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: ompi_rte_init failed
--> Returned "Unable to start a daemon on the local node" (-127) instead of 
"Success" (0)
--

openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 openmpi-sessions-40031@lorien_0

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Just in the FWIW category: the HNP used to send the singleton’s name down the 
pipe at startup, which eliminated the code line you identified. Now, we are 
pushing the name into the environment as a PMIx envar, and having the PMIx 
component pick it up. Roundabout way of getting it, and that’s what is causing 
the trouble.

> On Sep 14, 2016, at 8:26 PM, r...@open-mpi.org wrote:
> 
> Ah...I take that back. We changed this and now we _do_ indeed go down that 
> code path. Not good.
> 
> So yes, we need that putenv so it gets the jobid from the HNP that was 
> launched, like it used to do. You want to throw that in?
> 
> Thanks
> Ralph
> 
>> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
>> 
>> Nah, something isn’t right here. The singleton doesn’t go thru that code 
>> line, or it isn’t supposed to do so. I think the problem lies in the way the 
>> singleton in 2.x is starting up. Let me take a look at how singletons are 
>> working over there.
>> 
>>> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet  wrote:
>>> 
>>> Ralph,
>>> 
>>> 
>>> i think i just found the root cause :-)
>>> 
>>> 
>>> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
>>> 
>>> /* store our jobid and rank */
>>> if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
>>>  /* if we were launched by the OMPI RTE, then
>>>   * the jobid is in a special format - so get it */
>>>  mca_pmix_pmix112_component.native_launch = true;
>>>  opal_convert_string_to_jobid(, my_proc.nspace);
>>>  } else {
>>>  /* we were launched by someone else, so make the
>>>   * jobid just be the hash of the nspace */
>>>  OPAL_HASH_STR(my_proc.nspace, pname.jobid);
>>>  /* keep it from being negative */
>>>  pname.jobid &= ~(0x8000);
>>>  }
>>> 
>>> 
>>> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
>>> so we end up using a jobid that is a hash of the namespace
>>> 
>>> the initial error message was
>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
>>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>> 
>>> and with little surprise
>>> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated 
>>> by OPAL_HASH_STR
>>> 
>>> 
>>> i am now thinking the right fix is to simply
>>> putenv(OPAL_MCA_PREFIX"orte_launch=1");
>>> in ess/singleton
>>> 
>>> makes sense ?
>>> 
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
 Ok,
 
 one test segfaulted *but* I can't tell if it is the *same* bug because 
 there has been a segfault:
 
 stderr:
 http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
  
 
 [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
 NULL
 [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
 1366255883
 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
 [lorien:190552] [[53310,0],0] plm:base:receive start comm
 *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
 ...
 ...
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
 daemon on the local node in file ess_singleton_module.c at line 573
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
 daemon on the local node in file ess_singleton_module.c at line 163
 *** An error occurred in MPI_Init_thread
 *** on a NULL communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***and potentially your MPI job)
 [lorien:190306] Local abort before MPI_INIT completed completed 
 successfully, but am not able to aggregate error messages, and not able to 
 guarantee that all other processes were killed!
 
 stdout:
 
 -- 
 It looks like orte_init failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during orte_init; some of which are due to configuration or
 environment problems.  This failure appears to be an internal failure;
 here's some additional information (which may only be relevant to an
 Open MPI developer):
 
 orte_ess_init failed
 --> Returned value Unable to start a daemon on the local node (-127) 
 instead of ORTE_SUCCESS
 -- 
 -- 
 It looks like MPI_INIT failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during MPI_INIT; some of which are due to configuration or environment
 problems.  This 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Ah...I take that back. We changed this and now we _do_ indeed go down that code 
path. Not good.

So yes, we need that putenv so it gets the jobid from the HNP that was 
launched, like it used to do. You want to throw that in?

Thanks
Ralph

> On Sep 14, 2016, at 8:18 PM, r...@open-mpi.org wrote:
> 
> Nah, something isn’t right here. The singleton doesn’t go thru that code 
> line, or it isn’t supposed to do so. I think the problem lies in the way the 
> singleton in 2.x is starting up. Let me take a look at how singletons are 
> working over there.
> 
>> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet  wrote:
>> 
>> Ralph,
>> 
>> 
>> i think i just found the root cause :-)
>> 
>> 
>> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
>> 
>>  /* store our jobid and rank */
>>  if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
>>   /* if we were launched by the OMPI RTE, then
>>* the jobid is in a special format - so get it */
>>   mca_pmix_pmix112_component.native_launch = true;
>>   opal_convert_string_to_jobid(, my_proc.nspace);
>>   } else {
>>   /* we were launched by someone else, so make the
>>* jobid just be the hash of the nspace */
>>   OPAL_HASH_STR(my_proc.nspace, pname.jobid);
>>   /* keep it from being negative */
>>   pname.jobid &= ~(0x8000);
>>   }
>> 
>> 
>> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
>> so we end up using a jobid that is a hash of the namespace
>> 
>> the initial error message was
>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>> 
>> and with little surprise
>> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated 
>> by OPAL_HASH_STR
>> 
>> 
>> i am now thinking the right fix is to simply
>> putenv(OPAL_MCA_PREFIX"orte_launch=1");
>> in ess/singleton
>> 
>> makes sense ?
>> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>> Ok,
>>> 
>>> one test segfaulted *but* I can't tell if it is the *same* bug because 
>>> there has been a segfault:
>>> 
>>> stderr:
>>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>>  
>>> 
>>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>>> NULL
>>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>>> 1366255883
>>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>>> *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
>>> ...
>>> ...
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
>>> daemon on the local node in file ess_singleton_module.c at line 573
>>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
>>> daemon on the local node in file ess_singleton_module.c at line 163
>>> *** An error occurred in MPI_Init_thread
>>> *** on a NULL communicator
>>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>>> ***and potentially your MPI job)
>>> [lorien:190306] Local abort before MPI_INIT completed completed 
>>> successfully, but am not able to aggregate error messages, and not able to 
>>> guarantee that all other processes were killed!
>>> 
>>> stdout:
>>> 
>>> -- 
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>> 
>>> orte_ess_init failed
>>> --> Returned value Unable to start a daemon on the local node (-127) 
>>> instead of ORTE_SUCCESS
>>> -- 
>>> -- 
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>> 
>>> ompi_mpi_init: ompi_rte_init failed
>>> --> Returned "Unable to start a daemon on the local node" (-127) instead of 
>>> "Success" (0)
>>> -- 
>>> 
>>> openmpi content of $TMP:
>>> 
>>> /tmp/tmp.GoQXICeyJl> ls -la
>>> total 1500
>>> drwx--3 cmpbib bib 250 Sep 14 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Nah, something isn’t right here. The singleton doesn’t go thru that code line, 
or it isn’t supposed to do so. I think the problem lies in the way the 
singleton in 2.x is starting up. Let me take a look at how singletons are 
working over there.

> On Sep 14, 2016, at 8:10 PM, Gilles Gouaillardet  wrote:
> 
> Ralph,
> 
> 
> i think i just found the root cause :-)
> 
> 
> from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c
> 
>   /* store our jobid and rank */
>   if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
>/* if we were launched by the OMPI RTE, then
> * the jobid is in a special format - so get it */
>mca_pmix_pmix112_component.native_launch = true;
>opal_convert_string_to_jobid(, my_proc.nspace);
>} else {
>/* we were launched by someone else, so make the
> * jobid just be the hash of the nspace */
>OPAL_HASH_STR(my_proc.nspace, pname.jobid);
>/* keep it from being negative */
>pname.jobid &= ~(0x8000);
>}
> 
> 
> in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
> so we end up using a jobid that is a hash of the namespace
> 
> the initial error message was
> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
> unexpected process identifier [[9325,0],0] from [[5590,0],0]
> 
> and with little surprise
> ((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as calculated by 
> OPAL_HASH_STR
> 
> 
> i am now thinking the right fix is to simply
> putenv(OPAL_MCA_PREFIX"orte_launch=1");
> in ess/singleton
> 
> makes sense ?
> 
> 
> Cheers,
> 
> Gilles
> 
> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>> Ok,
>> 
>> one test segfaulted *but* I can't tell if it is the *same* bug because there 
>> has been a segfault:
>> 
>> stderr:
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt
>>  
>> 
>> [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path 
>> NULL
>> [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
>> 1366255883
>> [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
>> [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
>> [lorien:190552] [[53310,0],0] plm:base:receive start comm
>> *** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
>> ...
>> ...
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 573
>> [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon 
>> on the local node in file ess_singleton_module.c at line 163
>> *** An error occurred in MPI_Init_thread
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
>> ***and potentially your MPI job)
>> [lorien:190306] Local abort before MPI_INIT completed completed 
>> successfully, but am not able to aggregate error messages, and not able to 
>> guarantee that all other processes were killed!
>> 
>> stdout:
>> 
>> -- 
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>  orte_ess_init failed
>>  --> Returned value Unable to start a daemon on the local node (-127) 
>> instead of ORTE_SUCCESS
>> -- 
>> -- 
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>>  ompi_mpi_init: ompi_rte_init failed
>>  --> Returned "Unable to start a daemon on the local node" (-127) instead of 
>> "Success" (0)
>> -- 
>> 
>> openmpi content of $TMP:
>> 
>> /tmp/tmp.GoQXICeyJl> ls -la
>> total 1500
>> drwx--3 cmpbib bib 250 Sep 14 13:34 .
>> drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
>> ...
>> drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
>> openmpi-sessions-40031@lorien_0
>> srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552
>> 
>> cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find . 
>> -type f
>> ./53310/contact.txt
>> 
>> cat 53310/contact.txt
>> 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet

Ralph,


i think i just found the root cause :-)


from pmix1_client_init() in opal/mca/pmix/pmix112/pmix1_client.c

   /* store our jobid and rank */
   if (NULL != getenv(OPAL_MCA_PREFIX"orte_launch")) {
/* if we were launched by the OMPI RTE, then
 * the jobid is in a special format - so get it */
mca_pmix_pmix112_component.native_launch = true;
opal_convert_string_to_jobid(, my_proc.nspace);
} else {
/* we were launched by someone else, so make the
 * jobid just be the hash of the nspace */
OPAL_HASH_STR(my_proc.nspace, pname.jobid);
/* keep it from being negative */
pname.jobid &= ~(0x8000);
}


in this case, there is no OPAL_MCA_PREFIX"orte_launch" in the environment,
so we end up using a jobid that is a hash of the namespace

the initial error message was
[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[9325,0],0] from [[5590,0],0]


and with little surprise
((9325 << 16) + 1) and ((5590 << 16) + 1) have the same hash as 
calculated by OPAL_HASH_STR



i am now thinking the right fix is to simply
putenv(OPAL_MCA_PREFIX"orte_launch=1");
in ess/singleton

makes sense ?


Cheers,

Gilles

On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because 
there has been a segfault:


stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt 



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename 
hash 1366255883

[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770 
***

...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 163

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not 
able to guarantee that all other processes were killed!


stdout:

-- 


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) 
instead of ORTE_SUCCESS
-- 

-- 


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or 
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) 
instead of "Success" (0)
-- 



openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
openmpi-sessions-40031@lorien_0

srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> 
find . -type f

./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
plm:base:set_hnp_name: final jobfam 53310


(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt 



config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log 



ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt 



Maybe it aborted (instead of giving the other message) while doing the 
error because of export 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because 
there has been a segfault:


stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt

[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 
1366255883

[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770 ***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a 
daemon on the local node in file ess_singleton_module.c at line 163

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed 
successfully, but am not able to aggregate error messages, and not able 
to guarantee that all other processes were killed!


stdout:

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) 
instead of ORTE_SUCCESS

--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) 
instead of "Success" (0)

--

openmpi content of $TMP:

/tmp/tmp.GoQXICeyJl> ls -la
total 1500
drwx--3 cmpbib bib 250 Sep 14 13:34 .
drwxrwxrwt  356 root   root  61440 Sep 14 13:45 ..
...
drwx-- 1848 cmpbib bib   45056 Sep 14 13:34 
openmpi-sessions-40031@lorien_0

srw-rw-r--1 cmpbib bib   0 Sep 14 12:24 pmix-190552

cmpbib@lorien:/tmp/tmp.GoQXICeyJl/openmpi-sessions-40031@lorien_0> find 
. -type f

./53310/contact.txt

cat 53310/contact.txt
3493724160.0;usock;tcp://132.203.7.36:54605
190552

egrep 'jobfam|stop' */*/Cerr* ../BIBTV/*/*/*/Cerr*|grep 53310
dev/Test.FonctionsSUPG/Cerr.Triangle.h_cte_1.txt:[lorien:190552] 
plm:base:set_hnp_name: final jobfam 53310


(this is the faulty test)
full egrep:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.egrep.txt

config.log:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_config.log

ompi_info:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s_ompi_info_all.txt

Maybe it aborted (instead of giving the other message) while doing the 
error because of export OMPI_MCA_plm_base_verbose=5 ?


Thanks,

Eric


On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?

in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict

Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
> wrote:

Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13 03:05 10049
drwx--2 cmpbib bib 6 Sep 13 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
Many things are possible, given infinite time :-)

The issue with this notion lies in direct launch scenarios - i.e., when procs 
are launched directly by the RM and not via mpirun. In this case, there is 
nobody who can give us the session directory (well, until PMIx becomes 
universal), and so the apps must be able to generate a name that they all can 
know. Otherwise, we lose shared memory support because they can’t rendezvous.

However, that doesn’t seem to be the root problem here. I suspect there is a 
bug in the code that spawns the orted from the singleton, and subsequently 
parses the returned connection info. If you look at the error, you’ll see that 
both jobid’s have “zero” for their local jobid. This means that the two procs 
attempting to communicate both think they are daemons, which is impossible in 
this scenario.

So something garbled the string that the orted returns on startup to the 
singleton, and/or the singleton is parsing it incorrectly. IIRC, the singleton 
gets its name from that string, and so I expect it is getting the wrong name - 
and hence the error.

As you may recall, you made a change a little while back where we modified the 
code in ess/singleton to be a little less strict in its checking of that 
returned string. I wonder if that is biting us here? It wouldn’t fix the 
problem, but might generate a different error at a more obvious place.


> On Sep 14, 2016, at 8:00 AM, Gilles Gouaillardet 
>  wrote:
> 
> Ralph,
> 
> is there any reason to use a session directory based on the jobid (or job 
> family) ?
> I mean, could we use mkstemp to generate a unique directory, and then 
> propagate the path via orted comm or the environment ?
> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, September 14, 2016, r...@open-mpi.org 
>  > 
> wrote:
> This has nothing to do with PMIx, Josh - the error is coming out of the usock 
> OOB component.
> 
> 
>> On Sep 14, 2016, at 7:17 AM, Joshua Ladd > > wrote:
>> 
>> Eric,
>> 
>> We are looking into the PMIx code path that sets up the jobid. The session 
>> directories are created based on the jobid. It might be the case that the 
>> jobids (generated with rand) happen to be the same for different jobs 
>> resulting in multiple jobs sharing the same session directory, but we need 
>> to check. We will update.
>> 
>> Josh
>> 
>> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland 
>> > > wrote:
>> Lucky!
>> 
>> Since each runs have a specific TMP, I still have it on disc.
>> 
>> for the faulty run, the TMP variable was:
>> 
>> TMP=/tmp/tmp.wOv5dkNaSI
>> 
>> and into $TMP I have:
>> 
>> openmpi-sessions-40031@lorien_0
>> 
>> and into this subdirectory I have a bunch of empty dirs:
>> 
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la 
>> |wc -l
>> 1841
>> 
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la 
>> |more
>> total 68
>> drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
>> drwx--3 cmpbib bib   231 Sep 13 03:50 ..
>> drwx--2 cmpbib bib 6 Sep 13 02:10 10015
>> drwx--2 cmpbib bib 6 Sep 13 03:05 10049
>> drwx--2 cmpbib bib 6 Sep 13 03:15 10052
>> drwx--2 cmpbib bib 6 Sep 13 02:22 10059
>> drwx--2 cmpbib bib 6 Sep 13 02:22 10110
>> drwx--2 cmpbib bib 6 Sep 13 02:41 10114
>> ...
>> 
>> If I do:
>> 
>> lsof |grep "openmpi-sessions-40031"
>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>>   Output information may be incomplete.
>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>>   Output information may be incomplete.
>> 
>> nothing...
>> 
>> What else may I check?
>> 
>> Eric
>> 
>> 
>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>> Hi, Eric
>> 
>> I **think** this might be related to the following:
>> 
>> https://github.com/pmix/master/pull/145 
>> 
>> 
>> I'm wondering if you can look into the /tmp directory and see if you
>> have a bunch of stale usock files.
>> 
>> Best,
>> 
>> Josh
>> 
>> 
>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > 
>> > >> wrote:
>> 
>> Eric,
>> 
>> 
>> can you please provide more information on how your tests are launched ?
>> 
>> do you
>> 
>> mpirun -np 1 ./a.out
>> 
>> or do you simply
>> 
>> ./a.out
>> 
>> 
>> do you use a batch manager ? if yes, which one ?
>> 
>> do you run one test per job ? or multiple tests per job ?
>> 
>> how are these tests launched ?
>> 
>> 
>> do the test that crashes use 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Ralph,

is there any reason to use a session directory based on the jobid (or job
family) ?
I mean, could we use mkstemp to generate a unique directory, and then
propagate the path via orted comm or the environment ?

Cheers,

Gilles

On Wednesday, September 14, 2016, r...@open-mpi.org  wrote:

> This has nothing to do with PMIx, Josh - the error is coming out of the
> usock OOB component.
>
>
> On Sep 14, 2016, at 7:17 AM, Joshua Ladd  > wrote:
>
> Eric,
>
> We are looking into the PMIx code path that sets up the jobid. The session
> directories are created based on the jobid. It might be the case that the
> jobids (generated with rand) happen to be the same for different jobs
> resulting in multiple jobs sharing the same session directory, but we need
> to check. We will update.
>
> Josh
>
> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland  ulaval.ca
> > wrote:
>
>> Lucky!
>>
>> Since each runs have a specific TMP, I still have it on disc.
>>
>> for the faulty run, the TMP variable was:
>>
>> TMP=/tmp/tmp.wOv5dkNaSI
>>
>> and into $TMP I have:
>>
>> openmpi-sessions-40031@lorien_0
>>
>> and into this subdirectory I have a bunch of empty dirs:
>>
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>> -la |wc -l
>> 1841
>>
>> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls
>> -la |more
>> total 68
>> drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
>> drwx--3 cmpbib bib   231 Sep 13 03:50 ..
>> drwx--2 cmpbib bib 6 Sep 13 02:10 10015
>> drwx--2 cmpbib bib 6 Sep 13 03:05 10049
>> drwx--2 cmpbib bib 6 Sep 13 03:15 10052
>> drwx--2 cmpbib bib 6 Sep 13 02:22 10059
>> drwx--2 cmpbib bib 6 Sep 13 02:22 10110
>> drwx--2 cmpbib bib 6 Sep 13 02:41 10114
>> ...
>>
>> If I do:
>>
>> lsof |grep "openmpi-sessions-40031"
>> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
>> /run/user/1000/gvfs
>>   Output information may be incomplete.
>> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>>   Output information may be incomplete.
>>
>> nothing...
>>
>> What else may I check?
>>
>> Eric
>>
>>
>> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>>
>>> Hi, Eric
>>>
>>> I **think** this might be related to the following:
>>>
>>> https://github.com/pmix/master/pull/145
>>>
>>> I'm wondering if you can look into the /tmp directory and see if you
>>> have a bunch of stale usock files.
>>>
>>> Best,
>>>
>>> Josh
>>>
>>>
>>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet >> 
>>> >> >> wrote:
>>>
>>> Eric,
>>>
>>>
>>> can you please provide more information on how your tests are
>>> launched ?
>>>
>>> do you
>>>
>>> mpirun -np 1 ./a.out
>>>
>>> or do you simply
>>>
>>> ./a.out
>>>
>>>
>>> do you use a batch manager ? if yes, which one ?
>>>
>>> do you run one test per job ? or multiple tests per job ?
>>>
>>> how are these tests launched ?
>>>
>>>
>>> do the test that crashes use MPI_Comm_spawn ?
>>>
>>> i am surprised by the process name [[9325,5754],0], which suggests
>>> there MPI_Comm_spawn was called 5753 times (!)
>>>
>>>
>>> can you also run
>>>
>>> hostname
>>>
>>> on the 'lorien' host ?
>>>
>>> if you configure'd Open MPI with --enable-debug, can you
>>>
>>> export OMPI_MCA_plm_base_verbose 5
>>>
>>> then run one test and post the logs ?
>>>
>>>
>>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>>> produce job family 5576 (but you get 9325)
>>>
>>> the discrepancy could be explained by the use of a batch manager
>>> and/or a full hostname i am unaware of.
>>>
>>>
>>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>>
>>> so strictly speaking, it is possible two jobs launched on the same
>>> node are assigned the same 16 bits job family.
>>>
>>>
>>> the easiest way to detect this could be to
>>>
>>> - edit orte/mca/plm/base/plm_base_jobid.c
>>>
>>> and replace
>>>
>>> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framew
>>> ork_output,
>>>  "plm:base:set_hnp_name: final jobfam %lu",
>>>  (unsigned long)jobfam));
>>>
>>> with
>>>
>>> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framew
>>> ork_output,
>>>  "plm:base:set_hnp_name: final jobfam %lu",
>>>  (unsigned long)jobfam));
>>>
>>> configure Open MPI with --enable-debug and rebuild
>>>
>>> and then
>>>
>>> export 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland



On 14/09/16 10:27 AM, Gilles Gouaillardet wrote:

Eric,

do you mean you have a unique $TMP per a.out ?


No


or a unique $TMP per "batch" of run ?


Yes.

I was happy because each nighlty batch has it's own TMP, so I can check 
afterward for problems related to a specific night without interference 
with another nightly batch of tests... if a bug ever happens... ;)




in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the
failed a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict



ok, I will launch it manually later today, but it will be automatic 
tonight (with export OMPI_MCA_plm_base_verbose=5).


Thanks!

Eric



Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland
> wrote:

Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |wc -l
1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0>
ls -la |more
total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13 03:05 10049
drwx--2 cmpbib bib 6 Sep 13 03:15 10052
drwx--2 cmpbib bib 6 Sep 13 02:22 10059
drwx--2 cmpbib bib 6 Sep 13 02:22 10110
drwx--2 cmpbib bib 6 Sep 13 02:41 10114
...

If I do:

lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system
/run/user/1000/gvfs
  Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system
/sys/kernel/debug/tracing
  Output information may be incomplete.

nothing...

What else may I check?

Eric


On 14/09/16 08:47 AM, Joshua Ladd wrote:

Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145


I'm wondering if you can look into the /tmp directory and see if you
have a bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet
> wrote:

Eric,


can you please provide more information on how your tests
are launched ?

do you

mpirun -np 1 ./a.out

or do you simply

./a.out


do you use a batch manager ? if yes, which one ?

do you run one test per job ? or multiple tests per job ?

how are these tests launched ?


do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which
suggests
there MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?

if you configure'd Open MPI with --enable-debug, can you

export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?


from orte_plm_base_set_hnp_name(), "lorien" and pid 142766
should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager
and/or a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family
from the
(32 bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on
the same
node are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5,
orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final
jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4,
orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final
jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Eric,

do you mean you have a unique $TMP per a.out ?
or a unique $TMP per "batch" of run ?

in the first case, my understanding is that conflicts cannot happen ...

once you hit the bug, can you please please post the output of the failed
a.out,
and run
egrep 'jobfam|stop'
on all your logs, so we might spot a conflict

Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Lucky!
>
> Since each runs have a specific TMP, I still have it on disc.
>
> for the faulty run, the TMP variable was:
>
> TMP=/tmp/tmp.wOv5dkNaSI
>
> and into $TMP I have:
>
> openmpi-sessions-40031@lorien_0
>
> and into this subdirectory I have a bunch of empty dirs:
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |wc -l
> 1841
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |more
> total 68
> drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
> drwx--3 cmpbib bib   231 Sep 13 03:50 ..
> drwx--2 cmpbib bib 6 Sep 13 02:10 10015
> drwx--2 cmpbib bib 6 Sep 13 03:05 10049
> drwx--2 cmpbib bib 6 Sep 13 03:15 10052
> drwx--2 cmpbib bib 6 Sep 13 02:22 10059
> drwx--2 cmpbib bib 6 Sep 13 02:22 10110
> drwx--2 cmpbib bib 6 Sep 13 02:41 10114
> ...
>
> If I do:
>
> lsof |grep "openmpi-sessions-40031"
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>   Output information may be incomplete.
> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>   Output information may be incomplete.
>
> nothing...
>
> What else may I check?
>
> Eric
>
>
> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>
>> Hi, Eric
>>
>> I **think** this might be related to the following:
>>
>> https://github.com/pmix/master/pull/145
>>
>> I'm wondering if you can look into the /tmp directory and see if you
>> have a bunch of stale usock files.
>>
>> Best,
>>
>> Josh
>>
>>
>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > > wrote:
>>
>> Eric,
>>
>>
>> can you please provide more information on how your tests are
>> launched ?
>>
>> do you
>>
>> mpirun -np 1 ./a.out
>>
>> or do you simply
>>
>> ./a.out
>>
>>
>> do you use a batch manager ? if yes, which one ?
>>
>> do you run one test per job ? or multiple tests per job ?
>>
>> how are these tests launched ?
>>
>>
>> do the test that crashes use MPI_Comm_spawn ?
>>
>> i am surprised by the process name [[9325,5754],0], which suggests
>> there MPI_Comm_spawn was called 5753 times (!)
>>
>>
>> can you also run
>>
>> hostname
>>
>> on the 'lorien' host ?
>>
>> if you configure'd Open MPI with --enable-debug, can you
>>
>> export OMPI_MCA_plm_base_verbose 5
>>
>> then run one test and post the logs ?
>>
>>
>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>> produce job family 5576 (but you get 9325)
>>
>> the discrepancy could be explained by the use of a batch manager
>> and/or a full hostname i am unaware of.
>>
>>
>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>
>> so strictly speaking, it is possible two jobs launched on the same
>> node are assigned the same 16 bits job family.
>>
>>
>> the easiest way to detect this could be to
>>
>> - edit orte/mca/plm/base/plm_base_jobid.c
>>
>> and replace
>>
>> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> with
>>
>> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> configure Open MPI with --enable-debug and rebuild
>>
>> and then
>>
>> export OMPI_MCA_plm_base_verbose=4
>>
>> and run your tests.
>>
>>
>> when the problem occurs, you will be able to check which pids
>> produced the faulty jobfam, and that could hint to a conflict.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>
>> Hi,
>>
>> It is the third time this happened into the last 10 days.
>>
>> While running nighlty tests (~2200), we have one or two tests
>> that fails at the very beginning with this strange error:
>>
>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>> received unexpected process identifier [[9325,0],0] from
>> [[5590,0],0]
>>
>> But I can't reproduce the problem right now... ie: If I launch
>> this test alone "by hand", it is successful... the same test was
>> successful yesterday...
>>

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread r...@open-mpi.org
This has nothing to do with PMIx, Josh - the error is coming out of the usock 
OOB component.


> On Sep 14, 2016, at 7:17 AM, Joshua Ladd  wrote:
> 
> Eric,
> 
> We are looking into the PMIx code path that sets up the jobid. The session 
> directories are created based on the jobid. It might be the case that the 
> jobids (generated with rand) happen to be the same for different jobs 
> resulting in multiple jobs sharing the same session directory, but we need to 
> check. We will update.
> 
> Josh
> 
> On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland 
> > 
> wrote:
> Lucky!
> 
> Since each runs have a specific TMP, I still have it on disc.
> 
> for the faulty run, the TMP variable was:
> 
> TMP=/tmp/tmp.wOv5dkNaSI
> 
> and into $TMP I have:
> 
> openmpi-sessions-40031@lorien_0
> 
> and into this subdirectory I have a bunch of empty dirs:
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la |wc 
> -l
> 1841
> 
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la 
> |more
> total 68
> drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
> drwx--3 cmpbib bib   231 Sep 13 03:50 ..
> drwx--2 cmpbib bib 6 Sep 13 02:10 10015
> drwx--2 cmpbib bib 6 Sep 13 03:05 10049
> drwx--2 cmpbib bib 6 Sep 13 03:15 10052
> drwx--2 cmpbib bib 6 Sep 13 02:22 10059
> drwx--2 cmpbib bib 6 Sep 13 02:22 10110
> drwx--2 cmpbib bib 6 Sep 13 02:41 10114
> ...
> 
> If I do:
> 
> lsof |grep "openmpi-sessions-40031"
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>   Output information may be incomplete.
> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>   Output information may be incomplete.
> 
> nothing...
> 
> What else may I check?
> 
> Eric
> 
> 
> On 14/09/16 08:47 AM, Joshua Ladd wrote:
> Hi, Eric
> 
> I **think** this might be related to the following:
> 
> https://github.com/pmix/master/pull/145 
> 
> 
> I'm wondering if you can look into the /tmp directory and see if you
> have a bunch of stale usock files.
> 
> Best,
> 
> Josh
> 
> 
> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet  
> >> wrote:
> 
> Eric,
> 
> 
> can you please provide more information on how your tests are launched ?
> 
> do you
> 
> mpirun -np 1 ./a.out
> 
> or do you simply
> 
> ./a.out
> 
> 
> do you use a batch manager ? if yes, which one ?
> 
> do you run one test per job ? or multiple tests per job ?
> 
> how are these tests launched ?
> 
> 
> do the test that crashes use MPI_Comm_spawn ?
> 
> i am surprised by the process name [[9325,5754],0], which suggests
> there MPI_Comm_spawn was called 5753 times (!)
> 
> 
> can you also run
> 
> hostname
> 
> on the 'lorien' host ?
> 
> if you configure'd Open MPI with --enable-debug, can you
> 
> export OMPI_MCA_plm_base_verbose 5
> 
> then run one test and post the logs ?
> 
> 
> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
> produce job family 5576 (but you get 9325)
> 
> the discrepancy could be explained by the use of a batch manager
> and/or a full hostname i am unaware of.
> 
> 
> orte_plm_base_set_hnp_name() generate a 16 bits job family from the
> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
> 
> so strictly speaking, it is possible two jobs launched on the same
> node are assigned the same 16 bits job family.
> 
> 
> the easiest way to detect this could be to
> 
> - edit orte/mca/plm/base/plm_base_jobid.c
> 
> and replace
> 
> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>  "plm:base:set_hnp_name: final jobfam %lu",
>  (unsigned long)jobfam));
> 
> with
> 
> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>  "plm:base:set_hnp_name: final jobfam %lu",
>  (unsigned long)jobfam));
> 
> configure Open MPI with --enable-debug and rebuild
> 
> and then
> 
> export OMPI_MCA_plm_base_verbose=4
> 
> and run your tests.
> 
> 
> when the problem occurs, you will be able to check which pids
> produced the faulty jobfam, and that could hint to a conflict.
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> 
> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
> 
> Hi,
> 
> It is the third time this happened into the last 10 days.
> 
> While running nighlty tests (~2200), we have one or two tests
> that fails at the very beginning with this strange error:
> 
> [lorien:142766] [[9325,5754],0] 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Eric,

We are looking into the PMIx code path that sets up the jobid. The session
directories are created based on the jobid. It might be the case that the
jobids (generated with rand) happen to be the same for different jobs
resulting in multiple jobs sharing the same session directory, but we need
to check. We will update.

Josh

On Wed, Sep 14, 2016 at 9:33 AM, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Lucky!
>
> Since each runs have a specific TMP, I still have it on disc.
>
> for the faulty run, the TMP variable was:
>
> TMP=/tmp/tmp.wOv5dkNaSI
>
> and into $TMP I have:
>
> openmpi-sessions-40031@lorien_0
>
> and into this subdirectory I have a bunch of empty dirs:
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |wc -l
> 1841
>
> cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls -la
> |more
> total 68
> drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
> drwx--3 cmpbib bib   231 Sep 13 03:50 ..
> drwx--2 cmpbib bib 6 Sep 13 02:10 10015
> drwx--2 cmpbib bib 6 Sep 13 03:05 10049
> drwx--2 cmpbib bib 6 Sep 13 03:15 10052
> drwx--2 cmpbib bib 6 Sep 13 02:22 10059
> drwx--2 cmpbib bib 6 Sep 13 02:22 10110
> drwx--2 cmpbib bib 6 Sep 13 02:41 10114
> ...
>
> If I do:
>
> lsof |grep "openmpi-sessions-40031"
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
>   Output information may be incomplete.
> lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
>   Output information may be incomplete.
>
> nothing...
>
> What else may I check?
>
> Eric
>
>
> On 14/09/16 08:47 AM, Joshua Ladd wrote:
>
>> Hi, Eric
>>
>> I **think** this might be related to the following:
>>
>> https://github.com/pmix/master/pull/145
>>
>> I'm wondering if you can look into the /tmp directory and see if you
>> have a bunch of stale usock files.
>>
>> Best,
>>
>> Josh
>>
>>
>> On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > > wrote:
>>
>> Eric,
>>
>>
>> can you please provide more information on how your tests are
>> launched ?
>>
>> do you
>>
>> mpirun -np 1 ./a.out
>>
>> or do you simply
>>
>> ./a.out
>>
>>
>> do you use a batch manager ? if yes, which one ?
>>
>> do you run one test per job ? or multiple tests per job ?
>>
>> how are these tests launched ?
>>
>>
>> do the test that crashes use MPI_Comm_spawn ?
>>
>> i am surprised by the process name [[9325,5754],0], which suggests
>> there MPI_Comm_spawn was called 5753 times (!)
>>
>>
>> can you also run
>>
>> hostname
>>
>> on the 'lorien' host ?
>>
>> if you configure'd Open MPI with --enable-debug, can you
>>
>> export OMPI_MCA_plm_base_verbose 5
>>
>> then run one test and post the logs ?
>>
>>
>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>> produce job family 5576 (but you get 9325)
>>
>> the discrepancy could be explained by the use of a batch manager
>> and/or a full hostname i am unaware of.
>>
>>
>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the
>> (32 bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>
>> so strictly speaking, it is possible two jobs launched on the same
>> node are assigned the same 16 bits job family.
>>
>>
>> the easiest way to detect this could be to
>>
>> - edit orte/mca/plm/base/plm_base_jobid.c
>>
>> and replace
>>
>> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> with
>>
>> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> configure Open MPI with --enable-debug and rebuild
>>
>> and then
>>
>> export OMPI_MCA_plm_base_verbose=4
>>
>> and run your tests.
>>
>>
>> when the problem occurs, you will be able to check which pids
>> produced the faulty jobfam, and that could hint to a conflict.
>>
>>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>>
>> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>
>> Hi,
>>
>> It is the third time this happened into the last 10 days.
>>
>> While running nighlty tests (~2200), we have one or two tests
>> that fails at the very beginning with this strange error:
>>
>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
>> received unexpected process identifier [[9325,0],0] from
>> [[5590,0],0]
>>
>> But I can't reproduce the problem right now... ie: If I launch
>> this test alone "by hand", it is successful... the same test was
>> successful yesterday...
>>
>> 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Gilles Gouaillardet
Thanks Eric,

the goal of the patch is simply not to output info that is not needed (by
both orted and a.out)
/* since you ./a.out, an orted is forked under the hood */
so the patch is really optional, though convenient.


Cheers,

Gilles

On Wednesday, September 14, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

>
>
> On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:
>
>> Eric,
>>
>>
>> can you please provide more information on how your tests are launched ?
>>
>>
> Yes!
>
> do you
>>
>> mpirun -np 1 ./a.out
>>
>> or do you simply
>>
>> ./a.out
>>
>>
> For all sequential tests, we do ./a.out.
>
>
>> do you use a batch manager ? if yes, which one ?
>>
>
> No.
>
>
>> do you run one test per job ? or multiple tests per job ?
>>
>
> On this automatic compilation, up to 16 tests are launched together.
>
>
>> how are these tests launched ?
>>
> For sequential ones, the special thing is that they are launched via
> python Popen call, which launches "time" which launches the code.
>
> So the "full" commande line is:
>
> /usr/bin/time -v -o /users/cmpbib/compilations/lor
> ien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/
> Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt
> /pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt
> mpi_v=2 verbose=True Beowulf=False outilMassif=False outilPerfRecord=False
> verifValgrind=False outilPerfStat=False outilCallgrind=False
> RepertoireDestination=/users/cmpbib/compilations/lorien/linu
> x_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien
> RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMP
> ILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien
> Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier
>
>
>
>>
>> do the test that crashes use MPI_Comm_spawn ?
>>
>> i am surprised by the process name [[9325,5754],0], which suggests there
>> MPI_Comm_spawn was called 5753 times (!)
>>
>>
>> can you also run
>>
>> hostname
>>
>> on the 'lorien' host ?
>>
>>
> [eric@lorien] Scripts (master $ u+1)> hostname
> lorien
>
> if you configure'd Open MPI with --enable-debug, can you
>>
> Yes.
>
>
>> export OMPI_MCA_plm_base_verbose 5
>>
>> then run one test and post the logs ?
>>
>>
> Hmmm, strange?
>
> [lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
> NULL
> [lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash
> 1366255883
> [lorien:93841] plm:base:set_hnp_name: final jobfam 22260
> [lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:93841] [[22260,0],0] plm:base:receive start comm
> [lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered
> [lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a
> dynamic spawn
> [lorien:93841] [[22260,0],0] plm:base:receive stop comm
>
> ~
>
>>
>> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
>> produce job family 5576 (but you get 9325)
>>
>> the discrepancy could be explained by the use of a batch manager and/or
>> a full hostname i am unaware of.
>>
>>
>> orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
>> bits hash of the) hostname and the mpirun (32 bits ?) pid.
>>
>> so strictly speaking, it is possible two jobs launched on the same node
>> are assigned the same 16 bits job family.
>>
>>
>> the easiest way to detect this could be to
>>
>> - edit orte/mca/plm/base/plm_base_jobid.c
>>
>> and replace
>>
>> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> with
>>
>> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>>  "plm:base:set_hnp_name: final jobfam %lu",
>>  (unsigned long)jobfam));
>>
>> configure Open MPI with --enable-debug and rebuild
>>
>> and then
>>
>> export OMPI_MCA_plm_base_verbose=4
>>
>> and run your tests.
>>
>>
>> when the problem occurs, you will be able to check which pids produced
>> the faulty jobfam, and that could hint to a conflict.
>>
>> Does this gives the same output as with export
> OMPI_MCA_plm_base_verbose=5 without the patch?
>
> If so, beacause all is automated, applying a patch is "harder" than doing
> a simple
> export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add
> OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs?
>
> Thanks!
>
> Eric
>
>
>
>> Cheers,
>>
>>
>> Gilles
>>
>>
>> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>>
>>> Hi,
>>>
>>> It is the third time this happened into the last 10 days.
>>>
>>> While running nighlty tests (~2200), we have one or two tests that
>>> fails at the very beginning with this strange error:
>>>
>>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
>>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>>
>>> But I can't reproduce the problem right 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland

Lucky!

Since each runs have a specific TMP, I still have it on disc.

for the faulty run, the TMP variable was:

TMP=/tmp/tmp.wOv5dkNaSI

and into $TMP I have:

openmpi-sessions-40031@lorien_0

and into this subdirectory I have a bunch of empty dirs:

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls 
-la |wc -l

1841

cmpbib@lorien:/tmp/tmp.wOv5dkNaSI/openmpi-sessions-40031@lorien_0> ls 
-la |more

total 68
drwx-- 1840 cmpbib bib 45056 Sep 13 03:49 .
drwx--3 cmpbib bib   231 Sep 13 03:50 ..
drwx--2 cmpbib bib 6 Sep 13 02:10 10015
drwx--2 cmpbib bib 6 Sep 13 03:05 10049
drwx--2 cmpbib bib 6 Sep 13 03:15 10052
drwx--2 cmpbib bib 6 Sep 13 02:22 10059
drwx--2 cmpbib bib 6 Sep 13 02:22 10110
drwx--2 cmpbib bib 6 Sep 13 02:41 10114
...

If I do:

lsof |grep "openmpi-sessions-40031"
lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1000/gvfs
  Output information may be incomplete.
lsof: WARNING: can't stat() tracefs file system /sys/kernel/debug/tracing
  Output information may be incomplete.

nothing...

What else may I check?

Eric


On 14/09/16 08:47 AM, Joshua Ladd wrote:

Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145

I'm wondering if you can look into the /tmp directory and see if you
have a bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet > wrote:

Eric,


can you please provide more information on how your tests are launched ?

do you

mpirun -np 1 ./a.out

or do you simply

./a.out


do you use a batch manager ? if yes, which one ?

do you run one test per job ? or multiple tests per job ?

how are these tests launched ?


do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which suggests
there MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?

if you configure'd Open MPI with --enable-debug, can you

export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?


from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager
and/or a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family from the
(32 bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on the same
node are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, you will be able to check which pids
produced the faulty jobfam, and that could hint to a conflict.


Cheers,


Gilles



On 9/14/2016 12:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests
that fails at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack:
received unexpected process identifier [[9325,0],0] from
[[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch
this test alone "by hand", it is successful... the same test was
successful yesterday...

Is there some kind of "race condition" that can happen on the
creation of "tmp" files if many tests runs together on the same
node? (we are oversubcribing even sequential runs...)

Here are the build logs:


http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log




http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt




Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org 

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Eric Chamberland



On 14/09/16 01:36 AM, Gilles Gouaillardet wrote:

Eric,


can you please provide more information on how your tests are launched ?



Yes!


do you

mpirun -np 1 ./a.out

or do you simply

./a.out



For all sequential tests, we do ./a.out.



do you use a batch manager ? if yes, which one ?


No.



do you run one test per job ? or multiple tests per job ?


On this automatic compilation, up to 16 tests are launched together.



how are these tests launched ?
For sequential ones, the special thing is that they are launched via 
python Popen call, which launches "time" which launches the code.


So the "full" commande line is:

/usr/bin/time -v -o 
/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien/Time.Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier.txt 
/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/GIREF/bin/Test.Laplacien.opt 
mpi_v=2 verbose=True Beowulf=False outilMassif=False 
outilPerfRecord=False verifValgrind=False outilPerfStat=False 
outilCallgrind=False 
RepertoireDestination=/users/cmpbib/compilations/lorien/linux_dernier_ompi_leap/TV2016-09-14_03h03m15sEDT/opt/Test.Laplacien 
RepertoireTest=/pmi/cmpbib/compilation_BIB_dernier_ompi/COMPILE_AUTO/TestValidation/Ressources/opt/Test.Laplacien 
Prefixe=Laplacien3D.Dirichlet.mixte_tetra_prismetri.scalhier






do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which suggests there
MPI_Comm_spawn was called 5753 times (!)


can you also run

hostname

on the 'lorien' host ?



[eric@lorien] Scripts (master $ u+1)> hostname
lorien


if you configure'd Open MPI with --enable-debug, can you

Yes.



export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?



Hmmm, strange?

[lorien:93841] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:93841] plm:base:set_hnp_name: initial bias 93841 nodename hash 
1366255883

[lorien:93841] plm:base:set_hnp_name: final jobfam 22260
[lorien:93841] [[22260,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:93841] [[22260,0],0] plm:base:receive start comm
[lorien:93841] [[22260,0],0] plm:base:launch [22260,1] registered
[lorien:93841] [[22260,0],0] plm:base:launch job [22260,1] is not a 
dynamic spawn

[lorien:93841] [[22260,0],0] plm:base:receive stop comm

~ 



from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should
produce job family 5576 (but you get 9325)

the discrepancy could be explained by the use of a batch manager and/or
a full hostname i am unaware of.


orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
bits hash of the) hostname and the mpirun (32 bits ?) pid.

so strictly speaking, it is possible two jobs launched on the same node
are assigned the same 16 bits job family.


the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, you will be able to check which pids produced
the faulty jobfam, and that could hint to a conflict.

Does this gives the same output as with export 
OMPI_MCA_plm_base_verbose=5 without the patch?


If so, beacause all is automated, applying a patch is "harder" than 
doing a simple
export OMPI_MCA_plm_base_verbose=5 for me, so maybe I could just add 
OMPI_MCA_plm_base_verbose=5 to all tests and wait until it hangs?


Thanks!

Eric




Cheers,


Gilles


On 9/14/2016 12:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that
fails at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],0] from [[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch this
test alone "by hand", it is successful... the same test was successful
yesterday...

Is there some kind of "race condition" that can happen on the creation
of "tmp" files if many tests runs together on the same node? (we are
oversubcribing even sequential runs...)

Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel




Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-14 Thread Joshua Ladd
Hi, Eric

I **think** this might be related to the following:

https://github.com/pmix/master/pull/145

I'm wondering if you can look into the /tmp directory and see if you have a
bunch of stale usock files.

Best,

Josh


On Wed, Sep 14, 2016 at 1:36 AM, Gilles Gouaillardet 
wrote:

> Eric,
>
>
> can you please provide more information on how your tests are launched ?
>
> do you
>
> mpirun -np 1 ./a.out
>
> or do you simply
>
> ./a.out
>
>
> do you use a batch manager ? if yes, which one ?
>
> do you run one test per job ? or multiple tests per job ?
>
> how are these tests launched ?
>
>
> do the test that crashes use MPI_Comm_spawn ?
>
> i am surprised by the process name [[9325,5754],0], which suggests there
> MPI_Comm_spawn was called 5753 times (!)
>
>
> can you also run
>
> hostname
>
> on the 'lorien' host ?
>
> if you configure'd Open MPI with --enable-debug, can you
>
> export OMPI_MCA_plm_base_verbose 5
>
> then run one test and post the logs ?
>
>
> from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should produce
> job family 5576 (but you get 9325)
>
> the discrepancy could be explained by the use of a batch manager and/or a
> full hostname i am unaware of.
>
>
> orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32
> bits hash of the) hostname and the mpirun (32 bits ?) pid.
>
> so strictly speaking, it is possible two jobs launched on the same node
> are assigned the same 16 bits job family.
>
>
> the easiest way to detect this could be to
>
> - edit orte/mca/plm/base/plm_base_jobid.c
>
> and replace
>
> OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
>  "plm:base:set_hnp_name: final jobfam %lu",
>  (unsigned long)jobfam));
>
> with
>
> OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
>  "plm:base:set_hnp_name: final jobfam %lu",
>  (unsigned long)jobfam));
>
> configure Open MPI with --enable-debug and rebuild
>
> and then
>
> export OMPI_MCA_plm_base_verbose=4
>
> and run your tests.
>
>
> when the problem occurs, you will be able to check which pids produced the
> faulty jobfam, and that could hint to a conflict.
>
>
> Cheers,
>
>
> Gilles
>
>
>
> On 9/14/2016 12:35 AM, Eric Chamberland wrote:
>
>> Hi,
>>
>> It is the third time this happened into the last 10 days.
>>
>> While running nighlty tests (~2200), we have one or two tests that fails
>> at the very beginning with this strange error:
>>
>> [lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
>> unexpected process identifier [[9325,0],0] from [[5590,0],0]
>>
>> But I can't reproduce the problem right now... ie: If I launch this test
>> alone "by hand", it is successful... the same test was successful
>> yesterday...
>>
>> Is there some kind of "race condition" that can happen on the creation of
>> "tmp" files if many tests runs together on the same node? (we are
>> oversubcribing even sequential runs...)
>>
>> Here are the build logs:
>>
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_config.log
>> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13
>> .01h16m01s_ompi_info_all.txt
>>
>> Thanks,
>>
>> Eric
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Gilles Gouaillardet

Eric,


can you please provide more information on how your tests are launched ?

do you

mpirun -np 1 ./a.out

or do you simply

./a.out


do you use a batch manager ? if yes, which one ?

do you run one test per job ? or multiple tests per job ?

how are these tests launched ?


do the test that crashes use MPI_Comm_spawn ?

i am surprised by the process name [[9325,5754],0], which suggests there 
MPI_Comm_spawn was called 5753 times (!)



can you also run

hostname

on the 'lorien' host ?

if you configure'd Open MPI with --enable-debug, can you

export OMPI_MCA_plm_base_verbose 5

then run one test and post the logs ?


from orte_plm_base_set_hnp_name(), "lorien" and pid 142766 should 
produce job family 5576 (but you get 9325)


the discrepancy could be explained by the use of a batch manager and/or 
a full hostname i am unaware of.



orte_plm_base_set_hnp_name() generate a 16 bits job family from the (32 
bits hash of the) hostname and the mpirun (32 bits ?) pid.


so strictly speaking, it is possible two jobs launched on the same node 
are assigned the same 16 bits job family.



the easiest way to detect this could be to

- edit orte/mca/plm/base/plm_base_jobid.c

and replace

OPAL_OUTPUT_VERBOSE((5, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

with

OPAL_OUTPUT_VERBOSE((4, orte_plm_base_framework.framework_output,
 "plm:base:set_hnp_name: final jobfam %lu",
 (unsigned long)jobfam));

configure Open MPI with --enable-debug and rebuild

and then

export OMPI_MCA_plm_base_verbose=4

and run your tests.


when the problem occurs, you will be able to check which pids produced 
the faulty jobfam, and that could hint to a conflict.



Cheers,


Gilles


On 9/14/2016 12:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that 
fails at the very beginning with this strange error:


[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[9325,0],0] from [[5590,0],0]


But I can't reproduce the problem right now... ie: If I launch this 
test alone "by hand", it is successful... the same test was successful 
yesterday...


Is there some kind of "race condition" that can happen on the creation 
of "tmp" files if many tests runs together on the same node? (we are 
oversubcribing even sequential runs...)


Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log 

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt 



Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel



___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland



On 13/09/16 12:11 PM, Pritchard Jr., Howard wrote:

Hello Eric,

Is the failure seen with the same two tests?  Or is it random
which tests fail?  If its not random, would you be able to post


No, the tests that failed were different ones...


the tests to the list?

Also,  if possible, it would be great if you could test against a master
snapshot:

https://www.open-mpi.org/nightly/master/


Yes I can, but since the bug appears time to time, I think I can't get 
relevant info from a single run on master will have to wait let's 
say 10 or 15 days before it crashes... but that may be hard since master 
is less stable than release and will have normal failures... :/



Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-13 Thread Eric Chamberland
Other relevant info: I never saw this problem with OpenMPI 1.6.5,1.8.4 
and 1.10.[3,4] which runs the same test suite...


thanks,

Eric


On 13/09/16 11:35 AM, Eric Chamberland wrote:

Hi,

It is the third time this happened into the last 10 days.

While running nighlty tests (~2200), we have one or two tests that fails
at the very beginning with this strange error:

[lorien:142766] [[9325,5754],0] usock_peer_recv_connect_ack: received
unexpected process identifier [[9325,0],0] from [[5590,0],0]

But I can't reproduce the problem right now... ie: If I launch this test
alone "by hand", it is successful... the same test was successful
yesterday...

Is there some kind of "race condition" that can happen on the creation
of "tmp" files if many tests runs together on the same node? (we are
oversubcribing even sequential runs...)

Here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.13.01h16m01s_ompi_info_all.txt


Thanks,

Eric
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel