Re: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?

2019-01-18 Thread Gilles Gouaillardet
Jeff,

that could be a copy/paste error and/or an email client issue.

The syntax is
mpirun --mca variable value ...

(short hyphen, short hyphen, m, c, a)

The error message is about the missing —-mca executable
(long hyphen, short hyphen, m, c, a)

This is most likely the root cause of this issue.

An other option is to set these parameters via the environment
export OMPI_MCA_coll_sync_priority=100
export OMPI_MCA_coll_sync_barrier_after=10
and then invoke mpirun without the --mca options.

Cheers,

Gilles

On Sat, Jan 19, 2019 at 11:28 AM Jeff Wentworth via users
 wrote:
>
> Hi,
>
> Thanks for the quick response.  But it looks like I am missing something 
> because neither -mca nor --mca is being recognized by my mpirun command.
>
> % mpirun --mca coll_sync_priority 100 --mca coll_sync_barrier_after 10 -q -np 
> 2 a.out
> --
> mpirun was unable to find the specified executable file, and therefore
> did not launch the job.  This error was first reported for process
> rank 0; it may have occurred for other processes as well.
>
> NOTE: A common cause for this error is misspelling a mpirun command
>   line parameter option (remember that mpirun interprets the first
>   unrecognized command line token as the executable).
>
> Node:   mia
> Executable: —-mca
> --
> 4 total processes failed to start
>
> % which mpirun
> /usr/local/bin/mpirun
> % ls -l /usr/local/bin/mpirun
> lrwxrwxrwx. 1 root root 7 Jan 15 20:50 /usr/local/bin/mpirun -> orterun
>
> jw2002
>
> 
> On Fri, 1/18/19, Nathan Hjelm via users  wrote:
>
>  Subject: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce 
> calls?
>  To: "Open MPI Users" 
>  Cc: "Nathan Hjelm" 
>  Date: Friday, January 18, 2019, 9:00 PM
>
>
>  Since neither bcast nor reduce acts as
>  a barrier it is possible to run out of resources if either
>  of these calls (or both) are used in a tight loop. The sync
>  coll component exists for this scenario. You can enable it
>  by  adding the following to mpirun (or setting these
>  variables through the environment or a file):
>
>  —mca coll_sync_priority 100 —mca
>  coll_sync_barrier_after 10
>
>
>  This will effectively throttle the
>  collective calls for you. You can also change the reduce to
>  an allreduce.
>
>
>  -Nathan
>
>  > On Jan 18, 2019, at 6:31 PM, Jeff
>  Wentworth via users 
>  wrote:
>  >
>  > Greetings everyone,
>  >
>  > I have a scientific code using
>  Open MPI (v3.1.3) that seems to work fine when MPI_Bcast()
>  and MPI_Reduce() calls are well spaced out in time.
>  Yet if the time between these calls is short, eventually one
>  of the nodes hangs at some random point, never returning
>  from the broadcast or reduce call.  Is there some
>  minimum time between calls that needs to be obeyed in order
>  for Open MPI to process these reliably?
>  >
>  > The reason this has come up is
>  because I am trying to run in a multi-node environment some
>  established acceptance tests in order to verify that the
>  Open MPI configured version of the code yields the same
>  baseline result as the original single node version of the
>  code.  These acceptance tests must pass in order for
>  the code to be considered validated and deliverable to the
>  customer.  One of these acceptance tests that hangs
>  does involve 90 broadcasts and 90 reduces in a short period
>  of time (less than .01 cpu sec), as in:
>  >
>  > Broadcast #89 in
>  >  Broadcast #89 out 8 bytes
>  >  Calculate angle #89
>  >  Reduce #89 in
>  >  Reduce #89 out 208 bytes
>  > Write result #89 to file on
>  service node
>  > Broadcast #90 in
>  >  Broadcast #90 out 8 bytes
>  >  Calculate angle #89
>  >  Reduce #90 in
>  >  Reduce #90 out 208 bytes
>  > Write result #90 to file on
>  service node
>  >
>  > If I slow down the above
>  acceptance test, for example by running it under valgrind,
>  then it runs to completion and yields the correct
>  result.  So it seems to suggest that something internal
>  to Open MPI is getting swamped.  I understand that
>  these acceptance tests might be pushing the limit, given
>  that they involve so many short calculations combined with
>  frequent, yet tiny, transfers of data among nodes.
>  >
>  > Would it be worthwhile for me to
>  enforce with some minimum wait time between the MPI calls,
>  say 0.01 or 0.001 sec via nanosleep()?  The only time
>  it would matter would be when acceptance tests are run, as
>  the situation doesn't arise when beefier runs are performed.
>
>  >
>  > Thanks.
>  >
>  > jw2002
>  >
>  ___
>  > users mailing list
>  > users@lists.open-mpi.org
>  > https://lists.open-mpi.org/mailman/listinfo/users
>
>
>  ___
>  users mailing list
>  users@lists.open-mpi.org
>  

Re: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?

2019-01-18 Thread Jeff Wentworth via users
Hi,

Thanks for the quick response.  But it looks like I am missing something 
because neither -mca nor --mca is being recognized by my mpirun command.  

% mpirun --mca coll_sync_priority 100 --mca coll_sync_barrier_after 10 -q -np 2 
a.out
--
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
  line parameter option (remember that mpirun interprets the first
  unrecognized command line token as the executable).

Node:   mia
Executable: —-mca
--
4 total processes failed to start

% which mpirun
/usr/local/bin/mpirun
% ls -l /usr/local/bin/mpirun
lrwxrwxrwx. 1 root root 7 Jan 15 20:50 /usr/local/bin/mpirun -> orterun

jw2002


On Fri, 1/18/19, Nathan Hjelm via users  wrote:

 Subject: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?
 To: "Open MPI Users" 
 Cc: "Nathan Hjelm" 
 Date: Friday, January 18, 2019, 9:00 PM
 
 
 Since neither bcast nor reduce acts as
 a barrier it is possible to run out of resources if either
 of these calls (or both) are used in a tight loop. The sync
 coll component exists for this scenario. You can enable it
 by  adding the following to mpirun (or setting these
 variables through the environment or a file):
 
 —mca coll_sync_priority 100 —mca
 coll_sync_barrier_after 10
 
 
 This will effectively throttle the
 collective calls for you. You can also change the reduce to
 an allreduce.
 
 
 -Nathan
 
 > On Jan 18, 2019, at 6:31 PM, Jeff
 Wentworth via users 
 wrote:
 > 
 > Greetings everyone,
 > 
 > I have a scientific code using
 Open MPI (v3.1.3) that seems to work fine when MPI_Bcast()
 and MPI_Reduce() calls are well spaced out in time. 
 Yet if the time between these calls is short, eventually one
 of the nodes hangs at some random point, never returning
 from the broadcast or reduce call.  Is there some
 minimum time between calls that needs to be obeyed in order
 for Open MPI to process these reliably?
 > 
 > The reason this has come up is
 because I am trying to run in a multi-node environment some
 established acceptance tests in order to verify that the
 Open MPI configured version of the code yields the same
 baseline result as the original single node version of the
 code.  These acceptance tests must pass in order for
 the code to be considered validated and deliverable to the
 customer.  One of these acceptance tests that hangs
 does involve 90 broadcasts and 90 reduces in a short period
 of time (less than .01 cpu sec), as in:
 > 
 > Broadcast #89 in
 >  Broadcast #89 out 8 bytes
 >  Calculate angle #89
 >  Reduce #89 in
 >  Reduce #89 out 208 bytes
 > Write result #89 to file on
 service node
 > Broadcast #90 in
 >  Broadcast #90 out 8 bytes
 >  Calculate angle #89
 >  Reduce #90 in
 >  Reduce #90 out 208 bytes
 > Write result #90 to file on
 service node
 > 
 > If I slow down the above
 acceptance test, for example by running it under valgrind,
 then it runs to completion and yields the correct
 result.  So it seems to suggest that something internal
 to Open MPI is getting swamped.  I understand that
 these acceptance tests might be pushing the limit, given
 that they involve so many short calculations combined with
 frequent, yet tiny, transfers of data among nodes.  
 > 
 > Would it be worthwhile for me to
 enforce with some minimum wait time between the MPI calls,
 say 0.01 or 0.001 sec via nanosleep()?  The only time
 it would matter would be when acceptance tests are run, as
 the situation doesn't arise when beefier runs are performed.
 
 > 
 > Thanks.
 > 
 > jw2002
 >
 ___
 > users mailing list
 > users@lists.open-mpi.org
 > https://lists.open-mpi.org/mailman/listinfo/users
 
 
 ___
 users mailing list
 users@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?

2019-01-18 Thread Nathan Hjelm via users

Since neither bcast nor reduce acts as a barrier it is possible to run out of 
resources if either of these calls (or both) are used in a tight loop. The sync 
coll component exists for this scenario. You can enable it by  adding the 
following to mpirun (or setting these variables through the environment or a 
file):

—mca coll_sync_priority 100 —mca coll_sync_barrier_after 10


This will effectively throttle the collective calls for you. You can also 
change the reduce to an allreduce.


-Nathan

> On Jan 18, 2019, at 6:31 PM, Jeff Wentworth via users 
>  wrote:
> 
> Greetings everyone,
> 
> I have a scientific code using Open MPI (v3.1.3) that seems to work fine when 
> MPI_Bcast() and MPI_Reduce() calls are well spaced out in time.  Yet if the 
> time between these calls is short, eventually one of the nodes hangs at some 
> random point, never returning from the broadcast or reduce call.  Is there 
> some minimum time between calls that needs to be obeyed in order for Open MPI 
> to process these reliably?
> 
> The reason this has come up is because I am trying to run in a multi-node 
> environment some established acceptance tests in order to verify that the 
> Open MPI configured version of the code yields the same baseline result as 
> the original single node version of the code.  These acceptance tests must 
> pass in order for the code to be considered validated and deliverable to the 
> customer.  One of these acceptance tests that hangs does involve 90 
> broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), 
> as in:
> 
> Broadcast #89 in
>  Broadcast #89 out 8 bytes
>  Calculate angle #89
>  Reduce #89 in
>  Reduce #89 out 208 bytes
> Write result #89 to file on service node
> Broadcast #90 in
>  Broadcast #90 out 8 bytes
>  Calculate angle #89
>  Reduce #90 in
>  Reduce #90 out 208 bytes
> Write result #90 to file on service node
> 
> If I slow down the above acceptance test, for example by running it under 
> valgrind, then it runs to completion and yields the correct result.  So it 
> seems to suggest that something internal to Open MPI is getting swamped.  I 
> understand that these acceptance tests might be pushing the limit, given that 
> they involve so many short calculations combined with frequent, yet tiny, 
> transfers of data among nodes.  
> 
> Would it be worthwhile for me to enforce with some minimum wait time between 
> the MPI calls, say 0.01 or 0.001 sec via nanosleep()?  The only time it would 
> matter would be when acceptance tests are run, as the situation doesn't arise 
> when beefier runs are performed. 
> 
> Thanks.
> 
> jw2002
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Minimum time between MPI_Bcast or MPI_Reduce calls?

2019-01-18 Thread Jeff Wentworth via users
Greetings everyone,

I have a scientific code using Open MPI (v3.1.3) that seems to work fine when 
MPI_Bcast() and MPI_Reduce() calls are well spaced out in time.  Yet if the 
time between these calls is short, eventually one of the nodes hangs at some 
random point, never returning from the broadcast or reduce call.  Is there some 
minimum time between calls that needs to be obeyed in order for Open MPI to 
process these reliably?

The reason this has come up is because I am trying to run in a multi-node 
environment some established acceptance tests in order to verify that the Open 
MPI configured version of the code yields the same baseline result as the 
original single node version of the code.  These acceptance tests must pass in 
order for the code to be considered validated and deliverable to the customer.  
One of these acceptance tests that hangs does involve 90 broadcasts and 90 
reduces in a short period of time (less than .01 cpu sec), as in:

 Broadcast #89 in
   Broadcast #89 out 8 bytes
   Calculate angle #89
   Reduce #89 in
   Reduce #89 out 208 bytes
 Write result #89 to file on service node
 Broadcast #90 in
   Broadcast #90 out 8 bytes
   Calculate angle #89
   Reduce #90 in
   Reduce #90 out 208 bytes
 Write result #90 to file on service node

If I slow down the above acceptance test, for example by running it under 
valgrind, then it runs to completion and yields the correct result.  So it 
seems to suggest that something internal to Open MPI is getting swamped.  I 
understand that these acceptance tests might be pushing the limit, given that 
they involve so many short calculations combined with frequent, yet tiny, 
transfers of data among nodes.  

Would it be worthwhile for me to enforce with some minimum wait time between 
the MPI calls, say 0.01 or 0.001 sec via nanosleep()?  The only time it would 
matter would be when acceptance tests are run, as the situation doesn't arise 
when beefier runs are performed. 

Thanks.

jw2002
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
Good - thanks!

> On Jan 18, 2019, at 3:25 PM, Michael Di Domenico  
> wrote:
> 
> seems to be better now.  jobs are running
> 
> On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain  wrote:
>> 
>> I have pushed a fix to the v2.2 branch - could you please confirm it?
>> 
>> 
>>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
>>> 
>>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
>>> plugin folks seem to be off somewhere for awhile and haven’t been testing 
>>> it. Sigh.
>>> 
>>> I’ll patch the branch and let you know - we’d appreciate the feedback.
>>> Ralph
>>> 
>>> 
 On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
 wrote:
 
 here's the branches i'm using.  i did a git clone on the repo's and
 then a git checkout
 
 [ec2-user@labhead bin]$ cd /hpc/src/pmix/
 [ec2-user@labhead pmix]$ git branch
 master
 * v2.2
 [ec2-user@labhead pmix]$ cd ../slurm/
 [ec2-user@labhead slurm]$ git branch
 * (detached from origin/slurm-18.08)
 master
 [ec2-user@labhead slurm]$ cd ../ompi/
 [ec2-user@labhead ompi]$ git branch
 * (detached from origin/v3.1.x)
 master
 
 
 attached is the debug out from the run with the debugging turned on
 
 On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
> 
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
> 
> If you are using PMIx v2.2.0, then please note that there is a bug in it 
> that slipped through our automated testing. I replaced it today with 
> v2.2.1 - you probably should update if that’s the case. However, that 
> wouldn’t necessarily explain this behavior. I’m not that familiar with 
> the Slurm plugin, but you might try adding
> 
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
> 
> to your environment and see if that provides anything useful.
> 
>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
>>  wrote:
>> 
>> i compilied pmix slurm openmpi
>> 
>> ---pmix
>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>> --disable-debug
>> ---slurm
>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>> --with-pmix=/hpc/pmix/2.2
>> ---openmpi
>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
>> --with-pmix=/hpc/pmix/2.2
>> 
>> everything seemed to compile fine, but when i do an srun i get the
>> below errors, however, if i salloc and then mpirun it seems to work
>> fine.  i'm not quite sure where the breakdown is or how to debug it
>> 
>> ---
>> 
>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>> event/pmix_event_registration.c at line 101
>> --
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or 
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>> 
>> ompi_interlib_declare
>> --> Returned "Would block" (-10) instead of "Success" (0)
>> ...snipped...
>> [labcmp6:18355] *** An error occurred in MPI_Init
>> [labcmp6:18355] *** reported by process [140726281390153,15]
>> [labcmp6:18355] *** on a NULL communicator
>> [labcmp6:18355] *** Unknown error
>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [labcmp6:18355] ***and potentially your MPI job)
>> [labcmp6:18352] *** An error occurred in MPI_Init
>> [labcmp6:18352] *** reported by process [1677936713,12]
>> [labcmp6:18352] *** on a NULL communicator
>> [labcmp6:18352] *** Unknown error
>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>> communicator will now abort,
>> [labcmp6:18352] ***and potentially your MPI job)
>> [labcmp6:18354] *** An error occurred in MPI_Init
>> [labcmp6:18354] *** reported by process [140726281390153,14]
>> [labcmp6:18354] *** on a NULL communicator
>> [labcmp6:18354] *** Unknown error
>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL 

Re: [OMPI users] pmix and srun

2019-01-18 Thread Michael Di Domenico
seems to be better now.  jobs are running

On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain  wrote:
>
> I have pushed a fix to the v2.2 branch - could you please confirm it?
>
>
> > On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
> >
> > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm 
> > plugin folks seem to be off somewhere for awhile and haven’t been testing 
> > it. Sigh.
> >
> > I’ll patch the branch and let you know - we’d appreciate the feedback.
> > Ralph
> >
> >
> >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
> >> wrote:
> >>
> >> here's the branches i'm using.  i did a git clone on the repo's and
> >> then a git checkout
> >>
> >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
> >> [ec2-user@labhead pmix]$ git branch
> >> master
> >> * v2.2
> >> [ec2-user@labhead pmix]$ cd ../slurm/
> >> [ec2-user@labhead slurm]$ git branch
> >> * (detached from origin/slurm-18.08)
> >> master
> >> [ec2-user@labhead slurm]$ cd ../ompi/
> >> [ec2-user@labhead ompi]$ git branch
> >> * (detached from origin/v3.1.x)
> >> master
> >>
> >>
> >> attached is the debug out from the run with the debugging turned on
> >>
> >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
> >>>
> >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> >>> notification system in the Slurm plugin, but you should only be trying to 
> >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> >>> definitely doesn’t do.
> >>>
> >>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
> >>> that slipped through our automated testing. I replaced it today with 
> >>> v2.2.1 - you probably should update if that’s the case. However, that 
> >>> wouldn’t necessarily explain this behavior. I’m not that familiar with 
> >>> the Slurm plugin, but you might try adding
> >>>
> >>> PMIX_MCA_pmix_client_event_verbose=5
> >>> PMIX_MCA_pmix_server_event_verbose=5
> >>> OMPI_MCA_pmix_base_verbose=10
> >>>
> >>> to your environment and see if that provides anything useful.
> >>>
>  On Jan 18, 2019, at 12:09 PM, Michael Di Domenico 
>   wrote:
> 
>  i compilied pmix slurm openmpi
> 
>  ---pmix
>  ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>  --disable-debug
>  ---slurm
>  ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>  --with-pmix=/hpc/pmix/2.2
>  ---openmpi
>  ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>  --with-libevent=external --with-slurm=/hpc/slurm/18.08
>  --with-pmix=/hpc/pmix/2.2
> 
>  everything seemed to compile fine, but when i do an srun i get the
>  below errors, however, if i salloc and then mpirun it seems to work
>  fine.  i'm not quite sure where the breakdown is or how to debug it
> 
>  ---
> 
>  [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>  [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>  event/pmix_event_registration.c at line 101
>  --
>  It looks like MPI_INIT failed for some reason; your parallel process is
>  likely to abort.  There are many reasons that a parallel process can
>  fail during MPI_INIT; some of which are due to configuration or 
>  environment
>  problems.  This failure appears to be an internal failure; here's some
>  additional information (which may only be relevant to an Open MPI
>  developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
>  ...snipped...
>  [labcmp6:18355] *** An error occurred in MPI_Init
>  [labcmp6:18355] *** reported by process [140726281390153,15]
>  [labcmp6:18355] *** on a NULL communicator
>  [labcmp6:18355] *** Unknown error
>  [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18355] ***and potentially your MPI job)
>  [labcmp6:18352] *** An error occurred in MPI_Init
>  [labcmp6:18352] *** reported by process [1677936713,12]
>  [labcmp6:18352] *** on a NULL communicator
>  [labcmp6:18352] *** Unknown error
>  [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18352] ***and potentially your MPI job)
>  [labcmp6:18354] *** An error occurred in MPI_Init
>  [labcmp6:18354] *** reported by process [140726281390153,14]
>  [labcmp6:18354] *** on a NULL communicator
>  [labcmp6:18354] *** Unknown error
>  [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
>  communicator will now abort,
>  [labcmp6:18354] ***and potentially your MPI job)
>  

Re: [OMPI users] pmix and srun

2019-01-18 Thread Ralph H Castain
I have pushed a fix to the v2.2 branch - could you please confirm it?


> On Jan 18, 2019, at 2:23 PM, Ralph H Castain  wrote:
> 
> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin 
> folks seem to be off somewhere for awhile and haven’t been testing it. Sigh.
> 
> I’ll patch the branch and let you know - we’d appreciate the feedback.
> Ralph
> 
> 
>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
>> wrote:
>> 
>> here's the branches i'm using.  i did a git clone on the repo's and
>> then a git checkout
>> 
>> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
>> [ec2-user@labhead pmix]$ git branch
>> master
>> * v2.2
>> [ec2-user@labhead pmix]$ cd ../slurm/
>> [ec2-user@labhead slurm]$ git branch
>> * (detached from origin/slurm-18.08)
>> master
>> [ec2-user@labhead slurm]$ cd ../ompi/
>> [ec2-user@labhead ompi]$ git branch
>> * (detached from origin/v3.1.x)
>> master
>> 
>> 
>> attached is the debug out from the run with the debugging turned on
>> 
>> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>>> 
>>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
>>> notification system in the Slurm plugin, but you should only be trying to 
>>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
>>> definitely doesn’t do.
>>> 
>>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
>>> that slipped through our automated testing. I replaced it today with v2.2.1 
>>> - you probably should update if that’s the case. However, that wouldn’t 
>>> necessarily explain this behavior. I’m not that familiar with the Slurm 
>>> plugin, but you might try adding
>>> 
>>> PMIX_MCA_pmix_client_event_verbose=5
>>> PMIX_MCA_pmix_server_event_verbose=5
>>> OMPI_MCA_pmix_base_verbose=10
>>> 
>>> to your environment and see if that provides anything useful.
>>> 
 On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
 wrote:
 
 i compilied pmix slurm openmpi
 
 ---pmix
 ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
 --disable-debug
 ---slurm
 ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
 --with-pmix=/hpc/pmix/2.2
 ---openmpi
 ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
 --with-libevent=external --with-slurm=/hpc/slurm/18.08
 --with-pmix=/hpc/pmix/2.2
 
 everything seemed to compile fine, but when i do an srun i get the
 below errors, however, if i salloc and then mpirun it seems to work
 fine.  i'm not quite sure where the breakdown is or how to debug it
 
 ---
 
 [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
 [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 --
 It looks like MPI_INIT failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during MPI_INIT; some of which are due to configuration or environment
 problems.  This failure appears to be an internal failure; here's some
 additional information (which may only be relevant to an Open MPI
 developer):
 
 ompi_interlib_declare
 --> Returned "Would block" (-10) instead of "Success" (0)
 ...snipped...
 [labcmp6:18355] *** An error occurred in MPI_Init
 [labcmp6:18355] *** reported by process [140726281390153,15]
 [labcmp6:18355] *** on a NULL communicator
 [labcmp6:18355] *** Unknown error
 [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18355] ***and potentially your MPI job)
 [labcmp6:18352] *** An error occurred in MPI_Init
 [labcmp6:18352] *** reported by process [1677936713,12]
 [labcmp6:18352] *** on a NULL communicator
 [labcmp6:18352] *** Unknown error
 [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18352] ***and potentially your MPI job)
 [labcmp6:18354] *** An error occurred in MPI_Init
 [labcmp6:18354] *** reported by process [140726281390153,14]
 [labcmp6:18354] *** on a NULL communicator
 [labcmp6:18354] *** Unknown error
 [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
 communicator will now abort,
 [labcmp6:18354] ***and potentially your MPI job)
 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
 2019-01-18T20:03:33 ***
 [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
 event/pmix_event_registration.c at line 101
 

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin 
folks seem to be off somewhere for awhile and haven’t been testing it. Sigh.

I’ll patch the branch and let you know - we’d appreciate the feedback.
Ralph


> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico  
> wrote:
> 
> here's the branches i'm using.  i did a git clone on the repo's and
> then a git checkout
> 
> [ec2-user@labhead bin]$ cd /hpc/src/pmix/
> [ec2-user@labhead pmix]$ git branch
>  master
> * v2.2
> [ec2-user@labhead pmix]$ cd ../slurm/
> [ec2-user@labhead slurm]$ git branch
> * (detached from origin/slurm-18.08)
>  master
> [ec2-user@labhead slurm]$ cd ../ompi/
> [ec2-user@labhead ompi]$ git branch
> * (detached from origin/v3.1.x)
>  master
> 
> 
> attached is the debug out from the run with the debugging turned on
> 
> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>> 
>> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
>> notification system in the Slurm plugin, but you should only be trying to 
>> call it if OMPI is registering a system-level event code - which OMPI 3.1 
>> definitely doesn’t do.
>> 
>> If you are using PMIx v2.2.0, then please note that there is a bug in it 
>> that slipped through our automated testing. I replaced it today with v2.2.1 
>> - you probably should update if that’s the case. However, that wouldn’t 
>> necessarily explain this behavior. I’m not that familiar with the Slurm 
>> plugin, but you might try adding
>> 
>> PMIX_MCA_pmix_client_event_verbose=5
>> PMIX_MCA_pmix_server_event_verbose=5
>> OMPI_MCA_pmix_base_verbose=10
>> 
>> to your environment and see if that provides anything useful.
>> 
>>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
>>> wrote:
>>> 
>>> i compilied pmix slurm openmpi
>>> 
>>> ---pmix
>>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
>>> --disable-debug
>>> ---slurm
>>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
>>> --with-pmix=/hpc/pmix/2.2
>>> ---openmpi
>>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
>>> --with-libevent=external --with-slurm=/hpc/slurm/18.08
>>> --with-pmix=/hpc/pmix/2.2
>>> 
>>> everything seemed to compile fine, but when i do an srun i get the
>>> below errors, however, if i salloc and then mpirun it seems to work
>>> fine.  i'm not quite sure where the breakdown is or how to debug it
>>> 
>>> ---
>>> 
>>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
>>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> --
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or environment
>>> problems.  This failure appears to be an internal failure; here's some
>>> additional information (which may only be relevant to an Open MPI
>>> developer):
>>> 
>>> ompi_interlib_declare
>>> --> Returned "Would block" (-10) instead of "Success" (0)
>>> ...snipped...
>>> [labcmp6:18355] *** An error occurred in MPI_Init
>>> [labcmp6:18355] *** reported by process [140726281390153,15]
>>> [labcmp6:18355] *** on a NULL communicator
>>> [labcmp6:18355] *** Unknown error
>>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18355] ***and potentially your MPI job)
>>> [labcmp6:18352] *** An error occurred in MPI_Init
>>> [labcmp6:18352] *** reported by process [1677936713,12]
>>> [labcmp6:18352] *** on a NULL communicator
>>> [labcmp6:18352] *** Unknown error
>>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18352] ***and potentially your MPI job)
>>> [labcmp6:18354] *** An error occurred in MPI_Init
>>> [labcmp6:18354] *** reported by process [140726281390153,14]
>>> [labcmp6:18354] *** on a NULL communicator
>>> [labcmp6:18354] *** Unknown error
>>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
>>> communicator will now abort,
>>> [labcmp6:18354] ***and potentially your MPI job)
>>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
>>> 2019-01-18T20:03:33 ***
>>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
>>> event/pmix_event_registration.c at line 101
>>> --
>>> It looks like MPI_INIT failed for some reason; your parallel process is
>>> likely to abort.  There are many reasons that a parallel process can
>>> fail during MPI_INIT; some of which are due to configuration or 

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Michael Di Domenico
here's the branches i'm using.  i did a git clone on the repo's and
then a git checkout

[ec2-user@labhead bin]$ cd /hpc/src/pmix/
[ec2-user@labhead pmix]$ git branch
  master
* v2.2
[ec2-user@labhead pmix]$ cd ../slurm/
[ec2-user@labhead slurm]$ git branch
* (detached from origin/slurm-18.08)
  master
[ec2-user@labhead slurm]$ cd ../ompi/
[ec2-user@labhead ompi]$ git branch
* (detached from origin/v3.1.x)
  master


attached is the debug out from the run with the debugging turned on

On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain  wrote:
>
> Looks strange. I’m pretty sure Mellanox didn’t implement the event 
> notification system in the Slurm plugin, but you should only be trying to 
> call it if OMPI is registering a system-level event code - which OMPI 3.1 
> definitely doesn’t do.
>
> If you are using PMIx v2.2.0, then please note that there is a bug in it that 
> slipped through our automated testing. I replaced it today with v2.2.1 - you 
> probably should update if that’s the case. However, that wouldn’t necessarily 
> explain this behavior. I’m not that familiar with the Slurm plugin, but you 
> might try adding
>
> PMIX_MCA_pmix_client_event_verbose=5
> PMIX_MCA_pmix_server_event_verbose=5
> OMPI_MCA_pmix_base_verbose=10
>
> to your environment and see if that provides anything useful.
>
> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
> > wrote:
> >
> > i compilied pmix slurm openmpi
> >
> > ---pmix
> > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> > --disable-debug
> > ---slurm
> > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> > --with-pmix=/hpc/pmix/2.2
> > ---openmpi
> > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> > --with-libevent=external --with-slurm=/hpc/slurm/18.08
> > --with-pmix=/hpc/pmix/2.2
> >
> > everything seemed to compile fine, but when i do an srun i get the
> > below errors, however, if i salloc and then mpirun it seems to work
> > fine.  i'm not quite sure where the breakdown is or how to debug it
> >
> > ---
> >
> > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > ...snipped...
> > [labcmp6:18355] *** An error occurred in MPI_Init
> > [labcmp6:18355] *** reported by process [140726281390153,15]
> > [labcmp6:18355] *** on a NULL communicator
> > [labcmp6:18355] *** Unknown error
> > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18355] ***and potentially your MPI job)
> > [labcmp6:18352] *** An error occurred in MPI_Init
> > [labcmp6:18352] *** reported by process [1677936713,12]
> > [labcmp6:18352] *** on a NULL communicator
> > [labcmp6:18352] *** Unknown error
> > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18352] ***and potentially your MPI job)
> > [labcmp6:18354] *** An error occurred in MPI_Init
> > [labcmp6:18354] *** reported by process [140726281390153,14]
> > [labcmp6:18354] *** on a NULL communicator
> > [labcmp6:18354] *** Unknown error
> > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> > communicator will now abort,
> > [labcmp6:18354] ***and potentially your MPI job)
> > srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 
> > 2019-01-18T20:03:33 ***
> > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> > event/pmix_event_registration.c at line 101
> > --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >  ompi_interlib_declare
> >  --> Returned "Would block" (-10) instead of "Success" (0)
> > --
> > [labcmp5:18357] PMIX 

Re: [OMPI users] Fwd: pmix and srun

2019-01-18 Thread Ralph H Castain
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification 
system in the Slurm plugin, but you should only be trying to call it if OMPI is 
registering a system-level event code - which OMPI 3.1 definitely doesn’t do.

If you are using PMIx v2.2.0, then please note that there is a bug in it that 
slipped through our automated testing. I replaced it today with v2.2.1 - you 
probably should update if that’s the case. However, that wouldn’t necessarily 
explain this behavior. I’m not that familiar with the Slurm plugin, but you 
might try adding

PMIX_MCA_pmix_client_event_verbose=5
PMIX_MCA_pmix_server_event_verbose=5
OMPI_MCA_pmix_base_verbose=10

to your environment and see if that provides anything useful.

> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico  
> wrote:
> 
> i compilied pmix slurm openmpi
> 
> ---pmix
> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
> --disable-debug
> ---slurm
> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
> --with-pmix=/hpc/pmix/2.2
> ---openmpi
> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
> --with-libevent=external --with-slurm=/hpc/slurm/18.08
> --with-pmix=/hpc/pmix/2.2
> 
> everything seemed to compile fine, but when i do an srun i get the
> below errors, however, if i salloc and then mpirun it seems to work
> fine.  i'm not quite sure where the breakdown is or how to debug it
> 
> ---
> 
> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
> ...snipped...
> [labcmp6:18355] *** An error occurred in MPI_Init
> [labcmp6:18355] *** reported by process [140726281390153,15]
> [labcmp6:18355] *** on a NULL communicator
> [labcmp6:18355] *** Unknown error
> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18355] ***and potentially your MPI job)
> [labcmp6:18352] *** An error occurred in MPI_Init
> [labcmp6:18352] *** reported by process [1677936713,12]
> [labcmp6:18352] *** on a NULL communicator
> [labcmp6:18352] *** Unknown error
> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18352] ***and potentially your MPI job)
> [labcmp6:18354] *** An error occurred in MPI_Init
> [labcmp6:18354] *** reported by process [140726281390153,14]
> [labcmp6:18354] *** on a NULL communicator
> [labcmp6:18354] *** Unknown error
> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [labcmp6:18354] ***and potentially your MPI job)
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 
> ***
> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
> 
>  ompi_interlib_declare
>  --> Returned "Would block" (-10) instead of "Success" (0)
> --
> [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
> event/pmix_event_registration.c at line 101
> srun: error: labcmp6: tasks 12-15: Exited with exit code 1
> srun: error: labcmp3: tasks 0-3: Killed
> srun: error: labcmp4: tasks 4-7: Killed
> srun: error: labcmp5: tasks 8-11: Killed
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Fwd: pmix and srun

2019-01-18 Thread Michael Di Domenico
i compilied pmix slurm openmpi

---pmix
./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13
--disable-debug
---slurm
./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13
--with-pmix=/hpc/pmix/2.2
---openmpi
./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external
--with-libevent=external --with-slurm=/hpc/slurm/18.08
--with-pmix=/hpc/pmix/2.2

everything seemed to compile fine, but when i do an srun i get the
below errors, however, if i salloc and then mpirun it seems to work
fine.  i'm not quite sure where the breakdown is or how to debug it

---

[ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl
[labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
...snipped...
[labcmp6:18355] *** An error occurred in MPI_Init
[labcmp6:18355] *** reported by process [140726281390153,15]
[labcmp6:18355] *** on a NULL communicator
[labcmp6:18355] *** Unknown error
[labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18355] ***and potentially your MPI job)
[labcmp6:18352] *** An error occurred in MPI_Init
[labcmp6:18352] *** reported by process [1677936713,12]
[labcmp6:18352] *** on a NULL communicator
[labcmp6:18352] *** Unknown error
[labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18352] ***and potentially your MPI job)
[labcmp6:18354] *** An error occurred in MPI_Init
[labcmp6:18354] *** reported by process [140726281390153,14]
[labcmp6:18354] *** on a NULL communicator
[labcmp6:18354] *** Unknown error
[labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[labcmp6:18354] ***and potentially your MPI job)
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 ***
[labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
--
[labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
[labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file
event/pmix_event_registration.c at line 101
srun: error: labcmp6: tasks 12-15: Exited with exit code 1
srun: error: labcmp3: tasks 0-3: Killed
srun: error: labcmp4: tasks 4-7: Killed
srun: error: labcmp5: tasks 8-11: Killed
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Cabral, Matias A
Hi Matt,

Few comments/questions:

-  If your cluster has Omni-Path, you won’t need UCX. Instead you can 
run using PSM2, or alternatively OFI (a.k.a. Libfabric)

-  With the command you shared below (4 ranks on the local node) (I 
think) a shared mem transport is being selected (vader?). So, if the job is not 
starting this seems to be a runtime issue rather than transport…. Pmix? slurm?
Thanks
_MAC


From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson
Sent: Friday, January 18, 2019 10:27 AM
To: Open MPI Users 
Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users 
mailto:users@lists.open-mpi.org>> wrote:
On Jan 18, 2019, at 12:43 PM, Matt Thompson 
mailto:fort...@gmail.com>> wrote:
>
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

I've had these in my Open MPI builds for a while now. The reason was one of the 
libraries I need for the climate model I work on went nuts if both of them 
weren't there. It was originally the rpath one but then eventually (Open MPI 
3?) I had to add the runpath one. But I have been updating the libraries more 
aggressively recently (due to OS upgrades) so it's possible this is no longer 
needed.


> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

Ah. I did not know that. I figured if our SLURM was built linked to a specific 
PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 
4 without specifying this.


> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
>
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
>
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
>
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine 
(as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy 
on this cluster. I once managed to get Open MPI 3.0.1 working (a few months 
ago), and it had some interesting startup scaling I liked (slow at low core 
count, but getting close to Intel MPI at high core count), though it seemed to 
not work after about 100 nodes (4000 processes) or so.

--
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna 
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org> wrote:

> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >
> > With some help, I managed to build an Open MPI 4.0.0 with:
>
> We can discuss each of these params to let you know what they are.
>
> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
>
> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
>

I've had these in my Open MPI builds for a while now. The reason was one of
the libraries I need for the climate model I work on went nuts if both of
them weren't there. It was originally the rpath one but then eventually
(Open MPI 3?) I had to add the runpath one. But I have been updating the
libraries more aggressively recently (due to OS upgrades) so it's possible
this is no longer needed.


>
> > --with-psm2
>
> Ensure that Open MPI can include support for the PSM2 library, and abort
> configure if it cannot.
>
> > --with-slurm
>
> Ensure that Open MPI can include support for SLURM, and abort configure if
> it cannot.
>
> > --enable-mpi1-compatibility
>
> Add support for MPI_Address and other MPI-1 functions that have since been
> deleted from the MPI 3.x specification.
>
> > --with-ucx
>
> Ensure that Open MPI can include support for UCX, and abort configure if
> it cannot.
>
> > --with-pmix=/usr/nlocal/pmix/2.1
>
> Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1
> (instead of using the PMIx that is bundled internally to Open MPI's source
> code tree/expanded tarball).
>
> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
>

Ah. I did not know that. I figured if our SLURM was built linked to a
specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
build an Open MPI 4 without specifying this.


>
> > --with-libevent=/usr
>
> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
>
> > CC=icc CXX=icpc FC=ifort
>
> Specify the exact compilers to use.
>
> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
>
> Might as well remove the --with-libevent if you don't need it.
>
> > However, I seem to have built an Open MPI that doesn't work:
> >
> > (1099)(master) $ mpirun --version
> > mpirun (Open MPI) 4.0.0
> >
> > Report bugs to http://www.open-mpi.org/community/help/
> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >
> > It just sits there...forever. Can the gurus here help me figure out what
> I managed to break? Perhaps I added too much to my configure line? Not
> enough?
>
> There could be a few things going on here.
>
> Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an
> "sbatch" script?
>

I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just
fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI
is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a
few months ago), and it had some interesting startup scaling I liked (slow
at low core count, but getting close to Intel MPI at high core count),
though it seemed to not work after about 100 nodes (4000 processes) or so.

-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Jeff Squyres (jsquyres) via users
On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> 
> With some help, I managed to build an Open MPI 4.0.0 with:

We can discuss each of these params to let you know what they are.

> ./configure --disable-wrapper-rpath --disable-wrapper-runpath

Did you have a reason for disabling these?  They're generally good things.  
What they do is add linker flags to the wrapper compilers (i.e., mpicc and 
friends) that basically put a default path to find libraries at run time (that 
can/will in most cases override LD_LIBRARY_PATH -- but you can override these 
linked-in-default-paths if you want/need to).

> --with-psm2

Ensure that Open MPI can include support for the PSM2 library, and abort 
configure if it cannot.

> --with-slurm 

Ensure that Open MPI can include support for SLURM, and abort configure if it 
cannot.

> --enable-mpi1-compatibility

Add support for MPI_Address and other MPI-1 functions that have since been 
deleted from the MPI 3.x specification.

> --with-ucx

Ensure that Open MPI can include support for UCX, and abort configure if it 
cannot.

> --with-pmix=/usr/nlocal/pmix/2.1

Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 
(instead of using the PMIx that is bundled internally to Open MPI's source code 
tree/expanded tarball).

Unless you have a reason to use the external PMIx, the internal/bundled PMIx is 
usually sufficient.

> --with-libevent=/usr

Same as previous; change "pmix" to "libevent" (i.e., use the external libevent 
instead of the bundled libevent).

> CC=icc CXX=icpc FC=ifort

Specify the exact compilers to use.

> The MPI 1 is because I need to build HDF5 eventually and I added psm2 because 
> it's an Omnipath cluster. The libevent was probably a red herring as 
> libevent-devel wasn't installed on the system. It was eventually, and I just 
> didn't remove the flag. And I saw no errors in the build!

Might as well remove the --with-libevent if you don't need it.

> However, I seem to have built an Open MPI that doesn't work:
> 
> (1099)(master) $ mpirun --version
> mpirun (Open MPI) 4.0.0
> 
> Report bugs to http://www.open-mpi.org/community/help/
> (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> 
> It just sits there...forever. Can the gurus here help me figure out what I 
> managed to break? Perhaps I added too much to my configure line? Not enough?

There could be a few things going on here.

Are you running inside a SLURM job?  E.g., in a "salloc" job, or in an "sbatch" 
script?

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-18 Thread Matt Thompson
All,

With some help, I managed to build an Open MPI 4.0.0 with:

./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2
--with-slurm --enable-mpi1-compatibility --with-ucx
--with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc
FC=ifort

The MPI 1 is because I need to build HDF5 eventually and I added psm2
because it's an Omnipath cluster. The libevent was probably a red herring
as libevent-devel wasn't installed on the system. It was eventually, and I
just didn't remove the flag. And I saw no errors in the build!

However, I seem to have built an Open MPI that doesn't work:

(1099)(master) $ mpirun --version
mpirun (Open MPI) 4.0.0

Report bugs to http://www.open-mpi.org/community/help/
(1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe

It just sits there...forever. Can the gurus here help me figure out what I
managed to break? Perhaps I added too much to my configure line? Not enough?

Thanks,
Matt

On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson  wrote:

> Dear Open MPI Gurus,
>
> A cluster I use recently updated their SLURM to have support for UCX and
> PMIx. These are names I've seen and heard often at SC BoFs and posters, but
> now is my first time to play with them.
>
> So, my first question is how exactly should I build Open MPI to try these
> features out. I'm guessing I'll need things like "--with-ucx" to test UCX,
> but is anything needed for PMIx?
>
> Second, when it comes to running Open MPI, are there new MCA parameters I
> need to look out for when testing?
>
> Sorry for the generic questions, but I'm more on the user end of the
> cluster than the administrator end, so I tend to get lost in the detailed
> presentations, etc. I see online.
>
> Thanks,
> Matt
> --
> Matt Thompson
>“The fact is, this is about us identifying what we do best and
>finding more ways of doing less of it better” -- Director of Better
> Anna Rampton
>


-- 
Matt Thompson
   “The fact is, this is about us identifying what we do best and
   finding more ways of doing less of it better” -- Director of Better Anna
Rampton
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users