Re: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?
Jeff, that could be a copy/paste error and/or an email client issue. The syntax is mpirun --mca variable value ... (short hyphen, short hyphen, m, c, a) The error message is about the missing —-mca executable (long hyphen, short hyphen, m, c, a) This is most likely the root cause of this issue. An other option is to set these parameters via the environment export OMPI_MCA_coll_sync_priority=100 export OMPI_MCA_coll_sync_barrier_after=10 and then invoke mpirun without the --mca options. Cheers, Gilles On Sat, Jan 19, 2019 at 11:28 AM Jeff Wentworth via users wrote: > > Hi, > > Thanks for the quick response. But it looks like I am missing something > because neither -mca nor --mca is being recognized by my mpirun command. > > % mpirun --mca coll_sync_priority 100 --mca coll_sync_barrier_after 10 -q -np > 2 a.out > -- > mpirun was unable to find the specified executable file, and therefore > did not launch the job. This error was first reported for process > rank 0; it may have occurred for other processes as well. > > NOTE: A common cause for this error is misspelling a mpirun command > line parameter option (remember that mpirun interprets the first > unrecognized command line token as the executable). > > Node: mia > Executable: —-mca > -- > 4 total processes failed to start > > % which mpirun > /usr/local/bin/mpirun > % ls -l /usr/local/bin/mpirun > lrwxrwxrwx. 1 root root 7 Jan 15 20:50 /usr/local/bin/mpirun -> orterun > > jw2002 > > > On Fri, 1/18/19, Nathan Hjelm via users wrote: > > Subject: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce > calls? > To: "Open MPI Users" > Cc: "Nathan Hjelm" > Date: Friday, January 18, 2019, 9:00 PM > > > Since neither bcast nor reduce acts as > a barrier it is possible to run out of resources if either > of these calls (or both) are used in a tight loop. The sync > coll component exists for this scenario. You can enable it > by adding the following to mpirun (or setting these > variables through the environment or a file): > > —mca coll_sync_priority 100 —mca > coll_sync_barrier_after 10 > > > This will effectively throttle the > collective calls for you. You can also change the reduce to > an allreduce. > > > -Nathan > > > On Jan 18, 2019, at 6:31 PM, Jeff > Wentworth via users > wrote: > > > > Greetings everyone, > > > > I have a scientific code using > Open MPI (v3.1.3) that seems to work fine when MPI_Bcast() > and MPI_Reduce() calls are well spaced out in time. > Yet if the time between these calls is short, eventually one > of the nodes hangs at some random point, never returning > from the broadcast or reduce call. Is there some > minimum time between calls that needs to be obeyed in order > for Open MPI to process these reliably? > > > > The reason this has come up is > because I am trying to run in a multi-node environment some > established acceptance tests in order to verify that the > Open MPI configured version of the code yields the same > baseline result as the original single node version of the > code. These acceptance tests must pass in order for > the code to be considered validated and deliverable to the > customer. One of these acceptance tests that hangs > does involve 90 broadcasts and 90 reduces in a short period > of time (less than .01 cpu sec), as in: > > > > Broadcast #89 in > > Broadcast #89 out 8 bytes > > Calculate angle #89 > > Reduce #89 in > > Reduce #89 out 208 bytes > > Write result #89 to file on > service node > > Broadcast #90 in > > Broadcast #90 out 8 bytes > > Calculate angle #89 > > Reduce #90 in > > Reduce #90 out 208 bytes > > Write result #90 to file on > service node > > > > If I slow down the above > acceptance test, for example by running it under valgrind, > then it runs to completion and yields the correct > result. So it seems to suggest that something internal > to Open MPI is getting swamped. I understand that > these acceptance tests might be pushing the limit, given > that they involve so many short calculations combined with > frequent, yet tiny, transfers of data among nodes. > > > > Would it be worthwhile for me to > enforce with some minimum wait time between the MPI calls, > say 0.01 or 0.001 sec via nanosleep()? The only time > it would matter would be when acceptance tests are run, as > the situation doesn't arise when beefier runs are performed. > > > > > Thanks. > > > > jw2002 > > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > > > ___ > users mailing list > users@lists.open-mpi.org >
Re: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?
Hi, Thanks for the quick response. But it looks like I am missing something because neither -mca nor --mca is being recognized by my mpirun command. % mpirun --mca coll_sync_priority 100 --mca coll_sync_barrier_after 10 -q -np 2 a.out -- mpirun was unable to find the specified executable file, and therefore did not launch the job. This error was first reported for process rank 0; it may have occurred for other processes as well. NOTE: A common cause for this error is misspelling a mpirun command line parameter option (remember that mpirun interprets the first unrecognized command line token as the executable). Node: mia Executable: —-mca -- 4 total processes failed to start % which mpirun /usr/local/bin/mpirun % ls -l /usr/local/bin/mpirun lrwxrwxrwx. 1 root root 7 Jan 15 20:50 /usr/local/bin/mpirun -> orterun jw2002 On Fri, 1/18/19, Nathan Hjelm via users wrote: Subject: [OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls? To: "Open MPI Users" Cc: "Nathan Hjelm" Date: Friday, January 18, 2019, 9:00 PM Since neither bcast nor reduce acts as a barrier it is possible to run out of resources if either of these calls (or both) are used in a tight loop. The sync coll component exists for this scenario. You can enable it by adding the following to mpirun (or setting these variables through the environment or a file): —mca coll_sync_priority 100 —mca coll_sync_barrier_after 10 This will effectively throttle the collective calls for you. You can also change the reduce to an allreduce. -Nathan > On Jan 18, 2019, at 6:31 PM, Jeff Wentworth via users wrote: > > Greetings everyone, > > I have a scientific code using Open MPI (v3.1.3) that seems to work fine when MPI_Bcast() and MPI_Reduce() calls are well spaced out in time. Yet if the time between these calls is short, eventually one of the nodes hangs at some random point, never returning from the broadcast or reduce call. Is there some minimum time between calls that needs to be obeyed in order for Open MPI to process these reliably? > > The reason this has come up is because I am trying to run in a multi-node environment some established acceptance tests in order to verify that the Open MPI configured version of the code yields the same baseline result as the original single node version of the code. These acceptance tests must pass in order for the code to be considered validated and deliverable to the customer. One of these acceptance tests that hangs does involve 90 broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), as in: > > Broadcast #89 in > Broadcast #89 out 8 bytes > Calculate angle #89 > Reduce #89 in > Reduce #89 out 208 bytes > Write result #89 to file on service node > Broadcast #90 in > Broadcast #90 out 8 bytes > Calculate angle #89 > Reduce #90 in > Reduce #90 out 208 bytes > Write result #90 to file on service node > > If I slow down the above acceptance test, for example by running it under valgrind, then it runs to completion and yields the correct result. So it seems to suggest that something internal to Open MPI is getting swamped. I understand that these acceptance tests might be pushing the limit, given that they involve so many short calculations combined with frequent, yet tiny, transfers of data among nodes. > > Would it be worthwhile for me to enforce with some minimum wait time between the MPI calls, say 0.01 or 0.001 sec via nanosleep()? The only time it would matter would be when acceptance tests are run, as the situation doesn't arise when beefier runs are performed. > > Thanks. > > jw2002 > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Fwd: Minimum time between MPI_Bcast or MPI_Reduce calls?
Since neither bcast nor reduce acts as a barrier it is possible to run out of resources if either of these calls (or both) are used in a tight loop. The sync coll component exists for this scenario. You can enable it by adding the following to mpirun (or setting these variables through the environment or a file): —mca coll_sync_priority 100 —mca coll_sync_barrier_after 10 This will effectively throttle the collective calls for you. You can also change the reduce to an allreduce. -Nathan > On Jan 18, 2019, at 6:31 PM, Jeff Wentworth via users > wrote: > > Greetings everyone, > > I have a scientific code using Open MPI (v3.1.3) that seems to work fine when > MPI_Bcast() and MPI_Reduce() calls are well spaced out in time. Yet if the > time between these calls is short, eventually one of the nodes hangs at some > random point, never returning from the broadcast or reduce call. Is there > some minimum time between calls that needs to be obeyed in order for Open MPI > to process these reliably? > > The reason this has come up is because I am trying to run in a multi-node > environment some established acceptance tests in order to verify that the > Open MPI configured version of the code yields the same baseline result as > the original single node version of the code. These acceptance tests must > pass in order for the code to be considered validated and deliverable to the > customer. One of these acceptance tests that hangs does involve 90 > broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), > as in: > > Broadcast #89 in > Broadcast #89 out 8 bytes > Calculate angle #89 > Reduce #89 in > Reduce #89 out 208 bytes > Write result #89 to file on service node > Broadcast #90 in > Broadcast #90 out 8 bytes > Calculate angle #89 > Reduce #90 in > Reduce #90 out 208 bytes > Write result #90 to file on service node > > If I slow down the above acceptance test, for example by running it under > valgrind, then it runs to completion and yields the correct result. So it > seems to suggest that something internal to Open MPI is getting swamped. I > understand that these acceptance tests might be pushing the limit, given that > they involve so many short calculations combined with frequent, yet tiny, > transfers of data among nodes. > > Would it be worthwhile for me to enforce with some minimum wait time between > the MPI calls, say 0.01 or 0.001 sec via nanosleep()? The only time it would > matter would be when acceptance tests are run, as the situation doesn't arise > when beefier runs are performed. > > Thanks. > > jw2002 > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Minimum time between MPI_Bcast or MPI_Reduce calls?
Greetings everyone, I have a scientific code using Open MPI (v3.1.3) that seems to work fine when MPI_Bcast() and MPI_Reduce() calls are well spaced out in time. Yet if the time between these calls is short, eventually one of the nodes hangs at some random point, never returning from the broadcast or reduce call. Is there some minimum time between calls that needs to be obeyed in order for Open MPI to process these reliably? The reason this has come up is because I am trying to run in a multi-node environment some established acceptance tests in order to verify that the Open MPI configured version of the code yields the same baseline result as the original single node version of the code. These acceptance tests must pass in order for the code to be considered validated and deliverable to the customer. One of these acceptance tests that hangs does involve 90 broadcasts and 90 reduces in a short period of time (less than .01 cpu sec), as in: Broadcast #89 in Broadcast #89 out 8 bytes Calculate angle #89 Reduce #89 in Reduce #89 out 208 bytes Write result #89 to file on service node Broadcast #90 in Broadcast #90 out 8 bytes Calculate angle #89 Reduce #90 in Reduce #90 out 208 bytes Write result #90 to file on service node If I slow down the above acceptance test, for example by running it under valgrind, then it runs to completion and yields the correct result. So it seems to suggest that something internal to Open MPI is getting swamped. I understand that these acceptance tests might be pushing the limit, given that they involve so many short calculations combined with frequent, yet tiny, transfers of data among nodes. Would it be worthwhile for me to enforce with some minimum wait time between the MPI calls, say 0.01 or 0.001 sec via nanosleep()? The only time it would matter would be when acceptance tests are run, as the situation doesn't arise when beefier runs are performed. Thanks. jw2002 ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] pmix and srun
Good - thanks! > On Jan 18, 2019, at 3:25 PM, Michael Di Domenico > wrote: > > seems to be better now. jobs are running > > On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: >> >> I have pushed a fix to the v2.2 branch - could you please confirm it? >> >> >>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: >>> >>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm >>> plugin folks seem to be off somewhere for awhile and haven’t been testing >>> it. Sigh. >>> >>> I’ll patch the branch and let you know - we’d appreciate the feedback. >>> Ralph >>> >>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico wrote: here's the branches i'm using. i did a git clone on the repo's and then a git checkout [ec2-user@labhead bin]$ cd /hpc/src/pmix/ [ec2-user@labhead pmix]$ git branch master * v2.2 [ec2-user@labhead pmix]$ cd ../slurm/ [ec2-user@labhead slurm]$ git branch * (detached from origin/slurm-18.08) master [ec2-user@labhead slurm]$ cd ../ompi/ [ec2-user@labhead ompi]$ git branch * (detached from origin/v3.1.x) master attached is the debug out from the run with the debugging turned on On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > > Looks strange. I’m pretty sure Mellanox didn’t implement the event > notification system in the Slurm plugin, but you should only be trying to > call it if OMPI is registering a system-level event code - which OMPI 3.1 > definitely doesn’t do. > > If you are using PMIx v2.2.0, then please note that there is a bug in it > that slipped through our automated testing. I replaced it today with > v2.2.1 - you probably should update if that’s the case. However, that > wouldn’t necessarily explain this behavior. I’m not that familiar with > the Slurm plugin, but you might try adding > > PMIX_MCA_pmix_client_event_verbose=5 > PMIX_MCA_pmix_server_event_verbose=5 > OMPI_MCA_pmix_base_verbose=10 > > to your environment and see if that provides anything useful. > >> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico >> wrote: >> >> i compilied pmix slurm openmpi >> >> ---pmix >> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 >> --disable-debug >> ---slurm >> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 >> --with-pmix=/hpc/pmix/2.2 >> ---openmpi >> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external >> --with-libevent=external --with-slurm=/hpc/slurm/18.08 >> --with-pmix=/hpc/pmix/2.2 >> >> everything seemed to compile fine, but when i do an srun i get the >> below errors, however, if i salloc and then mpirun it seems to work >> fine. i'm not quite sure where the breakdown is or how to debug it >> >> --- >> >> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl >> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> -- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or >> environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_interlib_declare >> --> Returned "Would block" (-10) instead of "Success" (0) >> ...snipped... >> [labcmp6:18355] *** An error occurred in MPI_Init >> [labcmp6:18355] *** reported by process [140726281390153,15] >> [labcmp6:18355] *** on a NULL communicator >> [labcmp6:18355] *** Unknown error >> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will now abort, >> [labcmp6:18355] ***and potentially your MPI job) >> [labcmp6:18352] *** An error occurred in MPI_Init >> [labcmp6:18352] *** reported by process [1677936713,12] >> [labcmp6:18352] *** on a NULL communicator >> [labcmp6:18352] *** Unknown error >> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will now abort, >> [labcmp6:18352] ***and potentially your MPI job) >> [labcmp6:18354] *** An error occurred in MPI_Init >> [labcmp6:18354] *** reported by process [140726281390153,14] >> [labcmp6:18354] *** on a NULL communicator >> [labcmp6:18354] *** Unknown error >> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL
Re: [OMPI users] pmix and srun
seems to be better now. jobs are running On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: > > I have pushed a fix to the v2.2 branch - could you please confirm it? > > > > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm > > plugin folks seem to be off somewhere for awhile and haven’t been testing > > it. Sigh. > > > > I’ll patch the branch and let you know - we’d appreciate the feedback. > > Ralph > > > > > >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico > >> wrote: > >> > >> here's the branches i'm using. i did a git clone on the repo's and > >> then a git checkout > >> > >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/ > >> [ec2-user@labhead pmix]$ git branch > >> master > >> * v2.2 > >> [ec2-user@labhead pmix]$ cd ../slurm/ > >> [ec2-user@labhead slurm]$ git branch > >> * (detached from origin/slurm-18.08) > >> master > >> [ec2-user@labhead slurm]$ cd ../ompi/ > >> [ec2-user@labhead ompi]$ git branch > >> * (detached from origin/v3.1.x) > >> master > >> > >> > >> attached is the debug out from the run with the debugging turned on > >> > >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > >>> > >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event > >>> notification system in the Slurm plugin, but you should only be trying to > >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 > >>> definitely doesn’t do. > >>> > >>> If you are using PMIx v2.2.0, then please note that there is a bug in it > >>> that slipped through our automated testing. I replaced it today with > >>> v2.2.1 - you probably should update if that’s the case. However, that > >>> wouldn’t necessarily explain this behavior. I’m not that familiar with > >>> the Slurm plugin, but you might try adding > >>> > >>> PMIX_MCA_pmix_client_event_verbose=5 > >>> PMIX_MCA_pmix_server_event_verbose=5 > >>> OMPI_MCA_pmix_base_verbose=10 > >>> > >>> to your environment and see if that provides anything useful. > >>> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > wrote: > > i compilied pmix slurm openmpi > > ---pmix > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > --disable-debug > ---slurm > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > --with-pmix=/hpc/pmix/2.2 > ---openmpi > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > --with-pmix=/hpc/pmix/2.2 > > everything seemed to compile fine, but when i do an srun i get the > below errors, however, if i salloc and then mpirun it seems to work > fine. i'm not quite sure where the breakdown is or how to debug it > > --- > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > ...snipped... > [labcmp6:18355] *** An error occurred in MPI_Init > [labcmp6:18355] *** reported by process [140726281390153,15] > [labcmp6:18355] *** on a NULL communicator > [labcmp6:18355] *** Unknown error > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18355] ***and potentially your MPI job) > [labcmp6:18352] *** An error occurred in MPI_Init > [labcmp6:18352] *** reported by process [1677936713,12] > [labcmp6:18352] *** on a NULL communicator > [labcmp6:18352] *** Unknown error > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18352] ***and potentially your MPI job) > [labcmp6:18354] *** An error occurred in MPI_Init > [labcmp6:18354] *** reported by process [140726281390153,14] > [labcmp6:18354] *** on a NULL communicator > [labcmp6:18354] *** Unknown error > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18354] ***and potentially your MPI job) >
Re: [OMPI users] pmix and srun
I have pushed a fix to the v2.2 branch - could you please confirm it? > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin > folks seem to be off somewhere for awhile and haven’t been testing it. Sigh. > > I’ll patch the branch and let you know - we’d appreciate the feedback. > Ralph > > >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico >> wrote: >> >> here's the branches i'm using. i did a git clone on the repo's and >> then a git checkout >> >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/ >> [ec2-user@labhead pmix]$ git branch >> master >> * v2.2 >> [ec2-user@labhead pmix]$ cd ../slurm/ >> [ec2-user@labhead slurm]$ git branch >> * (detached from origin/slurm-18.08) >> master >> [ec2-user@labhead slurm]$ cd ../ompi/ >> [ec2-user@labhead ompi]$ git branch >> * (detached from origin/v3.1.x) >> master >> >> >> attached is the debug out from the run with the debugging turned on >> >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: >>> >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event >>> notification system in the Slurm plugin, but you should only be trying to >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 >>> definitely doesn’t do. >>> >>> If you are using PMIx v2.2.0, then please note that there is a bug in it >>> that slipped through our automated testing. I replaced it today with v2.2.1 >>> - you probably should update if that’s the case. However, that wouldn’t >>> necessarily explain this behavior. I’m not that familiar with the Slurm >>> plugin, but you might try adding >>> >>> PMIX_MCA_pmix_client_event_verbose=5 >>> PMIX_MCA_pmix_server_event_verbose=5 >>> OMPI_MCA_pmix_base_verbose=10 >>> >>> to your environment and see if that provides anything useful. >>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico wrote: i compilied pmix slurm openmpi ---pmix ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 --disable-debug ---slurm ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 --with-pmix=/hpc/pmix/2.2 ---openmpi ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external --with-libevent=external --with-slurm=/hpc/slurm/18.08 --with-pmix=/hpc/pmix/2.2 everything seemed to compile fine, but when i do an srun i get the below errors, however, if i salloc and then mpirun it seems to work fine. i'm not quite sure where the breakdown is or how to debug it --- [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) ...snipped... [labcmp6:18355] *** An error occurred in MPI_Init [labcmp6:18355] *** reported by process [140726281390153,15] [labcmp6:18355] *** on a NULL communicator [labcmp6:18355] *** Unknown error [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18355] ***and potentially your MPI job) [labcmp6:18352] *** An error occurred in MPI_Init [labcmp6:18352] *** reported by process [1677936713,12] [labcmp6:18352] *** on a NULL communicator [labcmp6:18352] *** Unknown error [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18352] ***and potentially your MPI job) [labcmp6:18354] *** An error occurred in MPI_Init [labcmp6:18354] *** reported by process [140726281390153,14] [labcmp6:18354] *** on a NULL communicator [labcmp6:18354] *** Unknown error [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18354] ***and potentially your MPI job) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 *** [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101
Re: [OMPI users] Fwd: pmix and srun
Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin folks seem to be off somewhere for awhile and haven’t been testing it. Sigh. I’ll patch the branch and let you know - we’d appreciate the feedback. Ralph > On Jan 18, 2019, at 2:09 PM, Michael Di Domenico > wrote: > > here's the branches i'm using. i did a git clone on the repo's and > then a git checkout > > [ec2-user@labhead bin]$ cd /hpc/src/pmix/ > [ec2-user@labhead pmix]$ git branch > master > * v2.2 > [ec2-user@labhead pmix]$ cd ../slurm/ > [ec2-user@labhead slurm]$ git branch > * (detached from origin/slurm-18.08) > master > [ec2-user@labhead slurm]$ cd ../ompi/ > [ec2-user@labhead ompi]$ git branch > * (detached from origin/v3.1.x) > master > > > attached is the debug out from the run with the debugging turned on > > On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: >> >> Looks strange. I’m pretty sure Mellanox didn’t implement the event >> notification system in the Slurm plugin, but you should only be trying to >> call it if OMPI is registering a system-level event code - which OMPI 3.1 >> definitely doesn’t do. >> >> If you are using PMIx v2.2.0, then please note that there is a bug in it >> that slipped through our automated testing. I replaced it today with v2.2.1 >> - you probably should update if that’s the case. However, that wouldn’t >> necessarily explain this behavior. I’m not that familiar with the Slurm >> plugin, but you might try adding >> >> PMIX_MCA_pmix_client_event_verbose=5 >> PMIX_MCA_pmix_server_event_verbose=5 >> OMPI_MCA_pmix_base_verbose=10 >> >> to your environment and see if that provides anything useful. >> >>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico >>> wrote: >>> >>> i compilied pmix slurm openmpi >>> >>> ---pmix >>> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 >>> --disable-debug >>> ---slurm >>> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 >>> --with-pmix=/hpc/pmix/2.2 >>> ---openmpi >>> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external >>> --with-libevent=external --with-slurm=/hpc/slurm/18.08 >>> --with-pmix=/hpc/pmix/2.2 >>> >>> everything seemed to compile fine, but when i do an srun i get the >>> below errors, however, if i salloc and then mpirun it seems to work >>> fine. i'm not quite sure where the breakdown is or how to debug it >>> >>> --- >>> >>> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl >>> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> -- >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or environment >>> problems. This failure appears to be an internal failure; here's some >>> additional information (which may only be relevant to an Open MPI >>> developer): >>> >>> ompi_interlib_declare >>> --> Returned "Would block" (-10) instead of "Success" (0) >>> ...snipped... >>> [labcmp6:18355] *** An error occurred in MPI_Init >>> [labcmp6:18355] *** reported by process [140726281390153,15] >>> [labcmp6:18355] *** on a NULL communicator >>> [labcmp6:18355] *** Unknown error >>> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18355] ***and potentially your MPI job) >>> [labcmp6:18352] *** An error occurred in MPI_Init >>> [labcmp6:18352] *** reported by process [1677936713,12] >>> [labcmp6:18352] *** on a NULL communicator >>> [labcmp6:18352] *** Unknown error >>> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18352] ***and potentially your MPI job) >>> [labcmp6:18354] *** An error occurred in MPI_Init >>> [labcmp6:18354] *** reported by process [140726281390153,14] >>> [labcmp6:18354] *** on a NULL communicator >>> [labcmp6:18354] *** Unknown error >>> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this >>> communicator will now abort, >>> [labcmp6:18354] ***and potentially your MPI job) >>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >>> slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT >>> 2019-01-18T20:03:33 *** >>> [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file >>> event/pmix_event_registration.c at line 101 >>> -- >>> It looks like MPI_INIT failed for some reason; your parallel process is >>> likely to abort. There are many reasons that a parallel process can >>> fail during MPI_INIT; some of which are due to configuration or
Re: [OMPI users] Fwd: pmix and srun
here's the branches i'm using. i did a git clone on the repo's and then a git checkout [ec2-user@labhead bin]$ cd /hpc/src/pmix/ [ec2-user@labhead pmix]$ git branch master * v2.2 [ec2-user@labhead pmix]$ cd ../slurm/ [ec2-user@labhead slurm]$ git branch * (detached from origin/slurm-18.08) master [ec2-user@labhead slurm]$ cd ../ompi/ [ec2-user@labhead ompi]$ git branch * (detached from origin/v3.1.x) master attached is the debug out from the run with the debugging turned on On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > > Looks strange. I’m pretty sure Mellanox didn’t implement the event > notification system in the Slurm plugin, but you should only be trying to > call it if OMPI is registering a system-level event code - which OMPI 3.1 > definitely doesn’t do. > > If you are using PMIx v2.2.0, then please note that there is a bug in it that > slipped through our automated testing. I replaced it today with v2.2.1 - you > probably should update if that’s the case. However, that wouldn’t necessarily > explain this behavior. I’m not that familiar with the Slurm plugin, but you > might try adding > > PMIX_MCA_pmix_client_event_verbose=5 > PMIX_MCA_pmix_server_event_verbose=5 > OMPI_MCA_pmix_base_verbose=10 > > to your environment and see if that provides anything useful. > > > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > > wrote: > > > > i compilied pmix slurm openmpi > > > > ---pmix > > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > > --disable-debug > > ---slurm > > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > > --with-pmix=/hpc/pmix/2.2 > > ---openmpi > > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > > --with-pmix=/hpc/pmix/2.2 > > > > everything seemed to compile fine, but when i do an srun i get the > > below errors, however, if i salloc and then mpirun it seems to work > > fine. i'm not quite sure where the breakdown is or how to debug it > > > > --- > > > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_interlib_declare > > --> Returned "Would block" (-10) instead of "Success" (0) > > ...snipped... > > [labcmp6:18355] *** An error occurred in MPI_Init > > [labcmp6:18355] *** reported by process [140726281390153,15] > > [labcmp6:18355] *** on a NULL communicator > > [labcmp6:18355] *** Unknown error > > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18355] ***and potentially your MPI job) > > [labcmp6:18352] *** An error occurred in MPI_Init > > [labcmp6:18352] *** reported by process [1677936713,12] > > [labcmp6:18352] *** on a NULL communicator > > [labcmp6:18352] *** Unknown error > > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18352] ***and potentially your MPI job) > > [labcmp6:18354] *** An error occurred in MPI_Init > > [labcmp6:18354] *** reported by process [140726281390153,14] > > [labcmp6:18354] *** on a NULL communicator > > [labcmp6:18354] *** Unknown error > > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > > communicator will now abort, > > [labcmp6:18354] ***and potentially your MPI job) > > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT > > 2019-01-18T20:03:33 *** > > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file > > event/pmix_event_registration.c at line 101 > > -- > > It looks like MPI_INIT failed for some reason; your parallel process is > > likely to abort. There are many reasons that a parallel process can > > fail during MPI_INIT; some of which are due to configuration or environment > > problems. This failure appears to be an internal failure; here's some > > additional information (which may only be relevant to an Open MPI > > developer): > > > > ompi_interlib_declare > > --> Returned "Would block" (-10) instead of "Success" (0) > > -- > > [labcmp5:18357] PMIX
Re: [OMPI users] Fwd: pmix and srun
Looks strange. I’m pretty sure Mellanox didn’t implement the event notification system in the Slurm plugin, but you should only be trying to call it if OMPI is registering a system-level event code - which OMPI 3.1 definitely doesn’t do. If you are using PMIx v2.2.0, then please note that there is a bug in it that slipped through our automated testing. I replaced it today with v2.2.1 - you probably should update if that’s the case. However, that wouldn’t necessarily explain this behavior. I’m not that familiar with the Slurm plugin, but you might try adding PMIX_MCA_pmix_client_event_verbose=5 PMIX_MCA_pmix_server_event_verbose=5 OMPI_MCA_pmix_base_verbose=10 to your environment and see if that provides anything useful. > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > wrote: > > i compilied pmix slurm openmpi > > ---pmix > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > --disable-debug > ---slurm > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > --with-pmix=/hpc/pmix/2.2 > ---openmpi > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > --with-pmix=/hpc/pmix/2.2 > > everything seemed to compile fine, but when i do an srun i get the > below errors, however, if i salloc and then mpirun it seems to work > fine. i'm not quite sure where the breakdown is or how to debug it > > --- > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > ...snipped... > [labcmp6:18355] *** An error occurred in MPI_Init > [labcmp6:18355] *** reported by process [140726281390153,15] > [labcmp6:18355] *** on a NULL communicator > [labcmp6:18355] *** Unknown error > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18355] ***and potentially your MPI job) > [labcmp6:18352] *** An error occurred in MPI_Init > [labcmp6:18352] *** reported by process [1677936713,12] > [labcmp6:18352] *** on a NULL communicator > [labcmp6:18352] *** Unknown error > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18352] ***and potentially your MPI job) > [labcmp6:18354] *** An error occurred in MPI_Init > [labcmp6:18354] *** reported by process [140726281390153,14] > [labcmp6:18354] *** on a NULL communicator > [labcmp6:18354] *** Unknown error > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18354] ***and potentially your MPI job) > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 > *** > [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > -- > [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > srun: error: labcmp6: tasks 12-15: Exited with exit code 1 > srun: error: labcmp3: tasks 0-3: Killed > srun: error: labcmp4: tasks 4-7: Killed > srun: error: labcmp5: tasks 8-11: Killed > ___ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Fwd: pmix and srun
i compilied pmix slurm openmpi ---pmix ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 --disable-debug ---slurm ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 --with-pmix=/hpc/pmix/2.2 ---openmpi ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external --with-libevent=external --with-slurm=/hpc/slurm/18.08 --with-pmix=/hpc/pmix/2.2 everything seemed to compile fine, but when i do an srun i get the below errors, however, if i salloc and then mpirun it seems to work fine. i'm not quite sure where the breakdown is or how to debug it --- [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) ...snipped... [labcmp6:18355] *** An error occurred in MPI_Init [labcmp6:18355] *** reported by process [140726281390153,15] [labcmp6:18355] *** on a NULL communicator [labcmp6:18355] *** Unknown error [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18355] ***and potentially your MPI job) [labcmp6:18352] *** An error occurred in MPI_Init [labcmp6:18352] *** reported by process [1677936713,12] [labcmp6:18352] *** on a NULL communicator [labcmp6:18352] *** Unknown error [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18352] ***and potentially your MPI job) [labcmp6:18354] *** An error occurred in MPI_Init [labcmp6:18354] *** reported by process [140726281390153,14] [labcmp6:18354] *** on a NULL communicator [labcmp6:18354] *** Unknown error [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18354] ***and potentially your MPI job) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 *** [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) -- [labcmp5:18357] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18356] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 srun: error: labcmp6: tasks 12-15: Exited with exit code 1 srun: error: labcmp3: tasks 0-3: Killed srun: error: labcmp4: tasks 4-7: Killed srun: error: labcmp5: tasks 8-11: Killed ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
Hi Matt, Few comments/questions: - If your cluster has Omni-Path, you won’t need UCX. Instead you can run using PSM2, or alternatively OFI (a.k.a. Libfabric) - With the command you shared below (4 ranks on the local node) (I think) a shared mem transport is being selected (vader?). So, if the job is not starting this seems to be a runtime issue rather than transport…. Pmix? slurm? Thanks _MAC From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Matt Thompson Sent: Friday, January 18, 2019 10:27 AM To: Open MPI Users Subject: Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users mailto:users@lists.open-mpi.org>> wrote: On Jan 18, 2019, at 12:43 PM, Matt Thompson mailto:fort...@gmail.com>> wrote: > > With some help, I managed to build an Open MPI 4.0.0 with: We can discuss each of these params to let you know what they are. > ./configure --disable-wrapper-rpath --disable-wrapper-runpath Did you have a reason for disabling these? They're generally good things. What they do is add linker flags to the wrapper compilers (i.e., mpicc and friends) that basically put a default path to find libraries at run time (that can/will in most cases override LD_LIBRARY_PATH -- but you can override these linked-in-default-paths if you want/need to). I've had these in my Open MPI builds for a while now. The reason was one of the libraries I need for the climate model I work on went nuts if both of them weren't there. It was originally the rpath one but then eventually (Open MPI 3?) I had to add the runpath one. But I have been updating the libraries more aggressively recently (due to OS upgrades) so it's possible this is no longer needed. > --with-psm2 Ensure that Open MPI can include support for the PSM2 library, and abort configure if it cannot. > --with-slurm Ensure that Open MPI can include support for SLURM, and abort configure if it cannot. > --enable-mpi1-compatibility Add support for MPI_Address and other MPI-1 functions that have since been deleted from the MPI 3.x specification. > --with-ucx Ensure that Open MPI can include support for UCX, and abort configure if it cannot. > --with-pmix=/usr/nlocal/pmix/2.1 Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally to Open MPI's source code tree/expanded tarball). Unless you have a reason to use the external PMIx, the internal/bundled PMIx is usually sufficient. Ah. I did not know that. I figured if our SLURM was built linked to a specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 4 without specifying this. > --with-libevent=/usr Same as previous; change "pmix" to "libevent" (i.e., use the external libevent instead of the bundled libevent). > CC=icc CXX=icpc FC=ifort Specify the exact compilers to use. > The MPI 1 is because I need to build HDF5 eventually and I added psm2 because > it's an Omnipath cluster. The libevent was probably a red herring as > libevent-devel wasn't installed on the system. It was eventually, and I just > didn't remove the flag. And I saw no errors in the build! Might as well remove the --with-libevent if you don't need it. > However, I seem to have built an Open MPI that doesn't work: > > (1099)(master) $ mpirun --version > mpirun (Open MPI) 4.0.0 > > Report bugs to http://www.open-mpi.org/community/help/ > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe > > It just sits there...forever. Can the gurus here help me figure out what I > managed to break? Perhaps I added too much to my configure line? Not enough? There could be a few things going on here. Are you running inside a SLURM job? E.g., in a "salloc" job, or in an "sbatch" script? I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few months ago), and it had some interesting startup scaling I liked (slow at low core count, but getting close to Intel MPI at high core count), though it seemed to not work after about 100 nodes (4000 processes) or so. -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users < users@lists.open-mpi.org> wrote: > On Jan 18, 2019, at 12:43 PM, Matt Thompson wrote: > > > > With some help, I managed to build an Open MPI 4.0.0 with: > > We can discuss each of these params to let you know what they are. > > > ./configure --disable-wrapper-rpath --disable-wrapper-runpath > > Did you have a reason for disabling these? They're generally good > things. What they do is add linker flags to the wrapper compilers (i.e., > mpicc and friends) that basically put a default path to find libraries at > run time (that can/will in most cases override LD_LIBRARY_PATH -- but you > can override these linked-in-default-paths if you want/need to). > I've had these in my Open MPI builds for a while now. The reason was one of the libraries I need for the climate model I work on went nuts if both of them weren't there. It was originally the rpath one but then eventually (Open MPI 3?) I had to add the runpath one. But I have been updating the libraries more aggressively recently (due to OS upgrades) so it's possible this is no longer needed. > > > --with-psm2 > > Ensure that Open MPI can include support for the PSM2 library, and abort > configure if it cannot. > > > --with-slurm > > Ensure that Open MPI can include support for SLURM, and abort configure if > it cannot. > > > --enable-mpi1-compatibility > > Add support for MPI_Address and other MPI-1 functions that have since been > deleted from the MPI 3.x specification. > > > --with-ucx > > Ensure that Open MPI can include support for UCX, and abort configure if > it cannot. > > > --with-pmix=/usr/nlocal/pmix/2.1 > > Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 > (instead of using the PMIx that is bundled internally to Open MPI's source > code tree/expanded tarball). > > Unless you have a reason to use the external PMIx, the internal/bundled > PMIx is usually sufficient. > Ah. I did not know that. I figured if our SLURM was built linked to a specific PMIx v2 that I should build Open MPI with the same PMIx. I'll build an Open MPI 4 without specifying this. > > > --with-libevent=/usr > > Same as previous; change "pmix" to "libevent" (i.e., use the external > libevent instead of the bundled libevent). > > > CC=icc CXX=icpc FC=ifort > > Specify the exact compilers to use. > > > The MPI 1 is because I need to build HDF5 eventually and I added psm2 > because it's an Omnipath cluster. The libevent was probably a red herring > as libevent-devel wasn't installed on the system. It was eventually, and I > just didn't remove the flag. And I saw no errors in the build! > > Might as well remove the --with-libevent if you don't need it. > > > However, I seem to have built an Open MPI that doesn't work: > > > > (1099)(master) $ mpirun --version > > mpirun (Open MPI) 4.0.0 > > > > Report bugs to http://www.open-mpi.org/community/help/ > > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe > > > > It just sits there...forever. Can the gurus here help me figure out what > I managed to break? Perhaps I added too much to my configure line? Not > enough? > > There could be a few things going on here. > > Are you running inside a SLURM job? E.g., in a "salloc" job, or in an > "sbatch" script? > I have salloc'd 8 nodes of 40 cores each. Intel MPI 18 and 19 work just fine (as you'd hope on an Omnipath cluster), but for some reason Open MPI is twitchy on this cluster. I once managed to get Open MPI 3.0.1 working (a few months ago), and it had some interesting startup scaling I liked (slow at low core count, but getting close to Intel MPI at high core count), though it seemed to not work after about 100 nodes (4000 processes) or so. -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
On Jan 18, 2019, at 12:43 PM, Matt Thompson wrote: > > With some help, I managed to build an Open MPI 4.0.0 with: We can discuss each of these params to let you know what they are. > ./configure --disable-wrapper-rpath --disable-wrapper-runpath Did you have a reason for disabling these? They're generally good things. What they do is add linker flags to the wrapper compilers (i.e., mpicc and friends) that basically put a default path to find libraries at run time (that can/will in most cases override LD_LIBRARY_PATH -- but you can override these linked-in-default-paths if you want/need to). > --with-psm2 Ensure that Open MPI can include support for the PSM2 library, and abort configure if it cannot. > --with-slurm Ensure that Open MPI can include support for SLURM, and abort configure if it cannot. > --enable-mpi1-compatibility Add support for MPI_Address and other MPI-1 functions that have since been deleted from the MPI 3.x specification. > --with-ucx Ensure that Open MPI can include support for UCX, and abort configure if it cannot. > --with-pmix=/usr/nlocal/pmix/2.1 Tells Open MPI to use the PMIx that is installed at /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally to Open MPI's source code tree/expanded tarball). Unless you have a reason to use the external PMIx, the internal/bundled PMIx is usually sufficient. > --with-libevent=/usr Same as previous; change "pmix" to "libevent" (i.e., use the external libevent instead of the bundled libevent). > CC=icc CXX=icpc FC=ifort Specify the exact compilers to use. > The MPI 1 is because I need to build HDF5 eventually and I added psm2 because > it's an Omnipath cluster. The libevent was probably a red herring as > libevent-devel wasn't installed on the system. It was eventually, and I just > didn't remove the flag. And I saw no errors in the build! Might as well remove the --with-libevent if you don't need it. > However, I seem to have built an Open MPI that doesn't work: > > (1099)(master) $ mpirun --version > mpirun (Open MPI) 4.0.0 > > Report bugs to http://www.open-mpi.org/community/help/ > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe > > It just sits there...forever. Can the gurus here help me figure out what I > managed to break? Perhaps I added too much to my configure line? Not enough? There could be a few things going on here. Are you running inside a SLURM job? E.g., in a "salloc" job, or in an "sbatch" script? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX
All, With some help, I managed to build an Open MPI 4.0.0 with: ./configure --disable-wrapper-rpath --disable-wrapper-runpath --with-psm2 --with-slurm --enable-mpi1-compatibility --with-ucx --with-pmix=/usr/nlocal/pmix/2.1 --with-libevent=/usr CC=icc CXX=icpc FC=ifort The MPI 1 is because I need to build HDF5 eventually and I added psm2 because it's an Omnipath cluster. The libevent was probably a red herring as libevent-devel wasn't installed on the system. It was eventually, and I just didn't remove the flag. And I saw no errors in the build! However, I seem to have built an Open MPI that doesn't work: (1099)(master) $ mpirun --version mpirun (Open MPI) 4.0.0 Report bugs to http://www.open-mpi.org/community/help/ (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe It just sits there...forever. Can the gurus here help me figure out what I managed to break? Perhaps I added too much to my configure line? Not enough? Thanks, Matt On Thu, Jan 17, 2019 at 11:10 AM Matt Thompson wrote: > Dear Open MPI Gurus, > > A cluster I use recently updated their SLURM to have support for UCX and > PMIx. These are names I've seen and heard often at SC BoFs and posters, but > now is my first time to play with them. > > So, my first question is how exactly should I build Open MPI to try these > features out. I'm guessing I'll need things like "--with-ucx" to test UCX, > but is anything needed for PMIx? > > Second, when it comes to running Open MPI, are there new MCA parameters I > need to look out for when testing? > > Sorry for the generic questions, but I'm more on the user end of the > cluster than the administrator end, so I tend to get lost in the detailed > presentations, etc. I see online. > > Thanks, > Matt > -- > Matt Thompson >“The fact is, this is about us identifying what we do best and >finding more ways of doing less of it better” -- Director of Better > Anna Rampton > -- Matt Thompson “The fact is, this is about us identifying what we do best and finding more ways of doing less of it better” -- Director of Better Anna Rampton ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users