Re: [OMPI users] pmix and srun
Good - thanks! > On Jan 18, 2019, at 3:25 PM, Michael Di Domenico > wrote: > > seems to be better now. jobs are running > > On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: >> >> I have pushed a fix to the v2.2 branch - could you please confirm it? >> >> >>> On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: >>> >>> Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm >>> plugin folks seem to be off somewhere for awhile and haven’t been testing >>> it. Sigh. >>> >>> I’ll patch the branch and let you know - we’d appreciate the feedback. >>> Ralph >>> >>> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico wrote: here's the branches i'm using. i did a git clone on the repo's and then a git checkout [ec2-user@labhead bin]$ cd /hpc/src/pmix/ [ec2-user@labhead pmix]$ git branch master * v2.2 [ec2-user@labhead pmix]$ cd ../slurm/ [ec2-user@labhead slurm]$ git branch * (detached from origin/slurm-18.08) master [ec2-user@labhead slurm]$ cd ../ompi/ [ec2-user@labhead ompi]$ git branch * (detached from origin/v3.1.x) master attached is the debug out from the run with the debugging turned on On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > > Looks strange. I’m pretty sure Mellanox didn’t implement the event > notification system in the Slurm plugin, but you should only be trying to > call it if OMPI is registering a system-level event code - which OMPI 3.1 > definitely doesn’t do. > > If you are using PMIx v2.2.0, then please note that there is a bug in it > that slipped through our automated testing. I replaced it today with > v2.2.1 - you probably should update if that’s the case. However, that > wouldn’t necessarily explain this behavior. I’m not that familiar with > the Slurm plugin, but you might try adding > > PMIX_MCA_pmix_client_event_verbose=5 > PMIX_MCA_pmix_server_event_verbose=5 > OMPI_MCA_pmix_base_verbose=10 > > to your environment and see if that provides anything useful. > >> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico >> wrote: >> >> i compilied pmix slurm openmpi >> >> ---pmix >> ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 >> --disable-debug >> ---slurm >> ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 >> --with-pmix=/hpc/pmix/2.2 >> ---openmpi >> ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external >> --with-libevent=external --with-slurm=/hpc/slurm/18.08 >> --with-pmix=/hpc/pmix/2.2 >> >> everything seemed to compile fine, but when i do an srun i get the >> below errors, however, if i salloc and then mpirun it seems to work >> fine. i'm not quite sure where the breakdown is or how to debug it >> >> --- >> >> [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl >> [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file >> event/pmix_event_registration.c at line 101 >> -- >> It looks like MPI_INIT failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during MPI_INIT; some of which are due to configuration or >> environment >> problems. This failure appears to be an internal failure; here's some >> additional information (which may only be relevant to an Open MPI >> developer): >> >> ompi_interlib_declare >> --> Returned "Would block" (-10) instead of "Success" (0) >> ...snipped... >> [labcmp6:18355] *** An error occurred in MPI_Init >> [labcmp6:18355] *** reported by process [140726281390153,15] >> [labcmp6:18355] *** on a NULL communicator >> [labcmp6:18355] *** Unknown error >> [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will now abort, >> [labcmp6:18355] ***and potentially your MPI job) >> [labcmp6:18352] *** An error occurred in MPI_Init >> [labcmp6:18352] *** reported by process [1677936713,12] >> [labcmp6:18352] *** on a NULL communicator >> [labcmp6:18352] *** Unknown error >> [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this >> communicator will now abort, >> [labcmp6:18352] ***and potentially your MPI job) >> [labcmp6:18354] *** An error occurred in MPI_Init >> [labcmp6:18354] *** reported by process [140726281390153,14] >> [labcmp6:18354] *** on a NULL communicator >> [labcmp6:18354] *** Unknown error >> [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL
Re: [OMPI users] pmix and srun
seems to be better now. jobs are running On Fri, Jan 18, 2019 at 6:17 PM Ralph H Castain wrote: > > I have pushed a fix to the v2.2 branch - could you please confirm it? > > > > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm > > plugin folks seem to be off somewhere for awhile and haven’t been testing > > it. Sigh. > > > > I’ll patch the branch and let you know - we’d appreciate the feedback. > > Ralph > > > > > >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico > >> wrote: > >> > >> here's the branches i'm using. i did a git clone on the repo's and > >> then a git checkout > >> > >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/ > >> [ec2-user@labhead pmix]$ git branch > >> master > >> * v2.2 > >> [ec2-user@labhead pmix]$ cd ../slurm/ > >> [ec2-user@labhead slurm]$ git branch > >> * (detached from origin/slurm-18.08) > >> master > >> [ec2-user@labhead slurm]$ cd ../ompi/ > >> [ec2-user@labhead ompi]$ git branch > >> * (detached from origin/v3.1.x) > >> master > >> > >> > >> attached is the debug out from the run with the debugging turned on > >> > >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: > >>> > >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event > >>> notification system in the Slurm plugin, but you should only be trying to > >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 > >>> definitely doesn’t do. > >>> > >>> If you are using PMIx v2.2.0, then please note that there is a bug in it > >>> that slipped through our automated testing. I replaced it today with > >>> v2.2.1 - you probably should update if that’s the case. However, that > >>> wouldn’t necessarily explain this behavior. I’m not that familiar with > >>> the Slurm plugin, but you might try adding > >>> > >>> PMIX_MCA_pmix_client_event_verbose=5 > >>> PMIX_MCA_pmix_server_event_verbose=5 > >>> OMPI_MCA_pmix_base_verbose=10 > >>> > >>> to your environment and see if that provides anything useful. > >>> > On Jan 18, 2019, at 12:09 PM, Michael Di Domenico > wrote: > > i compilied pmix slurm openmpi > > ---pmix > ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 > --disable-debug > ---slurm > ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 > --with-pmix=/hpc/pmix/2.2 > ---openmpi > ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external > --with-libevent=external --with-slurm=/hpc/slurm/18.08 > --with-pmix=/hpc/pmix/2.2 > > everything seemed to compile fine, but when i do an srun i get the > below errors, however, if i salloc and then mpirun it seems to work > fine. i'm not quite sure where the breakdown is or how to debug it > > --- > > [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl > [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file > event/pmix_event_registration.c at line 101 > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > ompi_interlib_declare > --> Returned "Would block" (-10) instead of "Success" (0) > ...snipped... > [labcmp6:18355] *** An error occurred in MPI_Init > [labcmp6:18355] *** reported by process [140726281390153,15] > [labcmp6:18355] *** on a NULL communicator > [labcmp6:18355] *** Unknown error > [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18355] ***and potentially your MPI job) > [labcmp6:18352] *** An error occurred in MPI_Init > [labcmp6:18352] *** reported by process [1677936713,12] > [labcmp6:18352] *** on a NULL communicator > [labcmp6:18352] *** Unknown error > [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18352] ***and potentially your MPI job) > [labcmp6:18354] *** An error occurred in MPI_Init > [labcmp6:18354] *** reported by process [140726281390153,14] > [labcmp6:18354] *** on a NULL communicator > [labcmp6:18354] *** Unknown error > [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this > communicator will now abort, > [labcmp6:18354] ***and potentially your MPI job) >
Re: [OMPI users] pmix and srun
I have pushed a fix to the v2.2 branch - could you please confirm it? > On Jan 18, 2019, at 2:23 PM, Ralph H Castain wrote: > > Aha - I found it. It’s a typo in the v2.2.1 release. Sadly, our Slurm plugin > folks seem to be off somewhere for awhile and haven’t been testing it. Sigh. > > I’ll patch the branch and let you know - we’d appreciate the feedback. > Ralph > > >> On Jan 18, 2019, at 2:09 PM, Michael Di Domenico >> wrote: >> >> here's the branches i'm using. i did a git clone on the repo's and >> then a git checkout >> >> [ec2-user@labhead bin]$ cd /hpc/src/pmix/ >> [ec2-user@labhead pmix]$ git branch >> master >> * v2.2 >> [ec2-user@labhead pmix]$ cd ../slurm/ >> [ec2-user@labhead slurm]$ git branch >> * (detached from origin/slurm-18.08) >> master >> [ec2-user@labhead slurm]$ cd ../ompi/ >> [ec2-user@labhead ompi]$ git branch >> * (detached from origin/v3.1.x) >> master >> >> >> attached is the debug out from the run with the debugging turned on >> >> On Fri, Jan 18, 2019 at 4:30 PM Ralph H Castain wrote: >>> >>> Looks strange. I’m pretty sure Mellanox didn’t implement the event >>> notification system in the Slurm plugin, but you should only be trying to >>> call it if OMPI is registering a system-level event code - which OMPI 3.1 >>> definitely doesn’t do. >>> >>> If you are using PMIx v2.2.0, then please note that there is a bug in it >>> that slipped through our automated testing. I replaced it today with v2.2.1 >>> - you probably should update if that’s the case. However, that wouldn’t >>> necessarily explain this behavior. I’m not that familiar with the Slurm >>> plugin, but you might try adding >>> >>> PMIX_MCA_pmix_client_event_verbose=5 >>> PMIX_MCA_pmix_server_event_verbose=5 >>> OMPI_MCA_pmix_base_verbose=10 >>> >>> to your environment and see if that provides anything useful. >>> On Jan 18, 2019, at 12:09 PM, Michael Di Domenico wrote: i compilied pmix slurm openmpi ---pmix ./configure --prefix=/hpc/pmix/2.2 --with-munge=/hpc/munge/0.5.13 --disable-debug ---slurm ./configure --prefix=/hpc/slurm/18.08 --with-munge=/hpc/munge/0.5.13 --with-pmix=/hpc/pmix/2.2 ---openmpi ./configure --prefix=/hpc/ompi/3.1 --with-hwloc=external --with-libevent=external --with-slurm=/hpc/slurm/18.08 --with-pmix=/hpc/pmix/2.2 everything seemed to compile fine, but when i do an srun i get the below errors, however, if i salloc and then mpirun it seems to work fine. i'm not quite sure where the breakdown is or how to debug it --- [ec2-user@labcmp1 linux]$ srun --mpi=pmix_v2 -n 16 xhpl [labcmp6:18353] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp6:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 [labcmp5:18355] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_interlib_declare --> Returned "Would block" (-10) instead of "Success" (0) ...snipped... [labcmp6:18355] *** An error occurred in MPI_Init [labcmp6:18355] *** reported by process [140726281390153,15] [labcmp6:18355] *** on a NULL communicator [labcmp6:18355] *** Unknown error [labcmp6:18355] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18355] ***and potentially your MPI job) [labcmp6:18352] *** An error occurred in MPI_Init [labcmp6:18352] *** reported by process [1677936713,12] [labcmp6:18352] *** on a NULL communicator [labcmp6:18352] *** Unknown error [labcmp6:18352] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18352] ***and potentially your MPI job) [labcmp6:18354] *** An error occurred in MPI_Init [labcmp6:18354] *** reported by process [140726281390153,14] [labcmp6:18354] *** on a NULL communicator [labcmp6:18354] *** Unknown error [labcmp6:18354] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [labcmp6:18354] ***and potentially your MPI job) srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 24.0 ON labcmp3 CANCELLED AT 2019-01-18T20:03:33 *** [labcmp5:18358] PMIX ERROR: NOT-SUPPORTED in file event/pmix_event_registration.c at line 101