Ryan, With srun it's fine. Only with mpirun is there a problem, and that is both on a single node and on multiple nodes. SLURM was built against pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are running a recent patch of SLURM, I think. SLURM and OMPI are both being built using the same installation of pmix.
[bennet@cavium-hpc etc]$ srun --version slurm 17.11.7 [bennet@cavium-hpc etc]$ grep pmi slurm.conf MpiDefault=pmix [bennet@cavium-hpc pmix]$ srun --mpi=list srun: MPI types are... srun: pmix_v2 srun: openmpi srun: none srun: pmi2 srun: pmix I think I said that I was pretty sure I had got this to work with both mpirun and srun at one point, but I am unable to find the magic a second time. On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski <novos...@rutgers.edu> wrote: > > What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM > MPI is set to “none”, or was last I checked, and so isn’t necessarily doing > MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way > (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a > difference: > > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: > Bus error (signal 7) > srun: error: slepner032: task 10: Bus error > > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 > processors > > > On Jun 17, 2018, at 5:51 PM, Bennet Fauber <ben...@umich.edu> wrote: > > > > I rebuilt with --enable-debug, then ran with > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > salloc: Pending job allocation 158 > > salloc: job 158 queued and waiting for resources > > salloc: job 158 has been allocated resources > > salloc: Granted job allocation 158 > > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > The sum = 0.866386 > > Elapsed time is: 5.426759 > > The sum = 0.866386 > > Elapsed time is: 5.424068 > > The sum = 0.866386 > > Elapsed time is: 5.426195 > > The sum = 0.866386 > > Elapsed time is: 5.426059 > > The sum = 0.866386 > > Elapsed time is: 5.423192 > > The sum = 0.866386 > > Elapsed time is: 5.426252 > > The sum = 0.866386 > > Elapsed time is: 5.425444 > > The sum = 0.866386 > > Elapsed time is: 5.423647 > > The sum = 0.866386 > > Elapsed time is: 5.426082 > > The sum = 0.866386 > > Elapsed time is: 5.425936 > > The sum = 0.866386 > > Elapsed time is: 5.423964 > > Total time is: 59.677830 > > > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > > 2>&1 | tee debug2.log > > > > The zipped debug log should be attached. > > > > I did that after using systemctl to turn off the firewall on the login > > node from which the mpirun is executed, as well as on the host on > > which it runs. > > > > [bennet@cavium-hpc ~]$ mpirun hostname > > -------------------------------------------------------------------------- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -------------------------------------------------------------------------- > > > > [bennet@cavium-hpc ~]$ squeue > > JOBID PARTITION NAME USER ST TIME NODES > > NODELIST(REASON) > > 158 standard bash bennet R 14:30 1 cav01 > > [bennet@cavium-hpc ~]$ srun hostname > > cav01.arc-ts.umich.edu > > [ repeated 23 more times ] > > > > As always, your help is much appreciated, > > > > -- bennet > > > > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote: > >> > >> Add --enable-debug to your OMPI configure cmd line, and then add --mca > >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote > >> daemon isn’t starting - this will give you some info as to why. > >> > >> > >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: > >>> > >>> I have a compiled binary that will run with srun but not with mpirun. > >>> The attempts to run with mpirun all result in failures to initialize. > >>> I have tried this on one node, and on two nodes, with firewall turned > >>> on and with it off. > >>> > >>> Am I missing some command line option for mpirun? > >>> > >>> OMPI built from this configure command > >>> > >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > >>> FC=gfortran > >>> > >>> All tests from `make check` passed, see below. > >>> > >>> [bennet@cavium-hpc ~]$ mpicc --show > >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > >>> -Wl,--enable-new-dtags > >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > >>> > >>> The test_mpi was compiled with > >>> > >>> $ gcc -o test_mpi test_mpi.c -lm > >>> > >>> This is the runtime library path > >>> > >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > >>> > >>> > >>> These commands are given in exact sequence in which they were entered > >>> at a console. > >>> > >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > >>> salloc: Pending job allocation 156 > >>> salloc: job 156 queued and waiting for resources > >>> salloc: job 156 has been allocated resources > >>> salloc: Granted job allocation 156 > >>> > >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi > >>> -------------------------------------------------------------------------- > >>> An ORTE daemon has unexpectedly failed after launch and before > >>> communicating back to mpirun. This could be caused by a number > >>> of factors, including an inability to create a connection back > >>> to mpirun due to a lack of common network interfaces and/or no > >>> route found between them. Please check network connectivity > >>> (including firewalls and network routing requirements). > >>> -------------------------------------------------------------------------- > >>> > >>> [bennet@cavium-hpc ~]$ srun ./test_mpi > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425439 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.427427 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422579 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.424168 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.423951 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422414 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.427156 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.424834 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.425103 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422415 > >>> The sum = 0.866386 > >>> Elapsed time is: 5.422948 > >>> Total time is: 59.668622 > >>> > >>> Thanks, -- bennet > >>> > >>> > >>> make check results > >>> ---------------------------------------------- > >>> > >>> make check-TESTS > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > >>> PASS: predefined_gap_test > >>> PASS: predefined_pad_test > >>> SKIP: dlopen_test > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 3 > >>> # PASS: 2 > >>> # SKIP: 1 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> PASS: atomic_cmpset_noinline > >>> - 5 threads: Passed > >>> PASS: atomic_cmpset_noinline > >>> - 8 threads: Passed > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 8 > >>> # PASS: 8 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > >>> PASS: ompi_rb_tree > >>> PASS: opal_bitmap > >>> PASS: opal_hash_table > >>> PASS: opal_proc_table > >>> PASS: opal_tree > >>> PASS: opal_list > >>> PASS: opal_value_array > >>> PASS: opal_pointer_array > >>> PASS: opal_lifo > >>> PASS: opal_fifo > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 10 > >>> # PASS: 10 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make opal_thread opal_condition > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> CC opal_thread.o > >>> CCLD opal_thread > >>> CC opal_condition.o > >>> CCLD opal_condition > >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> make check-TESTS > >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 0 > >>> # PASS: 0 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' > >>> PASS: opal_datatype_test > >>> PASS: unpack_hetero > >>> PASS: checksum > >>> PASS: position > >>> PASS: position_noncontig > >>> PASS: ddt_test > >>> PASS: ddt_raw > >>> PASS: unpack_ooo > >>> PASS: ddt_pack > >>> PASS: external32 > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 10 > >>> # PASS: 10 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' > >>> PASS: opal_bit_ops > >>> PASS: opal_path_nfs > >>> PASS: bipartite_graph > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 3 > >>> # PASS: 3 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> [ elided ] > >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' > >>> PASS: dss_buffer > >>> PASS: dss_cmp > >>> PASS: dss_payload > >>> PASS: dss_print > >>> ============================================================================ > >>> Testsuite summary for Open MPI 3.1.0 > >>> ============================================================================ > >>> # TOTAL: 4 > >>> # PASS: 4 > >>> # SKIP: 0 > >>> # XFAIL: 0 > >>> # FAIL: 0 > >>> # XPASS: 0 > >>> # ERROR: 0 > >>> ============================================================================ > >>> _______________________________________________ > >>> users mailing list > >>> users@lists.open-mpi.org > >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > >> > >> _______________________________________________ > >> users mailing list > >> users@lists.open-mpi.org > >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > > <debug2.log.gz>_______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0 > > -- > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users