Ryan,

With srun it's fine.  Only with mpirun is there a problem, and that is
both on a single node and on multiple nodes.  SLURM was built against
pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
running a recent patch of SLURM, I think.  SLURM and OMPI are both
being built using the same installation of pmix.

[bennet@cavium-hpc etc]$ srun --version
slurm 17.11.7

[bennet@cavium-hpc etc]$ grep pmi slurm.conf
MpiDefault=pmix

[bennet@cavium-hpc pmix]$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix

I think I said that I was pretty sure I had got this to work with both
mpirun and srun at one point, but I am unable to find the magic a
second time.




On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski <novos...@rutgers.edu> wrote:
>
> What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
> MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
> MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
> (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
> difference:
>
> [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> Bus error (signal 7)
> srun: error: slepner032: task 10: Bus error
>
> [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
> processors
>
> > On Jun 17, 2018, at 5:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
> >
> > I rebuilt with --enable-debug, then ran with
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 158
> > salloc: job 158 queued and waiting for resources
> > salloc: job 158 has been allocated resources
> > salloc: Granted job allocation 158
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.426759
> > The sum = 0.866386
> > Elapsed time is:  5.424068
> > The sum = 0.866386
> > Elapsed time is:  5.426195
> > The sum = 0.866386
> > Elapsed time is:  5.426059
> > The sum = 0.866386
> > Elapsed time is:  5.423192
> > The sum = 0.866386
> > Elapsed time is:  5.426252
> > The sum = 0.866386
> > Elapsed time is:  5.425444
> > The sum = 0.866386
> > Elapsed time is:  5.423647
> > The sum = 0.866386
> > Elapsed time is:  5.426082
> > The sum = 0.866386
> > Elapsed time is:  5.425936
> > The sum = 0.866386
> > Elapsed time is:  5.423964
> > Total time is:  59.677830
> >
> > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> > 2>&1 | tee debug2.log
> >
> > The zipped debug log should be attached.
> >
> > I did that after using systemctl to turn off the firewall on the login
> > node from which the mpirun is executed, as well as on the host on
> > which it runs.
> >
> > [bennet@cavium-hpc ~]$ mpirun hostname
> > --------------------------------------------------------------------------
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --------------------------------------------------------------------------
> >
> > [bennet@cavium-hpc ~]$ squeue
> >             JOBID PARTITION     NAME     USER ST       TIME  NODES
> > NODELIST(REASON)
> >               158  standard     bash   bennet  R      14:30      1 cav01
> > [bennet@cavium-hpc ~]$ srun hostname
> > cav01.arc-ts.umich.edu
> > [ repeated 23 more times ]
> >
> > As always, your help is much appreciated,
> >
> > -- bennet
> >
> > On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
> >>
> >> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> >> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> >> daemon isn’t starting - this will give you some info as to why.
> >>
> >>
> >>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
> >>>
> >>> I have a compiled binary that will run with srun but not with mpirun.
> >>> The attempts to run with mpirun all result in failures to initialize.
> >>> I have tried this on one node, and on two nodes, with firewall turned
> >>> on and with it off.
> >>>
> >>> Am I missing some command line option for mpirun?
> >>>
> >>> OMPI built from this configure command
> >>>
> >>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> >>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> >>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> >>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> >>> FC=gfortran
> >>>
> >>> All tests from `make check` passed, see below.
> >>>
> >>> [bennet@cavium-hpc ~]$ mpicc --show
> >>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> >>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> >>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> >>> -Wl,--enable-new-dtags
> >>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> >>>
> >>> The test_mpi was compiled with
> >>>
> >>> $ gcc -o test_mpi test_mpi.c -lm
> >>>
> >>> This is the runtime library path
> >>>
> >>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> >>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> >>>
> >>>
> >>> These commands are given in exact sequence in which they were entered
> >>> at a console.
> >>>
> >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> >>> salloc: Pending job allocation 156
> >>> salloc: job 156 queued and waiting for resources
> >>> salloc: job 156 has been allocated resources
> >>> salloc: Granted job allocation 156
> >>>
> >>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> >>> --------------------------------------------------------------------------
> >>> An ORTE daemon has unexpectedly failed after launch and before
> >>> communicating back to mpirun. This could be caused by a number
> >>> of factors, including an inability to create a connection back
> >>> to mpirun due to a lack of common network interfaces and/or no
> >>> route found between them. Please check network connectivity
> >>> (including firewalls and network routing requirements).
> >>> --------------------------------------------------------------------------
> >>>
> >>> [bennet@cavium-hpc ~]$ srun ./test_mpi
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425439
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.427427
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422579
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.424168
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.423951
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422414
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.427156
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.424834
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.425103
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422415
> >>> The sum = 0.866386
> >>> Elapsed time is:  5.422948
> >>> Total time is:  59.668622
> >>>
> >>> Thanks,    -- bennet
> >>>
> >>>
> >>> make check results
> >>> ----------------------------------------------
> >>>
> >>> make  check-TESTS
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> >>> PASS: predefined_gap_test
> >>> PASS: predefined_pad_test
> >>> SKIP: dlopen_test
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 3
> >>> # PASS:  2
> >>> # SKIP:  1
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> PASS: atomic_cmpset_noinline
> >>>   - 5 threads: Passed
> >>> PASS: atomic_cmpset_noinline
> >>>   - 8 threads: Passed
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 8
> >>> # PASS:  8
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> >>> PASS: ompi_rb_tree
> >>> PASS: opal_bitmap
> >>> PASS: opal_hash_table
> >>> PASS: opal_proc_table
> >>> PASS: opal_tree
> >>> PASS: opal_list
> >>> PASS: opal_value_array
> >>> PASS: opal_pointer_array
> >>> PASS: opal_lifo
> >>> PASS: opal_fifo
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 10
> >>> # PASS:  10
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make  opal_thread opal_condition
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> CC       opal_thread.o
> >>> CCLD     opal_thread
> >>> CC       opal_condition.o
> >>> CCLD     opal_condition
> >>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> make  check-TESTS
> >>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 0
> >>> # PASS:  0
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
> >>> PASS: opal_datatype_test
> >>> PASS: unpack_hetero
> >>> PASS: checksum
> >>> PASS: position
> >>> PASS: position_noncontig
> >>> PASS: ddt_test
> >>> PASS: ddt_raw
> >>> PASS: unpack_ooo
> >>> PASS: ddt_pack
> >>> PASS: external32
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 10
> >>> # PASS:  10
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
> >>> PASS: opal_bit_ops
> >>> PASS: opal_path_nfs
> >>> PASS: bipartite_graph
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 3
> >>> # PASS:  3
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> [ elided ]
> >>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
> >>> PASS: dss_buffer
> >>> PASS: dss_cmp
> >>> PASS: dss_payload
> >>> PASS: dss_print
> >>> ============================================================================
> >>> Testsuite summary for Open MPI 3.1.0
> >>> ============================================================================
> >>> # TOTAL: 4
> >>> # PASS:  4
> >>> # SKIP:  0
> >>> # XFAIL: 0
> >>> # FAIL:  0
> >>> # XPASS: 0
> >>> # ERROR: 0
> >>> ============================================================================
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> >>
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
> > <debug2.log.gz>_______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fusers&data=02%7C01%7Cnovosirj%40rutgers.edu%7C3ff02f25eba14481620808d5d49cb359%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636648691891196531&sdata=ljljtlatx6zV%2BaxrFIJbMYw1joKVFTZKRHpRDV6M2VA%3D&reserved=0
>
> --
> ____
> || \\UTGERS,     |---------------------------*O*---------------------------
> ||_// the State  |         Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
>      `'
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to