Re: [OMPI users] Fwd: srun works, mpirun does not
Well, this is kind of interesting. I can strip the configure line back and get mpirun to work on one node, but then neither srun nor mpirun within a SLURM job will run. I can add back configure options to get to ./configure \ --prefix=${PREFIX} \ --mandir=${PREFIX}/share/man \ --with-pmix=/opt/pmix/2.0.2 \ --with-slurm and the situation does not seem to change. Then I add libevent, ./configure \ --prefix=${PREFIX} \ --mandir=${PREFIX}/share/man \ --with-pmix=/opt/pmix/2.0.2 \ --with-libevent=external \ --with-slurm and it works again with srun but fails to run the binary with mpirun. It is late, and I am baffled. On Mon, Jun 18, 2018 at 9:02 PM Bennet Fauber wrote: > > Ryan, > > With srun it's fine. Only with mpirun is there a problem, and that is > both on a single node and on multiple nodes. SLURM was built against > pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are > running a recent patch of SLURM, I think. SLURM and OMPI are both > being built using the same installation of pmix. > > [bennet@cavium-hpc etc]$ srun --version > slurm 17.11.7 > > [bennet@cavium-hpc etc]$ grep pmi slurm.conf > MpiDefault=pmix > > [bennet@cavium-hpc pmix]$ srun --mpi=list > srun: MPI types are... > srun: pmix_v2 > srun: openmpi > srun: none > srun: pmi2 > srun: pmix > > I think I said that I was pretty sure I had got this to work with both > mpirun and srun at one point, but I am unable to find the magic a > second time. > > > > > On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski wrote: > > > > What MPI is SLURM set to use/how was that compiled? Out of the box, the > > SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily > > doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right > > either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this > > definitely made a difference: > > > > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 > > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > > processors > > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: > > Bus error (signal 7) > > srun: error: slepner032: task 10: Bus error > > > > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 > > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 > > processors > > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 > > processors > > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 > > processors > > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 > > processors > > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 > > processors > > Hello world from
Re: [OMPI users] Fwd: srun works, mpirun does not
Ryan, With srun it's fine. Only with mpirun is there a problem, and that is both on a single node and on multiple nodes. SLURM was built against pmix 2.0.2, and I am pretty sure that SLURM's default is pmix. We are running a recent patch of SLURM, I think. SLURM and OMPI are both being built using the same installation of pmix. [bennet@cavium-hpc etc]$ srun --version slurm 17.11.7 [bennet@cavium-hpc etc]$ grep pmi slurm.conf MpiDefault=pmix [bennet@cavium-hpc pmix]$ srun --mpi=list srun: MPI types are... srun: pmix_v2 srun: openmpi srun: none srun: pmi2 srun: pmix I think I said that I was pretty sure I had got this to work with both mpirun and srun at one point, but I am unable to find the magic a second time. On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski wrote: > > What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM > MPI is set to “none”, or was last I checked, and so isn’t necessarily doing > MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way > (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a > difference: > > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 > processors > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: > Bus error (signal 7) > srun: error: slepner032: task 10: Bus error > > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 > processors > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 > processors > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 > processors > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 > processors > Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 > processors > > > On Jun 17, 2018, at 5:51 PM, Bennet Fauber wrote: > > > > I rebuilt with --enable-debug, then ran with > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > salloc: Pending job allocation 158 > > salloc: job 158 queued and waiting for resources > > salloc: job 158 has been allocated resources > > salloc: Granted job allocation 158 > > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > The sum = 0.866386 > > Elapsed time is: 5.426759 > > The sum = 0.866386 > > Elapsed time is: 5.424068 > > The sum = 0.866386 > > Elapsed time is: 5.426195 > > The sum = 0.866386 > > Elapsed time is: 5.426059 > > The sum = 0.866386 > > Elapsed time is:
Re: [OMPI users] Fwd: srun works, mpirun does not
What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a difference: [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 processors [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus error (signal 7) srun: error: slepner032: task 10: Bus error [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 ./mpi_hello_world-intel-17.0.4-mvapich2-2.2 Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 processors Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 processors Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 processors Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 processors Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 processors > On Jun 17, 2018, at 5:51 PM, Bennet Fauber wrote: > > I rebuilt with --enable-debug, then ran with > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > salloc: Pending job allocation 158 > salloc: job 158 queued and waiting for resources > salloc: job 158 has been allocated resources > salloc: Granted job allocation 158 > > [bennet@cavium-hpc ~]$ srun ./test_mpi > The sum = 0.866386 > Elapsed time is: 5.426759 > The sum = 0.866386 > Elapsed time is: 5.424068 > The sum = 0.866386 > Elapsed time is: 5.426195 > The sum = 0.866386 > Elapsed time is: 5.426059 > The sum = 0.866386 > Elapsed time is: 5.423192 > The sum = 0.866386 > Elapsed time is: 5.426252 > The sum = 0.866386 > Elapsed time is: 5.425444 > The sum = 0.866386 > Elapsed time is: 5.423647 > The sum = 0.866386 > Elapsed time is: 5.426082 > The sum = 0.866386 > Elapsed time is: 5.425936 > The sum = 0.866386 > Elapsed time is: 5.423964 > Total time is: 59.677830 > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > 2>&1 | tee debug2.log > > The zipped debug log should be attached. > > I did that after using systemctl to turn off the firewall on the login > node from which the mpirun is executed, as well as on the host on > which it runs. > > [bennet@cavium-hpc ~]$ mpirun hostname > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack
Re: [OMPI users] Fwd: srun works, mpirun does not
I rebuilt with --enable-debug, then ran with [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 salloc: Pending job allocation 158 salloc: job 158 queued and waiting for resources salloc: job 158 has been allocated resources salloc: Granted job allocation 158 [bennet@cavium-hpc ~]$ srun ./test_mpi The sum = 0.866386 Elapsed time is: 5.426759 The sum = 0.866386 Elapsed time is: 5.424068 The sum = 0.866386 Elapsed time is: 5.426195 The sum = 0.866386 Elapsed time is: 5.426059 The sum = 0.866386 Elapsed time is: 5.423192 The sum = 0.866386 Elapsed time is: 5.426252 The sum = 0.866386 Elapsed time is: 5.425444 The sum = 0.866386 Elapsed time is: 5.423647 The sum = 0.866386 Elapsed time is: 5.426082 The sum = 0.866386 Elapsed time is: 5.425936 The sum = 0.866386 Elapsed time is: 5.423964 Total time is: 59.677830 [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi 2>&1 | tee debug2.log The zipped debug log should be attached. I did that after using systemctl to turn off the firewall on the login node from which the mpirun is executed, as well as on the host on which it runs. [bennet@cavium-hpc ~]$ mpirun hostname -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- [bennet@cavium-hpc ~]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 158 standard bash bennet R 14:30 1 cav01 [bennet@cavium-hpc ~]$ srun hostname cav01.arc-ts.umich.edu [ repeated 23 more times ] As always, your help is much appreciated, -- bennet On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org wrote: > > Add --enable-debug to your OMPI configure cmd line, and then add --mca > plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote > daemon isn’t starting - this will give you some info as to why. > > > > On Jun 17, 2018, at 9:07 AM, Bennet Fauber wrote: > > > > I have a compiled binary that will run with srun but not with mpirun. > > The attempts to run with mpirun all result in failures to initialize. > > I have tried this on one node, and on two nodes, with firewall turned > > on and with it off. > > > > Am I missing some command line option for mpirun? > > > > OMPI built from this configure command > > > > $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > > --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > > FC=gfortran > > > > All tests from `make check` passed, see below. > > > > [bennet@cavium-hpc ~]$ mpicc --show > > gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > > -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > > -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > > -Wl,--enable-new-dtags > > -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > > > > The test_mpi was compiled with > > > > $ gcc -o test_mpi test_mpi.c -lm > > > > This is the runtime library path > > > > [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > > /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > > > > > > These commands are given in exact sequence in which they were entered > > at a console. > > > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > > salloc: Pending job allocation 156 > > salloc: job 156 queued and waiting for resources > > salloc: job 156 has been allocated resources > > salloc: Granted job allocation 156 > > > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi > > -- > > An ORTE daemon has unexpectedly failed after launch and before > > communicating back to mpirun. This could be caused by a number > > of factors, including an inability to create a connection back > > to mpirun due to a lack of common network interfaces and/or no > > route found between them. Please check network connectivity > > (including firewalls and network routing requirements). > > -- > > > > [bennet@cavium-hpc ~]$ srun ./test_mpi > > The sum = 0.866386 > > Elapsed time is: 5.425439 > > The sum = 0.866386 > > Elapsed time is: 5.427427 > > The sum = 0.866386 > > Elapsed time is: 5.422579 > > The sum = 0.866386 > > Elapsed
Re: [OMPI users] Fwd: srun works, mpirun does not
Add --enable-debug to your OMPI configure cmd line, and then add --mca plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon isn’t starting - this will give you some info as to why. > On Jun 17, 2018, at 9:07 AM, Bennet Fauber wrote: > > I have a compiled binary that will run with srun but not with mpirun. > The attempts to run with mpirun all result in failures to initialize. > I have tried this on one node, and on two nodes, with firewall turned > on and with it off. > > Am I missing some command line option for mpirun? > > OMPI built from this configure command > > $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ > FC=gfortran > > All tests from `make check` passed, see below. > > [bennet@cavium-hpc ~]$ mpicc --show > gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread > -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath > -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib > -Wl,--enable-new-dtags > -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi > > The test_mpi was compiled with > > $ gcc -o test_mpi test_mpi.c -lm > > This is the runtime library path > > [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH > /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib > > > These commands are given in exact sequence in which they were entered > at a console. > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > salloc: Pending job allocation 156 > salloc: job 156 queued and waiting for resources > salloc: job 156 has been allocated resources > salloc: Granted job allocation 156 > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi > -- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -- > > [bennet@cavium-hpc ~]$ srun ./test_mpi > The sum = 0.866386 > Elapsed time is: 5.425439 > The sum = 0.866386 > Elapsed time is: 5.427427 > The sum = 0.866386 > Elapsed time is: 5.422579 > The sum = 0.866386 > Elapsed time is: 5.424168 > The sum = 0.866386 > Elapsed time is: 5.423951 > The sum = 0.866386 > Elapsed time is: 5.422414 > The sum = 0.866386 > Elapsed time is: 5.427156 > The sum = 0.866386 > Elapsed time is: 5.424834 > The sum = 0.866386 > Elapsed time is: 5.425103 > The sum = 0.866386 > Elapsed time is: 5.422415 > The sum = 0.866386 > Elapsed time is: 5.422948 > Total time is: 59.668622 > > Thanks,-- bennet > > > make check results > -- > > make check-TESTS > make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' > PASS: predefined_gap_test > PASS: predefined_pad_test > SKIP: dlopen_test > > Testsuite summary for Open MPI 3.1.0 > > # TOTAL: 3 > # PASS: 2 > # SKIP: 1 > # XFAIL: 0 > # FAIL: 0 > # XPASS: 0 > # ERROR: 0 > > [ elided ] > PASS: atomic_cmpset_noinline >- 5 threads: Passed > PASS: atomic_cmpset_noinline >- 8 threads: Passed > > Testsuite summary for Open MPI 3.1.0 > > # TOTAL: 8 > # PASS: 8 > # SKIP: 0 > # XFAIL: 0 > # FAIL: 0 > # XPASS: 0 > # ERROR: 0 > > [ elided ] > make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' > PASS: ompi_rb_tree > PASS: opal_bitmap > PASS: opal_hash_table > PASS: opal_proc_table > PASS: opal_tree > PASS: opal_list > PASS: opal_value_array > PASS: opal_pointer_array > PASS: opal_lifo > PASS: opal_fifo > > Testsuite summary for Open MPI 3.1.0 > > # TOTAL: 10 > # PASS: 10 > # SKIP: 0 > # XFAIL: 0 > # FAIL: 0 > # XPASS: 0 > # ERROR: 0 >
[OMPI users] Fwd: srun works, mpirun does not
I have a compiled binary that will run with srun but not with mpirun. The attempts to run with mpirun all result in failures to initialize. I have tried this on one node, and on two nodes, with firewall turned on and with it off. Am I missing some command line option for mpirun? OMPI built from this configure command $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man --with-pmix=/opt/pmix/2.0.2 --with-libevent=external --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ FC=gfortran All tests from `make check` passed, see below. [bennet@cavium-hpc ~]$ mpicc --show gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -Wl,--enable-new-dtags -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi The test_mpi was compiled with $ gcc -o test_mpi test_mpi.c -lm This is the runtime library path [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib These commands are given in exact sequence in which they were entered at a console. [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 salloc: Pending job allocation 156 salloc: job 156 queued and waiting for resources salloc: job 156 has been allocated resources salloc: Granted job allocation 156 [bennet@cavium-hpc ~]$ mpirun ./test_mpi -- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -- [bennet@cavium-hpc ~]$ srun ./test_mpi The sum = 0.866386 Elapsed time is: 5.425439 The sum = 0.866386 Elapsed time is: 5.427427 The sum = 0.866386 Elapsed time is: 5.422579 The sum = 0.866386 Elapsed time is: 5.424168 The sum = 0.866386 Elapsed time is: 5.423951 The sum = 0.866386 Elapsed time is: 5.422414 The sum = 0.866386 Elapsed time is: 5.427156 The sum = 0.866386 Elapsed time is: 5.424834 The sum = 0.866386 Elapsed time is: 5.425103 The sum = 0.866386 Elapsed time is: 5.422415 The sum = 0.866386 Elapsed time is: 5.422948 Total time is: 59.668622 Thanks,-- bennet make check results -- make check-TESTS make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' PASS: predefined_gap_test PASS: predefined_pad_test SKIP: dlopen_test Testsuite summary for Open MPI 3.1.0 # TOTAL: 3 # PASS: 2 # SKIP: 1 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [ elided ] PASS: atomic_cmpset_noinline - 5 threads: Passed PASS: atomic_cmpset_noinline - 8 threads: Passed Testsuite summary for Open MPI 3.1.0 # TOTAL: 8 # PASS: 8 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [ elided ] make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' PASS: ompi_rb_tree PASS: opal_bitmap PASS: opal_hash_table PASS: opal_proc_table PASS: opal_tree PASS: opal_list PASS: opal_value_array PASS: opal_pointer_array PASS: opal_lifo PASS: opal_fifo Testsuite summary for Open MPI 3.1.0 # TOTAL: 10 # PASS: 10 # SKIP: 0 # XFAIL: 0 # FAIL: 0 # XPASS: 0 # ERROR: 0 [ elided ] make opal_thread opal_condition make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' CC opal_thread.o CCLD opal_thread CC opal_condition.o CCLD opal_condition make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' make check-TESTS make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'