Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
Well, this is kind of interesting.  I can strip the configure line
back and get mpirun to work on one node, but then neither srun nor
mpirun within a SLURM job will run.  I can add back configure options
to get to

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-slurm

and the situation does not seem to change.  Then I add libevent,

./configure \
--prefix=${PREFIX} \
--mandir=${PREFIX}/share/man \
--with-pmix=/opt/pmix/2.0.2 \
--with-libevent=external \
--with-slurm

and it works again with srun but fails to run the binary with mpirun.

It is late, and I am baffled.

On Mon, Jun 18, 2018 at 9:02 PM Bennet Fauber  wrote:
>
> Ryan,
>
> With srun it's fine.  Only with mpirun is there a problem, and that is
> both on a single node and on multiple nodes.  SLURM was built against
> pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
> running a recent patch of SLURM, I think.  SLURM and OMPI are both
> being built using the same installation of pmix.
>
> [bennet@cavium-hpc etc]$ srun --version
> slurm 17.11.7
>
> [bennet@cavium-hpc etc]$ grep pmi slurm.conf
> MpiDefault=pmix
>
> [bennet@cavium-hpc pmix]$ srun --mpi=list
> srun: MPI types are...
> srun: pmix_v2
> srun: openmpi
> srun: none
> srun: pmi2
> srun: pmix
>
> I think I said that I was pretty sure I had got this to work with both
> mpirun and srun at one point, but I am unable to find the magic a
> second time.
>
>
>
>
> On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski  wrote:
> >
> > What MPI is SLURM set to use/how was that compiled? Out of the box, the 
> > SLURM MPI is set to “none”, or was last I checked, and so isn’t necessarily 
> > doing MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right 
> > either way (OpenMPI built with “--with-pmi"), but for MVAPICH2 this 
> > definitely made a difference:
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> > processors
> > [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> > Bus error (signal 7)
> > srun: error: slepner032: task 10: Bus error
> >
> > [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> > ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> > processors
> > Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> > processors
> > Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> > processors
> > Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> > processors
> > Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> > processors
> > Hello world from 

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Bennet Fauber
Ryan,

With srun it's fine.  Only with mpirun is there a problem, and that is
both on a single node and on multiple nodes.  SLURM was built against
pmix 2.0.2, and I am pretty sure that SLURM's default is pmix.  We are
running a recent patch of SLURM, I think.  SLURM and OMPI are both
being built using the same installation of pmix.

[bennet@cavium-hpc etc]$ srun --version
slurm 17.11.7

[bennet@cavium-hpc etc]$ grep pmi slurm.conf
MpiDefault=pmix

[bennet@cavium-hpc pmix]$ srun --mpi=list
srun: MPI types are...
srun: pmix_v2
srun: openmpi
srun: none
srun: pmi2
srun: pmix

I think I said that I was pretty sure I had got this to work with both
mpirun and srun at one point, but I am unable to find the magic a
second time.




On Mon, Jun 18, 2018 at 4:44 PM Ryan Novosielski  wrote:
>
> What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
> MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
> MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
> (OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
> difference:
>
> [novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
> processors
> [slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: 
> Bus error (signal 7)
> srun: error: slepner032: task 10: Bus error
>
> [novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
> ./mpi_hello_world-intel-17.0.4-mvapich2-2.2
> Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
> processors
> Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
> processors
> Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
> processors
> Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
> processors
> Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
> processors
>
> > On Jun 17, 2018, at 5:51 PM, Bennet Fauber  wrote:
> >
> > I rebuilt with --enable-debug, then ran with
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 158
> > salloc: job 158 queued and waiting for resources
> > salloc: job 158 has been allocated resources
> > salloc: Granted job allocation 158
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.426759
> > The sum = 0.866386
> > Elapsed time is:  5.424068
> > The sum = 0.866386
> > Elapsed time is:  5.426195
> > The sum = 0.866386
> > Elapsed time is:  5.426059
> > The sum = 0.866386
> > Elapsed time is:  

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-18 Thread Ryan Novosielski
What MPI is SLURM set to use/how was that compiled? Out of the box, the SLURM 
MPI is set to “none”, or was last I checked, and so isn’t necessarily doing 
MPI. Now, I did try this with OpenMPI 2.1.1 and it looked right either way 
(OpenMPI built with “--with-pmi"), but for MVAPICH2 this definitely made a 
difference:

[novosirj@amarel1 novosirj]$ srun --mpi=none -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 0 out of 1 
processors
[slepner032.amarel.rutgers.edu:mpi_rank_0][error_sighandler] Caught error: Bus 
error (signal 7)
srun: error: slepner032: task 10: Bus error

[novosirj@amarel1 novosirj]$ srun --mpi=pmi2 -N 4 -n 16 --ntasks-per-node=4 
./mpi_hello_world-intel-17.0.4-mvapich2-2.2
Hello world from processor slepner028.amarel.rutgers.edu, rank 0 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 1 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 2 out of 16 
processors
Hello world from processor slepner028.amarel.rutgers.edu, rank 3 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 12 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 13 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 14 out of 16 
processors
Hello world from processor slepner035.amarel.rutgers.edu, rank 15 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 4 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 5 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 6 out of 16 
processors
Hello world from processor slepner031.amarel.rutgers.edu, rank 7 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 8 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 9 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 10 out of 16 
processors
Hello world from processor slepner032.amarel.rutgers.edu, rank 11 out of 16 
processors

> On Jun 17, 2018, at 5:51 PM, Bennet Fauber  wrote:
> 
> I rebuilt with --enable-debug, then ran with
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 158
> salloc: job 158 queued and waiting for resources
> salloc: job 158 has been allocated resources
> salloc: Granted job allocation 158
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.426759
> The sum = 0.866386
> Elapsed time is:  5.424068
> The sum = 0.866386
> Elapsed time is:  5.426195
> The sum = 0.866386
> Elapsed time is:  5.426059
> The sum = 0.866386
> Elapsed time is:  5.423192
> The sum = 0.866386
> Elapsed time is:  5.426252
> The sum = 0.866386
> Elapsed time is:  5.425444
> The sum = 0.866386
> Elapsed time is:  5.423647
> The sum = 0.866386
> Elapsed time is:  5.426082
> The sum = 0.866386
> Elapsed time is:  5.425936
> The sum = 0.866386
> Elapsed time is:  5.423964
> Total time is:  59.677830
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug2.log
> 
> The zipped debug log should be attached.
> 
> I did that after using systemctl to turn off the firewall on the login
> node from which the mpirun is executed, as well as on the host on
> which it runs.
> 
> [bennet@cavium-hpc ~]$ mpirun hostname
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack 

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread Bennet Fauber
I rebuilt with --enable-debug, then ran with

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 158
salloc: job 158 queued and waiting for resources
salloc: job 158 has been allocated resources
salloc: Granted job allocation 158

[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.426759
The sum = 0.866386
Elapsed time is:  5.424068
The sum = 0.866386
Elapsed time is:  5.426195
The sum = 0.866386
Elapsed time is:  5.426059
The sum = 0.866386
Elapsed time is:  5.423192
The sum = 0.866386
Elapsed time is:  5.426252
The sum = 0.866386
Elapsed time is:  5.425444
The sum = 0.866386
Elapsed time is:  5.423647
The sum = 0.866386
Elapsed time is:  5.426082
The sum = 0.866386
Elapsed time is:  5.425936
The sum = 0.866386
Elapsed time is:  5.423964
Total time is:  59.677830

[bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
2>&1 | tee debug2.log

The zipped debug log should be attached.

I did that after using systemctl to turn off the firewall on the login
node from which the mpirun is executed, as well as on the host on
which it runs.

[bennet@cavium-hpc ~]$ mpirun hostname
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

[bennet@cavium-hpc ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   158  standard bash   bennet  R  14:30  1 cav01
[bennet@cavium-hpc ~]$ srun hostname
cav01.arc-ts.umich.edu
[ repeated 23 more times ]

As always, your help is much appreciated,

-- bennet

On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org  wrote:
>
> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
> daemon isn’t starting - this will give you some info as to why.
>
>
> > On Jun 17, 2018, at 9:07 AM, Bennet Fauber  wrote:
> >
> > I have a compiled binary that will run with srun but not with mpirun.
> > The attempts to run with mpirun all result in failures to initialize.
> > I have tried this on one node, and on two nodes, with firewall turned
> > on and with it off.
> >
> > Am I missing some command line option for mpirun?
> >
> > OMPI built from this configure command
> >
> >  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> > --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> > FC=gfortran
> >
> > All tests from `make check` passed, see below.
> >
> > [bennet@cavium-hpc ~]$ mpicc --show
> > gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> > -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> > -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> > -Wl,--enable-new-dtags
> > -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> >
> > The test_mpi was compiled with
> >
> > $ gcc -o test_mpi test_mpi.c -lm
> >
> > This is the runtime library path
> >
> > [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> > /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> >
> >
> > These commands are given in exact sequence in which they were entered
> > at a console.
> >
> > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> > salloc: Pending job allocation 156
> > salloc: job 156 queued and waiting for resources
> > salloc: job 156 has been allocated resources
> > salloc: Granted job allocation 156
> >
> > [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> > --
> > An ORTE daemon has unexpectedly failed after launch and before
> > communicating back to mpirun. This could be caused by a number
> > of factors, including an inability to create a connection back
> > to mpirun due to a lack of common network interfaces and/or no
> > route found between them. Please check network connectivity
> > (including firewalls and network routing requirements).
> > --
> >
> > [bennet@cavium-hpc ~]$ srun ./test_mpi
> > The sum = 0.866386
> > Elapsed time is:  5.425439
> > The sum = 0.866386
> > Elapsed time is:  5.427427
> > The sum = 0.866386
> > Elapsed time is:  5.422579
> > The sum = 0.866386
> > Elapsed 

Re: [OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread r...@open-mpi.org
Add --enable-debug to your OMPI configure cmd line, and then add --mca 
plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote daemon 
isn’t starting - this will give you some info as to why.


> On Jun 17, 2018, at 9:07 AM, Bennet Fauber  wrote:
> 
> I have a compiled binary that will run with srun but not with mpirun.
> The attempts to run with mpirun all result in failures to initialize.
> I have tried this on one node, and on two nodes, with firewall turned
> on and with it off.
> 
> Am I missing some command line option for mpirun?
> 
> OMPI built from this configure command
> 
>  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
> FC=gfortran
> 
> All tests from `make check` passed, see below.
> 
> [bennet@cavium-hpc ~]$ mpicc --show
> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
> -Wl,--enable-new-dtags
> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
> 
> The test_mpi was compiled with
> 
> $ gcc -o test_mpi test_mpi.c -lm
> 
> This is the runtime library path
> 
> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
> 
> 
> These commands are given in exact sequence in which they were entered
> at a console.
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 156
> salloc: job 156 queued and waiting for resources
> salloc: job 156 has been allocated resources
> salloc: Granted job allocation 156
> 
> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
> 
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.425439
> The sum = 0.866386
> Elapsed time is:  5.427427
> The sum = 0.866386
> Elapsed time is:  5.422579
> The sum = 0.866386
> Elapsed time is:  5.424168
> The sum = 0.866386
> Elapsed time is:  5.423951
> The sum = 0.866386
> Elapsed time is:  5.422414
> The sum = 0.866386
> Elapsed time is:  5.427156
> The sum = 0.866386
> Elapsed time is:  5.424834
> The sum = 0.866386
> Elapsed time is:  5.425103
> The sum = 0.866386
> Elapsed time is:  5.422415
> The sum = 0.866386
> Elapsed time is:  5.422948
> Total time is:  59.668622
> 
> Thanks,-- bennet
> 
> 
> make check results
> --
> 
> make  check-TESTS
> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
> PASS: predefined_gap_test
> PASS: predefined_pad_test
> SKIP: dlopen_test
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 3
> # PASS:  2
> # SKIP:  1
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
> [ elided ]
> PASS: atomic_cmpset_noinline
>- 5 threads: Passed
> PASS: atomic_cmpset_noinline
>- 8 threads: Passed
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 8
> # PASS:  8
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 
> [ elided ]
> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
> PASS: ompi_rb_tree
> PASS: opal_bitmap
> PASS: opal_hash_table
> PASS: opal_proc_table
> PASS: opal_tree
> PASS: opal_list
> PASS: opal_value_array
> PASS: opal_pointer_array
> PASS: opal_lifo
> PASS: opal_fifo
> 
> Testsuite summary for Open MPI 3.1.0
> 
> # TOTAL: 10
> # PASS:  10
> # SKIP:  0
> # XFAIL: 0
> # FAIL:  0
> # XPASS: 0
> # ERROR: 0
> 

[OMPI users] Fwd: srun works, mpirun does not

2018-06-17 Thread Bennet Fauber
I have a compiled binary that will run with srun but not with mpirun.
The attempts to run with mpirun all result in failures to initialize.
I have tried this on one node, and on two nodes, with firewall turned
on and with it off.

Am I missing some command line option for mpirun?

OMPI built from this configure command

  $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
--mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
--with-pmix=/opt/pmix/2.0.2 --with-libevent=external
--with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
FC=gfortran

All tests from `make check` passed, see below.

[bennet@cavium-hpc ~]$ mpicc --show
gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
-L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
-Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
-Wl,--enable-new-dtags
-L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi

The test_mpi was compiled with

$ gcc -o test_mpi test_mpi.c -lm

This is the runtime library path

[bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
/opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib


These commands are given in exact sequence in which they were entered
at a console.

[bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
salloc: Pending job allocation 156
salloc: job 156 queued and waiting for resources
salloc: job 156 has been allocated resources
salloc: Granted job allocation 156

[bennet@cavium-hpc ~]$ mpirun ./test_mpi
--
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--

[bennet@cavium-hpc ~]$ srun ./test_mpi
The sum = 0.866386
Elapsed time is:  5.425439
The sum = 0.866386
Elapsed time is:  5.427427
The sum = 0.866386
Elapsed time is:  5.422579
The sum = 0.866386
Elapsed time is:  5.424168
The sum = 0.866386
Elapsed time is:  5.423951
The sum = 0.866386
Elapsed time is:  5.422414
The sum = 0.866386
Elapsed time is:  5.427156
The sum = 0.866386
Elapsed time is:  5.424834
The sum = 0.866386
Elapsed time is:  5.425103
The sum = 0.866386
Elapsed time is:  5.422415
The sum = 0.866386
Elapsed time is:  5.422948
Total time is:  59.668622

Thanks,-- bennet


make check results
--

make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
PASS: predefined_gap_test
PASS: predefined_pad_test
SKIP: dlopen_test

Testsuite summary for Open MPI 3.1.0

# TOTAL: 3
# PASS:  2
# SKIP:  1
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
PASS: atomic_cmpset_noinline
- 5 threads: Passed
PASS: atomic_cmpset_noinline
- 8 threads: Passed

Testsuite summary for Open MPI 3.1.0

# TOTAL: 8
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
PASS: ompi_rb_tree
PASS: opal_bitmap
PASS: opal_hash_table
PASS: opal_proc_table
PASS: opal_tree
PASS: opal_list
PASS: opal_value_array
PASS: opal_pointer_array
PASS: opal_lifo
PASS: opal_fifo

Testsuite summary for Open MPI 3.1.0

# TOTAL: 10
# PASS:  10
# SKIP:  0
# XFAIL: 0
# FAIL:  0
# XPASS: 0
# ERROR: 0

[ elided ]
make  opal_thread opal_condition
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
  CC   opal_thread.o
  CCLD opal_thread
  CC   opal_condition.o
  CCLD opal_condition
make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
make  check-TESTS
make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'