I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your mpirun cmd line and see if that works.
> On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote: > > To eliminate possibilities, I removed all other versions of OpenMPI > from the system, and rebuilt using the same build script as was used > to generate the prior report. > > [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh > Checking compilers and things > OMPI is ompi > COMP_NAME is gcc_7_1_0 > SRC_ROOT is /sw/arcts/centos7/src > PREFIX_ROOT is /sw/arcts/centos7 > PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > CONFIGURE_FLAGS are > COMPILERS are CC=gcc CXX=g++ FC=gfortran > > Currently Loaded Modules: > 1) gcc/7.1.0 > > gcc (ARM-build-14) 7.1.0 > Copyright (C) 2017 Free Software Foundation, Inc. > This is free software; see the source for copying conditions. There is NO > warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. > > Using the following configure command > > ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd > --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man > --with-pmix=/opt/pmix/2.0.2 --with-libevent=external > --with-hwloc=external --with-slurm --disable-dlopen > --enable-debug CC=gcc CXX=g++ FC=gfortran > > The tar ball is > > 2e783873f6b206aa71f745762fa15da5 > /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz > > I still get > > [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 > salloc: Pending job allocation 165 > salloc: job 165 queued and waiting for resources > salloc: job 165 has been allocated resources > salloc: Granted job allocation 165 > [bennet@cavium-hpc ~]$ srun ./test_mpi > The sum = 0.866386 > Elapsed time is: 5.425549 > The sum = 0.866386 > Elapsed time is: 5.422826 > The sum = 0.866386 > Elapsed time is: 5.427676 > The sum = 0.866386 > Elapsed time is: 5.424928 > The sum = 0.866386 > Elapsed time is: 5.422060 > The sum = 0.866386 > Elapsed time is: 5.425431 > The sum = 0.866386 > Elapsed time is: 5.424350 > The sum = 0.866386 > Elapsed time is: 5.423037 > The sum = 0.866386 > Elapsed time is: 5.427727 > The sum = 0.866386 > Elapsed time is: 5.424922 > The sum = 0.866386 > Elapsed time is: 5.424279 > Total time is: 59.672992 > > [bennet@cavium-hpc ~]$ mpirun ./test_mpi > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > > I reran with > > [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi > 2>&1 | tee debug3.log > > and the gzipped log is attached. > > I thought to try it with a different test program, which spits the error > [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not > found in file base/ess_base_std_app.c at line 219 > [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not > found in file base/ess_base_std_app.c at line 219 > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > store DAEMON URI failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > > > At one point, I am almost certain that OMPI mpirun did work, and I am > at a loss to explain why it no longer does. > > I have also tried the 3.1.1rc1 version. I am now going to try 3.0.0, > and we'll try downgrading SLURM to a prior version. > > -- bennet > > > -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org > <r...@open-mpi.org> wrote: >> >> Hmmm...well, the error has changed from your initial report. Turning off the >> firewall was the solution to that problem. >> >> This problem is different - it isn’t the orted that failed in the log you >> sent, but the application proc that couldn’t initialize. It looks like that >> app was compiled against some earlier version of OMPI? It is looking for >> something that no longer exists. I saw that you compiled it with a simple >> “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My >> guess is that your compile picked up some older version of OMPI on the >> system. >> >> Ralph >> >> >>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote: >>> >>> I rebuilt with --enable-debug, then ran with >>> >>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 >>> salloc: Pending job allocation 158 >>> salloc: job 158 queued and waiting for resources >>> salloc: job 158 has been allocated resources >>> salloc: Granted job allocation 158 >>> >>> [bennet@cavium-hpc ~]$ srun ./test_mpi >>> The sum = 0.866386 >>> Elapsed time is: 5.426759 >>> The sum = 0.866386 >>> Elapsed time is: 5.424068 >>> The sum = 0.866386 >>> Elapsed time is: 5.426195 >>> The sum = 0.866386 >>> Elapsed time is: 5.426059 >>> The sum = 0.866386 >>> Elapsed time is: 5.423192 >>> The sum = 0.866386 >>> Elapsed time is: 5.426252 >>> The sum = 0.866386 >>> Elapsed time is: 5.425444 >>> The sum = 0.866386 >>> Elapsed time is: 5.423647 >>> The sum = 0.866386 >>> Elapsed time is: 5.426082 >>> The sum = 0.866386 >>> Elapsed time is: 5.425936 >>> The sum = 0.866386 >>> Elapsed time is: 5.423964 >>> Total time is: 59.677830 >>> >>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi >>> 2>&1 | tee debug2.log >>> >>> The zipped debug log should be attached. >>> >>> I did that after using systemctl to turn off the firewall on the login >>> node from which the mpirun is executed, as well as on the host on >>> which it runs. >>> >>> [bennet@cavium-hpc ~]$ mpirun hostname >>> -------------------------------------------------------------------------- >>> An ORTE daemon has unexpectedly failed after launch and before >>> communicating back to mpirun. This could be caused by a number >>> of factors, including an inability to create a connection back >>> to mpirun due to a lack of common network interfaces and/or no >>> route found between them. Please check network connectivity >>> (including firewalls and network routing requirements). >>> -------------------------------------------------------------------------- >>> >>> [bennet@cavium-hpc ~]$ squeue >>> JOBID PARTITION NAME USER ST TIME NODES >>> NODELIST(REASON) >>> 158 standard bash bennet R 14:30 1 cav01 >>> [bennet@cavium-hpc ~]$ srun hostname >>> cav01.arc-ts.umich.edu >>> [ repeated 23 more times ] >>> >>> As always, your help is much appreciated, >>> >>> -- bennet >>> >>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote: >>>> >>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca >>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote >>>> daemon isn’t starting - this will give you some info as to why. >>>> >>>> >>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote: >>>>> >>>>> I have a compiled binary that will run with srun but not with mpirun. >>>>> The attempts to run with mpirun all result in failures to initialize. >>>>> I have tried this on one node, and on two nodes, with firewall turned >>>>> on and with it off. >>>>> >>>>> Am I missing some command line option for mpirun? >>>>> >>>>> OMPI built from this configure command >>>>> >>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b >>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man >>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external >>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++ >>>>> FC=gfortran >>>>> >>>>> All tests from `make check` passed, see below. >>>>> >>>>> [bennet@cavium-hpc ~]$ mpicc --show >>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread >>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath >>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib >>>>> -Wl,--enable-new-dtags >>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi >>>>> >>>>> The test_mpi was compiled with >>>>> >>>>> $ gcc -o test_mpi test_mpi.c -lm >>>>> >>>>> This is the runtime library path >>>>> >>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH >>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib >>>>> >>>>> >>>>> These commands are given in exact sequence in which they were entered >>>>> at a console. >>>>> >>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24 >>>>> salloc: Pending job allocation 156 >>>>> salloc: job 156 queued and waiting for resources >>>>> salloc: job 156 has been allocated resources >>>>> salloc: Granted job allocation 156 >>>>> >>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi >>>>> -------------------------------------------------------------------------- >>>>> An ORTE daemon has unexpectedly failed after launch and before >>>>> communicating back to mpirun. This could be caused by a number >>>>> of factors, including an inability to create a connection back >>>>> to mpirun due to a lack of common network interfaces and/or no >>>>> route found between them. Please check network connectivity >>>>> (including firewalls and network routing requirements). >>>>> -------------------------------------------------------------------------- >>>>> >>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.425439 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.427427 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.422579 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.424168 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.423951 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.422414 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.427156 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.424834 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.425103 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.422415 >>>>> The sum = 0.866386 >>>>> Elapsed time is: 5.422948 >>>>> Total time is: 59.668622 >>>>> >>>>> Thanks, -- bennet >>>>> >>>>> >>>>> make check results >>>>> ---------------------------------------------- >>>>> >>>>> make check-TESTS >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers' >>>>> PASS: predefined_gap_test >>>>> PASS: predefined_pad_test >>>>> SKIP: dlopen_test >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 3 >>>>> # PASS: 2 >>>>> # SKIP: 1 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> PASS: atomic_cmpset_noinline >>>>> - 5 threads: Passed >>>>> PASS: atomic_cmpset_noinline >>>>> - 8 threads: Passed >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 8 >>>>> # PASS: 8 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class' >>>>> PASS: ompi_rb_tree >>>>> PASS: opal_bitmap >>>>> PASS: opal_hash_table >>>>> PASS: opal_proc_table >>>>> PASS: opal_tree >>>>> PASS: opal_list >>>>> PASS: opal_value_array >>>>> PASS: opal_pointer_array >>>>> PASS: opal_lifo >>>>> PASS: opal_fifo >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 10 >>>>> # PASS: 10 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> make opal_thread opal_condition >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' >>>>> CC opal_thread.o >>>>> CCLD opal_thread >>>>> CC opal_condition.o >>>>> CCLD opal_condition >>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads' >>>>> make check-TESTS >>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads' >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 0 >>>>> # PASS: 0 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype' >>>>> PASS: opal_datatype_test >>>>> PASS: unpack_hetero >>>>> PASS: checksum >>>>> PASS: position >>>>> PASS: position_noncontig >>>>> PASS: ddt_test >>>>> PASS: ddt_raw >>>>> PASS: unpack_ooo >>>>> PASS: ddt_pack >>>>> PASS: external32 >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 10 >>>>> # PASS: 10 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util' >>>>> PASS: opal_bit_ops >>>>> PASS: opal_path_nfs >>>>> PASS: bipartite_graph >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 3 >>>>> # PASS: 3 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> [ elided ] >>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss' >>>>> PASS: dss_buffer >>>>> PASS: dss_cmp >>>>> PASS: dss_payload >>>>> PASS: dss_print >>>>> ============================================================================ >>>>> Testsuite summary for Open MPI 3.1.0 >>>>> ============================================================================ >>>>> # TOTAL: 4 >>>>> # PASS: 4 >>>>> # SKIP: 0 >>>>> # XFAIL: 0 >>>>> # FAIL: 0 >>>>> # XPASS: 0 >>>>> # ERROR: 0 >>>>> ============================================================================ >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org >>>>> https://lists.open-mpi.org/mailman/listinfo/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/users >>> <debug2.log.gz>_______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/users >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > <debug3.log.gz>_______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users