I doubt Slurm is the issue. For grins, lets try adding “--mca plm rsh” to your 
mpirun cmd line and see if that works.


> On Jun 18, 2018, at 12:57 PM, Bennet Fauber <ben...@umich.edu> wrote:
> 
> To eliminate possibilities, I removed all other versions of OpenMPI
> from the system, and rebuilt using the same build script as was used
> to generate the prior report.
> 
> [bennet@cavium-hpc bennet]$ ./ompi-3.1.0bd.sh
> Checking compilers and things
> OMPI is ompi
> COMP_NAME is gcc_7_1_0
> SRC_ROOT is /sw/arcts/centos7/src
> PREFIX_ROOT is /sw/arcts/centos7
> PREFIX is /sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
> CONFIGURE_FLAGS are
> COMPILERS are CC=gcc CXX=g++ FC=gfortran
> 
> Currently Loaded Modules:
>  1) gcc/7.1.0
> 
> gcc (ARM-build-14) 7.1.0
> Copyright (C) 2017 Free Software Foundation, Inc.
> This is free software; see the source for copying conditions.  There is NO
> warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> 
> Using the following configure command
> 
> ./configure     --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd
>   --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-bd/share/man
> --with-pmix=/opt/pmix/2.0.2     --with-libevent=external
> --with-hwloc=external     --with-slurm     --disable-dlopen
> --enable-debug          CC=gcc CXX=g++ FC=gfortran
> 
> The tar ball is
> 
> 2e783873f6b206aa71f745762fa15da5
> /sw/arcts/centos7/src/ompi/openmpi-3.1.0.tar.gz
> 
> I still get
> 
> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
> salloc: Pending job allocation 165
> salloc: job 165 queued and waiting for resources
> salloc: job 165 has been allocated resources
> salloc: Granted job allocation 165
> [bennet@cavium-hpc ~]$ srun ./test_mpi
> The sum = 0.866386
> Elapsed time is:  5.425549
> The sum = 0.866386
> Elapsed time is:  5.422826
> The sum = 0.866386
> Elapsed time is:  5.427676
> The sum = 0.866386
> Elapsed time is:  5.424928
> The sum = 0.866386
> Elapsed time is:  5.422060
> The sum = 0.866386
> Elapsed time is:  5.425431
> The sum = 0.866386
> Elapsed time is:  5.424350
> The sum = 0.866386
> Elapsed time is:  5.423037
> The sum = 0.866386
> Elapsed time is:  5.427727
> The sum = 0.866386
> Elapsed time is:  5.424922
> The sum = 0.866386
> Elapsed time is:  5.424279
> Total time is:  59.672992
> 
> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
> 
> I reran with
> 
> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
> 2>&1 | tee debug3.log
> 
> and the gzipped log is attached.
> 
> I thought to try it with a different test program, which spits the error
> [cavium-hpc.arc-ts.umich.edu:42853] [[58987,1],0] ORTE_ERROR_LOG: Not
> found in file base/ess_base_std_app.c at line 219
> [cavium-hpc.arc-ts.umich.edu:42854] [[58987,1],1] ORTE_ERROR_LOG: Not
> found in file base/ess_base_std_app.c at line 219
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  store DAEMON URI failed
>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
> 
> 
> At one point, I am almost certain that OMPI mpirun did work, and I am
> at a loss to explain why it no longer does.
> 
> I have also tried the 3.1.1rc1 version.  I am now going to try 3.0.0,
> and we'll try downgrading SLURM to a prior version.
> 
> -- bennet
> 
> 
> -- bennetOn Mon, Jun 18, 2018 at 10:56 AM r...@open-mpi.org
> <r...@open-mpi.org> wrote:
>> 
>> Hmmm...well, the error has changed from your initial report. Turning off the 
>> firewall was the solution to that problem.
>> 
>> This problem is different - it isn’t the orted that failed in the log you 
>> sent, but the application proc that couldn’t initialize. It looks like that 
>> app was compiled against some earlier version of OMPI? It is looking for 
>> something that no longer exists. I saw that you compiled it with a simple 
>> “gcc” instead of our wrapper compiler “mpicc” - any particular reason? My 
>> guess is that your compile picked up some older version of OMPI on the 
>> system.
>> 
>> Ralph
>> 
>> 
>>> On Jun 17, 2018, at 2:51 PM, Bennet Fauber <ben...@umich.edu> wrote:
>>> 
>>> I rebuilt with --enable-debug, then ran with
>>> 
>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
>>> salloc: Pending job allocation 158
>>> salloc: job 158 queued and waiting for resources
>>> salloc: job 158 has been allocated resources
>>> salloc: Granted job allocation 158
>>> 
>>> [bennet@cavium-hpc ~]$ srun ./test_mpi
>>> The sum = 0.866386
>>> Elapsed time is:  5.426759
>>> The sum = 0.866386
>>> Elapsed time is:  5.424068
>>> The sum = 0.866386
>>> Elapsed time is:  5.426195
>>> The sum = 0.866386
>>> Elapsed time is:  5.426059
>>> The sum = 0.866386
>>> Elapsed time is:  5.423192
>>> The sum = 0.866386
>>> Elapsed time is:  5.426252
>>> The sum = 0.866386
>>> Elapsed time is:  5.425444
>>> The sum = 0.866386
>>> Elapsed time is:  5.423647
>>> The sum = 0.866386
>>> Elapsed time is:  5.426082
>>> The sum = 0.866386
>>> Elapsed time is:  5.425936
>>> The sum = 0.866386
>>> Elapsed time is:  5.423964
>>> Total time is:  59.677830
>>> 
>>> [bennet@cavium-hpc ~]$ mpirun --mca plm_base_verbose 10 ./test_mpi
>>> 2>&1 | tee debug2.log
>>> 
>>> The zipped debug log should be attached.
>>> 
>>> I did that after using systemctl to turn off the firewall on the login
>>> node from which the mpirun is executed, as well as on the host on
>>> which it runs.
>>> 
>>> [bennet@cavium-hpc ~]$ mpirun hostname
>>> --------------------------------------------------------------------------
>>> An ORTE daemon has unexpectedly failed after launch and before
>>> communicating back to mpirun. This could be caused by a number
>>> of factors, including an inability to create a connection back
>>> to mpirun due to a lack of common network interfaces and/or no
>>> route found between them. Please check network connectivity
>>> (including firewalls and network routing requirements).
>>> --------------------------------------------------------------------------
>>> 
>>> [bennet@cavium-hpc ~]$ squeue
>>>            JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>              158  standard     bash   bennet  R      14:30      1 cav01
>>> [bennet@cavium-hpc ~]$ srun hostname
>>> cav01.arc-ts.umich.edu
>>> [ repeated 23 more times ]
>>> 
>>> As always, your help is much appreciated,
>>> 
>>> -- bennet
>>> 
>>> On Sun, Jun 17, 2018 at 1:06 PM r...@open-mpi.org <r...@open-mpi.org> wrote:
>>>> 
>>>> Add --enable-debug to your OMPI configure cmd line, and then add --mca 
>>>> plm_base_verbose 10 to your mpirun cmd line. For some reason, the remote 
>>>> daemon isn’t starting - this will give you some info as to why.
>>>> 
>>>> 
>>>>> On Jun 17, 2018, at 9:07 AM, Bennet Fauber <ben...@umich.edu> wrote:
>>>>> 
>>>>> I have a compiled binary that will run with srun but not with mpirun.
>>>>> The attempts to run with mpirun all result in failures to initialize.
>>>>> I have tried this on one node, and on two nodes, with firewall turned
>>>>> on and with it off.
>>>>> 
>>>>> Am I missing some command line option for mpirun?
>>>>> 
>>>>> OMPI built from this configure command
>>>>> 
>>>>> $ ./configure --prefix=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b
>>>>> --mandir=/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/share/man
>>>>> --with-pmix=/opt/pmix/2.0.2 --with-libevent=external
>>>>> --with-hwloc=external --with-slurm --disable-dlopen CC=gcc CXX=g++
>>>>> FC=gfortran
>>>>> 
>>>>> All tests from `make check` passed, see below.
>>>>> 
>>>>> [bennet@cavium-hpc ~]$ mpicc --show
>>>>> gcc -I/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/include -pthread
>>>>> -L/opt/pmix/2.0.2/lib -Wl,-rpath -Wl,/opt/pmix/2.0.2/lib -Wl,-rpath
>>>>> -Wl,/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib
>>>>> -Wl,--enable-new-dtags
>>>>> -L/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib -lmpi
>>>>> 
>>>>> The test_mpi was compiled with
>>>>> 
>>>>> $ gcc -o test_mpi test_mpi.c -lm
>>>>> 
>>>>> This is the runtime library path
>>>>> 
>>>>> [bennet@cavium-hpc ~]$ echo $LD_LIBRARY_PATH
>>>>> /opt/slurm/lib64:/sw/arcts/centos7/gcc_7_1_0/openmpi/3.1.0-b/lib:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib64:/opt/arm/gcc-7.1.0_Generic-AArch64_RHEL-7_aarch64-linux/lib:/opt/slurm/lib64:/opt/pmix/2.0.2/lib:/sw/arcts/centos7/hpc-utils/lib
>>>>> 
>>>>> 
>>>>> These commands are given in exact sequence in which they were entered
>>>>> at a console.
>>>>> 
>>>>> [bennet@cavium-hpc ~]$ salloc -N 1 --ntasks-per-node=24
>>>>> salloc: Pending job allocation 156
>>>>> salloc: job 156 queued and waiting for resources
>>>>> salloc: job 156 has been allocated resources
>>>>> salloc: Granted job allocation 156
>>>>> 
>>>>> [bennet@cavium-hpc ~]$ mpirun ./test_mpi
>>>>> --------------------------------------------------------------------------
>>>>> An ORTE daemon has unexpectedly failed after launch and before
>>>>> communicating back to mpirun. This could be caused by a number
>>>>> of factors, including an inability to create a connection back
>>>>> to mpirun due to a lack of common network interfaces and/or no
>>>>> route found between them. Please check network connectivity
>>>>> (including firewalls and network routing requirements).
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> [bennet@cavium-hpc ~]$ srun ./test_mpi
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.425439
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.427427
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.422579
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.424168
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.423951
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.422414
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.427156
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.424834
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.425103
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.422415
>>>>> The sum = 0.866386
>>>>> Elapsed time is:  5.422948
>>>>> Total time is:  59.668622
>>>>> 
>>>>> Thanks,    -- bennet
>>>>> 
>>>>> 
>>>>> make check results
>>>>> ----------------------------------------------
>>>>> 
>>>>> make  check-TESTS
>>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/ompi/debuggers'
>>>>> PASS: predefined_gap_test
>>>>> PASS: predefined_pad_test
>>>>> SKIP: dlopen_test
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 3
>>>>> # PASS:  2
>>>>> # SKIP:  1
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> PASS: atomic_cmpset_noinline
>>>>>  - 5 threads: Passed
>>>>> PASS: atomic_cmpset_noinline
>>>>>  - 8 threads: Passed
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 8
>>>>> # PASS:  8
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/class'
>>>>> PASS: ompi_rb_tree
>>>>> PASS: opal_bitmap
>>>>> PASS: opal_hash_table
>>>>> PASS: opal_proc_table
>>>>> PASS: opal_tree
>>>>> PASS: opal_list
>>>>> PASS: opal_value_array
>>>>> PASS: opal_pointer_array
>>>>> PASS: opal_lifo
>>>>> PASS: opal_fifo
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 10
>>>>> # PASS:  10
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> make  opal_thread opal_condition
>>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>>>> CC       opal_thread.o
>>>>> CCLD     opal_thread
>>>>> CC       opal_condition.o
>>>>> CCLD     opal_condition
>>>>> make[3]: Leaving directory `/tmp/build/openmpi-3.1.0/test/threads'
>>>>> make  check-TESTS
>>>>> make[3]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/threads'
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 0
>>>>> # PASS:  0
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/datatype'
>>>>> PASS: opal_datatype_test
>>>>> PASS: unpack_hetero
>>>>> PASS: checksum
>>>>> PASS: position
>>>>> PASS: position_noncontig
>>>>> PASS: ddt_test
>>>>> PASS: ddt_raw
>>>>> PASS: unpack_ooo
>>>>> PASS: ddt_pack
>>>>> PASS: external32
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 10
>>>>> # PASS:  10
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/util'
>>>>> PASS: opal_bit_ops
>>>>> PASS: opal_path_nfs
>>>>> PASS: bipartite_graph
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 3
>>>>> # PASS:  3
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> [ elided ]
>>>>> make[4]: Entering directory `/tmp/build/openmpi-3.1.0/test/dss'
>>>>> PASS: dss_buffer
>>>>> PASS: dss_cmp
>>>>> PASS: dss_payload
>>>>> PASS: dss_print
>>>>> ============================================================================
>>>>> Testsuite summary for Open MPI 3.1.0
>>>>> ============================================================================
>>>>> # TOTAL: 4
>>>>> # PASS:  4
>>>>> # SKIP:  0
>>>>> # XFAIL: 0
>>>>> # FAIL:  0
>>>>> # XPASS: 0
>>>>> # ERROR: 0
>>>>> ============================================================================
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/users
>>> <debug2.log.gz>_______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> <debug3.log.gz>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to