We are primarily using OpenMPI 3.1.4 but also have 4.0.1 installed.
On our cluster, we were running CentOS 7.5 with updates, alongside
MLNX_OFED 4.5.x. OpenMPI was compiled with GCC, Intel, PGI and AOCC
compilers. We could run with no issues.
To accommodate updates needed to get our IB gear all running at HDR100
(EDR50 previously) we upgraded to CentOS 7.6.1810 and the current
MLNX_OFED 4.6.x.
We can no longer reliably run on more than two nodes.
We see errors like:
[epyc-compute-3-2.local:42447] pml_ucx.c:380 Error:
ucp_ep_create(proc=276) failed: Destination is unreachable
[epyc-compute-3-2.local:42447] pml_ucx.c:447 Error: Failed to resolve
UCX endpoint for rank 276
[epyc-compute-3-2:42447] *** An error occurred in MPI_Allreduce
[epyc-compute-3-2:42447] *** reported by process
[47894553493505,47893180318004]
[epyc-compute-3-2:42447] *** on communicator MPI_COMM_WORLD
[epyc-compute-3-2:42447] *** MPI_ERR_OTHER: known error not in list
[epyc-compute-3-2:42447] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[epyc-compute-3-2:42447] *** and potentially your MPI job)
[epyc-compute-3-17.local:36637] PMIX ERROR: UNREACHABLE in file
server/pmix_server.c at line 2079
[epyc-compute-3-17.local:37008] pml_ucx.c:380 Error:
ucp_ep_create(proc=147) failed: Destination is unreachable
[epyc-compute-3-17.local:37008] pml_ucx.c:447 Error: Failed to resolve
UCX endpoint for rank 147
[epyc-compute-3-7.local:39776] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[epyc-compute-3-7.local:39776] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
UCX appears to be part of the MLNX_OFED release, and is version 1.6.0.
OpenMPI is is built on the same OS and MLNX_OFED, as we are running on
the compute nodes.
I have a case open with Mellanox but it is not clear where this error is
coming from.
--
Ray Muno
IT Manager
e-mail:m...@aem.umn.edu
Phone: (612) 625-9531
University of Minnesota
Aerospace Engineering and Mechanics Mechanical Engineering
110 Union St. S.E. 111 Church Street SE
Minneapolis, MN 55455 Minneapolis, MN 55455