UCX 1.8 or UCX 1.18 ? Your application does not exchange any data so it is possible that MPICH behavior differs from OMPI (aka not creating connections vs creating them during MPI_Init). That's why running a slightly different version of the hello_world with a barrier would clarify the connection's status.
George. On Tue, Jul 1, 2025 at 10:30 PM Achilles Vassilicos <avass...@gmail.com> wrote: > When I use openmpi5, I get the same behavior even with a very small number > of processes per node. However, when I use mpich-ofi it runs fine (see > below). That gives me confidence that the network is setup correctly. The > nodes are connected via infiniband ConnectX-3 adapters, and all ib tests > show no problems. > I found an older post about ucx1.18 having possible issues with openmpi5. > I have assumed that ucx1.18 is now fully compatible with openmpi5. Could > this be the cause? Does anyone use ucx1.8 with openmpi5? If not ucx1.18, > what version is confirmed to work with openmpi5? > > My test code: > ---------------------------------------------------------------------- > [av@c12 test]$ cat mpi_hello_world.c > #include <mpi.h> > #include <stdio.h> > > int main(int argc, char** argv) { > // Initialize the MPI environment > MPI_Init(NULL, NULL); > > // Get the number of processes > int world_size; > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > > // Get the rank of the process > int world_rank; > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > > // Get the name of the processor > char processor_name[MPI_MAX_PROCESSOR_NAME]; > int name_len; > MPI_Get_processor_name(processor_name, &name_len); > > // Print off a hello world message > printf("Hello world from processor %s, rank %d out of %d processors\n", > processor_name, world_rank, world_size); > > // Finalize the MPI environment. > MPI_Finalize(); > } > ------------------------------------------------------------------------- > [av@c12 test]$ which mpirun > /opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.7/bin/mpirun > [av@sms test]$ mpicc -o openmpi5_hello_world mpi_hello_world.c > [av@sms test]$ salloc -n 4 -N 2 > salloc: Granted job allocation 63 > salloc: Nodes c[12-13] are ready for job > [av@c12 test]$ mpirun ./openmpi5_hello_world > Hello world from processor c12, rank 0 out of 4 processors > Hello world from processor c12, rank 1 out of 4 processors > Hello world from processor c13, rank 3 out of 4 processors > Hello world from processor c13, rank 2 out of 4 processors > [c12:1709 :0:1709] ud_ep.c:278 Fatal: UD endpoint 0x117ae80 to <no > debug data>: unhandled timeout error > ==== backtrace (tid: 1709) ==== > 0 > /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294) > [0x7f200b4f3ee4] > ................ > ----------------------------------------------------------------------- > [av@sms test]$ which mpicc > /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpicc > [av@sms test]$ which mpirun > /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpirun > [av@sms test]$ mpicc -o mpich-ofi_hello_world mpi_hello_world.c > [av@sms test]$ salloc -n 4 -N 2 > salloc: Granted job allocation 66 > salloc: Nodes c[12-13] are ready for job > [av@c12 test]$ mpirun ./mpich-ofi_hello_world > Hello world from processor c13, rank 2 out of 4 processors > Hello world from processor c13, rank 3 out of 4 processors > Hello world from processor c12, rank 0 out of 4 processors > Hello world from processor c12, rank 1 out of 4 processors > [av@c12 test]$ > ------------------------------------------------------------------------ > Achilles > On Tuesday, July 1, 2025 at 7:14:06 AM UTC-4 George Bosilca wrote: > >> This error message is usually due to a misconfiguration of the network. >> However, I don't think this is the case here because the output contains >> messages from both odd and even ranks (which according to your binding >> policy were placed on different nodes) suggesting at least some of the >> processes were able to connect (and thus the network configuration is >> correct). >> >> So I'm thinking about some timing issues during network setup due to the >> fact that you have many processes per node, and an application that does >> nothing except creating and then shutting down the network layer. Does this >> happen if you have less processes per node ? Does it happen if you add >> anything else in the application (such as an `MPI_Barrier(MPI_COMM_WORLD)`) >> ? >> >> George. >> >> >> On Mon, Jun 30, 2025 at 10:00 PM Achilles Vassilicos <avas...@gmail.com> >> wrote: >> >>> Hello all, new to the list. >>> While testing my openmpi5.0.7 installation using the simple >>> mpi_hello_world.c code, I am experiencing an unexpected behavior where the >>> execution on the last processor rank hangs with a "fatal unhandled timeout >>> error", which leads to core dumps. It confirmed that it happens regardless >>> of the compiler I use, i.e., gnu14 or intel2024.0. Moreover, it does not >>> happen when I use mpich3.4.3-ofi. Below I am including the setting I am >>> using and the runtime error. You will notice that the error happened on >>> node c11, which may suggest that there is something wrong with this node. >>> However, it turns out that any other other node that happens to execute the >>> last processor rank leads to the same error. I must be missing something. >>> Any Thoughts? >>> Sorry about the length of the post. >>> >>> ----------------------------------------------------- >>> ]$ module list >>> Currently Loaded Modules: >>> 1) cmake/4.0.0 6) spack/0.23.1 11) mkl/2024.0 >>> 16) ifort/2024.0.0 21) EasyBuild/5.0.0 >>> 2) autotools 7) oclfpga/2024.0.0 12) intel/2024.0.0 >>> 17) inspector/2024.2 22) valgrind/3.24.0 >>> 3) hwloc/2.12.0 8) tbb/2021.11 13) debugger/2024.0.0 >>> 18) intel_ipp_intel64/2021.10 23) openmpi5/5.0.7 >>> 4) libfabric/1.18.0 9) compiler-rt/2024.0.0 14) dpl/2022.3 >>> 19) intel_ippcp_intel64/2021.9 24) ucx/1.18.0 >>> 5) prun/2.2 10) compiler/2024.0.0 15) icc/2023.2.1 >>> 20) vtune/2025.3 >>> ---------------------------------------------------------- >>> $ sinfo >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> normal* up infinite 10 idle* c[2-10,12] >>> normal* up infinite 3 idle c[1,11,13] >>> [av@sms test]$ salloc -n 24 -N 2 >>> salloc: Granted job allocation 61 >>> salloc: Nodes c[1,11] are ready for job >>> [av@c1 test]$ mpirun --display-map --map-by node -x >>> MXM_RDMA_PORTS=mlx4_0:1 -mca btl_openib_if_include mlx4_0:1 mpi_hello_world >>> >>> ======================== JOB MAP ======================== >>> Data for JOB prterun-c1-1575@1 offset 0 Total slots allocated 24 >>> Mapping policy: BYNODE:NOOVERSUBSCRIBE Ranking policy: NODE Binding >>> policy: NUMA:IF-SUPPORTED >>> Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: CORE >>> >>> >>> Data for node: c1 Num slots: 12 Max slots: 0 Num procs: 12 >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 0 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 2 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 4 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 6 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 8 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 10 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 12 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 14 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 16 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 18 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 20 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 22 Bound: >>> package[0][core:0-17] >>> >>> Data for node: c11 Num slots: 12 Max slots: 0 Num procs: 12 >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 1 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 3 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 5 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 7 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 9 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 11 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 13 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 15 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 17 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 19 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 21 Bound: >>> package[0][core:0-17] >>> Process jobid: prterun-c1-1575@1 App: 0 Process rank: 23 Bound: >>> package[0][core:0-17] >>> >>> ============================================================= >>> Hello world from processor c1, rank 6 out of 24 processors >>> Hello world from processor c1, rank 20 out of 24 processors >>> Hello world from processor c1, rank 16 out of 24 processors >>> Hello world from processor c1, rank 12 out of 24 processors >>> Hello world from processor c1, rank 0 out of 24 processors >>> Hello world from processor c1, rank 2 out of 24 processors >>> Hello world from processor c1, rank 14 out of 24 processors >>> Hello world from processor c1, rank 10 out of 24 processors >>> Hello world from processor c1, rank 4 out of 24 processors >>> Hello world from processor c1, rank 22 out of 24 processors >>> Hello world from processor c1, rank 18 out of 24 processors >>> Hello world from processor c1, rank 8 out of 24 processors >>> Hello world from processor c11, rank 11 out of 24 processors >>> Hello world from processor c11, rank 1 out of 24 processors >>> Hello world from processor c11, rank 3 out of 24 processors >>> Hello world from processor c11, rank 13 out of 24 processors >>> Hello world from processor c11, rank 19 out of 24 processors >>> Hello world from processor c11, rank 7 out of 24 processors >>> Hello world from processor c11, rank 17 out of 24 processors >>> Hello world from processor c11, rank 21 out of 24 processors >>> Hello world from processor c11, rank 15 out of 24 processors >>> Hello world from processor c11, rank 23 out of 24 processors >>> Hello world from processor c11, rank 9 out of 24 processors >>> Hello world from processor c11, rank 5 out of 24 processors >>> [c11:2028 :0:2028] ud_ep.c:278 Fatal: UD endpoint 0x1c8da90 to >>> <no debug data>: unhandled timeout error >>> [c11:2035 :0:2035] ud_ep.c:278 Fatal: UD endpoint 0x722a90 to <no >>> debug data>: unhandled timeout error >>> [c11:2025 :0:2025] ud_ep.c:278 Fatal: UD endpoint 0xc52a90 to <no >>> debug data>: unhandled timeout error >>> ==== backtrace (tid: 2028) ==== >>> 0 >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294) >>> [0x7fade4326ee4] >>> 1 >>> >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_fatal_error_message+0xb2) >>> [0x7fade4324292] >>> 2 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369) >>> [0x7fade4324369] >>> 3 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0) >>> [0x7fade110d3f0] >>> 4 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987) >>> [0x7fade4319987] >>> 5 >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a) >>> [0x7fade43abc9a] >>> 6 >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc) >>> [0x7fade471b9bc] >>> 7 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a) >>> [0x7fade471b79a] >>> 8 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20) >>> [0x7fade471baf0] >>> 9 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(mca_pml_ucx_del_procs+0x140) >>> [0x7fade4d1cd70] >>> 10 /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xac837) >>> [0x7fade4b27837] >>> 11 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize_cleanup_domain+0x53) >>> [0x7fade46aebd3] >>> 12 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize+0x2e) >>> [0x7fade46a22be] >>> 13 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_rte_finalize+0x1f9) >>> [0x7fade4b21909] >>> 14 /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xab304) >>> [0x7fade4b26304] >>> 15 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_instance_finalize+0xe5) >>> [0x7fade4b26935] >>> 16 >>> >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_finalize+0x3d1) >>> [0x7fade4b1e091] >>> 17 mpi_hello_world() [0x40258f] >>> 18 /lib64/libc.so.6(+0x295d0) [0x7fade47b95d0] >>> 19 /lib64/libc.so.6(__libc_start_main+0x80) [0x7fade47b9680] >>> 20 mpi_hello_world() [0x402455] >>> ================================= >>> [c11:02028] *** Process received signal *** >>> [c11:02028] Signal: Aborted (6) >>> [c11:02028] Signal code: (-6) >>> [c11:02028] [ 0] /lib64/libc.so.6(+0x3ebf0)[0x7fade47cebf0] >>> [c11:02028] [ 1] /lib64/libc.so.6(+0x8bedc)[0x7fade481bedc] >>> [c11:02028] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fade47ceb46] >>> [c11:02028] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fade47b8833] >>> [c11:02028] [ 4] >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f297)[0x7fade4324297] >>> [c11:02028] [ 5] >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)[0x7fade4324369] >>> [c11:02028] [ 6] >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)[0x7fade110d3f0] >>> [c11:02028] [ 7] >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)[0x7fade4319987] >>> [c11:02028] [ 8] >>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fade43abc9a] >>> [c11:02028] [ 9] >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)[0x7fade471b9bc] >>> [c11:02028] [10] >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)[0x7fade471b79a] >>> [c11:02028] [11] >>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)[0x7fade471baf0] >>> c11:02028] [12] ==== backtrace (tid: 2035) ==== >>> .................. >>> >>> -------------------------------------------------------------------------------- >>> Achilles >>> >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to users+un...@lists.open-mpi.org. >>> >> To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.