UCX 1.8 or UCX 1.18 ?

Your application does not exchange any data so it is possible that MPICH
behavior differs from OMPI (aka not creating connections vs creating them
during MPI_Init). That's why running a slightly different version of the
hello_world with a barrier would clarify the connection's status.

  George.


On Tue, Jul 1, 2025 at 10:30 PM Achilles Vassilicos <avass...@gmail.com>
wrote:

> When I use openmpi5, I get the same behavior even with a very small number
> of processes per node. However, when I use mpich-ofi it runs fine (see
> below). That gives me confidence that the network is setup correctly. The
> nodes are connected via infiniband ConnectX-3 adapters, and all ib tests
> show no problems.
> I found  an older post about ucx1.18 having possible issues with openmpi5.
> I have assumed that ucx1.18 is now fully compatible with openmpi5. Could
> this be the cause? Does anyone use ucx1.8 with openmpi5? If not ucx1.18,
> what version is confirmed to work with openmpi5?
>
> My test code:
> ----------------------------------------------------------------------
> [av@c12 test]$ cat mpi_hello_world.c
> #include <mpi.h>
> #include <stdio.h>
>
> int main(int argc, char** argv) {
>     // Initialize the MPI environment
>     MPI_Init(NULL, NULL);
>
>     // Get the number of processes
>     int world_size;
>     MPI_Comm_size(MPI_COMM_WORLD, &world_size);
>
>     // Get the rank of the process
>     int world_rank;
>     MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
>
>     // Get the name of the processor
>     char processor_name[MPI_MAX_PROCESSOR_NAME];
>     int name_len;
>     MPI_Get_processor_name(processor_name, &name_len);
>
>     // Print off a hello world message
>     printf("Hello world from processor %s, rank %d out of %d processors\n",
>            processor_name, world_rank, world_size);
>
>     // Finalize the MPI environment.
>     MPI_Finalize();
> }
> -------------------------------------------------------------------------
> [av@c12 test]$ which mpirun
> /opt/ohpc/pub/mpi/openmpi5-gnu14/5.0.7/bin/mpirun
> [av@sms test]$ mpicc -o openmpi5_hello_world mpi_hello_world.c
> [av@sms test]$ salloc -n 4 -N 2
> salloc: Granted job allocation 63
> salloc: Nodes c[12-13] are ready for job
> [av@c12 test]$ mpirun  ./openmpi5_hello_world
> Hello world from processor c12, rank 0 out of 4 processors
> Hello world from processor c12, rank 1 out of 4 processors
> Hello world from processor c13, rank 3 out of 4 processors
> Hello world from processor c13, rank 2 out of 4 processors
> [c12:1709 :0:1709]       ud_ep.c:278  Fatal: UD endpoint 0x117ae80 to <no
> debug data>: unhandled timeout error
> ==== backtrace (tid:   1709) ====
>  0
>  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294)
> [0x7f200b4f3ee4]
> ................
> -----------------------------------------------------------------------
> [av@sms test]$ which mpicc
> /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpicc
> [av@sms test]$ which mpirun
> /opt/ohpc/pub/mpi/mpich-ofi-gnu14-ohpc/3.4.3/bin/mpirun
> [av@sms test]$ mpicc -o mpich-ofi_hello_world mpi_hello_world.c
> [av@sms test]$ salloc -n 4 -N 2
> salloc: Granted job allocation 66
> salloc: Nodes c[12-13] are ready for job
> [av@c12 test]$ mpirun ./mpich-ofi_hello_world
> Hello world from processor c13, rank 2 out of 4 processors
> Hello world from processor c13, rank 3 out of 4 processors
> Hello world from processor c12, rank 0 out of 4 processors
> Hello world from processor c12, rank 1 out of 4 processors
> [av@c12 test]$
> ------------------------------------------------------------------------
> Achilles
> On Tuesday, July 1, 2025 at 7:14:06 AM UTC-4 George Bosilca wrote:
>
>> This error message is usually due to a misconfiguration of the network.
>> However, I don't think this is the case here because the output contains
>> messages from both odd and even ranks (which according to your binding
>> policy were placed on different nodes) suggesting at least some of the
>> processes were able to connect (and thus the network configuration is
>> correct).
>>
>> So I'm thinking about some timing issues during network setup due to the
>> fact that you have many processes per node, and an application that does
>> nothing except creating and then shutting down the network layer. Does this
>> happen if you have less processes per node ? Does it happen if you add
>> anything else in the application (such as an `MPI_Barrier(MPI_COMM_WORLD)`)
>> ?
>>
>>    George.
>>
>>
>> On Mon, Jun 30, 2025 at 10:00 PM Achilles Vassilicos <avas...@gmail.com>
>> wrote:
>>
>>> Hello all, new to the list.
>>> While testing my openmpi5.0.7 installation using the simple
>>> mpi_hello_world.c code, I am experiencing an unexpected behavior where the
>>> execution on the last processor rank hangs with a "fatal unhandled timeout
>>> error", which leads to core dumps. It confirmed that it happens regardless
>>> of the compiler I use, i.e., gnu14 or intel2024.0. Moreover, it does not
>>> happen when I use mpich3.4.3-ofi. Below I am including the setting I am
>>> using and the runtime error. You will notice that the error happened on
>>> node c11, which may suggest that there is something wrong with this node.
>>> However, it turns out that any other other node that happens to execute the
>>> last processor rank leads to the same error. I must be missing something.
>>> Any Thoughts?
>>> Sorry about the length of the post.
>>>
>>> -----------------------------------------------------
>>> ]$ module list
>>> Currently Loaded Modules:
>>>   1) cmake/4.0.0        6) spack/0.23.1          11) mkl/2024.0
>>> 16) ifort/2024.0.0              21) EasyBuild/5.0.0
>>>   2) autotools          7) oclfpga/2024.0.0      12) intel/2024.0.0
>>> 17) inspector/2024.2            22) valgrind/3.24.0
>>>   3) hwloc/2.12.0       8) tbb/2021.11           13) debugger/2024.0.0
>>>  18) intel_ipp_intel64/2021.10   23) openmpi5/5.0.7
>>>   4) libfabric/1.18.0   9) compiler-rt/2024.0.0  14) dpl/2022.3
>>> 19) intel_ippcp_intel64/2021.9  24) ucx/1.18.0
>>>   5) prun/2.2          10) compiler/2024.0.0     15) icc/2023.2.1
>>> 20) vtune/2025.3
>>> ----------------------------------------------------------
>>> $ sinfo
>>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>>> normal*      up   infinite     10  idle* c[2-10,12]
>>> normal*      up   infinite      3   idle c[1,11,13]
>>> [av@sms test]$ salloc -n 24 -N 2
>>> salloc: Granted job allocation 61
>>> salloc: Nodes c[1,11] are ready for job
>>> [av@c1 test]$ mpirun --display-map --map-by node -x
>>> MXM_RDMA_PORTS=mlx4_0:1 -mca btl_openib_if_include mlx4_0:1 mpi_hello_world
>>>
>>> ========================   JOB MAP   ========================
>>> Data for JOB prterun-c1-1575@1 offset 0 Total slots allocated 24
>>>     Mapping policy: BYNODE:NOOVERSUBSCRIBE  Ranking policy: NODE Binding
>>> policy: NUMA:IF-SUPPORTED
>>>     Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE
>>>
>>>
>>> Data for node: c1 Num slots: 12 Max slots: 0 Num procs: 12
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 0 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 2 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 4 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 6 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 8 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 10 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 12 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 14 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 16 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 18 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 20 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 22 Bound:
>>> package[0][core:0-17]
>>>
>>> Data for node: c11 Num slots: 12 Max slots: 0 Num procs: 12
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 1 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 3 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 5 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 7 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 9 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 11 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 13 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 15 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 17 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 19 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 21 Bound:
>>> package[0][core:0-17]
>>>         Process jobid: prterun-c1-1575@1 App: 0 Process rank: 23 Bound:
>>> package[0][core:0-17]
>>>
>>> =============================================================
>>> Hello world from processor c1, rank 6 out of 24 processors
>>> Hello world from processor c1, rank 20 out of 24 processors
>>> Hello world from processor c1, rank 16 out of 24 processors
>>> Hello world from processor c1, rank 12 out of 24 processors
>>> Hello world from processor c1, rank 0 out of 24 processors
>>> Hello world from processor c1, rank 2 out of 24 processors
>>> Hello world from processor c1, rank 14 out of 24 processors
>>> Hello world from processor c1, rank 10 out of 24 processors
>>> Hello world from processor c1, rank 4 out of 24 processors
>>> Hello world from processor c1, rank 22 out of 24 processors
>>> Hello world from processor c1, rank 18 out of 24 processors
>>> Hello world from processor c1, rank 8 out of 24 processors
>>> Hello world from processor c11, rank 11 out of 24 processors
>>> Hello world from processor c11, rank 1 out of 24 processors
>>> Hello world from processor c11, rank 3 out of 24 processors
>>> Hello world from processor c11, rank 13 out of 24 processors
>>> Hello world from processor c11, rank 19 out of 24 processors
>>> Hello world from processor c11, rank 7 out of 24 processors
>>> Hello world from processor c11, rank 17 out of 24 processors
>>> Hello world from processor c11, rank 21 out of 24 processors
>>> Hello world from processor c11, rank 15 out of 24 processors
>>> Hello world from processor c11, rank 23 out of 24 processors
>>> Hello world from processor c11, rank 9 out of 24 processors
>>> Hello world from processor c11, rank 5 out of 24 processors
>>> [c11:2028 :0:2028]       ud_ep.c:278  Fatal: UD endpoint 0x1c8da90 to
>>> <no debug data>: unhandled timeout error
>>> [c11:2035 :0:2035]       ud_ep.c:278  Fatal: UD endpoint 0x722a90 to <no
>>> debug data>: unhandled timeout error
>>> [c11:2025 :0:2025]       ud_ep.c:278  Fatal: UD endpoint 0xc52a90 to <no
>>> debug data>: unhandled timeout error
>>> ==== backtrace (tid:   2028) ====
>>> 0
>>>  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294)
>>> [0x7fade4326ee4]
>>>  1
>>>  
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_fatal_error_message+0xb2)
>>> [0x7fade4324292]
>>>  2  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)
>>> [0x7fade4324369]
>>>  3  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)
>>> [0x7fade110d3f0]
>>>  4  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)
>>> [0x7fade4319987]
>>>  5
>>>  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)
>>> [0x7fade43abc9a]
>>>  6
>>>  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)
>>> [0x7fade471b9bc]
>>>  7
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)
>>> [0x7fade471b79a]
>>>  8
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)
>>> [0x7fade471baf0]
>>>  9
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(mca_pml_ucx_del_procs+0x140)
>>> [0x7fade4d1cd70]
>>> 10  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xac837)
>>> [0x7fade4b27837]
>>> 11
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize_cleanup_domain+0x53)
>>> [0x7fade46aebd3]
>>> 12
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize+0x2e)
>>> [0x7fade46a22be]
>>> 13
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_rte_finalize+0x1f9)
>>> [0x7fade4b21909]
>>> 14  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xab304)
>>> [0x7fade4b26304]
>>> 15
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_instance_finalize+0xe5)
>>> [0x7fade4b26935]
>>> 16
>>>  
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_finalize+0x3d1)
>>> [0x7fade4b1e091]
>>> 17  mpi_hello_world() [0x40258f]
>>> 18  /lib64/libc.so.6(+0x295d0) [0x7fade47b95d0]
>>> 19  /lib64/libc.so.6(__libc_start_main+0x80) [0x7fade47b9680]
>>> 20  mpi_hello_world() [0x402455]
>>> =================================
>>> [c11:02028] *** Process received signal ***
>>> [c11:02028] Signal: Aborted (6)
>>> [c11:02028] Signal code:  (-6)
>>> [c11:02028] [ 0] /lib64/libc.so.6(+0x3ebf0)[0x7fade47cebf0]
>>> [c11:02028] [ 1] /lib64/libc.so.6(+0x8bedc)[0x7fade481bedc]
>>> [c11:02028] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fade47ceb46]
>>> [c11:02028] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fade47b8833]
>>> [c11:02028] [ 4]
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f297)[0x7fade4324297]
>>> [c11:02028] [ 5]
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)[0x7fade4324369]
>>> [c11:02028] [ 6]
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)[0x7fade110d3f0]
>>> [c11:02028] [ 7]
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)[0x7fade4319987]
>>> [c11:02028] [ 8]
>>> /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fade43abc9a]
>>> [c11:02028] [ 9]
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)[0x7fade471b9bc]
>>> [c11:02028] [10]
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)[0x7fade471b79a]
>>> [c11:02028] [11]
>>> /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)[0x7fade471baf0]
>>> c11:02028] [12] ==== backtrace (tid:   2035) ====
>>> ..................
>>>
>>> --------------------------------------------------------------------------------
>>> Achilles
>>>
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to users+un...@lists.open-mpi.org.
>>>
>>

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to