Hello all, new to the list.
While testing my openmpi5.0.7 installation using the simple 
mpi_hello_world.c code, I am experiencing an unexpected behavior where the 
execution on the last processor rank hangs with a "fatal unhandled timeout 
error", which leads to core dumps. It confirmed that it happens regardless 
of the compiler I use, i.e., gnu14 or intel2024.0. Moreover, it does not 
happen when I use mpich3.4.3-ofi. Below I am including the setting I am 
using and the runtime error. You will notice that the error happened on 
node c11, which may suggest that there is something wrong with this node. 
However, it turns out that any other other node that happens to execute the 
last processor rank leads to the same error. I must be missing something. 
Any Thoughts?
Sorry about the length of the post.

-----------------------------------------------------
]$ module list
Currently Loaded Modules:
  1) cmake/4.0.0        6) spack/0.23.1          11) mkl/2024.0         16) 
ifort/2024.0.0              21) EasyBuild/5.0.0
  2) autotools          7) oclfpga/2024.0.0      12) intel/2024.0.0     17) 
inspector/2024.2            22) valgrind/3.24.0
  3) hwloc/2.12.0       8) tbb/2021.11           13) debugger/2024.0.0  18) 
intel_ipp_intel64/2021.10   23) openmpi5/5.0.7
  4) libfabric/1.18.0   9) compiler-rt/2024.0.0  14) dpl/2022.3         19) 
intel_ippcp_intel64/2021.9  24) ucx/1.18.0
  5) prun/2.2          10) compiler/2024.0.0     15) icc/2023.2.1       20) 
vtune/2025.3
----------------------------------------------------------
$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite     10  idle* c[2-10,12]
normal*      up   infinite      3   idle c[1,11,13]
[av@sms test]$ salloc -n 24 -N 2 
salloc: Granted job allocation 61
salloc: Nodes c[1,11] are ready for job
[av@c1 test]$ mpirun --display-map --map-by node -x MXM_RDMA_PORTS=mlx4_0:1 
-mca btl_openib_if_include mlx4_0:1 mpi_hello_world 

========================   JOB MAP   ========================
Data for JOB prterun-c1-1575@1 offset 0 Total slots allocated 24
    Mapping policy: BYNODE:NOOVERSUBSCRIBE  Ranking policy: NODE Binding 
policy: NUMA:IF-SUPPORTED
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: c1 Num slots: 12 Max slots: 0 Num procs: 12
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 0 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 2 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 4 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 6 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 8 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 10 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 12 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 14 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 16 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 18 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 20 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 22 Bound: 
package[0][core:0-17]

Data for node: c11 Num slots: 12 Max slots: 0 Num procs: 12
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 1 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 3 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 5 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 7 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 9 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 11 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 13 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 15 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 17 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 19 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 21 Bound: 
package[0][core:0-17]
        Process jobid: prterun-c1-1575@1 App: 0 Process rank: 23 Bound: 
package[0][core:0-17]

=============================================================
Hello world from processor c1, rank 6 out of 24 processors
Hello world from processor c1, rank 20 out of 24 processors
Hello world from processor c1, rank 16 out of 24 processors
Hello world from processor c1, rank 12 out of 24 processors
Hello world from processor c1, rank 0 out of 24 processors
Hello world from processor c1, rank 2 out of 24 processors
Hello world from processor c1, rank 14 out of 24 processors
Hello world from processor c1, rank 10 out of 24 processors
Hello world from processor c1, rank 4 out of 24 processors
Hello world from processor c1, rank 22 out of 24 processors
Hello world from processor c1, rank 18 out of 24 processors
Hello world from processor c1, rank 8 out of 24 processors
Hello world from processor c11, rank 11 out of 24 processors
Hello world from processor c11, rank 1 out of 24 processors
Hello world from processor c11, rank 3 out of 24 processors
Hello world from processor c11, rank 13 out of 24 processors
Hello world from processor c11, rank 19 out of 24 processors
Hello world from processor c11, rank 7 out of 24 processors
Hello world from processor c11, rank 17 out of 24 processors
Hello world from processor c11, rank 21 out of 24 processors
Hello world from processor c11, rank 15 out of 24 processors
Hello world from processor c11, rank 23 out of 24 processors
Hello world from processor c11, rank 9 out of 24 processors
Hello world from processor c11, rank 5 out of 24 processors
[c11:2028 :0:2028]       ud_ep.c:278  Fatal: UD endpoint 0x1c8da90 to <no 
debug data>: unhandled timeout error
[c11:2035 :0:2035]       ud_ep.c:278  Fatal: UD endpoint 0x722a90 to <no 
debug data>: unhandled timeout error
[c11:2025 :0:2025]       ud_ep.c:278  Fatal: UD endpoint 0xc52a90 to <no 
debug data>: unhandled timeout error
==== backtrace (tid:   2028) ====
0 
 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_handle_error+0x294) 
[0x7fade4326ee4]
 1 
 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(ucs_fatal_error_message+0xb2) 
[0x7fade4324292]
 2  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369) 
[0x7fade4324369]
 3  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0) 
[0x7fade110d3f0]
 4  /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987) 
[0x7fade4319987]
 5 
 /opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a) 
[0x7fade43abc9a]
 6  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc) 
[0x7fade471b9bc]
 7 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)
 
[0x7fade471b79a]
 8 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)
 
[0x7fade471baf0]
 9 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(mca_pml_ucx_del_procs+0x140)
 
[0x7fade4d1cd70]
10  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xac837) 
[0x7fade4b27837]
11 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize_cleanup_domain+0x53)
 
[0x7fade46aebd3]
12 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_finalize+0x2e)
 
[0x7fade46a22be]
13 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_rte_finalize+0x1f9)
 
[0x7fade4b21909]
14  /opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(+0xab304) 
[0x7fade4b26304]
15 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_instance_finalize+0xe5)
 
[0x7fade4b26935]
16 
 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libmpi.so.40(ompi_mpi_finalize+0x3d1)
 
[0x7fade4b1e091]
17  mpi_hello_world() [0x40258f]
18  /lib64/libc.so.6(+0x295d0) [0x7fade47b95d0]
19  /lib64/libc.so.6(__libc_start_main+0x80) [0x7fade47b9680]
20  mpi_hello_world() [0x402455]
=================================
[c11:02028] *** Process received signal ***
[c11:02028] Signal: Aborted (6)
[c11:02028] Signal code:  (-6)
[c11:02028] [ 0] /lib64/libc.so.6(+0x3ebf0)[0x7fade47cebf0]
[c11:02028] [ 1] /lib64/libc.so.6(+0x8bedc)[0x7fade481bedc]
[c11:02028] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fade47ceb46]
[c11:02028] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fade47b8833]
[c11:02028] [ 4] 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f297)[0x7fade4324297]
[c11:02028] [ 5] 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x2f369)[0x7fade4324369]
[c11:02028] [ 6] 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/ucx/libuct_ib.so.0(+0x263f0)[0x7fade110d3f0]
[c11:02028] [ 7] 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucs.so.0(+0x24987)[0x7fade4319987]
[c11:02028] [ 8] 
/opt/ohpc/pub/mpi/ucx-ohpc/1.18.0/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fade43abc9a]
[c11:02028] [ 9] 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(+0xa09bc)[0x7fade471b9bc]
[c11:02028] [10] 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs_nofence+0x6a)[0x7fade471b79a]
[c11:02028] [11] 
/opt/ohpc/pub/mpi/openmpi5-intel/5.0.7/lib/libopen-pal.so.80(opal_common_ucx_del_procs+0x20)[0x7fade471baf0]
c11:02028] [12] ==== backtrace (tid:   2035) ====
..................
--------------------------------------------------------------------------------
Achilles

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to