I can't comment much on UCX; you'll need to ask Nvidia for support on that.

But transport retry count exceeded errors mean that the underlying IB network 
tried to send a message a bunch of times but never received the corresponding 
ACK from the receiver indicating that the receiver successfully got the 
message.  From back in my IB days, the typical first place to look for errors 
like this is to check the layer 0 and layer 1 networking with Nvidia-level 
diagnostics to ensure that the network itself is healthy.

--
Jeff Squyres
jsquy...@cisco.com

________________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Feng Wade via users 
<users@lists.open-mpi.org>
Sent: Saturday, February 19, 2022 4:04 PM
To: users@lists.open-mpi.org
Cc: Feng Wade
Subject: [OMPI users] Unknown breakdown (Transport retry count exceeded on 
mlx5_0:1/IB)

Hi,

Good afternoon.

I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. It worked 
quite well for lower Reynolds number. However, after increasing it from  3600 
to 9000, openmpi reported errors as shown below:

[gra1288:149104:0:149104] ib_mlx5_log.c:132  Transport retry count exceeded on 
mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[gra1288:149104:0:149104] ib_mlx5_log.c:132  DCI QP 0x2ecc1 wqe[475]: SEND s-e 
[rqpn 0xd7b7 rlid 1406] [va 0x2b6140d4ca80 len 8256 lkey 0x2e1bb1]
==== backtrace (tid: 149102) ====
 0 0x0000000000020753 ucs_debug_print_backtrace()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000001dfa8 uct_ib_mlx5_completion_with_err()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x0000000000056fae uct_ib_mlx5_poll_cq()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x0000000000056fae uct_dc_mlx5_iface_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x00000000000263ca ucs_callbackq_dispatch()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x00000000000263ca uct_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x00000000000263ca ucp_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x00000000000036b7 mca_pml_ucx_progress()  ???:0
 8 0x00000000000566bb opal_progress()  ???:0
 9 0x000000000007acf5 ompi_request_default_wait()  ???:0
10 0x00000000000b3ad9 MPI_Sendrecv()  ???:0
11 0x0000000000009c86 transpose_chunks()  transpose-pairwise.c:0
12 0x0000000000009d0f apply()  transpose-pairwise.c:0
13 0x0000000000422b5f channelflow::FlowFieldFD::transposeX1Y0()  ???:0
14 0x0000000000438d50 channelflow::grad_uDalpha()  ???:0
15 0x0000000000434a47 channelflow::VE_NL()  ???:0
16 0x0000000000432783 channelflow::MultistepVEDNSFD::advance()  ???:0
17 0x0000000000413767 main()  ???:0
18 0x0000000000023e1b __libc_start_main()  
/cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308<http://soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/libc-start.c:308>
19 0x00000000004109aa _start()  ???:0
=================================
[gra1288:149102] *** Process received signal ***
[gra1288:149102] Signal: Aborted (6)
[gra1288:149102] Signal code:  (-6)
[gra1288:149102] [ 0] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2addb0310980]<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)%5B0x2addb0310980%5D>
[gra1288:149102] [ 1] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2addb0310901]<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)%5B0x2addb0310901%5D>
[gra1288:149102] [ 2] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2addb02fa56b]<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)%5B0x2addb02fa56b%5D>
[gra1288:149102] [ 3] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)[0x2addb6cd7435]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)%5B0x2addb6cd7435%5D>
[gra1288:149102] [ 4] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)[0x2addb6cdb6b5]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)%5B0x2addb6cdb6b5%5D>
[gra1288:149102] [ 5] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)[0x2addb6cdb7d9]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)%5B0x2addb6cdb7d9%5D>
[gra1288:149102] [ 6] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)[0x2addb6ec1fa8]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)%5B0x2addb6ec1fa8%5D>
[gra1288:149102] [ 7] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)[0x2addb6efafae]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)%5B0x2addb6efafae%5D>
[gra1288:149102] [ 8] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x2addb6c193ca]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)%5B0x2addb6c193ca%5D>
[gra1288:149102] [ 9] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2addafad36b7]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)%5B0x2addafad36b7%5D>
[gra1288:149102] [10] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2addb33cb6bb]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)%5B0x2addb33cb6bb%5D>
[gra1288:149102] [11] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)[0x2addaf8c5cf5]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)%5B0x2addaf8c5cf5%5D>
[gra1288:149102] [12] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)[0x2addaf8fead9]<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)%5B0x2addaf8fead9%5D>
[gra1288:149102] [13] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)[0x2addaf5a2c86]<http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)%5B0x2addaf5a2c86%5D>
[gra1288:149102] [14] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)[0x2addaf5a2d0f]<http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)%5B0x2addaf5a2d0f%5D>
[gra1288:149102] [15] ./vepoiseuilleFD_5.x[0x422b5f]
[gra1288:149102] [16] ./vepoiseuilleFD_5.x[0x438d50]
[gra1288:149102] [17] ./vepoiseuilleFD_5.x[0x434a47]
[gra1288:149102] [18] ./vepoiseuilleFD_5.x[0x432783]
[gra1288:149102] [19] ./vepoiseuilleFD_5.x[0x413767]
[gra1288:149102] [20] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)[0x2addb02fbe1b]<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)%5B0x2addb02fbe1b%5D>
[gra1288:149102] [21] ./vepoiseuilleFD_5.x[0x4109aa]
[gra1288:149102] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 116 with PID 149096 on node gra1288 exited on 
signal 6 (Aborted).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)

This is my computation parameters and command to run openmpi:
#!/bin/bash
#SBATCH --time=0-24:00:00
#SBATCH --job-name=Wi45_Re9000
#SBATCH --output=log-%j
#SBATCH --ntasks=256
#SBATCH --nodes=8
#SBATCH --mem-per-cpu=4000M
mpirun ./vepoiseuilleFD_5.x

I have no idea what is going wrong. Please give me some hints if possible. 
Thank you very much!

Wade Feng

Reply via email to