Hi, Good afternoon.
I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. It worked quite well for lower Reynolds number. However, after increasing it from 3600 to 9000, openmpi reported errors as shown below: [gra1288:149104:0:149104] ib_mlx5_log.c:132 Transport retry count exceeded on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) [gra1288:149104:0:149104] ib_mlx5_log.c:132 DCI QP 0x2ecc1 wqe[475]: SEND s-e [rqpn 0xd7b7 rlid 1406] [va 0x2b6140d4ca80 len 8256 lkey 0x2e1bb1] ==== backtrace (tid: 149102) ==== 0 0x0000000000020753 ucs_debug_print_backtrace() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653 1 0x000000000001dfa8 uct_ib_mlx5_completion_with_err() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132 2 0x0000000000056fae uct_ib_mlx5_poll_cq() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81 3 0x0000000000056fae uct_dc_mlx5_iface_progress() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238 4 0x00000000000263ca ucs_callbackq_dispatch() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211 5 0x00000000000263ca uct_worker_progress() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221 6 0x00000000000263ca ucp_worker_progress() /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951 7 0x00000000000036b7 mca_pml_ucx_progress() ???:0 8 0x00000000000566bb opal_progress() ???:0 9 0x000000000007acf5 ompi_request_default_wait() ???:0 10 0x00000000000b3ad9 MPI_Sendrecv() ???:0 11 0x0000000000009c86 transpose_chunks() transpose-pairwise.c:0 12 0x0000000000009d0f apply() transpose-pairwise.c:0 13 0x0000000000422b5f channelflow::FlowFieldFD::transposeX1Y0() ???:0 14 0x0000000000438d50 channelflow::grad_uDalpha() ???:0 15 0x0000000000434a47 channelflow::VE_NL() ???:0 16 0x0000000000432783 channelflow::MultistepVEDNSFD::advance() ???:0 17 0x0000000000413767 main() ???:0 18 0x0000000000023e1b __libc_start_main() /cvmfs/ soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308 <http://soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/libc-start.c:308> 19 0x00000000004109aa _start() ???:0 ================================= [gra1288:149102] *** Process received signal *** [gra1288:149102] Signal: Aborted (6) [gra1288:149102] Signal code: (-6) [gra1288:149102] [ 0] /cvmfs/ soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2addb0310980] <http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)%5B0x2addb0310980%5D> [gra1288:149102] [ 1] /cvmfs/ soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2addb0310901] <http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)%5B0x2addb0310901%5D> [gra1288:149102] [ 2] /cvmfs/ soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2addb02fa56b] <http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)%5B0x2addb02fa56b%5D> [gra1288:149102] [ 3] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)[0x2addb6cd7435] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)%5B0x2addb6cd7435%5D> [gra1288:149102] [ 4] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)[0x2addb6cdb6b5] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)%5B0x2addb6cdb6b5%5D> [gra1288:149102] [ 5] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)[0x2addb6cdb7d9] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)%5B0x2addb6cdb7d9%5D> [gra1288:149102] [ 6] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)[0x2addb6ec1fa8] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)%5B0x2addb6ec1fa8%5D> [gra1288:149102] [ 7] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)[0x2addb6efafae] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)%5B0x2addb6efafae%5D> [gra1288:149102] [ 8] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x2addb6c193ca] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)%5B0x2addb6c193ca%5D> [gra1288:149102] [ 9] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2addafad36b7] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)%5B0x2addafad36b7%5D> [gra1288:149102] [10] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2addb33cb6bb] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)%5B0x2addb33cb6bb%5D> [gra1288:149102] [11] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)[0x2addaf8c5cf5] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)%5B0x2addaf8c5cf5%5D> [gra1288:149102] [12] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)[0x2addaf8fead9] <http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)%5B0x2addaf8fead9%5D> [gra1288:149102] [13] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)[0x2addaf5a2c86] <http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)%5B0x2addaf5a2c86%5D> [gra1288:149102] [14] /cvmfs/ soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)[0x2addaf5a2d0f] <http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)%5B0x2addaf5a2d0f%5D> [gra1288:149102] [15] ./vepoiseuilleFD_5.x[0x422b5f] [gra1288:149102] [16] ./vepoiseuilleFD_5.x[0x438d50] [gra1288:149102] [17] ./vepoiseuilleFD_5.x[0x434a47] [gra1288:149102] [18] ./vepoiseuilleFD_5.x[0x432783] [gra1288:149102] [19] ./vepoiseuilleFD_5.x[0x413767] [gra1288:149102] [20] /cvmfs/ soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)[0x2addb02fbe1b] <http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)%5B0x2addb02fbe1b%5D> [gra1288:149102] [21] ./vepoiseuilleFD_5.x[0x4109aa] [gra1288:149102] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 116 with PID 149096 on node gra1288 exited on signal 6 (Aborted). -------------------------------------------------------------------------- 2 total processes killed (some possibly by mpirun during cleanup) This is my computation parameters and command to run openmpi: #!/bin/bash #SBATCH --time=0-24:00:00 #SBATCH --job-name=Wi45_Re9000 #SBATCH --output=log-%j #SBATCH --ntasks=256 #SBATCH --nodes=8 #SBATCH --mem-per-cpu=4000M mpirun ./vepoiseuilleFD_5.x I have no idea what is going wrong. Please give me some hints if possible. Thank you very much! Wade Feng