Hi,

Good afternoon.

I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. It
worked quite well for lower Reynolds number. However, after increasing it
from  3600 to 9000, openmpi reported errors as shown below:

[gra1288:149104:0:149104] ib_mlx5_log.c:132  Transport retry count exceeded
on mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[gra1288:149104:0:149104] ib_mlx5_log.c:132  DCI QP 0x2ecc1 wqe[475]: SEND
s-e [rqpn 0xd7b7 rlid 1406] [va 0x2b6140d4ca80 len 8256 lkey 0x2e1bb1]
==== backtrace (tid: 149102) ====
 0 0x0000000000020753 ucs_debug_print_backtrace()
 /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x000000000001dfa8 uct_ib_mlx5_completion_with_err()
 
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x0000000000056fae uct_ib_mlx5_poll_cq()
 
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x0000000000056fae uct_dc_mlx5_iface_progress()
 /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x00000000000263ca ucs_callbackq_dispatch()
 
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x00000000000263ca uct_worker_progress()
 /tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x00000000000263ca ucp_worker_progress()
 
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x00000000000036b7 mca_pml_ucx_progress()  ???:0
 8 0x00000000000566bb opal_progress()  ???:0
 9 0x000000000007acf5 ompi_request_default_wait()  ???:0
10 0x00000000000b3ad9 MPI_Sendrecv()  ???:0
11 0x0000000000009c86 transpose_chunks()  transpose-pairwise.c:0
12 0x0000000000009d0f apply()  transpose-pairwise.c:0
13 0x0000000000422b5f channelflow::FlowFieldFD::transposeX1Y0()  ???:0
14 0x0000000000438d50 channelflow::grad_uDalpha()  ???:0
15 0x0000000000434a47 channelflow::VE_NL()  ???:0
16 0x0000000000432783 channelflow::MultistepVEDNSFD::advance()  ???:0
17 0x0000000000413767 main()  ???:0
18 0x0000000000023e1b __libc_start_main()  /cvmfs/
soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
<http://soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/libc-start.c:308>
19 0x00000000004109aa _start()  ???:0
=================================
[gra1288:149102] *** Process received signal ***
[gra1288:149102] Signal: Aborted (6)
[gra1288:149102] Signal code:  (-6)
[gra1288:149102] [ 0] /cvmfs/
soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2addb0310980]
<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)%5B0x2addb0310980%5D>
[gra1288:149102] [ 1] /cvmfs/
soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2addb0310901]
<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)%5B0x2addb0310901%5D>
[gra1288:149102] [ 2] /cvmfs/
soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2addb02fa56b]
<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)%5B0x2addb02fa56b%5D>
[gra1288:149102] [ 3] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)[0x2addb6cd7435]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)%5B0x2addb6cd7435%5D>
[gra1288:149102] [ 4] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)[0x2addb6cdb6b5]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)%5B0x2addb6cdb6b5%5D>
[gra1288:149102] [ 5] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)[0x2addb6cdb7d9]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)%5B0x2addb6cdb7d9%5D>
[gra1288:149102] [ 6] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)[0x2addb6ec1fa8]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)%5B0x2addb6ec1fa8%5D>
[gra1288:149102] [ 7] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)[0x2addb6efafae]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(+0x56fae)%5B0x2addb6efafae%5D>
[gra1288:149102] [ 8] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x2addb6c193ca]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucp.so.0(ucp_worker_progress+0x6a)%5B0x2addb6c193ca%5D>
[gra1288:149102] [ 9] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x2addafad36b7]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)%5B0x2addafad36b7%5D>
[gra1288:149102] [10] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)[0x2addb33cb6bb]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_progress+0x2b)%5B0x2addb33cb6bb%5D>
[gra1288:149102] [11] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)[0x2addaf8c5cf5]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(ompi_request_default_wait+0x105)%5B0x2addaf8c5cf5%5D>
[gra1288:149102] [12] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)[0x2addaf8fead9]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Sendrecv+0x219)%5B0x2addaf8fead9%5D>
[gra1288:149102] [13] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)[0x2addaf5a2c86]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9c86)%5B0x2addaf5a2c86%5D>
[gra1288:149102] [14] /cvmfs/
soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)[0x2addaf5a2d0f]
<http://soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8/lib/libfftw3_mpi.so.3(+0x9d0f)%5B0x2addaf5a2d0f%5D>
[gra1288:149102] [15] ./vepoiseuilleFD_5.x[0x422b5f]
[gra1288:149102] [16] ./vepoiseuilleFD_5.x[0x438d50]
[gra1288:149102] [17] ./vepoiseuilleFD_5.x[0x434a47]
[gra1288:149102] [18] ./vepoiseuilleFD_5.x[0x432783]
[gra1288:149102] [19] ./vepoiseuilleFD_5.x[0x413767]
[gra1288:149102] [20] /cvmfs/
soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)[0x2addb02fbe1b]
<http://soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__libc_start_main+0xeb)%5B0x2addb02fbe1b%5D>
[gra1288:149102] [21] ./vepoiseuilleFD_5.x[0x4109aa]
[gra1288:149102] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 116 with PID 149096 on node gra1288 exited
on signal 6 (Aborted).
--------------------------------------------------------------------------
2 total processes killed (some possibly by mpirun during cleanup)

This is my computation parameters and command to run openmpi:
#!/bin/bash
#SBATCH --time=0-24:00:00
#SBATCH --job-name=Wi45_Re9000
#SBATCH --output=log-%j
#SBATCH --ntasks=256
#SBATCH --nodes=8
#SBATCH --mem-per-cpu=4000M
mpirun ./vepoiseuilleFD_5.x

I have no idea what is going wrong. Please give me some hints if possible.
Thank you very much!

Wade Feng

Reply via email to