Hi again, and thank you to Florent for answering my questions last time. The answers were very helpful!
We have some strange errors occurring randomly when running MPI jobs. We are using openmpi 4.0.3 with UCX and GPUDirect RDMA and are running multi-node applications using SLURM on a cluster. We only recently got GPUDirect RDMA to work, and we are seeing improved performance, but after RDMA started working we have begun to see errors like the one below randomly. It seems that MPI_Init isn't able to establish a connection between all ranks. (?) Can somebody help me make sense of the core dump(s) below? Where should we start digging to see what is causing this, and does anyone have experience of similar cases? Best regards, Oskar =================================Begin output================================================== Lmod is automatically replacing "hpcx-mpi/2.5.0-cuda" with "openmpi/4.0.3-cuda". [r14g04:78340:0:78340] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) [r14g04:78340:0:78340] ib_mlx5_log.c:139 RC QP 0x89c2 wqe[0]: SEND --e [inl len 18] [r14g05:35369:0:35369] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) [r14g05:35369:0:35369] ib_mlx5_log.c:139 RC QP 0x79df wqe[0]: SEND --e [inl len 18] ==== backtrace (tid: 78340) ==== 0 0x000000000004ec80 ucs_fatal_error_message() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33 1 0x00000000000532b5 ucs_log_default_handler() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140 2 0x00000000000533e4 ucs_log_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191 3 0x000000000001c793 uct_ib_mlx5_completion_with_err() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216 5 0x000000000002b17b uct_ib_mlx5_poll_cq() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38 6 0x000000000002b17b uct_rc_mlx5_iface_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133 7 0x000000000001fcf2 ucs_callbackq_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211 8 0x000000000001fcf2 uct_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203 9 0x000000000001fcf2 ucp_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897 10 0x0000000000004877 mca_pml_ucx_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515 11 0x0000000000036bdc opal_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231 12 0x00000000000bf179 wait_completion() hcoll_collectives.c:0 13 0x000000000001d0f4 comm_allreduce_hcolrte_generic() common_allreduce.c:0 14 0x000000000001d72b comm_allreduce_hcolrte() ???:0 15 0x000000000001380b hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0 16 0x00000000000cb86c hmca_bcol_base_init() ???:0 17 0x000000000004a328 hmca_coll_ml_init_query() ???:0 18 0x00000000000bff37 hcoll_init_with_opts() ???:0 19 0x0000000000005f90 mca_coll_hcoll_comm_query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292 20 0x00000000000837ca query_2_0_0() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449 21 0x00000000000837ca query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432 22 0x00000000000837ca check_one_component() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394 23 0x00000000000837ca check_components() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344 24 0x00000000000837ca mca_coll_base_comm_select() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126 25 0x00000000000beb66 ompi_mpi_init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 26 0x0000000000074fb1 PMPI_Init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69 27 0x000000000040250c main() ???:0 28 0x0000000000022545 __libc_start_main() ???:0 29 0x0000000000402e98 _start() ???:0 ================================= [r14g04:78340] *** Process received signal *** [r14g04:78340] Signal: Aborted (6) [r14g04:78340] Signal code: (-6) [r14g04:78340] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f96c4788630] [r14g04:78340] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f96c3d4f377] [r14g04:78340] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f96c3d50a68] [r14g04:78340] [ 3] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f9641493c85] [r14g04:78340] [ 4] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f96414982b5] [r14g04:78340] [ 5] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f96414983e4] [r14g04:78340] [ 6] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f96404f1793] [r14g04:78340] [ 7] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f96404fef7f] [r14g04:78340] [ 8] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f964050017b] [r14g04:78340] [ 9] /appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f9641c04cf2] [r14g04:78340] [10] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f9641e42877] [r14g04:78340] [11] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f96c3c40bdc] [r14g04:78340] [12] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f963a9e6179] [r14g04:78340] [13] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0x1d0f4)[0x7f963a9440f4] [r14g04:78340] [14] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allreduce_hcolrte+0x4b)[0x7f963a94472b] [r14g04:78340] [15] /appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x1380b)[0x7f962479980b] [r14g04:78340] [16] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f963a9f286c] [r14g04:78340] [17] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f963a971328] [r14g04:78340] [18] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f963a9e6f37] [r14g04:78340] [19] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f963ac6ef90] [r14g04:78340] [20] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f96c618a7ca] [r14g04:78340] [21] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f96c61c5b66] [r14g04:78340] [22] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f96c617bfb1] [r14g04:78340] [23] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c] [r14g04:78340] [24] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f96c3d3b545] [r14g04:78340] [25] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98] [r14g04:78340] *** End of error message *** ==== backtrace (tid: 35369) ==== 0 0x000000000004ec80 ucs_fatal_error_message() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33 1 0x00000000000532b5 ucs_log_default_handler() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140 2 0x00000000000533e4 ucs_log_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191 3 0x000000000001c793 uct_ib_mlx5_completion_with_err() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216 5 0x000000000002b17b uct_ib_mlx5_poll_cq() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38 6 0x000000000002b17b uct_rc_mlx5_iface_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133 7 0x000000000001fcf2 ucs_callbackq_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211 8 0x000000000001fcf2 uct_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203 9 0x000000000001fcf2 ucp_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897 10 0x0000000000004877 mca_pml_ucx_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515 11 0x0000000000036bdc opal_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231 12 0x00000000000bf179 wait_completion() hcoll_collectives.c:0 13 0x000000000001e52e comm_allgather_hcolrte() ???:0 14 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0 15 0x00000000000cb86c hmca_bcol_base_init() ???:0 16 0x000000000004a328 hmca_coll_ml_init_query() ???:0 17 0x00000000000bff37 hcoll_init_with_opts() ???:0 18 0x0000000000005f90 mca_coll_hcoll_comm_query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292 19 0x00000000000837ca query_2_0_0() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449 20 0x00000000000837ca query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432 21 0x00000000000837ca check_one_component() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394 22 0x00000000000837ca check_components() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344 23 0x00000000000837ca mca_coll_base_comm_select() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126 24 0x00000000000beb66 ompi_mpi_init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 25 0x0000000000074fb1 PMPI_Init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69 26 0x000000000040250c main() ???:0 27 0x0000000000022545 __libc_start_main() ???:0 28 0x0000000000402e98 _start() ???:0 ================================= [r14g05:35369] *** Process received signal *** [r14g05:35369] Signal: Aborted (6) [r14g05:35369] Signal code: (-6) [r14g05:35369] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f39274da630] [r14g05:35369] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f3926aa1377] [r14g05:35369] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f3926aa2a68] [r14g05:35369] [ 3] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f38a41e5c85] [r14g05:35369] [ 4] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f38a41ea2b5] [r14g05:35369] [ 5] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f38a41ea3e4] [r14g05:35369] [ 6] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f389f0de793] [r14g05:35369] [ 7] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f389f0ebf7f] [r14g05:35369] [ 8] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f389f0ed17b] [r14g05:35369] [ 9] /appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f38a4956cf2] [r14g05:35369] [10] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f38a4b94877] [r14g05:35369] [11] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f3926992bdc] [r14g05:35369] [12] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f389d6fa179] [r14g05:35369] [13] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x7f389d65952e] [r14g05:35369] [14] /appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x7f38907118b7] [r14g05:35369] [15] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f389d70686c] [r14g05:35369] [16] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f389d685328] [r14g05:35369] [17] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f389d6faf37] [r14g05:35369] [18] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f38a4050f90] [r14g05:35369] [19] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f3928edc7ca] [r14g05:35369] [20] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f3928f17b66] [r14g05:35369] [21] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f3928ecdfb1] [r14g05:35369] [22] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c] [r14g05:35369] [23] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3926a8d545] [r14g05:35369] [24] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98] [r14g05:35369] *** End of error message *** [r03g01:128574:0:128574] ib_mlx5_log.c:139 Transport retry count exceeded on mlx5_1:1/IB (synd 0x15 vend 0x81 hw_synd 0/0) [r03g01:128574:0:128574] ib_mlx5_log.c:139 RC QP 0xdf69 wqe[0]: SEND --e [inl len 18] ==== backtrace (tid: 128574) ==== 0 0x000000000004ec80 ucs_fatal_error_message() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/assert.c:33 1 0x00000000000532b5 ucs_log_default_handler() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:140 2 0x00000000000533e4 ucs_log_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/debug/log.c:191 3 0x000000000001c793 uct_ib_mlx5_completion_with_err() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5_log.c:132 4 0x0000000000029f7f uct_rc_mlx5_iface_handle_failure() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:216 5 0x000000000002b17b uct_ib_mlx5_poll_cq() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/mlx5/ib_mlx5.inl:38 6 0x000000000002b17b uct_rc_mlx5_iface_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/ib/rc/accel/rc_mlx5_iface.c:133 7 0x000000000001fcf2 ucs_callbackq_dispatch() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucs/datastruct/callbackq.h:211 8 0x000000000001fcf2 uct_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/uct/api/uct.h:2203 9 0x000000000001fcf2 ucp_worker_progress() /build-result/src/hpcx-v2.5.0-gcc-MLNX_OFED_LINUX-4.6-1.0.1.1-redhat7.6-x86_64/ucx-v1.7.x/src/ucp/core/ucp_worker.c:1897 10 0x0000000000004877 mca_pml_ucx_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/pml/ucx/pml_ucx.c:515 11 0x0000000000036bdc opal_progress() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/opal/runtime/opal_progress.c:231 12 0x00000000000bf179 wait_completion() hcoll_collectives.c:0 13 0x000000000001e52e comm_allgather_hcolrte() ???:0 14 0x00000000000138b7 hmca_bcol_ucx_p2p_init_query.part.4() bcol_ucx_p2p_component.c:0 15 0x00000000000cb86c hmca_bcol_base_init() ???:0 16 0x000000000004a328 hmca_coll_ml_init_query() ???:0 17 0x00000000000bff37 hcoll_init_with_opts() ???:0 18 0x0000000000005f90 mca_coll_hcoll_comm_query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/hcoll/coll_hcoll_module.c:292 19 0x00000000000837ca query_2_0_0() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:449 20 0x00000000000837ca query() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:432 21 0x00000000000837ca check_one_component() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:394 22 0x00000000000837ca check_components() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:344 23 0x00000000000837ca mca_coll_base_comm_select() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mca/coll/base/coll_base_comm_select.c:126 24 0x00000000000beb66 ompi_mpi_init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/runtime/ompi_mpi_init.c:958 25 0x0000000000074fb1 PMPI_Init() /local_scratch/build/spack-stage/ilvonens/spack-stage/spack-stage-5aafho8w/openmpi-gitclone/ompi/mpi/c/profile/pinit.c:69 26 0x000000000040250c main() ???:0 27 0x0000000000022545 __libc_start_main() ???:0 28 0x0000000000402e98 _start() ???:0 ================================= [r03g01:128574] *** Process received signal *** [r03g01:128574] Signal: Aborted (6) [r03g01:128574] Signal code: (-6) [r03g01:128574] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f75e9284630] [r03g01:128574] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f75e884b377] [r03g01:128574] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f75e884ca68] [r03g01:128574] [ 3] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f7565f8fc85] [r03g01:128574] [ 4] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(+0x532b5)[0x7f7565f942b5] [r03g01:128574] [ 5] /appl/opt/ucx/1.7.0-mlnx/lib/libucs.so.0(ucs_log_dispatch+0xc4)[0x7f7565f943e4] [r03g01:128574] [ 6] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x683)[0x7f7564fed793] [r03g01:128574] [ 7] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(+0x29f7f)[0x7f7564ffaf7f] [r03g01:128574] [ 8] /appl/opt/ucx/1.7.0-mlnx/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x41b)[0x7f7564ffc17b] [r03g01:128574] [ 9] /appl/opt/ucx/1.7.0-mlnx/lib/libucp.so.0(ucp_worker_progress+0x22)[0x7f7566700cf2] [r03g01:128574] [10] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_progress+0x17)[0x7f756693e877] [r03g01:128574] [11] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f75e873cbdc] [r03g01:128574] [12] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(+0xbf179)[0x7f755f4ef179] [r03g01:128574] [13] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(comm_allgather_hcolrte+0xcae)[0x7f755f44e52e] [r03g01:128574] [14] /appl/opt/hcoll/4.4.2938/lib/hcoll/hmca_bcol_ucx_p2p.so(+0x138b7)[0x7f755c4bb8b7] [r03g01:128574] [15] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_bcol_base_init+0x4c)[0x7f755f4fb86c] [r03g01:128574] [16] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hmca_coll_ml_init_query+0x68)[0x7f755f47a328] [r03g01:128574] [17] /appl/opt/hcoll/4.4.2938/lib/libhcoll.so.1(hcoll_init_with_opts+0x307)[0x7f755f4eff37] [r03g01:128574] [18] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/openmpi/mca_coll_hcoll.so(mca_coll_hcoll_comm_query+0x450)[0x7f755f777f90] [r03g01:128574] [19] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(mca_coll_base_comm_select+0x13a)[0x7f75eac867ca] [r03g01:128574] [20] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(ompi_mpi_init+0xec6)[0x7f75eacc1b66] [r03g01:128574] [21] /appl/spack/install-tree/gcc-8.3.0/hpcx-mpi-2.5.0-elyo5a/lib/libmpi.so.40(MPI_Init+0x81)[0x7f75eac77fb1] [r03g01:128574] [22] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x40250c] [r03g01:128574] [23] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f75e8837545] [r03g01:128574] [24] /scratch/project_XXXXXXX/astaroth_openmpi/prefix_meshsize_256/./benchmark[0x402e98] [r03g01:128574] *** End of error message *** srun: error: r14g04: task 19: Aborted (core dumped) srun: Terminating job step 2991688.0 slurmstepd: error: *** STEP 2991688.0 ON r03g01 CANCELLED AT 2020-07-20T18:24:23 *** srun: error: r14g05: task 22: Aborted (core dumped) srun: error: r03g01: tasks 0-2: Terminated srun: error: r14g06: tasks 24-27: Terminated srun: error: r03g03: tasks 8-11: Terminated srun: error: r14g03: tasks 12-15: Terminated srun: error: r03g02: tasks 4-7: Terminated srun: error: r14g07: tasks 28-31: Terminated srun: error: r14g05: tasks 20-21,23: Terminated srun: error: r14g04: tasks 16-18: Terminated srun: error: r03g01: task 3: Aborted (core dumped) srun: Force Terminated job step 2991688.0