We've encountered OpenMPI crashing in handle_wc(), with following error message: [.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc] Unhandled work completion opcode is 136
Our setup is admittedly little tricky, but I'm still worried that it may be a genuine problem, so please bear with me while I try to explain: The OpenMPI version is 3.1.2, it is built from source, here is the relevant ompi_info excerpt: Configure command line: '--prefix=/opt/openmpi/3.1.2' '--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes' '--enable-shared=yes' '--with-cuda' Our nodes have initially had installed open-source OFED, and then on a couple of nodes we had it replaced with recent MLNX_OFED (version 5.5-1.0.3.2), with the idea to test for some time, then upgrade them all and then to switch to OpenMPI 4.x. However, the system is still in use in this intermediate state, and it happens that our code crashes sometimes, with the error message mentioned above. FWIW, the configuration used for runs in question is 2 nodes with 3 MPI ranks each; and crashes only occur if at least one of the nodes used is from these that are upgraded to MLNX_OFED. We also have OpenMPI 4.1.2 built, after MLNX_OFED installed, and when our code run linked with this version, a crash won't occur, but - we've built this one with UCX (1.12.0) and openib disabled, so the code path for handling this completion opcode (if it occurs at all) is different. So when I looked into /usr/include/infiniband/verbs.h, I was able to see that opcode 136 in this context means IBV_WC_DRIVER2. However, this opcode, as well as some other opcodes, are not there in the /usr/include/infiniband/verbs.h from the open-source OFED installation that we used so far. On the other side, for /usr/include/infiniband from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding that this opcode 136 that OpenMPI reports as error, comes from MLNX_OFED driver returning MLX5DV_WC_RAW_WQE. Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c deals with 6 completion codes only, and reports fatal error for the rest of them; this doesn't seem to be changed between OpenMPI 3.1.2 and 4.1.2. So my question is here: anyone able to shed some light on MLX5DV_WC_RAW_WQE completion code, and what kind of problem could cause it returned? Or it's really just about us having OpenMPI built before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI rebuilt now (with the same configure flags as initially, that means with openib kept) the problem won't occur? Thanks.