The short answer is likely that UCX and Open MPI v4.1.x is your way forward.
openib has basically been unmaintained for quite a while -- Nvidia (Mellanox) made it quite clear long ago that UCX was their path forward. openib was kept around until UCX became stable enough to become the preferred IB network transport -- which it now is. Due to Open MPI's backwards compatibility guarantees, we can't remove openib from the 4.0.x and 4.1.x series, but it won't be present in the upcoming Open MPI v5.0.x -- IB will be solely supported via UCX. What I suspect you're seeing is that you've got new firmware and/or drivers on some nodes, and those are reporting a new opcode error up to Open MPI's old openib code. The openib code hasn't been updated to handle that new opcode, and it gets confused and throws an error, and therefore aborts. UCX and/or Open MPI v4.1.x, presumably, have been updated to handle that new opcode, and therefore things run smoothly. This is just an educated guess. But if you're running in an effectively-heterogeneous scenario (i.e., some nodes with old OFED some nodes with new MLNX OFED), weird backwards/forwards compatibility issues like this can occur. -- Jeff Squyres jsquy...@cisco.com ________________________________________ From: users <users-boun...@lists.open-mpi.org> on behalf of Crni Gorac via users <users@lists.open-mpi.org> Sent: Tuesday, February 22, 2022 7:37 AM To: users@lists.open-mpi.org Cc: Crni Gorac Subject: [OMPI users] handle_wc() in openib and IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code We've encountered OpenMPI crashing in handle_wc(), with following error message: [.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc] Unhandled work completion opcode is 136 Our setup is admittedly little tricky, but I'm still worried that it may be a genuine problem, so please bear with me while I try to explain: The OpenMPI version is 3.1.2, it is built from source, here is the relevant ompi_info excerpt: Configure command line: '--prefix=/opt/openmpi/3.1.2' '--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes' '--enable-shared=yes' '--with-cuda' Our nodes have initially had installed open-source OFED, and then on a couple of nodes we had it replaced with recent MLNX_OFED (version 5.5-1.0.3.2), with the idea to test for some time, then upgrade them all and then to switch to OpenMPI 4.x. However, the system is still in use in this intermediate state, and it happens that our code crashes sometimes, with the error message mentioned above. FWIW, the configuration used for runs in question is 2 nodes with 3 MPI ranks each; and crashes only occur if at least one of the nodes used is from these that are upgraded to MLNX_OFED. We also have OpenMPI 4.1.2 built, after MLNX_OFED installed, and when our code run linked with this version, a crash won't occur, but - we've built this one with UCX (1.12.0) and openib disabled, so the code path for handling this completion opcode (if it occurs at all) is different. So when I looked into /usr/include/infiniband/verbs.h, I was able to see that opcode 136 in this context means IBV_WC_DRIVER2. However, this opcode, as well as some other opcodes, are not there in the /usr/include/infiniband/verbs.h from the open-source OFED installation that we used so far. On the other side, for /usr/include/infiniband from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding that this opcode 136 that OpenMPI reports as error, comes from MLNX_OFED driver returning MLX5DV_WC_RAW_WQE. Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c deals with 6 completion codes only, and reports fatal error for the rest of them; this doesn't seem to be changed between OpenMPI 3.1.2 and 4.1.2. So my question is here: anyone able to shed some light on MLX5DV_WC_RAW_WQE completion code, and what kind of problem could cause it returned? Or it's really just about us having OpenMPI built before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI rebuilt now (with the same configure flags as initially, that means with openib kept) the problem won't occur? Thanks.