We've encountered OpenMPI crashing in handle_wc(), with following error message:
[.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc]
Unhandled work completion opcode is 136

Our setup is admittedly little tricky, but I'm still worried that it
may be a genuine problem, so please bear with me while I try to
explain:  The OpenMPI version is 3.1.2, it is built from source, here
is the relevant ompi_info excerpt:
 Configure command line: '--prefix=/opt/openmpi/3.1.2'
'--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes'
'--enable-shared=yes' '--with-cuda'

Our nodes have initially had installed open-source OFED, and then on a
couple of nodes we had it replaced with recent MLNX_OFED (version
5.5-1.0.3.2), with the idea to test for some time, then upgrade them
all and then to switch to OpenMPI 4.x.  However, the system is still
in use in this intermediate state, and it happens that our code
crashes sometimes, with the error message mentioned above.  FWIW, the
configuration used for runs in question is 2 nodes with 3 MPI ranks
each; and crashes only occur if at least one of the nodes used is from
these that are upgraded to MLNX_OFED.  We also have OpenMPI 4.1.2
built, after MLNX_OFED installed, and when our code run linked with
this version, a crash won't occur, but - we've built this one with UCX
(1.12.0) and openib disabled, so the code path for handling this
completion opcode (if it occurs at all) is different.

So when I looked into /usr/include/infiniband/verbs.h, I was able to
see that opcode 136 in this context means IBV_WC_DRIVER2.  However,
this opcode, as well as some other opcodes, are not there in the
/usr/include/infiniband/verbs.h from the open-source OFED installation
that we used so far.  On the other side, for /usr/include/infiniband
from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to
IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding
that this opcode 136 that OpenMPI reports as error, comes from
MLNX_OFED driver returning MLX5DV_WC_RAW_WQE.

Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c
deals with 6 completion codes only, and reports fatal error for the
rest of them; this doesn't seem to be changed between OpenMPI 3.1.2
and 4.1.2.   So my question is here: anyone able to shed some light on
MLX5DV_WC_RAW_WQE completion code, and what kind of problem could
cause it returned?  Or it's really just about us having OpenMPI built
before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI
rebuilt now (with the same configure flags as initially, that means
with openib kept) the problem won't occur?

Thanks.

Reply via email to