The short answer is likely that UCX and Open MPI v4.1.x is your way forward.

openib has basically been unmaintained for quite a while -- Nvidia (Mellanox) 
made it quite clear long ago that UCX was their path forward.  openib was kept 
around until UCX became stable enough to become the preferred IB network 
transport -- which it now is.  Due to Open MPI's backwards compatibility 
guarantees, we can't remove openib from the 4.0.x and 4.1.x series, but it 
won't be present in the upcoming Open MPI v5.0.x -- IB will be solely supported 
via UCX.

What I suspect you're seeing is that you've got new firmware and/or drivers on 
some nodes, and those are reporting a new opcode error up to Open MPI's old 
openib code.  The openib code hasn't been updated to handle that new opcode, 
and it gets confused and throws an error, and therefore aborts.  UCX and/or 
Open MPI v4.1.x, presumably, have been updated to handle that new opcode, and 
therefore things run smoothly.

This is just an educated guess.  But if you're running in an 
effectively-heterogeneous scenario (i.e., some nodes with old OFED some nodes 
with new MLNX OFED), weird backwards/forwards compatibility issues like this 
can occur.

--
Jeff Squyres
jsquy...@cisco.com

________________________________________
From: users <users-boun...@lists.open-mpi.org> on behalf of Crni Gorac via 
users <users@lists.open-mpi.org>
Sent: Tuesday, February 22, 2022 7:37 AM
To: users@lists.open-mpi.org
Cc: Crni Gorac
Subject: [OMPI users] handle_wc() in openib and 
IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code

We've encountered OpenMPI crashing in handle_wc(), with following error message:
[.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc]
Unhandled work completion opcode is 136

Our setup is admittedly little tricky, but I'm still worried that it
may be a genuine problem, so please bear with me while I try to
explain:  The OpenMPI version is 3.1.2, it is built from source, here
is the relevant ompi_info excerpt:
 Configure command line: '--prefix=/opt/openmpi/3.1.2'
'--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes'
'--enable-shared=yes' '--with-cuda'

Our nodes have initially had installed open-source OFED, and then on a
couple of nodes we had it replaced with recent MLNX_OFED (version
5.5-1.0.3.2), with the idea to test for some time, then upgrade them
all and then to switch to OpenMPI 4.x.  However, the system is still
in use in this intermediate state, and it happens that our code
crashes sometimes, with the error message mentioned above.  FWIW, the
configuration used for runs in question is 2 nodes with 3 MPI ranks
each; and crashes only occur if at least one of the nodes used is from
these that are upgraded to MLNX_OFED.  We also have OpenMPI 4.1.2
built, after MLNX_OFED installed, and when our code run linked with
this version, a crash won't occur, but - we've built this one with UCX
(1.12.0) and openib disabled, so the code path for handling this
completion opcode (if it occurs at all) is different.

So when I looked into /usr/include/infiniband/verbs.h, I was able to
see that opcode 136 in this context means IBV_WC_DRIVER2.  However,
this opcode, as well as some other opcodes, are not there in the
/usr/include/infiniband/verbs.h from the open-source OFED installation
that we used so far.  On the other side, for /usr/include/infiniband
from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to
IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding
that this opcode 136 that OpenMPI reports as error, comes from
MLNX_OFED driver returning MLX5DV_WC_RAW_WQE.

Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c
deals with 6 completion codes only, and reports fatal error for the
rest of them; this doesn't seem to be changed between OpenMPI 3.1.2
and 4.1.2.   So my question is here: anyone able to shed some light on
MLX5DV_WC_RAW_WQE completion code, and what kind of problem could
cause it returned?  Or it's really just about us having OpenMPI built
before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI
rebuilt now (with the same configure flags as initially, that means
with openib kept) the problem won't occur?

Thanks.

Reply via email to