Re: [OMPI users] Verbose logging options to track IB communication issues

2022-02-23 Thread Shan-ho Tsai via users

Hi John,

Thank you so much for your detailed response, I really appreciate it.  It was 
very helpful. We had recently updated the IB card firmware on the compute 
nodes. It appears that downgrading the firmware resolves the issue.

Thank you again!
Best regards,
Shan-Ho


Shan-Ho Tsai
University of Georgia, Athens GA




From: John Hearns 
Sent: Thursday, February 17, 2022 3:10 AM
To: Open MPI Users 
Cc: Shan-ho Tsai 
Subject: Re: [OMPI users] Verbose logging options to track IB communication 
issues

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

I would start at a lower level.  Clear your error counters then run some fabric 
over the network, maybe using an IMB or OSU benchmark.
Then look to see if any ports are very noisy - that usually indicates a cable 
needing a reseat or replacement.

Now start at a low level. Run IMB or OSU bandwidth or latency tests between 
pairs of nodes. Are any nodes particularly slow?

Now run tests between groups of nodes which share a leaf switch.

Finally, if this is really a problem which is being triggered by an application 
start by bisecting your network.  Run the application on half the nodes, then 
the other half.  My hunch is that you will find faulty cables.
I can of course be very wrong and it is something that this application 
triggers.






On Wed, 16 Feb 2022 at 19:28, Shan-ho Tsai via users 
mailto:users@lists.open-mpi.org>> wrote:

Greetings,

We are troubleshooting an IB network fabric issue that is causing some of our 
MPI applications to failed with errors like this:


--
The InfiniBand retry count between two MPI processes has been
exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself.  You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
  attempt to retry (defaulted to 7, the maximum value).
* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
  to 20).  The actual timeout value used is calculated as:

 4.096 microseconds * (2^btl_openib_ib_timeout)

  See the InfiniBand spec 1.2 (section 12.7.34) for more details.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host:   a3-6
  Local device: mlx5_0
  Peer host:a3-14

You may need to consult with your system administrator to get this
problem fixed.
--

I would like to enable verbose logging for the MPI application to see if that 
could help us pinpoint the IB communication issue (or the nodes with the issue).

I see many verbose logging options reported by "ompi_info -a | grep verbose", 
but I am not sure which one(s) could be helpful here. Would any of them be 
useful here or are there any other ways to enable verbose logging to help with 
tracking down the issue?

Thank you so much in advance.

Best regards,


Shan-Ho Tsai
University of Georgia, Athens GA




Re: [OMPI users] handle_wc() in openib and IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code

2022-02-23 Thread Jeff Squyres (jsquyres) via users
The short answer is likely that UCX and Open MPI v4.1.x is your way forward.

openib has basically been unmaintained for quite a while -- Nvidia (Mellanox) 
made it quite clear long ago that UCX was their path forward.  openib was kept 
around until UCX became stable enough to become the preferred IB network 
transport -- which it now is.  Due to Open MPI's backwards compatibility 
guarantees, we can't remove openib from the 4.0.x and 4.1.x series, but it 
won't be present in the upcoming Open MPI v5.0.x -- IB will be solely supported 
via UCX.

What I suspect you're seeing is that you've got new firmware and/or drivers on 
some nodes, and those are reporting a new opcode error up to Open MPI's old 
openib code.  The openib code hasn't been updated to handle that new opcode, 
and it gets confused and throws an error, and therefore aborts.  UCX and/or 
Open MPI v4.1.x, presumably, have been updated to handle that new opcode, and 
therefore things run smoothly.

This is just an educated guess.  But if you're running in an 
effectively-heterogeneous scenario (i.e., some nodes with old OFED some nodes 
with new MLNX OFED), weird backwards/forwards compatibility issues like this 
can occur.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Crni Gorac via 
users 
Sent: Tuesday, February 22, 2022 7:37 AM
To: users@lists.open-mpi.org
Cc: Crni Gorac
Subject: [OMPI users] handle_wc() in openib and 
IBV_WC_DRIVER2/MLX5DV_WC_RAW_WQE completion code

We've encountered OpenMPI crashing in handle_wc(), with following error message:
[.../opal/mca/btl/openib/btl_openib_component.c:3610:handle_wc]
Unhandled work completion opcode is 136

Our setup is admittedly little tricky, but I'm still worried that it
may be a genuine problem, so please bear with me while I try to
explain:  The OpenMPI version is 3.1.2, it is built from source, here
is the relevant ompi_info excerpt:
 Configure command line: '--prefix=/opt/openmpi/3.1.2'
'--disable-silent-rules' '--with-tm=/opt/pbs' '--enable-static=yes'
'--enable-shared=yes' '--with-cuda'

Our nodes have initially had installed open-source OFED, and then on a
couple of nodes we had it replaced with recent MLNX_OFED (version
5.5-1.0.3.2), with the idea to test for some time, then upgrade them
all and then to switch to OpenMPI 4.x.  However, the system is still
in use in this intermediate state, and it happens that our code
crashes sometimes, with the error message mentioned above.  FWIW, the
configuration used for runs in question is 2 nodes with 3 MPI ranks
each; and crashes only occur if at least one of the nodes used is from
these that are upgraded to MLNX_OFED.  We also have OpenMPI 4.1.2
built, after MLNX_OFED installed, and when our code run linked with
this version, a crash won't occur, but - we've built this one with UCX
(1.12.0) and openib disabled, so the code path for handling this
completion opcode (if it occurs at all) is different.

So when I looked into /usr/include/infiniband/verbs.h, I was able to
see that opcode 136 in this context means IBV_WC_DRIVER2.  However,
this opcode, as well as some other opcodes, are not there in the
/usr/include/infiniband/verbs.h from the open-source OFED installation
that we used so far.  On the other side, for /usr/include/infiniband
from MLNX_OFED, there is MLX5DV_WC_RAW_WQE that is set to
IBV_WC_DRIVER2 in /usr/include/infiniband/mlx5dv.h, so I'm concluding
that this opcode 136 that OpenMPI reports as error, comes from
MLNX_OFED driver returning MLX5DV_WC_RAW_WQE.

Apparently, handle_wc() in pal/mca/btl/openib/btl_openib_component.c
deals with 6 completion codes only, and reports fatal error for the
rest of them; this doesn't seem to be changed between OpenMPI 3.1.2
and 4.1.2.   So my question is here: anyone able to shed some light on
MLX5DV_WC_RAW_WQE completion code, and what kind of problem could
cause it returned?  Or it's really just about us having OpenMPI built
before MLNX_OFED upgrade, i.e. is it to be expected that with OpenMPI
rebuilt now (with the same configure flags as initially, that means
with openib kept) the problem won't occur?

Thanks.


Re: [OMPI users] Unknown breakdown (Transport retry count exceeded on mlx5_0:1/IB)

2022-02-23 Thread Jeff Squyres (jsquyres) via users
I can't comment much on UCX; you'll need to ask Nvidia for support on that.

But transport retry count exceeded errors mean that the underlying IB network 
tried to send a message a bunch of times but never received the corresponding 
ACK from the receiver indicating that the receiver successfully got the 
message.  From back in my IB days, the typical first place to look for errors 
like this is to check the layer 0 and layer 1 networking with Nvidia-level 
diagnostics to ensure that the network itself is healthy.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Feng Wade via users 

Sent: Saturday, February 19, 2022 4:04 PM
To: users@lists.open-mpi.org
Cc: Feng Wade
Subject: [OMPI users] Unknown breakdown (Transport retry count exceeded on 
mlx5_0:1/IB)

Hi,

Good afternoon.

I am using openmpi/4.0.3 on Compute Canada to do 3D flow simulation. It worked 
quite well for lower Reynolds number. However, after increasing it from  3600 
to 9000, openmpi reported errors as shown below:

[gra1288:149104:0:149104] ib_mlx5_log.c:132  Transport retry count exceeded on 
mlx5_0:1/IB (synd 0x15 vend 0x81 hw_synd 0/0)
[gra1288:149104:0:149104] ib_mlx5_log.c:132  DCI QP 0x2ecc1 wqe[475]: SEND s-e 
[rqpn 0xd7b7 rlid 1406] [va 0x2b6140d4ca80 len 8256 lkey 0x2e1bb1]
 backtrace (tid: 149102) 
 0 0x00020753 ucs_debug_print_backtrace()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/debug/debug.c:653
 1 0x0001dfa8 uct_ib_mlx5_completion_with_err()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5_log.c:132
 2 0x00056fae uct_ib_mlx5_poll_cq()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/mlx5/ib_mlx5.inl:81
 3 0x00056fae uct_dc_mlx5_iface_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/ib/dc/dc_mlx5.c:238
 4 0x000263ca ucs_callbackq_dispatch()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucs/datastruct/callbackq.h:211
 5 0x000263ca uct_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/uct/api/uct.h:2221
 6 0x000263ca ucp_worker_progress()  
/tmp/ebuser/avx2/UCX/1.8.0/GCCcore-9.3.0/ucx-1.8.0/src/ucp/core/ucp_worker.c:1951
 7 0x36b7 mca_pml_ucx_progress()  ???:0
 8 0x000566bb opal_progress()  ???:0
 9 0x0007acf5 ompi_request_default_wait()  ???:0
10 0x000b3ad9 MPI_Sendrecv()  ???:0
11 0x9c86 transpose_chunks()  transpose-pairwise.c:0
12 0x9d0f apply()  transpose-pairwise.c:0
13 0x00422b5f channelflow::FlowFieldFD::transposeX1Y0()  ???:0
14 0x00438d50 channelflow::grad_uDalpha()  ???:0
15 0x00434a47 channelflow::VE_NL()  ???:0
16 0x00432783 channelflow::MultistepVEDNSFD::advance()  ???:0
17 0x00413767 main()  ???:0
18 0x00023e1b __libc_start_main()  
/cvmfs/soft.computecanada.ca/gentoo/2020/usr/src/debug/sys-libs/glibc-2.30-r8/glibc-2.30/csu/../csu/libc-start.c:308
19 0x004109aa _start()  ???:0
=
[gra1288:149102] *** Process received signal ***
[gra1288:149102] Signal: Aborted (6)
[gra1288:149102] Signal code:  (-6)
[gra1288:149102] [ 0] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2addb0310980]
[gra1288:149102] [ 1] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(gsignal+0x141)[0x2addb0310901]
[gra1288:149102] [ 2] 
/cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(abort+0x127)[0x2addb02fa56b]
[gra1288:149102] [ 3] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x1f435)[0x2addb6cd7435]
[gra1288:149102] [ 4] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(+0x236b5)[0x2addb6cdb6b5]
[gra1288:149102] [ 5] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/libucs.so.0(ucs_log_dispatch+0xc9)[0x2addb6cdb7d9]
[gra1288:149102] [ 6] 
/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/ucx/1.8.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x528)[0x2addb6ec1fa8]

Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

2022-02-23 Thread Jeff Squyres (jsquyres) via users
I'd recommend against using Open MPI v3.1.0 -- it's quite old.  If you have to 
use Open MPI v3.1.x, I'd at least suggest using v3.1.6, which has all the 
rolled-up bug fixes on the v3.1.x series.

That being said, Open MPI v4.1.2 is the most current.  Open MPI v4.1.2 does 
restrict which versions of UCX it uses because there are bugs in the older 
versions of UCX.  I am not intimately familiar with UCX -- you'll need to ask 
Nvidia for support there -- but I was under the impression that it's just a 
user-level library, and you could certainly install your own copy of UCX to use 
with your compilation of Open MPI.  I.e., you're not restricted to whatever UCX 
is installed in the cluster system-default locations.

I don't know why you're getting MXM-specific error messages; those don't appear 
to be coming from Open MPI (especially since you configured Open MPI with 
--without-mxm).  If you can upgrade to Open MPI v4.1.2 and the latest UCX, see 
if you are still getting those MXM error messages.

--
Jeff Squyres
jsquy...@cisco.com


From: users  on behalf of Angel de Vicente 
via users 
Sent: Friday, February 18, 2022 5:46 PM
To: Gilles Gouaillardet via users
Cc: Angel de Vicente
Subject: Re: [OMPI users] Trouble compiling OpenMPI with Infiniband support

Hello,

Gilles Gouaillardet via users  writes:

> Infiniband detection likely fails before checking expanded verbs.

thanks for this. At the end, after playing a bit with different options,
I managed to install OpenMPI 3.1.0 OK in our cluster using UCX (I wanted
4.1.1, but that would not compile cleanly with the old version of UCX
that is installed in the cluster). The configure command line (as
reported by ompi_info) was:

,
|   Configure command line: 
'--prefix=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/openmpi-3.1.0-g5a7szwxcsgmyibqvwwavfkz5b4i2ym7'
|   '--enable-shared' '--disable-silent-rules'
|   '--disable-builtin-atomics' '--with-pmi=/usr'
|   
'--with-zlib=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/zlib-1.2.11-hrstx5ffrg4f4k3xc2anyxed3mmgdcoz'
|   '--without-knem' '--with-hcoll=/opt/mellanox/hcoll'
|   '--without-psm' '--without-ofi' '--without-cma'
|   '--with-ucx=/opt/ucx' '--without-fca'
|   '--without-mxm' '--without-verbs' '--without-xpmem'
|   '--without-psm2' '--without-alps' '--without-lsf'
|   '--without-sge' '--with-slurm' '--without-tm'
|   '--without-loadleveler' '--disable-memchecker'
|   
'--with-hwloc=/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-9.3.0/hwloc-1.11.13-kpjkidab37wn25h2oyh3eva43ycjb6c5'
|   '--disable-java' '--disable-mpi-java'
|   '--without-cuda' '--enable-wrapper-rpath'
|   '--disable-wrapper-runpath' '--disable-mpi-cxx'
|   '--disable-cxx-exceptions'
|   
'--with-wrapper-ldflags=-Wl,-rpath,/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-7.2.0/gcc-9.3.0-ghr2jekwusoa4zip36xsa3okgp3bylqm/lib/gcc/x86_\
| 64-pc-linux-gnu/9.3.0
|   
-Wl,-rpath,/storage/projects/can30/angelv/spack/opt/spack/linux-sles12-sandybridge/gcc-7.2.0/gcc-9.3.0-ghr2jekwusoa4zip36xsa3okgp3bylqm/lib64'
`


The versions that I'm using are:

gcc:   9.3.0
mxm:   3.6.3102  (though I configure OpenMPI --without-mxm)
hcoll: 3.8.1649
knem:  1.1.2.90mlnx2 (though I configure OpenMPI --without-knem)
ucx:   1.2.2947
slurm: 18.08.7


It looks like everything executes fine, but I have a couple of warnings,
and I'm not sure how much I should worry and what I could do about them:

1) Conflicting CPU frequencies detected:

[1645221586.038838] [s01r3b78:11041:0] sys.c:744  MXM  WARN  
Conflicting CPU frequencies detected, using: 3151.41
[1645221585.740595] [s01r3b79:11484:0] sys.c:744  MXM  WARN  
Conflicting CPU frequencies detected, using: 2998.76

2) Won't use knem. In a previous try, I was specifying --with-knem, but
I was getting this warning about not being able to open /dev/knem. I
guess our cluster is not properly configured w.r.t knem, so I built
OpenMPI again --without-knem, but I still get this message?

[1645221587.091122] [s01r3b74:9054 :0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.
[1645221587.104807] [s01r3b76:8610 :0] shm.c:65   MXM  WARN  Could not 
open the KNEM device file at /dev/knem : No such file or directory. Won't use 
knem.


Any help/pointers appreciated. Many thanks,
--
Ángel de Vicente

Tel.: +34 922 605 747
Web.: http://research.iac.es/proyecto/p