Hi Jeff
I installed ucx as you suggested. But I can't get even the simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to control
what drivers are being loaded - UCX wants to make that decision for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.
The only good (as best I can tell) diagnostic is from openMPI. ibv_obj
(from v2.x) complains that openib returns a NULL object, whereas on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for details). I am
pretty sure if we can that would fix the problem.
Thanks
Tony
PS I tried configuring two nodes back to back to see if it was a switch
issue, but the result was the same.
On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
[External Email]
Tony --
Have you tried compiling Open MPI with UCX support? This is Mellanox
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib
BTL is legacy.
You can run: mpirun --mca pml ucx ...
On Aug 19, 2020, at 12:46 PM, Tony Ladd via users <users@lists.open-mpi.org>
wrote:
One other update. I compiled OpenMPI-4.0.4 The outcome was the same but there
is no mention of ibv_obj this time.
Tony
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
<outf34-4.0><outfoam-4.0>
--
Jeff Squyres
jsquy...@cisco.com
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514