Hi Jeff

I installed ucx as you suggested. But I can't get even the simplest code (ucp_client_server) to work across the network. I can compile openMPI with UCX but it has the same problem - mpi codes will not execute and there are no messages. Really, UCX is not helping. It is adding another (not so well documented) software layer, which does not offer better diagnostics as far as I can see. Its also unclear to me how to control what drivers are being loaded - UCX wants to make that decision for you. With openMPI I can see that (for instance) the tcp module works both locally and over the network - it must be using the Mellanox NIC for the bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I try to use openib (or allow ucx or openmpi to choose the transport layer) it just hangs. Annoyingly I have this server where everything works just fine - I can run locally over openib and its fine. All the other nodes cannot seem to load openib so even local jobs fail.

The only good (as best I can tell) diagnostic is from openMPI. ibv_obj (from v2.x) complains  that openib returns a NULL object, whereas on my server it returns logical_index=1. Can we not try to diagnose the problem with openib not loading (see my original post for details). I am pretty sure if we can that would fix the problem.

Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a switch issue, but the result was the same.


On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
[External Email]

Tony --

Have you tried compiling Open MPI with UCX support?  This is Mellanox 
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib 
BTL is legacy.

You can run: mpirun --mca pml ucx ...


On Aug 19, 2020, at 12:46 PM, Tony Ladd via users <users@lists.open-mpi.org> 
wrote:

One other update. I compiled OpenMPI-4.0.4 The outcome was the same but there 
is no mention of ibv_obj this time.

Tony

--

Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

<outf34-4.0><outfoam-4.0>

--
Jeff Squyres
jsquy...@cisco.com

--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

Reply via email to