Hi Jeff

I appreciate your help (and John's as well). At this point I don't think is an OMPI problem - my mistake. I think the communication with RDMA is somehow disabled (perhaps its the verbs layer - I am not very knowledgeable with this). It used to work like a dream but Mellanox has apparently disabled some of the Connect X2 components, because neither ompi or ucx (with/without ompi) could connect with the RDMA layer. Some of the infiniband functions are also not working on the X2 (mstflint, mstconfig).

In fact ompi always tries to access the openib module. I have to explicitly disable it even to run on 1 node. So I think it is in initialization not communication that the problem lies. This is why (I think) ibv_obj returns NULL. The better news is that with the tcp stack everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is similar to rdma so for large messages its semi OK. Its a partial solution - not all I wanted of course. The direct rdma functions ib_read_lat etc also work fine with expected results. I am suspicious this disabling of the driver is a commercial more than a technical decision.

I am going to try going back to Ubuntu 16.04 - there is a version of OFED that still supports the X2. But I think it may still get messed up by kernel upgrades (it does for 18.04 I found). So its not an easy path.

Thanks again.

Tony

On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
[External Email]

I'm afraid I don't have many better answers for you.

I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv 
*on a single node* with `--mca btl openib,self`?

I don't remember offhand, but I didn't think that openib was supposed to do loopback 
communication.  E.g., if both MPI processes are on the same node, `--mca btl 
openib,vader,self` should do the trick (where "vader" = shared memory support).

More specifically: are you running into a problem running openib (and/or UCX) 
across multiple nodes?

I can't speak to Nvidia support on various models of [older] hardware 
(including UCX support on that hardware).  But be aware that openib is 
definitely going away; it is wholly being replaced by UCX.  It may be that your 
only option is to stick with older software stacks in these hardware 
environments.


On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <users@lists.open-mpi.org> 
wrote:

Hi John

Thanks for the response. I have run all those diagnostics, and as best I can 
tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server) 
and the fabric passes all the tests. There is 1 warning:

I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between different 
pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8 
interface ie 4 GB/s). I have also configured 2 nodes back to back to check that 
the switch is not the problem - it makes no difference.

I have been playing with the btl params with openMPI (v. 2.1.1 which is what is 
relelased in Ubuntu 18.04). So with tcp as the transport layer everything works 
fine - 1 node or 2 node communication - I have tested up to 16 processes (8+8) 
and it seems fine. Of course the latency is much higher on the tcp interface, 
so I would still like to access the RDMA layer. But unless I exclude the openib 
module, it always hangs. Same with OpenMPI v4 compiled from source.

I think an important component is that Mellanox is not supporting Connect X2 
for some time. This is really infuriating; a $500 network card with no 
supported drivers, but that is business for you I suppose. I have 50 NICS and I 
can't afford to replace them all. The other component is the MLNX-OFED is tied 
to specific software versions, so I can't just run an older set of drivers. I 
have not seen source files for the Mellanox drivers - I would take a crack at 
compiling them if I did. In the past I have used the OFED drivers (on Centos 5) 
with no problem, but I don't think this is an option now.

Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms 
this), but of course this is community support and the number of cases is 
obviously small. I use the Ubuntu drivers right now because the OFED install 
seems broken and there is no help with it. Its not supported! Neat huh?

The only handle I have is with openmpi v. 2 when there is a message (see my 
original post) that ibv_obj returns a NULL result. But I don't understand the 
significance of the message (if any).

I am not enthused about UCX - the documentation has several obvious typos in 
it, which is not encouraging when you a floundering. I know its a newish 
project but I have used openib for 10+ years and its never had a problem until 
now. I think this is not so much openib as the software below. One other thing 
I should say is that if I run any recent version of mstflint is always 
complains:

Failed to identify the device - Can not create SignatureManager!

Going back to my original OFED 1.5 this did not happen, but they are at v5 now.

Everything else works as far as I can see. But I could not burn new firmware 
except by going back to the 1.5 OS. Perhaps this is connected with the obv_obj 
= NULL result.

Thanks for helping out. As you can see I am rather stuck.

Best

Tony

On 8/23/20 3:01 AM, John Hearns via users wrote:
*[External Email]*

Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users <users@lists.open-mpi.org 
<mailto:users@lists.open-mpi.org>> wrote:

    Hi Jeff

    I installed ucx as you suggested. But I can't get even the
    simplest code
    (ucp_client_server) to work across the network. I can compile openMPI
    with UCX but it has the same problem - mpi codes will not execute and
    there are no messages. Really, UCX is not helping. It is adding
    another
    (not so well documented) software layer, which does not offer better
    diagnostics as far as I can see. Its also unclear to me how to
    control
    what drivers are being loaded - UCX wants to make that decision
    for you.
    With openMPI I can see that (for instance) the tcp module works both
    locally and over the network - it must be using the Mellanox NIC
    for the
    bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
    if I
    try to use openib (or allow ucx or openmpi to choose the transport
    layer) it just hangs. Annoyingly I have this server where everything
    works just fine - I can run locally over openib and its fine. All the
    other nodes cannot seem to load openib so even local jobs fail.

    The only good (as best I can tell) diagnostic is from openMPI.
    ibv_obj
    (from v2.x) complains  that openib returns a NULL object, whereas
    on my
    server it returns logical_index=1. Can we not try to diagnose the
    problem with openib not loading (see my original post for
    details). I am
    pretty sure if we can that would fix the problem.

    Thanks

    Tony

    PS I tried configuring two nodes back to back to see if it was a
    switch
    issue, but the result was the same.


    On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
    > [External Email]
    >
    > Tony --
    >
    > Have you tried compiling Open MPI with UCX support? This is
    Mellanox (NVIDIA's) preferred mechanism for InfiniBand support
    these days -- the openib BTL is legacy.
    >
    > You can run: mpirun --mca pml ucx ...
    >
    >
    >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users
    <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
    >>
    >> One other update. I compiled OpenMPI-4.0.4 The outcome was the
    same but there is no mention of ibv_obj this time.
    >>
    >> Tony
    >>
    >> --
    >>
    >> Tony Ladd
    >>
    >> Chemical Engineering Department
    >> University of Florida
    >> Gainesville, Florida 32611-6005
    >> USA
    >>
    >> Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu>
    >> Web http://ladd.che.ufl.edu
    >>
    >> Tel:   (352)-392-6509
    >> FAX:   (352)-392-9514
    >>
    >> <outf34-4.0><outfoam-4.0>
    >
    > --
    > Jeff Squyres
    > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
    >
    --     Tony Ladd

    Chemical Engineering Department
    University of Florida
    Gainesville, Florida 32611-6005
    USA

    Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu>
    Web http://ladd.che.ufl.edu

    Tel:   (352)-392-6509
    FAX:   (352)-392-9514

--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514


--
Jeff Squyres
jsquy...@cisco.com

--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

Reply via email to