Aha. I dimly remember a problem with the ibverbs /dev device - maybe the permissions, or more likely the owner account for that device.
On Tue, 25 Aug 2020 at 02:44, Tony Ladd <tl...@che.ufl.edu> wrote: > Hi Jeff > > I appreciate your help (and John's as well). At this point I don't think > is an OMPI problem - my mistake. I think the communication with RDMA is > somehow disabled (perhaps its the verbs layer - I am not very > knowledgeable with this). It used to work like a dream but Mellanox has > apparently disabled some of the Connect X2 components, because neither > ompi or ucx (with/without ompi) could connect with the RDMA layer. Some > of the infiniband functions are also not working on the X2 (mstflint, > mstconfig). > > In fact ompi always tries to access the openib module. I have to > explicitly disable it even to run on 1 node. So I think it is in > initialization not communication that the problem lies. This is why (I > think) ibv_obj returns NULL. The better news is that with the tcp stack > everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is > similar to rdma so for large messages its semi OK. Its a partial > solution - not all I wanted of course. The direct rdma functions > ib_read_lat etc also work fine with expected results. I am suspicious > this disabling of the driver is a commercial more than a technical > decision. > > I am going to try going back to Ubuntu 16.04 - there is a version of > OFED that still supports the X2. But I think it may still get messed up > by kernel upgrades (it does for 18.04 I found). So its not an easy path. > > Thanks again. > > Tony > > On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote: > > [External Email] > > > > I'm afraid I don't have many better answers for you. > > > > I can't quite tell from your machines, but are you running IMB-MPI1 > Sendrecv *on a single node* with `--mca btl openib,self`? > > > > I don't remember offhand, but I didn't think that openib was supposed to > do loopback communication. E.g., if both MPI processes are on the same > node, `--mca btl openib,vader,self` should do the trick (where "vader" = > shared memory support). > > > > More specifically: are you running into a problem running openib (and/or > UCX) across multiple nodes? > > > > I can't speak to Nvidia support on various models of [older] hardware > (including UCX support on that hardware). But be aware that openib is > definitely going away; it is wholly being replaced by UCX. It may be that > your only option is to stick with older software stacks in these hardware > environments. > > > > > >> On Aug 23, 2020, at 9:46 PM, Tony Ladd via users < > users@lists.open-mpi.org> wrote: > >> > >> Hi John > >> > >> Thanks for the response. I have run all those diagnostics, and as best > I can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + > server) and the fabric passes all the tests. There is 1 warning: > >> > >> I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps > SL:0x00 > >> -W- Suboptimal rate for group. Lowest member rate:40Gbps > > group-rate:10Gbps > >> > >> but according to a number of sources this is harmless. > >> > >> I have run Mellanox's P2P performance tests (ib_write_bw) between > different pairs of nodes and it reports 3.22 GB/sec which is reasonable > (its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back to > back to check that the switch is not the problem - it makes no difference. > >> > >> I have been playing with the btl params with openMPI (v. 2.1.1 which is > what is relelased in Ubuntu 18.04). So with tcp as the transport layer > everything works fine - 1 node or 2 node communication - I have tested up > to 16 processes (8+8) and it seems fine. Of course the latency is much > higher on the tcp interface, so I would still like to access the RDMA > layer. But unless I exclude the openib module, it always hangs. Same with > OpenMPI v4 compiled from source. > >> > >> I think an important component is that Mellanox is not supporting > Connect X2 for some time. This is really infuriating; a $500 network card > with no supported drivers, but that is business for you I suppose. I have > 50 NICS and I can't afford to replace them all. The other component is the > MLNX-OFED is tied to specific software versions, so I can't just run an > older set of drivers. I have not seen source files for the Mellanox drivers > - I would take a crack at compiling them if I did. In the past I have used > the OFED drivers (on Centos 5) with no problem, but I don't think this is > an option now. > >> > >> Ubuntu claims to support Connect X2 with their drivers (Mellanox > confirms this), but of course this is community support and the number of > cases is obviously small. I use the Ubuntu drivers right now because the > OFED install seems broken and there is no help with it. Its not supported! > Neat huh? > >> > >> The only handle I have is with openmpi v. 2 when there is a message > (see my original post) that ibv_obj returns a NULL result. But I don't > understand the significance of the message (if any). > >> > >> I am not enthused about UCX - the documentation has several obvious > typos in it, which is not encouraging when you a floundering. I know its a > newish project but I have used openib for 10+ years and its never had a > problem until now. I think this is not so much openib as the software > below. One other thing I should say is that if I run any recent version of > mstflint is always complains: > >> > >> Failed to identify the device - Can not create SignatureManager! > >> > >> Going back to my original OFED 1.5 this did not happen, but they are at > v5 now. > >> > >> Everything else works as far as I can see. But I could not burn new > firmware except by going back to the 1.5 OS. Perhaps this is connected with > the obv_obj = NULL result. > >> > >> Thanks for helping out. As you can see I am rather stuck. > >> > >> Best > >> > >> Tony > >> > >> On 8/23/20 3:01 AM, John Hearns via users wrote: > >>> *[External Email]* > >>> > >>> Tony, start at a low level. Is the Infiniband fabric healthy? > >>> Run > >>> ibstatus on every node > >>> sminfo on one node > >>> ibdiagnet on one node > >>> > >>> On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users < > users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: > >>> > >>> Hi Jeff > >>> > >>> I installed ucx as you suggested. But I can't get even the > >>> simplest code > >>> (ucp_client_server) to work across the network. I can compile > openMPI > >>> with UCX but it has the same problem - mpi codes will not execute > and > >>> there are no messages. Really, UCX is not helping. It is adding > >>> another > >>> (not so well documented) software layer, which does not offer > better > >>> diagnostics as far as I can see. Its also unclear to me how to > >>> control > >>> what drivers are being loaded - UCX wants to make that decision > >>> for you. > >>> With openMPI I can see that (for instance) the tcp module works > both > >>> locally and over the network - it must be using the Mellanox NIC > >>> for the > >>> bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But > >>> if I > >>> try to use openib (or allow ucx or openmpi to choose the transport > >>> layer) it just hangs. Annoyingly I have this server where > everything > >>> works just fine - I can run locally over openib and its fine. All > the > >>> other nodes cannot seem to load openib so even local jobs fail. > >>> > >>> The only good (as best I can tell) diagnostic is from openMPI. > >>> ibv_obj > >>> (from v2.x) complains that openib returns a NULL object, whereas > >>> on my > >>> server it returns logical_index=1. Can we not try to diagnose the > >>> problem with openib not loading (see my original post for > >>> details). I am > >>> pretty sure if we can that would fix the problem. > >>> > >>> Thanks > >>> > >>> Tony > >>> > >>> PS I tried configuring two nodes back to back to see if it was a > >>> switch > >>> issue, but the result was the same. > >>> > >>> > >>> On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote: > >>> > [External Email] > >>> > > >>> > Tony -- > >>> > > >>> > Have you tried compiling Open MPI with UCX support? This is > >>> Mellanox (NVIDIA's) preferred mechanism for InfiniBand support > >>> these days -- the openib BTL is legacy. > >>> > > >>> > You can run: mpirun --mca pml ucx ... > >>> > > >>> > > >>> >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users > >>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> > wrote: > >>> >> > >>> >> One other update. I compiled OpenMPI-4.0.4 The outcome was the > >>> same but there is no mention of ibv_obj this time. > >>> >> > >>> >> Tony > >>> >> > >>> >> -- > >>> >> > >>> >> Tony Ladd > >>> >> > >>> >> Chemical Engineering Department > >>> >> University of Florida > >>> >> Gainesville, Florida 32611-6005 > >>> >> USA > >>> >> > >>> >> Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu> > >>> >> Web http://ladd.che.ufl.edu > >>> >> > >>> >> Tel: (352)-392-6509 > >>> >> FAX: (352)-392-9514 > >>> >> > >>> >> <outf34-4.0><outfoam-4.0> > >>> > > >>> > -- > >>> > Jeff Squyres > >>> > jsquy...@cisco.com <mailto:jsquy...@cisco.com> > >>> > > >>> -- Tony Ladd > >>> > >>> Chemical Engineering Department > >>> University of Florida > >>> Gainesville, Florida 32611-6005 > >>> USA > >>> > >>> Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu> > >>> Web http://ladd.che.ufl.edu > >>> > >>> Tel: (352)-392-6509 > >>> FAX: (352)-392-9514 > >>> > >> -- > >> Tony Ladd > >> > >> Chemical Engineering Department > >> University of Florida > >> Gainesville, Florida 32611-6005 > >> USA > >> > >> Email: tladd-"(AT)"-che.ufl.edu > >> Web http://ladd.che.ufl.edu > >> > >> Tel: (352)-392-6509 > >> FAX: (352)-392-9514 > >> > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > -- > Tony Ladd > > Chemical Engineering Department > University of Florida > Gainesville, Florida 32611-6005 > USA > > Email: tladd-"(AT)"-che.ufl.edu > Web http://ladd.che.ufl.edu > > Tel: (352)-392-6509 > FAX: (352)-392-9514 > >