Hi Jeff
I appreciate your help (and John's as well). At this point I don't think
is an OMPI problem - my mistake. I think the communication with RDMA is
somehow disabled (perhaps its the verbs layer - I am not very
knowledgeable with this). It used to work like a dream but Mellanox has
apparently disabled some of the Connect X2 components, because neither
ompi or ucx (with/without ompi) could connect with the RDMA layer. Some
of the infiniband functions are also not working on the X2 (mstflint,
mstconfig).
In fact ompi always tries to access the openib module. I have to
explicitly disable it even to run on 1 node. So I think it is in
initialization not communication that the problem lies. This is why (I
think) ibv_obj returns NULL. The better news is that with the tcp stack
everything works fine (ompi, ucx, 1 node, many nodes) - the bandwidth is
similar to rdma so for large messages its semi OK. Its a partial
solution - not all I wanted of course. The direct rdma functions
ib_read_lat etc also work fine with expected results. I am suspicious
this disabling of the driver is a commercial more than a technical decision.
I am going to try going back to Ubuntu 16.04 - there is a version of
OFED that still supports the X2. But I think it may still get messed up
by kernel upgrades (it does for 18.04 I found). So its not an easy path.
Thanks again.
Tony
On 8/24/20 11:35 AM, Jeff Squyres (jsquyres) wrote:
[External Email]
I'm afraid I don't have many better answers for you.
I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv
*on a single node* with `--mca btl openib,self`?
I don't remember offhand, but I didn't think that openib was supposed to do loopback
communication. E.g., if both MPI processes are on the same node, `--mca btl
openib,vader,self` should do the trick (where "vader" = shared memory support).
More specifically: are you running into a problem running openib (and/or UCX)
across multiple nodes?
I can't speak to Nvidia support on various models of [older] hardware
(including UCX support on that hardware). But be aware that openib is
definitely going away; it is wholly being replaced by UCX. It may be that your
only option is to stick with older software stacks in these hardware
environments.
On Aug 23, 2020, at 9:46 PM, Tony Ladd via users <users@lists.open-mpi.org>
wrote:
Hi John
Thanks for the response. I have run all those diagnostics, and as best I can
tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + server)
and the fabric passes all the tests. There is 1 warning:
I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
but according to a number of sources this is harmless.
I have run Mellanox's P2P performance tests (ib_write_bw) between different
pairs of nodes and it reports 3.22 GB/sec which is reasonable (its PCIe 2 x8
interface ie 4 GB/s). I have also configured 2 nodes back to back to check that
the switch is not the problem - it makes no difference.
I have been playing with the btl params with openMPI (v. 2.1.1 which is what is
relelased in Ubuntu 18.04). So with tcp as the transport layer everything works
fine - 1 node or 2 node communication - I have tested up to 16 processes (8+8)
and it seems fine. Of course the latency is much higher on the tcp interface,
so I would still like to access the RDMA layer. But unless I exclude the openib
module, it always hangs. Same with OpenMPI v4 compiled from source.
I think an important component is that Mellanox is not supporting Connect X2
for some time. This is really infuriating; a $500 network card with no
supported drivers, but that is business for you I suppose. I have 50 NICS and I
can't afford to replace them all. The other component is the MLNX-OFED is tied
to specific software versions, so I can't just run an older set of drivers. I
have not seen source files for the Mellanox drivers - I would take a crack at
compiling them if I did. In the past I have used the OFED drivers (on Centos 5)
with no problem, but I don't think this is an option now.
Ubuntu claims to support Connect X2 with their drivers (Mellanox confirms
this), but of course this is community support and the number of cases is
obviously small. I use the Ubuntu drivers right now because the OFED install
seems broken and there is no help with it. Its not supported! Neat huh?
The only handle I have is with openmpi v. 2 when there is a message (see my
original post) that ibv_obj returns a NULL result. But I don't understand the
significance of the message (if any).
I am not enthused about UCX - the documentation has several obvious typos in
it, which is not encouraging when you a floundering. I know its a newish
project but I have used openib for 10+ years and its never had a problem until
now. I think this is not so much openib as the software below. One other thing
I should say is that if I run any recent version of mstflint is always
complains:
Failed to identify the device - Can not create SignatureManager!
Going back to my original OFED 1.5 this did not happen, but they are at v5 now.
Everything else works as far as I can see. But I could not burn new firmware
except by going back to the 1.5 OS. Perhaps this is connected with the obv_obj
= NULL result.
Thanks for helping out. As you can see I am rather stuck.
Best
Tony
On 8/23/20 3:01 AM, John Hearns via users wrote:
*[External Email]*
Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus on every node
sminfo on one node
ibdiagnet on one node
On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users <users@lists.open-mpi.org
<mailto:users@lists.open-mpi.org>> wrote:
Hi Jeff
I installed ucx as you suggested. But I can't get even the
simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding
another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to
control
what drivers are being loaded - UCX wants to make that decision
for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC
for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.
The only good (as best I can tell) diagnostic is from openMPI.
ibv_obj
(from v2.x) complains that openib returns a NULL object, whereas
on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for
details). I am
pretty sure if we can that would fix the problem.
Thanks
Tony
PS I tried configuring two nodes back to back to see if it was a
switch
issue, but the result was the same.
On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
> [External Email]
>
> Tony --
>
> Have you tried compiling Open MPI with UCX support? This is
Mellanox (NVIDIA's) preferred mechanism for InfiniBand support
these days -- the openib BTL is legacy.
>
> You can run: mpirun --mca pml ucx ...
>
>
>> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users
<users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>
>> One other update. I compiled OpenMPI-4.0.4 The outcome was the
same but there is no mention of ibv_obj this time.
>>
>> Tony
>>
>> --
>>
>> Tony Ladd
>>
>> Chemical Engineering Department
>> University of Florida
>> Gainesville, Florida 32611-6005
>> USA
>>
>> Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu>
>> Web http://ladd.che.ufl.edu
>>
>> Tel: (352)-392-6509
>> FAX: (352)-392-9514
>>
>> <outf34-4.0><outfoam-4.0>
>
> --
> Jeff Squyres
> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>
-- Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu <http://che.ufl.edu>
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514
--
Jeff Squyres
jsquy...@cisco.com
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514