Jeff
I found the solution - rdma needs significant memory so the limits on
the shell have to be increased. I needed to add the lines
* soft memlock unlimited
* hard memlock unlimited
to the end of the file /etc/security/limits.conf. After that the openib
driver loads and everything is fine - proper IB latency again.
I see that # 16 of the tuning FAQ discusses the same issue, but in my
case there was no error or warning message. I am posting this in case
anyone else runs into this issue.
The Mellanox OFED install adds those lines automatically, so I had not
run into this before.
Tony
On 8/25/20 10:42 AM, Jeff Squyres (jsquyres) wrote:
[External Email]
On Aug 24, 2020, at 9:44 PM, Tony Ladd <tl...@che.ufl.edu> wrote:
I appreciate your help (and John's as well). At this point I don't think is an
OMPI problem - my mistake. I think the communication with RDMA is somehow
disabled (perhaps its the verbs layer - I am not very knowledgeable with this).
It used to work like a dream but Mellanox has apparently disabled some of the
Connect X2 components, because neither ompi or ucx (with/without ompi) could
connect with the RDMA layer. Some of the infiniband functions are also not
working on the X2 (mstflint, mstconfig).
If the IB stack itself is not functioning, then you're right: Open MPI won't
work, either (with openib or UCX).
You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo
and ibv_rc_pingpong. If those don't work, Open MPI won't work over IB, either.
In fact ompi always tries to access the openib module. I have to explicitly
disable it even to run on 1 node.
Yes, that makes sense: Open MPI will aggressively try to use every possible
mechanism.
So I think it is in initialization not communication that the problem lies.
I'm not sure that's correct.
From your initial emails, it looks like openib thinks it initialized properly.
This is why (I think) ibv_obj returns NULL.
I'm not sure if that's a problem or not. That section of output is where Open
MPI is measuring the distance from the current process to the PCI bus where the
device lives. I don't remember offhand if returning NULL in that area is
actually a problem or just an indication of some kind of non-error condition.
Specifically: if returning NULL there was a problem, we *probably* would have
aborted at that point. I have not looked at the code to verify that, though.
The better news is that with the tcp stack everything works fine (ompi, ucx, 1
node, many nodes) - the bandwidth is similar to rdma so for large messages its
semi OK. Its a partial solution - not all I wanted of course. The direct rdma
functions ib_read_lat etc also work fine with expected results. I am suspicious
this disabling of the driver is a commercial more than a technical decision.
I am going to try going back to Ubuntu 16.04 - there is a version of OFED that
still supports the X2. But I think it may still get messed up by kernel
upgrades (it does for 18.04 I found). So its not an easy path.
I can't speak for Nvidia here, sorry.
--
Jeff Squyres
jsquy...@cisco.com
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Web http://ladd.che.ufl.edu
Tel: (352)-392-6509
FAX: (352)-392-9514