Jeff
I found the solution - rdma needs significant memory so the limits on
the shell have to be increased. I needed to add the lines
* soft memlock unlimited
* hard memlock unlimited
to the end of the file /etc/security/limits.conf. After that the openib
driver loads and everything is fine
On Aug 24, 2020, at 9:44 PM, Tony Ladd wrote:
>
> I appreciate your help (and John's as well). At this point I don't think is
> an OMPI problem - my mistake. I think the communication with RDMA is somehow
> disabled (perhaps its the verbs layer - I am not very knowledgeable with
> this). It
I apologise. That was an Omnipath issue
https://www.beowulf.org/pipermail/beowulf/2017-March/034214.html
On Tue, 25 Aug 2020 at 08:17, John Hearns wrote:
> Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
> permissions,
> or more likely the owner account for that device.
Aha. I dimly remember a problem with the ibverbs /dev device - maybe the
permissions,
or more likely the owner account for that device.
On Tue, 25 Aug 2020 at 02:44, Tony Ladd wrote:
> Hi Jeff
>
> I appreciate your help (and John's as well). At this point I don't think
> is an OMPI problem -
Hi Jeff
I appreciate your help (and John's as well). At this point I don't think
is an OMPI problem - my mistake. I think the communication with RDMA is
somehow disabled (perhaps its the verbs layer - I am not very
knowledgeable with this). It used to work like a dream but Mellanox has
I'm afraid I don't have many better answers for you.
I can't quite tell from your machines, but are you running IMB-MPI1 Sendrecv
*on a single node* with `--mca btl openib,self`?
I don't remember offhand, but I didn't think that openib was supposed to do
loopback communication. E.g., if both
Hi John
Thanks for the response. I have run all those diagnostics, and as best I
can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
server) and the fabric passes all the tests. There is 1 warning:
I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps
Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus on every node
sminfo on one node
ibdiagnet on one node
On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users
wrote:
> Hi Jeff
>
> I installed ucx as you suggested. But I can't get even the simplest code
>
Hi Jeff
I installed ucx as you suggested. But I can't get even the simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding another
Tony --
Have you tried compiling Open MPI with UCX support? This is Mellanox
(NVIDIA's) preferred mechanism for InfiniBand support these days -- the openib
BTL is legacy.
You can run: mpirun --mca pml ucx ...
> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users
> wrote:
>
> One other
One other update. I compiled OpenMPI-4.0.4 The outcome was the same but
there is no mention of ibv_obj this time.
Tony
--
Tony Ladd
Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA
Email: tladd-"(AT)"-che.ufl.edu
Webhttp://ladd.che.ufl.edu
Tel:
My apologies - I did not read the FAQ's carefully enough - with regard
to 14:
1. openib
2. Ubuntu supplied drivers etc.
3. Ubuntu 18.04 4.15.0-112-generic
4. opensm-3.3.5_mlnx-0.1.g6b18e73
5. Attached
6. Attached
7. unlimited on foam and 16384 on f34
I changed the ulimit to unlimited on
12 matches
Mail list logo