Yalla works because MXM defaults to using unconnected datagrams (I don’t think
it uses RC unless you ask). Is this a fully connected algorithm? I ask because
(3584 - 28) * 28 * 3 (default number of QPs/remote process in btl/openib) =
298704 > 262144. This is the problem with RC. Mellanox solved
Hi,
One of our users is having trouble scaling his code up to 3584 cores (i.e. 128
28-core nodes). It runs fine on 1792 cores (64 nodes), but fails with this at
3584:
--
A process failed to create a queue pair. This usually