On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A
<matias.a.cab...@intel.com> wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will 
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. 
> As Tom suggest, confirm the limits are setup on every node: could it be the 
> alltoall is reaching a node that "others" are not? Please share the command 
> line and the error message.



Yes, under normal circumstances, I use PSM.  i only disabled to see if
it affected any kind of change.

the test i'm running is

mpirun -n 512 ./IMB-MPI1 alltoallv

when the system gets to 128 ranks, it freezes and errors out with

---

A process failed to create a queue pair. This usually means either
the device has run out of queue pairs (too many connections) or
there are insufficient resources available to allocate a queue pair
(out of memory). The latter can happen if either 1) insufficient
memory is available, or 2) no more physical memory can be registered
with the device.

For more information on memory registration see the Open MPI FAQs at:
http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

Local host:             node001
Local device:           qib0
Queue pair type:        Reliable connected (RC)

---

i've also tried various nodes across the cluster (200+).  i think i
ruled out errant switch (qlogic single 12800-120) problems, bad
cables, and bad nodes.  that's not to say they're may not be present,
i've just not been able to find it

Reply via email to