On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A <matias.a.cab...@intel.com> wrote: > Hi Michael, > > I may be missing some context, if you are using the qlogic cards you will > always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. > As Tom suggest, confirm the limits are setup on every node: could it be the > alltoall is reaching a node that "others" are not? Please share the command > line and the error message.
Yes, under normal circumstances, I use PSM. i only disabled to see if it affected any kind of change. the test i'm running is mpirun -n 512 ./IMB-MPI1 alltoallv when the system gets to 128 ranks, it freezes and errors out with --- A process failed to create a queue pair. This usually means either the device has run out of queue pairs (too many connections) or there are insufficient resources available to allocate a queue pair (out of memory). The latter can happen if either 1) insufficient memory is available, or 2) no more physical memory can be registered with the device. For more information on memory registration see the Open MPI FAQs at: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: node001 Local device: qib0 Queue pair type: Reliable connected (RC) --- i've also tried various nodes across the cluster (200+). i think i ruled out errant switch (qlogic single 12800-120) problems, bad cables, and bad nodes. that's not to say they're may not be present, i've just not been able to find it