Hi Michael, I may be missing some context, if you are using the qlogic cards you will always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. As Tom suggest, confirm the limits are setup on every node: could it be the alltoall is reaching a node that "others" are not? Please share the command line and the error message.
Thanks, _MAC >> Begin forwarded message: >> >> From: Michael Di Domenico <mdidomeni...@gmail.com> >> Subject: Re: [OMPI users] locked memory and queue pairs >> Date: March 16, 2016 at 11:32:01 AM EDT >> To: Open MPI Users <us...@open-mpi.org> >> Reply-To: Open MPI Users <us...@open-mpi.org> >> >> On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico >> <mdidomeni...@gmail.com> wrote: >>> when i try to run an openmpi job with >128 ranks (16 ranks per node) >>> using alltoall or alltoallv, i'm getting an error that the process >>> was unable to get a queue pair. >>> >>> i've checked the max locked memory settings across my machines; >>> >>> using ulimit -l in and outside of mpirun and they're all set to >>> unlimited pam modules to ensure pam_limits.so is loaded and working >>> the /etc/security/limits.conf is set for soft/hard mem to unlimited >>> >>> i tried a couple of quick mpi config settings i could think of; >>> >>> -mca mtl ^psm no affect >>> -mca btl_openib_flags 1 no affect >>> >>> the openmpi faq says to tweak some mtt values in /sys, but since i'm >>> not on mellanox that doesn't apply to me >>> >>> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled >>> ofed), running on qlogic single-port infiniband cards, psm is >>> enabled >>> >>> other collectives seem to run okay, it seems to only be alltoall >>> comms that fail and only at scale >>> >>> i believe (but can't prove) that this worked at one point, but i >>> can't recall when i last tested it. so it's reasonable to assume >>> that some change to the system is preventing this. >>> >>> the question is, where should i start poking to find it? >> >> bump? >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/03/28713.php > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ >