On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico
<mdidomeni...@gmail.com> wrote:
> when i try to run an openmpi job with >128 ranks (16 ranks per node)
> using alltoall or alltoallv, i'm getting an error that the process was
> unable to get a queue pair.
>
> i've checked the max locked memory settings across my machines;
>
> using ulimit -l in and outside of mpirun and they're all set to unlimited
> pam modules to ensure pam_limits.so is loaded and working
> the /etc/security/limits.conf is set for soft/hard mem to unlimited
>
> i tried a couple of quick mpi config settings i could think of;
>
> -mca mtl ^psm no affect
> -mca btl_openib_flags 1 no affect
>
> the openmpi faq says to tweak some mtt values in /sys, but since i'm
> not on mellanox that doesn't apply to me
>
> the machines are rhel 6.7, kernel 2.6.32-573.12.1(with bundled ofed),
> running on qlogic single-port infiniband cards, psm is enabled
>
> other collectives seem to run okay, it seems to only be alltoall comms
> that fail and only at scale
>
> i believe (but can't prove) that this worked at one point, but i can't
> recall when i last tested it.  so it's reasonable to assume that some
> change to the system is preventing this.
>
> the question is, where should i start poking to find it?

bump?

Reply via email to