I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks.  I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first
line, and run this command:

mpirun -N 1 --bind-to none --hostfile hosts.txt hostname

This works fine up to 64 machines.  At 65 or greater, I get ssh errors.

Permission denied (publickey,gssapi-keyex,gssapi-with-mic)

though today another user got

Host key verification failed.

I have confirmed I can successfully manually ssh into these instances.
I've also written a loop in bash which will background an ssh sleep command
to > 64 instances and this succeeds.

>From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove
by running straight ssh commands that I'm not hitting a limit.

Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)?  Or is this
a known bug that's addressed in a later MPI release?

Thanks for the help.
users mailing list

Reply via email to