I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the
default ssh-based launcher, where I have my private ssh key on rank 0 and
the associated public key on all ranks. I create a hosts file with a list
of unique IPs, with the host that I'm running mpirun from on the first
line, and run this command:
mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
This works fine up to 64 machines. At 65 or greater, I get ssh errors.
Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
though today another user got
Host key verification failed.
I have confirmed I can successfully manually ssh into these instances.
I've also written a loop in bash which will background an ssh sleep command
to > 64 instances and this succeeds.
>From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
connections have to do with inbound, not outbound limits, and I can prove
by running straight ssh commands that I'm not hitting a limit.
Is there something wrong with my mpirun syntax (I've run this way thousands
of times without issues with fewer than 64 hosts, and I know MPI is
frequently used on orders of magnitudes more hosts than this)? Or is this
a known bug that's addressed in a later MPI release?
Thanks for the help.
users mailing list