by default, when more than 64 hosts are involved, mpirun uses a tree
spawn in order to remote launch the orted daemons.
That means you have two options here :
- allow all compute nodes to ssh each other (e.g. the ssh private key
of *all* the nodes should be in *all* the authorized_keys
- do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true ...)
I recommend the first option, otherwise mpirun would fork&exec a large
number of ssh processes and hence use quite a lot of
resources on the node running mpirun.
On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester <op8...@gmail.com> wrote:
> I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the
> default ssh-based launcher, where I have my private ssh key on rank 0 and
> the associated public key on all ranks. I create a hosts file with a list
> of unique IPs, with the host that I'm running mpirun from on the first line,
> and run this command:
> mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
> This works fine up to 64 machines. At 65 or greater, I get ssh errors.
> Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
> though today another user got
> Host key verification failed.
> I have confirmed I can successfully manually ssh into these instances. I've
> also written a loop in bash which will background an ssh sleep command to >
> 64 instances and this succeeds.
> From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
> connections have to do with inbound, not outbound limits, and I can prove by
> running straight ssh commands that I'm not hitting a limit.
> Is there something wrong with my mpirun syntax (I've run this way thousands
> of times without issues with fewer than 64 hosts, and I know MPI is
> frequently used on orders of magnitudes more hosts than this)? Or is this a
> known bug that's addressed in a later MPI release?
> Thanks for the help.
> users mailing list
users mailing list