Ahhhh... thanks Gilles. That makes sense. I was stuck thinking there was an ssh problem on rank 0; it never occurred to me mpirun was doing something clever there and that those ssh errors were from a different instance altogether.
It's no problem to put my private key on all instances - I'll go that route. -Adam On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Adam, > > by default, when more than 64 hosts are involved, mpirun uses a tree > spawn in order to remote launch the orted daemons. > > That means you have two options here : > - allow all compute nodes to ssh each other (e.g. the ssh private key > of *all* the nodes should be in *all* the authorized_keys > - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true > ...) > > I recommend the first option, otherwise mpirun would fork&exec a large > number of ssh processes and hence use quite a lot of > resources on the node running mpirun. > > Cheers, > > Gilles > > On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester <op8...@gmail.com> wrote: > > I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the > > default ssh-based launcher, where I have my private ssh key on rank 0 and > > the associated public key on all ranks. I create a hosts file with a > list > > of unique IPs, with the host that I'm running mpirun from on the first > line, > > and run this command: > > > > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname > > > > This works fine up to 64 machines. At 65 or greater, I get ssh errors. > > Frequently > > > > Permission denied (publickey,gssapi-keyex,gssapi-with-mic) > > > > though today another user got > > > > Host key verification failed. > > > > I have confirmed I can successfully manually ssh into these instances. > I've > > also written a loop in bash which will background an ssh sleep command > to > > > 64 instances and this succeeds. > > > > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh > > connections have to do with inbound, not outbound limits, and I can > prove by > > running straight ssh commands that I'm not hitting a limit. > > > > Is there something wrong with my mpirun syntax (I've run this way > thousands > > of times without issues with fewer than 64 hosts, and I know MPI is > > frequently used on orders of magnitudes more hosts than this)? Or is > this a > > known bug that's addressed in a later MPI release? > > > > Thanks for the help. > > -Adam > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users