Ahhhh... thanks Gilles.  That makes sense.  I was stuck thinking there was
an ssh problem on rank 0; it never occurred to me mpirun was doing
something clever there and that those ssh errors were from a different
instance altogether.

It's no problem to put my private key on all instances - I'll go that route.

-Adam

On Mon, Feb 12, 2018 at 7:12 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Adam,
>
> by default, when more than 64 hosts are involved, mpirun uses a tree
> spawn in order to remote launch the orted daemons.
>
> That means you have two options here :
>  - allow all compute nodes to ssh each other (e.g. the ssh private key
> of *all* the nodes should be in *all* the authorized_keys
>  - do not use a tree spawn (e.g. mpirun --mca plm_rsh_no_tree_spawn true
> ...)
>
> I recommend the first option, otherwise mpirun would fork&exec a large
> number of ssh processes and  hence use quite a lot of
> resources on the node running mpirun.
>
> Cheers,
>
> Gilles
>
> On Tue, Feb 13, 2018 at 8:23 AM, Adam Sylvester <op8...@gmail.com> wrote:
> > I'm running OpenMPI 2.1.0, built from source, on RHEL 7.  I'm using the
> > default ssh-based launcher, where I have my private ssh key on rank 0 and
> > the associated public key on all ranks.  I create a hosts file with a
> list
> > of unique IPs, with the host that I'm running mpirun from on the first
> line,
> > and run this command:
> >
> > mpirun -N 1 --bind-to none --hostfile hosts.txt hostname
> >
> > This works fine up to 64 machines.  At 65 or greater, I get ssh errors.
> > Frequently
> >
> > Permission denied (publickey,gssapi-keyex,gssapi-with-mic)
> >
> > though today another user got
> >
> > Host key verification failed.
> >
> > I have confirmed I can successfully manually ssh into these instances.
> I've
> > also written a loop in bash which will background an ssh sleep command
> to >
> > 64 instances and this succeeds.
> >
> > From what I can tell, the /etc/ssh/ssh*config settings that limit ssh
> > connections have to do with inbound, not outbound limits, and I can
> prove by
> > running straight ssh commands that I'm not hitting a limit.
> >
> > Is there something wrong with my mpirun syntax (I've run this way
> thousands
> > of times without issues with fewer than 64 hosts, and I know MPI is
> > frequently used on orders of magnitudes more hosts than this)?  Or is
> this a
> > known bug that's addressed in a later MPI release?
> >
> > Thanks for the help.
> > -Adam
> >
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to