Of course! I blame that two-node test distracted me from checking all the
FQDN relevant parameters. :)

So why would the two-node test pass in this case without allowing the FQDN
then?

Thanks,

On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org> wrote:

> You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or
> set the equivalent MCA param
>
>
> > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote:
> >
> > Hi,
> >
> > This has been bothering me for a while but I never got a chance to
> identify the root cause. I know this issue could be due to misconfig of
> network or ssh in many cases. But I'm pretty sure that we don't fall into
> that category at all. Proof? This issue doesn't happen in 1.6.x and earlier
> releases, but only in 1.8+ including the 1.10.0 which was released today.
> >
> > [yqin@n0009.scs00 ~]$ mpirun -V
> > mpirun (Open MPI) 1.6.5
> >
> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
> n0189.mako0,n0233.mako0,n0198.mako0 hostname
> > n0233.mako0
> > n0189.mako0
> > n0198.mako0
> >
> > [yqin@n0009.scs00 ~]$ mpirun -V
> > mpirun (Open MPI) 1.8.8
> >
> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
> n0189.mako0,n0233.mako0,n0198.mako0 hostname
> > ssh: Could not resolve hostname n0198: Name or service not known
> >
> --------------------------------------------------------------------------
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to
> use.
> >
> > *  compilation of the orted with dynamic libraries when static are
> required
> >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> >   one of the contrib/platform definitions for your system type.
> >
> > * an inability to create a connection back to mpirun due to a
> >   lack of common network interfaces and/or no route found between
> >   them. Please check network connectivity (including firewalls
> >   and network routing requirements).
> >
> --------------------------------------------------------------------------
> >
> > [yqin@n0009.scs00 ~]$ mpirun -V
> > mpirun (Open MPI) 1.10.0
> >
> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
> n0189.mako0,n0233.mako0,n0198.mako0 hostname
> > ssh: Could not resolve hostname n0198: Name or service not known
> >
> --------------------------------------------------------------------------
> > ORTE was unable to reliably start one or more daemons.
> > This usually is caused by:
> >
> > * not finding the required libraries and/or binaries on
> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
> >
> > * lack of authority to execute on one or more specified nodes.
> >   Please verify your allocation and authorities.
> >
> > * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
> >   Please check with your sys admin to determine the correct location to
> use.
> >
> > *  compilation of the orted with dynamic libraries when static are
> required
> >   (e.g., on Cray). Please check your configure cmd line and consider
> using
> >   one of the contrib/platform definitions for your system type.
> >
> > * an inability to create a connection back to mpirun due to a
> >   lack of common network interfaces and/or no route found between
> >   them. Please check network connectivity (including firewalls
> >   and network routing requirements).
> >
> --------------------------------------------------------------------------
> >
> >
> > Note that I was running the mpirun from "n0009.scs00". If I ran it from
> a native "mako0" node, it would pass as well.
> >
> > [yqin@n0198.mako0 ~]$ mpirun -V
> > mpirun (Open MPI) 1.10.0
> >
> > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H
> n0189.mako0,n0233.mako0,n0198.mako0 hostname
> > n0189.mako0
> > n0198.mako0
> > n0233.mako0
> >
> > The network connection between n0009.scs00 and all the "mako0" nodes are
> clear and no firewall at all, and all on the same subnet. The only thing
> that I can think of is the hostname itself. From the error message mpirun
> was trying to resolve n0198, etc., even though the proper hostname that's
> passed to it was n0198.mako0. "n0198" by itself would not resolve because
> we have multiple clusters configured within the same subnet and we do need
> the cluster name suffix as identifier. But this is also not always true,
> for example, if I only have two nodes involved than it would pass as well.
> >
> > [yqin@n0009.scs00 ~]$ mpirun -V
> > mpirun (Open MPI) 1.10.0
> >
> > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname
> > n0189.mako0
> > n0233.mako0
> >
> > The issue only exposes itself when more than 2 nodes are involved. Any
> thoughts?
> >
> > Thanks,
> >
> > Yong Qin
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27489.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27490.php

Reply via email to