You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or set the equivalent MCA param
> On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote: > > Hi, > > This has been bothering me for a while but I never got a chance to identify > the root cause. I know this issue could be due to misconfig of network or ssh > in many cases. But I'm pretty sure that we don't fall into that category at > all. Proof? This issue doesn't happen in 1.6.x and earlier releases, but only > in 1.8+ including the 1.10.0 which was released today. > > [yqin@n0009.scs00 ~]$ mpirun -V > mpirun (Open MPI) 1.6.5 > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > hostname > n0233.mako0 > n0189.mako0 > n0198.mako0 > > [yqin@n0009.scs00 ~]$ mpirun -V > mpirun (Open MPI) 1.8.8 > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > hostname > ssh: Could not resolve hostname n0198: Name or service not known > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > [yqin@n0009.scs00 ~]$ mpirun -V > mpirun (Open MPI) 1.10.0 > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > hostname > ssh: Could not resolve hostname n0198: Name or service not known > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > -------------------------------------------------------------------------- > > > Note that I was running the mpirun from "n0009.scs00". If I ran it from a > native "mako0" node, it would pass as well. > > [yqin@n0198.mako0 ~]$ mpirun -V > mpirun (Open MPI) 1.10.0 > > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > hostname > n0189.mako0 > n0198.mako0 > n0233.mako0 > > The network connection between n0009.scs00 and all the "mako0" nodes are > clear and no firewall at all, and all on the same subnet. The only thing that > I can think of is the hostname itself. From the error message mpirun was > trying to resolve n0198, etc., even though the proper hostname that's passed > to it was n0198.mako0. "n0198" by itself would not resolve because we have > multiple clusters configured within the same subnet and we do need the > cluster name suffix as identifier. But this is also not always true, for > example, if I only have two nodes involved than it would pass as well. > > [yqin@n0009.scs00 ~]$ mpirun -V > mpirun (Open MPI) 1.10.0 > > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname > n0189.mako0 > n0233.mako0 > > The issue only exposes itself when more than 2 nodes are involved. Any > thoughts? > > Thanks, > > Yong Qin > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27489.php