Of course! I blame that two-node test distracted me from checking all the FQDN relevant parameters. :)
So why would the two-node test pass in this case without allowing the FQDN then? Thanks, On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org> wrote: > You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or > set the equivalent MCA param > > > > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote: > > > > Hi, > > > > This has been bothering me for a while but I never got a chance to > identify the root cause. I know this issue could be due to misconfig of > network or ssh in many cases. But I'm pretty sure that we don't fall into > that category at all. Proof? This issue doesn't happen in 1.6.x and earlier > releases, but only in 1.8+ including the 1.10.0 which was released today. > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.6.5 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H > n0189.mako0,n0233.mako0,n0198.mako0 hostname > > n0233.mako0 > > n0189.mako0 > > n0198.mako0 > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.8.8 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H > n0189.mako0,n0233.mako0,n0198.mako0 hostname > > ssh: Could not resolve hostname n0198: Name or service not known > > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > > * compilation of the orted with dynamic libraries when static are > required > > (e.g., on Cray). Please check your configure cmd line and consider > using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > > -------------------------------------------------------------------------- > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H > n0189.mako0,n0233.mako0,n0198.mako0 hostname > > ssh: Could not resolve hostname n0198: Name or service not known > > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to > use. > > > > * compilation of the orted with dynamic libraries when static are > required > > (e.g., on Cray). Please check your configure cmd line and consider > using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > > -------------------------------------------------------------------------- > > > > > > Note that I was running the mpirun from "n0009.scs00". If I ran it from > a native "mako0" node, it would pass as well. > > > > [yqin@n0198.mako0 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H > n0189.mako0,n0233.mako0,n0198.mako0 hostname > > n0189.mako0 > > n0198.mako0 > > n0233.mako0 > > > > The network connection between n0009.scs00 and all the "mako0" nodes are > clear and no firewall at all, and all on the same subnet. The only thing > that I can think of is the hostname itself. From the error message mpirun > was trying to resolve n0198, etc., even though the proper hostname that's > passed to it was n0198.mako0. "n0198" by itself would not resolve because > we have multiple clusters configured within the same subnet and we do need > the cluster name suffix as identifier. But this is also not always true, > for example, if I only have two nodes involved than it would pass as well. > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname > > n0189.mako0 > > n0233.mako0 > > > > The issue only exposes itself when more than 2 nodes are involved. Any > thoughts? > > > > Thanks, > > > > Yong Qin > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27489.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27490.php