Your earlier message indicates that it works fine so long as the DN was the same, regardless of number of nodes. It only failed when the DN’s of the nodes differed.
> On Aug 25, 2015, at 3:31 PM, Yong Qin <yong....@gmail.com> wrote: > > Of course! I blame that two-node test distracted me from checking all the > FQDN relevant parameters. :) > > So why would the two-node test pass in this case without allowing the FQDN > then? > > Thanks, > > On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or set > the equivalent MCA param > > > > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com > > <mailto:yong....@gmail.com>> wrote: > > > > Hi, > > > > This has been bothering me for a while but I never got a chance to identify > > the root cause. I know this issue could be due to misconfig of network or > > ssh in many cases. But I'm pretty sure that we don't fall into that > > category at all. Proof? This issue doesn't happen in 1.6.x and earlier > > releases, but only in 1.8+ including the 1.10.0 which was released today. > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.6.5 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > > hostname > > n0233.mako0 > > n0189.mako0 > > n0198.mako0 > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.8.8 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > > hostname > > ssh: Could not resolve hostname n0198: Name or service not known > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > > hostname > > ssh: Could not resolve hostname n0198: Name or service not known > > -------------------------------------------------------------------------- > > ORTE was unable to reliably start one or more daemons. > > This usually is caused by: > > > > * not finding the required libraries and/or binaries on > > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > > settings, or configure OMPI with --enable-orterun-prefix-by-default > > > > * lack of authority to execute on one or more specified nodes. > > Please verify your allocation and authorities. > > > > * the inability to write startup files into /tmp > > (--tmpdir/orte_tmpdir_base). > > Please check with your sys admin to determine the correct location to use. > > > > * compilation of the orted with dynamic libraries when static are required > > (e.g., on Cray). Please check your configure cmd line and consider using > > one of the contrib/platform definitions for your system type. > > > > * an inability to create a connection back to mpirun due to a > > lack of common network interfaces and/or no route found between > > them. Please check network connectivity (including firewalls > > and network routing requirements). > > -------------------------------------------------------------------------- > > > > > > Note that I was running the mpirun from "n0009.scs00". If I ran it from a > > native "mako0" node, it would pass as well. > > > > [yqin@n0198.mako0 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 > > hostname > > n0189.mako0 > > n0198.mako0 > > n0233.mako0 > > > > The network connection between n0009.scs00 and all the "mako0" nodes are > > clear and no firewall at all, and all on the same subnet. The only thing > > that I can think of is the hostname itself. From the error message mpirun > > was trying to resolve n0198, etc., even though the proper hostname that's > > passed to it was n0198.mako0. "n0198" by itself would not resolve because > > we have multiple clusters configured within the same subnet and we do need > > the cluster name suffix as identifier. But this is also not always true, > > for example, if I only have two nodes involved than it would pass as well. > > > > [yqin@n0009.scs00 ~]$ mpirun -V > > mpirun (Open MPI) 1.10.0 > > > > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname > > n0189.mako0 > > n0233.mako0 > > > > The issue only exposes itself when more than 2 nodes are involved. Any > > thoughts? > > > > Thanks, > > > > Yong Qin > > _______________________________________________ > > users mailing list > > us...@open-mpi.org <mailto:us...@open-mpi.org> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > > Link to this post: > > http://www.open-mpi.org/community/lists/users/2015/08/27489.php > > <http://www.open-mpi.org/community/lists/users/2015/08/27489.php> > > _______________________________________________ > users mailing list > us...@open-mpi.org <mailto:us...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > <http://www.open-mpi.org/mailman/listinfo.cgi/users> > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27490.php > <http://www.open-mpi.org/community/lists/users/2015/08/27490.php> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27491.php