Yes all cross-node ssh works perfectly and this is our production system which have been running for years. I've done all of these testing and was puzzled by the inconsistent behavior that I observed. But enabling FQDN resolves the issue so I am just trying to understand why the inconsistency exists now.
[yqin@n0009.scs00 ~]$ ssh n0189.mako0 ssh n0198.mako0 echo ok ok [yqin@n0009.scs00 ~]$ ssh n0233.mako0 ssh n0198.mako0 echo ok ok The latter one wouldn't work because n0198 by itself without a domain name wouldn't resolve. On Wed, Aug 26, 2015 at 3:48 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > is name resolution working on *all* the nodes ? > orted might be ssh'ed in a tree fashion. > that means orted can either be ssh'ed by the node running mpirun or any > other node. > from n0009.scs00, can you make sure > ssh n0189.mako0 ssh n0198.mako0 echo ok > ssh n0233.mako0 ssh n0198.mako0 echo ok > both work ? > > per your log, mpirun might remove the domain name from the ssh command > under the hood. > e.g. > ssh n0189.mako0 ssh n0198 echo ok > or > ssh n0198 ssh n0198.mako0 echo ok > if that is the case, then I have no idea why we are doing this ... > > Cheers, > > Gilles > > On Thursday, August 27, 2015, Yong Qin <yong....@gmail.com> wrote: > >> > regardless of number of nodes >> >> No, this is not true. I was referring to this specific test, which was >> the one that preventing me from thinking about FQDN, and the DN is >> different in this case. As I clearly stated in my original question - "The >> issue only exposes itself when more than 2 nodes are involved." >> >> [yqin@n0009.scs00 ~]$ mpirun -V >> mpirun (Open MPI) 1.10.0 >> >> [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname >> n0189.mako0 >> n0233.mako0 >> >> On Tue, Aug 25, 2015 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Your earlier message indicates that it works fine so long as the DN was >>> the same, regardless of number of nodes. It only failed when the DN’s of >>> the nodes differed. >>> >>> >>> On Aug 25, 2015, at 3:31 PM, Yong Qin <yong....@gmail.com> wrote: >>> >>> Of course! I blame that two-node test distracted me from checking all >>> the FQDN relevant parameters. :) >>> >>> So why would the two-node test pass in this case without allowing the >>> FQDN then? >>> >>> Thanks, >>> >>> On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or >>>> set the equivalent MCA param >>>> >>>> >>>> > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote: >>>> > >>>> > Hi, >>>> > >>>> > This has been bothering me for a while but I never got a chance to >>>> identify the root cause. I know this issue could be due to misconfig of >>>> network or ssh in many cases. But I'm pretty sure that we don't fall into >>>> that category at all. Proof? This issue doesn't happen in 1.6.x and earlier >>>> releases, but only in 1.8+ including the 1.10.0 which was released today. >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -V >>>> > mpirun (Open MPI) 1.6.5 >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>>> > n0233.mako0 >>>> > n0189.mako0 >>>> > n0198.mako0 >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -V >>>> > mpirun (Open MPI) 1.8.8 >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>>> > ssh: Could not resolve hostname n0198: Name or service not known >>>> > >>>> -------------------------------------------------------------------------- >>>> > ORTE was unable to reliably start one or more daemons. >>>> > This usually is caused by: >>>> > >>>> > * not finding the required libraries and/or binaries on >>>> > one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>> > settings, or configure OMPI with --enable-orterun-prefix-by-default >>>> > >>>> > * lack of authority to execute on one or more specified nodes. >>>> > Please verify your allocation and authorities. >>>> > >>>> > * the inability to write startup files into /tmp >>>> (--tmpdir/orte_tmpdir_base). >>>> > Please check with your sys admin to determine the correct location >>>> to use. >>>> > >>>> > * compilation of the orted with dynamic libraries when static are >>>> required >>>> > (e.g., on Cray). Please check your configure cmd line and consider >>>> using >>>> > one of the contrib/platform definitions for your system type. >>>> > >>>> > * an inability to create a connection back to mpirun due to a >>>> > lack of common network interfaces and/or no route found between >>>> > them. Please check network connectivity (including firewalls >>>> > and network routing requirements). >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -V >>>> > mpirun (Open MPI) 1.10.0 >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>>> > ssh: Could not resolve hostname n0198: Name or service not known >>>> > >>>> -------------------------------------------------------------------------- >>>> > ORTE was unable to reliably start one or more daemons. >>>> > This usually is caused by: >>>> > >>>> > * not finding the required libraries and/or binaries on >>>> > one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>>> > settings, or configure OMPI with --enable-orterun-prefix-by-default >>>> > >>>> > * lack of authority to execute on one or more specified nodes. >>>> > Please verify your allocation and authorities. >>>> > >>>> > * the inability to write startup files into /tmp >>>> (--tmpdir/orte_tmpdir_base). >>>> > Please check with your sys admin to determine the correct location >>>> to use. >>>> > >>>> > * compilation of the orted with dynamic libraries when static are >>>> required >>>> > (e.g., on Cray). Please check your configure cmd line and consider >>>> using >>>> > one of the contrib/platform definitions for your system type. >>>> > >>>> > * an inability to create a connection back to mpirun due to a >>>> > lack of common network interfaces and/or no route found between >>>> > them. Please check network connectivity (including firewalls >>>> > and network routing requirements). >>>> > >>>> -------------------------------------------------------------------------- >>>> > >>>> > >>>> > Note that I was running the mpirun from "n0009.scs00". If I ran it >>>> from a native "mako0" node, it would pass as well. >>>> > >>>> > [yqin@n0198.mako0 ~]$ mpirun -V >>>> > mpirun (Open MPI) 1.10.0 >>>> > >>>> > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H >>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>>> > n0189.mako0 >>>> > n0198.mako0 >>>> > n0233.mako0 >>>> > >>>> > The network connection between n0009.scs00 and all the "mako0" nodes >>>> are clear and no firewall at all, and all on the same subnet. The only >>>> thing that I can think of is the hostname itself. From the error message >>>> mpirun was trying to resolve n0198, etc., even though the proper hostname >>>> that's passed to it was n0198.mako0. "n0198" by itself would not resolve >>>> because we have multiple clusters configured within the same subnet and we >>>> do need the cluster name suffix as identifier. But this is also not always >>>> true, for example, if I only have two nodes involved than it would pass as >>>> well. >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -V >>>> > mpirun (Open MPI) 1.10.0 >>>> > >>>> > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 >>>> hostname >>>> > n0189.mako0 >>>> > n0233.mako0 >>>> > >>>> > The issue only exposes itself when more than 2 nodes are involved. >>>> Any thoughts? >>>> > >>>> > Thanks, >>>> > >>>> > Yong Qin >>>> > _______________________________________________ >>>> > users mailing list >>>> > us...@open-mpi.org >>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> > Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27489.php >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/08/27490.php >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/08/27491.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/08/27493.php >>> >> >> > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/08/27498.php >