Yes all cross-node ssh works perfectly and this is our production system
which have been running for years. I've done all of these testing and was
puzzled by the inconsistent behavior that I observed. But enabling FQDN
resolves the issue so I am just trying to understand why the inconsistency
exists now.

[yqin@n0009.scs00 ~]$ ssh n0189.mako0 ssh n0198.mako0 echo ok
ok
[yqin@n0009.scs00 ~]$ ssh n0233.mako0 ssh n0198.mako0 echo ok
ok

The latter one wouldn't work because n0198 by itself without a domain name
wouldn't resolve.

On Wed, Aug 26, 2015 at 3:48 PM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> is name resolution working on *all* the nodes ?
> orted might be ssh'ed in a tree fashion.
> that means orted can either be ssh'ed by the node running mpirun or any
> other node.
> from n0009.scs00, can you make sure
> ssh n0189.mako0 ssh n0198.mako0 echo ok
> ssh n0233.mako0 ssh n0198.mako0 echo ok
> both work ?
>
> per your log, mpirun might remove the domain name from the ssh command
> under the hood.
> e.g.
> ssh n0189.mako0 ssh n0198 echo ok
> or
> ssh n0198 ssh n0198.mako0 echo ok
> if that is the case, then I have no idea why we are doing this ...
>
> Cheers,
>
> Gilles
>
> On Thursday, August 27, 2015, Yong Qin <yong....@gmail.com> wrote:
>
>> > regardless of number of nodes
>>
>> No, this is not true. I was referring to this specific test, which was
>> the one that preventing me from thinking about FQDN, and the DN is
>> different in this case. As I clearly stated in my original question - "The
>> issue only exposes itself when more than 2 nodes are involved."
>>
>> [yqin@n0009.scs00 ~]$ mpirun -V
>> mpirun (Open MPI) 1.10.0
>>
>> [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname
>> n0189.mako0
>> n0233.mako0
>>
>> On Tue, Aug 25, 2015 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Your earlier message indicates that it works fine so long as the DN was
>>> the same, regardless of number of nodes. It only failed when the DN’s of
>>> the nodes differed.
>>>
>>>
>>> On Aug 25, 2015, at 3:31 PM, Yong Qin <yong....@gmail.com> wrote:
>>>
>>> Of course! I blame that two-node test distracted me from checking all
>>> the FQDN relevant parameters. :)
>>>
>>> So why would the two-node test pass in this case without allowing the
>>> FQDN then?
>>>
>>> Thanks,
>>>
>>> On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or
>>>> set the equivalent MCA param
>>>>
>>>>
>>>> > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > This has been bothering me for a while but I never got a chance to
>>>> identify the root cause. I know this issue could be due to misconfig of
>>>> network or ssh in many cases. But I'm pretty sure that we don't fall into
>>>> that category at all. Proof? This issue doesn't happen in 1.6.x and earlier
>>>> releases, but only in 1.8+ including the 1.10.0 which was released today.
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -V
>>>> > mpirun (Open MPI) 1.6.5
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
>>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname
>>>> > n0233.mako0
>>>> > n0189.mako0
>>>> > n0198.mako0
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -V
>>>> > mpirun (Open MPI) 1.8.8
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
>>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname
>>>> > ssh: Could not resolve hostname n0198: Name or service not known
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > ORTE was unable to reliably start one or more daemons.
>>>> > This usually is caused by:
>>>> >
>>>> > * not finding the required libraries and/or binaries on
>>>> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>> >
>>>> > * lack of authority to execute on one or more specified nodes.
>>>> >   Please verify your allocation and authorities.
>>>> >
>>>> > * the inability to write startup files into /tmp
>>>> (--tmpdir/orte_tmpdir_base).
>>>> >   Please check with your sys admin to determine the correct location
>>>> to use.
>>>> >
>>>> > *  compilation of the orted with dynamic libraries when static are
>>>> required
>>>> >   (e.g., on Cray). Please check your configure cmd line and consider
>>>> using
>>>> >   one of the contrib/platform definitions for your system type.
>>>> >
>>>> > * an inability to create a connection back to mpirun due to a
>>>> >   lack of common network interfaces and/or no route found between
>>>> >   them. Please check network connectivity (including firewalls
>>>> >   and network routing requirements).
>>>> >
>>>> --------------------------------------------------------------------------
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -V
>>>> > mpirun (Open MPI) 1.10.0
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
>>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname
>>>> > ssh: Could not resolve hostname n0198: Name or service not known
>>>> >
>>>> --------------------------------------------------------------------------
>>>> > ORTE was unable to reliably start one or more daemons.
>>>> > This usually is caused by:
>>>> >
>>>> > * not finding the required libraries and/or binaries on
>>>> >   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>>>> >   settings, or configure OMPI with --enable-orterun-prefix-by-default
>>>> >
>>>> > * lack of authority to execute on one or more specified nodes.
>>>> >   Please verify your allocation and authorities.
>>>> >
>>>> > * the inability to write startup files into /tmp
>>>> (--tmpdir/orte_tmpdir_base).
>>>> >   Please check with your sys admin to determine the correct location
>>>> to use.
>>>> >
>>>> > *  compilation of the orted with dynamic libraries when static are
>>>> required
>>>> >   (e.g., on Cray). Please check your configure cmd line and consider
>>>> using
>>>> >   one of the contrib/platform definitions for your system type.
>>>> >
>>>> > * an inability to create a connection back to mpirun due to a
>>>> >   lack of common network interfaces and/or no route found between
>>>> >   them. Please check network connectivity (including firewalls
>>>> >   and network routing requirements).
>>>> >
>>>> --------------------------------------------------------------------------
>>>> >
>>>> >
>>>> > Note that I was running the mpirun from "n0009.scs00". If I ran it
>>>> from a native "mako0" node, it would pass as well.
>>>> >
>>>> > [yqin@n0198.mako0 ~]$ mpirun -V
>>>> > mpirun (Open MPI) 1.10.0
>>>> >
>>>> > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H
>>>> n0189.mako0,n0233.mako0,n0198.mako0 hostname
>>>> > n0189.mako0
>>>> > n0198.mako0
>>>> > n0233.mako0
>>>> >
>>>> > The network connection between n0009.scs00 and all the "mako0" nodes
>>>> are clear and no firewall at all, and all on the same subnet. The only
>>>> thing that I can think of is the hostname itself. From the error message
>>>> mpirun was trying to resolve n0198, etc., even though the proper hostname
>>>> that's passed to it was n0198.mako0. "n0198" by itself would not resolve
>>>> because we have multiple clusters configured within the same subnet and we
>>>> do need the cluster name suffix as identifier. But this is also not always
>>>> true, for example, if I only have two nodes involved than it would pass as
>>>> well.
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -V
>>>> > mpirun (Open MPI) 1.10.0
>>>> >
>>>> > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0
>>>> hostname
>>>> > n0189.mako0
>>>> > n0233.mako0
>>>> >
>>>> > The issue only exposes itself when more than 2 nodes are involved.
>>>> Any thoughts?
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Yong Qin
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > us...@open-mpi.org
>>>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> > Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27489.php
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/08/27490.php
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27491.php
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27493.php
>>>
>>
>>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27498.php
>

Reply via email to