can you try
mpirun --mca plm_rsh_no_tree_spawn 1 ...
without the FQDN and see if it helps ?

just to be clear, i can understand the following command
ssh n0189.mako0 ssh n0198 echo ok
does not work and has never worked before

what about the following command :
ssh n0189 ssh n0198.mako0 echo ok
my guess is that should be working, or if it does not work today, it used to work before


or maybe i am all wrong ...
are you using any batch manager ? if yes, which one ?
the issue could be ompi is not using the batch manager integration plugin as it should
(e.g. it did not use ssh in the past, so you never ran into this issue)
/* with openmpi 1.6.5, you can run
strace -f -e execve -s 1024 -- mpirun ...
and see if
1) ssh is invoked
2) ssh is using the FQDN or not

an other less likely option is your ssh config has changed
/etc/ssh/ssh_config or $HOME/.ssh/config
it is possible to do some tweaking with hostnames, so
ssh node0198 ...
really do
ssh node0198.mako0 ...
under the hood

Cheers,

Gilles

On 8/27/2015 8:08 AM, Yong Qin wrote:
Yes all cross-node ssh works perfectly and this is our production system which have been running for years. I've done all of these testing and was puzzled by the inconsistent behavior that I observed. But enabling FQDN resolves the issue so I am just trying to understand why the inconsistency exists now.

[yqin@n0009.scs00 ~]$ ssh n0189.mako0 ssh n0198.mako0 echo ok
ok
[yqin@n0009.scs00 ~]$ ssh n0233.mako0 ssh n0198.mako0 echo ok
ok

The latter one wouldn't work because n0198 by itself without a domain name wouldn't resolve.

On Wed, Aug 26, 2015 at 3:48 PM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:

    is name resolution working on *all* the nodes ?
    orted might be ssh'ed in a tree fashion.
    that means orted can either be ssh'ed by the node running mpirun
    or any other node.
    from n0009.scs00, can you make sure
    ssh n0189.mako0 ssh n0198.mako0 echo ok
    ssh n0233.mako0 ssh n0198.mako0 echo ok
    both work ?

    per your log, mpirun might remove the domain name from the ssh
    command under the hood.
    e.g.
    ssh n0189.mako0 ssh n0198 echo ok
    or
    ssh n0198 ssh n0198.mako0 echo ok
    if that is the case, then I have no idea why we are doing this ...

    Cheers,

    Gilles

    On Thursday, August 27, 2015, Yong Qin <yong....@gmail.com
    <mailto:yong....@gmail.com>> wrote:

        > regardless of number of nodes

        No, this is not true. I was referring to this specific test,
        which was the one that preventing me from thinking about FQDN,
        and the DN is different in this case. As I clearly stated in
        my original question - "The issue only exposes itself when
        more than 2 nodes are involved."

        [yqin@n0009.scs00 ~]$ mpirun -V
        mpirun (Open MPI) 1.10.0

        [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0
        hostname
        n0189.mako0
        n0233.mako0

        On Tue, Aug 25, 2015 at 4:39 PM, Ralph Castain
        <r...@open-mpi.org> wrote:

            Your earlier message indicates that it works fine so long
            as the DN was the same, regardless of number of nodes. It
            only failed when the DN’s of the nodes differed.


            On Aug 25, 2015, at 3:31 PM, Yong Qin
            <yong....@gmail.com> wrote:

            Of course! I blame that two-node test distracted me from
            checking all the FQDN relevant parameters. :)

            So why would the two-node test pass in this case without
            allowing the FQDN then?

            Thanks,

            On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain
            <r...@open-mpi.org> wrote:

                You need to set —mca orte_keep_fqdn_hostnames 1 on
                your mpirun line, or set the equivalent MCA param


                > On Aug 25, 2015, at 2:24 PM, Yong Qin
                <yong....@gmail.com> wrote:
                >
                > Hi,
                >
                > This has been bothering me for a while but I never
                got a chance to identify the root cause. I know this
                issue could be due to misconfig of network or ssh in
                many cases. But I'm pretty sure that we don't fall
                into that category at all. Proof? This issue doesn't
                happen in 1.6.x and earlier releases, but only in
                1.8+ including the 1.10.0 which was released today.
                >
                > [yqin@n0009.scs00 ~]$ mpirun -V
                > mpirun (Open MPI) 1.6.5
                >
                > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
                n0189.mako0,n0233.mako0,n0198.mako0 hostname
                > n0233.mako0
                > n0189.mako0
                > n0198.mako0
                >
                > [yqin@n0009.scs00 ~]$ mpirun -V
                > mpirun (Open MPI) 1.8.8
                >
                > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
                n0189.mako0,n0233.mako0,n0198.mako0 hostname
                > ssh: Could not resolve hostname n0198: Name or
                service not known
                >
                
--------------------------------------------------------------------------
                > ORTE was unable to reliably start one or more daemons.
                > This usually is caused by:
                >
                > * not finding the required libraries and/or binaries on
                >   one or more nodes. Please check your PATH and
                LD_LIBRARY_PATH
                >  settings, or configure OMPI with
                --enable-orterun-prefix-by-default
                >
                > * lack of authority to execute on one or more
                specified nodes.
                >   Please verify your allocation and authorities.
                >
                > * the inability to write startup files into /tmp
                (--tmpdir/orte_tmpdir_base).
                >   Please check with your sys admin to determine the
                correct location to use.
                >
                > * compilation of the orted with dynamic libraries
                when static are required
                >   (e.g., on Cray). Please check your configure cmd
                line and consider using
                >   one of the contrib/platform definitions for your
                system type.
                >
                > * an inability to create a connection back to
                mpirun due to a
                >   lack of common network interfaces and/or no route
                found between
                >   them. Please check network connectivity
                (including firewalls
                >   and network routing requirements).
                >
                
--------------------------------------------------------------------------
                >
                > [yqin@n0009.scs00 ~]$ mpirun -V
                > mpirun (Open MPI) 1.10.0
                >
                > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
                n0189.mako0,n0233.mako0,n0198.mako0 hostname
                > ssh: Could not resolve hostname n0198: Name or
                service not known
                >
                
--------------------------------------------------------------------------
                > ORTE was unable to reliably start one or more daemons.
                > This usually is caused by:
                >
                > * not finding the required libraries and/or binaries on
                >   one or more nodes. Please check your PATH and
                LD_LIBRARY_PATH
                >  settings, or configure OMPI with
                --enable-orterun-prefix-by-default
                >
                > * lack of authority to execute on one or more
                specified nodes.
                >   Please verify your allocation and authorities.
                >
                > * the inability to write startup files into /tmp
                (--tmpdir/orte_tmpdir_base).
                >   Please check with your sys admin to determine the
                correct location to use.
                >
                > * compilation of the orted with dynamic libraries
                when static are required
                >   (e.g., on Cray). Please check your configure cmd
                line and consider using
                >   one of the contrib/platform definitions for your
                system type.
                >
                > * an inability to create a connection back to
                mpirun due to a
                >   lack of common network interfaces and/or no route
                found between
                >   them. Please check network connectivity
                (including firewalls
                >   and network routing requirements).
                >
                
--------------------------------------------------------------------------
                >
                >
                > Note that I was running the mpirun from
                "n0009.scs00". If I ran it from a native "mako0"
                node, it would pass as well.
                >
                > [yqin@n0198.mako0 ~]$ mpirun -V
                > mpirun (Open MPI) 1.10.0
                >
                > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H
                n0189.mako0,n0233.mako0,n0198.mako0 hostname
                > n0189.mako0
                > n0198.mako0
                > n0233.mako0
                >
                > The network connection between n0009.scs00 and all
                the "mako0" nodes are clear and no firewall at all,
                and all on the same subnet. The only thing that I can
                think of is the hostname itself. From the error
                message mpirun was trying to resolve n0198, etc.,
                even though the proper hostname that's passed to it
                was n0198.mako0. "n0198" by itself would not resolve
                because we have multiple clusters configured within
                the same subnet and we do need the cluster name
                suffix as identifier. But this is also not always
                true, for example, if I only have two nodes involved
                than it would pass as well.
                >
                > [yqin@n0009.scs00 ~]$ mpirun -V
                > mpirun (Open MPI) 1.10.0
                >
                > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H
                n0189.mako0,n0233.mako0 hostname
                > n0189.mako0
                > n0233.mako0
                >
                > The issue only exposes itself when more than 2
                nodes are involved. Any thoughts?
                >
                > Thanks,
                >
                > Yong Qin
                > _______________________________________________
                > users mailing list
                > us...@open-mpi.org
                > Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users
                > Link to this post:
                http://www.open-mpi.org/community/lists/users/2015/08/27489.php

                _______________________________________________
                users mailing list
                us...@open-mpi.org
                Subscription:
                http://www.open-mpi.org/mailman/listinfo.cgi/users
                Link to this post:
                http://www.open-mpi.org/community/lists/users/2015/08/27490.php


            _______________________________________________
            users mailing list
            us...@open-mpi.org
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2015/08/27491.php


            _______________________________________________
            users mailing list
            us...@open-mpi.org
            Subscription:
            http://www.open-mpi.org/mailman/listinfo.cgi/users
            Link to this post:
            http://www.open-mpi.org/community/lists/users/2015/08/27493.php



    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2015/08/27498.php




_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27499.php

Reply via email to