can you try
mpirun --mca plm_rsh_no_tree_spawn 1 ...
without the FQDN and see if it helps ?
just to be clear, i can understand the following command
ssh n0189.mako0 ssh n0198 echo ok
does not work and has never worked before
what about the following command :
ssh n0189 ssh n0198.mako0 echo ok
my guess is that should be working, or if it does not work today, it
used to work before
or maybe i am all wrong ...
are you using any batch manager ? if yes, which one ?
the issue could be ompi is not using the batch manager integration
plugin as it should
(e.g. it did not use ssh in the past, so you never ran into this issue)
/* with openmpi 1.6.5, you can run
strace -f -e execve -s 1024 -- mpirun ...
and see if
1) ssh is invoked
2) ssh is using the FQDN or not
an other less likely option is your ssh config has changed
/etc/ssh/ssh_config or $HOME/.ssh/config
it is possible to do some tweaking with hostnames, so
ssh node0198 ...
really do
ssh node0198.mako0 ...
under the hood
Cheers,
Gilles
On 8/27/2015 8:08 AM, Yong Qin wrote:
Yes all cross-node ssh works perfectly and this is our production
system which have been running for years. I've done all of these
testing and was puzzled by the inconsistent behavior that I observed.
But enabling FQDN resolves the issue so I am just trying to understand
why the inconsistency exists now.
[yqin@n0009.scs00 ~]$ ssh n0189.mako0 ssh n0198.mako0 echo ok
ok
[yqin@n0009.scs00 ~]$ ssh n0233.mako0 ssh n0198.mako0 echo ok
ok
The latter one wouldn't work because n0198 by itself without a domain
name wouldn't resolve.
On Wed, Aug 26, 2015 at 3:48 PM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
wrote:
is name resolution working on *all* the nodes ?
orted might be ssh'ed in a tree fashion.
that means orted can either be ssh'ed by the node running mpirun
or any other node.
from n0009.scs00, can you make sure
ssh n0189.mako0 ssh n0198.mako0 echo ok
ssh n0233.mako0 ssh n0198.mako0 echo ok
both work ?
per your log, mpirun might remove the domain name from the ssh
command under the hood.
e.g.
ssh n0189.mako0 ssh n0198 echo ok
or
ssh n0198 ssh n0198.mako0 echo ok
if that is the case, then I have no idea why we are doing this ...
Cheers,
Gilles
On Thursday, August 27, 2015, Yong Qin <yong....@gmail.com
<mailto:yong....@gmail.com>> wrote:
> regardless of number of nodes
No, this is not true. I was referring to this specific test,
which was the one that preventing me from thinking about FQDN,
and the DN is different in this case. As I clearly stated in
my original question - "The issue only exposes itself when
more than 2 nodes are involved."
[yqin@n0009.scs00 ~]$ mpirun -V
mpirun (Open MPI) 1.10.0
[yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0
hostname
n0189.mako0
n0233.mako0
On Tue, Aug 25, 2015 at 4:39 PM, Ralph Castain
<r...@open-mpi.org> wrote:
Your earlier message indicates that it works fine so long
as the DN was the same, regardless of number of nodes. It
only failed when the DN’s of the nodes differed.
On Aug 25, 2015, at 3:31 PM, Yong Qin
<yong....@gmail.com> wrote:
Of course! I blame that two-node test distracted me from
checking all the FQDN relevant parameters. :)
So why would the two-node test pass in this case without
allowing the FQDN then?
Thanks,
On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain
<r...@open-mpi.org> wrote:
You need to set —mca orte_keep_fqdn_hostnames 1 on
your mpirun line, or set the equivalent MCA param
> On Aug 25, 2015, at 2:24 PM, Yong Qin
<yong....@gmail.com> wrote:
>
> Hi,
>
> This has been bothering me for a while but I never
got a chance to identify the root cause. I know this
issue could be due to misconfig of network or ssh in
many cases. But I'm pretty sure that we don't fall
into that category at all. Proof? This issue doesn't
happen in 1.6.x and earlier releases, but only in
1.8+ including the 1.10.0 which was released today.
>
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.6.5
>
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
n0189.mako0,n0233.mako0,n0198.mako0 hostname
> n0233.mako0
> n0189.mako0
> n0198.mako0
>
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.8.8
>
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
n0189.mako0,n0233.mako0,n0198.mako0 hostname
> ssh: Could not resolve hostname n0198: Name or
service not known
>
--------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
> one or more nodes. Please check your PATH and
LD_LIBRARY_PATH
> settings, or configure OMPI with
--enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more
specified nodes.
> Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
> Please check with your sys admin to determine the
correct location to use.
>
> * compilation of the orted with dynamic libraries
when static are required
> (e.g., on Cray). Please check your configure cmd
line and consider using
> one of the contrib/platform definitions for your
system type.
>
> * an inability to create a connection back to
mpirun due to a
> lack of common network interfaces and/or no route
found between
> them. Please check network connectivity
(including firewalls
> and network routing requirements).
>
--------------------------------------------------------------------------
>
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
>
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H
n0189.mako0,n0233.mako0,n0198.mako0 hostname
> ssh: Could not resolve hostname n0198: Name or
service not known
>
--------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
> one or more nodes. Please check your PATH and
LD_LIBRARY_PATH
> settings, or configure OMPI with
--enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more
specified nodes.
> Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
> Please check with your sys admin to determine the
correct location to use.
>
> * compilation of the orted with dynamic libraries
when static are required
> (e.g., on Cray). Please check your configure cmd
line and consider using
> one of the contrib/platform definitions for your
system type.
>
> * an inability to create a connection back to
mpirun due to a
> lack of common network interfaces and/or no route
found between
> them. Please check network connectivity
(including firewalls
> and network routing requirements).
>
--------------------------------------------------------------------------
>
>
> Note that I was running the mpirun from
"n0009.scs00". If I ran it from a native "mako0"
node, it would pass as well.
>
> [yqin@n0198.mako0 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
>
> [yqin@n0198.mako0 ~]$ mpirun -np 3 -H
n0189.mako0,n0233.mako0,n0198.mako0 hostname
> n0189.mako0
> n0198.mako0
> n0233.mako0
>
> The network connection between n0009.scs00 and all
the "mako0" nodes are clear and no firewall at all,
and all on the same subnet. The only thing that I can
think of is the hostname itself. From the error
message mpirun was trying to resolve n0198, etc.,
even though the proper hostname that's passed to it
was n0198.mako0. "n0198" by itself would not resolve
because we have multiple clusters configured within
the same subnet and we do need the cluster name
suffix as identifier. But this is also not always
true, for example, if I only have two nodes involved
than it would pass as well.
>
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
>
> [yqin@n0009.scs00 ~]$ mpirun -np 2 -H
n0189.mako0,n0233.mako0 hostname
> n0189.mako0
> n0233.mako0
>
> The issue only exposes itself when more than 2
nodes are involved. Any thoughts?
>
> Thanks,
>
> Yong Qin
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27489.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27490.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27491.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription:
http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27493.php
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27498.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2015/08/27499.php