Hi,

This has been bothering me for a while but I never got a chance to identify
the root cause. I know this issue could be due to misconfig of network or
ssh in many cases. But I'm pretty sure that we don't fall into that
category at all. Proof? This issue doesn't happen in 1.6.x and earlier
releases, but only in 1.8+ including the 1.10.0 which was released today.

[yqin@n0009.scs00 ~]$ mpirun -V
mpirun (Open MPI) 1.6.5

[yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0
hostname
n0233.mako0
n0189.mako0
n0198.mako0

[yqin@n0009.scs00 ~]$ mpirun -V
mpirun (Open MPI) 1.8.8

[yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0
hostname
ssh: Could not resolve hostname n0198: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

[yqin@n0009.scs00 ~]$ mpirun -V
mpirun (Open MPI) 1.10.0

[yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0
hostname
ssh: Could not resolve hostname n0198: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------


Note that I was running the mpirun from "n0009.scs00". If I ran it from a
native "mako0" node, it would pass as well.

[yqin@n0198.mako0 ~]$ mpirun -V
mpirun (Open MPI) 1.10.0

[yqin@n0198.mako0 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0
hostname
n0189.mako0
n0198.mako0
n0233.mako0

The network connection between n0009.scs00 and all the "mako0" nodes are
clear and no firewall at all, and all on the same subnet. The only thing
that I can think of is the hostname itself. From the error message mpirun
was trying to resolve n0198, etc., even though the proper hostname that's
passed to it was n0198.mako0. "n0198" by itself would not resolve because
we have multiple clusters configured within the same subnet and we do need
the cluster name suffix as identifier. But this is also not always true,
for example, if I only have two nodes involved than it would pass as well.

[yqin@n0009.scs00 ~]$ mpirun -V
mpirun (Open MPI) 1.10.0

[yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname
n0189.mako0
n0233.mako0

The issue only exposes itself when more than 2 nodes are involved. Any
thoughts?

Thanks,

Yong Qin

Reply via email to