You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or set the 
equivalent MCA param


> On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com> wrote:
> 
> Hi,
> 
> This has been bothering me for a while but I never got a chance to identify 
> the root cause. I know this issue could be due to misconfig of network or ssh 
> in many cases. But I'm pretty sure that we don't fall into that category at 
> all. Proof? This issue doesn't happen in 1.6.x and earlier releases, but only 
> in 1.8+ including the 1.10.0 which was released today.
> 
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.6.5
> 
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 
> hostname
> n0233.mako0
> n0189.mako0
> n0198.mako0
> 
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.8.8
> 
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 
> hostname
> ssh: Could not resolve hostname n0198: Name or service not known
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> 
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
> 
> [yqin@n0009.scs00 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 
> hostname
> ssh: Could not resolve hostname n0198: Name or service not known
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
> 
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
> 
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
> 
> * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to use.
> 
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
> 
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
> --------------------------------------------------------------------------
> 
> 
> Note that I was running the mpirun from "n0009.scs00". If I ran it from a 
> native "mako0" node, it would pass as well.
> 
> [yqin@n0198.mako0 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
> 
> [yqin@n0198.mako0 ~]$ mpirun -np 3 -H n0189.mako0,n0233.mako0,n0198.mako0 
> hostname
> n0189.mako0
> n0198.mako0
> n0233.mako0
> 
> The network connection between n0009.scs00 and all the "mako0" nodes are 
> clear and no firewall at all, and all on the same subnet. The only thing that 
> I can think of is the hostname itself. From the error message mpirun was 
> trying to resolve n0198, etc., even though the proper hostname that's passed 
> to it was n0198.mako0. "n0198" by itself would not resolve because we have 
> multiple clusters configured within the same subnet and we do need the 
> cluster name suffix as identifier. But this is also not always true, for 
> example, if I only have two nodes involved than it would pass as well.
> 
> [yqin@n0009.scs00 ~]$ mpirun -V
> mpirun (Open MPI) 1.10.0
> 
> [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname
> n0189.mako0
> n0233.mako0
> 
> The issue only exposes itself when more than 2 nodes are involved. Any 
> thoughts?
> 
> Thanks,
> 
> Yong Qin
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/08/27489.php

Reply via email to