is name resolution working on *all* the nodes ? orted might be ssh'ed in a tree fashion. that means orted can either be ssh'ed by the node running mpirun or any other node. from n0009.scs00, can you make sure ssh n0189.mako0 ssh n0198.mako0 echo ok ssh n0233.mako0 ssh n0198.mako0 echo ok both work ?
per your log, mpirun might remove the domain name from the ssh command under the hood. e.g. ssh n0189.mako0 ssh n0198 echo ok or ssh n0198 ssh n0198.mako0 echo ok if that is the case, then I have no idea why we are doing this ... Cheers, Gilles On Thursday, August 27, 2015, Yong Qin <yong....@gmail.com> wrote: > > regardless of number of nodes > > No, this is not true. I was referring to this specific test, which was the > one that preventing me from thinking about FQDN, and the DN is different in > this case. As I clearly stated in my original question - "The issue only > exposes itself when more than 2 nodes are involved." > > [yqin@n0009.scs00 ~]$ mpirun -V > mpirun (Open MPI) 1.10.0 > > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname > n0189.mako0 > n0233.mako0 > > On Tue, Aug 25, 2015 at 4:39 PM, Ralph Castain <r...@open-mpi.org > <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: > >> Your earlier message indicates that it works fine so long as the DN was >> the same, regardless of number of nodes. It only failed when the DN’s of >> the nodes differed. >> >> >> On Aug 25, 2015, at 3:31 PM, Yong Qin <yong....@gmail.com >> <javascript:_e(%7B%7D,'cvml','yong....@gmail.com');>> wrote: >> >> Of course! I blame that two-node test distracted me from checking all the >> FQDN relevant parameters. :) >> >> So why would the two-node test pass in this case without allowing the >> FQDN then? >> >> Thanks, >> >> On Tue, Aug 25, 2015 at 2:35 PM, Ralph Castain <r...@open-mpi.org >> <javascript:_e(%7B%7D,'cvml','r...@open-mpi.org');>> wrote: >> >>> You need to set —mca orte_keep_fqdn_hostnames 1 on your mpirun line, or >>> set the equivalent MCA param >>> >>> >>> > On Aug 25, 2015, at 2:24 PM, Yong Qin <yong....@gmail.com >>> <javascript:_e(%7B%7D,'cvml','yong....@gmail.com');>> wrote: >>> > >>> > Hi, >>> > >>> > This has been bothering me for a while but I never got a chance to >>> identify the root cause. I know this issue could be due to misconfig of >>> network or ssh in many cases. But I'm pretty sure that we don't fall into >>> that category at all. Proof? This issue doesn't happen in 1.6.x and earlier >>> releases, but only in 1.8+ including the 1.10.0 which was released today. >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -V >>> > mpirun (Open MPI) 1.6.5 >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>> > n0233.mako0 >>> > n0189.mako0 >>> > n0198.mako0 >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -V >>> > mpirun (Open MPI) 1.8.8 >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>> > ssh: Could not resolve hostname n0198: Name or service not known >>> > >>> -------------------------------------------------------------------------- >>> > ORTE was unable to reliably start one or more daemons. >>> > This usually is caused by: >>> > >>> > * not finding the required libraries and/or binaries on >>> > one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>> > settings, or configure OMPI with --enable-orterun-prefix-by-default >>> > >>> > * lack of authority to execute on one or more specified nodes. >>> > Please verify your allocation and authorities. >>> > >>> > * the inability to write startup files into /tmp >>> (--tmpdir/orte_tmpdir_base). >>> > Please check with your sys admin to determine the correct location >>> to use. >>> > >>> > * compilation of the orted with dynamic libraries when static are >>> required >>> > (e.g., on Cray). Please check your configure cmd line and consider >>> using >>> > one of the contrib/platform definitions for your system type. >>> > >>> > * an inability to create a connection back to mpirun due to a >>> > lack of common network interfaces and/or no route found between >>> > them. Please check network connectivity (including firewalls >>> > and network routing requirements). >>> > >>> -------------------------------------------------------------------------- >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -V >>> > mpirun (Open MPI) 1.10.0 >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -np 3 -H >>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>> > ssh: Could not resolve hostname n0198: Name or service not known >>> > >>> -------------------------------------------------------------------------- >>> > ORTE was unable to reliably start one or more daemons. >>> > This usually is caused by: >>> > >>> > * not finding the required libraries and/or binaries on >>> > one or more nodes. Please check your PATH and LD_LIBRARY_PATH >>> > settings, or configure OMPI with --enable-orterun-prefix-by-default >>> > >>> > * lack of authority to execute on one or more specified nodes. >>> > Please verify your allocation and authorities. >>> > >>> > * the inability to write startup files into /tmp >>> (--tmpdir/orte_tmpdir_base). >>> > Please check with your sys admin to determine the correct location >>> to use. >>> > >>> > * compilation of the orted with dynamic libraries when static are >>> required >>> > (e.g., on Cray). Please check your configure cmd line and consider >>> using >>> > one of the contrib/platform definitions for your system type. >>> > >>> > * an inability to create a connection back to mpirun due to a >>> > lack of common network interfaces and/or no route found between >>> > them. Please check network connectivity (including firewalls >>> > and network routing requirements). >>> > >>> -------------------------------------------------------------------------- >>> > >>> > >>> > Note that I was running the mpirun from "n0009.scs00". If I ran it >>> from a native "mako0" node, it would pass as well. >>> > >>> > [yqin@n0198.mako0 ~]$ mpirun -V >>> > mpirun (Open MPI) 1.10.0 >>> > >>> > [yqin@n0198.mako0 ~]$ mpirun -np 3 -H >>> n0189.mako0,n0233.mako0,n0198.mako0 hostname >>> > n0189.mako0 >>> > n0198.mako0 >>> > n0233.mako0 >>> > >>> > The network connection between n0009.scs00 and all the "mako0" nodes >>> are clear and no firewall at all, and all on the same subnet. The only >>> thing that I can think of is the hostname itself. From the error message >>> mpirun was trying to resolve n0198, etc., even though the proper hostname >>> that's passed to it was n0198.mako0. "n0198" by itself would not resolve >>> because we have multiple clusters configured within the same subnet and we >>> do need the cluster name suffix as identifier. But this is also not always >>> true, for example, if I only have two nodes involved than it would pass as >>> well. >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -V >>> > mpirun (Open MPI) 1.10.0 >>> > >>> > [yqin@n0009.scs00 ~]$ mpirun -np 2 -H n0189.mako0,n0233.mako0 hostname >>> > n0189.mako0 >>> > n0233.mako0 >>> > >>> > The issue only exposes itself when more than 2 nodes are involved. Any >>> thoughts? >>> > >>> > Thanks, >>> > >>> > Yong Qin >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/08/27489.php >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/08/27490.php >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27491.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org <javascript:_e(%7B%7D,'cvml','us...@open-mpi.org');> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/08/27493.php >> > >