Re: [OMPI users] How OMPI picks ethernet interfaces

Reuti Fri, 14 Nov 2014 10:52:43 -0500 (EST)

Jeff, Gus, Gilles,

Am 14.11.2014 um 15:56 schrieb Jeff Squyres (jsquyres):


> I lurked on this thread for a while, but I have some thoughts on the many 
> issues that were discussed on this thread (sorry, I'm still pretty under 
> water trying to get ready for SC next week...).

I appreciate your replies and will read them thoroughly. I think it's best to 
continue with the discussion after SC14. I don't want to put any burden on 
anyone when time is tight.

-- Reuti


>  These points are in no particular order...
> 
> 0. Two fundamental points have been missed in this thread:
> 
>   - A hostname technically has nothing to do with the resolvable name of an 
> IP interface.  By convention, many people set the hostname to be the same as 
> some "primary" IP interface (for some definition of "primary", e.g., eth0).  
> But they are actually unrelated concepts.
> 
>   - Open MPI uses host specifications only to specify a remote server, *NOT* 
> an interface.  E.g., when you list names in a hostile or the --host CLI 
> option, those only specify the server -- not the interface(s).  This was an 
> intentional design choice because there tends to be confusion and different 
> schools of thought about the question "What's the [resolvable] name of that 
> remote server?"  Hence, OMPI will take any old name you throw at it to 
> identify that remote server, but then we have separate controls for 
> specifying which interface(s) to use to communicate with that server.
> 
> 1. Remember that there is at least one, and possibly two, uses of TCP 
> communications in Open MPI -- and they are used differently:
> 
>   - Command/control (sometimes referred to as "oob"): used for things like 
> mpirun control messages, shuttling IO from remote processes back to mpirun, 
> etc.  Generally, unless you have a mountain of stdout/stderr from your 
> launched processes, this isn't a huge amount of traffic.
> 
>   - MPI messages: kernel-based TCP is the fallback if you don't have some 
> kind of faster off-server network -- i.e., the TCP BTL.  Like all BTLs, the 
> TCP BTL carries all MPI traffic when it is used.  How much traffic is 
> sent/received depends on your application.
> 
> 2. For OOB, I believe that the current ORTE mechanism is that it will try all 
> available IP interfaces and use the *first* one that succeeds.  Meaning: 
> after some negotiation, only one IP interface will be used to communicate 
> with a given peer.
> 
> 3. The TCP BTL will examine all local IP interfaces and determine all that 
> can be used to reach each peer according to the algorithm described here: 
> http://www.open-mpi.org/faq/?category=tcp#tcp-routability-1.3.  It will use 
> *all* IP interfaces to reach a given peer in order to maximize the available 
> bandwidth.
> 
> 4. The usNIC BTL uses UDP as its wire transport, and therefore has the same 
> reachability issues as both the TCP OOB and BTL.  However, we use a different 
> mechanism than the algorithm described in the above-cited FAQ item: we simply 
> query the Linux routing table.  This can cause ARP requests, but the kernel 
> caches them (e.g., for multiple MPI procs on the same server making the 
> same/similar requests), and for a properly-segmented L3 network, each MPI 
> process will effectively end up querying about its local gateway (vs. the 
> actual peer), and therefore the chances of having that ARP already cached are 
> quite high.
> 
> --> I want to make this clear: there's nothing magic our the 
> usNIC/check-the-routing-table approach.  It's actually a very standard 
> IP/datacenter method.  With a proper routing table, you can know fairly 
> quickly whether local IP interface X can reach remote IP interface Y.
> 
> 5. The original problem cited in this thread was about the TCP OOB, not the 
> TCP BTL.  It's important to keep straight that the OOB, with no guidance from 
> the user, was trying to probe the different IP interfaces and find one that 
> would reach a peer.  Using the check-the-routing-table approach cited in #4, 
> we might be able to make this better (that's what Ralph and I are going to 
> talk about in December / post-SC / post-US Thanksgiving holiday).
> 
> 6. As a sidenote to #5, the TCP OOB and TCP BTL determine reachability in 
> different ways.  Remember that the TCP BTL has the benefit of having all the 
> ORTE infrastructure up and running.  Meaning: MPI processes can exchange IP 
> interface information and then use that information to compute which peer IP 
> interfaces can be reached.  The TCP OOB doesn't have this benefit -- it's 
> being used to establish initial connectivity.  Hence, it probes each IP 
> interface to see if it can reach a given peer.
> 
> --> We apparently need to do that probe better (vs. blocking in a serial 
> fashion, and eventually timing out on "bad" interfaces and then trying the 
> next one). 
> 
> Having a bad route or gateway listed in a server's IP setup, however, will 
> make the process take an artificially long time.  This is a user error that 
> Open MPI cannot compensate for.  If prior versions of OMPI tried interfaces 
> in a different order that luckily worked nicely, cool.  But as Gilles 
> mentioned, that was luck -- there was still a user config error that was the 
> real underlying issue.
> 
> 7. Someone asked: does it matter in which order you specify interfaces in 
> btl_tcp_if_include?  No, it effectively does not.  Open MPI will use both 
> interfaces.  If you only send one short MPI message to a peer, then yes, OMPI 
> will only use one of those interfaces, but that's not the usual case.  Open 
> MPI will effectively round robin multiplex across all the interfaces that you 
> list (or all the interfaces that are not excluded).  They're all used equally 
> unless you specify a weighting factor (i.e., bandwidth) for each interface.
> 
> 8. Don't forget that you can use CIDR notation to specify which interfaces to 
> use, too.  E.g., "--mca btl_tcp_if_include 10.10.10.0/24".  That way, you 
> don't have to know which interface a given network uses (and it might even be 
> different on different servers).  Same goes for the oob_tcp_if_*clude MCA 
> params, too.
> 
> 9. If I followed the thread properly (and I might not have?), I think Reuti 
> eliminated a bad route/gateway and reduced the dead time during startup to be 
> much shorter.  But there still seems to be a 30 second timeout in there when 
> no sysadmin-specified oob_tcp_if_include param is provided.  If this is 
> correct, Reuti, can you send the full "ifconfig -a" output from two servers 
> in question (i.e., 2 servers where you can reproduce the problem), and the 
> full routing tables between those two servers?  (make sure to show all 
> routing tables on each server - fun fact, did you know that you can have a 
> different routing table for each IP interface in Linux?)  Include any 
> relevant network routing tables (e.g., from intermediate switches), if 
> they're not just pass thru.
> 
> 
> 
> 
> On Nov 13, 2014, at 9:17 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> 
>> My 0.02 US$
>> 
>> first, the root cause of the problem was a default gateway was
>> configured on the node,
>> but this gateway was unreachable.
>> imho, this is incorrect system setting that can lead to unpredictable
>> results :
>> - openmpi 1.8.1 works (you are lucky, good for you)
>> - openmpi 1.8.3 fails (no luck this time, too bad)
>> so i believe it is incorrect to blame openmpi for this.
>> 
>> that being said, you raise some good points of how to improve user
>> friendliness for end users
>> that have limited skills and/or interest in OpenMPI and system
>> administration.
>> 
>> basically, i agree with Gus. HPC is complex, not every clusters are the same
>> and imho some minimal config/tuning might not be avoided to get OpenMPI
>> working,
>> or operating at full speed.
>> 
>> 
>> let me give a few examples :
>> 
>> you recommend OpenMPI uses only the interfaces that matches the
>> hostnames in the machinefile.
>> what if you submit from the head node ? should you use the interface
>> that matches the hostname ?
>> what if this interface is the public interface, there is a firewall
>> and/or compute nodes have no default gateway ?
>> that will simply not work ...
>> so mpirun needs to pass orted all its interfaces.
>> which one should be picked by orted ?
>> - the first one ? it might be the unreachable public interface ...
>> - the one on the same subnet ? what if none is on the same subnet ?
>> on the cluster i am working, eth0 are in different subnets, ib0 is on
>> a single subnet
>> and i do *not* want to use ib0. but on some other clusters, the
>> ethernet network is so cheap
>> they *want* to use ib0.
>> 
>> on your cluster, you want to use eth0 for oob and mpi, and eth1 for NFS.
>> that is legitimate.
>> in my case, i want to use eth0 (gigE) for oob and eth2 (10gigE) for MPI.
>> that is legitimate too.
>> 
>> we both want OpenMPI works *and* with best performance out of the box.
>> it is a good thing to have high expectations, but they might not all be met.
>> 
>> i'd rather implement some pre-defined policies that rules how ethernet
>> interfaces should be picked up,
>> and add a FAQ that mentions : if it does not work (or does not work as
>> fast as expected) out of the box, you should
>> at first try an other policy.
>> 
>> then the next legitimate question will be "what is the default policy" ?
>> regardless the answer, it will be good for some and bad for others.
>> 
>> 
>> imho, posting a mail to the OMPI users mailing list was the right thing
>> to do :
>> - you got help on how to troubleshot and fix the issue
>> - we got some valuable feedback on endusers expectations.
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> On 2014/11/14 3:36, Gus Correa wrote:
>>> On 11/13/2014 11:14 AM, Ralph Castain wrote:
>>>> Hmmm…I’m beginning to grok the issue. It is a tad unusual for people to
>>>> assign different hostnames to their interfaces - I’ve seen it in the
>>>> Hadoop world, but not in HPC. Still, no law against it.
>>> 
>>> No, not so unusual.
>>> I have clusters from respectable vendors that come with
>>> /etc/hosts for name resolution of the various interfaces.
>>> If I remember right, Rocks clusters also does that (or actually
>>> allow the sys admin to setup additional networks and at that point
>>> will append /etc/hosts with the additional names, or perhaps put those
>>> names in DHCP).
>>> I am not so familiar to xcat, but I think it has similar DHCP
>>> functionality, or maybe DNS on the head node.
>>> 
>>> Having said that, I don't think this is an obstacle to setting up the
>>> right "if_include/if_exlculde" choices (along with the btl, oob, etc),
>>> for each particular cluster in the mca parameter configuration file.
>>> That is what my parallel conversation with Reuti was about.
>>> 
>>> I believe the current approach w.r.t. interfaces:
>>> "use everythint, let the sysadmin/user restrict as
>>> (s)he sees fit" is both a wise and flexible way to do it.
>>> Guessing the "right interface to use" sounds risky to me (wrong
>>> choices may happen), and a bit of a cast.
>>> 
>>>> 
>>>> This will take a little thought to figure out a solution. One problem
>>>> that immediately occurs is if someone includes a hostfile that has lines
>>>> which refer to the same physical server, but using different interface
>>>> names. We’ll think those are completely distinct servers, and so the
>>>> process placement will be totally messed up.
>>>> 
>>> 
>>> Sure, and besides this, there will be machines with
>>> inconsistent/wrong/conflicting name resolution schemes
>>> that the current OMPI approach simply (and wisely) ignores.
>>> 
>>> 
>>>> We’ll also encounter issues with the daemon when it reports back, as the
>>>> hostname it gets will almost certainly differ from the hostname we were
>>>> expecting. Not as critical, but need to check to see where that will
>>>> impact the code base
>>>> 
>>> 
>>> I'm sure that will happen.
>>> Torque uses hostname by default for several things, and it can be a
>>> configuration nightmare to workaround that when what hostname reports
>>> is not what you want.
>>> 
>>> IMHO, you may face a daunting guesswork task to get this right,
>>> to pick the
>>> interfaces that are best for a particular computer or cluster.
>>> It is so much easier to let the sysadmin/user, who presumably knows
>>> his/her machine, to write an MCA parameter config file,
>>> as it is now in OMPI.
>>> 
>>>> We can look at the hostfile changes at that time - no real objection to
>>>> them, but would need to figure out how to pass that info to the
>>>> appropriate subsystems. I assume you want this to apply to both the oob
>>>> and tcp/btl?
>>>> 
>>>> Obviously, this won’t make it for 1.8 as it is going to be fairly
>>>> intrusive, but we can probably do something for 1.9
>>>> 
>>> 
>>> The status quo is good.
>>> Long life to the OMPI status quo.
>>> (You don't know how reluctant I am to support the status quo, any
>>> status quo.  :) )
>>> My vote (... well, I don't have voting rights on that, but I'll vote
>>> anyway ...) is to keeep the current approach.
>>> It is wise and flexible, and easy to adjust and configure to specific
>>> machines with their own oddities, via MCA parameters, as I tried to
>>> explain in previous postings.
>>> 
>>> My two cents,
>>> Gus Correa
>>> 
>>>> 
>>>>> On Nov 13, 2014, at 4:23 AM, Reuti <re...@staff.uni-marburg.de
>>>>> <mailto:re...@staff.uni-marburg.de>> wrote:
>>>>> 
>>>>> Am 13.11.2014 um 00:34 schrieb Ralph Castain:
>>>>> 
>>>>>>> On Nov 12, 2014, at 2:45 PM, Reuti <re...@staff.uni-marburg.de
>>>>>>> <mailto:re...@staff.uni-marburg.de>> wrote:
>>>>>>> 
>>>>>>> Am 12.11.2014 um 17:27 schrieb Reuti:
>>>>>>> 
>>>>>>>> Am 11.11.2014 um 02:25 schrieb Ralph Castain:
>>>>>>>> 
>>>>>>>>> Another thing you can do is (a) ensure you built with
>>>>>>>>> —enable-debug, and then (b) run it with -mca oob_base_verbose 100
>>>>>>>>> (without the tcp_if_include option) so we can watch the
>>>>>>>>> connection handshake and see what it is doing. The —hetero-nodes
>>>>>>>>> will have not affect here and can be ignored.
>>>>>>>> 
>>>>>>>> Done. It really tries to connect to the outside interface of the
>>>>>>>> headnode. But being there a firewall or not: the nodes have no clue
>>>>>>>> how to reach 137.248.0.0 - they have no gateway to this network
>>>>>>>> at all.
>>>>>>> 
>>>>>>> I have to revert this. They think that there is a gateway although
>>>>>>> it isn't. When I remove the entry by hand for the gateway in the
>>>>>>> routing table it starts up instantly too.
>>>>>>> 
>>>>>>> While I can do this on my own cluster I still have the 30 seconds
>>>>>>> delay on a cluster where I'm not root, while this can be because of
>>>>>>> the firewall there. The gateway on this cluster is indeed going to
>>>>>>> the outside world.
>>>>>>> 
>>>>>>> Personally I find this behavior a little bit too aggressive to use
>>>>>>> all interfaces. If you don't check this carefully beforehand and
>>>>>>> start a long running application one might even not notice the delay
>>>>>>> during the startup.
>>>>>> 
>>>>>> Agreed - do you have any suggestions on how we should choose the
>>>>>> order in which to try them? I haven’t been able to come up with
>>>>>> anything yet. Jeff has some fancy algo in his usnic BTL that we are
>>>>>> going to discuss after SC that I’m hoping will help, but I’d be open
>>>>>> to doing something better in the interim for 1.8.4
>>>>> 
>>>>> The plain`mpiexec` should just use the specified interface it finds in
>>>>> the hostfile. Being it hand crafted or prepared by any queuing system.
>>>>> 
>>>>> 
>>>>> Option: could a single entry for a machine in the hostfile contain a
>>>>> list of interfaces? I mean something like:
>>>>> 
>>>>> node01,node01-extra-eth1,node01-extra-eth2 slots=4
>>>>> 
>>>>> or
>>>>> 
>>>>> node01* slots=4
>>>>> 
>>>>> Means: use exactly these interfaces or even try to find all available
>>>>> interfaces on/between the machines.
>>>>> 
>>>>> In case all interfaces have the same name, then it's up to the admin
>>>>> to correct this.
>>>>> 
>>>>> -- Reuti
>>>>> 
>>>>> 
>>>>>>> -- Reuti
>>>>>>> 
>>>>>>> 
>>>>>>>> It tries so independent from the internal or external name of the
>>>>>>>> headnode given in the machinefile - I hit ^C then. I attached the
>>>>>>>> output of Open MPI 1.8.1 for this setup too.
>>>>>>>> 
>>>>>>>> -- Reuti
>>>>>>>> 
>>>>>>>> <openmpi1.8.3.txt><openmpi1.8.1.txt>_______________________________________________
>>>>>>>> 
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> Link to this post:
>>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25777.php
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/users/2014/11/25781.php
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> Link to this
>>>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25782.php
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org <mailto:us...@open-mpi.org>
>>>>> Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this
>>>>> post:http://www.open-mpi.org/community/lists/users/2014/11/25800.php
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2014/11/25801.php
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2014/11/25806.php
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/11/25809.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/11/25810.php
>

Re: [OMPI users] How OMPI picks ethernet interfaces

Reply via email to