On Mar 17, 2014, at 9:37 AM, Gus Correa <g...@ldeo.columbia.edu> wrote:
> On 03/17/2014 10:52 AM, Jeff Squyres (jsquyres) wrote: >> To add on to what Ralph said: >> >> 1. There are two different message passing paths in OMPI: >> - "OOB" (out of band): used for control messages >> - "BTL" (byte transfer layer): used for MPI traffic >> (there are actually others, but these seem to be the relevant 2 for your >> setup) >> >> 2. If you don't specify which OOB interfaces > to use OMPI will (basically) just pick one. > It doesn't really matter which one it uses; > the OOB channel doesn't use too much bandwidth, > and is mostly just during startup and shutdown. >> >> The one exception to this is stdout/stderr routing. > If your MPI app writes to stdout/stderr, this also uses the OOB path. > So if you output a LOT to stdout, then the OOB interface choice might matter. > > Hi All > > Not trying to hijack Jianyu's very interesting and informative questions and > thread, I have two questions and one note about it. > I promise to shut up after this. > > Is the interface that OOB picks and uses > somehow related to how the hosts/nodes names listed > in a "hostfile" > (or in the mpiexec command -host option, > or in the Torque/SGE/Slurm node file,) > are resolved into IP addresses (via /etc/hosts, DNS or other mechanism)? > > In other words, does OOB pick the interface associated to the IP address > that resolves the specific node name, or does OOB have its own will and > picks whatever interface it wants? The OOB on each node gets the list of available interfaces from the kernel on that node. When it needs to talk to someone on a remote node, it uses the standard mechanisms to resolve that node name to an IP address *if* it already isn't one - i.e., it checks the provided info to see if it is an IP address, and attempts to resolve the name if not. Once it has an IP address for the remote host, it checks its interfaces to see if one is on the same subnet as the remote IP. If so, then it uses that interface to create the connection. If none of the interfaces share the same subnet as the remote IP, then the OOB picks the first kernel-ordered interface and attempts to connect via that one, in the hope that there is a router in the system capable of passing the connection to the remote subnet. The OOB will cycle across all its interfaces in that manner until one indicates that it was indeed able to connect - if not, then we error out. > > At some early point during startup I suppose mpiexec > needs to touch base first time with each node, > and I would guess the nodes' IP address > (and the corresponding interface) plays a role then. > Does OOB piggy-back that same interface to do its job? Yes - once we establish that connection, we use it for whatever OOB communication is required. > >> >> 3. If you don't specify which MPI interfaces to use, OMPI will basically >> find the > "best" set of interfaces and use those. IP interfaces are always rated less > than > OS-bypass interfaces (e.g., verbs/IB). > > > In a node outfitted with more than one Inifinband interface, > can one choose which one OMPI is going to use (say, if one wants to > reserve the other IB interface for IO)? > > In other words, are there verbs/rdma syntax equivalent to > > --mca btl_tcp_if_include > > and to > > --mca oob_tcp_if_include ? > > [Perhaps something like --mca btl_openib_if_include ...?] Yes - exactly as you describe > > Forgive me if this question doesn't make sense, > for maybe on its guts verbs/rdma already has a greedy policy of using > everything available, but I don't know anything about it. > >> >> Or, as you noticed, you can give a comma-delimited list of BTLs to use. > OMPI will then use -- at most -- exactly those BTLs, but definitely no others. > Each BTL typically has an additional parameter or parameters that can be used > to specify which interfaces to use for the network interface type that that > BTL uses. > For example, btl_tcp_if_include tells the TCP BTL which interface(s) to use. >> >> Also, note that you seem to have missed a BTL: sm (shared memory). > sm is the preferred BTL to use for same-server communication. > > This may be because several FAQs skip the sm BTL, even when it would > be an appropriate/recommended choice to include in the BTL list. > For instance: > > http://www.open-mpi.org/faq/?category=all#selecting-components > http://www.open-mpi.org/faq/?category=all#tcp-selection > > The command line examples with an ellipsis "..." don't actually e > xclude the use of "sm", but IMHO are too vague and somewhat misleading. > > I think this issue was reported/discussed before in the list, > but somehow the FAQ were not fixed. I can try to do something about it - largely a question of time :-/ > > Thank you, > Gus Correa > > It is much faster than both the TCP loopback device > (which OMPI excludes by default, BTW, which is probably > why you got reachability errors when you specifying > "--mca btl tcp,self") and the verbs (i.e., "openib") > BTL for same-server communication. >> >> 4. If you don't specify anything, OMPI usually picks the best thing for you. > In your case, it'll probably be equivalent to: >> >> mpirun --mca btl openib,sm,self ... >> >> And the control messages will flow across one of your IP interfaces. >> >> 5. If you want to be specific about which one it uses, > you can specify oob_tcp_if_include. For example: >> >> mpirun --mca oob_tcp_if_include eth0 ... >> >> Make sense? >> >> >> >> On Mar 15, 2014, at 1:18 AM, Jianyu Liu <jerry_...@msn.com> wrote: >> >>>> On Mar 14, 2014, at 10:16:34 AM,Jeff Squyres <jsquyres_at_[hidden]> wrote: >>>> >>>>> On Mar 14, 2014, at 10:11 AM, Ralph Castain <rhc_at_[hidden]> wrote: >>>>> >>>>>> 1. If specified '--mca btl tcp,self', which interface application will >>>>>> run on, use GigE adaper OR use the OpenFabrics interface in IP over IB >>>>>> mode (just like a high performance GigE adapter) ? >>>>> >>>>> Both - ip over ib looks just like an Ethernet adaptor >>>> >>>> >>>> To be clear: the TCP BTL will use all TCP interfaces (regardless of >>>> underlying physical transport). Your GigE adapter and your IP adapter both >>>> present IP interfaces to>the OS, and both support TCP. So the TCP BTL will >>>> use them, because it just sees the TCP/IP interfaces. >>> >>> Thanks for your kindly input. >>> >>> Please see if I have understood correctly >>> >>> Assume there are two nework >>> Gigabit Ethernet >>> >>> eth0-renamed : 192.168.[1-22].[1-14] / 255.255.192.0 >>> >>> InfiniBand network >>> >>> ib0 : 172.20.[1-22].[1-4] / 255.255.0.0 >>> >>> >>> 1. If specified '--mca btl tcp,self >>> >>> The control information ( such as setup and teardown ) are routed to >>> and passed by Gigabit Ethernet in TCP/IP mode >>> The MPI messages are routed to and passed by InfiniBand network in IP >>> over IB mode >>> On the same machine, the TCP lookback device will be used for passing >>> control and MPI messages >>> >>> 2. If specified '--mca btl tcp,self --mca btl_tcp_if_include ib0' >>> >>> Both of control information ( such as setup and teardown ) and MPI >>> messages are routed to and passed by InfiniBand network in IP over IB mode >>> On the same machine, The TCP lookback device will be used for passing >>> control and MPI messages >>> >>> >>> 3. If specified '--mca btl openib,self' >>> >>> The control information ( such as setup and teardown ) are routed to >>> and passed by InfiniBand network in IP over IB mode >>> The MPI messages are routed to and passed by InfiniBand network in RDMA >>> mode >>> On the same machine, the TCP lookback device will be used for passing >>> control and MPI messages >>> >>> >>> 4. If without specifiying any 'mca btl' parameters >>> >>> The control information ( such as setup and teardown ) are routed to >>> and passed by Gigabit Ethernet in TCP/IP mode >>> The MPI messages are routed and passed by InfiniBand network in RDMA >>> mode >>> On the same machine, the shared memory (sm) BTL will be used for >>> control and MPI passing messages >>> >>> >>> Appreciating your kindly input >>> >>> Jianyu >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users