Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

Jeff Squyres Fri, 8 Jul 2011 19:03:19 -0400

Sorry -- I got distracted all afternoon...

In addition to what Ralph said (i.e., I'm not sure if the CIDR notation stuff 
made it over to the v1.5 branch or not, but it is available from the nightly 
SVN trunk tarballs: http://www.open-mpi.org/nightly/trunk/), here's a few 
points from other mails in this thread...


1. Gus is correct that OMPI is complaining that bge1 doesn't exist on all 
nodes.  The MCA parameters that you pass on the command line get shipped to 
*all* MPI processes, and therefore generally need to work on all of them.  If 
you have per-host MCA parameter values, you can set them a few different ways:

- have a per-host MCA param file, usually in $prefix/etc/openmpi-mca-params.conf
- have your shell startup files intelligently determine which host you're on 
and set the corresponding MCA environment variable as appropriate (e.g., on the 
head node, set the env variable OMPI_MCA_btl_tcp_if_include to bge1, and set it 
to bge0 on the others)

Those are a little klunky, but having a heterogeneous setup like this is not 
common, so we haven't really optimized the ability to set different MCA params 
on different servers.

2. I am curious to figure out why the automatic reachability computations isn't 
working for you.  Unfortunately, the code to compute the reachability is pretty 
gnarly.  :-\  The code to find the IP interfaces on your machines is in 
opal/util/if.c.  That *should* be working -- there's *BSD-specific code in 
there that has been verified by others in the past... but who knows?  Perhaps 
it has bit-rotted...?  The code to take these IP interfaces and figure out if a 
given peer is reachable is in 
ompi/mca/btl/tcp/btl_tcp_proc.c:mca_btl_tcp_proc_insert().  This requires a 
little explanation...

- There is one TCP BTL "component".  Think of this as the plugin that is 
dlopen'd into the process itself.  It contains some high-level information 
about the plugin itself (e.g., the version number, ...etc.).

- There is one TCP BTL "module" per IP interface that is used for MPI 
communications.  So your head node will have 2 TCP BTL modules and the others 
will only have one TCP BTL module.  A module is a struct with a bunch of 
function pointers and some meta data (e.g., which IP interface it "owns", etc.).

- During the BTL module's initialization, btl_tcp.c:mca_btl_tcp_add_procs() is 
called to notify the module of all of its peers (an ompi_proc_t instance is 
used to describe a peer process -- note: a *process*, not any particular 
communications method or IP address of that process).  btl_tcp_add_procs() 
takes the array of ompi_proc_t instances (that correspond to all the MPI 
processes in MPI_COMM_WORLD) and tries to figure out if this particular TCP BTL 
module can "reach" that peer, per the algorithm described in the FAQ that I 
cited earlier.

- mca_btl_tcp_add_procs() calls mca_btl_tcp_proc_insert() to do the 
reachability computation.  If _insert() succeeds, then _add_procs() assumes 
that this module can reach that process and proceeds accordingly.  If _insert() 
fails, then _add_procs() assumes that this module cannot reach that peer and 
proceeds accordingly.

- mca_btl_tcp_proc_insert() has previously learned about all the IP addresses 
of all the peer MPI processes via a different mechanism called the modex (which 
I won't go into here).  It basically checks the one peer process in question, 
looks up that peer's IP addresses (aka "endpoints", from that peer's TCP BTL 
modules), and tries to find the best quality match that it can.  It basically 
makes a 2D graph of weights of how "good" the connection is to each of the peer 
process' endpoints.  It then finds the best connection and uses that one.

- We unfortunately do not have good debugging output in _proc_insert(), so you 
might need to step through this with a debugger.  :-(  I have a long-languished 
branch that adds lots of debugging output in this reachability computation 
area, but I have never finished it (it has some kind of bug in it that prevents 
it from working, which is why I haven't merged it into the mainline).  

This was a long explanation -- I hope it helps...  Is there any chance you 
could dig into this to see what's going on?  The short version is that all this 
code *should* automatically figure out that the 10.x interface should 
effectively end up getting ignored because it can't be used to commuicate with 
any of its TCP BTL module peers in the other processes on the other servers.

We unfortunately don't have access to any BSD machines to test this on, 
ourselves.  It works on other OS's, so I'm curious as to why it doesn't seem to 
work for you.  :-(


On Jul 8, 2011, at 5:37 PM, Ralph Castain wrote:

> We've been moving to provide support for including values as CIDR notation 
> instead of names - e.g., 192.168.0/16 instead of bge0 or bge1 - but I don't 
> think that has been put into the 1.4 release series. If you need it now, you 
> might try using the developer's trunk - I know it works there.
> 
> 
> On Jul 8, 2011, at 2:49 PM, Steve Kargl wrote:
> 
>> On Fri, Jul 08, 2011 at 04:26:35PM -0400, Gus Correa wrote:
>>> Steve Kargl wrote:
>>>> On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
>>>>> The easiest way to fix this is likely to use the btl_tcp_if_include
>>>>> or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly
>>>>> which interfaces to use:
>>>>> 
>>>>>  http://www.open-mpi.org/faq/?category=tcp#tcp-selection
>>>>> 
>>>> 
>>>> Perhaps, I'm again misreading the output, but it appears that
>>>> 1.4.4rc2 does not even see the 2nd nic.
>>>> 
>>>> hpc:kargl[317] ifconfig bge0 
>>>> bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>  options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>>>  ether 00:e0:81:40:48:92
>>>>  inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255
>>>> hpc:kargl[318] ifconfig bge1
>>>> bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
>>>>  options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
>>>>  ether 00:e0:81:40:48:93
>>>>  inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
>>>> 
>>>> kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \
>>>> --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>>> 
>>>> hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 
>>>> 10 --mca btl_tcp_if_include bge1 -machinefile mf1 ./z
>>>> [hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl 
>>>> [node11.cimu.org:21878] select: init of component self returned success
>>>> [node11.cimu.org:21878] select: initializing btl component sm
>>>> [node11.cimu.org:21878] select: init of component sm returned success
>>>> [node11.cimu.org:21878] select: initializing btl component tcp
>>>> [node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
>>>>  invalid interface "bge1"
>>>> [node11.cimu.org:21878] select: init of component tcp returned success
>>>> --------------------------------------------------------------------------
>>>> At least one pair of MPI processes are unable to reach each other for
>>>> MPI communications.  This means that no Open MPI device has indicated
>>>> that it can be used to communicate between these processes.  This is
>>>> an error; Open MPI requires that all MPI processes be able to reach
>>>> each other.  This error can sometimes be the result of forgetting to
>>>> specify the "self" BTL.
>>>> 
>>> Hi Steve
>>> 
>>> It is complaining that bge1 is not valid on node11, not on node10/hpc,
>>> where you ran ifconfig.
>>> 
>>> Would the names of the interfaces and the matching subnet/IP
>>> vary from node to node?
>>> (E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.)
>>> 
>>> Would it be possible that only on node10 bge1 is on the 192.168.0.0 
>>> subnet, but on the other nodes it is bge0 that connects
>>> to the 192.168.0.0 subnet perhaps?
>> 
>> node10 has bge0 = 10.208.x.y and bge1 = 192.168.0.10.
>> node11 through node21 use bge0 = 192.168.0.N where N = 11, ..., 21.
>> 
>>> If you're including only bge1 on your mca btl switch,
>>> supposedly all nodes are able to reach
>>> each other via an interface called bge1.
>>> Is this really the case?
>>> You may want to run ifconfig on all nodes to check.
>>> 
>>> Alternatively, you could exclude node10 from your host file
>>> and try to run the job on the remaining nodes
>>> (and maybe not restrict the interface names with any btl switch).
>> 
>> Completely exclude node10 would appear to work.  Of course,
>> this then loses the 4 cpus and 16 GB of memory that are
>> in node.
>> 
>> The question to me is why does 1.4.2 work without a
>> problem, and 1.4.3 and 1.4.4 have problems with a
>> node with 2 NICs.
>> 
>> I suppose a follow-on question is: Is there some
>> way to get 1.4.4 to exclusive use bge1 on node10
>> while using bge0 on the other nodes?
>> 
>> -- 
>> Steve
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD

Reply via email to