Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)

2008-02-22 Thread Brian W. Barrett

On Fri, 22 Feb 2008, Adrian Knoth wrote:



I see three approaches:

  a) remove lo globally (in if.c). I expect objections. ;)


I object!  :).  But for a good reason -- it'll break things.  Someone 
tried this before, and the issue is when a node (like a laptop) only has 
lo -- then there are no reported interfaces, and either there needs to be 
lots of extra code in the oob / btl or things break.  So let's not go down 
this path again.



  b) print a warning from BTL/TCP if the interfaces in use contain lo.
 Like "Warning: You've included the loopback for communication.
   This may cause hanging processes due to unreachable peers."


I like this one.


  c) Throw away 127.0.0.1 on the remote side. But when doing so, what's
 the use for including it at all?


This seems hard.

Brian


Re: [OMPI devel] New address selection for btl-tcp (was Re: [OMPI svn] svn:open-mpi r17307)

2008-02-15 Thread Tim Prins

Adrian Knoth wrote:

On Fri, Feb 01, 2008 at 11:40:20AM -0500, Tim Prins wrote:


Adrian,


Hi!

Sorry for the late reply and thanks for your testing.


1. There are some warnings when compiling:


I've fixed these issues.

Thanks.


2. If I exclude all my tcp interfaces, the connection fails properly, 
but I do get a malloc request for 0 bytes:
tprins@odin examples]$ mpirun -mca btl tcp,self  -mca btl_tcp_if_exclude 
eth0,ib0,lo -np 2 ./ring_c

malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)
malloc debug: Request for 0 bytes (btl_tcp_component.c, 844)



Not my fault, but I guess we could fix it anyway. Should we?
It probably should be fixed. But I've noticed that other BTLs (such as 
MX) do not properly handle the case where there are no available 
interfaces either...




3. If the exclude list does not contain 'lo', or the include list 
contains 'lo', the job hangs when using multiple nodes:


That's weird. Loopback interfaces should automatically be excluded right
from the beginning. See opal/util/if.c.

I neither know nor haven't checked where things go wrong. Do you want to
investigate? As already mentioned, this should not happen.
I took a quick glance at this file, and I'd be lying if I said I 
understood what was going on in it. One thing I did notice is that the 
parameter btl_tcp_if_exclude defaults to 'lo', but the user can of 
course overwrite it.


It might be worth looking into this further. If the user got an error or 
the job aborted if they did something wrong with 'lo' I would not worry 
about it at all. But the fact that it causes a hang is worrisome to me.




Can you post the output of "ip a s" or "ifconfig -a"?

It is at the end of the email.



However, the great news about this patch is that it appears to fix 
https://svn.open-mpi.org/trac/ompi/ticket/1027 for me.


It also fixes my #1206. I'd like to merge tmp-public/btl-tcp into the
trunk, especially before the 1.3 code freeze. Any objections?

Not from me, especially now that it is already in the trunk :).

Tim


--
ifconfig -a:
eth0  Link encap:Ethernet  HWaddr 00:E0:81:2D:0B:08
  inet addr:129.79.240.101  Bcast:129.79.240.255 
Mask:255.255.255.0

  inet6 addr: 2001:18e8:2:240:2e0:81ff:fe2d:b08/64 Scope:Global
  inet6 addr: fe80::2e0:81ff:fe2d:b08/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:555918407 errors:0 dropped:2122 overruns:0 frame:0
  TX packets:569928551 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:448936694980 (418.1 GiB)  TX bytes:486030858441 
(452.6 GiB)

  Interrupt:193

eth1  Link encap:Ethernet  HWaddr 00:E0:81:2D:0B:09
  BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Interrupt:201

ib0   Link encap:UNSPEC  HWaddr 
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00

  inet addr:192.168.0.101  Bcast:192.168.0.255  Mask:255.255.255.0
  inet6 addr: fe80::202:c902:0:5d71/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:6304819 errors:0 dropped:0 overruns:0 frame:0
  TX packets:6355094 errors:0 dropped:2 overruns:0 carrier:0
  collisions:0 txqueuelen:128
  RX bytes:26794850321 (24.9 GiB)  TX bytes:35448899645 (33.0 GiB)

ib1   Link encap:UNSPEC  HWaddr 
00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00

  BROADCAST MULTICAST  MTU:2044  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:128
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:182055033 errors:0 dropped:0 overruns:0 frame:0
  TX packets:182055033 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:997605665018 (929.0 GiB)  TX bytes:997605665018 
(929.0 GiB)


sit0  Link encap:IPv6-in-IPv4
  NOARP  MTU:1480  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

ip a s:
1: lo:  mtu 16436 qdisc noqueue
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
inet6 ::1/128 scope host
   valid_lft forever preferred_lft forever
2: eth0:  mtu 1500 qdisc pfifo_fast qlen 1000
link/ether 00:e0:81:2d:0b:08 brd ff:ff:ff:ff:ff:ff
inet