Hi,

I am seeing problems with a small linux cluster when running OpenMPI
jobs. The error message I get is:

[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110

Following the FAQ, I looked to see what this error code corresponds to:

$ perl -e 'die$!=110'
Connection timed out at -e line 1.

This error message occurs the first time one of the compute nodes,
which are on a private network, attempts to send data to the frontend
(from where the job was started with mpirun).
In actual fact, it seems that the error occurs the first time a
process on the frontend tries to send data to another process on the
frontend.

I tried to play about with  things like --mca btl_tcp_if_exclude
lo,eth0, but that didn't help matters.  Nothing in the FAQ section on
TCP and routing actually seemed to help.


Any advice would be very welcome


The network configurations are:

a) frontend (2 network adapters, eth1 private for the cluster):

$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:E0:81:30:A1:CE
         inet addr:128.40.5.39  Bcast:128.40.5.255  Mask:255.255.255.0
         inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0
         TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:500939570 (477.7 MiB)  TX bytes:671589665 (640.4 MiB)
         Interrupt:193

eth1      Link encap:Ethernet  HWaddr 00:E0:81:30:A1:CF
         inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
         inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0
         TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:188615778 (179.8 MiB)  TX bytes:247305804 (235.8 MiB)
         Interrupt:201

lo        Link encap:Local Loopback
         inet addr:127.0.0.1  Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:16436  Metric:1
         RX packets:1528 errors:0 dropped:0 overruns:0 frame:0
         TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:363101 (354.5 KiB)  TX bytes:363101 (354.5 KiB)



$ /sbin/route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.1.0     *               255.255.255.0   U     0      0        0 eth1
128.40.5.0      *               255.255.255.0   U     0      0        0 eth0
default         128.40.5.245    0.0.0.0         UG    0      0        0 eth0



b) Compute nodes:

$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:E0:81:30:A0:72
         inet addr:192.168.1.2  Bcast:192.168.1.255  Mask:255.255.255.0
         inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:189207 errors:0 dropped:0 overruns:0 frame:0
         TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:23075241 (22.0 MiB)  TX bytes:17693363 (16.8 MiB)
         Interrupt:193

lo        Link encap:Local Loopback
         inet addr:127.0.0.1  Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:16436  Metric:1
         RX packets:185 errors:0 dropped:0 overruns:0 frame:0
         TX packets:185 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:12644 (12.3 KiB)  TX bytes:12644 (12.3 KiB)


$ /sbin/route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.1.0     *               255.255.255.0   U     0      0        0 eth0
default         frontend.cluste 0.0.0.0         UG    0      0        0 eth0

TIS
Jonathan

Reply via email to