Hi,
I am seeing problems with a small linux cluster when running OpenMPI
jobs. The error message I get is:
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=110
Following the FAQ, I looked to see what this error code corresponds to:
$ perl -e 'die$!=110'
Connection timed out at -e line 1.
This error message occurs the first time one of the compute nodes,
which are on a private network, attempts to send data to the frontend
(from where the job was started with mpirun).
In actual fact, it seems that the error occurs the first time a
process on the frontend tries to send data to another process on the
frontend.
I tried to play about with things like --mca btl_tcp_if_exclude
lo,eth0, but that didn't help matters. Nothing in the FAQ section on
TCP and routing actually seemed to help.
Any advice would be very welcome
The network configurations are:
a) frontend (2 network adapters, eth1 private for the cluster):
$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CE
inet addr:128.40.5.39 Bcast:128.40.5.255 Mask:255.255.255.0
inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0
TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:500939570 (477.7 MiB) TX bytes:671589665 (640.4 MiB)
Interrupt:193
eth1 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CF
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0
TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:188615778 (179.8 MiB) TX bytes:247305804 (235.8 MiB)
Interrupt:201
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1528 errors:0 dropped:0 overruns:0 frame:0
TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:363101 (354.5 KiB) TX bytes:363101 (354.5 KiB)
$ /sbin/route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 0 0 0 eth1
128.40.5.0 * 255.255.255.0 U 0 0 0 eth0
default 128.40.5.245 0.0.0.0 UG 0 0 0 eth0
b) Compute nodes:
$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A0:72
inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:189207 errors:0 dropped:0 overruns:0 frame:0
TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:23075241 (22.0 MiB) TX bytes:17693363 (16.8 MiB)
Interrupt:193
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:185 errors:0 dropped:0 overruns:0 frame:0
TX packets:185 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12644 (12.3 KiB) TX bytes:12644 (12.3 KiB)
$ /sbin/route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 0 0 0 eth0
default frontend.cluste 0.0.0.0 UG 0 0 0 eth0
TIS
Jonathan