Hi, I am seeing problems with a small linux cluster when running OpenMPI jobs. The error message I get is:
[frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=110 Following the FAQ, I looked to see what this error code corresponds to: $ perl -e 'die$!=110' Connection timed out at -e line 1. This error message occurs the first time one of the compute nodes, which are on a private network, attempts to send data to the frontend (from where the job was started with mpirun). In actual fact, it seems that the error occurs the first time a process on the frontend tries to send data to another process on the frontend. I tried to play about with things like --mca btl_tcp_if_exclude lo,eth0, but that didn't help matters. Nothing in the FAQ section on TCP and routing actually seemed to help. Any advice would be very welcome The network configurations are: a) frontend (2 network adapters, eth1 private for the cluster): $ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CE inet addr:128.40.5.39 Bcast:128.40.5.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0 TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:500939570 (477.7 MiB) TX bytes:671589665 (640.4 MiB) Interrupt:193 eth1 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CF inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0 TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:188615778 (179.8 MiB) TX bytes:247305804 (235.8 MiB) Interrupt:201 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1528 errors:0 dropped:0 overruns:0 frame:0 TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:363101 (354.5 KiB) TX bytes:363101 (354.5 KiB) $ /sbin/route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.1.0 * 255.255.255.0 U 0 0 0 eth1 128.40.5.0 * 255.255.255.0 U 0 0 0 eth0 default 128.40.5.245 0.0.0.0 UG 0 0 0 eth0 b) Compute nodes: $ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A0:72 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:189207 errors:0 dropped:0 overruns:0 frame:0 TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:23075241 (22.0 MiB) TX bytes:17693363 (16.8 MiB) Interrupt:193 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:185 errors:0 dropped:0 overruns:0 frame:0 TX packets:185 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12644 (12.3 KiB) TX bytes:12644 (12.3 KiB) $ /sbin/route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.1.0 * 255.255.255.0 U 0 0 0 eth0 default frontend.cluste 0.0.0.0 UG 0 0 0 eth0 TIS Jonathan