Re: [OMPI users] TCP connection errors
Thanks Adrian - that's a useful suggestion, I'll explore that. Jonathan.
Re: [OMPI users] TCP connection errors
On 12/06/07, George Bosilca <bosi...@cs.utk.edu> wrote: Jonathan Underwood wrote: > Presumably switching the two interfaces on the frontend (eth0<->eth1) > would also solve this problem? > If you have root privileges this seems to be a another good approach. I don't, but will explain the issue to sysadmin. Thanks again. Jonathan.
Re: [OMPI users] TCP connection errors
On 12/06/07, George Bosilcawrote: Jonathan, It will be difficult to make it works in this configuration. The problem is that on the head node the network interface that have to be used is eth1 while on the compute nodes is eth0. Therefore, the tcp_if_include will not help ... Now, if you only start processes on the compute nodes you will not have to face this problem. Right now, I think this is the safest approach. We have a patch for this kind of problems, but it's not yet in the trunk. I let you know as soon as we commit it and then you will have to use the unstable version until the patch make its way into a stable version. OK, thanks very much for letting me know. Presumably switching the two interfaces on the frontend (eth0<->eth1) would also solve this problem? Cheers, Jonathan
Re: [OMPI users] TCP connection errors
Hi Adrian, On 11/06/07, Adrian Knothwrote: Which OMPI version? 1.2.2 > $ perl -e 'die$!=110' > Connection timed out at -e line 1. Looks pretty much like a routing issue. Can you sniff on eth1 on the frontend node? I don't have root access, so am afraid not. > This error message occurs the first time one of the compute nodes, > which are on a private network, attempts to send data to the frontend > In actual fact, it seems that the error occurs the first time a > process on the frontend tries to send data to another process on the > frontend. What's the exact problem? compute-node -> frontend? I don't think you have two processes on the frontend node, and even if you do, they should use shared memory. > Any advice would be very welcome Use tcpdump and/or recompile with debug enabled. In addition, set WANT_PEER_DUMP in ompi/mca/btl/tcp/btl_tcp_endpoint.c to 1 (line 120) and recompile, thus giving you more debug output. Depending on your OMPI version, you can also add mpi_preconnect_all=1 to your ~/.openmpi/mca-params.conf, by this establishing all connections during MPI_Init(). OK, will try these things. If nothing helps, exclude the frontend from computation. OK. Thanks for the suggestions! Joanthan
[OMPI users] TCP connection errors
Hi, I am seeing problems with a small linux cluster when running OpenMPI jobs. The error message I get is: [frontend][0,1,0][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed with errno=110 Following the FAQ, I looked to see what this error code corresponds to: $ perl -e 'die$!=110' Connection timed out at -e line 1. This error message occurs the first time one of the compute nodes, which are on a private network, attempts to send data to the frontend (from where the job was started with mpirun). In actual fact, it seems that the error occurs the first time a process on the frontend tries to send data to another process on the frontend. I tried to play about with things like --mca btl_tcp_if_exclude lo,eth0, but that didn't help matters. Nothing in the FAQ section on TCP and routing actually seemed to help. Any advice would be very welcome The network configurations are: a) frontend (2 network adapters, eth1 private for the cluster): $ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CE inet addr:128.40.5.39 Bcast:128.40.5.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a1ce/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3496038 errors:0 dropped:0 overruns:0 frame:0 TX packets:2833685 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:500939570 (477.7 MiB) TX bytes:671589665 (640.4 MiB) Interrupt:193 eth1 Link encap:Ethernet HWaddr 00:E0:81:30:A1:CF inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a1cf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2201778 errors:0 dropped:0 overruns:0 frame:0 TX packets:2046572 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:188615778 (179.8 MiB) TX bytes:247305804 (235.8 MiB) Interrupt:201 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:1528 errors:0 dropped:0 overruns:0 frame:0 TX packets:1528 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:363101 (354.5 KiB) TX bytes:363101 (354.5 KiB) $ /sbin/route Kernel IP routing table Destination Gateway Genmask Flags Metric RefUse Iface 192.168.1.0 * 255.255.255.0 U 0 00 eth1 128.40.5.0 * 255.255.255.0 U 0 00 eth0 default 128.40.5.2450.0.0.0 UG0 00 eth0 b) Compute nodes: $ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:E0:81:30:A0:72 inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::2e0:81ff:fe30:a072/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:189207 errors:0 dropped:0 overruns:0 frame:0 TX packets:203507 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:23075241 (22.0 MiB) TX bytes:17693363 (16.8 MiB) Interrupt:193 loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:185 errors:0 dropped:0 overruns:0 frame:0 TX packets:185 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12644 (12.3 KiB) TX bytes:12644 (12.3 KiB) $ /sbin/route Kernel IP routing table Destination Gateway Genmask Flags Metric RefUse Iface 192.168.1.0 * 255.255.255.0 U 0 00 eth0 default frontend.cluste 0.0.0.0 UG0 00 eth0 TIS Jonathan
Re: [OMPI users] mpi.h - not conforming to C90 spec
On 18/08/06, Brian Barrett <brbar...@open-mpi.org> wrote: On Aug 17, 2006, at 4:43 PM, Jonathan Underwood wrote: > Compiling an mpi program with gcc options -pedantic -Wall gives the > following warning: > > mpi.h:147: warning: ISO C90 does not support 'long long' > > So it seems that the openmpi implementation doesn't conform to C90. Is > this by design, or should it be reported as a bug? Well, MPI_LONG_LONG is a type we're supposed to support, and that means having 'long long' in the mpi.h file. I'm not really sure how to get around this, especially since there are a bunch of users out there that rely on MPI_LONG_LONG to send 64 bit integers around on 32 bit platforms. So I suppose that it's by design. OK, that seems reasonable. I wonder then if the non-C90 conforming parts should be surrounded with #ifndef __STRICT_ANSI__ - this is predefined when gcc is expecting C90 conforming code. I am not sure if this is portable to other compilers however. Probably not. Best wishes, Jonathan