On 2/1/07, Galen Shipman <gship...@lanl.gov> wrote:
What does ifconfig report on both nodes?
Hi Galen, On headnode: # ifconfig eth0 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6C inet addr:10.1.1.11 Bcast:10.1.1.255 Mask:255.255.255.0 inet6 addr: fe80::211:43ff:feef:5d6c/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:279965 errors:0 dropped:0 overruns:0 frame:0 TX packets:785652 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:28422663 (27.1 MiB) TX bytes:999981228 (953.6 MiB) Base address:0xecc0 Memory:dfae0000-dfb00000 eth1 Link encap:Ethernet HWaddr 00:11:43:EF:5D:6D inet addr:<public IP> Bcast:172.25.238.255 Mask:255.255.255.0 inet6 addr: fe80::211:43ff:feef:5d6d/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1763252 errors:0 dropped:0 overruns:0 frame:0 TX packets:133260 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1726135418 (1.6 GiB) TX bytes:40990369 (39.0 MiB) Base address:0xdcc0 Memory:df8e0000-df900000 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:20.1.0.11 Bcast:20.1.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:9746 errors:0 dropped:0 overruns:0 frame:0 TX packets:9746 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:576988 (563.4 KiB) TX bytes:462432 (451.5 KiB) ib1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:30.5.0.11 Bcast:30.5.0.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) on COMPUTE node: # ifconfig eth0 Link encap:Ethernet HWaddr 00:11:43:D1:C0:80 inet addr:10.1.1.254 Bcast:10.1.1.255 Mask:255.255.255.0 inet6 addr: fe80::211:43ff:fed1:c080/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:145725 errors:0 dropped:0 overruns:0 frame:0 TX packets:85136 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:46506800 (44.3 MiB) TX bytes:14722190 (14.0 MiB) Base address:0xbcc0 Memory:df7e0000-df800000 ib0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:20.1.0.254 Bcast:20.1.0.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:9773 errors:0 dropped:0 overruns:0 frame:0 TX packets:9773 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:424624 (414.6 KiB) TX bytes:617676 (603.1 KiB) ib1 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 inet addr:30.5.0.254 Bcast:30.5.0.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Additionally, I've discovered that this problem is specific to either Dell hardware or Gig-E, because I cannot reproduce it in my VMware cluster. Output of lspci for ethernet devices: [headnode]# lspci |grep -i "ether"; ssh -x compute-0-0 'lspci |grep -i ether' 06:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) 07:08.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) 07:07.0 Ethernet controller: Intel Corporation 82541GI/PI Gigabit Ethernet Controller (rev 05) i.e. headnode has 2 gig-e interfaces and compute - one, and all are the same. Thanks, Alex. On 2/1/07, Galen Shipman <gship...@lanl.gov> wrote:
What does ifconfig report on both nodes? - Galen On Feb 1, 2007, at 2:50 PM, Alex Tumanov wrote: > Hi, > > I have kept doing my own investigation and recompiled OpenMPI to have > only the barebones functionality with no support for any interconnects > other than ethernet: > # rpmbuild --rebuild --define="configure_options > --prefix=/opt/openmpi/1.1.4" --define="install_in_opt 1" > --define="mflags all" openmpi-1.1.4-1.src.rpm > > The error detailed in my previous message persisted, which eliminates > the possibility of interconnect support interfering with ethernet > support. Here's an excerpt from ompi_info: > # ompi_info > Open MPI: 1.1.4 > Open MPI SVN revision: r13362 > Open RTE: 1.1.4 > Open RTE SVN revision: r13362 > OPAL: 1.1.4 > OPAL SVN revision: r13362 > Prefix: /opt/openmpi/1.1.4 > Configured architecture: x86_64-redhat-linux-gnu > . . . > Thread support: posix (mpi: no, progress: no) > Internal debug support: no > MPI parameter check: runtime > . . . > MCA btl: self (MCA v1.0, API v1.0, Component v1.1.4) > MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.4) > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0) > > Again, to replicate the error, I ran > # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello > In this case, you can even omit the runtime mca param specifications: > # mpirun -hostfile ~/testdir/hosts ~/testdir/hello > > Thanks for reading this. I hope I've provided enough information. > > Sincerely, > Alex. > > On 2/1/07, Alex Tumanov <atuma...@gmail.com> wrote: >> Hello, >> >> I have tried a very basic test on a 2 node "cluster" consisting of 2 >> dell boxes. One of them is dual CPU Intel(R) Xeon(TM) CPU 2.80GHz >> with >> 1GB of RAM and the slave node is quad-CPU Intel(R) Xeon(TM) CPU >> 3.40GHz with 2GB of RAM. Both have Infiniband cards and Gig-E. The >> slave node is connected directly to the headnode. >> >> OpenMPI version 1.1.4 was compiled with support for the following >> btl's: openib,mx,gm, and mvapi. I got it to work over openib, but, >> ironically, the same trivial hello world job fails over tcp (please >> see the log below). I found that the same problem was already >> discussed on this list here: >> http://www.open-mpi.org/community/lists/users/2006/06/1347.php >> The discussion mentioned that there could be something wrong with the >> TCP setup of the nodes. Unfortunately it was taken offline. Could >> someone help me with this? >> >> Thanks, >> Alex. >> >> # mpirun -hostfile ~/testdir/hosts --mca btl tcp,self ~/testdir/hello >> Hello from Alex' MPI test program >> Process 0 on headnode out of 2 >> Hello from Alex' MPI test program >> Process 1 on compute-0-0.local out of 2 >> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) >> Failing at addr:0xdebdf8 >> [0] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a9587e0e5] >> [1] func:/lib64/tls/libpthread.so.0 [0x3d1a00c430] >> [2] func:/opt/openmpi/1.1.4/lib/libopal.so.0 [0x2a95880729] >> [3] func:/opt/openmpi/1.1.4/lib/libopal.so.0(_int_free+0x24a) >> [0x2a95880d7a] >> [4] func:/opt/openmpi/1.1.4/lib/libopal.so.0(free+0xbf) >> [0x2a9588303f] >> [5] func:/opt/openmpi/1.1.4/lib/libmpi.so.0 [0x2a955949ca] >> [6] func:/opt/openmpi/1.1.4/lib/openmpi/mca_btl_tcp.so >> (mca_btl_tcp_component_close+0x34f) >> [0x2a988ee8ef] >> [7] func:/opt/openmpi/1.1.4/lib/libopal.so.0 >> (mca_base_components_close+0xde) >> [0x2a95872e1e] >> [8] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_btl_base_close+0xe9) >> [0x2a955e5159] >> [9] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_bml_base_close+0x9) >> [0x2a955e5029] >> [10] func:/opt/openmpi/1.1.4/lib/openmpi/mca_pml_ob1.so >> (mca_pml_ob1_component_close+0x25) >> [0x2a97f4dc55] >> [11] func:/opt/openmpi/1.1.4/lib/libopal.so.0 >> (mca_base_components_close+0xde) >> [0x2a95872e1e] >> [12] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(mca_pml_base_close+0x69) >> [0x2a955ea3e9] >> [13] func:/opt/openmpi/1.1.4/lib/libmpi.so.0(ompi_mpi_finalize+0xfe) >> [0x2a955ab57e] >> [14] func:/root/testdir/hello(main+0x7b) [0x4009d3] >> [15] func:/lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x3d1951c3fb] >> [16] func:/root/testdir/hello [0x4008ca] >> *** End of error message *** >> mpirun noticed that job rank 0 with PID 15573 on node "dr11.local" >> exited on signal 11. >> 2 additional processes aborted (not shown) >> > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users