It's the failure on readv that's the source of the trouble. What happens if you only if_include eth2? Does it work then?
On Jan 23, 2014, at 5:38 PM, Doug Roberts <robe...@sharcnet.ca> wrote: > >> Date: Fri, 17 Jan 2014 19:24:50 -0800 >> From: Ralph Castain <r...@open-mpi.org> >> >> The most common cause of this problem is a firewall between the >> nodes - you can ssh across, but not communicate. Have you checked >> to see that the firewall is turned off? > > Turns out some iptables rules (typical on our clusters) were active. > They are now turned off for continued testing as suggested. I have > rerun the mpi_test code, this time using a debug enabled build of > openmpi/1.6.5 keeping with the intel compiler. > > As shown below the problem is still there. I'm including some gdb > output this time. The job is shown to succeed using only eth0 over > 1g but hang nearly immediately when the eth2 over 10g interface is > included. Any more suggestions would be greatly appreciated. > > [roberpj@bro127:~/samples/mpi_test] mpicc -g mpi_test.c > > o Using eth0 only: > > [roberpj@bro127:~/samples/mpi_test] > /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl > tcp,sm,self --mca btl_tcp_i > f_include eth0 --host bro127,bro128 ./a.out > Number of processes = 2 > Test repeated 3 times for reliability > I am process 0 on node bro127 > Run 1 of 3 > P0: Sending to P1 > P0: Waiting to receive from P1 > I am process 1 on node bro128 > P1: Waiting to receive from to P0 > P0: Received from to P1 > Run 2 of 3 > P0: Sending to P1 > P0: Waiting to receive from P1 > P1: Sending to to P0 > P1: Waiting to receive from to P0 > P0: Received from to P1 > Run 3 of 3 > P0: Sending to P1 > P0: Waiting to receive from P1 > P1: Sending to to P0 > P1: Waiting to receive from to P0 > P1: Sending to to P0 > P1: Done > P0: Received from to P1 > P0: Done > > o Using eth2 only: > > [roberpj@bro127:~/samples/mpi_test] > /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl > tcp,sm,self --mca btl_tcp_i > f_include eth0,eth2 --host bro127,bro128 ./a.out > Number of processes = 2 > Test repeated 3 times for reliability > I am process 0 on node bro127 > Run 1 of 3 > P0: Sending to P1 > P0: Waiting to receive from P1 > I am process 1 on node bro128 > P1: Waiting to receive from to P0 > ^Cmpirun: killing job... > > o Using eth0,eth2 with verbosity: > > [roberpj@bro127:~/samples/mpi_test] > /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl > tcp,sm,self --mca btl_tcp_i > f_include eth0,eth2 --mca btl_base_verbose 100 --host bro127,bro128 ./a.out > [bro127:20157] mca: base: components_open: Looking for btl components > [bro127:20157] mca: base: components_open: opening btl components > [bro127:20157] mca: base: components_open: found loaded component self > [bro127:20157] mca: base: components_open: component self has no register > function > [bro127:20157] mca: base: components_open: component self open function > successful > [bro127:20157] mca: base: components_open: found loaded component sm > [bro127:20157] mca: base: components_open: component sm has no register > function > [bro128:23354] mca: base: components_open: Looking for btl components > [bro127:20157] mca: base: components_open: component sm open function > successful > [bro127:20157] mca: base: components_open: found loaded component tcp > [bro127:20157] mca: base: components_open: component tcp register function > successful > [bro127:20157] mca: base: components_open: component tcp open function > successful > [bro128:23354] mca: base: components_open: opening btl components > [bro128:23354] mca: base: components_open: found loaded component self > [bro128:23354] mca: base: components_open: component self has no register > function > [bro128:23354] mca: base: components_open: component self open function > successful > [bro128:23354] mca: base: components_open: found loaded component sm > [bro128:23354] mca: base: components_open: component sm has no register > function > [bro128:23354] mca: base: components_open: component sm open function > successful > [bro128:23354] mca: base: components_open: found loaded component tcp > [bro128:23354] mca: base: components_open: component tcp register function > successful > [bro128:23354] mca: base: components_open: component tcp open function > successful > [bro127:20157] select: initializing btl component self > [bro127:20157] select: init of component self returned success > [bro127:20157] select: initializing btl component sm > [bro127:20157] select: init of component sm returned success > [bro127:20157] select: initializing btl component tcp > [bro127:20157] select: init of component tcp returned success > [bro128:23354] select: initializing btl component self > [bro128:23354] select: init of component self returned success > [bro128:23354] select: initializing btl component sm > [bro128:23354] select: init of component sm returned success > [bro128:23354] select: initializing btl component tcp > [bro128:23354] select: init of component tcp returned success > [bro127:20157] btl: tcp: attempting to connect() to address 10.27.2.128 on > port 4 > Number of processes = 2 > Test repeated 3 times for reliability > [bro128:23354] btl: tcp: attempting to connect() to address 10.27.2.127 on > port 4 > I am process 0 on node bro127 > Run 1 of 3 > P0: Sending to P1 > [bro127:20157] btl: tcp: attempting to connect() to address 10.29.4.128 on > port 4 > P0: Waiting to receive from P1 > I am process 1 on node bro128 > P1: Waiting to receive from to P0 > [bro127][[9184,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215 > :mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed > out (110) > ^C mpirun: killing job... > > o Master node bro127 debugging info: > > [roberpj@bro127:~] gdb -p 21067 > (gdb) bt > #0 0x00002ac7ae4a86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00002ac7acc3dedc in epoll_dispatch (base=0x3, arg=0x1916850, tv=0x20) > at ../../../../openmpi-1.6.5/opal/event/epoll.c:215 > #2 0x00002ac7acc3f276 in opal_event_base_loop (base=0x3, flags=26306640) at > ../../../../openmpi-1.6.5/opal/event/event.c:838 > #3 0x00002ac7acc3f122 in opal_event_loop (flags=3) at > ../../../../openmpi-1.6.5/opal/event/event.c:766 > #4 0x00002ac7acc82c14 in opal_progress () at > ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189 > #5 0x00002ac7b21a8c40 in mca_pml_ob1_recv (addr=0x3, count=26306640, > datatype=0x20, src=-1, tag=0, comm=0x80000, status=0x7fff15ad5f38) > at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105 > #6 0x00002ac7acb830f7 in PMPI_Recv (buf=0x3, count=26306640, type=0x20, > source=-1, tag=0, comm=0x80000, status=0x4026e0) at precv.c:78 > #7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72 > (gdb) frame 7 > #7 0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72 > 72 MPI_Recv(&A[0], M, MPI_DOUBLE, procs-1, msgid, > MPI_COMM_WORLD, &stat); > (gdb) > > confirming ... > [root@bro127:~] iptables --list > Chain INPUT (policy ACCEPT) > target prot opt source destination > > Chain FORWARD (policy ACCEPT) > target prot opt source destination > > Chain OUTPUT (policy ACCEPT) > target prot opt source destination > > o Slave node bro128 debugging info: > > [roberpj@bro128:~] top -u roberpj > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 24334 roberpj 20 0 115m 5208 3216 R 100.0 0.0 2:32.12 a.out > > [roberpj@bro128:~] gdb -p 24334 > (gdb) bt > #0 0x00002b7475cc86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6 > #1 0x00002b747445dedc in epoll_dispatch (base=0x3, arg=0x9b6850, tv=0x20) at > ../../../../openmpi-1.6.5/opal/event/epoll.c:215 > #2 0x00002b747445f276 in opal_event_base_loop (base=0x3, flags=10184784) at > ../../../../openmpi-1.6.5/opal/event/event.c:838 > #3 0x00002b747445f122 in opal_event_loop (flags=3) at > ../../../../openmpi-1.6.5/opal/event/event.c:766 > #4 0x00002b74744a2c14 in opal_progress () at > ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189 > #5 0x00002b74799c8c40 in mca_pml_ob1_recv (addr=0x3, count=10184784, > datatype=0x20, src=-1, tag=10899040, comm=0x0, status=0x7fff1ce5e778) > at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105 > #6 0x00002b74743a30f7 in PMPI_Recv (buf=0x3, count=10184784, type=0x20, > source=-1, tag=10899040, comm=0x0, status=0x4026e0) at precv.c:78 > #7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76 > (gdb) frame 7 > #7 0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76 > 76 MPI_Recv(&A[0], M, MPI_DOUBLE, myid-1, msgid, > MPI_COMM_WORLD, &stat); > (gdb) > > confirming ... > [root@bro128:~] iptables --list > Chain INPUT (policy ACCEPT) > target prot opt source destination > > Chain FORWARD (policy ACCEPT) > target prot opt source destination > > Chain OUTPUT (policy ACCEPT) > target prot opt source destination > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users