It's the failure on readv that's the source of the trouble. What happens if you 
only if_include eth2? Does it work then?


On Jan 23, 2014, at 5:38 PM, Doug Roberts <robe...@sharcnet.ca> wrote:

> 
>> Date: Fri, 17 Jan 2014 19:24:50 -0800
>> From: Ralph Castain <r...@open-mpi.org>
>> 
>> The most common cause of this problem is a firewall between the
>> nodes - you can ssh across, but not communicate. Have you checked
>> to see that the firewall is turned off?
> 
> Turns out some iptables rules (typical on our clusters) were active.
> They are now turned off for continued testing as suggested. I have
> rerun the mpi_test code, this time using a debug enabled build of 
> openmpi/1.6.5 keeping with the intel compiler.
> 
> As shown below the problem is still there. I'm including some gdb
> output this time. The job is shown to succeed using only eth0 over
> 1g but hang nearly immediately when the eth2 over 10g interface is
> included.  Any more suggestions would be greatly appreciated.
> 
> [roberpj@bro127:~/samples/mpi_test] mpicc -g mpi_test.c
> 
> o Using eth0 only:
> 
> [roberpj@bro127:~/samples/mpi_test] 
> /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl 
> tcp,sm,self --mca btl_tcp_i
> f_include eth0 --host bro127,bro128 ./a.out
> Number of processes = 2
> Test repeated 3 times for reliability
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> P0: Received from to P1
> Run 2 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> P1: Sending to to P0
> P1: Waiting to receive from to P0
> P0: Received from to P1
> Run 3 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> P1: Sending to to P0
> P1: Waiting to receive from to P0
> P1: Sending to to P0
> P1: Done
> P0: Received from to P1
> P0: Done
> 
> o Using eth2 only:
> 
> [roberpj@bro127:~/samples/mpi_test] 
> /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl 
> tcp,sm,self --mca btl_tcp_i
> f_include eth0,eth2 --host bro127,bro128 ./a.out
> Number of processes = 2
> Test repeated 3 times for reliability
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> ^Cmpirun: killing job...
> 
> o Using eth0,eth2 with verbosity:
> 
> [roberpj@bro127:~/samples/mpi_test] 
> /opt/sharcnet/openmpi/1.6.5/intel-debug/bin/mpirun -np 2 --mca btl 
> tcp,sm,self --mca btl_tcp_i
> f_include eth0,eth2 --mca btl_base_verbose 100 --host bro127,bro128 ./a.out
> [bro127:20157] mca: base: components_open: Looking for btl components
> [bro127:20157] mca: base: components_open: opening btl components
> [bro127:20157] mca: base: components_open: found loaded component self
> [bro127:20157] mca: base: components_open: component self has no register 
> function
> [bro127:20157] mca: base: components_open: component self open function 
> successful
> [bro127:20157] mca: base: components_open: found loaded component sm
> [bro127:20157] mca: base: components_open: component sm has no register 
> function
> [bro128:23354] mca: base: components_open: Looking for btl components
> [bro127:20157] mca: base: components_open: component sm open function 
> successful
> [bro127:20157] mca: base: components_open: found loaded component tcp
> [bro127:20157] mca: base: components_open: component tcp register function 
> successful
> [bro127:20157] mca: base: components_open: component tcp open function 
> successful
> [bro128:23354] mca: base: components_open: opening btl components
> [bro128:23354] mca: base: components_open: found loaded component self
> [bro128:23354] mca: base: components_open: component self has no register 
> function
> [bro128:23354] mca: base: components_open: component self open function 
> successful
> [bro128:23354] mca: base: components_open: found loaded component sm
> [bro128:23354] mca: base: components_open: component sm has no register 
> function
> [bro128:23354] mca: base: components_open: component sm open function 
> successful
> [bro128:23354] mca: base: components_open: found loaded component tcp
> [bro128:23354] mca: base: components_open: component tcp register function 
> successful
> [bro128:23354] mca: base: components_open: component tcp open function 
> successful
> [bro127:20157] select: initializing btl component self
> [bro127:20157] select: init of component self returned success
> [bro127:20157] select: initializing btl component sm
> [bro127:20157] select: init of component sm returned success
> [bro127:20157] select: initializing btl component tcp
> [bro127:20157] select: init of component tcp returned success
> [bro128:23354] select: initializing btl component self
> [bro128:23354] select: init of component self returned success
> [bro128:23354] select: initializing btl component sm
> [bro128:23354] select: init of component sm returned success
> [bro128:23354] select: initializing btl component tcp
> [bro128:23354] select: init of component tcp returned success
> [bro127:20157] btl: tcp: attempting to connect() to address 10.27.2.128 on 
> port 4
> Number of processes = 2
> Test repeated 3 times for reliability
> [bro128:23354] btl: tcp: attempting to connect() to address 10.27.2.127 on 
> port 4
> I am process 0 on node bro127
> Run 1 of 3
> P0: Sending to P1
> [bro127:20157] btl: tcp: attempting to connect() to address 10.29.4.128 on 
> port 4
> P0: Waiting to receive from P1
> I am process 1 on node bro128
> P1: Waiting to receive from to P0
> [bro127][[9184,1],0][../../../../../../openmpi-1.6.5/ompi/mca/btl/tcp/btl_tcp_frag.c:215
> :mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection timed 
> out (110)
> ^C mpirun: killing job...
> 
> o Master node bro127 debugging info:
> 
> [roberpj@bro127:~] gdb -p 21067
> (gdb) bt
> #0  0x00002ac7ae4a86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00002ac7acc3dedc in epoll_dispatch (base=0x3, arg=0x1916850, tv=0x20) 
> at ../../../../openmpi-1.6.5/opal/event/epoll.c:215
> #2  0x00002ac7acc3f276 in opal_event_base_loop (base=0x3, flags=26306640) at 
> ../../../../openmpi-1.6.5/opal/event/event.c:838
> #3  0x00002ac7acc3f122 in opal_event_loop (flags=3) at 
> ../../../../openmpi-1.6.5/opal/event/event.c:766
> #4  0x00002ac7acc82c14 in opal_progress () at 
> ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
> #5  0x00002ac7b21a8c40 in mca_pml_ob1_recv (addr=0x3, count=26306640, 
> datatype=0x20, src=-1, tag=0, comm=0x80000, status=0x7fff15ad5f38)
>    at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
> #6  0x00002ac7acb830f7 in PMPI_Recv (buf=0x3, count=26306640, type=0x20, 
> source=-1, tag=0, comm=0x80000, status=0x4026e0) at precv.c:78
> #7  0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72
> (gdb) frame 7
> #7  0x0000000000402b65 in main (argc=1, argv=0x7fff15ad6098) at mpi_test.c:72
> 72               MPI_Recv(&A[0], M, MPI_DOUBLE, procs-1, msgid, 
> MPI_COMM_WORLD, &stat);
> (gdb)
> 
> confirming ...
> [root@bro127:~] iptables --list
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> o Slave node bro128 debugging info:
> 
> [roberpj@bro128:~]  top -u roberpj
>  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 24334 roberpj   20   0  115m 5208 3216 R 100.0  0.0   2:32.12 a.out
> 
> [roberpj@bro128:~] gdb -p 24334
> (gdb) bt
> #0  0x00002b7475cc86f3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> #1  0x00002b747445dedc in epoll_dispatch (base=0x3, arg=0x9b6850, tv=0x20) at 
> ../../../../openmpi-1.6.5/opal/event/epoll.c:215
> #2  0x00002b747445f276 in opal_event_base_loop (base=0x3, flags=10184784) at 
> ../../../../openmpi-1.6.5/opal/event/event.c:838
> #3  0x00002b747445f122 in opal_event_loop (flags=3) at 
> ../../../../openmpi-1.6.5/opal/event/event.c:766
> #4  0x00002b74744a2c14 in opal_progress () at 
> ../../../openmpi-1.6.5/opal/runtime/opal_progress.c:189
> #5  0x00002b74799c8c40 in mca_pml_ob1_recv (addr=0x3, count=10184784, 
> datatype=0x20, src=-1, tag=10899040, comm=0x0, status=0x7fff1ce5e778)
>    at ../../../../../../openmpi-1.6.5/ompi/mca/pml/ob1/pml_ob1_irecv.c:105
> #6  0x00002b74743a30f7 in PMPI_Recv (buf=0x3, count=10184784, type=0x20, 
> source=-1, tag=10899040, comm=0x0, status=0x4026e0) at precv.c:78
> #7  0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76
> (gdb) frame 7
> #7  0x0000000000402c40 in main (argc=1, argv=0x7fff1ce5e8d8) at mpi_test.c:76
> 76               MPI_Recv(&A[0], M, MPI_DOUBLE, myid-1, msgid, 
> MPI_COMM_WORLD, &stat);
> (gdb)
> 
> confirming ...
> [root@bro128:~] iptables --list
> Chain INPUT (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain FORWARD (policy ACCEPT)
> target     prot opt source               destination
> 
> Chain OUTPUT (policy ACCEPT)
> target     prot opt source               destination
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to