Thanks, Ralph! When I add --mca oob_tcp_if_include ib0 (where ib0 is infiniband interface from ifconfig) to mpirun it starts working correct! Why OpenMPI doesn't do it itself?
Tue, 22 Jul 2014 11:26:16 -0700 от Ralph Castain <r...@open-mpi.org>: >Okay, the problem is that the connection back to mpirun isn't getting thru. We >are trying on the 10.0.251.53 address - is that blocked, or should we be using >something else? If so, you might want to direct us by adding "-mca >oob_tcp_if_include foo", where foo is the interface you want us to use > > >On Jul 20, 2014, at 10:24 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>NIC = network interface controller? >>There is QDR Infiniband 4x/10G Ethernet/Gigabit Ethernet. >>I want to use QDR Infiniband. >>Here is a new output: >>$ mpirun -mca mca_base_env_list 'LD_PRELOAD' --debug-daemons --mca >>plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 2 >>hello_c |tee hello.out >>Warning: Conflicting CPU frequencies detected, using: 2927.000000. >>[compiler-2:30735] mca:base:select:( plm) Querying component [isolated] >>[compiler-2:30735] mca:base:select:( plm) Query of component [isolated] set >>priority to 0 >>[compiler-2:30735] mca:base:select:( plm) Querying component [rsh] >>[compiler-2:30735] mca:base:select:( plm) Query of component [rsh] set >>priority to 10 >>[compiler-2:30735] mca:base:select:( plm) Querying component [slurm] >>[compiler-2:30735] mca:base:select:( plm) Query of component [slurm] set >>priority to 75 >>[compiler-2:30735] mca:base:select:( plm) Selected component [slurm] >>[compiler-2:30735] mca: base: components_register: registering oob components >>[compiler-2:30735] mca: base: components_register: found loaded component tcp >>[compiler-2:30735] mca: base: components_register: component tcp register >>function successful >>[compiler-2:30735] mca: base: components_open: opening oob components >>[compiler-2:30735] mca: base: components_open: found loaded component tcp >>[compiler-2:30735] mca: base: components_open: component tcp open function >>successful >>[compiler-2:30735] mca:oob:select: checking available component tcp >>[compiler-2:30735] mca:oob:select: Querying component [tcp] >>[compiler-2:30735] oob:tcp: component_available called >>[compiler-2:30735] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>[compiler-2:30735] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.251.53 to our list >>of V4 connections >>[compiler-2:30735] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.0.0.4 to our list of >>V4 connections >>[compiler-2:30735] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.2.251.14 to our list >>of V4 connections >>[compiler-2:30735] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 10.128.0.4 to our list >>of V4 connections >>[compiler-2:30735] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>[compiler-2:30735] [[65177,0],0] oob:tcp:init adding 93.180.7.38 to our list >>of V4 connections >>[compiler-2:30735] [[65177,0],0] TCP STARTUP >>[compiler-2:30735] [[65177,0],0] attempting to bind to IPv4 port 0 >>[compiler-2:30735] [[65177,0],0] assigned IPv4 port 49759 >>[compiler-2:30735] mca:oob:select: Adding component to end >>[compiler-2:30735] mca:oob:select: Found 1 active transports >>[compiler-2:30735] mca: base: components_register: registering rml components >>[compiler-2:30735] mca: base: components_register: found loaded component oob >>[compiler-2:30735] mca: base: components_register: component oob has no >>register or open function >>[compiler-2:30735] mca: base: components_open: opening rml components >>[compiler-2:30735] mca: base: components_open: found loaded component oob >>[compiler-2:30735] mca: base: components_open: component oob open function >>successful >>[compiler-2:30735] orte_rml_base_select: initializing rml component oob >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 30 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 15 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 32 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 33 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 5 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 10 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 12 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 9 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 34 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 2 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 21 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 22 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 45 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 46 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 1 for peer >>[[WILDCARD],WILDCARD] >>[compiler-2:30735] [[65177,0],0] posting recv >>[compiler-2:30735] [[65177,0],0] posting persistent recv on tag 27 for peer >>[[WILDCARD],WILDCARD] >>Daemon was launched on node1-128-17 - beginning to initialize >>Daemon was launched on node1-128-18 - beginning to initialize >>[node1-128-17:14779] mca: base: components_register: registering oob >>components >>[node1-128-17:14779] mca: base: components_register: found loaded component >>tcp >>[node1-128-17:14779] mca: base: components_register: component tcp register >>function successful >>[node1-128-17:14779] mca: base: components_open: opening oob components >>[node1-128-17:14779] mca: base: components_open: found loaded component tcp >>[node1-128-17:14779] mca: base: components_open: component tcp open function >>successful >>[node1-128-17:14779] mca:oob:select: checking available component tcp >>[node1-128-17:14779] mca:oob:select: Querying component [tcp] >>[node1-128-17:14779] oob:tcp: component_available called >>[node1-128-17:14779] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>[node1-128-17:14779] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>[node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.0.128.17 to our >>list of V4 connections >>[node1-128-17:14779] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>[node1-128-17:14779] [[65177,0],1] oob:tcp:init adding 10.128.128.17 to our >>list of V4 connections >>[node1-128-17:14779] [[65177,0],1] TCP STARTUP >>[node1-128-17:14779] [[65177,0],1] attempting to bind to IPv4 port 0 >>[node1-128-17:14779] [[65177,0],1] assigned IPv4 port 46441 >>[node1-128-17:14779] mca:oob:select: Adding component to end >>[node1-128-17:14779] mca:oob:select: Found 1 active transports >>[node1-128-17:14779] mca: base: components_register: registering rml >>components >>[node1-128-17:14779] mca: base: components_register: found loaded component >>oob >>[node1-128-17:14779] mca: base: components_register: component oob has no >>register or open function >>[node1-128-17:14779] mca: base: components_open: opening rml components >>[node1-128-17:14779] mca: base: components_open: found loaded component oob >>[node1-128-17:14779] mca: base: components_open: component oob open function >>successful >>[node1-128-17:14779] orte_rml_base_select: initializing rml component oob >>[node1-128-18:17849] mca: base: components_register: registering oob >>components >>[node1-128-18:17849] mca: base: components_register: found loaded component >>tcp >>[node1-128-18:17849] mca: base: components_register: component tcp register >>function successful >>[node1-128-18:17849] mca: base: components_open: opening oob components >>[node1-128-18:17849] mca: base: components_open: found loaded component tcp >>[node1-128-18:17849] mca: base: components_open: component tcp open function >>successful >>[node1-128-18:17849] mca:oob:select: checking available component tcp >>[node1-128-18:17849] mca:oob:select: Querying component [tcp] >>[node1-128-18:17849] oob:tcp: component_available called >>[node1-128-18:17849] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>[node1-128-18:17849] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>[node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.0.128.18 to our >>list of V4 connections >>[node1-128-18:17849] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>[node1-128-18:17849] [[65177,0],2] oob:tcp:init adding 10.128.128.18 to our >>list of V4 connections >>[node1-128-18:17849] [[65177,0],2] TCP STARTUP >>[node1-128-18:17849] [[65177,0],2] attempting to bind to IPv4 port 0 >>[node1-128-18:17849] [[65177,0],2] assigned IPv4 port 60695 >>[node1-128-18:17849] mca:oob:select: Adding component to end >>[node1-128-18:17849] mca:oob:select: Found 1 active transports >>[node1-128-18:17849] mca: base: components_register: registering rml >>components >>[node1-128-18:17849] mca: base: components_register: found loaded component >>oob >>[node1-128-18:17849] mca: base: components_register: component oob has no >>register or open function >>[node1-128-18:17849] mca: base: components_open: opening rml components >>[node1-128-18:17849] mca: base: components_open: found loaded component oob >>[node1-128-18:17849] mca: base: components_open: component oob open function >>successful >>[node1-128-18:17849] orte_rml_base_select: initializing rml component oob >>Daemon [[65177,0],1] checking in as pid 14779 on host node1-128-17 >>[node1-128-17:14779] [[65177,0],1] orted: up and running - waiting for >>commands! >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 30 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 15 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 32 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 11 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 9 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1]: set_addr to uri 4271439872.0; >>tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >>[node1-128-17:14779] [[65177,0],1]:set_addr checking if peer [[65177,0],0] is >>reachable via component tcp >>[node1-128-17:14779] [[65177,0],1] oob:tcp: working peer [[65177,0],0] >>address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >>[node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.251.53 TO MODULE >>[node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.0.0.4 TO MODULE >>[node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.2.251.14 TO MODULE >>[node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] PASSING ADDR 10.128.0.4 TO MODULE >>[node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] PASSING ADDR 93.180.7.38 TO MODULE >>[node1-128-17:14779] [[65177,0],1]:tcp set addr for peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1]: peer [[65177,0],0] is reachable via >>component tcp >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 3 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 21 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 45 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 46 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] posting recv >>[node1-128-17:14779] [[65177,0],1] posting persistent recv on tag 1 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-17:14779] [[65177,0],1] OOB_SEND: rml_oob_send.c:199 >>[node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >>[node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >>[node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >>[node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >>[node1-128-17:14779] [[65177,0],1]:tcp:processing set_peer cmd >>[node1-128-17:14779] [[65177,0],1] oob:base:send to target [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] oob:tcp:send_nb to peer [[65177,0],0]:10 >>[node1-128-17:14779] [[65177,0],1] tcp:send_nb to peer [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:484] post send to [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:421] processing send to peer >>[[65177,0],0]:10 >>[node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:455] queue pending to >>[[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] tcp:send_nb: initiating connection to >>[[65177,0],0] >>[node1-128-17:14779] [[65177,0],1]:[oob_tcp.c:469] connect to [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] on socket 10 >>[node1-128-17:14779] [[65177,0],1] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries >>[node1-128-17:14779] [[65177,0],1] waiting for connect completion to >>[[65177,0],0] - activating send event >>Daemon [[65177,0],2] checking in as pid 17849 on host node1-128-18 >>[node1-128-18:17849] [[65177,0],2] orted: up and running - waiting for >>commands! >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 30 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 15 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 32 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 11 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 9 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2]: set_addr to uri 4271439872.0; >>tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >>[node1-128-18:17849] [[65177,0],2]:set_addr checking if peer [[65177,0],0] is >>reachable via component tcp >>[node1-128-18:17849] [[65177,0],2] oob:tcp: working peer [[65177,0],0] >>address tcp://10.0.251.53,10.0.0.4,10.2.251.14,10.128.0.4,93.180.7.38:49759 >>[node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.251.53 TO MODULE >>[node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.0.0.4 TO MODULE >>[node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.2.251.14 TO MODULE >>[node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] PASSING ADDR 10.128.0.4 TO MODULE >>[node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] PASSING ADDR 93.180.7.38 TO MODULE >>[node1-128-18:17849] [[65177,0],2]:tcp set addr for peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2]: peer [[65177,0],0] is reachable via >>component tcp >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 3 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 21 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 45 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 46 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] posting recv >>[node1-128-18:17849] [[65177,0],2] posting persistent recv on tag 1 for peer >>[[WILDCARD],WILDCARD] >>[node1-128-18:17849] [[65177,0],2] OOB_SEND: rml_oob_send.c:199 >>[node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >>[node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >>[node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >>[node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >>[node1-128-18:17849] [[65177,0],2]:tcp:processing set_peer cmd >>[node1-128-18:17849] [[65177,0],2] oob:base:send to target [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] oob:tcp:send_nb to peer [[65177,0],0]:10 >>[node1-128-18:17849] [[65177,0],2] tcp:send_nb to peer [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:484] post send to [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:421] processing send to peer >>[[65177,0],0]:10 >>[node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:455] queue pending to >>[[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] tcp:send_nb: initiating connection to >>[[65177,0],0] >>[node1-128-18:17849] [[65177,0],2]:[oob_tcp.c:469] connect to [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] on socket 10 >>[node1-128-18:17849] [[65177,0],2] orte_tcp_peer_try_connect: attempting to >>connect to proc [[65177,0],0] on 10.0.251.53:49759 - 0 retries >>[node1-128-18:17849] [[65177,0],2] waiting for connect completion to >>[[65177,0],0] - activating send event >>[node1-128-18:17837] [[61806,0],2] tcp:send_handler called to send to peer >>[[61806,0],0] >>[node1-128-18:17837] [[61806,0],2] tcp:send_handler CONNECTING >>[node1-128-18:17837] [[61806,0],2]:tcp:complete_connect called for peer >>[[61806,0],0] on socket 10 >>[node1-128-18:17837] [[61806,0],2]-[[61806,0],0] tcp_peer_complete_connect: >>connection failed: Connection timed out (110) >>[node1-128-18:17837] [[61806,0],2] tcp_peer_close for [[61806,0],0] sd 10 >>state CONNECTING >>[node1-128-18:17837] [[61806,0],2] tcp:lost connection called for peer >>[[61806,0],0] >>[node1-128-18:17837] mca: base: close: component oob closed >>[node1-128-18:17837] mca: base: close: unloading component oob >>[node1-128-18:17837] [[61806,0],2] TCP SHUTDOWN >>[node1-128-18:17837] [[61806,0],2] RELEASING PEER OBJ [[61806,0],0] >>[node1-128-18:17837] [[61806,0],2] CLOSING SOCKET 10 >>[node1-128-18:17837] mca: base: close: component tcp closed >>[node1-128-18:17837] mca: base: close: unloading component tcp >>srun: error: node1-128-18: task 1: Exited with exit code 1 >>srun: Terminating job step 647191.1 >>[node1-128-17:14767] [[61806,0],1] tcp:send_handler called to send to peer >>[[61806,0],0] >>[node1-128-17:14767] [[61806,0],1] tcp:send_handler CONNECTING >>[node1-128-17:14767] [[61806,0],1]:tcp:complete_connect called for peer >>[[61806,0],0] on socket 10 >>[node1-128-17:14767] [[61806,0],1]-[[61806,0],0] tcp_peer_complete_connect: >>connection failed: Connection timed out (110) >>[node1-128-17:14767] [[61806,0],1] tcp_peer_close for [[61806,0],0] sd 10 >>state CONNECTING >>[node1-128-17:14767] [[61806,0],1] tcp:lost connection called for peer >>[[61806,0],0] >>[node1-128-17:14767] mca: base: close: component oob closed >>[node1-128-17:14767] mca: base: close: unloading component oob >>[node1-128-17:14767] [[61806,0],1] TCP SHUTDOWN >>[node1-128-17:14767] [[61806,0],1] RELEASING PEER OBJ [[61806,0],0] >>[node1-128-17:14767] [[61806,0],1] CLOSING SOCKET 10 >>[node1-128-17:14767] mca: base: close: component tcp closed >>[node1-128-17:14767] mca: base: close: unloading component tcp >>srun: error: node1-128-17: task 0: Exited with exit code 1 >>[node1-128-17:14779] [[65177,0],1] tcp:send_handler called to send to peer >>[[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] tcp:send_handler CONNECTING >>[node1-128-17:14779] [[65177,0],1]:tcp:complete_connect called for peer >>[[65177,0],0] on socket 10 >>[node1-128-17:14779] [[65177,0],1]-[[65177,0],0] tcp_peer_complete_connect: >>connection failed: Connection timed out (110) >>[node1-128-17:14779] [[65177,0],1] tcp_peer_close for [[65177,0],0] sd 10 >>state CONNECTING >>[node1-128-17:14779] [[65177,0],1] tcp:lost connection called for peer >>[[65177,0],0] >>[node1-128-17:14779] mca: base: close: component oob closed >>[node1-128-17:14779] mca: base: close: unloading component oob >>[node1-128-17:14779] [[65177,0],1] TCP SHUTDOWN >>[node1-128-17:14779] [[65177,0],1] RELEASING PEER OBJ [[65177,0],0] >>[node1-128-17:14779] [[65177,0],1] CLOSING SOCKET 10 >>[node1-128-17:14779] mca: base: close: component tcp closed >>[node1-128-17:14779] mca: base: close: unloading component tcp >>[node1-128-18:17849] [[65177,0],2] tcp:send_handler called to send to peer >>[[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] tcp:send_handler CONNECTING >>[node1-128-18:17849] [[65177,0],2]:tcp:complete_connect called for peer >>[[65177,0],0] on socket 10 >>[node1-128-18:17849] [[65177,0],2]-[[65177,0],0] tcp_peer_complete_connect: >>connection failed: Connection timed out (110) >>[node1-128-18:17849] [[65177,0],2] tcp_peer_close for [[65177,0],0] sd 10 >>state CONNECTING >>[node1-128-18:17849] [[65177,0],2] tcp:lost connection called for peer >>[[65177,0],0] >>[node1-128-18:17849] mca: base: close: component oob closed >>[node1-128-18:17849] mca: base: close: unloading component oob >>[node1-128-18:17849] [[65177,0],2] TCP SHUTDOWN >>[node1-128-18:17849] [[65177,0],2] RELEASING PEER OBJ [[65177,0],0] >>[node1-128-18:17849] [[65177,0],2] CLOSING SOCKET 10 >>[node1-128-18:17849] mca: base: close: component tcp closed >>[node1-128-18:17849] mca: base: close: unloading component tcp >>srun: error: node1-128-17: task 0: Exited with exit code 1 >>srun: Terminating job step 647191.2 >>srun: error: node1-128-18: task 1: Exited with exit code 1 >>-------------------------------------------------------------------------- >>An ORTE daemon has unexpectedly failed after launch and before >>communicating back to mpirun. This could be caused by a number >>of factors, including an inability to create a connection back >>to mpirun due to a lack of common network interfaces and/or no >>route found between them. Please check network connectivity >>(including firewalls and network routing requirements). >>-------------------------------------------------------------------------- >>[compiler-2:30735] [[65177,0],0] orted_cmd: received halt_vm cmd >>[compiler-2:30735] mca: base: close: component oob closed >>[compiler-2:30735] mca: base: close: unloading component oob >>[compiler-2:30735] [[65177,0],0] TCP SHUTDOWN >>[compiler-2:30735] mca: base: close: component tcp closed >>[compiler-2:30735] mca: base: close: unloading component tcp >> >> >>Sun, 20 Jul 2014 13:11:19 -0700 от Ralph Castain < r...@open-mpi.org >: >>>Yeah, we aren't connecting back - is there a firewall running? You need to >>>leave the "--debug-daemons --mca plm_base_verbose 5" on there as well to see >>>the entire problem. >>> >>>What you can see here is that mpirun is listening on several interfaces: >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list >>>>of V4 connections >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list >>>>of V4 connections >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of >>>>V4 connections >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of >>>>V4 connections >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>>>of V4 connections >>> >>>It looks like you have multiple interfaces connected to the same subnet - >>>this is generally a bad idea. I also saw that the last one in the list shows >>>up twice in the kernel array - not sure why, but is there something special >>>about that NIC? >>> >>>What do the NICs look like on the remote hosts? >>> >>>On Jul 20, 2014, at 10:59 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>> >>>> >>>> >>>>-------- Пересылаемое сообщение -------- >>>>От кого: Timur Ismagilov < tismagi...@mail.ru > >>>>Кому: Ralph Castain < r...@open-mpi.org > >>>>Дата: Sun, 20 Jul 2014 21:58:41 +0400 >>>>Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem >>>> >>>>Here it is: >>>>$ salloc -N2 --exclusive -p test -J ompi >>>>salloc: Granted job allocation 647049 >>>> >>>>$ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca >>>>rml_base_verbose 10 -np 2 hello_c >>>>[access1:24264] mca: base: components_register: registering oob components >>>>[access1:24264] mca: base: components_register: found loaded component tcp >>>>[access1:24264] mca: base: components_register: component tcp register >>>>function successful >>>>[access1:24264] mca: base: components_open: opening oob components >>>>[access1:24264] mca: base: components_open: found loaded component tcp >>>>[access1:24264] mca: base: components_open: component tcp open function >>>>successful >>>>[access1:24264] mca:oob:select: checking available component tcp >>>>[access1:24264] mca:oob:select: Querying component [tcp] >>>>[access1:24264] oob:tcp: component_available called >>>>[access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 >>>>[access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list >>>>of V4 connections >>>>[access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of >>>>V4 connections >>>>[access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list >>>>of V4 connections >>>>[access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of >>>>V4 connections >>>>[access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>>>of V4 connections >>>>[access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4 >>>>[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list >>>>of V4 connections >>>>[access1:24264] [[55095,0],0] TCP STARTUP >>>>[access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0 >>>>[access1:24264] [[55095,0],0] assigned IPv4 port 47756 >>>>[access1:24264] mca:oob:select: Adding component to end >>>>[access1:24264] mca:oob:select: Found 1 active transports >>>>[access1:24264] mca: base: components_register: registering rml components >>>>[access1:24264] mca: base: components_register: found loaded component oob >>>>[access1:24264] mca: base: components_register: component oob has no >>>>register or open function >>>>[access1:24264] mca: base: components_open: opening rml components >>>>[access1:24264] mca: base: components_open: found loaded component oob >>>>[access1:24264] mca: base: components_open: component oob open function >>>>successful >>>>[access1:24264] orte_rml_base_select: initializing rml component oob >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer >>>>[[WILDCARD],WILDCARD] >>>>[access1:24264] [[55095,0],0] posting recv >>>>[access1:24264] [[55095,0],0] posting persistent recv on tag 27 for peer >>>>[[WILDCARD],WILDCARD] >>>>-------------------------------------------------------------------------- >>>>An ORTE daemon has unexpectedly failed after launch and before >>>>communicating back to mpirun. This could be caused by a number >>>>of factors, including an inability to create a connection back >>>>to mpirun due to a lack of common network interfaces and/or no >>>>route found between them. Please check network connectivity >>>>(including firewalls and network routing requirements). >>>>-------------------------------------------------------------------------- >>>>[access1:24264] mca: base: close: component oob closed >>>>[access1:24264] mca: base: close: unloading component oob >>>>[access1:24264] [[55095,0],0] TCP SHUTDOWN >>>>[access1:24264] mca: base: close: component tcp closed >>>>[access1:24264] mca: base: close: unloading component tcp >>>>When i use srun i got: >>>>$ salloc -N2 --exclusive -p test -J ompi >>>>.... >>>>$srun -N 2 ./hello_c >>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>>>semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul >>>>16, 2014 (nightly snapshot tarball), 146) >>>>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI >>>>semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul >>>>16, 2014 (nightly snapshot tarball), 146) >>>> >>>>Sun, 20 Jul 2014 09:28:13 -0700 от Ralph Castain < r...@open-mpi.org >: >>>>>Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd >>>>>line. It looks to me like we are unable to connect back to the node where >>>>>you are running mpirun for some reason. >>>>> >>>>> >>>>>On Jul 20, 2014, at 9:16 AM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>>>>>I have the same problem in openmpi 1.8.1( Apr 23, 2014 ). >>>>>>Does the srun command have a --map-by<foo> mpirun parameter, or can i >>>>>>chage it from bash enviroment? >>>>>> >>>>>> >>>>>>-------- Пересылаемое сообщение -------- >>>>>>От кого: Timur Ismagilov < tismagi...@mail.ru > >>>>>>Кому: Mike Dubman < mi...@dev.mellanox.co.il > >>>>>>Копия: Open MPI Users < us...@open-mpi.org > >>>>>>Дата: Thu, 17 Jul 2014 16:42:24 +0400 >>>>>>Тема: Re[4]: [OMPI users] Salloc and mpirun problem >>>>>> >>>>>>With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got >>>>>>this output (same?): >>>>>>$ salloc -N2 --exclusive -p test -J ompi >>>>>>salloc: Granted job allocation 645686 >>>>>> >>>>>>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >>>>>> mpirun -mca mca_base_env_list 'LD_PRELOAD' --mca plm_base_verbose 10 >>>>>>--debug-daemons -np 1 hello_c >>>>>>[access1:04312] mca: base: components_register: registering plm components >>>>>>[access1:04312] mca: base: components_register: found loaded component >>>>>>isolated >>>>>>[access1:04312] mca: base: components_register: component isolated has no >>>>>>register or open function >>>>>>[access1:04312] mca: base: components_register: found loaded component rsh >>>>>>[access1:04312] mca: base: components_register: component rsh register >>>>>>function successful >>>>>>[access1:04312] mca: base: components_register: found loaded component >>>>>>slurm >>>>>>[access1:04312] mca: base: components_register: component slurm register >>>>>>function successful >>>>>>[access1:04312] mca: base: components_open: opening plm components >>>>>>[access1:04312] mca: base: components_open: found loaded component >>>>>>isolated >>>>>>[access1:04312] mca: base: components_open: component isolated open >>>>>>function successful >>>>>>[access1:04312] mca: base: components_open: found loaded component rsh >>>>>>[access1:04312] mca: base: components_open: component rsh open function >>>>>>successful >>>>>>[access1:04312] mca: base: components_open: found loaded component slurm >>>>>>[access1:04312] mca: base: components_open: component slurm open function >>>>>>successful >>>>>>[access1:04312] mca:base:select: Auto-selecting plm components >>>>>>[access1:04312] mca:base:select:( plm) Querying component [isolated] >>>>>>[access1:04312] mca:base:select:( plm) Query of component [isolated] set >>>>>>priority to 0 >>>>>>[access1:04312] mca:base:select:( plm) Querying component [rsh] >>>>>>[access1:04312] mca:base:select:( plm) Query of component [rsh] set >>>>>>priority to 10 >>>>>>[access1:04312] mca:base:select:( plm) Querying component [slurm] >>>>>>[access1:04312] mca:base:select:( plm) Query of component [slurm] set >>>>>>priority to 75 >>>>>>[access1:04312] mca:base:select:( plm) Selected component [slurm] >>>>>>[access1:04312] mca: base: close: component isolated closed >>>>>>[access1:04312] mca: base: close: unloading component isolated >>>>>>[access1:04312] mca: base: close: component rsh closed >>>>>>[access1:04312] mca: base: close: unloading component rsh >>>>>>Daemon was launched on node1-128-09 - beginning to initialize >>>>>>Daemon was launched on node1-128-15 - beginning to initialize >>>>>>Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09 >>>>>>[node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for >>>>>>commands! >>>>>>Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15 >>>>>>[node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for >>>>>>commands! >>>>>>srun: error: node1-128-09: task 0: Exited with exit code 1 >>>>>>srun: Terminating job step 645686.3 >>>>>>srun: error: node1-128-15: task 1: Exited with exit code 1 >>>>>>-------------------------------------------------------------------------- >>>>>>An ORTE daemon has unexpectedly failed after launch and before >>>>>>communicating back to mpirun. This could be caused by a number >>>>>>of factors, including an inability to create a connection back >>>>>>to mpirun due to a lack of common network interfaces and/or no >>>>>>route found between them. Please check network connectivity >>>>>>(including firewalls and network routing requirements). >>>>>>-------------------------------------------------------------------------- >>>>>>[access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd >>>>>>[access1:04312] mca: base: close: component slurm closed >>>>>>[access1:04312] mca: base: close: unloading component slurm >>>>>> >>>>>> >>>> >>>> >>>>---------------------------------------------------------------------- >>>> >>>> >>>>_______________________________________________ >>>>users mailing list >>>>us...@open-mpi.org >>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>Link to this post: >>>>http://www.open-mpi.org/community/lists/users/2014/07/24828.php >> >> >>