Dear OpenMPI developers, m i have a strange problem during running my application ( 2000 processors). I'm using openmpi 1.2.22 over Infiniband. The follow is the mca-params.conf:
btl = ^tcp btl_tcp_if_exclude = eth0,ib0,ib1 oob_tcp_include = eth1,lo,eth0 btl_openib_warn_default_gid_prefix = 0 btl_openib_ib_timeout = 20 At certain point of my run, the application died with this message: [node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect: connect to 10.161.12.14:36645 failed: Software caused connection abort (103) [node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect: connect to 10.161.12.14:36647 failed: Software caused connection abort (103) [node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect: connect to 10.161.12.14:36647 failed: Software caused connection abort (103) [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect: connect to 10.161.12.14:36647 failed: Software caused connection abort (103) [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect: connect to 10.161.12.14:36647 failed, connecting over all interfaces failed! My question is: This error depends by some timeout? How can i solve? Thanks in advance. Than -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatigati [AT] cineca.it