Hi Gabriele
I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very
well to that size due to a requirement that the underlying out-of-band
system fully connect at the TCP level. Thus, every process in your job
will be opening 2002 sockets (one to every other process, one to the
local orted, and one back to mpirun). More than likely, you are simply
running out of sockets on your nodes.
For a job this size, I would recommend upgrading to OMPI 1.3.1. This
uses a routing scheme for the out-of-band system, so each process only
opens 1 socket to its local daemon. Much more scalable, and I think it
would solve this problem. It will also start much faster, as a bonus.
HTH
Ralph
On Apr 1, 2009, at 3:58 AM, Gabriele Fatigati wrote:
Dear OpenMPI developers, m
i have a strange problem during running my application ( 2000
processors). I'm using openmpi 1.2.22 over Infiniband. The follow is
the mca-params.conf:
btl = ^tcp
btl_tcp_if_exclude = eth0,ib0,ib1
oob_tcp_include = eth1,lo,eth0
btl_openib_warn_default_gid_prefix = 0
btl_openib_ib_timeout = 20
At certain point of my run, the application died with this message:
[node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36645 failed: Software caused connection abort
(103)
[node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed: Software caused connection abort
(103)
[node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect:
connect to 10.161.12.14:36647 failed, connecting over all interfaces
failed!
My question is: This error depends by some timeout? How can i solve?
Thanks in advance.
Than
--
Ing. Gabriele Fatigati
Parallel programmer
CINECA Systems & Tecnologies Department
Supercomputing Group
Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy
www.cineca.it Tel: +39 051 6171722
g.fatigati [AT] cineca.it
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users