Hi Ralph, unfortunately, in this machine i can't upgrade OpenMPI at the moment. Is there a way to limit or to reduce the probability of this error?
2009/4/1 Ralph Castain <r...@lanl.gov>: > Hi Gabriele > > I don't think this is a timeout issue. OMPI 1.2.x doesn't scale very well to > that size due to a requirement that the underlying out-of-band system fully > connect at the TCP level. Thus, every process in your job will be opening > 2002 sockets (one to every other process, one to the local orted, and one > back to mpirun). More than likely, you are simply running out of sockets on > your nodes. > > For a job this size, I would recommend upgrading to OMPI 1.3.1. This uses a > routing scheme for the out-of-band system, so each process only opens 1 > socket to its local daemon. Much more scalable, and I think it would solve > this problem. It will also start much faster, as a bonus. > > HTH > Ralph > > > On Apr 1, 2009, at 3:58 AM, Gabriele Fatigati wrote: > >> Dear OpenMPI developers, m >> i have a strange problem during running my application ( 2000 >> processors). I'm using openmpi 1.2.22 over Infiniband. The follow is >> the mca-params.conf: >> >> >> btl = ^tcp >> btl_tcp_if_exclude = eth0,ib0,ib1 >> oob_tcp_include = eth1,lo,eth0 >> btl_openib_warn_default_gid_prefix = 0 >> btl_openib_ib_timeout = 20 >> >> At certain point of my run, the application died with this message: >> >> [node265:05593] [0,1,1679]-[0,1,1680] mca_oob_tcp_peer_try_connect: >> connect to 10.161.12.14:36645 failed: Software caused connection abort >> (103) >> [node484:06545] [0,1,1617]-[0,1,1681] mca_oob_tcp_peer_try_connect: >> connect to 10.161.12.14:36647 failed: Software caused connection abort >> (103) >> [node295:05394] [0,1,1649]-[0,1,1681] mca_oob_tcp_peer_try_connect: >> connect to 10.161.12.14:36647 failed: Software caused connection abort >> (103) >> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect: >> connect to 10.161.12.14:36647 failed: Software caused connection abort >> (103) >> [node182:05579] [0,1,1673]-[0,1,1681] mca_oob_tcp_peer_try_connect: >> connect to 10.161.12.14:36647 failed, connecting over all interfaces >> failed! >> >> My question is: This error depends by some timeout? How can i solve? >> Thanks in advance. >> >> Than >> >> >> >> >> -- >> Ing. Gabriele Fatigati >> >> Parallel programmer >> >> CINECA Systems & Tecnologies Department >> >> Supercomputing Group >> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> >> www.cineca.it Tel: +39 051 6171722 >> >> g.fatigati [AT] cineca.it >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.it Tel: +39 051 6171722 g.fatigati [AT] cineca.it