You might want to upgrade to 1.10.1, or at least to 1.8.8 as 1.6.5 is pretty old
> On Nov 26, 2015, at 1:49 PM, Grigory Shamov <grigory.sha...@umanitoba.ca> > wrote: > > Hi All, > > For a parallel MPI job, we sometimes (not always) get the following > message: > > [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number > of attempts to create TCP connection has been exceeded. Can not > communicate with peer > [n047:25850] [[36630,0],1] ORTE_ERROR_LOG: Unreachable in file > ../../../../../openmpi-1.6.5/orte/mca/grpcomm/bad/grpcomm_bad_module.c at > line 412 > [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number > of attempts to create TCP connection has been exceeded. Can not > communicate with peer > > These appear in the middle of a running job; we use OpenMPI 1.6.5 and OFED > 2.4 on CentOS 6. > > -- > Grigory Shamov > HPC Analist, > Westgrid/ComputeCanada Site Lead > University of Manitoba > E2-588 EITC Building, > (204) 474-9625 > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28113.php