On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > This *usually* indicates a physical / layer 0 problem in your IB > fabric. You should do a diagnostic on your HCAs, cables, and switches. > > Increasing the timeout value should only be necessary on very large IB > fabrics and/or very congested networks.
Thanks Jeff! What is considered to be very large IB fabrics? I assume that with just over 180 compute nodes, our cluster does not fall into this category. Jan > > > On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote: > > >I found several reports on the openmpi users mailing list from users, > >who need to bump up the default value for btl_openib_ib_timeout. > >We also have some applications on our cluster, that have problems, > >unless we set this value from the default 10 to 15: > > > >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174 > >to: shc175 > >error polling LP CQ with status RETRY EXCEEDED ERROR status number > >12 for > >wr_id 250450816 opcode 11048 qp_idx 3 > > > >This is seen with OpenMPI 1.3 and OpenFabrics 1.4. > > > >Is this normal or is it an indicator of other problems, maybe > >related to > >hardware? > >Are there other parameters that need to be looked at too? > > > >Thanks for any insight on this! > > > >Regards, > >Jan Lindheim > >_______________________________________________ > >users mailing list > >us...@open-mpi.org > >http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >