Re: [OMPI users] RETRY EXCEEDED ERROR

Jan Lindheim Wed, 4 Mar 2009 16:16:25 -0500

On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB  
> fabric.  You should do a diagnostic on your HCAs, cables, and switches.
> 
> Increasing the timeout value should only be necessary on very large IB  
> fabrics and/or very congested networks.


Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.

Jan

> 
> 
> On Mar 4, 2009, at 3:28 PM, Jan Lindheim wrote:
> 
> >I found several reports on the openmpi users mailing list from users,
> >who need to bump up the default value for btl_openib_ib_timeout.
> >We also have some applications on our cluster, that have problems,
> >unless we set this value from the default 10 to 15:
> >
> >[24426,1],122][btl_openib_component.c:2905:handle_wc] from shc174  
> >to: shc175
> >error polling LP CQ with status RETRY EXCEEDED ERROR status number  
> >12 for
> >wr_id 250450816 opcode 11048 qp_idx 3
> >
> >This is seen with OpenMPI 1.3 and OpenFabrics 1.4.
> >
> >Is this normal or is it an indicator of other problems, maybe  
> >related to
> >hardware?
> >Are there other parameters that need to be looked at too?
> >
> >Thanks for any insight on this!
> >
> >Regards,
> >Jan Lindheim
> >_______________________________________________
> >users mailing list
> >us...@open-mpi.org
> >http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] RETRY EXCEEDED ERROR

Reply via email to