Re: [OMPI users] RETRY EXCEEDED ERROR

Jan Lindheim Wed, 4 Mar 2009 16:46:00 -0500

On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote:
> On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
> 
> >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> >> This *usually* indicates a physical / layer 0 problem in your IB
> >> fabric.  You should do a diagnostic on your HCAs, cables, and  
> >switches.
> >>
> >> Increasing the timeout value should only be necessary on very  
> >large IB
> >> fabrics and/or very congested networks.
> >
> >Thanks Jeff!
> >What is considered to be very large IB fabrics?
> >I assume that with just over 180 compute nodes,
> >our cluster does not fall into this category.
> >
> 
> I was a little misleading in my note -- I should clarify.  It's really  
> congestion that matters, not the size of the fabric.  Congestion is  
> potentially more likely to happen in larger fabrics, since packets may  
> have to flow through more switches, there's likely more apps running  
> on the cluster, etc.  But it's all very application/cluster-specific;  
> only you can know if your fabric is heavily congested based on what  
> you run on it, etc.
> 
> -- 
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Thanks again Jeff!
Time to dig up diagnostics tools and look at port statistics.

Jan

Re: [OMPI users] RETRY EXCEEDED ERROR

Reply via email to