Re: [OMPI users] RETRY EXCEEDED ERROR

Jeff Squyres Wed, 4 Mar 2009 16:34:56 -0500

On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:

On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB

> fabric. You should do a diagnostic on your HCAs, cables, andswitches.

> Increasing the timeout value should only be necessary on verylarge IB

> fabrics and/or very congested networks.


Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.

I was a little misleading in my note -- I should clarify. It's reallycongestion that matters, not the size of the fabric. Congestion ispotentially more likely to happen in larger fabrics, since packets mayhave to flow through more switches, there's likely more apps runningon the cluster, etc. But it's all very application/cluster-specific;only you can know if your fabric is heavily congested based on whatyou run on it, etc.


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] RETRY EXCEEDED ERROR

Reply via email to