On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:

On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and switches.
>
> Increasing the timeout value should only be necessary on very large IB
> fabrics and/or very congested networks.

Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.


I was a little misleading in my note -- I should clarify. It's really congestion that matters, not the size of the fabric. Congestion is potentially more likely to happen in larger fabrics, since packets may have to flow through more switches, there's likely more apps running on the cluster, etc. But it's all very application/cluster-specific; only you can know if your fabric is heavily congested based on what you run on it, etc.

--
Jeff Squyres
Cisco Systems

Reply via email to