On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote:
On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote:
> This *usually* indicates a physical / layer 0 problem in your IB
> fabric. You should do a diagnostic on your HCAs, cables, and
switches.
>
> Increasing the timeout value should only be necessary on very
large IB
> fabrics and/or very congested networks.
Thanks Jeff!
What is considered to be very large IB fabrics?
I assume that with just over 180 compute nodes,
our cluster does not fall into this category.
I was a little misleading in my note -- I should clarify. It's really
congestion that matters, not the size of the fabric. Congestion is
potentially more likely to happen in larger fabrics, since packets may
have to flow through more switches, there's likely more apps running
on the cluster, etc. But it's all very application/cluster-specific;
only you can know if your fabric is heavily congested based on what
you run on it, etc.
--
Jeff Squyres
Cisco Systems