On Wed, Mar 04, 2009 at 04:34:49PM -0500, Jeff Squyres wrote: > On Mar 4, 2009, at 4:16 PM, Jan Lindheim wrote: > > >On Wed, Mar 04, 2009 at 04:02:06PM -0500, Jeff Squyres wrote: > >> This *usually* indicates a physical / layer 0 problem in your IB > >> fabric. You should do a diagnostic on your HCAs, cables, and > >switches. > >> > >> Increasing the timeout value should only be necessary on very > >large IB > >> fabrics and/or very congested networks. > > > >Thanks Jeff! > >What is considered to be very large IB fabrics? > >I assume that with just over 180 compute nodes, > >our cluster does not fall into this category. > > > > I was a little misleading in my note -- I should clarify. It's really > congestion that matters, not the size of the fabric. Congestion is > potentially more likely to happen in larger fabrics, since packets may > have to flow through more switches, there's likely more apps running > on the cluster, etc. But it's all very application/cluster-specific; > only you can know if your fabric is heavily congested based on what > you run on it, etc. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Thanks again Jeff! Time to dig up diagnostics tools and look at port statistics. Jan