What value do you suggest then? I know I've seen the problem persist at values of 14 and 16, and would rather be certain that this isn't going to kill the job that just sat in the queue for a week.

Andrew

Jeff Squyres wrote:
Roland thought that the default value of 10 might be a bit too low and that tuning it to be higher, particularly in apps that pound on a single port, would probably be acceptable.

Tuning up to 20 is probably a bit overkill.



On Nov 27, 2007, at 3:54 PM, Jeff Squyres wrote:

BTW, Andrew is correct about the unit for btl_openib_ib_timeout and that the value is simply passed down to the verbs library when making an IB connection. Open MPI does nothing else with that value; it's an IBTA-defined value.

The help message was wrong on the 1.2 branch for a while; I think it's been corrected in more recent versions of OMPI (i.e., >1.2 -- I don't recall which version specifically).


On Nov 27, 2007, at 3:19 PM, Andrew Friedley wrote:


Brock Palen wrote:
What would be a place to look? Should this just be default then for
OMPI?  ompi_info shows the default as 10 seconds?  Is that right
'seconds' ?
The other IB guys can probably answer better than I can -- I'm not an
expert in this part of IB (or really any part I guess :).  Not sure
why
a larger value isn't the default.  No, its not seconds -- check the
description of the MCA parameter:

4.096 microseconds * (2^btl_openib_ib_timeout)
You sure?
ompi_info --param btl openib

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
                         InfiniBand transmit timeout, in seconds
(must be >= 1)
Yeah:

MCA btl: parameter "btl_openib_ib_timeout" (current value: "10")
        InfiniBand transmit timeout, plugged into formula:
        4.096 microseconds * (2^btl_openib_ib_timeout)(must be
= 0 and <= 31)
Reading earlier in the thread you said OMPI v1.2.0, I got this from a
trunk checkout thats around 3 weeks old.  A quick check shows this
description was changed between 1.2.0 and 1.2.1.  However the use of
this parameter hasn't changed -- it's simply passed along to IB verbs
when creating a queue pair (aka a connection).

Andrew
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jeff Squyres
Cisco Systems




Reply via email to