In the upcoming 1.5 series, we will introduce a new "sensor" framework to help 
resolve such issues. Among other things, it will automatically track (if 
requested) the size of a sentinel file, cpu usage, and memory footprint and 
will terminate the job if any exceed user-specified limits (e.g., file doesn't 
grow fast enough, memory grows too large).

Backing off the polling rate requires more application-specific logic like that 
offered below, so it is a little difficult for us to implement at the MPI 
library level. Not saying we eventually won't - just not sure anyone quite 
knows how to do so in a generalized form.


On Sep 2, 2010, at 7:46 PM, Douglas Guptill wrote:

> Hi David:
> 
> On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote:
>> 
>> I'm sure this has been discussed before but having watched hundreds of
>> thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd
>> be keen to know why there isn't some sort of "spin-wait backoff" option.
>> For example, a way to specify spin-wait for x seconds/cycles/iterations
>> then backoff to lighter and lighter cpu usage.  At least that way, hung
>> jobs would become self-evident.
>> 
>> Maybe there is already some way of doing this?
> 
> For my solution to this, see
> 
>  http://www.open-mpi.org/community/lists/users/2010/07/13731.php
> 
> HTH,
> Douglas.
> -- 
>  Douglas Guptill                       voice: 902-461-9749
>  Research Assistant, LSC 4640          email: douglas.gupt...@dal.ca
>  Oceanography Department               fax:   902-494-3877
>  Dalhousie University
>  Halifax, NS, B3H 4J1, Canada
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to