In the upcoming 1.5 series, we will introduce a new "sensor" framework to help resolve such issues. Among other things, it will automatically track (if requested) the size of a sentinel file, cpu usage, and memory footprint and will terminate the job if any exceed user-specified limits (e.g., file doesn't grow fast enough, memory grows too large).
Backing off the polling rate requires more application-specific logic like that offered below, so it is a little difficult for us to implement at the MPI library level. Not saying we eventually won't - just not sure anyone quite knows how to do so in a generalized form. On Sep 2, 2010, at 7:46 PM, Douglas Guptill wrote: > Hi David: > > On Fri, Sep 03, 2010 at 10:50:02AM +1000, David Singleton wrote: >> >> I'm sure this has been discussed before but having watched hundreds of >> thousands of cpuhrs being wasted by difficult-to-detect hung jobs, I'd >> be keen to know why there isn't some sort of "spin-wait backoff" option. >> For example, a way to specify spin-wait for x seconds/cycles/iterations >> then backoff to lighter and lighter cpu usage. At least that way, hung >> jobs would become self-evident. >> >> Maybe there is already some way of doing this? > > For my solution to this, see > > http://www.open-mpi.org/community/lists/users/2010/07/13731.php > > HTH, > Douglas. > -- > Douglas Guptill voice: 902-461-9749 > Research Assistant, LSC 4640 email: douglas.gupt...@dal.ca > Oceanography Department fax: 902-494-3877 > Dalhousie University > Halifax, NS, B3H 4J1, Canada > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users