LSF uses "exit rate" for that, but in SGE the load sensor has no knowledge of jobs running & exiting.
The way to do it in SGE (Open Grid Scheduler & Son of Grid Engine, etc) is to record the exit rate in a job starter (aka starter method). And if anyone has written one already, I would like to put it up on the Open Grid Scheduler howto page as it is a nice feature for users migrating from LSF. The starter method should be really simple, just record the exit time of the late few jobs, and calculate the rate of exit. If the rate is too high, disable the host. http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html "starter_method" Rayson On Thu, Mar 10, 2011 at 11:47 AM, Reuti <[email protected]> wrote: > well, the feature to use the hawking radiation to allow the jobs to pop up on > other nodes needs precise alignment of the installation - SCNR > > There is a demo script to check the size of e.g. /tmp here > http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds > tmpfree=1G" in the queue definition, so that the queue instance is set to > alarm state in case it falls below a certain value. > > A load sensor can also deliver a boolean value, hence checking locally > something like "all disks fine" and use this as a "load_threshold" can also > be a solution. How to check something is of course specific to your node > setup. > > The last necessary piece would be to inform the admin: this could be done by > the load sensor too, but as the node is known not to be in a proper state I > wouldn't recommend this. Better might be a cron-job on the qmaster machine > checking `qstat -explain a -qs a -u foobar` *) to look for passed load > thresholds. > > -- Reuti > > *) There is no switch "show no jobs at all" to `qstat`, so using an unknown > user "foobar" will help. And OTOH there is no "load_threshold" in the > exechost definition. > > >> -Ed >> >> >> >> Hi, >> >> Am 10.03.2011 um 16:50 schrieb Edward Lauzier: >> >> > I'm looking for best practices and techniques to detect blackhole hosts >> > quickly >> > and disable them. ( Platform LSF has this already built in...) >> > >> > What I see is possible is: >> > >> > Using a cron job on a ge client node... >> > >> > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' >> > - if detected, use qmod -d '<queue_instance>' to disable >> > - send email to ge_admin list >> > - possibly send email of failed jobs to user(s) >> > >> > Must be robust to be able to timeout properly when ge is down or too busy >> > for qmod to respond...and/or filesystem problems, etc... >> > >> > ( perl or php alarm and sig handlers for proc_open work well for enforcing >> > timeouts...) >> > >> > Any hints would be appreciated before I start on it... >> > >> > Won't take long to write the code, just looking for best practices and >> > maybe >> > a setting I'm missing in the ge config... >> >> what is causing the blackhole? For example: if it's a full file system on a >> node, you could detect it by a load sensor in SGE and define in the queue >> setup an alarm threshold, so that no more jobs are schedule to this >> particular node. >> >> -- Reuti >> > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
