Am 10.03.2011 um 17:59 schrieb Rayson Ho: > LSF uses "exit rate" for that, but in SGE the load sensor has no > knowledge of jobs running & exiting. > > The way to do it in SGE (Open Grid Scheduler & Son of Grid Engine, > etc) is to record the exit rate in a job starter (aka starter method). > > And if anyone has written one already, I would like to put it up on > the Open Grid Scheduler howto page as it is a nice feature for users > migrating from LSF. > > The starter method should be really simple, just record the exit time > of the late few jobs, and calculate the rate of exit. If the rate is > too high, disable the host. > > http://gridscheduler.sourceforge.net/htmlman/htmlman5/queue_conf.html > > "starter_method"
Isn't it already too late when the "starter_method" is started? I mean, when no job information can be written (e.g. to the spool area), it will never get executed but still trash the job. -- Reuti > Rayson > > > > On Thu, Mar 10, 2011 at 11:47 AM, Reuti <[email protected]> wrote: >> well, the feature to use the hawking radiation to allow the jobs to pop up >> on other nodes needs precise alignment of the installation - SCNR >> >> There is a demo script to check the size of e.g. /tmp here >> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds >> tmpfree=1G" in the queue definition, so that the queue instance is set to >> alarm state in case it falls below a certain value. >> >> A load sensor can also deliver a boolean value, hence checking locally >> something like "all disks fine" and use this as a "load_threshold" can also >> be a solution. How to check something is of course specific to your node >> setup. >> >> The last necessary piece would be to inform the admin: this could be done by >> the load sensor too, but as the node is known not to be in a proper state I >> wouldn't recommend this. Better might be a cron-job on the qmaster machine >> checking `qstat -explain a -qs a -u foobar` *) to look for passed load >> thresholds. >> >> -- Reuti >> >> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown >> user "foobar" will help. And OTOH there is no "load_threshold" in the >> exechost definition. >> >> >>> -Ed >>> >>> >>> >>> Hi, >>> >>> Am 10.03.2011 um 16:50 schrieb Edward Lauzier: >>> >>>> I'm looking for best practices and techniques to detect blackhole hosts >>>> quickly >>>> and disable them. ( Platform LSF has this already built in...) >>>> >>>> What I see is possible is: >>>> >>>> Using a cron job on a ge client node... >>>> >>>> - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' >>>> - if detected, use qmod -d '<queue_instance>' to disable >>>> - send email to ge_admin list >>>> - possibly send email of failed jobs to user(s) >>>> >>>> Must be robust to be able to timeout properly when ge is down or too busy >>>> for qmod to respond...and/or filesystem problems, etc... >>>> >>>> ( perl or php alarm and sig handlers for proc_open work well for enforcing >>>> timeouts...) >>>> >>>> Any hints would be appreciated before I start on it... >>>> >>>> Won't take long to write the code, just looking for best practices and >>>> maybe >>>> a setting I'm missing in the ge config... >>> >>> what is causing the blackhole? For example: if it's a full file system on a >>> node, you could detect it by a load sensor in SGE and define in the queue >>> setup an alarm threshold, so that no more jobs are schedule to this >>> particular node. >>> >>> -- Reuti >>> >> >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
