Ed, Am 10.03.2011 um 17:20 schrieb Edward Lauzier:
> This was caused by one host having a scsi disk error... > sge_execd was ok, but could not properly fire up the shepherd... > ( we could not log into the console...because of disk access errors....) > So, the jobs failed with the error message: > > 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host > node1182 invalid execution state because: shepherd exited with exit status > 127: invalid execution state > > And, man did it chew through a lot of jobs fast... > > We set the load adjust to 0.50 per job for one minute to and load formula to > slots... > > Things run fine and fast... > > And the scheduler can really dispatch fast, esp to a blackhole host... well, the feature to use the hawking radiation to allow the jobs to pop up on other nodes needs precise alignment of the installation - SCNR There is a demo script to check the size of e.g. /tmp here http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds tmpfree=1G" in the queue definition, so that the queue instance is set to alarm state in case it falls below a certain value. A load sensor can also deliver a boolean value, hence checking locally something like "all disks fine" and use this as a "load_threshold" can also be a solution. How to check something is of course specific to your node setup. The last necessary piece would be to inform the admin: this could be done by the load sensor too, but as the node is known not to be in a proper state I wouldn't recommend this. Better might be a cron-job on the qmaster machine checking `qstat -explain a -qs a -u foobar` *) to look for passed load thresholds. -- Reuti *) There is no switch "show no jobs at all" to `qstat`, so using an unknown user "foobar" will help. And OTOH there is no "load_threshold" in the exechost definition. > -Ed > > > > Hi, > > Am 10.03.2011 um 16:50 schrieb Edward Lauzier: > > > I'm looking for best practices and techniques to detect blackhole hosts > > quickly > > and disable them. ( Platform LSF has this already built in...) > > > > What I see is possible is: > > > > Using a cron job on a ge client node... > > > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' > > - if detected, use qmod -d '<queue_instance>' to disable > > - send email to ge_admin list > > - possibly send email of failed jobs to user(s) > > > > Must be robust to be able to timeout properly when ge is down or too busy > > for qmod to respond...and/or filesystem problems, etc... > > > > ( perl or php alarm and sig handlers for proc_open work well for enforcing > > timeouts...) > > > > Any hints would be appreciated before I start on it... > > > > Won't take long to write the code, just looking for best practices and maybe > > a setting I'm missing in the ge config... > > what is causing the blackhole? For example: if it's a full file system on a > node, you could detect it by a load sensor in SGE and define in the queue > setup an alarm threshold, so that no more jobs are schedule to this > particular node. > > -- Reuti > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
