Hi Reuti, Thanks for the input...
I like the idea of a boolean load sensor. It could be used to set the value of a host-specific boolean complex resource...and a default job submission could say... -l host_healthcheck=OK This may work... Thanks, Ed On Thu, Mar 10, 2011 at 11:47 AM, Reuti > Ed, > > Am 10.03.2011 um 17:20 schrieb Edward Lauzier: > > > This was caused by one host having a scsi disk error... > > sge_execd was ok, but could not properly fire up the shepherd... > > ( we could not log into the console...because of disk access errors....) > > So, the jobs failed with the error message: > > > > 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host > node1182 invalid execution state because: shepherd exited with exit status > 127: invalid execution state > > > > And, man did it chew through a lot of jobs fast... > > > > We set the load adjust to 0.50 per job for one minute to and load formula > to slots... > > > > Things run fine and fast... > > > > And the scheduler can really dispatch fast, esp to a blackhole host... > > well, the feature to use the hawking radiation to allow the jobs to pop up > on other nodes needs precise alignment of the installation - SCNR > > There is a demo script to check the size of e.g. /tmp here > http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use > "load_thresholds tmpfree=1G" in the queue definition, so that the queue > instance is set to alarm state in case it falls below a certain value. > > A load sensor can also deliver a boolean value, hence checking locally > something like "all disks fine" and use this as a "load_threshold" can also > be a solution. How to check something is of course specific to your node > setup. > > The last necessary piece would be to inform the admin: this could be done > by the load sensor too, but as the node is known not to be in a proper state > I wouldn't recommend this. Better might be a cron-job on the qmaster machine > checking `qstat -explain a -qs a -u foobar` *) to look for passed load > thresholds. > > -- Reuti > > *) There is no switch "show no jobs at all" to `qstat`, so using an unknown > user "foobar" will help. And OTOH there is no "load_threshold" in the > exechost definition. > > > > -Ed > > > > > > > > Hi, > > > > Am 10.03.2011 um 16:50 schrieb Edward Lauzier: > > > > > I'm looking for best practices and techniques to detect blackhole hosts > quickly > > > and disable them. ( Platform LSF has this already built in...) > > > > > > What I see is possible is: > > > > > > Using a cron job on a ge client node... > > > > > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' > > > - if detected, use qmod -d '<queue_instance>' to disable > > > - send email to ge_admin list > > > - possibly send email of failed jobs to user(s) > > > > > > Must be robust to be able to timeout properly when ge is down or too > busy > > > for qmod to respond...and/or filesystem problems, etc... > > > > > > ( perl or php alarm and sig handlers for proc_open work well for > enforcing timeouts...) > > > > > > Any hints would be appreciated before I start on it... > > > > > > Won't take long to write the code, just looking for best practices and > maybe > > > a setting I'm missing in the ge config... > > > > what is causing the blackhole? For example: if it's a full file system on > a node, you could detect it by a load sensor in SGE and define in the queue > setup an alarm threshold, so that no more jobs are schedule to this > particular node. > > > > -- Reuti > > > >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users