Ed,

Am 10.03.2011 um 17:20 schrieb Edward Lauzier:

> This was caused by one host having a scsi disk error...
> sge_execd was ok, but could not properly fire up the shepherd...
> ( we could not log into the console...because of disk access errors....)
> So, the jobs failed with the error message:
> 
> 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host 
> node1182 invalid execution state because: shepherd exited with exit status 
> 127: invalid execution state
> 
> And, man did it chew through a lot of jobs fast...
> 
> We set the load adjust to 0.50 per job for one minute to and load formula to 
> slots...
> 
> Things run fine and fast...
> 
> And the scheduler can really dispatch fast, esp to a blackhole host...

well, the feature to use the hawking radiation to allow the jobs to pop up on 
other nodes needs precise alignment of the installation -  SCNR

There is a demo script to check the size of e.g. /tmp here 
http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds 
tmpfree=1G" in the queue definition, so that the queue instance is set to alarm 
state in case it falls below a certain value.

A load sensor can also deliver a boolean value, hence checking locally 
something like "all disks fine" and use this as a "load_threshold" can also be 
a solution. How to check something is of course specific to your node setup.

The last necessary piece would be to inform the admin: this could be done by 
the load sensor too, but as the node is known not to be in a proper state I 
wouldn't recommend this. Better might be a cron-job on the qmaster machine 
checking `qstat -explain a -qs a -u foobar` *)  to look for passed load 
thresholds.

-- Reuti

*) There is no switch "show no jobs at all" to `qstat`, so using an unknown 
user "foobar" will help. And OTOH there is no "load_threshold" in the exechost 
definition.


> -Ed
> 
> 
> 
> Hi,
> 
> Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
> 
> > I'm looking for best practices and techniques to detect blackhole hosts 
> > quickly
> > and disable them.  ( Platform LSF has this already built in...)
> >
> > What I see is possible is:
> >
> > Using a cron job on a ge client node...
> >
> > -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> > -  if detected, use qmod -d '<queue_instance>' to disable
> > -  send email to ge_admin list
> > -  possibly send email of failed jobs to user(s)
> >
> > Must be robust to be able to timeout properly when ge is down or too busy
> > for qmod to respond...and/or filesystem problems, etc...
> >
> > ( perl or php alarm and sig handlers for proc_open work well for enforcing 
> > timeouts...)
> >
> > Any hints would be appreciated before I start on it...
> >
> > Won't take long to write the code, just looking for best practices and maybe
> > a setting I'm missing in the ge config...
> 
> what is causing the blackhole? For example: if it's a full file system on a 
> node, you could detect it by a load sensor in SGE and define in the queue 
> setup an alarm threshold, so that no more jobs are schedule to this 
> particular node.
> 
> -- Reuti
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to