Re: [gridengine users] How to detect "blackhole" host in gridengine?

Reuti Thu, 10 Mar 2011 09:50:29 -0800

Am 10.03.2011 um 17:58 schrieb Edward Lauzier:

> Thanks for the input...
> 
> I like the idea of a boolean load sensor.  It could be used to set the value 
> of a host-specific
> boolean complex resource...and a default job submission could say...
> 
> -l host_healthcheck=OK
> 
> This may work...


It's not necessary to request it in the `qsub` command. Depending on the 
true/false logic you can have:

$ qconf -sq all.q
...
load_thresholds host_healthcheck=FALSE
...

Means: if it's set to FALSE by the load_sensor, the queue instance on this 
machine will be disabled.

-- Reuti


> Thanks,
> 
> Ed
> 
> 
> On Thu, Mar 10, 2011 at 11:47 AM, Reuti
> Ed,
> 
> Am 10.03.2011 um 17:20 schrieb Edward Lauzier:
> 
> > This was caused by one host having a scsi disk error...
> > sge_execd was ok, but could not properly fire up the shepherd...
> > ( we could not log into the console...because of disk access errors....)
> > So, the jobs failed with the error message:
> >
> > 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host 
> > node1182 invalid execution state because: shepherd exited with exit status 
> > 127: invalid execution state
> >
> > And, man did it chew through a lot of jobs fast...
> >
> > We set the load adjust to 0.50 per job for one minute to and load formula 
> > to slots...
> >
> > Things run fine and fast...
> >
> > And the scheduler can really dispatch fast, esp to a blackhole host...
> 
> well, the feature to use the hawking radiation to allow the jobs to pop up on 
> other nodes needs precise alignment of the installation -  SCNR
> 
> There is a demo script to check the size of e.g. /tmp here 
> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use "load_thresholds 
> tmpfree=1G" in the queue definition, so that the queue instance is set to 
> alarm state in case it falls below a certain value.
> 
> A load sensor can also deliver a boolean value, hence checking locally 
> something like "all disks fine" and use this as a "load_threshold" can also 
> be a solution. How to check something is of course specific to your node 
> setup.
> 
> The last necessary piece would be to inform the admin: this could be done by 
> the load sensor too, but as the node is known not to be in a proper state I 
> wouldn't recommend this. Better might be a cron-job on the qmaster machine 
> checking `qstat -explain a -qs a -u foobar` *)  to look for passed load 
> thresholds.
> 
> -- Reuti
> 
> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown 
> user "foobar" will help. And OTOH there is no "load_threshold" in the 
> exechost definition.
> 
> 
> > -Ed
> >
> >
> >
> > Hi,
> >
> > Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
> >
> > > I'm looking for best practices and techniques to detect blackhole hosts 
> > > quickly
> > > and disable them.  ( Platform LSF has this already built in...)
> > >
> > > What I see is possible is:
> > >
> > > Using a cron job on a ge client node...
> > >
> > > -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> > > -  if detected, use qmod -d '<queue_instance>' to disable
> > > -  send email to ge_admin list
> > > -  possibly send email of failed jobs to user(s)
> > >
> > > Must be robust to be able to timeout properly when ge is down or too busy
> > > for qmod to respond...and/or filesystem problems, etc...
> > >
> > > ( perl or php alarm and sig handlers for proc_open work well for 
> > > enforcing timeouts...)
> > >
> > > Any hints would be appreciated before I start on it...
> > >
> > > Won't take long to write the code, just looking for best practices and 
> > > maybe
> > > a setting I'm missing in the ge config...
> >
> > what is causing the blackhole? For example: if it's a full file system on a 
> > node, you could detect it by a load sensor in SGE and define in the queue 
> > setup an alarm threshold, so that no more jobs are schedule to this 
> > particular node.
> >
> > -- Reuti
> >
> 
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to detect "blackhole" host in gridengine?

Reply via email to