Hi Reuti,

Thanks for the input...

I like the idea of a boolean load sensor.  It could be used to set the value
of a host-specific
boolean complex resource...and a default job submission could say...

-l host_healthcheck=OK

This may work...

Thanks,

Ed


On Thu, Mar 10, 2011 at 11:47 AM, Reuti

> Ed,
>
> Am 10.03.2011 um 17:20 schrieb Edward Lauzier:
>
> > This was caused by one host having a scsi disk error...
> > sge_execd was ok, but could not properly fire up the shepherd...
> > ( we could not log into the console...because of disk access errors....)
> > So, the jobs failed with the error message:
> >
> > 03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host
> node1182 invalid execution state because: shepherd exited with exit status
> 127: invalid execution state
> >
> > And, man did it chew through a lot of jobs fast...
> >
> > We set the load adjust to 0.50 per job for one minute to and load formula
> to slots...
> >
> > Things run fine and fast...
> >
> > And the scheduler can really dispatch fast, esp to a blackhole host...
>
> well, the feature to use the hawking radiation to allow the jobs to pop up
> on other nodes needs precise alignment of the installation -  SCNR
>
> There is a demo script to check the size of e.g. /tmp here
> http://arc.liv.ac.uk/SGE/howto/loadsensor.html and then use
> "load_thresholds tmpfree=1G" in the queue definition, so that the queue
> instance is set to alarm state in case it falls below a certain value.
>
> A load sensor can also deliver a boolean value, hence checking locally
> something like "all disks fine" and use this as a "load_threshold" can also
> be a solution. How to check something is of course specific to your node
> setup.
>
> The last necessary piece would be to inform the admin: this could be done
> by the load sensor too, but as the node is known not to be in a proper state
> I wouldn't recommend this. Better might be a cron-job on the qmaster machine
> checking `qstat -explain a -qs a -u foobar` *)  to look for passed load
> thresholds.
>
> -- Reuti
>
> *) There is no switch "show no jobs at all" to `qstat`, so using an unknown
> user "foobar" will help. And OTOH there is no "load_threshold" in the
> exechost definition.
>
>
> > -Ed
> >
> >
> >
> > Hi,
> >
> > Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
> >
> > > I'm looking for best practices and techniques to detect blackhole hosts
> quickly
> > > and disable them.  ( Platform LSF has this already built in...)
> > >
> > > What I see is possible is:
> > >
> > > Using a cron job on a ge client node...
> > >
> > > -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> > > -  if detected, use qmod -d '<queue_instance>' to disable
> > > -  send email to ge_admin list
> > > -  possibly send email of failed jobs to user(s)
> > >
> > > Must be robust to be able to timeout properly when ge is down or too
> busy
> > > for qmod to respond...and/or filesystem problems, etc...
> > >
> > > ( perl or php alarm and sig handlers for proc_open work well for
> enforcing timeouts...)
> > >
> > > Any hints would be appreciated before I start on it...
> > >
> > > Won't take long to write the code, just looking for best practices and
> maybe
> > > a setting I'm missing in the ge config...
> >
> > what is causing the blackhole? For example: if it's a full file system on
> a node, you could detect it by a load sensor in SGE and define in the queue
> setup an alarm threshold, so that no more jobs are schedule to this
> particular node.
> >
> > -- Reuti
> >
>
>
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to