Hi,

Am 10.03.2011 um 16:50 schrieb Edward Lauzier:

> I'm looking for best practices and techniques to detect blackhole hosts 
> quickly
> and disable them.  ( Platform LSF has this already built in...)
> 
> What I see is possible is:
> 
> Using a cron job on a ge client node...
> 
> -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> -  if detected, use qmod -d '<queue_instance>' to disable
> -  send email to ge_admin list
> -  possibly send email of failed jobs to user(s)
> 
> Must be robust to be able to timeout properly when ge is down or too busy
> for qmod to respond...and/or filesystem problems, etc...
> 
> ( perl or php alarm and sig handlers for proc_open work well for enforcing 
> timeouts...)
> 
> Any hints would be appreciated before I start on it...
> 
> Won't take long to write the code, just looking for best practices and maybe
> a setting I'm missing in the ge config...

what is causing the blackhole? For example: if it's a full file system on a 
node, you could detect it by a load sensor in SGE and define in the queue setup 
an alarm threshold, so that no more jobs are schedule to this particular node.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to