Hi,

I'm looking for best practices and techniques to detect blackhole hosts
quickly
and disable them.  ( Platform LSF has this already built in...)

What I see is possible is:

Using a cron job on a ge client node...

-  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
-  if detected, use qmod -d '<queue_instance>' to disable
-  send email to ge_admin list
-  possibly send email of failed jobs to user(s)

Must be robust to be able to timeout properly when ge is down or too busy
for qmod to respond...and/or filesystem problems, etc...

( perl or php alarm and sig handlers for proc_open work well for enforcing
timeouts...)

Any hints would be appreciated before I start on it...

Won't take long to write the code, just looking for best practices and maybe
a setting I'm missing in the ge config...

Thanks,

Ed Lauzier
The Broad Institute
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to