Hi, I'm looking for best practices and techniques to detect blackhole hosts quickly and disable them. ( Platform LSF has this already built in...)
What I see is possible is: Using a cron job on a ge client node... - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' - if detected, use qmod -d '<queue_instance>' to disable - send email to ge_admin list - possibly send email of failed jobs to user(s) Must be robust to be able to timeout properly when ge is down or too busy for qmod to respond...and/or filesystem problems, etc... ( perl or php alarm and sig handlers for proc_open work well for enforcing timeouts...) Any hints would be appreciated before I start on it... Won't take long to write the code, just looking for best practices and maybe a setting I'm missing in the ge config... Thanks, Ed Lauzier The Broad Institute
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
