Hi, Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
> I'm looking for best practices and techniques to detect blackhole hosts > quickly > and disable them. ( Platform LSF has this already built in...) > > What I see is possible is: > > Using a cron job on a ge client node... > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' > - if detected, use qmod -d '<queue_instance>' to disable > - send email to ge_admin list > - possibly send email of failed jobs to user(s) > > Must be robust to be able to timeout properly when ge is down or too busy > for qmod to respond...and/or filesystem problems, etc... > > ( perl or php alarm and sig handlers for proc_open work well for enforcing > timeouts...) > > Any hints would be appreciated before I start on it... > > Won't take long to write the code, just looking for best practices and maybe > a setting I'm missing in the ge config... what is causing the blackhole? For example: if it's a full file system on a node, you could detect it by a load sensor in SGE and define in the queue setup an alarm threshold, so that no more jobs are schedule to this particular node. -- Reuti _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
