Edward Lauzier <[email protected]> writes: > Hi, > > I'm looking for best practices and techniques to detect blackhole hosts > quickly > and disable them. ( Platform LSF has this already built in...) > > What I see is possible is: > > Using a cron job on a ge client node... > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' > - if detected, use qmod -d '<queue_instance>' to disable > - send email to ge_admin list > - possibly send email of failed jobs to user(s) > > Must be robust to be able to timeout properly when ge is down or too busy > for qmod to respond...and/or filesystem problems, etc...
Aside from the suggestions for a GE builtin, I'd expect to do that sort of job in a general framework with (something like) Nagios and to test the health of nodes generally. Of course you may not get a fast enough response from the event handler disabling the node in such cases, and it's unfeasible to write tests for all the failure modes I've seen. There are various other sorts of GE-specific alerts you might get out of Nagios, such as queues or jobs in an error state, nodes which are busy without any jobs supposed to be in them, etc. Unfortunately, some messages in the GE log which are actually serious problems are marked as `W', not `E', and various `E's aren't, so you need to look for specific patterns there. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
