Edward Lauzier <[email protected]> writes:

> Hi,
>
> I'm looking for best practices and techniques to detect blackhole hosts 
> quickly
> and disable them.  ( Platform LSF has this already built in...)
>
> What I see is possible is:
>
> Using a cron job on a ge client node...
>
> -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> -  if detected, use qmod -d '<queue_instance>' to disable
> -  send email to ge_admin list
> -  possibly send email of failed jobs to user(s)
>
> Must be robust to be able to timeout properly when ge is down or too busy
> for qmod to respond...and/or filesystem problems, etc...

Aside from the suggestions for a GE builtin, I'd expect to do that sort
of job in a general framework with (something like) Nagios and to test
the health of nodes generally.  Of course you may not get a fast enough
response from the event handler disabling the node in such cases, and
it's unfeasible to write tests for all the failure modes I've seen.
There are various other sorts of GE-specific alerts you might get out of
Nagios, such as queues or jobs in an error state, nodes which are busy
without any jobs supposed to be in them, etc.  Unfortunately, some
messages in the GE log which are actually serious problems are marked as
`W', not `E', and various `E's aren't, so you need to look for specific
patterns there.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to