On Fri, 2013-06-21 at 08:45 +0100, William Hay wrote:
> 
> 
> 
> 
> 
> On 20 June 2013 17:49, Skylar Thompson <[email protected]>
> wrote:
> 
>         We have our own custom Nagios plugins we use. It basically
>         parses the
>         "qstat -xml" output and looks for hosts that are
>         disabled/alarming/unavailable. Currently each node check
>         requires a call
>         to the qmaster which is a lot of overhead, so we only poll
>         every four
>         hours. We have load sensors running on our exec hosts that
>         will
>         immediately raise an alarm if the node reports hardware
>         problems
>         (monitored via local ipmitool calls, and OpenManage for some),
>         disk
>         space, or out-of-memory conditions (checked via parsing
>         dmesg). This
>         means that new jobs are immediately prevented from running on
>         those
>         nodes, so we really just clean stuff up once a day or even
>         less often.
>         
>         
> We have something similar here.  Rather than active checks we have a
> script/daemon that regularly parses the qstat  output looking for
> nodes in an alarmed state,determines the cause of the alarm and pushes
> the results of  to our nagios/opsview server.  The nagios/opsview
> server is configured to flag up a problem if it doesn't receive an
> update for a service for a while.  This means we're only running one
> qstat command to check the entire cluster so the load on grid engine
> isn't much.  We also push the results to opsview with a single
> send_nsca command which helps keep the load on the opsview server
> low.  
> 
Since  I received an enquiry about this we have now made this available
for anyone who wants it.  The code is available at:
https://github.com/UCL/opsview-gridengine-integration .Bear in mind that
this was written for our own cluster so there are a few assumptions
baked into the code.

William

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to