On 20 June 2013 17:49, Skylar Thompson <[email protected]> wrote:

> We have our own custom Nagios plugins we use. It basically parses the
> "qstat -xml" output and looks for hosts that are
> disabled/alarming/unavailable. Currently each node check requires a call
> to the qmaster which is a lot of overhead, so we only poll every four
> hours. We have load sensors running on our exec hosts that will
> immediately raise an alarm if the node reports hardware problems
> (monitored via local ipmitool calls, and OpenManage for some), disk
> space, or out-of-memory conditions (checked via parsing dmesg). This
> means that new jobs are immediately prevented from running on those
> nodes, so we really just clean stuff up once a day or even less often.
>
> We have something similar here.  Rather than active checks we have a
script/daemon that regularly parses the qstat  output looking for nodes in
an alarmed state,determines the cause of the alarm and pushes the results
of  to our nagios/opsview server.  The nagios/opsview server is configured
to flag up a problem if it doesn't receive an update for a service for a
while.  This means we're only running one qstat command to check the entire
cluster so the load on grid engine isn't much.  We also push the results to
opsview with a single send_nsca command which helps keep the load on the
opsview server low.  One thing we've noticed is that grid engine sometimes
retains old load values for uncontatctable hosts so an explicit check for
queues in an uncontactable state is necessary.

Somewhere on our to do list is to go the other way: writing a wrapper to
convert the output of nagios plugins to grid engine load sensor format so
we can use all the nice pre-written nagios plugins to inform grid engine of
issues with a node.


> Eventually, we'll probably have the qmaster generate a cache of "qstat
> -xml" regularly and just parse that. Even longer-term, we'd like to dump
> that into a network-accessible message bus so that anything can make use
> of those data.
>
> -- Skylar Thompson ([email protected])
> -- Genome Sciences Department, System Administrator
> -- Foege Building S046, (206)-685-7354
> -- University of Washington School of Medicine
>
> On 06/20/13 09:41, Dave Love wrote:
> > Tina Friedrich <[email protected]> writes:
> >
> >>> Which do you have in mind?  I've a nasty feeling I made mods to one
> >>> which I've never distributed, but I think it should be changed to do
> >>> passive monitoring anyhow, just running qstat/qhost once.
> >>
> >> Oh. Called "check_sge" (written in python). Was that the one? I never
> >> noticed any performance problems as such.
> >
> > It doesn't cause particular problems on not-very-large clusters here,
> > but it's clearly making a lot more calls on qmaster than necessary,
> > albeit less demanding ones, plus invocations of the script.
> >
> >>> Of course, if you just want to check if an execd is alive, you only
> need
> >>> check_tcp on the right port.
> >>
> >> True; I think the plugin did/does more than that - checks for error
> >> states etc.
> >
> > Right.  I'll try to check what, if anything, I did to it.
> >
> > I had intended to write notes on monitoring, but have never got round to
> > it.  If anyone else would like to contribute...
> >
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to