On Fri, 2013-06-21 at 08:45 +0100, William Hay wrote: > > > > > > On 20 June 2013 17:49, Skylar Thompson <[email protected]> > wrote: > > We have our own custom Nagios plugins we use. It basically > parses the > "qstat -xml" output and looks for hosts that are > disabled/alarming/unavailable. Currently each node check > requires a call > to the qmaster which is a lot of overhead, so we only poll > every four > hours. We have load sensors running on our exec hosts that > will > immediately raise an alarm if the node reports hardware > problems > (monitored via local ipmitool calls, and OpenManage for some), > disk > space, or out-of-memory conditions (checked via parsing > dmesg). This > means that new jobs are immediately prevented from running on > those > nodes, so we really just clean stuff up once a day or even > less often. > > > We have something similar here. Rather than active checks we have a > script/daemon that regularly parses the qstat output looking for > nodes in an alarmed state,determines the cause of the alarm and pushes > the results of to our nagios/opsview server. The nagios/opsview > server is configured to flag up a problem if it doesn't receive an > update for a service for a while. This means we're only running one > qstat command to check the entire cluster so the load on grid engine > isn't much. We also push the results to opsview with a single > send_nsca command which helps keep the load on the opsview server > low. > Since I received an enquiry about this we have now made this available for anyone who wants it. The code is available at: https://github.com/UCL/opsview-gridengine-integration .Bear in mind that this was written for our own cluster so there are a few assumptions baked into the code.
William
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
