[gridengine users] NHC and SoGE integration

Cam Karnes Wed, 20 Apr 2016 11:20:00 -0700

Been scouring the internet for a while looking for good examples of NHC 
[https://github.com/mej/nhc] implementations on HPC clusters using SoGE for 
resource management. I’ve found a few posts here and there via Google Groups, 
and Dave Love’s writeup detailing the usage of NHC as a load sensor for SoGE.


We’re currently trying to use NHC to get useful snapshots of the states of our 
HPC nodes. The checks and configurations of NHC aren’t a problem, but the 
framework itself definitely looks to be geared more towards SLURM and TORQUE.

We currently have two goals.

1) Have NHC act as the standard health checking mechanism for all clustered 
devices, wherein a check is initiated upstream by a centralized monitoring 
service (Zabbix, Nagios, Cacti, etc) and then return values and diagnostic 
messages are consumed by the same monitoring service, where the node’s state 
will be reflected (wrong mounts, over limit filesystems, memory free, etc).

2) Any failed check will result in disabling all queues for that node.

NHC is working, and the configuration files we have in place are running some 
basic checks that reflect unhealthy states, but the integration and automation 
of the service with SoGE is where we’re left scratching our heads a little. 
Sourcing NHC with certain environmental variables set and running individual 
NHC functions with a script upstream is one option we’ve explored.

There are some other things we’ve noticed that strike us as a little strange. 
For example, running NHC seems to require some STDIN when SoGE is detected as 
the resource manager. There are certain environmental variables set by NHC, as 
well, like TIMEOUT. If SoGE is detected, then NHC will set this to 0 regardless 
of what has been specified without the timeout flag. This breaks NHC functions 
like check_cmd_output.

Hopefully this wasn’t too NHC specific of a post and thanks in advance.
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

[gridengine users] NHC and SoGE integration

Reply via email to