Been scouring the internet for a while looking for good examples of NHC [https://github.com/mej/nhc] implementations on HPC clusters using SoGE for resource management. I’ve found a few posts here and there via Google Groups, and Dave Love’s writeup detailing the usage of NHC as a load sensor for SoGE.
We’re currently trying to use NHC to get useful snapshots of the states of our HPC nodes. The checks and configurations of NHC aren’t a problem, but the framework itself definitely looks to be geared more towards SLURM and TORQUE. We currently have two goals. 1) Have NHC act as the standard health checking mechanism for all clustered devices, wherein a check is initiated upstream by a centralized monitoring service (Zabbix, Nagios, Cacti, etc) and then return values and diagnostic messages are consumed by the same monitoring service, where the node’s state will be reflected (wrong mounts, over limit filesystems, memory free, etc). 2) Any failed check will result in disabling all queues for that node. NHC is working, and the configuration files we have in place are running some basic checks that reflect unhealthy states, but the integration and automation of the service with SoGE is where we’re left scratching our heads a little. Sourcing NHC with certain environmental variables set and running individual NHC functions with a script upstream is one option we’ve explored. There are some other things we’ve noticed that strike us as a little strange. For example, running NHC seems to require some STDIN when SoGE is detected as the resource manager. There are certain environmental variables set by NHC, as well, like TIMEOUT. If SoGE is detected, then NHC will set this to 0 regardless of what has been specified without the timeout flag. This breaks NHC functions like check_cmd_output. Hopefully this wasn’t too NHC specific of a post and thanks in advance. _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users