We use Node Health Check [1] with SLURM [2] to ensure nodes are healthy, and if anything is wrong they get automatically marked offline. We also have some of same checks being performed with Zabbix as well as some performance metrics. We are currently evaluating PCP in combination with Zabbix as XDMoD's SUPReMM features can make use of the PCP data to show job level performance and utilization.
Nagios and Ganglia are probably more commonly used in HPC compared to Zabbix. The best tool(s) are whatever meet your needs the best. - Trey [1]: https://github.com/mej/nhc [2]: grep Health /etc/slurm/slurm.conf HealthCheckInterval=600 HealthCheckNodeState=ANY HealthCheckProgram=/usr/sbin/nhc ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Sun, Jan 24, 2016 at 3:58 PM, Simpson Lachlan < [email protected]> wrote: > Hi, > > I've install xdmod for measuring the SLURM data, but what do people use > for the monitoring of their nodes? > > Are people predominantly service based software (the hypervisor itself), > Ganglia, Nagios....? > > Cheers > L. > This email (including any attachments or links) may contain > confidential and/or legally privileged information and is > intended only to be read or used by the addressee. If you > are not the intended addressee, any use, distribution, > disclosure or copying of this email is strictly > prohibited. > Confidentiality and legal privilege attached to this email > (including any attachments) are not waived or lost by > reason of its mistaken delivery to you. > If you have received this email in error, please delete it > and notify us immediately by telephone or email. Peter > MacCallum Cancer Centre provides no guarantee that this > transmission is free of virus or that it has not been > intercepted or altered and will not be liable for any delay > in its receipt. > >
