We use Node Health Check [1] with SLURM [2] to ensure nodes are healthy,
and if anything is wrong they get automatically marked offline.  We also
have some of same checks being performed with Zabbix as well as some
performance metrics.  We are currently evaluating PCP in combination with
Zabbix as XDMoD's SUPReMM features can make use of the PCP data to show job
level performance and utilization.

Nagios and Ganglia are probably more commonly used in HPC compared to
Zabbix.

The best tool(s) are whatever meet your needs the best.

- Trey


[1]: https://github.com/mej/nhc

[2]: grep Health /etc/slurm/slurm.conf
HealthCheckInterval=600
HealthCheckNodeState=ANY
HealthCheckProgram=/usr/sbin/nhc


=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Sun, Jan 24, 2016 at 3:58 PM, Simpson Lachlan <
[email protected]> wrote:

> Hi,
>
> I've install xdmod for measuring the SLURM data, but what do people use
> for the monitoring of their nodes?
>
> Are people predominantly service based software (the hypervisor itself),
> Ganglia, Nagios....?
>
> Cheers
> L.
> This email (including any attachments or links) may contain
> confidential and/or legally privileged information and is
> intended only to be read or used by the addressee.  If you
> are not the intended addressee, any use, distribution,
> disclosure or copying of this email is strictly
> prohibited.
> Confidentiality and legal privilege attached to this email
> (including any attachments) are not waived or lost by
> reason of its mistaken delivery to you.
> If you have received this email in error, please delete it
> and notify us immediately by telephone or email.  Peter
> MacCallum Cancer Centre provides no guarantee that this
> transmission is free of virus or that it has not been
> intercepted or altered and will not be liable for any delay
> in its receipt.
>
>

Reply via email to