Hiya,

On 25/01/16 08:57, Simpson Lachlan wrote:

> I've install xdmod for measuring the SLURM data, but what do people
> use for the monitoring of their nodes?
> 
> Are people predominantly service based software (the hypervisor
> itself), Ganglia, Nagios....?

We use Slurm's own health check options to run a set of scripts we have
built over the years.  You configure that in slurm.conf:

http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckInterval

If you don't want to write your own then look at LBNL's "Node Health
Check" scripts which are freely available here:

https://github.com/mej/nhc/

We also poll xCAT from Icinga (the community Nagios fork) to look for
nodes that xCAT things are down and trigger an alert on those.

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: [email protected] Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci

Reply via email to