Hiya, On 25/01/16 08:57, Simpson Lachlan wrote:
> I've install xdmod for measuring the SLURM data, but what do people > use for the monitoring of their nodes? > > Are people predominantly service based software (the hypervisor > itself), Ganglia, Nagios....? We use Slurm's own health check options to run a set of scripts we have built over the years. You configure that in slurm.conf: http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckInterval If you don't want to write your own then look at LBNL's "Node Health Check" scripts which are freely available here: https://github.com/mej/nhc/ We also poll xCAT from Icinga (the community Nagios fork) to look for nodes that xCAT things are down and trigger an alert on those. All the best, Chris -- Christopher Samuel Senior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: [email protected] Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
