[slurm-dev] Re: NHC and disk / dell server health

Ole Holm Nielsen Wed, 27 Jan 2016 00:54:06 -0800


On 01/27/2016 09:12 AM, Johan Guldmyr wrote:

has anybody already made some custom NHC checks that can be used to
check disk health or perhaps even hardware health on a dell server?


I've been thinking of using smartctl + NHC to test if the local disks
on the compute node is healthy.

Or for Dell hardware then "omreport" something or perhaps one could
call for example the check_openmanage nagios check from NHC..

We're extremely happy with NHC (Node Health Check was moved tohttps://github.com/mej/nhc recently) due to its numerous checks and itslightweight resource usage.

I haven't been able to find any command for checking disk health, sincesmartctl is completely unreliable for checking failing disks (a bad diskwill usually have a PASSED SMART status). What I've seen many times isthat a disk fails partly, so the kernel remounts file systems read-only.This prevents any further health checks from running, including NHC,and all batch jobs running on a system with read-only disks are going tofail (almost) silently :-( Normally I discover this scenario due touser complaints.

There is one hardware test which I do find useful for catching mostlymemory errors. Use this NHC check in nhc.conf:


# Check Machine Check Exception (MCE, mcelog) errors (Intel only, not AMD)
* || check_hw_mcelog

You'll need to have the mcelogd daemon running. Make a manual test by:mcelog --client


--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark

[slurm-dev] Re: NHC and disk / dell server health

Reply via email to