On 01/27/2016 09:12 AM, Johan Guldmyr wrote:
has anybody already made some custom NHC checks that can be used to
check disk health or perhaps even hardware health on a dell server?

I've been thinking of using smartctl + NHC to test if the local disks
on the compute node is healthy.

Or for Dell hardware then "omreport" something or perhaps one could
call for example the check_openmanage nagios check from NHC..

We're extremely happy with NHC (Node Health Check was moved to https://github.com/mej/nhc recently) due to its numerous checks and its lightweight resource usage.

I haven't been able to find any command for checking disk health, since smartctl is completely unreliable for checking failing disks (a bad disk will usually have a PASSED SMART status). What I've seen many times is that a disk fails partly, so the kernel remounts file systems read-only. This prevents any further health checks from running, including NHC, and all batch jobs running on a system with read-only disks are going to fail (almost) silently :-( Normally I discover this scenario due to user complaints.

There is one hardware test which I do find useful for catching mostly memory errors. Use this NHC check in nhc.conf:

# Check Machine Check Exception (MCE, mcelog) errors (Intel only, not AMD)
* || check_hw_mcelog

You'll need to have the mcelogd daemon running. Make a manual test by: mcelog --client

--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark

Reply via email to