On 01/27/2016 09:12 AM, Johan Guldmyr wrote:
has anybody already made some custom NHC checks that can be used to
check disk health or perhaps even hardware health on a dell server?
I've been thinking of using smartctl + NHC to test if the local disks
on the compute node is healthy.
Or for Dell hardware then "omreport" something or perhaps one could
call for example the check_openmanage nagios check from NHC..
We're extremely happy with NHC (Node Health Check was moved to
https://github.com/mej/nhc recently) due to its numerous checks and its
lightweight resource usage.
I haven't been able to find any command for checking disk health, since
smartctl is completely unreliable for checking failing disks (a bad disk
will usually have a PASSED SMART status). What I've seen many times is
that a disk fails partly, so the kernel remounts file systems read-only.
This prevents any further health checks from running, including NHC,
and all batch jobs running on a system with read-only disks are going to
fail (almost) silently :-( Normally I discover this scenario due to
user complaints.
There is one hardware test which I do find useful for catching mostly
memory errors. Use this NHC check in nhc.conf:
# Check Machine Check Exception (MCE, mcelog) errors (Intel only, not AMD)
* || check_hw_mcelog
You'll need to have the mcelogd daemon running. Make a manual test by:
mcelog --client
--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark