Am 27.01.2016 um 09:53 schrieb Ole Holm Nielsen: > On 01/27/2016 09:12 AM, Johan Guldmyr wrote: >> has anybody already made some custom NHC checks that can be used to >> check disk health or perhaps even hardware health on a dell server?
>> I've been thinking of using smartctl + NHC to test if the local disks >> on the compute node is healthy. >> >> Or for Dell hardware then "omreport" something or perhaps one could >> call for example the check_openmanage nagios check from NHC.. > We're extremely happy with NHC (Node Health Check was moved to > https://github.com/mej/nhc recently) due to its numerous checks and its > lightweight resource usage. when I first read about NHC I wondered what improvements that gives my about (standard) monitoring Not that I don't like NHC per se: I found the sample configuration on github really nice, because I could immediately implement a hand full of checks in short time that would prevent me from running into a list of failures of the last months. But: when I fail to implement the proper monitoring rules (centrally) I will fail to implement to proper checks in NHC I guess(?) > I haven't been able to find any command for checking disk health, since > smartctl is completely unreliable for checking failing disks (a bad disk > will usually have a PASSED SMART status). What I've seen many times is > that a disk fails partly, so the kernel remounts file systems read-only. > This prevents any further health checks from running, including NHC, > and all batch jobs running on a system with read-only disks are going to > fail (almost) silently :-( Normally I discover this scenario due to > user complaints. check_mk (we use it as part of OMD) complaints automatically if mount options change -- and really _a lot_ of other parameters of a node (everything IPMI sensors provide, DRDB status, network, ... ) out of the box. Running a custom script https://mathias-kettner.de/checkmk_mkeventd_actions.html that drains that node / puts job on hold / mails your users before your users complain should be quite easy Other monitoring solutions provide triggering actions after such a event too. Before a colleague of mine introduced it I like to keep tests minimal (KISS done wrong?), but since testing OMD on a ganeti cluster and getting warnings about things I wouldn't have been able to figure out how to monitor -- or how important they are -- I'm totally sold and can highly recommend it. /BR -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
