[slurm-dev] slurm-dev check health and trigger actions via monitoring (Re: Re: NHC and disk / dell server health)

Benjamin Redling Thu, 28 Jan 2016 01:48:38 -0800

Am 27.01.2016 um 09:53 schrieb Ole Holm Nielsen:
> On 01/27/2016 09:12 AM, Johan Guldmyr wrote:
>> has anybody already made some custom NHC checks that can be used to
>> check disk health or perhaps even hardware health on a dell server?


>> I've been thinking of using smartctl + NHC to test if the local disks
>> on the compute node is healthy.
>>
>> Or for Dell hardware then "omreport" something or perhaps one could
>> call for example the check_openmanage nagios check from NHC..

> We're extremely happy with NHC (Node Health Check was moved to
> https://github.com/mej/nhc recently) due to its numerous checks and its
> lightweight resource usage.

when I first read about NHC I wondered what improvements that gives my
about (standard) monitoring
Not that I don't like NHC per se: I found the sample configuration on
github really nice, because I could immediately implement a hand full of
checks in short time that would prevent me from running into a list of
failures of the last months.

But: when I fail to implement the proper monitoring rules (centrally) I
will fail to implement to proper checks in NHC I guess(?)

> I haven't been able to find any command for checking disk health, since
> smartctl is completely unreliable for checking failing disks (a bad disk
> will usually have a PASSED SMART status).  What I've seen many times is
> that a disk fails partly, so the kernel remounts file systems read-only.
>  This prevents any further health checks from running, including NHC,
> and all batch jobs running on a system with read-only disks are going to
> fail (almost) silently :-(  Normally I discover this scenario due to
> user complaints.

check_mk (we use it as part of OMD) complaints automatically if mount
options change -- and really _a lot_ of other parameters of a node
(everything IPMI sensors provide, DRDB status, network, ... ) out of the
box.

Running a custom script
https://mathias-kettner.de/checkmk_mkeventd_actions.html
that drains that node / puts job on hold / mails your users before your
users complain should be quite easy
  Other monitoring solutions provide triggering actions after such a
event too.

Before a colleague of mine introduced it I like to keep tests minimal
(KISS done wrong?),
but since testing OMD on a ganeti cluster and getting warnings about
things I wouldn't have been able to figure out how to monitor -- or how
important they are -- I'm totally sold and can highly recommend it.


/BR
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321

[slurm-dev] slurm-dev check health and trigger actions via monitoring (Re: Re: NHC and disk / dell server health)

Reply via email to