[slurm-users] Re: [EXTERNAL] Node Health Check Program

Paul Edmon via slurm-users Tue, 19 Aug 2025 12:39:00 -0700

Thanks, that explains why no EL9 release has appeared yet. I tried outthe dev branch and it works great.

NHC has been awesome to use (we've been using it for years). Thanks formaintaining it!


-Paul Edmon-

On 8/19/25 3:25 PM, Jennings, Michael E wrote:

Hi Paul!

Have you by chance given the `dev` branch a try?  All our production servers 
currently run `lbnl-nhc-1.5-0.82.gf8dc.el8.noarch` built from the `dev` branch, 
have been for some time now, and it's been rock solid.  Our RHEL-based clusters 
also use this version.  Our HPE/Cray Shasta clusters, including our largest 
(classified) clusters Crossroads, Tycho, and Venado, use a variant.  (Long 
story short, I've merged in all my changes into a separate branch, but the 
reverse is not yet true.)  This variant is, at present, COS/SLES-specific, but 
it has quite a few useful additional checks (many of them Cray-centric) 
contributed by other LANL folks that I haven't had a chance to upstream yet.

Well, to be fair, that's not exactly true.  I could just add them in en masse — 
no tidying, no unit tests, no code reviews — and believe me, I've come close to 
doing it several times!  If enough folks out there would find it useful at 
their site(s), I'm sure I could be persuaded. :-)

For better or worse, RHEL9 was only recently approved by the security folks 
here, so very few servers and clusters are running it (or derivatives like Alma 
or Rocky).  I do have a RHEL9-based VM running the same version I noted above 
(for servers); the only problem I ran into so far is updating the `sshd` 
service check due to the fact that, with OpenSSH 8.7, even the primary daemon 
process (the listener) rewrites its `argv[]` data.  Here's what I'm using that 
works:

* || check_ps_service -VS -u root -fm 'sshd: /*/sshd* -D*' sshd

I don't see any other mention of RHEL9-centric issue reports; did I miss one?

In any event, the project isn't dead, I swear!  And for what it's worth, it 
won't be going away any time soon; LANL HPC (independently of myself) evaluated 
all the available options at the time, on at least 3 separate occasions, and 
consistently found `lbnl-nhc` to be the best choice.  Since that time, it's 
been deployed on almost all services hosts (Quay servers, GitLab servers, 
OpenShift prime and worker nodes, our virtualization cluster, and numerous 
others) as well as all production clusters and supporting infrastructure.  NHC 
feeds its results into Splunk (hence the recently added JSON support), and we 
also use things like telegraf and LDMS, but in terms of situational awareness 
at the OS, scheduler, and cluster hardware levels, NHC is everywhere, and we've 
invested quite a bit into it in terms of time, training, and ancillary efforts 
(like our NHC Ansible role).

I'm not able to spend 90+% of my time on NHC right now, as I was briefly able 
to do last year, but it is still being developed and deployed at scale.

As far as forks go, the only thing I'm aware of in that vein is work that comes 
from the great team over at the University of Ghent in Belgium.  Their tree 
(github.com/hpcugent/nhc) was still undergoing development while I was stuck in 
legal limbo with Feynman, but I haven't checked recently to see if they've 
merged any of the recent features (and some pretty significant bugfixes, 
primarily around process management).

Hope that helps!
Michael

--
Michael E. Jennings (he/him) <m...@lanl.gov>            https://hpc.lanl.gov/
HPC Platform Integration Engineer - Platforms Design Team - HPC Design Group
Ultra-Scale Research Center (USRC), 4200 W Jemez #301-25   +1 (505) 412-4151
Los Alamos National Laboratory,  P.O. Box 1663,  Los Alamos, NM   87545-0001


________________________________________
From: Paul Edmon via slurm-users <slurm-users@lists.schedmd.com>
Sent: Tuesday, August 19, 2025 08:20
To: slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
Subject: [EXTERNAL] [slurm-users] Node Health Check Program

We've been using NHC (https://urldefense.com/v3/__https://github.com/mej/nhc__;!!Bt8fGhp8LhKGRg!Bk5i1HddTdGHkkfiEkBVC-FlNuY-UL7SNdC9_qipxiFzZWdTL8GbcblS0rU6xNmLwg2w-UBP0WLtfQkbFXVSdbDe$ ) for years with much

success. However that project hasn't had a release in 2 years and the
various Issues filed indicate that there might be problems with Rocky 9
(which we are looking to upgrade to). Do people that are at EL9 use NHC?
Is there a fork? Is there a different code that people use for doing
node health checks?

-Paul Edmon-


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERNAL] Node Health Check Program

Reply via email to