One simple thing to do is enable
<!-- tmpl_var LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET
-->http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
and use a simple script along the lines of:
#!/bin/bash
ntpdate -u ntpsserver.cluster.local ; rc=$?
[[ rc -ne 0 ]] && scontrol update NodeName=$HOSTNAME State=drain
Reason=ntp_failure
But a better solution is probably to add clock drift checks to your
nhc / cluster monitoring solution and replace system boards of repeat
offenders.
On 10/06/2016 05:17 PM, Per Lönnborg wrote:
Hi,
as a sysadmin, I know the importance of keeping correct time on
"things". We use (of course) ntp for that.
But what is the preferred way to check that the compute nodes on
our have correct time, and if not, see to it that Slurm doesn�t
allocate these nodes to perform tasks?
For about a year ago, we started to use Munge for authentication.
Default time drift between nodes and slurm server for Munge is +/-
5 minutes (300sec). If time exceeds these values, the node cannot
communicate with slurmd and will be marked "down*" in slurm.
Perfect, we thought, since this check would be enough for Slurm
not to allocate time broken nodes to users. But...
I�ve read the Slurm documentation and it says "While Slurm itself
does not rely upon synchronized clocks on all nodes of a cluster
for proper operation, its underlying authentication mechanism does
have this requirement."
True. Tests we have performed with drifting time on a compute node
shows that if a node clock is >5 minutes AFTER correct time we get
"Job credential expired". Fine. (=same TTL as Munge)But - if time
on node is approx. JUST 2.5 minutes BEFORE correct time, we also
get "Job credential expired". NOT Fine. About half the TTL vs.
Munge.
So, if we just let Slurm "rely" on Munge, we will have users
complaining about "Job credential expired" if time on node is
between about 2.5 minutes and 5 minutes wrong.
We also have tried to alter the TTL for Munge, but that doesn�t
seem to be implemented...yet?
Earlier (before we used Munge) we had a bit of quite crappy code
in a Prolog-script that checked NTP, but it was buggy...
I would appreciate input from other admins how to check and
maintaining synchronized clocks in a Slurm managed cluster!
Thanks,
/Per L�nnborg
_______________________________________________________________
Annons: Handla enkelt och smidigt hos
<!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET -->Clas
Ohlson[IMAGE]
<!-- tmpl_var LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET -->
http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
<!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET -->
http://www.dpbolvw.net/click-5762941-10771045