One simple thing to do is enable
   <!-- tmpl_var LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET 
-->http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
   and use a simple script along the lines of:


   #!/bin/bash

   ntpdate -u ntpsserver.cluster.local ; rc=$?

   [[ rc -ne 0 ]] && scontrol update NodeName=$HOSTNAME State=drain
   Reason=ntp_failure



   But a better solution is probably to add clock drift checks to your
   nhc / cluster monitoring solution and replace system boards of repeat
   offenders.



   On 10/06/2016 05:17 PM, Per Lönnborg wrote:

     Hi,
     as a sysadmin, I know the importance of keeping correct time on
     "things". We use (of course) ntp for that.
     But what is the preferred way to check that the compute nodes on
     our have correct time, and if not, see to it that Slurm doesn�t
     allocate these nodes to perform tasks?
     For about a year ago, we started to use Munge for authentication.
     Default time drift between nodes and slurm server for Munge is +/-
     5 minutes (300sec). If time exceeds these values, the node cannot
     communicate with slurmd and will be marked "down*" in slurm.
     Perfect, we thought, since this check would be enough for Slurm
     not to allocate time broken nodes to users. But...
     I�ve read the Slurm documentation and it says "While Slurm itself
     does not rely upon synchronized clocks on all nodes of a cluster
     for proper operation, its underlying authentication mechanism does
     have this requirement."
     True. Tests we have performed with drifting time on a compute node
     shows that if a node clock is >5 minutes AFTER correct time we get
     "Job credential expired". Fine. (=same TTL as Munge)But - if time
     on node is approx. JUST 2.5 minutes BEFORE correct time, we also
     get "Job credential expired". NOT Fine. About half the TTL vs.
     Munge.
     So, if we just let Slurm "rely" on Munge, we will have users
     complaining about "Job credential expired" if time on node is
     between about 2.5 minutes and 5 minutes wrong.
     We also have tried to alter the TTL for Munge, but that doesn�t
     seem to be implemented...yet?
     Earlier (before we used Munge) we had a bit of quite crappy code
     in a Prolog-script that checked NTP, but it was buggy...
     I would appreciate input from other admins how to check and
     maintaining synchronized clocks in a Slurm managed cluster!
     Thanks,
     /Per L�nnborg

     _______________________________________________________________
     Annons: Handla enkelt och smidigt hos
     <!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET -->Clas
     Ohlson[IMAGE]

   


   <!-- tmpl_var LEFT_BRACKET -->1<!-- tmpl_var RIGHT_BRACKET --> 
http://slurm.schedmd.com/slurm.conf.html#OPT_HealthCheckProgram
   <!-- tmpl_var LEFT_BRACKET -->2<!-- tmpl_var RIGHT_BRACKET --> 
http://www.dpbolvw.net/click-5762941-10771045


Reply via email to