Check they are all in the same time or ntpd against the same server. I found that the nodes that kept going down had the time out of sync.
Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 25 January 2017 at 05:49, Allan Streib <astr...@indiana.edu> wrote: > > I have a cluster of 64 nodes and nodes 1 and 19 keep getting marked as > down with a reason of not responding. They are up, pingable, slurmd is > running, etc. everything looks normal. > > Using wireshark on the slurmctld I looked at traffic for node 1 and node > 2. I can see traffic betwen the slurmctld node and node 2 at intervals > about every 300 seconds but for node 1 sometimes the interval is as much > as 1800 seconds. > > Any reason why these nodes might be getting "pinged" less often than the > others? The slurm.conf is identical, and contains these timer settings > (which I think are all defaults): > > # TIMERS > InactiveLimit=0 > KillWait=30 > MinJobAge=300 > SlurmctldTimeout=120 > SlurmdTimeout=300 > Waittime=0 > > Slurm version 14.11.7. > > Allan > -- > Allan Streib > Indiana University School of Informatics and Computing > Digital Science Center :: Community Grids Lab :: FutureSystems >