I have a cluster of 64 nodes and nodes 1 and 19 keep getting marked as down with a reason of not responding. They are up, pingable, slurmd is running, etc. everything looks normal.
Using wireshark on the slurmctld I looked at traffic for node 1 and node 2. I can see traffic betwen the slurmctld node and node 2 at intervals about every 300 seconds but for node 1 sometimes the interval is as much as 1800 seconds. Any reason why these nodes might be getting "pinged" less often than the others? The slurm.conf is identical, and contains these timer settings (which I think are all defaults): # TIMERS InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 Slurm version 14.11.7. Allan -- Allan Streib Indiana University School of Informatics and Computing Digital Science Center :: Community Grids Lab :: FutureSystems