I have a cluster of 64 nodes and nodes 1 and 19 keep getting marked as
down with a reason of not responding. They are up, pingable, slurmd is
running, etc. everything looks normal.

Using wireshark on the slurmctld I looked at traffic for node 1 and node
2. I can see traffic betwen the slurmctld node and node 2 at intervals
about every 300 seconds but for node 1 sometimes the interval is as much
as 1800 seconds.

Any reason why these nodes might be getting "pinged" less often than the
others? The slurm.conf is identical, and contains these timer settings
(which I think are all defaults):

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

Slurm version 14.11.7.

Allan
-- 
Allan Streib
Indiana University School of Informatics and Computing
Digital Science Center :: Community Grids Lab :: FutureSystems

Reply via email to