Check they are all in the same time or ntpd against the same server. I
found that the nodes that kept going down had the time out of sync.

Cheers
L.

------
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper

On 25 January 2017 at 05:49, Allan Streib <astr...@indiana.edu> wrote:

>
> I have a cluster of 64 nodes and nodes 1 and 19 keep getting marked as
> down with a reason of not responding. They are up, pingable, slurmd is
> running, etc. everything looks normal.
>
> Using wireshark on the slurmctld I looked at traffic for node 1 and node
> 2. I can see traffic betwen the slurmctld node and node 2 at intervals
> about every 300 seconds but for node 1 sometimes the interval is as much
> as 1800 seconds.
>
> Any reason why these nodes might be getting "pinged" less often than the
> others? The slurm.conf is identical, and contains these timer settings
> (which I think are all defaults):
>
> # TIMERS
> InactiveLimit=0
> KillWait=30
> MinJobAge=300
> SlurmctldTimeout=120
> SlurmdTimeout=300
> Waittime=0
>
> Slurm version 14.11.7.
>
> Allan
> --
> Allan Streib
> Indiana University School of Informatics and Computing
> Digital Science Center :: Community Grids Lab :: FutureSystems
>

Reply via email to