On Tue, 2017-01-24 at 14:11:11 -0800, Lachlan Musicman wrote:
> Check they are all in the same time or ntpd against the same server. I
> found that the nodes that kept going down had the time out of sync.

While I'm running three ntp servers (connected to the same external time source,
and having each other listed in ntp.conf) locally to avoid such kinds of 
bottleneck,
I still see "error: not responding" messages in the log, with corresponding "now
responding" messages about one minute *before* the error line, e.g.
...
[2017-01-25T10:57:38.363] Node c052 now responding
[2017-01-25T10:57:38.363] Node c456 now responding
[2017-01-25T10:58:46.399] error: Nodes c[052,454,456,567,570,580,591] not 
responding
...
The order is certainly wrong. (This is Slurm 15.08.8.)
In rare cases, a "not responding" state takes so long that the corresponding
node gets set to "down" (while it isn't).
Are the "timeouts" and "cycle dureations" hardwired in Slurm, or can I adjust 
them
somewhere in the config? (Somehow I cannot find anything that matches...)

Thanks,
 S

Reply via email to