The "agent maximum delay 21 seconds" is reporting what is actually  
observed. The timeouts could be due to many reasons: network problems,  
slurmd daemons paged out and not responding, etc. There is retry logic  
so this isn't critical (it is "debug3" and not normally reported)  
unless you start seeing messages like this  "Nodes r5i3n10 not  
responding", in which case you want to determine what happened to that  
node.

Quoting Michel Bourget <mic...@sgi.com>:

>
> Hi all,
>
> I have a situation where, in a 1024 nodes clusters, we see several of
> the following:
> [2012-03-23T05:03:37] debug2: agent maximum delay 21 seconds
> [2012-03-23T05:03:37] error: Nodes r5i3n10 not responding
> [2012-03-23T06:02:01] debug3: agent thread 140737047705344 timed out
> [2012-03-23T06:05:54] debug3: agent thread 140737027704576 timed out
>
> I am trying to gather how this could be dodged.
> It looks like this could be related to TreeWidth configuration but it
> use the recommended default of 50 which is documented to be ok
> for cluster < 2500 nodes.
>
> I am also puzzled as to why 21 seconds shows up instead of
> COMMAND_TIMEOUT=30.
>
> FYI. slurm 2.2.7.
>
> Any clues ?
>
> TIA
>
> --
>
> -----------------------------------------------------------
>       Michel Bourget - SGI - Linux Software Engineering
>      "Past BIOS POST, everything else is extra" (travis)
> -----------------------------------------------------------
>

Reply via email to