Hi all,

I have a situation where, in a 1024 nodes clusters, we see several of 
the following:
[2012-03-23T05:03:37] debug2: agent maximum delay 21 seconds
[2012-03-23T05:03:37] error: Nodes r5i3n10 not responding
[2012-03-23T06:02:01] debug3: agent thread 140737047705344 timed out
[2012-03-23T06:05:54] debug3: agent thread 140737027704576 timed out

I am trying to gather how this could be dodged.
It looks like this could be related to TreeWidth configuration but it
use the recommended default of 50 which is documented to be ok
for cluster < 2500 nodes.

I am also puzzled as to why 21 seconds shows up instead of 
COMMAND_TIMEOUT=30.

FYI. slurm 2.2.7.

Any clues ?

TIA

-- 

-----------------------------------------------------------
      Michel Bourget - SGI - Linux Software Engineering
     "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------

Reply via email to