I just migrated over 611  nodes to slurm from moab/torque.  The last set of
our nodes and noticed that a subset of the nodes around 39 or so show down
with a * after the work down.  I have tried to change the state to IDLE but
the log files shows - Communication connection failure rpc:1008 errors and
I can't see to see what is causing this.


Any ideas of what to troubleshoot would be helpful.  Tried the munge -n |
ssh nodename umunge so munge is communication just fine.  Does it have
anything to do with any of the scheduler parameters.  My thoughts are that
the Timeout for message timeout is too low for a cluster of this size:
 1831 nodes.

Current setting is MessageTimeout          = 60 sec

should I increase it to 5 minutes or at least 2 minutes?

Jackie

Reply via email to