I just migrated over 611 nodes to slurm from moab/torque. The last set of our nodes and noticed that a subset of the nodes around 39 or so show down with a * after the work down. I have tried to change the state to IDLE but the log files shows - Communication connection failure rpc:1008 errors and I can't see to see what is causing this.
Any ideas of what to troubleshoot would be helpful. Tried the munge -n | ssh nodename umunge so munge is communication just fine. Does it have anything to do with any of the scheduler parameters. My thoughts are that the Timeout for message timeout is too low for a cluster of this size: 1831 nodes. Current setting is MessageTimeout = 60 sec should I increase it to 5 minutes or at least 2 minutes? Jackie
