Jackie, what does the slurmd log look like on one of these nodes? The * means just what you thought, no communication.
Make sure you can ping the address from the slurmctld. Your timeout should be fine. Danny On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins <[email protected]> wrote: >I just migrated over 611 nodes to slurm from moab/torque. The last >set of >our nodes and noticed that a subset of the nodes around 39 or so show >down >with a * after the work down. I have tried to change the state to IDLE >but >the log files shows - Communication connection failure rpc:1008 errors >and >I can't see to see what is causing this. > > >Any ideas of what to troubleshoot would be helpful. Tried the munge -n >| >ssh nodename umunge so munge is communication just fine. Does it have >anything to do with any of the scheduler parameters. My thoughts are >that >the Timeout for message timeout is too low for a cluster of this size: > 1831 nodes. > >Current setting is MessageTimeout = 60 sec > >should I increase it to 5 minutes or at least 2 minutes? > >Jackie
