Jackie, what does the slurmd log look like on one of these nodes?   The * means 
just what you thought, no communication. 

Make sure you can ping the address from the slurmctld. 

Your timeout should be fine. 

Danny 

On May 27, 2014 4:40:23 PM PDT, Jacqueline Scoggins <[email protected]> wrote:
>I just migrated over 611  nodes to slurm from moab/torque.  The last
>set of
>our nodes and noticed that a subset of the nodes around 39 or so show
>down
>with a * after the work down.  I have tried to change the state to IDLE
>but
>the log files shows - Communication connection failure rpc:1008 errors
>and
>I can't see to see what is causing this.
>
>
>Any ideas of what to troubleshoot would be helpful.  Tried the munge -n
>|
>ssh nodename umunge so munge is communication just fine.  Does it have
>anything to do with any of the scheduler parameters.  My thoughts are
>that
>the Timeout for message timeout is too low for a cluster of this size:
> 1831 nodes.
>
>Current setting is MessageTimeout          = 60 sec
>
>should I increase it to 5 minutes or at least 2 minutes?
>
>Jackie

Reply via email to