[slurm-dev] Re: Slurm not working: Reason=Node unexpectedly rebooted

Tim Wickberg Mon, 05 Nov 2012 15:25:06 -0800

Tony -

I'd guess that the clocks across your cluster are not in sync - the 
messages you're seeing indicate that munge is not able to send messages 
between nodes correctly, and clock sync problems seems to be the common 
cause of this. IIRC, by default, the messages are accepted within +/- 5 
minutes. If your clocks are further apart than this the nodes won't 
communicate properly.


Testing munge on a single node only shows that node works, not that it 
is capable of sending messages across the network successfully.

To test that you'd want to do something more like:

munge -n | ssh node1 unmunge return sucess

- Tim


On 11/05/2012 04:16 PM, Tony wrote:
>
> Hi, thanks for you guys' reply.
> As you suggested, I set ReturnToService to 2, after restarting slurm, I
> have a node back. But still have another 2 down. The status is still
> Reason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38].
>
> I run slurmd -Dvvvv on a problematic node, it gives error like this:
> ----------------------
> slurmd: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received
> slurmd: error: Unable to register: Zero Bytes were transmitted or received
> slurmd: debug:  Unable to register with slurm controller, retrying
> ----------------------
> It looks communication has problem. But on that node, munge works (munge
> -n |unmunge return success), ssh also works.
> I really don't know where  is the problem.
>
> The version of Slurm is 2.4.3.
> Config file as follows, I removed most of the comment lines

--
Tim Wickberg
[email protected]
Senior System Administrator
Office of Research, Rensselaer Polytechnic Institute

[slurm-dev] Re: Slurm not working: Reason=Node unexpectedly rebooted

Reply via email to