Hi, Tim, you were right, it is the problem of unsynchronized clock. After I set up a ntp server for entire cluster, they all back to idle :) Thank you so much!
-Tony On Mon, Nov 5, 2012 at 5:50 PM, Tim Wickberg <[email protected]> wrote: > > Tony - > > I'd guess that the clocks across your cluster are not in sync - the > messages you're seeing indicate that munge is not able to send messages > between nodes correctly, and clock sync problems seems to be the common > cause of this. IIRC, by default, the messages are accepted within +/- 5 > minutes. If your clocks are further apart than this the nodes won't > communicate properly. > > Testing munge on a single node only shows that node works, not that it > is capable of sending messages across the network successfully. > > To test that you'd want to do something more like: > > munge -n | ssh node1 unmunge return sucess > > - Tim > > > On 11/05/2012 04:16 PM, Tony wrote: > > > > Hi, thanks for you guys' reply. > > As you suggested, I set ReturnToService to 2, after restarting slurm, I > > have a node back. But still have another 2 down. The status is still > > Reason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38]. > > > > I run slurmd -Dvvvv on a problematic node, it gives error like this: > > ---------------------- > > slurmd: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes > > slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received > > slurmd: error: Unable to register: Zero Bytes were transmitted or > received > > slurmd: debug: Unable to register with slurm controller, retrying > > ---------------------- > > It looks communication has problem. But on that node, munge works (munge > > -n |unmunge return success), ssh also works. > > I really don't know where is the problem. > > > > The version of Slurm is 2.4.3. > > Config file as follows, I removed most of the comment lines > > -- > Tim Wickberg > [email protected] > Senior System Administrator > Office of Research, Rensselaer Polytechnic Institute >
