Tony - I'd guess that the clocks across your cluster are not in sync - the messages you're seeing indicate that munge is not able to send messages between nodes correctly, and clock sync problems seems to be the common cause of this. IIRC, by default, the messages are accepted within +/- 5 minutes. If your clocks are further apart than this the nodes won't communicate properly.
Testing munge on a single node only shows that node works, not that it is capable of sending messages across the network successfully. To test that you'd want to do something more like: munge -n | ssh node1 unmunge return sucess - Tim On 11/05/2012 04:16 PM, Tony wrote: > > Hi, thanks for you guys' reply. > As you suggested, I set ReturnToService to 2, after restarting slurm, I > have a node back. But still have another 2 down. The status is still > Reason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38]. > > I run slurmd -Dvvvv on a problematic node, it gives error like this: > ---------------------- > slurmd: debug: _slurm_recv_timeout at 0 of 4, recv zero bytes > slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received > slurmd: error: Unable to register: Zero Bytes were transmitted or received > slurmd: debug: Unable to register with slurm controller, retrying > ---------------------- > It looks communication has problem. But on that node, munge works (munge > -n |unmunge return success), ssh also works. > I really don't know where is the problem. > > The version of Slurm is 2.4.3. > Config file as follows, I removed most of the comment lines -- Tim Wickberg [email protected] Senior System Administrator Office of Research, Rensselaer Polytechnic Institute
