Hi,
Tim, you were right, it is the problem of unsynchronized clock. After I set
up a ntp server for entire cluster, they all back to idle :)
Thank you so much!

-Tony


On Mon, Nov 5, 2012 at 5:50 PM, Tim Wickberg <[email protected]> wrote:

>
> Tony -
>
> I'd guess that the clocks across your cluster are not in sync - the
> messages you're seeing indicate that munge is not able to send messages
> between nodes correctly, and clock sync problems seems to be the common
> cause of this. IIRC, by default, the messages are accepted within +/- 5
> minutes. If your clocks are further apart than this the nodes won't
> communicate properly.
>
> Testing munge on a single node only shows that node works, not that it
> is capable of sending messages across the network successfully.
>
> To test that you'd want to do something more like:
>
> munge -n | ssh node1 unmunge return sucess
>
> - Tim
>
>
> On 11/05/2012 04:16 PM, Tony wrote:
> >
> > Hi, thanks for you guys' reply.
> > As you suggested, I set ReturnToService to 2, after restarting slurm, I
> > have a node back. But still have another 2 down. The status is still
> > Reason=Node unexpectedly rebooted [slurm@2012-11-04T22:05:38].
> >
> > I run slurmd -Dvvvv on a problematic node, it gives error like this:
> > ----------------------
> > slurmd: debug:  _slurm_recv_timeout at 0 of 4, recv zero bytes
> > slurmd: error: slurm_receive_msg: Zero Bytes were transmitted or received
> > slurmd: error: Unable to register: Zero Bytes were transmitted or
> received
> > slurmd: debug:  Unable to register with slurm controller, retrying
> > ----------------------
> > It looks communication has problem. But on that node, munge works (munge
> > -n |unmunge return success), ssh also works.
> > I really don't know where  is the problem.
> >
> > The version of Slurm is 2.4.3.
> > Config file as follows, I removed most of the comment lines
>
> --
> Tim Wickberg
> [email protected]
> Senior System Administrator
> Office of Research, Rensselaer Polytechnic Institute
>

Reply via email to