1) Capture the logs from all 5 servers
2) give the config for the "down" server, also indicate that it's server id is. 3) if possible it would be interesting to see the netstat information from 2 of the servers - the one that's down and one or more of the others.

Patrick

Jean-Daniel Cryans wrote:
I believe we've just hit the same problem with zk-3.2.1

For some reason a machine crashed and it was part of our quorum of 5
servers. When we try to restart it it this does this (I replaced
hostname and IP):

2010-01-25 10:25:06,469 WARN
org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open
channel to 1 at election address somehost1/someip1:3888
java.net.ConnectException: Connection refused
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356)
        at 
org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488)

It has been like that for almost 20 minutes now, trying every other
server in the quorum on different channels. ruok says imok but all
other commands say that ZK server isn't running. I don't believe that
3.2.2 will help unless ZK-547 does more than it seems to.

Any else I should look at?

Thx!

J-D

On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <ni...@mailtrust.com> wrote:
So the solution for us was to just nuke zookeeper and restart everywhere.
 We will also be upgrading soon as well.

To answer your question, yes I believe all the servers were running normally
except for the fact that they were experiencing high CPU usage.  As we began
to see some CPU alerts I started restarting some of the servers.

It was then that we noticed that they were not actually running according to
'stat'.

I still have the log from one server with a debug level and the rest with a
warn level. If you would like to see any of these and analyze them just let
me know.

Thanks for the help,
Nick Bailey

Reply via email to