Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz
The server we are trying to start is sv4borg222 (myid is 2) and we started it around 10:03:21 Thx! J-D On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt <ph...@apache.org> wrote: > 1) Capture the logs from all 5 servers > 2) give the config for the "down" server, also indicate that it's server id > is. > 3) if possible it would be interesting to see the netstat information from 2 > of the servers - the one that's down and one or more of the others. > > Patrick > > Jean-Daniel Cryans wrote: >> >> I believe we've just hit the same problem with zk-3.2.1 >> >> For some reason a machine crashed and it was part of our quorum of 5 >> servers. When we try to restart it it this does this (I replaced >> hostname and IP): >> >> 2010-01-25 10:25:06,469 WARN >> org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open >> channel to 1 at election address somehost1/someip1:3888 >> java.net.ConnectException: Connection refused >> at sun.nio.ch.Net.connect(Native Method) >> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) >> at java.nio.channels.SocketChannel.open(SocketChannel.java:146) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) >> at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356) >> at >> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603) >> at >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) >> >> It has been like that for almost 20 minutes now, trying every other >> server in the quorum on different channels. ruok says imok but all >> other commands say that ZK server isn't running. I don't believe that >> 3.2.2 will help unless ZK-547 does more than it seems to. >> >> Any else I should look at? >> >> Thx! >> >> J-D >> >> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey <ni...@mailtrust.com> wrote: >>> >>> So the solution for us was to just nuke zookeeper and restart everywhere. >>> We will also be upgrading soon as well. >>> >>> To answer your question, yes I believe all the servers were running >>> normally >>> except for the fact that they were experiencing high CPU usage. As we >>> began >>> to see some CPU alerts I started restarting some of the servers. >>> >>> It was then that we noticed that they were not actually running according >>> to >>> 'stat'. >>> >>> I still have the log from one server with a debug level and the rest with >>> a >>> warn level. If you would like to see any of these and analyze them just >>> let >>> me know. >>> >>> Thanks for the help, >>> Nick Bailey >>> >