Hi all, I run a 3 node ZK ensemble and I had a hardware failure of the leader, followed by an outage of the entire ensemble because no quorum could be reached. I have looked at the logs and can’t figure out what happened, and would like some help trying to figure out how to fix it or if there is a bug here.
I’m running 3.4.8 on centos 7. 3 nodes, 1, 2, and 3. 2 was the leader and went down. The ensemble remained unavailable until node 2 was brought back online where continues to operate normally. 1 looked as if it participated in an election, figured out that 3 would be the leader, then tried to connect but got a connection refused on port 2888, which is what I understand how the ensemble members use to communicate with each other about quorums. I also don’t see 3 attempting to connect to 1, which I find alarming. Any ideas? Logs from Node 1: 2016-09-15 14:00:46,812 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846) 2016-09-15 14:00:46,812 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown of request processor complete 2016-09-15 14:00:46,817 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting down 2016-09-15 14:00:46,817 [myid:1] - INFO [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited! 2016-09-15 14:00:46,817 [myid:1] - INFO [FollowerRequestProcessor:1:FollowerRequestProcessor@97] - FollowerRequestProcessor exited loop! 2016-09-15 14:00:46,817 [myid:1] - INFO [CommitProcessor:1:CommitProcessor@153] - CommitProcessor exited loop! 2016-09-15 14:00:46,818 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING 2016-09-15 14:00:46,883 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /opt/data-zookeeper/version-2/snapshot.2b0004b82a 2016-09-15 14:00:49,194 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:49,395 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:57,377 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 1, proposed zxid=0x2b00061453 2016-09-15 14:00:57,378 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message format version), 1 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:57,378 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:57,580 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING 2016-09-15 14:00:57,580 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@170] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2 2016-09-15 14:00:57,580 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER ELECTION TOOK - 10762 2016-09-15 14:00:57,621 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - Resolved hostname: zoo3 to address: zoo3/10.10.10.102 2016-09-15 14:00:57,622 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=0, connecting to zoo3/10.10.10.102:2888 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:589) at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:228) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846) 2016-09-15 14:00:58,623 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, tries=1, connecting to zoo3/10.10.10.102:2888 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) Logs from node 3: 2016-09-15 14:00:46,792 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846) 2016-09-15 14:00:46,793 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850) 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - Shutting down 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - Shutting down 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown of request processor complete 2016-09-15 14:00:46,798 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting down 2016-09-15 14:00:46,799 [myid:3] - INFO [SyncThread:3:SyncRequestProcessor@187] - SyncRequestProcessor exited! 2016-09-15 14:00:46,799 [myid:3] - INFO [FollowerRequestProcessor:3:FollowerRequestProcessor@97] - FollowerRequestProcessor exited loop! 2016-09-15 14:00:46,799 [myid:3] - INFO [CommitProcessor:3:CommitProcessor@153] - CommitProcessor exited loop! 2016-09-15 14:00:46,799 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING 2016-09-15 14:00:46,838 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot /opt/data-zookeeper/version-2/snapshot.2b0004fab9 2016-09-15 14:00:49,193 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New election. My id = 3, proposed zxid=0x2b00061453 2016-09-15 14:00:49,194 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:49,394 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 400 2016-09-15 14:00:49,394 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:49,794 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 800 2016-09-15 14:00:49,795 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:50,595 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 1600 2016-09-15 14:00:50,595 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:52,195 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 3200 2016-09-15 14:00:52,196 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:00:55,396 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 6400 2016-09-15 14:00:55,396 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:01:01,796 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 12800 2016-09-15 14:01:01,796 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:01:14,597 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 25600 2016-09-15 14:01:14,597 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) 2016-09-15 14:01:40,197 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification time out: 51200 2016-09-15 14:01:40,197 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state) EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at java.lang.Thread.run(Thread.java:745) EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at java.lang.Thread.run(Thread.java:745) EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client has closed socket at java.lang.Thread.run(Thread.java:745)