Hi all,

I run a 3 node ZK ensemble and I had a hardware failure of the leader, followed 
by an outage of the entire ensemble because no quorum could be reached. I have 
looked at the logs and can’t figure out what happened, and would like some help 
trying to figure out how to fix it or if there is a bug here.

I’m running 3.4.8 on centos 7.

3 nodes, 1, 2, and 3.  2 was the leader and went down.  The ensemble remained 
unavailable until node 2 was brought back online where continues to operate 
normally.

1 looked as if it participated in an election, figured out that 3 would be the 
leader, then tried to connect but got a connection refused on port 2888, which 
is what I understand how the ensemble members use to communicate with each 
other about quorums.

I also don’t see 3 attempting to connect to 1, which I find alarming.

Any ideas?


Logs from Node 1:

2016-09-15 14:00:46,812 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,812 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - 
Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - 
Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown 
of request processor complete
2016-09-15 14:00:46,817 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting 
down
2016-09-15 14:00:46,817 [myid:1] - INFO  
[SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
2016-09-15 14:00:46,817 [myid:1] - INFO  
[FollowerRequestProcessor:1:FollowerRequestProcessor@97] - 
FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,817 [myid:1] - INFO  
[CommitProcessor:1:CommitProcessor@153] - CommitProcessor exited loop!
2016-09-15 14:00:46,818 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2016-09-15 14:00:46,883 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot 
/opt/data-zookeeper/version-2/snapshot.2b0004b82a
2016-09-15 14:00:49,194 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,395 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,377 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New 
election. My id =  1, proposed zxid=0x2b00061453
2016-09-15 14:00:57,378 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message 
format version), 1 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,378 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,580 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
2016-09-15 14:00:57,580 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@170] - Created server 
with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir 
/opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
2016-09-15 14:00:57,580 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER 
ELECTION TOOK - 10762
2016-09-15 14:00:57,621 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] - 
Resolved hostname: zoo3 to address: zoo3/10.10.10.102
2016-09-15 14:00:57,622 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, 
tries=0, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at 
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:228)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:58,623 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception, 
tries=1, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

Logs from node 3:
2016-09-15 14:00:46,792 [myid:3] - WARN  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,793 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] - 
Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] - 
Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown 
of request processor complete
2016-09-15 14:00:46,798 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting 
down
2016-09-15 14:00:46,799 [myid:3] - INFO  
[SyncThread:3:SyncRequestProcessor@187] - SyncRequestProcessor exited!
2016-09-15 14:00:46,799 [myid:3] - INFO  
[FollowerRequestProcessor:3:FollowerRequestProcessor@97] - 
FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO  
[CommitProcessor:3:CommitProcessor@153] - CommitProcessor exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2016-09-15 14:00:46,838 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot 
/opt/data-zookeeper/version-2/snapshot.2b0004fab9
2016-09-15 14:00:49,193 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New 
election. My id =  3, proposed zxid=0x2b00061453
2016-09-15 14:00:49,194 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,394 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 400
2016-09-15 14:00:49,394 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,794 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 800
2016-09-15 14:00:49,795 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:50,595 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 1600
2016-09-15 14:00:50,595 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:52,195 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 3200
2016-09-15 14:00:52,196 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:55,396 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 6400
2016-09-15 14:00:55,396 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:01,796 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 12800
2016-09-15 14:01:01,796 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:14,597 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 25600
2016-09-15 14:01:14,597 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:40,197 [myid:3] - INFO  
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification 
time out: 51200
2016-09-15 14:01:40,197 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message 
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING 
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
        at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
        at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, 
likely client has closed socket
        at java.lang.Thread.run(Thread.java:745)

Reply via email to