Hi all,
I run a 3 node ZK ensemble and I had a hardware failure of the leader, followed
by an outage of the entire ensemble because no quorum could be reached. I have
looked at the logs and can’t figure out what happened, and would like some help
trying to figure out how to fix it or if there is a bug here.
I’m running 3.4.8 on centos 7.
3 nodes, 1, 2, and 3. 2 was the leader and went down. The ensemble remained
unavailable until node 2 was brought back online where continues to operate
normally.
1 looked as if it participated in an election, figured out that 3 would be the
leader, then tried to connect but got a connection refused on port 2888, which
is what I understand how the ensemble members use to communicate with each
other about quorums.
I also don’t see 3 attempting to connect to 1, which I find alarming.
Any ideas?
Logs from Node 1:
2016-09-15 14:00:46,812 [myid:1] - WARN
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,812 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown
of request processor complete
2016-09-15 14:00:46,817 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting
down
2016-09-15 14:00:46,817 [myid:1] - INFO
[SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
2016-09-15 14:00:46,817 [myid:1] - INFO
[FollowerRequestProcessor:1:FollowerRequestProcessor@97] -
FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,817 [myid:1] - INFO
[CommitProcessor:1:CommitProcessor@153] - CommitProcessor exited loop!
2016-09-15 14:00:46,818 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2016-09-15 14:00:46,883 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
/opt/data-zookeeper/version-2/snapshot.2b0004b82a
2016-09-15 14:00:49,194 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,395 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,377 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id = 1, proposed zxid=0x2b00061453
2016-09-15 14:00:57,378 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
format version), 1 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,378 [myid:1] - INFO
[WorkerReceiver[myid=1]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,580 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
2016-09-15 14:00:57,580 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@170] - Created server
with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
2016-09-15 14:00:57,580 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
ELECTION TOOK - 10762
2016-09-15 14:00:57,621 [myid:1] - INFO
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
Resolved hostname: zoo3 to address: zoo3/10.10.10.102
2016-09-15 14:00:57,622 [myid:1] - WARN
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception,
tries=0, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at
org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:228)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:58,623 [myid:1] - WARN
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected exception,
tries=1, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
Logs from node 3:
2016-09-15 14:00:46,792 [myid:3] - WARN
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,793 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469] - shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown
of request processor complete
2016-09-15 14:00:46,798 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting
down
2016-09-15 14:00:46,799 [myid:3] - INFO
[SyncThread:3:SyncRequestProcessor@187] - SyncRequestProcessor exited!
2016-09-15 14:00:46,799 [myid:3] - INFO
[FollowerRequestProcessor:3:FollowerRequestProcessor@97] -
FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO
[CommitProcessor:3:CommitProcessor@153] - CommitProcessor exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
2016-09-15 14:00:46,838 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
/opt/data-zookeeper/version-2/snapshot.2b0004fab9
2016-09-15 14:00:49,193 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id = 3, proposed zxid=0x2b00061453
2016-09-15 14:00:49,194 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,394 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 400
2016-09-15 14:00:49,394 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,794 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 800
2016-09-15 14:00:49,795 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:50,595 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 1600
2016-09-15 14:00:50,595 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:52,195 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 3200
2016-09-15 14:00:52,196 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:55,396 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 6400
2016-09-15 14:00:55,396 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:01,796 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 12800
2016-09-15 14:01:01,796 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:14,597 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 25600
2016-09-15 14:01:14,597 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:40,197 [myid:3] - INFO
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] - Notification
time out: 51200
2016-09-15 14:01:40,197 [myid:3] - INFO
[WorkerReceiver[myid=3]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round), LOOKING
(n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0,
likely client has closed socket
at java.lang.Thread.run(Thread.java:745)