Hello folks,

Wonder if anyone can help me. Suspect it must be something simple but I cant 
see it. Any suggestions about how to diagnose would be gratefully received.

I have a three node ZK cluster, when it starts up only two of  the nodes form a 
quorum. If I restart the leader the quorum reforms with the other two node…

Thanks in advance for any  help
Steve

This is the ‘stat’ for the leader/follow…

bash-5.0$ echo stat | nc zookeeper1 2181
This ZooKeeper instance is not currently serving requests

bash-5.0$ echo stat | nc zookeeper2 2181
Zookeeper version: 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built on 
02/10/2020 11:30 GMT
Clients:
/10.0.1.152:44910[1](queued=0,recved=151,sent=151)
/10.0.1.140:53138[1](queued=0,recved=187,sent=187)
/10.0.1.143:57422[1](queued=0,recved=151,sent=151)
/10.0.1.152:59242[0](queued=0,recved=1,sent=0)
/10.0.1.143:40826[1](queued=0,recved=1139,sent=1139)
/10.0.1.152:49188[1](queued=0,recved=200,sent=203)
/10.0.1.152:59548[1](queued=0,recved=1157,sent=1159)
/10.0.1.140:36624[1](queued=0,recved=151,sent=151)

Latency min/avg/max: 0/0/5
Received: 3338
Sent: 3342
Connections: 8
Outstanding: 0
Zxid: 0xc000000f3
Mode: follower
Node count: 181

bash-5.0$ echo stat | nc zookeeper3 2181
Zookeeper version: 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built on 
02/10/2020 11:30 GMT
Clients:
/10.0.1.152:49428[0](queued=0,recved=1,sent=0)
/10.0.1.140:32912[1](queued=0,recved=1426,sent=1429)

Latency min/avg/max: 0/0/4
Received: 1684
Sent: 1686
Connections: 2
Outstanding: 0
Zxid: 0xc000000f4
Mode: leader
Node count: 181
Proposal sizes last/min/max: 32/32/406
bash-5.0$

The trace for the failing node is:

server.1=0.0.0.0:2888:3888
server.2=zookeeper2:2888:3888
server.3=zookeeper3:2888:3888
ZooKeeper JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
2020-02-19 04:23:27,759 [myid:] - INFO  [main:QuorumPeerConfig@135] - Reading 
configuration from: /opt/zookeeper/bin/../conf/zoo.cfg
2020-02-19 04:23:27,764 [myid:] - INFO  [main:QuorumPeerConfig@387] - 
clientPortAddress is 0.0.0.0:2181
2020-02-19 04:23:27,764 [myid:] - INFO  [main:QuorumPeerConfig@391] - 
secureClientPort is not set
2020-02-19 04:23:27,771 [myid:1] - INFO  [main:DatadirCleanupManager@78] - 
autopurge.snapRetainCount set to 3
2020-02-19 04:23:27,772 [myid:1] - INFO  [main:DatadirCleanupManager@79] - 
autopurge.purgeInterval set to 24
2020-02-19 04:23:27,772 [myid:1] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task started.
2020-02-19 04:23:27,773 [myid:1] - INFO  [main:ManagedUtil@46] - Log4j found 
with jmx enabled.
2020-02-19 04:23:27,774 [myid:1] - INFO  [PurgeTask:FileTxnSnapLog@115] - 
zookeeper.snapshot.trust.empty : false
2020-02-19 04:23:27,780 [myid:1] - INFO  
[PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task completed.
2020-02-19 04:23:27,781 [myid:1] - INFO  [main:QuorumPeerMain@141] - Starting 
quorum peer
2020-02-19 04:23:27,786 [myid:1] - INFO  [main:ServerCnxnFactory@135] - Using 
org.apache.zookeeper.server.NIOServerCnxnFactory as server connection factory
2020-02-19 04:23:27,788 [myid:1] - INFO  [main:NIOServerCnxnFactory@673] - 
Configuring NIO connection handler with 10s sessionless connection timeout, 2 
selector thread(s), 32 worker threads, and 64 kB direct buffers.
2020-02-19 04:23:27,791 [myid:1] - INFO  [main:NIOServerCnxnFactory@686] - 
binding to port 0.0.0.0/0.0.0.0:2181
2020-02-19 04:23:27,809 [myid:1] - INFO  [main:Log@169] - Logging initialized 
@249ms to org.eclipse.jetty.util.log.Slf4jLog
2020-02-19 04:23:27,913 [myid:1] - WARN  [main:ContextHandler@1520] - 
o.e.j.s.ServletContextHandler@5abca1e0{/,null,UNAVAILABLE} contextPath ends 
with /*
2020-02-19 04:23:27,913 [myid:1] - WARN  [main:ContextHandler@1531] - Empty 
contextPath
2020-02-19 04:23:27,922 [myid:1] - INFO  [main:X509Util@79] - Setting -D 
jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS 
renegotiation
2020-02-19 04:23:27,923 [myid:1] - INFO  [main:FileTxnSnapLog@115] - 
zookeeper.snapshot.trust.empty : false
2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1470] - Local 
sessions disabled
2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1481] - Local session 
upgrading disabled
2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1448] - tickTime set 
to 2000
2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1492] - 
minSessionTimeout set to 4000
2020-02-19 04:23:27,924 [myid:1] - INFO  [main:QuorumPeer@1503] - 
maxSessionTimeout set to 40000
2020-02-19 04:23:27,924 [myid:1] - INFO  [main:QuorumPeer@1518] - initLimit set 
to 30
2020-02-19 04:23:27,932 [myid:1] - INFO  [main:ZKDatabase@117] - 
zookeeper.snapshotSizeFactor = 0.33
2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@1763] - Using 
insecure (non-TLS) quorum communication
2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@1769] - Port 
unification disabled
2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@2136] - QuorumPeer 
communication is not secured! (SASL auth disabled)
2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@2165] - 
quorum.cnxn.threads.size set to 20
2020-02-19 04:23:27,934 [myid:1] - INFO  [main:FileSnap@83] - Reading snapshot 
/opt/zookeeper/data/version-2/snapshot.90000043e
2020-02-19 04:23:27,963 [myid:1] - INFO  [main:Server@359] - 
jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 
363d5f2df3a8a28de40604320230664b9c793c16; jvm 1.8.0_111-internal-alpine-r0-b14
2020-02-19 04:23:27,989 [myid:1] - INFO  [main:DefaultSessionIdManager@333] - 
DefaultSessionIdManager workerName=node0
2020-02-19 04:23:27,989 [myid:1] - INFO  [main:DefaultSessionIdManager@338] - 
No SessionScavenger set, using defaults
2020-02-19 04:23:27,990 [myid:1] - INFO  [main:HouseKeeper@140] - node0 
Scavenging every 660000ms
2020-02-19 04:23:27,997 [myid:1] - INFO  [main:ContextHandler@825] - Started 
o.e.j.s.ServletContextHandler@5abca1e0{/,null,AVAILABLE}
2020-02-19 04:23:28,004 [myid:1] - INFO  [main:AbstractConnector@330] - Started 
ServerConnector@2b98378d{HTTP/1.1,[http/1.1]}{0.0.0.0:8080}
2020-02-19 04:23:28,004 [myid:1] - INFO  [main:Server@399] - Started @444ms
2020-02-19 04:23:28,004 [myid:1] - INFO  [main:JettyAdminServer@112] - Started 
AdminServer on address 0.0.0.0, port 8080 and command URL /commands
2020-02-19 04:23:28,007 [myid:1] - INFO  [main:QuorumCnxManager$Listener@867] - 
Election port bind maximum retries is 1000
2020-02-19 04:23:28,007 [myid:1] - INFO  
[QuorumPeerListener:QuorumCnxManager$Listener@917] - My election bind port: 
/0.0.0.0:3888
2020-02-19 04:23:28,014 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
LOOKING
2020-02-19 04:23:28,015 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
 - New election. My id =  1, proposed zxid=0xa000000fb
2020-02-19 04:23:28,018 [myid:1] - INFO  
[WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller server identifier, 
so dropping the connection: (2, 1)
2020-02-19 04:23:28,019 [myid:1] - INFO  
[WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller server identifier, 
so dropping the connection: (3, 1)
2020-02-19 04:23:28,019 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message 
format version), 1 (n.leader), 0xa000000fb (n.zxid), 0x1 (n.round), LOOKING 
(n.state), 1 (n.sid), 0xa (n.peerEPoch), LOOKING (my state)0 (n.config version)
2020-02-19 04:23:28,221 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumCnxManager@438] 
- Have smaller server identifier, so dropping the connection: (2, 1)
2020-02-19 04:23:28,222 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumCnxManager@438] 
- Have smaller server identifier, so dropping the connection: (3, 1)
2020-02-19 04:23:28,222 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@919]
 - Notification time out: 400
2020-02-19 04:23:28,623 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumCnxManager@438] 
- Have smaller server identifier, so dropping the connection: (2, 1)
2020-02-19 04:23:28,624 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):QuorumCnxManager@438] 
- Have smaller server identifier, so dropping the connection: (3, 1)
2020-02-19 04:23:28,624 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@919]
 - Notification time out: 800
…

And for the leader:
….
2020-02-19 05:02:10,341 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message 
format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 (n.round), LOOKING 
(n.state), 3 (n.sid), 0xb (n.peerEPoch), LEADING (my state)0 (n.config version)
2020-02-19 05:02:10,341 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message 
format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 (n.round), LEADING 
(n.state), 3 (n.sid), 0xc (n.peerEPoch), LEADING (my state)0 (n.config version)
2020-02-19 05:02:33,640 [myid:3] - WARN  [NIOWorkerThread-4:NIOServerCnxn@366] 
- Unable to read additional data from client sessionid 0x30002ba40710018, 
likely client has closed socket
2020-02-19 05:02:39,047 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@398] - 
Expiring session 0x20002ba2fd6001a, timeout of 40000ms exceeded
2020-02-19 05:02:39,048 [myid:3] - INFO  
[SessionTracker:QuorumZooKeeperServer@157] - Submitting global closeSession 
request for session 0x20002ba2fd6001a
2020-02-19 05:03:10,340 [myid:3] - INFO  
[/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 
10.0.1.152:52492
2020-02-19 05:03:10,340 [myid:3] - WARN  
[SendWorker:1:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for 
message on queue
java.lang.InterruptedException
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
                at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
2020-02-19 05:03:10,340 [myid:3] - WARN  
[SendWorker:1:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
id 1 my id = 3
2020-02-19 05:03:10,340 [myid:3] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1227] - Connection broken for id 1, 
my id = 3, error =
java.net.SocketException: Socket closed
                at java.net.SocketInputStream.socketRead0(Native Method)
                at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
                at java.net.SocketInputStream.read(SocketInputStream.java:170)
                at java.net.SocketInputStream.read(SocketInputStream.java:141)
                at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
                at 
java.io.BufferedInputStream.read(BufferedInputStream.java:265)
                at java.io.DataInputStream.readInt(DataInputStream.java:387)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212)
2020-02-19 05:03:10,341 [myid:3] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker
2020-02-19 05:03:10,340 [myid:3] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@1227] - Connection broken for id 3, 
my id = 3, error =
java.io.EOFException
                at java.io.DataInputStream.readInt(DataInputStream.java:392)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212)
2020-02-19 05:03:10,341 [myid:3] - WARN  
[RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting SendWorker
2020-02-19 05:03:10,341 [myid:3] - INFO  
[/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection request 
10.0.1.142:46326
2020-02-19 05:03:10,343 [myid:3] - WARN  
[SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting for 
message on queue
java.lang.InterruptedException
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
                at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
                at 
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
                at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
2020-02-19 05:03:10,344 [myid:3] - WARN  
[SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
id 3 my id = 3
2020-02-19 05:03:10,344 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message 
format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 (n.round), LOOKING 
(n.state), 3 (n.sid), 0xb (n.peerEPoch), LEADING (my state)0 (n.config version)
2020-02-19 05:03:10,345 [myid:3] - INFO  
[WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: 2 (message 
format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 (n.round), LEADING 
(n.state), 3 (n.sid), 0xc (n.peerEPoch), LEADING (my state)0 (n.config version)
2020-02-19 05:03:11,048 [myid:3] - INFO  [SessionTracker:ZooKeeperServer@398] - 
Expiring session 0x30002ba40710018, timeout of 40000ms exceeded
2020-02-19 05:03:11,048 [myid:3] - INFO  
[SessionTracker:QuorumZooKeeperServer@157] - Submitting global closeSession 
request for session 0x30002ba40710018

All of the instances have a similar zoo.cfg:

bash-4.3# cat conf/zoo.cfg
# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=30
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# Purge every 24 hours
autopurge.purgeInterval=24
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/opt/zookeeper/data
# the port at which the clients will connect
clientPort=2181
#Append other confg...
electionPortBindRetry=1000
4lw.commands.whitelist=stat, ruok, conf, mntr

server.1=zookeeper1:2888:3888
server.2=zookeeper2:2888:3888
server.3=0.0.0.0:2888:3888

Reply via email to