OK, that explains it. I will see if 3.4.14 fixes the issue for the time being...
Thanks Steve On 2/19/20, 8:57 AM, "Jan Kosecki" <[email protected]> wrote: Hi Steve, it's possible that the quorum state depends on the order your nodes start. In my kubernetes environment I've had a similar issue and I've noticed that starting brokers 1 by 1, following the order from configuration file allows all 3 to join the quorum but a reverse order would keep server started as the last outside of the quorum. I was also using 0.0.0.0 in the configuration and didn't try a full address due to readiness check configuration. Unfortunately I didn't have time to debug it any further so I've downgraded back to 3.4 for the time being. Hope you manage to find a solution, Best, Jan On Wed, 19 Feb 2020, 15:47 Steve Jerman, <[email protected]> wrote: > Hi, > > I've just been testing restarts ... I restarted one of the instances (id > 1) ... and it doesn't join the quorum ... same error. > > Odd that the system started fine but can't handle a restart.... > > Steve > > On 2/19/20, 7:45 AM, "Steve Jerman" <[email protected]> wrote: > > Thank You Mate, > > That fixed it. Unfortunately I can't easily avoid using 0.0.0.0 > > My configuration is built using Docker Storm and it doesn't let you > bind to a host name... > > Steve > > On 2/19/20, 5:27 AM, "Szalay-Bekő Máté" <[email protected]> > wrote: > > Hi Steve! > > If you are using ZooKeeper newer than 3.5.0, then this might be > the issue > we are just discussing / trying to fix in ZOOKEEPER-2164. > Can you test your setup with a config where you don't use 0.0.0.0 > in the > server addresses? > > If you need to bind to the 0.0.0.0 locally, then please set the > 'quorumListenOnAllIPs' config property to true. > > like: > > # usually you don't really need this, unless if you actually need > to bind > to multiple IPs > quorumListenOnAllIPs=true > > # it is best if all the zoo.cfg files contain the same address > settings, > and we don't use 0.0.0.0 here > server.1=zookeeper1:2888:3888 > server.2=zookeeper2:2888:3888 > server.3=zookeeper3:2888:3888 > > Kind regards, > Mate > > On Wed, Feb 19, 2020 at 6:08 AM Steve Jerman <[email protected]> > wrote: > > > Hello folks, > > > > Wonder if anyone can help me. Suspect it must be something > simple but I > > cant see it. Any suggestions about how to diagnose would be > gratefully > > received. > > > > I have a three node ZK cluster, when it starts up only two of > the nodes > > form a quorum. If I restart the leader the quorum reforms with > the other > > two node… > > > > Thanks in advance for any help > > Steve > > > > This is the ‘stat’ for the leader/follow… > > > > bash-5.0$ echo stat | nc zookeeper1 2181 > > This ZooKeeper instance is not currently serving requests > > > > bash-5.0$ echo stat | nc zookeeper2 2181 > > Zookeeper version: > 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built > > on 02/10/2020 11:30 GMT > > Clients: > > /10.0.1.152:44910[1](queued=0,recved=151,sent=151) > > /10.0.1.140:53138[1](queued=0,recved=187,sent=187) > > /10.0.1.143:57422[1](queued=0,recved=151,sent=151) > > /10.0.1.152:59242[0](queued=0,recved=1,sent=0) > > /10.0.1.143:40826[1](queued=0,recved=1139,sent=1139) > > /10.0.1.152:49188[1](queued=0,recved=200,sent=203) > > /10.0.1.152:59548[1](queued=0,recved=1157,sent=1159) > > /10.0.1.140:36624[1](queued=0,recved=151,sent=151) > > > > Latency min/avg/max: 0/0/5 > > Received: 3338 > > Sent: 3342 > > Connections: 8 > > Outstanding: 0 > > Zxid: 0xc000000f3 > > Mode: follower > > Node count: 181 > > > > bash-5.0$ echo stat | nc zookeeper3 2181 > > Zookeeper version: > 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built > > on 02/10/2020 11:30 GMT > > Clients: > > /10.0.1.152:49428[0](queued=0,recved=1,sent=0) > > /10.0.1.140:32912[1](queued=0,recved=1426,sent=1429) > > > > Latency min/avg/max: 0/0/4 > > Received: 1684 > > Sent: 1686 > > Connections: 2 > > Outstanding: 0 > > Zxid: 0xc000000f4 > > Mode: leader > > Node count: 181 > > Proposal sizes last/min/max: 32/32/406 > > bash-5.0$ > > > > The trace for the failing node is: > > > > server.1=0.0.0.0:2888:3888 > > server.2=zookeeper2:2888:3888 > > server.3=zookeeper3:2888:3888 > > ZooKeeper JMX enabled by default > > Using config: /opt/zookeeper/bin/../conf/zoo.cfg > > 2020-02-19 04:23:27,759 [myid:] - INFO > [main:QuorumPeerConfig@135] - > > Reading configuration from: /opt/zookeeper/bin/../conf/zoo.cfg > > 2020-02-19 04:23:27,764 [myid:] - INFO > [main:QuorumPeerConfig@387] - > > clientPortAddress is 0.0.0.0:2181 > > 2020-02-19 04:23:27,764 [myid:] - INFO > [main:QuorumPeerConfig@391] - > > secureClientPort is not set > > 2020-02-19 04:23:27,771 [myid:1] - INFO > [main:DatadirCleanupManager@78] > > - autopurge.snapRetainCount set to 3 > > 2020-02-19 04:23:27,772 [myid:1] - INFO > [main:DatadirCleanupManager@79] > > - autopurge.purgeInterval set to 24 > > 2020-02-19 04:23:27,772 [myid:1] - INFO > > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task > started. > > 2020-02-19 04:23:27,773 [myid:1] - INFO [main:ManagedUtil@46] > - Log4j > > found with jmx enabled. > > 2020-02-19 04:23:27,774 [myid:1] - INFO > [PurgeTask:FileTxnSnapLog@115] - > > zookeeper.snapshot.trust.empty : false > > 2020-02-19 04:23:27,780 [myid:1] - INFO > > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task > completed. > > 2020-02-19 04:23:27,781 [myid:1] - INFO [main:QuorumPeerMain@141] > - > > Starting quorum peer > > 2020-02-19 04:23:27,786 [myid:1] - INFO > [main:ServerCnxnFactory@135] - > > Using org.apache.zookeeper.server.NIOServerCnxnFactory as server > connection > > factory > > 2020-02-19 04:23:27,788 [myid:1] - INFO > [main:NIOServerCnxnFactory@673] > > - Configuring NIO connection handler with 10s sessionless > connection > > timeout, 2 selector thread(s), 32 worker threads, and 64 kB > direct buffers. > > 2020-02-19 04:23:27,791 [myid:1] - INFO > [main:NIOServerCnxnFactory@686] > > - binding to port 0.0.0.0/0.0.0.0:2181 > > 2020-02-19 <http://0.0.0.0/0.0.0.0:21812020-02-19> 04:23:27,809 > [myid:1] > > - INFO [main:Log@169] - Logging initialized @249ms to > > org.eclipse.jetty.util.log.Slf4jLog > > 2020-02-19 04:23:27,913 [myid:1] - WARN > [main:ContextHandler@1520] - > > o.e.j.s.ServletContextHandler@5abca1e0{/,null,UNAVAILABLE} > contextPath > > ends with /* > > 2020-02-19 04:23:27,913 [myid:1] - WARN > [main:ContextHandler@1531] - > > Empty contextPath > > 2020-02-19 04:23:27,922 [myid:1] - INFO [main:X509Util@79] - > Setting -D > > jdk.tls.rejectClientInitiatedRenegotiation=true to disable > client-initiated > > TLS renegotiation > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:FileTxnSnapLog@115] > - > > zookeeper.snapshot.trust.empty : false > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1470] > - Local > > sessions disabled > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1481] > - Local > > session upgrading disabled > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1448] > - > > tickTime set to 2000 > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1492] > - > > minSessionTimeout set to 4000 > > 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1503] > - > > maxSessionTimeout set to 40000 > > 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1518] > - > > initLimit set to 30 > > 2020-02-19 04:23:27,932 [myid:1] - INFO [main:ZKDatabase@117] - > > zookeeper.snapshotSizeFactor = 0.33 > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1763] > - Using > > insecure (non-TLS) quorum communication > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1769] > - Port > > unification disabled > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2136] > - > > QuorumPeer communication is not secured! (SASL auth disabled) > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2165] > - > > quorum.cnxn.threads.size set to 20 > > 2020-02-19 04:23:27,934 [myid:1] - INFO [main:FileSnap@83] - > Reading > > snapshot /opt/zookeeper/data/version-2/snapshot.90000043e > > 2020-02-19 04:23:27,963 [myid:1] - INFO [main:Server@359] - > > jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: > > 363d5f2df3a8a28de40604320230664b9c793c16; jvm > > 1.8.0_111-internal-alpine-r0-b14 > > 2020-02-19 04:23:27,989 [myid:1] - INFO > [main:DefaultSessionIdManager@333] > > - DefaultSessionIdManager workerName=node0 > > 2020-02-19 04:23:27,989 [myid:1] - INFO > [main:DefaultSessionIdManager@338] > > - No SessionScavenger set, using defaults > > 2020-02-19 04:23:27,990 [myid:1] - INFO [main:HouseKeeper@140] > - node0 > > Scavenging every 660000ms > > 2020-02-19 04:23:27,997 [myid:1] - INFO [main:ContextHandler@825] > - > > Started o.e.j.s.ServletContextHandler@5abca1e0{/,null,AVAILABLE} > > 2020-02-19 04:23:28,004 [myid:1] - INFO > [main:AbstractConnector@330] - > > Started ServerConnector@2b98378d{HTTP/1.1,[http/1.1]}{ > 0.0.0.0:8080} > > 2020-02-19 04:23:28,004 [myid:1] - INFO [main:Server@399] - > Started > > @444ms > > 2020-02-19 04:23:28,004 [myid:1] - INFO > [main:JettyAdminServer@112] - > > Started AdminServer on address 0.0.0.0, port 8080 and command > URL /commands > > 2020-02-19 04:23:28,007 [myid:1] - INFO > > [main:QuorumCnxManager$Listener@867] - Election port bind > maximum retries > > is 1000 > > 2020-02-19 04:23:28,007 [myid:1] - INFO > > [QuorumPeerListener:QuorumCnxManager$Listener@917] - My > election bind > > port: /0.0.0.0:3888 > > 2020-02-19 04:23:28,014 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > )(secure=disabled):QuorumPeer@1175] > > - LOOKING > > 2020-02-19 04:23:28,015 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@885] - New election. My > id = 1, > > proposed zxid=0xa000000fb > > 2020-02-19 04:23:28,018 [myid:1] - INFO > > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller > server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,019 [myid:1] - INFO > > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller > server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,019 [myid:1] - INFO > > [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 1 (n.leader), 0xa000000fb (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 1 (n.sid), 0xa (n.peerEPoch), > LOOKING (my > > state)0 (n.config version) > > 2020-02-19 04:23:28,221 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,222 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,222 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@919] - Notification time > out: 400 > > 2020-02-19 04:23:28,623 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,624 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,624 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@919] - Notification time > out: 800 > > … > > > > And for the leader: > > …. > > 2020-02-19 05:02:10,341 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:02:10,341 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:02:33,640 [myid:3] - WARN > > [NIOWorkerThread-4:NIOServerCnxn@366] - Unable to read > additional data > > from client sessionid 0x30002ba40710018, likely client has > closed socket > > 2020-02-19 05:02:39,047 [myid:3] - INFO > > [SessionTracker:ZooKeeperServer@398] - Expiring session > > 0x20002ba2fd6001a, timeout of 40000ms exceeded > > 2020-02-19 05:02:39,048 [myid:3] - INFO > > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global > > closeSession request for session 0x20002ba2fd6001a > > 2020-02-19 05:03:10,340 [myid:3] - INFO [/0.0.0.0:3888 > > :QuorumCnxManager$Listener@924] - Received connection request > > 10.0.1.152:52492 > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@1143] - Interrupted > while > > waiting for message on queue > > java.lang.InterruptedException > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > > at > > > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@1153] - Send worker > leaving > > thread id 1 my id = 3 > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1227] - Connection > broken for > > id 1, my id = 3, error = > > java.net.SocketException: Socket closed > > at java.net.SocketInputStream.socketRead0(Native > Method) > > at > > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > > at > > java.net.SocketInputStream.read(SocketInputStream.java:170) > > at > > java.net.SocketInputStream.read(SocketInputStream.java:141) > > at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > > at > > java.io.BufferedInputStream.read(BufferedInputStream.java:265) > > at > > java.io.DataInputStream.readInt(DataInputStream.java:387) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) > > 2020-02-19 05:03:10,341 [myid:3] - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1227] - Connection > broken for > > id 3, my id = 3, error = > > java.io.EOFException > > at > > java.io.DataInputStream.readInt(DataInputStream.java:392) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) > > 2020-02-19 05:03:10,341 [myid:3] - WARN > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker > > 2020-02-19 05:03:10,341 [myid:3] - INFO [/0.0.0.0:3888 > > :QuorumCnxManager$Listener@924] - Received connection request > > 10.0.1.142:46326 > > 2020-02-19 05:03:10,343 [myid:3] - WARN > > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted > while > > waiting for message on queue > > java.lang.InterruptedException > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > > at > > > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > > 2020-02-19 05:03:10,344 [myid:3] - WARN > > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker > leaving > > thread id 3 my id = 3 > > 2020-02-19 05:03:10,344 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:03:10,345 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:03:11,048 [myid:3] - INFO > > [SessionTracker:ZooKeeperServer@398] - Expiring session > > 0x30002ba40710018, timeout of 40000ms exceeded > > 2020-02-19 05:03:11,048 [myid:3] - INFO > > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global > > closeSession request for session 0x30002ba40710018 > > > > All of the instances have a similar zoo.cfg: > > > > bash-4.3# cat conf/zoo.cfg > > # The number of milliseconds of each tick > > tickTime=2000 > > # The number of ticks that the initial > > # synchronization phase can take > > initLimit=30 > > # The number of ticks that can pass between > > # sending a request and getting an acknowledgement > > syncLimit=5 > > # Purge every 24 hours > > autopurge.purgeInterval=24 > > # the directory where the snapshot is stored. > > # do not use /tmp for storage, /tmp here is just > > # example sakes. > > dataDir=/opt/zookeeper/data > > # the port at which the clients will connect > > clientPort=2181 > > #Append other confg... > > electionPortBindRetry=1000 > > 4lw.commands.whitelist=stat, ruok, conf, mntr > > > > server.1=zookeeper1:2888:3888 > > server.2=zookeeper2:2888:3888 > > server.3=0.0.0.0:2888:3888 > > > > > > >
