Ok, Just to confirm. Rolling back to 3.4.14 fixes the issue. The quorum starts up and restarting any of the instances works....
Are there any issues with using the 3.5 client with 3.4 server? Steve On 2/19/20, 9:02 AM, "Steve Jerman" <[email protected]> wrote: OK, that explains it. I will see if 3.4.14 fixes the issue for the time being... Thanks Steve On 2/19/20, 8:57 AM, "Jan Kosecki" <[email protected]> wrote: Hi Steve, it's possible that the quorum state depends on the order your nodes start. In my kubernetes environment I've had a similar issue and I've noticed that starting brokers 1 by 1, following the order from configuration file allows all 3 to join the quorum but a reverse order would keep server started as the last outside of the quorum. I was also using 0.0.0.0 in the configuration and didn't try a full address due to readiness check configuration. Unfortunately I didn't have time to debug it any further so I've downgraded back to 3.4 for the time being. Hope you manage to find a solution, Best, Jan On Wed, 19 Feb 2020, 15:47 Steve Jerman, <[email protected]> wrote: > Hi, > > I've just been testing restarts ... I restarted one of the instances (id > 1) ... and it doesn't join the quorum ... same error. > > Odd that the system started fine but can't handle a restart.... > > Steve > > On 2/19/20, 7:45 AM, "Steve Jerman" <[email protected]> wrote: > > Thank You Mate, > > That fixed it. Unfortunately I can't easily avoid using 0.0.0.0 > > My configuration is built using Docker Storm and it doesn't let you > bind to a host name... > > Steve > > On 2/19/20, 5:27 AM, "Szalay-Bekő Máté" <[email protected]> > wrote: > > Hi Steve! > > If you are using ZooKeeper newer than 3.5.0, then this might be > the issue > we are just discussing / trying to fix in ZOOKEEPER-2164. > Can you test your setup with a config where you don't use 0.0.0.0 > in the > server addresses? > > If you need to bind to the 0.0.0.0 locally, then please set the > 'quorumListenOnAllIPs' config property to true. > > like: > > # usually you don't really need this, unless if you actually need > to bind > to multiple IPs > quorumListenOnAllIPs=true > > # it is best if all the zoo.cfg files contain the same address > settings, > and we don't use 0.0.0.0 here > server.1=zookeeper1:2888:3888 > server.2=zookeeper2:2888:3888 > server.3=zookeeper3:2888:3888 > > Kind regards, > Mate > > On Wed, Feb 19, 2020 at 6:08 AM Steve Jerman <[email protected]> > wrote: > > > Hello folks, > > > > Wonder if anyone can help me. Suspect it must be something > simple but I > > cant see it. Any suggestions about how to diagnose would be > gratefully > > received. > > > > I have a three node ZK cluster, when it starts up only two of > the nodes > > form a quorum. If I restart the leader the quorum reforms with > the other > > two node… > > > > Thanks in advance for any help > > Steve > > > > This is the ‘stat’ for the leader/follow… > > > > bash-5.0$ echo stat | nc zookeeper1 2181 > > This ZooKeeper instance is not currently serving requests > > > > bash-5.0$ echo stat | nc zookeeper2 2181 > > Zookeeper version: > 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built > > on 02/10/2020 11:30 GMT > > Clients: > > /10.0.1.152:44910[1](queued=0,recved=151,sent=151) > > /10.0.1.140:53138[1](queued=0,recved=187,sent=187) > > /10.0.1.143:57422[1](queued=0,recved=151,sent=151) > > /10.0.1.152:59242[0](queued=0,recved=1,sent=0) > > /10.0.1.143:40826[1](queued=0,recved=1139,sent=1139) > > /10.0.1.152:49188[1](queued=0,recved=200,sent=203) > > /10.0.1.152:59548[1](queued=0,recved=1157,sent=1159) > > /10.0.1.140:36624[1](queued=0,recved=151,sent=151) > > > > Latency min/avg/max: 0/0/5 > > Received: 3338 > > Sent: 3342 > > Connections: 8 > > Outstanding: 0 > > Zxid: 0xc000000f3 > > Mode: follower > > Node count: 181 > > > > bash-5.0$ echo stat | nc zookeeper3 2181 > > Zookeeper version: > 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built > > on 02/10/2020 11:30 GMT > > Clients: > > /10.0.1.152:49428[0](queued=0,recved=1,sent=0) > > /10.0.1.140:32912[1](queued=0,recved=1426,sent=1429) > > > > Latency min/avg/max: 0/0/4 > > Received: 1684 > > Sent: 1686 > > Connections: 2 > > Outstanding: 0 > > Zxid: 0xc000000f4 > > Mode: leader > > Node count: 181 > > Proposal sizes last/min/max: 32/32/406 > > bash-5.0$ > > > > The trace for the failing node is: > > > > server.1=0.0.0.0:2888:3888 > > server.2=zookeeper2:2888:3888 > > server.3=zookeeper3:2888:3888 > > ZooKeeper JMX enabled by default > > Using config: /opt/zookeeper/bin/../conf/zoo.cfg > > 2020-02-19 04:23:27,759 [myid:] - INFO > [main:QuorumPeerConfig@135] - > > Reading configuration from: /opt/zookeeper/bin/../conf/zoo.cfg > > 2020-02-19 04:23:27,764 [myid:] - INFO > [main:QuorumPeerConfig@387] - > > clientPortAddress is 0.0.0.0:2181 > > 2020-02-19 04:23:27,764 [myid:] - INFO > [main:QuorumPeerConfig@391] - > > secureClientPort is not set > > 2020-02-19 04:23:27,771 [myid:1] - INFO > [main:DatadirCleanupManager@78] > > - autopurge.snapRetainCount set to 3 > > 2020-02-19 04:23:27,772 [myid:1] - INFO > [main:DatadirCleanupManager@79] > > - autopurge.purgeInterval set to 24 > > 2020-02-19 04:23:27,772 [myid:1] - INFO > > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task > started. > > 2020-02-19 04:23:27,773 [myid:1] - INFO [main:ManagedUtil@46] > - Log4j > > found with jmx enabled. > > 2020-02-19 04:23:27,774 [myid:1] - INFO > [PurgeTask:FileTxnSnapLog@115] - > > zookeeper.snapshot.trust.empty : false > > 2020-02-19 04:23:27,780 [myid:1] - INFO > > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task > completed. > > 2020-02-19 04:23:27,781 [myid:1] - INFO [main:QuorumPeerMain@141] > - > > Starting quorum peer > > 2020-02-19 04:23:27,786 [myid:1] - INFO > [main:ServerCnxnFactory@135] - > > Using org.apache.zookeeper.server.NIOServerCnxnFactory as server > connection > > factory > > 2020-02-19 04:23:27,788 [myid:1] - INFO > [main:NIOServerCnxnFactory@673] > > - Configuring NIO connection handler with 10s sessionless > connection > > timeout, 2 selector thread(s), 32 worker threads, and 64 kB > direct buffers. > > 2020-02-19 04:23:27,791 [myid:1] - INFO > [main:NIOServerCnxnFactory@686] > > - binding to port 0.0.0.0/0.0.0.0:2181 > > 2020-02-19 <http://0.0.0.0/0.0.0.0:21812020-02-19> 04:23:27,809 > [myid:1] > > - INFO [main:Log@169] - Logging initialized @249ms to > > org.eclipse.jetty.util.log.Slf4jLog > > 2020-02-19 04:23:27,913 [myid:1] - WARN > [main:ContextHandler@1520] - > > o.e.j.s.ServletContextHandler@5abca1e0{/,null,UNAVAILABLE} > contextPath > > ends with /* > > 2020-02-19 04:23:27,913 [myid:1] - WARN > [main:ContextHandler@1531] - > > Empty contextPath > > 2020-02-19 04:23:27,922 [myid:1] - INFO [main:X509Util@79] - > Setting -D > > jdk.tls.rejectClientInitiatedRenegotiation=true to disable > client-initiated > > TLS renegotiation > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:FileTxnSnapLog@115] > - > > zookeeper.snapshot.trust.empty : false > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1470] > - Local > > sessions disabled > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1481] > - Local > > session upgrading disabled > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1448] > - > > tickTime set to 2000 > > 2020-02-19 04:23:27,923 [myid:1] - INFO [main:QuorumPeer@1492] > - > > minSessionTimeout set to 4000 > > 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1503] > - > > maxSessionTimeout set to 40000 > > 2020-02-19 04:23:27,924 [myid:1] - INFO [main:QuorumPeer@1518] > - > > initLimit set to 30 > > 2020-02-19 04:23:27,932 [myid:1] - INFO [main:ZKDatabase@117] - > > zookeeper.snapshotSizeFactor = 0.33 > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1763] > - Using > > insecure (non-TLS) quorum communication > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@1769] > - Port > > unification disabled > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2136] > - > > QuorumPeer communication is not secured! (SASL auth disabled) > > 2020-02-19 04:23:27,933 [myid:1] - INFO [main:QuorumPeer@2165] > - > > quorum.cnxn.threads.size set to 20 > > 2020-02-19 04:23:27,934 [myid:1] - INFO [main:FileSnap@83] - > Reading > > snapshot /opt/zookeeper/data/version-2/snapshot.90000043e > > 2020-02-19 04:23:27,963 [myid:1] - INFO [main:Server@359] - > > jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: > > 363d5f2df3a8a28de40604320230664b9c793c16; jvm > > 1.8.0_111-internal-alpine-r0-b14 > > 2020-02-19 04:23:27,989 [myid:1] - INFO > [main:DefaultSessionIdManager@333] > > - DefaultSessionIdManager workerName=node0 > > 2020-02-19 04:23:27,989 [myid:1] - INFO > [main:DefaultSessionIdManager@338] > > - No SessionScavenger set, using defaults > > 2020-02-19 04:23:27,990 [myid:1] - INFO [main:HouseKeeper@140] > - node0 > > Scavenging every 660000ms > > 2020-02-19 04:23:27,997 [myid:1] - INFO [main:ContextHandler@825] > - > > Started o.e.j.s.ServletContextHandler@5abca1e0{/,null,AVAILABLE} > > 2020-02-19 04:23:28,004 [myid:1] - INFO > [main:AbstractConnector@330] - > > Started ServerConnector@2b98378d{HTTP/1.1,[http/1.1]}{ > 0.0.0.0:8080} > > 2020-02-19 04:23:28,004 [myid:1] - INFO [main:Server@399] - > Started > > @444ms > > 2020-02-19 04:23:28,004 [myid:1] - INFO > [main:JettyAdminServer@112] - > > Started AdminServer on address 0.0.0.0, port 8080 and command > URL /commands > > 2020-02-19 04:23:28,007 [myid:1] - INFO > > [main:QuorumCnxManager$Listener@867] - Election port bind > maximum retries > > is 1000 > > 2020-02-19 04:23:28,007 [myid:1] - INFO > > [QuorumPeerListener:QuorumCnxManager$Listener@917] - My > election bind > > port: /0.0.0.0:3888 > > 2020-02-19 04:23:28,014 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > )(secure=disabled):QuorumPeer@1175] > > - LOOKING > > 2020-02-19 04:23:28,015 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@885] - New election. My > id = 1, > > proposed zxid=0xa000000fb > > 2020-02-19 04:23:28,018 [myid:1] - INFO > > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller > server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,019 [myid:1] - INFO > > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller > server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,019 [myid:1] - INFO > > [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 1 (n.leader), 0xa000000fb (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 1 (n.sid), 0xa (n.peerEPoch), > LOOKING (my > > state)0 (n.config version) > > 2020-02-19 04:23:28,221 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,222 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,222 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@919] - Notification time > out: 400 > > 2020-02-19 04:23:28,623 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (2, 1) > > 2020-02-19 04:23:28,624 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):QuorumCnxManager@438] - Have smaller server > > identifier, so dropping the connection: (3, 1) > > 2020-02-19 04:23:28,624 [myid:1] - INFO > > [QuorumPeer[myid=1](plain=0.0.0.0:2181 > > )(secure=disabled):FastLeaderElection@919] - Notification time > out: 800 > > … > > > > And for the leader: > > …. > > 2020-02-19 05:02:10,341 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:02:10,341 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:02:33,640 [myid:3] - WARN > > [NIOWorkerThread-4:NIOServerCnxn@366] - Unable to read > additional data > > from client sessionid 0x30002ba40710018, likely client has > closed socket > > 2020-02-19 05:02:39,047 [myid:3] - INFO > > [SessionTracker:ZooKeeperServer@398] - Expiring session > > 0x20002ba2fd6001a, timeout of 40000ms exceeded > > 2020-02-19 05:02:39,048 [myid:3] - INFO > > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global > > closeSession request for session 0x20002ba2fd6001a > > 2020-02-19 05:03:10,340 [myid:3] - INFO [/0.0.0.0:3888 > > :QuorumCnxManager$Listener@924] - Received connection request > > 10.0.1.152:52492 > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@1143] - Interrupted > while > > waiting for message on queue > > java.lang.InterruptedException > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > > at > > > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [SendWorker:1:QuorumCnxManager$SendWorker@1153] - Send worker > leaving > > thread id 1 my id = 3 > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1227] - Connection > broken for > > id 1, my id = 3, error = > > java.net.SocketException: Socket closed > > at java.net.SocketInputStream.socketRead0(Native > Method) > > at > > java.net.SocketInputStream.socketRead(SocketInputStream.java:116) > > at > > java.net.SocketInputStream.read(SocketInputStream.java:170) > > at > > java.net.SocketInputStream.read(SocketInputStream.java:141) > > at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > > at > > java.io.BufferedInputStream.read(BufferedInputStream.java:265) > > at > > java.io.DataInputStream.readInt(DataInputStream.java:387) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) > > 2020-02-19 05:03:10,341 [myid:3] - WARN > > [RecvWorker:1:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker > > 2020-02-19 05:03:10,340 [myid:3] - WARN > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1227] - Connection > broken for > > id 3, my id = 3, error = > > java.io.EOFException > > at > > java.io.DataInputStream.readInt(DataInputStream.java:392) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212) > > 2020-02-19 05:03:10,341 [myid:3] - WARN > > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting > SendWorker > > 2020-02-19 05:03:10,341 [myid:3] - INFO [/0.0.0.0:3888 > > :QuorumCnxManager$Listener@924] - Received connection request > > 10.0.1.142:46326 > > 2020-02-19 05:03:10,343 [myid:3] - WARN > > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted > while > > waiting for message on queue > > java.lang.InterruptedException > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) > > at > > > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088) > > at > > > java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82) > > at > > > org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131) > > 2020-02-19 05:03:10,344 [myid:3] - WARN > > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker > leaving > > thread id 3 my id = 3 > > 2020-02-19 05:03:10,344 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:03:10,345 [myid:3] - INFO > > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification: > 2 > > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1 > > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch), > LEADING (my > > state)0 (n.config version) > > 2020-02-19 05:03:11,048 [myid:3] - INFO > > [SessionTracker:ZooKeeperServer@398] - Expiring session > > 0x30002ba40710018, timeout of 40000ms exceeded > > 2020-02-19 05:03:11,048 [myid:3] - INFO > > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global > > closeSession request for session 0x30002ba40710018 > > > > All of the instances have a similar zoo.cfg: > > > > bash-4.3# cat conf/zoo.cfg > > # The number of milliseconds of each tick > > tickTime=2000 > > # The number of ticks that the initial > > # synchronization phase can take > > initLimit=30 > > # The number of ticks that can pass between > > # sending a request and getting an acknowledgement > > syncLimit=5 > > # Purge every 24 hours > > autopurge.purgeInterval=24 > > # the directory where the snapshot is stored. > > # do not use /tmp for storage, /tmp here is just > > # example sakes. > > dataDir=/opt/zookeeper/data > > # the port at which the clients will connect > > clientPort=2181 > > #Append other confg... > > electionPortBindRetry=1000 > > 4lw.commands.whitelist=stat, ruok, conf, mntr > > > > server.1=zookeeper1:2888:3888 > > server.2=zookeeper2:2888:3888 > > server.3=0.0.0.0:2888:3888 > > > > > > >
