Hi Steve,

it's possible that the quorum state depends on the order your nodes start.
In my kubernetes environment I've had a similar issue and I've noticed that
starting brokers 1 by 1, following the order from configuration file allows
all 3 to join the quorum but a reverse order would keep server started as
the last outside of the quorum. I was also using 0.0.0.0 in the
configuration and didn't try a full address due to readiness check
configuration.

Unfortunately I didn't have time to debug it any further so I've downgraded
back to 3.4 for the time being.

Hope you manage to find a solution,

Best,
Jan

On Wed, 19 Feb 2020, 15:47 Steve Jerman, <[email protected]> wrote:

> Hi,
>
> I've just been testing restarts ... I restarted one of the instances (id
> 1) ... and it doesn't join the quorum ... same error.
>
> Odd that the system started fine but can't handle a restart....
>
> Steve
>
> On 2/19/20, 7:45 AM, "Steve Jerman" <[email protected]> wrote:
>
>     Thank You Mate,
>
>     That fixed it. Unfortunately I can't easily avoid using 0.0.0.0
>
>     My configuration is built using Docker Storm and it doesn't let you
> bind to a host name...
>
>     Steve
>
>     On 2/19/20, 5:27 AM, "Szalay-Bekő Máté" <[email protected]>
> wrote:
>
>         Hi Steve!
>
>         If you are using ZooKeeper newer than 3.5.0, then this might be
> the issue
>         we are just discussing / trying to fix in ZOOKEEPER-2164.
>         Can you test your setup with a config where you don't use 0.0.0.0
> in the
>         server addresses?
>
>         If you need to bind to the 0.0.0.0 locally, then please set the
>         'quorumListenOnAllIPs' config property to true.
>
>         like:
>
>         # usually you don't really need this, unless if you actually need
> to bind
>         to multiple IPs
>         quorumListenOnAllIPs=true
>
>         # it is best if all the zoo.cfg files contain the same address
> settings,
>         and we don't use 0.0.0.0 here
>         server.1=zookeeper1:2888:3888
>         server.2=zookeeper2:2888:3888
>         server.3=zookeeper3:2888:3888
>
>         Kind regards,
>         Mate
>
>         On Wed, Feb 19, 2020 at 6:08 AM Steve Jerman <[email protected]>
> wrote:
>
>         > Hello folks,
>         >
>         > Wonder if anyone can help me. Suspect it must be something
> simple but I
>         > cant see it. Any suggestions about how to diagnose would be
> gratefully
>         > received.
>         >
>         > I have a three node ZK cluster, when it starts up only two of
> the nodes
>         > form a quorum. If I restart the leader the quorum reforms with
> the other
>         > two node…
>         >
>         > Thanks in advance for any  help
>         > Steve
>         >
>         > This is the ‘stat’ for the leader/follow…
>         >
>         > bash-5.0$ echo stat | nc zookeeper1 2181
>         > This ZooKeeper instance is not currently serving requests
>         >
>         > bash-5.0$ echo stat | nc zookeeper2 2181
>         > Zookeeper version:
> 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built
>         > on 02/10/2020 11:30 GMT
>         > Clients:
>         > /10.0.1.152:44910[1](queued=0,recved=151,sent=151)
>         > /10.0.1.140:53138[1](queued=0,recved=187,sent=187)
>         > /10.0.1.143:57422[1](queued=0,recved=151,sent=151)
>         > /10.0.1.152:59242[0](queued=0,recved=1,sent=0)
>         > /10.0.1.143:40826[1](queued=0,recved=1139,sent=1139)
>         > /10.0.1.152:49188[1](queued=0,recved=200,sent=203)
>         > /10.0.1.152:59548[1](queued=0,recved=1157,sent=1159)
>         > /10.0.1.140:36624[1](queued=0,recved=151,sent=151)
>         >
>         > Latency min/avg/max: 0/0/5
>         > Received: 3338
>         > Sent: 3342
>         > Connections: 8
>         > Outstanding: 0
>         > Zxid: 0xc000000f3
>         > Mode: follower
>         > Node count: 181
>         >
>         > bash-5.0$ echo stat | nc zookeeper3 2181
>         > Zookeeper version:
> 3.5.7-f0fdd52973d373ffd9c86b81d99842dc2c7f660e, built
>         > on 02/10/2020 11:30 GMT
>         > Clients:
>         > /10.0.1.152:49428[0](queued=0,recved=1,sent=0)
>         > /10.0.1.140:32912[1](queued=0,recved=1426,sent=1429)
>         >
>         > Latency min/avg/max: 0/0/4
>         > Received: 1684
>         > Sent: 1686
>         > Connections: 2
>         > Outstanding: 0
>         > Zxid: 0xc000000f4
>         > Mode: leader
>         > Node count: 181
>         > Proposal sizes last/min/max: 32/32/406
>         > bash-5.0$
>         >
>         > The trace for the failing node is:
>         >
>         > server.1=0.0.0.0:2888:3888
>         > server.2=zookeeper2:2888:3888
>         > server.3=zookeeper3:2888:3888
>         > ZooKeeper JMX enabled by default
>         > Using config: /opt/zookeeper/bin/../conf/zoo.cfg
>         > 2020-02-19 04:23:27,759 [myid:] - INFO
> [main:QuorumPeerConfig@135] -
>         > Reading configuration from: /opt/zookeeper/bin/../conf/zoo.cfg
>         > 2020-02-19 04:23:27,764 [myid:] - INFO
> [main:QuorumPeerConfig@387] -
>         > clientPortAddress is 0.0.0.0:2181
>         > 2020-02-19 04:23:27,764 [myid:] - INFO
> [main:QuorumPeerConfig@391] -
>         > secureClientPort is not set
>         > 2020-02-19 04:23:27,771 [myid:1] - INFO
> [main:DatadirCleanupManager@78]
>         > - autopurge.snapRetainCount set to 3
>         > 2020-02-19 04:23:27,772 [myid:1] - INFO
> [main:DatadirCleanupManager@79]
>         > - autopurge.purgeInterval set to 24
>         > 2020-02-19 04:23:27,772 [myid:1] - INFO
>         > [PurgeTask:DatadirCleanupManager$PurgeTask@138] - Purge task
> started.
>         > 2020-02-19 04:23:27,773 [myid:1] - INFO  [main:ManagedUtil@46]
> - Log4j
>         > found with jmx enabled.
>         > 2020-02-19 04:23:27,774 [myid:1] - INFO
> [PurgeTask:FileTxnSnapLog@115] -
>         > zookeeper.snapshot.trust.empty : false
>         > 2020-02-19 04:23:27,780 [myid:1] - INFO
>         > [PurgeTask:DatadirCleanupManager$PurgeTask@144] - Purge task
> completed.
>         > 2020-02-19 04:23:27,781 [myid:1] - INFO  [main:QuorumPeerMain@141]
> -
>         > Starting quorum peer
>         > 2020-02-19 04:23:27,786 [myid:1] - INFO
> [main:ServerCnxnFactory@135] -
>         > Using org.apache.zookeeper.server.NIOServerCnxnFactory as server
> connection
>         > factory
>         > 2020-02-19 04:23:27,788 [myid:1] - INFO
> [main:NIOServerCnxnFactory@673]
>         > - Configuring NIO connection handler with 10s sessionless
> connection
>         > timeout, 2 selector thread(s), 32 worker threads, and 64 kB
> direct buffers.
>         > 2020-02-19 04:23:27,791 [myid:1] - INFO
> [main:NIOServerCnxnFactory@686]
>         > - binding to port 0.0.0.0/0.0.0.0:2181
>         > 2020-02-19 <http://0.0.0.0/0.0.0.0:21812020-02-19> 04:23:27,809
> [myid:1]
>         > - INFO  [main:Log@169] - Logging initialized @249ms to
>         > org.eclipse.jetty.util.log.Slf4jLog
>         > 2020-02-19 04:23:27,913 [myid:1] - WARN
> [main:ContextHandler@1520] -
>         > o.e.j.s.ServletContextHandler@5abca1e0{/,null,UNAVAILABLE}
> contextPath
>         > ends with /*
>         > 2020-02-19 04:23:27,913 [myid:1] - WARN
> [main:ContextHandler@1531] -
>         > Empty contextPath
>         > 2020-02-19 04:23:27,922 [myid:1] - INFO  [main:X509Util@79] -
> Setting -D
>         > jdk.tls.rejectClientInitiatedRenegotiation=true to disable
> client-initiated
>         > TLS renegotiation
>         > 2020-02-19 04:23:27,923 [myid:1] - INFO  [main:FileTxnSnapLog@115]
> -
>         > zookeeper.snapshot.trust.empty : false
>         > 2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1470]
> - Local
>         > sessions disabled
>         > 2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1481]
> - Local
>         > session upgrading disabled
>         > 2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1448]
> -
>         > tickTime set to 2000
>         > 2020-02-19 04:23:27,923 [myid:1] - INFO  [main:QuorumPeer@1492]
> -
>         > minSessionTimeout set to 4000
>         > 2020-02-19 04:23:27,924 [myid:1] - INFO  [main:QuorumPeer@1503]
> -
>         > maxSessionTimeout set to 40000
>         > 2020-02-19 04:23:27,924 [myid:1] - INFO  [main:QuorumPeer@1518]
> -
>         > initLimit set to 30
>         > 2020-02-19 04:23:27,932 [myid:1] - INFO  [main:ZKDatabase@117] -
>         > zookeeper.snapshotSizeFactor = 0.33
>         > 2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@1763]
> - Using
>         > insecure (non-TLS) quorum communication
>         > 2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@1769]
> - Port
>         > unification disabled
>         > 2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@2136]
> -
>         > QuorumPeer communication is not secured! (SASL auth disabled)
>         > 2020-02-19 04:23:27,933 [myid:1] - INFO  [main:QuorumPeer@2165]
> -
>         > quorum.cnxn.threads.size set to 20
>         > 2020-02-19 04:23:27,934 [myid:1] - INFO  [main:FileSnap@83] -
> Reading
>         > snapshot /opt/zookeeper/data/version-2/snapshot.90000043e
>         > 2020-02-19 04:23:27,963 [myid:1] - INFO  [main:Server@359] -
>         > jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git:
>         > 363d5f2df3a8a28de40604320230664b9c793c16; jvm
>         > 1.8.0_111-internal-alpine-r0-b14
>         > 2020-02-19 04:23:27,989 [myid:1] - INFO
> [main:DefaultSessionIdManager@333]
>         > - DefaultSessionIdManager workerName=node0
>         > 2020-02-19 04:23:27,989 [myid:1] - INFO
> [main:DefaultSessionIdManager@338]
>         > - No SessionScavenger set, using defaults
>         > 2020-02-19 04:23:27,990 [myid:1] - INFO  [main:HouseKeeper@140]
> - node0
>         > Scavenging every 660000ms
>         > 2020-02-19 04:23:27,997 [myid:1] - INFO  [main:ContextHandler@825]
> -
>         > Started o.e.j.s.ServletContextHandler@5abca1e0{/,null,AVAILABLE}
>         > 2020-02-19 04:23:28,004 [myid:1] - INFO
> [main:AbstractConnector@330] -
>         > Started ServerConnector@2b98378d{HTTP/1.1,[http/1.1]}{
> 0.0.0.0:8080}
>         > 2020-02-19 04:23:28,004 [myid:1] - INFO  [main:Server@399] -
> Started
>         > @444ms
>         > 2020-02-19 04:23:28,004 [myid:1] - INFO
> [main:JettyAdminServer@112] -
>         > Started AdminServer on address 0.0.0.0, port 8080 and command
> URL /commands
>         > 2020-02-19 04:23:28,007 [myid:1] - INFO
>         > [main:QuorumCnxManager$Listener@867] - Election port bind
> maximum retries
>         > is 1000
>         > 2020-02-19 04:23:28,007 [myid:1] - INFO
>         > [QuorumPeerListener:QuorumCnxManager$Listener@917] - My
> election bind
>         > port: /0.0.0.0:3888
>         > 2020-02-19 04:23:28,014 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
> )(secure=disabled):QuorumPeer@1175]
>         > - LOOKING
>         > 2020-02-19 04:23:28,015 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):FastLeaderElection@885] - New election. My
> id =  1,
>         > proposed zxid=0xa000000fb
>         > 2020-02-19 04:23:28,018 [myid:1] - INFO
>         > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller
> server
>         > identifier, so dropping the connection: (2, 1)
>         > 2020-02-19 04:23:28,019 [myid:1] - INFO
>         > [WorkerSender[myid=1]:QuorumCnxManager@438] - Have smaller
> server
>         > identifier, so dropping the connection: (3, 1)
>         > 2020-02-19 04:23:28,019 [myid:1] - INFO
>         > [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification:
> 2
>         > (message format version), 1 (n.leader), 0xa000000fb (n.zxid), 0x1
>         > (n.round), LOOKING (n.state), 1 (n.sid), 0xa (n.peerEPoch),
> LOOKING (my
>         > state)0 (n.config version)
>         > 2020-02-19 04:23:28,221 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):QuorumCnxManager@438] - Have smaller server
>         > identifier, so dropping the connection: (2, 1)
>         > 2020-02-19 04:23:28,222 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):QuorumCnxManager@438] - Have smaller server
>         > identifier, so dropping the connection: (3, 1)
>         > 2020-02-19 04:23:28,222 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):FastLeaderElection@919] - Notification time
> out: 400
>         > 2020-02-19 04:23:28,623 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):QuorumCnxManager@438] - Have smaller server
>         > identifier, so dropping the connection: (2, 1)
>         > 2020-02-19 04:23:28,624 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):QuorumCnxManager@438] - Have smaller server
>         > identifier, so dropping the connection: (3, 1)
>         > 2020-02-19 04:23:28,624 [myid:1] - INFO
>         > [QuorumPeer[myid=1](plain=0.0.0.0:2181
>         > )(secure=disabled):FastLeaderElection@919] - Notification time
> out: 800
>         > …
>         >
>         > And for the leader:
>         > ….
>         > 2020-02-19 05:02:10,341 [myid:3] - INFO
>         > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification:
> 2
>         > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1
>         > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch),
> LEADING (my
>         > state)0 (n.config version)
>         > 2020-02-19 05:02:10,341 [myid:3] - INFO
>         > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification:
> 2
>         > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1
>         > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch),
> LEADING (my
>         > state)0 (n.config version)
>         > 2020-02-19 05:02:33,640 [myid:3] - WARN
>         > [NIOWorkerThread-4:NIOServerCnxn@366] - Unable to read
> additional data
>         > from client sessionid 0x30002ba40710018, likely client has
> closed socket
>         > 2020-02-19 05:02:39,047 [myid:3] - INFO
>         > [SessionTracker:ZooKeeperServer@398] - Expiring session
>         > 0x20002ba2fd6001a, timeout of 40000ms exceeded
>         > 2020-02-19 05:02:39,048 [myid:3] - INFO
>         > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global
>         > closeSession request for session 0x20002ba2fd6001a
>         > 2020-02-19 05:03:10,340 [myid:3] - INFO  [/0.0.0.0:3888
>         > :QuorumCnxManager$Listener@924] - Received connection request
>         > 10.0.1.152:52492
>         > 2020-02-19 05:03:10,340 [myid:3] - WARN
>         > [SendWorker:1:QuorumCnxManager$SendWorker@1143] - Interrupted
> while
>         > waiting for message on queue
>         > java.lang.InterruptedException
>         >                 at
>         >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         >                 at
>         >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         >                 at
>         >
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
>         > 2020-02-19 05:03:10,340 [myid:3] - WARN
>         > [SendWorker:1:QuorumCnxManager$SendWorker@1153] - Send worker
> leaving
>         > thread  id 1 my id = 3
>         > 2020-02-19 05:03:10,340 [myid:3] - WARN
>         > [RecvWorker:1:QuorumCnxManager$RecvWorker@1227] - Connection
> broken for
>         > id 1, my id = 3, error =
>         > java.net.SocketException: Socket closed
>         >                 at java.net.SocketInputStream.socketRead0(Native
> Method)
>         >                 at
>         > java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         >                 at
>         > java.net.SocketInputStream.read(SocketInputStream.java:170)
>         >                 at
>         > java.net.SocketInputStream.read(SocketInputStream.java:141)
>         >                 at
>         > java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         >                 at
>         > java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         >                 at
>         > java.io.DataInputStream.readInt(DataInputStream.java:387)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212)
>         > 2020-02-19 05:03:10,341 [myid:3] - WARN
>         > [RecvWorker:1:QuorumCnxManager$RecvWorker@1230] - Interrupting
> SendWorker
>         > 2020-02-19 05:03:10,340 [myid:3] - WARN
>         > [RecvWorker:3:QuorumCnxManager$RecvWorker@1227] - Connection
> broken for
>         > id 3, my id = 3, error =
>         > java.io.EOFException
>         >                 at
>         > java.io.DataInputStream.readInt(DataInputStream.java:392)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1212)
>         > 2020-02-19 05:03:10,341 [myid:3] - WARN
>         > [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting
> SendWorker
>         > 2020-02-19 05:03:10,341 [myid:3] - INFO  [/0.0.0.0:3888
>         > :QuorumCnxManager$Listener@924] - Received connection request
>         > 10.0.1.142:46326
>         > 2020-02-19 05:03:10,343 [myid:3] - WARN
>         > [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted
> while
>         > waiting for message on queue
>         > java.lang.InterruptedException
>         >                 at
>         >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         >                 at
>         >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         >                 at
>         >
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
>         >                 at
>         >
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
>         > 2020-02-19 05:03:10,344 [myid:3] - WARN
>         > [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker
> leaving
>         > thread  id 3 my id = 3
>         > 2020-02-19 05:03:10,344 [myid:3] - INFO
>         > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification:
> 2
>         > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1
>         > (n.round), LOOKING (n.state), 3 (n.sid), 0xb (n.peerEPoch),
> LEADING (my
>         > state)0 (n.config version)
>         > 2020-02-19 05:03:10,345 [myid:3] - INFO
>         > [WorkerReceiver[myid=3]:FastLeaderElection@679] - Notification:
> 2
>         > (message format version), 3 (n.leader), 0xb0000018c (n.zxid), 0x1
>         > (n.round), LEADING (n.state), 3 (n.sid), 0xc (n.peerEPoch),
> LEADING (my
>         > state)0 (n.config version)
>         > 2020-02-19 05:03:11,048 [myid:3] - INFO
>         > [SessionTracker:ZooKeeperServer@398] - Expiring session
>         > 0x30002ba40710018, timeout of 40000ms exceeded
>         > 2020-02-19 05:03:11,048 [myid:3] - INFO
>         > [SessionTracker:QuorumZooKeeperServer@157] - Submitting global
>         > closeSession request for session 0x30002ba40710018
>         >
>         > All of the instances have a similar zoo.cfg:
>         >
>         > bash-4.3# cat conf/zoo.cfg
>         > # The number of milliseconds of each tick
>         > tickTime=2000
>         > # The number of ticks that the initial
>         > # synchronization phase can take
>         > initLimit=30
>         > # The number of ticks that can pass between
>         > # sending a request and getting an acknowledgement
>         > syncLimit=5
>         > # Purge every 24 hours
>         > autopurge.purgeInterval=24
>         > # the directory where the snapshot is stored.
>         > # do not use /tmp for storage, /tmp here is just
>         > # example sakes.
>         > dataDir=/opt/zookeeper/data
>         > # the port at which the clients will connect
>         > clientPort=2181
>         > #Append other confg...
>         > electionPortBindRetry=1000
>         > 4lw.commands.whitelist=stat, ruok, conf, mntr
>         >
>         > server.1=zookeeper1:2888:3888
>         > server.2=zookeeper2:2888:3888
>         > server.3=0.0.0.0:2888:3888
>         >
>
>
>
>
>

Reply via email to