[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2020-03-16 Thread Jira


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060596#comment-17060596
 ] 

Michael Dürr commented on ZOOKEEPER-2164:
-

Thank you very much [~eolivelli] and [~symat] !

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Assignee: Mate Szalay-Beko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.7.0, 3.6.1, 3.5.8
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3760) remove a useless throwing CliException

2020-03-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ZOOKEEPER-3760:
--
Labels: pull-request-available  (was: )

> remove a useless throwing CliException
> --
>
> Key: ZOOKEEPER-3760
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3760
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.7
>Reporter: Jinjiang Ling
>Priority: Major
>  Labels: pull-request-available
> Attachments: ZOOKEEPER-3760-1.patch
>
>
> when I upgrade zookeeper from 3.4.13 to 3.5.7 in my application, I find the 
> function processCmd in ZooKeeperMain.java is just like blow
> {code:java}
> protected boolean processCmd(MyCommandOptions co) throws CliException, 
> IOException, InterruptedException {
> boolean watch = false;
> try {
> watch = processZKCmd(co);
> exitCode = ExitCode.EXECUTION_FINISHED.getValue();
> } catch (CliException ex) {
> exitCode = ex.getExitCode();
> System.err.println(ex.getMessage());
> }
> return watch;
> }
> {code}
> it throws {color:#FF}CliException {color}which has been caught in the 
> funciton, so I think it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ZOOKEEPER-3760) remove a useless throwing CliException

2020-03-16 Thread Jinjiang Ling (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinjiang Ling updated ZOOKEEPER-3760:
-
Attachment: ZOOKEEPER-3760-1.patch

> remove a useless throwing CliException
> --
>
> Key: ZOOKEEPER-3760
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3760
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.5.7
>Reporter: Jinjiang Ling
>Priority: Major
> Attachments: ZOOKEEPER-3760-1.patch
>
>
> when I upgrade zookeeper from 3.4.13 to 3.5.7 in my application, I find the 
> function processCmd in ZooKeeperMain.java is just like blow
> {code:java}
> protected boolean processCmd(MyCommandOptions co) throws CliException, 
> IOException, InterruptedException {
> boolean watch = false;
> try {
> watch = processZKCmd(co);
> exitCode = ExitCode.EXECUTION_FINISHED.getValue();
> } catch (CliException ex) {
> exitCode = ex.getExitCode();
> System.err.println(ex.getMessage());
> }
> return watch;
> }
> {code}
> it throws {color:#FF}CliException {color}which has been caught in the 
> funciton, so I think it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ZOOKEEPER-3760) remove a useless throwing CliException

2020-03-16 Thread Jinjiang Ling (Jira)
Jinjiang Ling created ZOOKEEPER-3760:


 Summary: remove a useless throwing CliException
 Key: ZOOKEEPER-3760
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3760
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.5.7
Reporter: Jinjiang Ling


when I upgrade zookeeper from 3.4.13 to 3.5.7 in my application, I find the 
function processCmd in ZooKeeperMain.java is just like blow
{code:java}
protected boolean processCmd(MyCommandOptions co) throws CliException, 
IOException, InterruptedException {
boolean watch = false;
try {
watch = processZKCmd(co);
exitCode = ExitCode.EXECUTION_FINISHED.getValue();
} catch (CliException ex) {
exitCode = ex.getExitCode();
System.err.println(ex.getMessage());
}
return watch;
}
{code}
it throws {color:#FF}CliException {color}which has been caught in the 
funciton, so I think it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

2020-03-16 Thread Dai Shi (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060440#comment-17060440
 ] 

Dai Shi commented on ZOOKEEPER-3756:


I think you are right that kubernetes networking is one of the main issues 
here. Because the server IPs in the zookeeper configs are pointing to 
kubernetes services, opening a TCP connection to those IPs when there are no 
backend endpoints (which is the case when a pod is deleted) will just hang.

I tried running with {{-Dzookeeper.cnxTimeout=500}} and now the cluster stays 
down for around 3 to 5 seconds when restarting the leader instead of more than 
30 seconds. We may be able to tolerate this duration of downtime as a bandaid.

I can try and build a 3.6.0 docker image and test the multiAddress feature as 
well. Is there anything I should pay attention to while upgrading to 3.6.0? 
Also is it possible to downgrade back to 3.5.7 afterwards?

> Members failing to rejoin quorum
> 
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: leaderElection
>Affects Versions: 3.5.6, 3.5.7
>Reporter: Dai Shi
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the 
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 
> (so only servers 1, 2, and 3 remain in the configuration 

[jira] [Created] (ZOOKEEPER-3759) A way to configure the jmx rmi port

2020-03-16 Thread Agostino Sarubbo (Jira)
Agostino Sarubbo created ZOOKEEPER-3759:
---

 Summary: A way to configure the jmx rmi port
 Key: ZOOKEEPER-3759
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3759
 Project: ZooKeeper
  Issue Type: Bug
Reporter: Agostino Sarubbo


The start script misses a way to configure a java_rmi port, see also:
https://issues.apache.org/jira/browse/KAFKA-8658
[https://github.com/apache/kafka/pull/7088/commits/d02e14da8752a08bfe4f837d1cfea2c7b51e07af]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Agostino Sarubbo (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060265#comment-17060265
 ] 

Agostino Sarubbo commented on ZOOKEEPER-3758:
-

Hello, here is the requested data:
{code:java}
~ # java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
{code}
{code:java}
# zoo.cfg
tickTime=2000
dataDir=/opt/loway/zookeeper/data
dataLogDir=/opt/loway/zookeeper/logs
clientPort=2181
secureClientPort=2281
initLimit=100
syncLimit=30
4lw.commands.whitelist=*
autopurge.purgeInterval=1
autopurge.snapRetainCount=5
server.1=zookeeper1.mydomain:2888:3888
server.2=zookeeper2.mydomain:2888:3888
server.3=zookeeper3.mydomain:2888:3888
server.4=zookeeper4.mydomain:2888:3888
server.5=zookeeper5.mydomain:2888:3888{code}
We update zookeeper nodes one by one by installing the new version. We are 
using static configs, the job is done by ansible so there is no human error 
during the update.

Is there anything else I can provide to debug the issue?

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 

[jira] [Commented] (ZOOKEEPER-2164) fast leader election keeps failing

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060204#comment-17060204
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-2164:
-

FYI: this ticket contained multiple errors regarding the leader election. We 
fixed one (with the 0.0.0.0 addresses), but the original one (slow leader 
election due to synchronized {{connectOne}} method call and socket timeouts) 
remained unfixed. Now I just faced the same original issue in ZOOKEEPER-3756, 
and plan to fix it. I don't think we should re-open this jira, but I will  
rather use the new one.

> fast leader election keeps failing
> --
>
> Key: ZOOKEEPER-2164
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2164
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection
>Affects Versions: 3.4.5
>Reporter: Michi Mutsuzaki
>Assignee: Mate Szalay-Beko
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.7.0, 3.6.1, 3.5.8
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> I have a 3-node cluster with sids 1, 2 and 3. Originally 2 is the leader. 
> When I shut down 2, 1 and 3 keep going back to leader election. Here is what 
> seems to be happening.
> - Both 1 and 3 elect 3 as the leader.
> - 1 receives votes from 3 and itself, and starts trying to connect to 3 as a 
> follower.
> - 3 doesn't receive votes for 5 seconds because connectOne() to 2 doesn't 
> timeout for 5 seconds: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L346
> - By the time 3 receives votes, 1 has given up trying to connect to 3: 
> https://github.com/apache/zookeeper/blob/41c9fcb3ca09cd3d05e59fe47f08ecf0b85532c8/src/java/main/org/apache/zookeeper/server/quorum/Learner.java#L247
> I'm using 3.4.5, but it looks like this part of the code hasn't changed for a 
> while, so I'm guessing later versions have the same issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060200#comment-17060200
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3756:
-

OK, I have a theory... Maybe this is what happens:
- After shutting down the leader, the whole leader election restarts
- ZooKeeper tries to open socket connection to the other ZooKeeper servers by 
using synchronized methods, so only one can run a time (see  on the master 
branch: 
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L688
 and 
https://github.com/apache/zookeeper/blob/a5a4743733b8939464af82c1ee68a593fadbe362/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/QuorumCnxManager.java#L759)
- the default timeout is 5 secs (this is why there is nothing leader election 
related log message in your log files for 5 sec, until we hit the timeout of 
socket open to server 3)
- by the time when the 5 sec timeout elapsed, the leader election protocol was 
also timeouted (but AFAIK it is increasing its internal timeout always? I will 
need to verify this)
- after this happens a few time, either the leader election protocol timeout is 
increased enough to be able tolerate the 5 sec delay (and/or the fact that the 
server-3 restarted and the socket can be opened now) will cause that this block 
gets removed and everything goes smoothly after this. But it took 30 seconds, 
what is way too long...

The question is, why the socket needs to timeout (wait for 5 sec) and why the 
connection doesn't get closed immediately with some 'host unreachable' 
exception, what we would expect in case if the server goes down and no IP 
connection can be established. Usually we don't see this problem in production, 
so I guess it has to do something with Kubernetes networking.

Still, this part needs to be refactored in ZooKeeper, we have to make the 
{{connectOne}} asynchronous, what is not an easy task. Actually this is also 
something which was suggested in ZOOKEEPER-2164 (but in that ticket there were 
other errors fixed in the end). 

In the meanwhile there might be some workarounds:
# you can decrease the connection timeout to e.g. 500ms or 1000ms using the 
{{-Dzookeeper.cnxTimeout=500'}} system property. I am not sure if it will help, 
but I would be glad if you could test it
# an other independent workaround would be using the multiAddress feature of 
ZooKeeper 3.6.0, enabling it by {{-Dzookeeper.multiAddress.enabled=true}}. Then 
ZooKeeper should periodically check the availability of the currently used 
election addresses and kill the socket if the host is unavailable. This way we 
might kill the dead socket before the timeout happen. However, it might run 
ICMP traffic (ping) in the background, which I am not sure if will be reliable 
in kubernetes.

No matter if the workarounds would fix the problem for you or not, I would 
suggest to keep this ticket open, and I will try to implement an asynchronous 
connection establishment somehow.

> Members failing to rejoin quorum
> 
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: leaderElection
>Affects Versions: 3.5.6, 3.5.7
>Reporter: Dai Shi
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> 

[jira] [Assigned] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko reassigned ZOOKEEPER-3758:
---

Assignee: Mate Szalay-Beko

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Assignee: Mate Szalay-Beko
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
>  - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
>  - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
>  - zookeeper.pathStats.initialDelay = 5 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@115]
>  - zookeeper.pathStats.delay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@116]
>  - zookeeper.pathStats.enabled = 

[jira] [Comment Edited] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060158#comment-17060158
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3758 at 3/16/20, 12:06 PM:


also a small hint: I checked the code and AFAICS this exception suggests that 
the given ZooKeeper instance (a follower) don't know what is the quorum address 
/ port of the newly elected ZooKeeper server. This shouldn't really happen, 
unless you hit some bug or configuration issue. I am happy to dig deeper if you 
can send more info.

Also asking in the user mail list (as Enrico suggested) is better, as more 
people are watching there.


was (Author: symat):
also a small hint: I checked the code and this exception shows that the given 
ZooKeeper instance (a follower) don't know what is the quorum address / port of 
the newly elected ZooKeeper server.  This shouldn't really happen, unless you 
hit some bug or configuration issue. I am happy to dig deeper if you can send 
more info.

Also asking in the user mail list (as Enrico suggested) is better, as more 
people are watching there.

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - 

[jira] [Comment Edited] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060154#comment-17060154
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3758 at 3/16/20, 12:05 PM:


We tested the 3.5.7 -> 3.6.0 upgrade before the release, but of course it is 
always possible that we missed something...

Could you also share your configs and java version when you write the email? 
Also please provide some more background info, like: are you doing a 
rolling-upgrade, or just simply starting a new cluster with the old data? Did 
you change anything in the config compared to the old cluster? Are you using 
static config files, or you use the dynamic re-config?


was (Author: symat):
We tested the 3.5.7 -> 3.6.0 upgrade before the release, but of course it is 
always possible that we missed something...

Could you also share your configs and java version when you write the email? 
Also please provide some more background info, like: are you doing a 
rolling-upgrade, or just simply starting a new cluster with the old data? Did 
you change anything in the config compared to the old cluster?

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  

[jira] [Commented] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060158#comment-17060158
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3758:
-

also a small hint: I checked the code and this exception shows that the given 
ZooKeeper instance (a follower) don't know what is the quorum address / port of 
the newly elected ZooKeeper server.  This shouldn't really happen, unless you 
hit some bug or configuration issue. I am happy to dig deeper if you can send 
more info.

Also asking in the user mail list (as Enrico suggested) is better, as more 
people are watching there.

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
>  - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
>  - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
> 

[jira] [Commented] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060154#comment-17060154
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3758:
-

We tested the 3.5.7 -> 3.6.0 upgrade before the release, but of course it tis 
always possible that we missed something...

Could you also share your configs and java version when you write the email? 
Also please provide some more background info, like: are you doing a 
rolling-upgrade, or just simply starting a new cluster with the old data? Did 
you change anything in the config compared to the old cluster?

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
>  - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
>  - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
>  - 

[jira] [Comment Edited] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060154#comment-17060154
 ] 

Mate Szalay-Beko edited comment on ZOOKEEPER-3758 at 3/16/20, 11:56 AM:


We tested the 3.5.7 -> 3.6.0 upgrade before the release, but of course it is 
always possible that we missed something...

Could you also share your configs and java version when you write the email? 
Also please provide some more background info, like: are you doing a 
rolling-upgrade, or just simply starting a new cluster with the old data? Did 
you change anything in the config compared to the old cluster?


was (Author: symat):
We tested the 3.5.7 -> 3.6.0 upgrade before the release, but of course it tis 
always possible that we missed something...

Could you also share your configs and java version when you write the email? 
Also please provide some more background info, like: are you doing a 
rolling-upgrade, or just simply starting a new cluster with the old data? Did 
you change anything in the config compared to the old cluster?

> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 

[jira] [Commented] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Enrico Olivelli (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060095#comment-17060095
 ] 

Enrico Olivelli commented on ZOOKEEPER-3758:


Please start a discussion on u...@zookeeper.apache.org
It will be easier to help you.


> Update from 3.5.7 to 3.6.0 does not work
> 
>
> Key: ZOOKEEPER-3758
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Reporter: Agostino Sarubbo
>Priority: Major
>
> Hello,
>  we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 
> to 3.6.0 but it does not work.
> We got the following:
> {code:java}
> 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] 
> - Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] 
> - PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] 
> - LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
>  - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 
> [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
> n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, 
> n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
> Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
> n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
> version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] 
> - Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] 
> - FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
>  - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
>  - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45]
>  - Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
> [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
>  - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
>  - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
>  - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
>  - zookeeper.pathStats.initialDelay = 5 2020-03-16 10:40:45,519 [myid:1] - 
> INFO  
> [QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@115]
>  - zookeeper.pathStats.delay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO  
> 

[jira] [Created] (ZOOKEEPER-3758) Update from 3.5.7 to 3.6.0 does not work

2020-03-16 Thread Agostino Sarubbo (Jira)
Agostino Sarubbo created ZOOKEEPER-3758:
---

 Summary: Update from 3.5.7 to 3.6.0 does not work
 Key: ZOOKEEPER-3758
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3758
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Reporter: Agostino Sarubbo


Hello,
 we have a cluster with 5 zookeeper servers. We tried the update from 3.5.7 to 
3.6.0 but it does not work.

We got the following:
{code:java}
2020-03-16 10:40:45,514 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@863] - 
Peer state changed: looking 2020-03-16 10:40:45,514 [myid:1] - WARN  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1501] - 
PeerState set to LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1371] - 
LOOKING 2020-03-16 10:40:45,514 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):FastLeaderElection@931]
 - New election. My id = 1, proposed zxid=0x0 2020-03-16 10:40:45,515 [myid:1] 
- INFO  
[WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
Notification: my state:LOOKING; n.sid:1, n.state:LOOKING , n.leader:1, 
n.round:0x1b, n.peerEpoch:0x0, n.zxid:0x0, message format version:0x2, n.config 
version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
Notification: my state:LOOKING; n.sid:2, n.state:FOLLOWI NG, n.leader:4, 
n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
Notification: my state:LOOKING; n.sid:3, n.state:FOLLOWI NG, n.leader:4, 
n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
version:0x2, n.config version:0x0 2020-03-16 10:40:45,517 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
Notification: my state:LOOKING; n.sid:5, n.state:FOLLOWI NG, n.leader:4, 
n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
[WorkerReceiver[myid=1]:FastLeaderElection$Messenger$WorkerReceiver@376] - 
Notification: my state:LOOKING; n.sid:4, n.state:LEADING , n.leader:4, 
n.round:0x1a, n.peerEpoch:0x5c, n.zxid:0x5b0004, message format 
version:0x2, n.config version:0x0 2020-03-16 10:40:45,518 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@857] - 
Peer state changed: following 2020-03-16 10:40:45,518 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):QuorumPeer@1453] - 
FOLLOWING 2020-03-16 10:40:45,518 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1246]
 - minSessionTimeout set to 4000 2020-03-16 10:40:45,518 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1255]
 - maxSessionTimeout set to 4 2020-03-16 10:40:45,519 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45] 
- Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
[myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ResponseCache@45] 
- Response cache size is initialized with value 400. 2020-03-16 10:40:45,519 
[myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@111]
 - zookeeper.pathStats.slotCapacity = 60 2020-03-16 10:40:45,519 [myid:1] - 
INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@112]
 - zookeeper.pathStats.slotDuration = 15 2020-03-16 10:40:45,519 [myid:1] - 
INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@113]
 - zookeeper.pathStats.maxDepth = 6 2020-03-16 10:40:45,519 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@114]
 - zookeeper.pathStats.initialDelay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO 
 
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@115]
 - zookeeper.pathStats.delay = 5 2020-03-16 10:40:45,519 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):RequestPathMetricsCollector@116]
 - zookeeper.pathStats.enabled = false 2020-03-16 10:40:45,519 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1470]
 - The max bytes for all large requests are set t o 104857600 2020-03-16 
10:40:45,519 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=0.0.0.0:2281):ZooKeeperServer@1484]
 - The large request threshold is set to -1 2020-03-16 10:40:45,519 [myid:1] - 

[jira] [Commented] (ZOOKEEPER-3756) Members failing to rejoin quorum

2020-03-16 Thread Mate Szalay-Beko (Jira)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060051#comment-17060051
 ] 

Mate Szalay-Beko commented on ZOOKEEPER-3756:
-

Thanks, it's great that you were able to do this test and sent all the logs. I 
need a bit more time to dig into it, I hope I can analyze it deeper and come 
back with some answers (possibly questions? :) ) today / tomorrow. 

> Members failing to rejoin quorum
> 
>
> Key: ZOOKEEPER-3756
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3756
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: leaderElection
>Affects Versions: 3.5.6, 3.5.7
>Reporter: Dai Shi
>Assignee: Mate Szalay-Beko
>Priority: Major
> Attachments: Dockerfile, configmap.yaml, docker-entrypoint.sh, 
> jmx.yaml, zoo-0.log, zoo-1.log, zoo-2.log, zoo-service.yaml, zookeeper.yaml
>
>
> Not sure if this is the place to ask, please close if it's not.
> I am seeing some behavior that I can't explain since upgrading to 3.5:
> In a 5 member quorum, when server 3 is the leader and each server has this in 
> their configuration: 
> {code:java}
> server.1=100.71.255.254:2888:3888:participant;2181
> server.2=100.71.255.253:2888:3888:participant;2181
> server.3=100.71.255.252:2888:3888:participant;2181
> server.4=100.71.255.251:2888:3888:participant;2181
> server.5=100.71.255.250:2888:3888:participant;2181{code}
> If servers 1 or 2 are restarted, they fail to rejoin the quorum with this in 
> the logs:
> {code:java}
> 2020-03-11 20:23:35,720 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):QuorumPeer@1175] - 
> LOOKING
> 2020-03-11 20:23:35,721 [myid:2] - INFO  
> [QuorumPeer[myid=2](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@885]
>  - New election. My id =  2, proposed zxid=0x1b8005f4bba
> 2020-03-11 20:23:35,733 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (3, 2)
> 2020-03-11 20:23:35,734 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36140
> 2020-03-11 20:23:35,735 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (4, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerSender[myid=2]:QuorumCnxManager@438] - Have smaller server identifier, 
> so dropping the connection: (5, 2)
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [0.0.0.0/0.0.0.0:3888:QuorumCnxManager$Listener@924] - Received connection 
> request 100.126.116.201:36142
> 2020-03-11 20:23:35,740 [myid:2] - INFO  
> [WorkerReceiver[myid=2]:FastLeaderElection@679] - Notification: 2 (message 
> format version), 2 (n.leader), 0x1b8005f4bba (n.zxid), 0x1 (n.round), LOOKING 
> (n.state), 2 (n.sid), 0x1b8 (n.peerEPoch), LOOKING (my state)0 (n.config 
> version)
> 2020-03-11 20:23:35,742 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1143] - Interrupted while waiting 
> for message on queue
> java.lang.InterruptedException
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
> at 
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1294)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:82)
> at 
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1131)
> 2020-03-11 20:23:35,744 [myid:2] - WARN  
> [SendWorker:3:QuorumCnxManager$SendWorker@1153] - Send worker leaving thread  
> id 3 my id = 2
> 2020-03-11 20:23:35,745 [myid:2] - WARN  
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1230] - Interrupting 
> SendWorker{code}
> The only way I can seem to get them to rejoin the quorum is to restart the 
> leader.
> However, if I remove server 4 and 5 from the configuration of server 1 or 2 
> (so only servers 1, 2, and 3 remain in the configuration file), then they can 
> rejoin the quorum fine. Is this expected and am I doing something wrong? Any 
> help or explanation would be greatly appreciated. Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)