[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

2018-03-07 Thread Yuval Dori (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389307#comment-16389307
 ] 

Yuval Dori commented on ZOOKEEPER-2172:
---

Hi,

 

This issue happens in a few of our customers using 3.4.8 version.

During this days we are upgrading to 3.4.10.

As 3.5.3 is in Beta, is it possible to backport this fix?

 

Thanks,

 

Yuval 

> Cluster crashes when reconfig a new node as a participant
> -
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
>Reporter: Ziyou Wang
>Assignee: Mohammad Arshad
>Priority: Critical
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172-03.patch, 
> ZOOKEEPER-2172-04.patch, ZOOKEEPER-2172-06.patch, ZOOKEEPER-2172-07.patch, 
> ZOOKEEPER-2172.patch, ZOOKEPER-2172-05.patch, history.txt, node-1.log, 
> node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, 
> zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, 
> zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, 
> zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
> zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log, 
> zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>   I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
>  /10.0.0.2:55890 ”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>  When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.1005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>  When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

2018-03-07 Thread Yuval Dori (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389604#comment-16389604
 ] 

Yuval Dori commented on ZOOKEEPER-2172:
---

Thanks Andor.

Do you think it's another ZK bug or something else?

 

 

> Cluster crashes when reconfig a new node as a participant
> -
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
>Reporter: Ziyou Wang
>Assignee: Mohammad Arshad
>Priority: Critical
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172-03.patch, 
> ZOOKEEPER-2172-04.patch, ZOOKEEPER-2172-06.patch, ZOOKEEPER-2172-07.patch, 
> ZOOKEEPER-2172.patch, ZOOKEPER-2172-05.patch, history.txt, node-1.log, 
> node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, 
> zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, 
> zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, 
> zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
> zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log, 
> zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>   I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
>  /10.0.0.2:55890 ”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>  When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.1005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>  When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

2018-03-07 Thread Yuval Dori (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389619#comment-16389619
 ] 

Yuval Dori commented on ZOOKEEPER-2172:
---

So, in order to further investigate it, do I need to subscribe here:

[https://zookeeper.apache.org/lists.html] ?

 

Thanks,

 

Yuval

 

 

 

> Cluster crashes when reconfig a new node as a participant
> -
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
>Reporter: Ziyou Wang
>Assignee: Mohammad Arshad
>Priority: Critical
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172-03.patch, 
> ZOOKEEPER-2172-04.patch, ZOOKEEPER-2172-06.patch, ZOOKEEPER-2172-07.patch, 
> ZOOKEEPER-2172.patch, ZOOKEPER-2172-05.patch, history.txt, node-1.log, 
> node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, 
> zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, 
> zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, 
> zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
> zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log, 
> zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>   I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
>  /10.0.0.2:55890 ”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>  When the problem happened, all three nodes just used the same dynamic 
> config file zoo.cfg.dynamic.1005d which only contained the first two 
> nodes. But there was another unused dynamic config file in node-1 directory 
> zoo.cfg.dynamic.next  which already contained three nodes.
>  
>  When I extended the waiting time between starting the third node and 
> reconfiguring the cluster, the problem didn’t show again. So it should be a 
> race condition problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2172) Cluster crashes when reconfig a new node as a participant

2018-03-07 Thread Yuval Dori (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389390#comment-16389390
 ] 

Yuval Dori commented on ZOOKEEPER-2172:
---

1. The use case here was adding node to ZK cluster using zookeeper-3.5.jar. 
It's not the same use case as for our customers. the first use 5 machines with 
3 ZK instances. shutdown 2 machine (one with ZK. so 2 ZK left) and got 
"java.lang.IllegalStateException: instance must be started before calling this 
method". The second customer got this error when deploying the application.

This is this issue stack trace: 

2015-04-20 12:53:52,230 [myid:1] - ERROR 
[LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
causing shutdown while sock still open 
java.io.EOFException 
at java.io.DataInputStream.readInt(Unknown Source) 
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) 
at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
 
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103) 
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493) 
2015-04-20 12:53:52,231 [myid:1] - WARN 
[LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
/10.0.0.2:55890  

And this our customers stack trace: 

2018-02-15T09:58:12.094+0100; ERROR; WSOSTSLXWIT01/MANAGER; P3424/T194; 
[SPACE/LearnerHandler-/10.17.46.142:49336/LearnerHandler]; Unexpected exception 
causing shutdown while sock still open 
java.net.SocketTimeoutException: Read timed out 
at java.net.SocketInputStream.socketRead0(Native Method) 
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) 
at java.net.SocketInputStream.read(SocketInputStream.java:171) 
at java.net.SocketInputStream.read(SocketInputStream.java:141) 
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) 
at java.io.BufferedInputStream.read(BufferedInputStream.java:265) 
at java.io.DataInputStream.readInt(DataInputStream.java:387) 
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) 
at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
 
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) 
at 
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542) 

As you can see the row lines for BinaryInputArchive.java and 
LearnerHandler.java are different but I thought its related to the different 
versions (3.4.8 vs 3.5). 

 

The first customer tested it with ZK 3.5.3 and it didn't reproduced!

What is this new feature that was added to 3.5?

I'll be happy to hear whether do you think if it's related or not.

Thanks,

 

Yuval

> Cluster crashes when reconfig a new node as a participant
> -
>
> Key: ZOOKEEPER-2172
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2172
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: leaderElection, quorum, server
>Affects Versions: 3.5.0
> Environment: Ubuntu 12.04 + java 7
>Reporter: Ziyou Wang
>Assignee: Mohammad Arshad
>Priority: Critical
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2172-02.patch, ZOOKEEPER-2172-03.patch, 
> ZOOKEEPER-2172-04.patch, ZOOKEEPER-2172-06.patch, ZOOKEEPER-2172-07.patch, 
> ZOOKEEPER-2172.patch, ZOOKEPER-2172-05.patch, history.txt, node-1.log, 
> node-2.log, node-3.log, zoo-1.log, zoo-2-1.log, zoo-2-2.log, zoo-2-3.log, 
> zoo-2.log, zoo-2212-1.log, zoo-2212-2.log, zoo-2212-3.log, zoo-3-1.log, 
> zoo-3-2.log, zoo-3-3.log, zoo-3.log, zoo-4-1.log, zoo-4-2.log, zoo-4-3.log, 
> zoo.cfg.dynamic.1005d, zoo.cfg.dynamic.next, zookeeper-1.log, 
> zookeeper-1.out, zookeeper-2.log, zookeeper-2.out, zookeeper-3.log, 
> zookeeper-3.out
>
>
> The operations are quite simple: start three zk servers one by one, then 
> reconfig the cluster to add the new one as a participant. When I add the  
> third one, the zk cluster may enter a weird state and cannot recover.
>  
>   I found “2015-04-20 12:53:48,236 [myid:1] - INFO  [ProcessThread(sid:1 
> cport:-1)::PrepRequestProcessor@547] - Incremental reconfig” in node-1 log. 
> So the first node received the reconfig cmd at 12:53:48. Latter, it logged 
> “2015-04-20  12:53:52,230 [myid:1] - ERROR 
> [LearnerHandler-/10.0.0.2:55890:LearnerHandler@580] - Unexpected exception 
> causing shutdown while sock still open” and “2015-04-20 12:53:52,231 [myid:1] 
> - WARN  [LearnerHandler-/10.0.0.2:55890:LearnerHandler@595] - *** GOODBYE 
>  /10.0.0.2:55890 ”. From then on, the first node and second node 
> rejected all client connections and the third node didn’t join the cluster as 
> a participant. The whole cluster was done.
>  
>  When the problem happened, all three nodes just used