Flavio Paiva Junqueira commented on ZOOKEEPER-475:

Great catch! (I know it was hudson, but it was good that you've seen it)

The short version of the story is that the synchronization is not correct in 

The longer version is like this. From the traces, I can see the following 
sequence of messages:

* Replica 1 sends a message to itself and to Replica 2 stating that its current 
vote is for replica 1;
* Replica 2 sends a message to itself and to Replica 1 stating that its current 
vote is for replica 2;
* Replica 1 updates its vote, and sends a message to itself stating that its 
current vote is for replica 2;
* Since replica 1 has two votes for 2 in a an ensemble of 3 replicas, replica 1 
decides to follow 2.

The problem is that replica 2 does not receive a message from 1 stating that it 
changed its vote to 2, which prevents 2 from becoming a leader. Now looking 
more carefully at why that happened, you can see that when 1 tries to send a 
message to 2, QuorumCnxManager in 1 is both shutting down a connection to 2 at 
the same time that it is trying to open a new one. The incorrect 
synchronization prevents the creation of a new connection, and 1 and 2 end up 
not connected.   

> FLENewEpochTest failed on nightly builds.
> -----------------------------------------
>                 Key: ZOOKEEPER-475
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-475
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Mahadev konar
>            Assignee: Flavio Paiva Junqueira
>             Fix For: 3.2.1, 3.3.0
> THe flenewepochtest failed on one of the nightly builds -
> http://hudson.zones.apache.org/hudson/view/ZooKeeper/job/ZooKeeper-trunk/377.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to