[ https://issues.apache.org/jira/browse/ZOOKEEPER-822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899983#action_12899983 ]
Vishal K commented on ZOOKEEPER-822: ------------------------------------ While going through the code yesterday, I found two potential problems that I though might be worth reporting in the context of this bug. 1. In FastLeaderElection.java /** * Check if all queues are empty, indicating that all messages have been delivered. */ boolean haveDelivered() { for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) { LOG.debug("Queue size: " + queue.size()); if (queue.size() == 0) return true; } return false; } the haveDelivered() function returns true without checking if rest of the queus are empty. 2. QuorumCnxManager.connectAll() function connects to one peer at a time and it uses a blocking connect (SocketChannle.open). I added a timeout to the SocketChannel.open and that did not fix the problem. > Leader election taking a long time to complete > ----------------------------------------------- > > Key: ZOOKEEPER-822 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-822 > Project: Zookeeper > Issue Type: Bug > Components: quorum > Affects Versions: 3.3.0 > Reporter: Vishal K > Priority: Blocker > Attachments: 822.tar.gz, rhel.tar.gz, test_zookeeper_1.log, > test_zookeeper_2.log, zk_leader_election.tar.gz > > > Created a 3 node cluster. > 1 Fail the ZK leader > 2. Let leader election finish. Restart the leader and let it join the > 3. Repeat > After a few rounds leader election takes anywhere 25- 60 seconds to finish. > Note- we didn't have any ZK clients and no new znodes were created. > zoo.cfg is shown below: > #Mon Jul 19 12:15:10 UTC 2010 > server.1=192.168.4.12\:2888\:3888 > server.0=192.168.4.11\:2888\:3888 > clientPort=2181 > dataDir=/var/zookeeper > syncLimit=2 > server.2=192.168.4.13\:2888\:3888 > initLimit=5 > tickTime=2000 > I have attached logs from two nodes that took a long time to form the cluster > after failing the leader. The leader was down anyways so logs from that node > shouldn't matter. > Look for "START HERE". Logs after that point should be of our interest. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.