[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767645#action_12767645 ]
Mahadev konar commented on ZOOKEEPER-512: ----------------------------------------- flavio, the patch looks good - The following logging can be imprvoed to include which quorum server it corresponds to (for unit testing) and in general. {code} LOG.info("Leaving listener"); if(!shutdown) LOG.fatal("As I'm leaving the listener thread, I won't be able to participate in leader election any longer... digital life sucks"); {code} Also, I can see the hatred for digital life :), but a more useful logging message would be better ! - also I am having troble understanding this - {code} synchronized void connectOne(long sid){ if (senderWorkerMap.get(sid) == null){ InetSocketAddress electionAddr; if(self.quorumPeers.containsKey(sid)) electionAddr = self.quorumPeers.get(sid).electionAddr; else{ LOG.warn("Invalid server id: " + sid); return; } {code} you mentioned above that connectOne was being called with a sid that wasnt in the map. Is that possible? > FLE election fails to elect leader > ---------------------------------- > > Key: ZOOKEEPER-512 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.2.0 > Reporter: Patrick Hunt > Assignee: Flavio Paiva Junqueira > Priority: Blocker > Fix For: 3.3.0 > > Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, > t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, > ZOOKEEPER-512.patch > > > I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch > applied and noticed that after some time the ensemble failed to re-elect a > leader. > See the attached log files - 5 member ensemble. typically 5 is the leader > Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes > elapses w/no quorum > environment: > I was doing fault injection testing using aspectj. The faults are injected > into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio > (rand.nextFloat() <= .005 => throw IOException > You can see when a fault is injected in the log via: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] > - READPACKET FORCED FAIL > vs a read/write that didn't force fail: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] > - READPACKET OK > otw standard code/config (straight fle quorum with 5 members) > also see the attached jstack trace. this is for one of the servers. Notice in > particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.