[ https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746050#action_12746050 ]
Patrick Hunt commented on ZOOKEEPER-512: ---------------------------------------- I've been reading the Java API spec, for example: http://java.sun.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#read%28java.nio.ByteBuffer%29 there's nothing here (nor in Socket docs) that I can find that says that an ioexception thrown by the read method results in what you say you are expecting. Unless you can find otw I don't think it's prudent to assume a particular behavior. The quorum was def _not_ formed when I took the log snapshot, there was no active leader.. Clients were not able to connect to any server in the cluster, and running "stat" on the command port resulted in "zookeeper server not running" being returned by all 5 servers. (not the typical "... mode:follower...." etc... stat result.) I'll re-run and attach with debug logs. > FLE election fails to elect leader > ---------------------------------- > > Key: ZOOKEEPER-512 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512 > Project: Zookeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.2.0 > Reporter: Patrick Hunt > Priority: Blocker > Fix For: 3.2.1, 3.3.0 > > Attachments: jst.txt, logs.tar.gz, logs2.tar.gz > > > I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch > applied and noticed that after some time the ensemble failed to re-elect a > leader. > See the attached log files - 5 member ensemble. typically 5 is the leader > Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes > elapses w/no quorum > environment: > I was doing fault injection testing using aspectj. The faults are injected > into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio > (rand.nextFloat() <= .005 => throw IOException > You can see when a fault is injected in the log via: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38] > - READPACKET FORCED FAIL > vs a read/write that didn't force fail: > 2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41] > - READPACKET OK > otw standard code/config (straight fle quorum with 5 members) > also see the attached jstack trace. this is for one of the servers. Notice in > particular that the number of sendworkers != the number of recv workers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.