[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12747315#action_12747315
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--------------------------------------------------

I have finally been able to reproduce it reliably, and it is true, the ensemble 
stalls after a while. Looking at the logs, I realized the same as Pat: not 
enough votes are coming through. Soon later, however, I also realized that most 
processes died, and here is the cause:

{noformat}
2009-08-25 10:51:04,617 - FATAL [SyncThread:2:syncrequestproces...@131] - 
Severe unrecoverable error, exiting
java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at 
org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
        at 
org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
        at 
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:147)
        at 
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:92)
{noformat}

After three zookeeper processes die this way, we can't obviously form a quorum. 
My conclusion is that the aspects are killing the processes, and leader 
election cannot succeed without a quorum.

I think we should still add the finally block as in the broken patch I uploaded 
before. It makes sense to have it, but it is probably ok if we postpone it to 
3.3. 

 

> FLE election fails to elect leader
> ----------------------------------
>
>                 Key: ZOOKEEPER-512
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.2.0
>            Reporter: Patrick Hunt
>            Assignee: Flavio Paiva Junqueira
>            Priority: Blocker
>             Fix For: 3.2.1, 3.3.0
>
>         Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
> t5_aj.tar.gz, ZOOKEEPER-512.patch
>
>
> I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
> applied and noticed that after some time the ensemble failed to re-elect a 
> leader.
> See the attached log files - 5 member ensemble. typically 5 is the leader
> Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
> elapses w/no quorum
> environment:
> I was doing fault injection testing using aspectj. The faults are injected 
> into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
> (rand.nextFloat() <= .005 => throw IOException
> You can see when a fault is injected in the log via:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
> - READPACKET FORCED FAIL
> vs a read/write that didn't force fail:
> 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
> - READPACKET OK
> otw standard code/config (straight fle quorum with 5 members)
> also see the attached jstack trace. this is for one of the servers. Notice in 
> particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to