[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770894#action_12770894
 ] 

Hudson commented on ZOOKEEPER-512:
--

Integrated in ZooKeeper-trunk #511 (See 
[http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/511/])
. FLE election fails to elect leader (flavio via mahadev)


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-27 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12770641#action_12770641
 ] 

Mahadev konar commented on ZOOKEEPER-512:
-

+1 the patch looks good.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-23 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12769209#action_12769209
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

This is for retrying. If there is a problem while listening or trying to bind 
to the socket, it tries again and gives up after 3 consecutive attempts. 

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768650#action_12768650
 ] 

Hadoop QA commented on ZOOKEEPER-512:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12422891/ZOOKEEPER-512.patch
  against trunk revision 828216.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/console

This message is automatically generated.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-22 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12768656#action_12768656
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

We have been testing this patch externally with Pat's fault injection framework 
that uses aspectj. It is difficult at this point to introduce his framework, so 
we have agreed to postpone adding such tests. The patch fixes some visible 
problems and passes previous tests.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-19 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767645#action_12767645
 ] 

Mahadev konar commented on ZOOKEEPER-512:
-

flavio, the patch looks good - 

The following logging can be imprvoed to include which quorum server it 
corresponds to (for unit testing) and in general. 

{code}
LOG.info(Leaving listener);
if(!shutdown)
LOG.fatal(As I'm leaving the listener thread, I won't be able 
to participate in leader election any longer... digital life sucks);
{code}

Also, I can see the hatred for digital life :), but a more useful logging 
message would be better ! 

- also I am having troble understanding this - 

{code}
synchronized void connectOne(long sid){
 if (senderWorkerMap.get(sid) == null){
InetSocketAddress electionAddr;
if(self.quorumPeers.containsKey(sid))
electionAddr =
self.quorumPeers.get(sid).electionAddr;
else{
LOG.warn(Invalid server id:  + sid);
return;
}
{code} 

you mentioned above that connectOne was being called with a sid that wasnt in 
the map. Is that possible?

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-14 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765674#action_12765674
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I don't think we are ready for that. It's pretty straightforward to run 
manually, however running automatically
it's difficult to determine success/failure for example. I sure we can do 
something but
it will take some effort.


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-10-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12763850#action_12763850
 ] 

Hadoop QA commented on ZOOKEEPER-512:
-

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12417726/ZOOKEEPER-512.patch
  against trunk revision 823371.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/console

This message is automatically generated.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-27 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748298#action_12748298
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

Pat, I didn't understand from your last comment if you have tried the patch I 
uploaded yesterday. If you did and it still doesn't work for you, I would 
appreciate if you could upload logs and jstack traces when you have a chance. 

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Benjamin Reed (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748018#action_12748018
 ] 

Benjamin Reed commented on ZOOKEEPER-512:
-

agreed. i think the problem is that under high load we don't have a period of 
error free operation. i think it is ok to generate errors randomly as we are 
doing, but we should have periods of error free operation so that things can 
settle down.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748020#action_12748020
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I'm seeing 2 cases:

1) the entire quorum is unstable because clients are driving and causing many 
network (simulated) failures, in this case I agree

2) but I also see the case where the quorum is stable, but there's one server 
that's
been orphaned from the group. it is never able to reconnect, even though the 
clients
are stopped and the quorum in general is stable.

eventually 3 servers become orphaned (out of 5), in which case regardless of 
clients are running
or not the quorum will never re-form. I don't agree that this is ok.


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-25 Thread Mahadev konar (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747234#action_12747234
 ] 

Mahadev konar commented on ZOOKEEPER-512:
-

can we move this out to 3.3? I dont think its a regression or is it? 

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-25 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747769#action_12747769
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I'm afraid that with this latest patch I'm still seeing similar to what I was 
seeing previously.

when I drive the cluster hard (in this case 6 clients, each client connecting 
to each of the 5 servers
each session creating/getting/deleting a particular node inside a loop that 
runs every second (sleeps for 1 sec at end of loop)


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-24 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12747072#action_12747072
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I don't see any change in behavior, still see similar issues as before.

Also the patch fails to compile - close is declared to throw ioexception, a 
checked exception. I had to 
wrap with a try/catch/log.warn.



 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745870#action_12745870
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

Two things:

1- I'm not sure what you've been searching for, so I don't have a pointer, but 
the behavior I expect is that if you get an IOException upon invoking a socket 
operation, then the operation won't be available after that. Am I not 
interpreting it correctly?
2- Visually expecting the logs, I was able to count about 20 successful leader 
elections. In the previous set of logs, I think servers got stuck around 5, so 
I see improvement after you modified your fault injection. Also, according to 
server 5, a leader was elected successfully. Here is the tail of the log of 5: 

{noformat}
2009-08-20 13:43:56,636 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@508] - LEADING
2009-08-20 13:43:56,636 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2185:zookeeperser...@160] - Created server
2009-08-20 13:43:56,643 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2185:files...@81] 
- Reading snapshot ./localhost:2185/data/version-2/snapshot.1c001f
2009-08-20 13:43:56,699 - INFO  
[QuorumPeer:/0:0:0:0:0:0:0:0:2185:filetxnsnap...@208] - Snapshotting: 1c001f
2009-08-20 13:43:56,844 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,845 - INFO  
[FollowerHandler-/127.0.0.1:55253:followerhand...@227] - Follower sid: 4 : info 
: org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@1b1aa65
2009-08-20 13:43:56,845 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,845 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,846 - WARN  
[FollowerHandler-/127.0.0.1:55253:followerhand...@302] - Sending snapshot last 
zxid of peer is 0x1c001f  zxid of leader is 0x1d
2009-08-20 13:43:56,847 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,848 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,848 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,849 - INFO  
[FollowerHandler-/127.0.0.1:55254:followerhand...@227] - Follower sid: 2 : info 
: org.apache.zookeeper.server.quorum.quorumpeer$quorumser...@1ef9157
2009-08-20 13:43:56,849 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,850 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,850 - WARN  
[FollowerHandler-/127.0.0.1:55254:followerhand...@302] - Sending snapshot last 
zxid of peer is 0x1c001f  zxid of leader is 0x1d
2009-08-20 13:43:56,851 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:56,851 - WARN  [FollowerHandler-/127.0.0.1:55254:lea...@452] - 
Commiting zxid 0x1d from /127.0.0.1:3185 not first!
2009-08-20 13:43:56,852 - WARN  [FollowerHandler-/127.0.0.1:55254:lea...@454] - 
First is 0
2009-08-20 13:43:56,852 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:59,434 - INFO  [Thread-536:requestfailsintermitten...@120] - 
SOCKET REQUEST OK
2009-08-20 13:43:59,434 - INFO  [Thread-536:requestfailsintermitten...@120] - 
SOCKET REQUEST OK
2009-08-20 13:43:59,435 - INFO  [WorkerReceiver 
Thread:fastleaderelection$messenger$workerrecei...@254] - Sending new 
notification.
2009-08-20 13:43:59,435 - INFO  [Thread-535:requestfailsintermitten...@120] - 
SOCKET REQUEST OK
2009-08-20 13:43:59,846 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:43:59,848 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:44:00,765 - INFO  [NIOServerCxn.Factory:2185:nioserverc...@698] - 
Processing stat command from /127.0.0.1:38350
2009-08-20 13:44:00,766 - WARN  [NIOServerCxn.Factory:2185:nioserverc...@494] - 
Exception causing close of session 0x0 due to java.io.IOException: Responded to 
info probe
2009-08-20 13:44:00,766 - INFO  [NIOServerCxn.Factory:2185:nioserverc...@833] - 
closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected 
local=/127.0.0.1:2185 remote=/127.0.0.1:38350]
2009-08-20 13:44:00,846 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:44:00,848 - INFO  
[FollowerHandler-/127.0.0.1:55253:requestfailsintermitten...@91] - RECORD 
REQUEST OK
2009-08-20 13:44:01,847 - INFO  
[FollowerHandler-/127.0.0.1:55254:requestfailsintermitten...@91] - RECORD 

[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745872#action_12745872
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

Pat, could you run it again and switch log debug on for QuorumCnxManager, 
please?

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745887#action_12745887
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

I just realized that there is a bug in the first comment I posted today. I 
wanted to say that: if you get an IOException upon invoking a socket operation, 
then the SOCKET won't be available after that. (I really miss the ability to 
edit comments.)

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-21 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12746050#action_12746050
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I've been reading the Java API spec, for example:
http://java.sun.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#read%28java.nio.ByteBuffer%29

there's nothing here (nor in Socket docs) that I can find that says that an 
ioexception thrown by the read method
results in what you say you are expecting. Unless you can find otw I don't 
think it's prudent to assume a particular
behavior.

The quorum was def _not_ formed when I took the log snapshot, there was no 
active leader..  Clients were not able to 
connect to any server in the cluster, and running stat on the command port 
resulted in zookeeper server not running being
returned by all 5 servers. (not the typical ... mode:follower etc... stat 
result.)

I'll re-run and attach with debug logs.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-20 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745372#action_12745372
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-512:
--

I'm not convinced this is a bug. Right now it sounds to me that the problem is 
with the way you're injecting faults. More concretely, it sounds like some 
threads are getting IOException, but the corresponding socket is not closing. 
As recv and sender come in pairs, if one dies and the other doesn't, we have a 
problem. At the same time, I believe the current code would eventually 
terminate a pair of workers send/recv if the socket closes. It is true, though, 
that the current code assumes that if RecvWorker catches an IOException when 
performing an socket operation, then the corresponding SendWorker will also 
catch an exception when trying to write to the socket. This is where I think 
your framework is broken, but please correct me if I'm missing anything.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-20 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745475#action_12745475
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


Your explanation sounds reasonable, but I don't see anything in the java 
socket{channel} apis that talk about this. perhaps I missed it. Do you have a 
pointer to something that talks about this? (I did some searches and couldn't 
find). Basically, why should we assume that any ioexception results in the 
socket being closed?


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-20 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12745632#action_12745632
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


sorry, to be overly clear -- the same problem occurs in this case (close/throw) 
-- the quorum cannot be formed after some time.

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, logs.tar.gz, logs2.tar.gz


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.