date:20090826

[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Flavio Paiva Junqueira (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Flavio Paiva Junqueira updated ZOOKEEPER-512:
-

Attachment: ZOOKEEPER-512.patch

I've found a corner that I was not expecting but was an easy fix. Basically
receiveConnection was passing an invalid server identifier to connectOne which
was throwing a NPE and was propagating to the Listener. The listener would
consequently die, and would stop participating in leader election. The fix is
just not to proceed with the connection when that happens.

This patch is working fine for my test. It is important to note, though, that
if we are overwhelming servers so much (clients are hammering the system and
connections are failing), then there will be periods in which there will be no
leader. The important invariant to satisfy is that the system converges to a
live state once it stabilizes. In my tests, I observe periods with no leader
when clients are hammering the servers with requests, but they converge to a
leader soon after the clients stop. Of course, if we have no injected faults,
the clients requests are executed just fine (there is always a leader). This is
the behavior I expect to see.

At the same time, although I think it was a good idea to test such an extreme
case, I'm still not convinced that this test is realistic. It would be great if
we could model the cases this fault injection is trying to emulate to make sure
they are really expected cases.

Also, I don't see a good way of introducing a unit test for such extreme cases.
In fact, I'm not even sure it would make sense to test only leader election
under such extreme conditions.

FLE election fails to elect leader
--

Key: ZOOKEEPER-512
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
Project: Zookeeper
Issue Type: Bug
Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
Fix For: 3.2.1, 3.3.0

Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz,
t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch,
ZOOKEEPER-512.patch

I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch
applied and noticed that after some time the ensemble failed to re-elect a
leader.
See the attached log files - 5 member ensemble. typically 5 is the leader
Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes
elapses w/no quorum
environment:
I was doing fault injection testing using aspectj. The faults are injected
into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio
(rand.nextFloat() = .005 = throw IOException
You can see when a fault is injected in the log via:
2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@38]
- READPACKET FORCED FAIL
vs a read/write that didn't force fail:
2009-08-19 16:57:09,568 - INFO [Thread-74:readrequestfailsintermitten...@41]
- READPACKET OK
otw standard code/config (straight fle quorum with 5 members)
also see the attached jstack trace. this is for one of the servers. Notice in
particular that the number of sendworkers != the number of recv workers.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

zookeeper trunk build

2009-08-26 Thread Giridharan Kesavan

Zookeeper trunk build is now moved from vesta.apache.org to a 
h8.grid.sp2.yahoo.net machine

Tnx
Giri

[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Benjamin Reed (JIRA)

[
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748018#action_12748018
]

Benjamin Reed commented on ZOOKEEPER-512:
-

agreed. i think the problem is that under high load we don't have a period of
error free operation. i think it is ok to generate errors randomly as we are
doing, but we should have periods of error free operation so that things can
settle down.

FLE election fails to elect leader
--

Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz,
t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch,
ZOOKEEPER-512.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-518) DEBUG message for outstanding proposals in leader should be moved to trace.

2009-08-26 Thread Patrick Hunt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-518:
---

  Component/s: server
Affects Version/s: 3.1.1
   3.2.0

 DEBUG message for outstanding proposals in leader should be moved to trace.
 ---

 Key: ZOOKEEPER-518
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-518
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.1, 3.2.0
Reporter: Mahadev konar
Assignee: Patrick Hunt
 Fix For: 3.2.1, 3.3.0


 this is the code in Leader.java 
 {code}
  if (LOG.isDebugEnabled()) {
 LOG.debug(Ack zxid: 0x + Long.toHexString(zxid));
 for (Proposal p : outstandingProposals.values()) {
 long packetZxid = p.packet.getZxid();
 LOG.debug(outstanding proposal: 0x
 + Long.toHexString(packetZxid));
 }
 LOG.debug(outstanding proposals all);
 }
 {code}
 We should move this debug to trace since it will cause really high latencies 
 in response times from zookeeper servers in case folks want to use DEBUG 
 logging for servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-518) DEBUG message for outstanding proposals in leader should be moved to trace.

2009-08-26 Thread Patrick Hunt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-518:
---

Fix Version/s: 3.2.1
 Assignee: Patrick Hunt  (was: Mahadev konar)

 DEBUG message for outstanding proposals in leader should be moved to trace.
 ---

 Key: ZOOKEEPER-518
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-518
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.1, 3.2.0
Reporter: Mahadev konar
Assignee: Patrick Hunt
 Fix For: 3.2.1, 3.3.0


 this is the code in Leader.java 
 {code}
  if (LOG.isDebugEnabled()) {
 LOG.debug(Ack zxid: 0x + Long.toHexString(zxid));
 for (Proposal p : outstandingProposals.values()) {
 long packetZxid = p.packet.getZxid();
 LOG.debug(outstanding proposal: 0x
 + Long.toHexString(packetZxid));
 }
 LOG.debug(outstanding proposals all);
 }
 {code}
 We should move this debug to trace since it will cause really high latencies 
 in response times from zookeeper servers in case folks want to use DEBUG 
 logging for servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Patrick Hunt (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12748020#action_12748020
 ] 

Patrick Hunt commented on ZOOKEEPER-512:


I'm seeing 2 cases:

1) the entire quorum is unstable because clients are driving and causing many 
network (simulated) failures, in this case I agree

2) but I also see the case where the quorum is stable, but there's one server 
that's
been orphaned from the group. it is never able to reconnect, even though the 
clients
are stopped and the quorum in general is stable.

eventually 3 servers become orphaned (out of 5), in which case regardless of 
clients are running
or not the quorum will never re-form. I don't agree that this is ok.


 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.2.1, 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-518) DEBUG message for outstanding proposals in leader should be moved to trace.

2009-08-26 Thread Patrick Hunt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-518:
---

Status: Patch Available  (was: Open)

 DEBUG message for outstanding proposals in leader should be moved to trace.
 ---

 Key: ZOOKEEPER-518
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-518
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.2.0, 3.1.1
Reporter: Mahadev konar
Assignee: Patrick Hunt
 Fix For: 3.2.1, 3.3.0

 Attachments: ZOOKEEPER-518.patch


 this is the code in Leader.java 
 {code}
  if (LOG.isDebugEnabled()) {
 LOG.debug(Ack zxid: 0x + Long.toHexString(zxid));
 for (Proposal p : outstandingProposals.values()) {
 long packetZxid = p.packet.getZxid();
 LOG.debug(outstanding proposal: 0x
 + Long.toHexString(packetZxid));
 }
 LOG.debug(outstanding proposals all);
 }
 {code}
 We should move this debug to trace since it will cause really high latencies 
 in response times from zookeeper servers in case folks want to use DEBUG 
 logging for servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-518) DEBUG message for outstanding proposals in leader should be moved to trace.

2009-08-26 Thread Patrick Hunt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-518:
---

Attachment: ZOOKEEPER-518.patch

this patch changes from debug to trace, that is all. I expect the qabot to fail 
(no test changed)

 DEBUG message for outstanding proposals in leader should be moved to trace.
 ---

 Key: ZOOKEEPER-518
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-518
 Project: Zookeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.1.1, 3.2.0
Reporter: Mahadev konar
Assignee: Patrick Hunt
 Fix For: 3.2.1, 3.3.0

 Attachments: ZOOKEEPER-518.patch


 this is the code in Leader.java 
 {code}
  if (LOG.isDebugEnabled()) {
 LOG.debug(Ack zxid: 0x + Long.toHexString(zxid));
 for (Proposal p : outstandingProposals.values()) {
 long packetZxid = p.packet.getZxid();
 LOG.debug(outstanding proposal: 0x
 + Long.toHexString(packetZxid));
 }
 LOG.debug(outstanding proposals all);
 }
 {code}
 We should move this debug to trace since it will cause really high latencies 
 in response times from zookeeper servers in case folks want to use DEBUG 
 logging for servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (ZOOKEEPER-512) FLE election fails to elect leader

2009-08-26 Thread Patrick Hunt (JIRA)


 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-512:
---

Fix Version/s: (was: 3.2.1)

 FLE election fails to elect leader
 --

 Key: ZOOKEEPER-512
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-512
 Project: Zookeeper
  Issue Type: Bug
  Components: quorum, server
Affects Versions: 3.2.0
Reporter: Patrick Hunt
Assignee: Flavio Paiva Junqueira
Priority: Blocker
 Fix For: 3.3.0

 Attachments: jst.txt, log3_debug.tar.gz, logs.tar.gz, logs2.tar.gz, 
 t5_aj.tar.gz, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, ZOOKEEPER-512.patch, 
 ZOOKEEPER-512.patch


 I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch 
 applied and noticed that after some time the ensemble failed to re-elect a 
 leader.
 See the attached log files - 5 member ensemble. typically 5 is the leader
 Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes 
 elapses w/no quorum
 environment:
 I was doing fault injection testing using aspectj. The faults are injected 
 into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio 
 (rand.nextFloat() = .005 = throw IOException
 You can see when a fault is injected in the log via:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@38] 
 - READPACKET FORCED FAIL
 vs a read/write that didn't force fail:
 2009-08-19 16:57:09,568 - INFO  [Thread-74:readrequestfailsintermitten...@41] 
 - READPACKET OK
 otw standard code/config (straight fle quorum with 5 members)
 also see the attached jstack trace. this is for one of the servers. Notice in 
 particular that the number of sendworkers != the number of recv workers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

30 matches

Mail list logo