[jira] Commented: (ZOOKEEPER-368) Observers

2009-11-10 Thread Flavio Paiva Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775947#action_12775947
 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-368:
--

Henry, good job so far. Please bear with me for a little longer:

# Could you please update the changes to the test files? Due to a few recently 
committed patches they don't apply to trunk any longer;
# Could you make sure to remove all unnecessary LOG statements? Some of them 
look like messages you used for your own debugging (they start with HNR) and 
others are commented out. I think I've seen a TODO comment as well;
# It sounds like this feature works with both majority and hierarchical 
quorums. Is it correct? Can I have observers with hierarchical quorums?

This might be a little late for this patch now, but for future patches that 
introduce features like this, it is probably a good idea to have a brief design 
document explaining changes to the protocol and to ensemble configuration.  


 Observers
 -

 Key: ZOOKEEPER-368
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368
 Project: Zookeeper
  Issue Type: New Feature
  Components: quorum
Reporter: Flavio Paiva Junqueira
Assignee: Henry Robinson
 Attachments: obs-refactor.patch, observer-refactor.patch, observers 
 sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch


 Currently, all servers of an ensemble participate actively in reaching 
 agreement on the order of ZooKeeper transactions. That is, all followers 
 receive proposals, acknowledge them, and receive commit messages from the 
 leader. A leader issues commit messages once it receives acknowledgments from 
 a quorum of followers. For cross-colo operation, it would be useful to have a 
 third role: observer. Using Paxos terminology, observers are similar to 
 learners. An observer does not participate actively in the agreement step of 
 the atomic broadcast protocol. Instead, it only commits proposals that have 
 been accepted by some quorum of followers.
 One simple solution to implement observers is to have the leader forwarding 
 commit messages not only to followers but also to observers, and have 
 observers applying transactions according to the order followers agreed upon. 
 In the current implementation of the protocol, however, commit messages do 
 not carry their corresponding transaction payload because all servers 
 different from the leader are followers and followers receive such a payload 
 first through a proposal message. Just forwarding commit messages as they 
 currently are to an observer consequently is not sufficient. We have a couple 
 of options:
 1- Include the transaction payload along in commit messages to observers;
 2- Send proposals to observers as well.
 Number 2 is simpler to implement because it doesn't require changing the 
 protocol implementation, but it increases traffic slightly. The performance 
 impact due to such an increase might be insignificant, though.
 For scalability purposes, we may consider having followers also forwarding 
 commit messages to observers. With this option, observers can connect to 
 followers, and receive messages from followers. This choice is important to 
 avoid increasing the load on the leader with the number of observers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-368) Observers

2009-11-10 Thread Patrick Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775956#action_12775956
 ] 

Patrick Hunt commented on ZOOKEEPER-368:


Henry, I don't see any docs for this in src/docs. I suggest that you start a 
new document (new xml file) for this feature, it should explain why/how(torun) 
at the very
least -- so that potential users can come up to speed.

Flavio, could you also review the comments on this JIRA as part of your commit 
review? We should make sure that either all of the issues are addressed,
or at the very least new JIRAs are created (Henry could you do this?) for the 
pending items so that we don't lose the comments/concerns/issues that have been 
identified
previously (this is a major new/visible feature so I think it warrants the 
extra time/effort).

 Observers
 -

 Key: ZOOKEEPER-368
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368
 Project: Zookeeper
  Issue Type: New Feature
  Components: quorum
Reporter: Flavio Paiva Junqueira
Assignee: Henry Robinson
 Attachments: obs-refactor.patch, observer-refactor.patch, observers 
 sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch


 Currently, all servers of an ensemble participate actively in reaching 
 agreement on the order of ZooKeeper transactions. That is, all followers 
 receive proposals, acknowledge them, and receive commit messages from the 
 leader. A leader issues commit messages once it receives acknowledgments from 
 a quorum of followers. For cross-colo operation, it would be useful to have a 
 third role: observer. Using Paxos terminology, observers are similar to 
 learners. An observer does not participate actively in the agreement step of 
 the atomic broadcast protocol. Instead, it only commits proposals that have 
 been accepted by some quorum of followers.
 One simple solution to implement observers is to have the leader forwarding 
 commit messages not only to followers but also to observers, and have 
 observers applying transactions according to the order followers agreed upon. 
 In the current implementation of the protocol, however, commit messages do 
 not carry their corresponding transaction payload because all servers 
 different from the leader are followers and followers receive such a payload 
 first through a proposal message. Just forwarding commit messages as they 
 currently are to an observer consequently is not sufficient. We have a couple 
 of options:
 1- Include the transaction payload along in commit messages to observers;
 2- Send proposals to observers as well.
 Number 2 is simpler to implement because it doesn't require changing the 
 protocol implementation, but it increases traffic slightly. The performance 
 impact due to such an increase might be insignificant, though.
 For scalability purposes, we may consider having followers also forwarding 
 commit messages to observers. With this option, observers can connect to 
 followers, and receive messages from followers. This choice is important to 
 avoid increasing the load on the leader with the number of observers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-368) Observers

2009-11-10 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776004#action_12776004
 ] 

Henry Robinson commented on ZOOKEEPER-368:
--

Hi Falvio / Patrick - 

Thanks for your comments! 

Design document - there's a brief writeup at 
http://wiki.apache.org/hadoop/ZooKeeper/Observers which very broadly covers the 
design. I will update it when I get a moment to do so. 

User documentation - yes, will do, already on my to do list. There is a section 
in the above wiki page that will be a good start. 

Quorums - yes, it should work with all mechanisms. The only caveat is that it 
only works with the simple LeaderElection protocol, which presumes a majority 
quorum approach (there are lines where votes  quorum.size() / 2 is hardcoded 
rather than using the verifier - I think this is the source of at least one of 
the to-dos). 

Debug messages: ugh, sorry about that. Will update the patch to build against 
trunk shortly and remove those messages. 

 Observers
 -

 Key: ZOOKEEPER-368
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368
 Project: Zookeeper
  Issue Type: New Feature
  Components: quorum
Reporter: Flavio Paiva Junqueira
Assignee: Henry Robinson
 Attachments: obs-refactor.patch, observer-refactor.patch, observers 
 sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch


 Currently, all servers of an ensemble participate actively in reaching 
 agreement on the order of ZooKeeper transactions. That is, all followers 
 receive proposals, acknowledge them, and receive commit messages from the 
 leader. A leader issues commit messages once it receives acknowledgments from 
 a quorum of followers. For cross-colo operation, it would be useful to have a 
 third role: observer. Using Paxos terminology, observers are similar to 
 learners. An observer does not participate actively in the agreement step of 
 the atomic broadcast protocol. Instead, it only commits proposals that have 
 been accepted by some quorum of followers.
 One simple solution to implement observers is to have the leader forwarding 
 commit messages not only to followers but also to observers, and have 
 observers applying transactions according to the order followers agreed upon. 
 In the current implementation of the protocol, however, commit messages do 
 not carry their corresponding transaction payload because all servers 
 different from the leader are followers and followers receive such a payload 
 first through a proposal message. Just forwarding commit messages as they 
 currently are to an observer consequently is not sufficient. We have a couple 
 of options:
 1- Include the transaction payload along in commit messages to observers;
 2- Send proposals to observers as well.
 Number 2 is simpler to implement because it doesn't require changing the 
 protocol implementation, but it increases traffic slightly. The performance 
 impact due to such an increase might be insignificant, though.
 For scalability purposes, we may consider having followers also forwarding 
 commit messages to observers. With this option, observers can connect to 
 followers, and receive messages from followers. This choice is important to 
 avoid increasing the load on the leader with the number of observers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-472) Making DataNode not instantiate a HashMap when the node is ephmeral

2009-11-10 Thread Patrick Hunt (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Hunt updated ZOOKEEPER-472:
---

Status: Patch Available  (was: Open)

Is this ready? I see some updates, throwing into the patch queue, reviewer 
please be sure all the comments are addressed.

 Making DataNode not instantiate a HashMap when the node is ephmeral
 ---

 Key: ZOOKEEPER-472
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-472
 Project: Zookeeper
  Issue Type: Improvement
  Components: server
Affects Versions: 3.2.0, 3.1.1
Reporter: Erik Holstad
Assignee: Erik Holstad
Priority: Minor
 Fix For: 3.3.0

 Attachments: zookeeper-472.patch, zookeeper-472.patch, 
 zookeeper-472.patch, zookeeper-472.patch


 Looking at the code, there is an overhead of a HashSet object for that nodes 
 children, even though the node might be an ephmeral node and cannot have 
 children.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (ZOOKEEPER-368) Observers

2009-11-10 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776233#action_12776233
 ] 

Henry Robinson commented on ZOOKEEPER-368:
--

I just put up a set of notes on the patch on the wiki here: 
http://wiki.apache.org/hadoop/ZooKeeper/Observers/ReviewGuide to help make the 
review a little less painful - although non-comprehensive, it should help 
explain most of the major code changes. 

An updated patch will follow very shortly. 



 Observers
 -

 Key: ZOOKEEPER-368
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368
 Project: Zookeeper
  Issue Type: New Feature
  Components: quorum
Reporter: Flavio Paiva Junqueira
Assignee: Henry Robinson
 Attachments: obs-refactor.patch, observer-refactor.patch, observers 
 sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch


 Currently, all servers of an ensemble participate actively in reaching 
 agreement on the order of ZooKeeper transactions. That is, all followers 
 receive proposals, acknowledge them, and receive commit messages from the 
 leader. A leader issues commit messages once it receives acknowledgments from 
 a quorum of followers. For cross-colo operation, it would be useful to have a 
 third role: observer. Using Paxos terminology, observers are similar to 
 learners. An observer does not participate actively in the agreement step of 
 the atomic broadcast protocol. Instead, it only commits proposals that have 
 been accepted by some quorum of followers.
 One simple solution to implement observers is to have the leader forwarding 
 commit messages not only to followers but also to observers, and have 
 observers applying transactions according to the order followers agreed upon. 
 In the current implementation of the protocol, however, commit messages do 
 not carry their corresponding transaction payload because all servers 
 different from the leader are followers and followers receive such a payload 
 first through a proposal message. Just forwarding commit messages as they 
 currently are to an observer consequently is not sufficient. We have a couple 
 of options:
 1- Include the transaction payload along in commit messages to observers;
 2- Send proposals to observers as well.
 Number 2 is simpler to implement because it doesn't require changing the 
 protocol implementation, but it increases traffic slightly. The performance 
 impact due to such an increase might be insignificant, though.
 For scalability purposes, we may consider having followers also forwarding 
 commit messages to observers. With this option, observers can connect to 
 followers, and receive messages from followers. This choice is important to 
 avoid increasing the load on the leader with the number of observers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (ZOOKEEPER-368) Observers

2009-11-10 Thread Henry Robinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henry Robinson updated ZOOKEEPER-368:
-

Attachment: ZOOKEEPER-368.patch

Updated patch - removed some erroneous debugging logs, made a slight 
improvement to one test. 

Please see review guide at 
http://wiki.apache.org/hadoop/ZooKeeper/Observers/ReviewGuide - comments on any 
further tests required would be particularly welcome. 

 Observers
 -

 Key: ZOOKEEPER-368
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368
 Project: Zookeeper
  Issue Type: New Feature
  Components: quorum
Reporter: Flavio Paiva Junqueira
Assignee: Henry Robinson
 Attachments: obs-refactor.patch, observer-refactor.patch, observers 
 sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, 
 ZOOKEEPER-368.patch


 Currently, all servers of an ensemble participate actively in reaching 
 agreement on the order of ZooKeeper transactions. That is, all followers 
 receive proposals, acknowledge them, and receive commit messages from the 
 leader. A leader issues commit messages once it receives acknowledgments from 
 a quorum of followers. For cross-colo operation, it would be useful to have a 
 third role: observer. Using Paxos terminology, observers are similar to 
 learners. An observer does not participate actively in the agreement step of 
 the atomic broadcast protocol. Instead, it only commits proposals that have 
 been accepted by some quorum of followers.
 One simple solution to implement observers is to have the leader forwarding 
 commit messages not only to followers but also to observers, and have 
 observers applying transactions according to the order followers agreed upon. 
 In the current implementation of the protocol, however, commit messages do 
 not carry their corresponding transaction payload because all servers 
 different from the leader are followers and followers receive such a payload 
 first through a proposal message. Just forwarding commit messages as they 
 currently are to an observer consequently is not sufficient. We have a couple 
 of options:
 1- Include the transaction payload along in commit messages to observers;
 2- Send proposals to observers as well.
 Number 2 is simpler to implement because it doesn't require changing the 
 protocol implementation, but it increases traffic slightly. The performance 
 impact due to such an increase might be insignificant, though.
 For scalability purposes, we may consider having followers also forwarding 
 commit messages to observers. With this option, observers can connect to 
 followers, and receive messages from followers. This choice is important to 
 avoid increasing the load on the leader with the number of observers. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (ZOOKEEPER-575) remove System.exit calls to make the server more container friendly

2009-11-10 Thread Patrick Hunt (JIRA)
remove System.exit calls to make the server more container friendly
---

 Key: ZOOKEEPER-575
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-575
 Project: Zookeeper
  Issue Type: Improvement
  Components: server
Reporter: Patrick Hunt
 Fix For: 3.3.0


There are a handful of places left in the code that still use System.exit, we 
should remove these to make the server
more container friendly.

There are some legitimate places for the exits - in *Main.java for example 
should be fine - these are the command
line main routines. Containers should be embedding code that runs just below 
this layer (or we should refactor
so that it would).

The tricky bit is ensuring the server shuts down in case of an unrecoverable 
error occurring, afaik these are the
locations where we still have sys exit calls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Possible race in LETest.java

2009-11-10 Thread Patrick Hunt
Closing the loop - what's the status on this? Can one of you open a 
JIRA and provide a patch for this?


Thanks,

Patrick

Flavio Junqueira wrote:
Hi Henry, Apologies for the the delay. Your observation sounds right to 
me. Here is how I'm reading it; let me know if it makes sense.


If everyone votes for 3 in the second round and 3 has crashed, then in 
countVotes we will remove all votes to 3 and there will be no vote left. 
In such a case, there will be no winner as a result of the call to 
countVotes and lookForLeader won't change the current vote 
(LeaderElection.java:201). This is a situation in which we are stuck.


Does it sound reasonable to add an else to the if statement of 
LeaderElection.java:201 to reset the vote? This modification would 
implementing resetting the vote when countVotes returns no winner, which 
should happen only when the replica itself votes for a dead leader.


-Flavio

On Oct 28, 2009, at 7:44 AM, Henry Robinson wrote:

[ Sending this direct since the Apache mailserver is rejecting my 
e-mails at the moment ]


As I understand it, 1 and 2 receive a vote for 3 in the first round, 
which causes them to vote for 3 in the second round. So in the second 
round, all votes cast are for 3. But 3 has died, so all votes for it 
are discounted. 1 and 2 continue to vote for 3 ad infinitum, never 
resetting their vote.


Does this sound plausible, or am I missing something?

cheers,
Henry

On Tue, Oct 27, 2009 at 3:48 PM, Flavio Junqueira f...@yahoo-inc.com 
wrote:
Hi Henry, I don't understand how 1 and 2 do not end up electing 2 in 
your situation. If they exclude 3 in countVotes, then countVotes will 
end up returning 2 and not 3, assuming there is a vote for 2. What am 
I missing?


The problem with QuorumPeer you're pointing at was also an issue with 
the FLE tests, and I couldn't see an easy way around it other than 
timing out and restarting leader election.


Cheers,
-Flavio


On Oct 27, 2009, at 6:35 AM, Henry Robinson wrote:

I've been working on adding a TCPResponderThread to the leader election
process so that if a deployment needs to be TCP only, it can be and still
use all election types. Testing this has exposed what might be a race
condition in the leader election code that prevents a leader from being
elected.

Here's the behaviour I see in LETest occasionally. With three nodes 
(reduced
from 30 for ease of debugging), node 3 gets elected before either node 
1 or
node 2 finish their election (there is one round where each node that 
3 has
the highest id, and then 3 completes its second round by receiving 
votes for

itself from 1 and 2, but 1 and 2 do not receive votes from 3).

Now 3 is killed by the test harness. 1 and 2 are still voting for it, but
every time they try, the vote tally excludes 3 since it hasn't been heard
from. They then spin round the voting process, unable to reset their 
vote. I

expect that the heartbeat mechanism in a running QuorumPeer takes care of
this when the leader is lost, but the associated QuorumPeers aren't 
running.


If this is the case, then there is a simple fix to reset the nodes 
vote to

themselves if they are voting for a node that hasn't been heard from. I
don't know why using TCP instead of UDP for the responder thread is
exacerbating this (and we can't rule out my introducing a bug :)); but as
it's a race condition the different timings associated with waiting on 
a TCP

socket might just be enough to expose the issue.

Can someone verify this might be possible / figure out what I missed?

cheers,
Henry