[jira] Commented: (ZOOKEEPER-368) Observers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775947#action_12775947 ] Flavio Paiva Junqueira commented on ZOOKEEPER-368: -- Henry, good job so far. Please bear with me for a little longer: # Could you please update the changes to the test files? Due to a few recently committed patches they don't apply to trunk any longer; # Could you make sure to remove all unnecessary LOG statements? Some of them look like messages you used for your own debugging (they start with HNR) and others are commented out. I think I've seen a TODO comment as well; # It sounds like this feature works with both majority and hierarchical quorums. Is it correct? Can I have observers with hierarchical quorums? This might be a little late for this patch now, but for future patches that introduce features like this, it is probably a good idea to have a brief design document explaining changes to the protocol and to ensemble configuration. Observers - Key: ZOOKEEPER-368 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368 Project: Zookeeper Issue Type: New Feature Components: quorum Reporter: Flavio Paiva Junqueira Assignee: Henry Robinson Attachments: obs-refactor.patch, observer-refactor.patch, observers sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch Currently, all servers of an ensemble participate actively in reaching agreement on the order of ZooKeeper transactions. That is, all followers receive proposals, acknowledge them, and receive commit messages from the leader. A leader issues commit messages once it receives acknowledgments from a quorum of followers. For cross-colo operation, it would be useful to have a third role: observer. Using Paxos terminology, observers are similar to learners. An observer does not participate actively in the agreement step of the atomic broadcast protocol. Instead, it only commits proposals that have been accepted by some quorum of followers. One simple solution to implement observers is to have the leader forwarding commit messages not only to followers but also to observers, and have observers applying transactions according to the order followers agreed upon. In the current implementation of the protocol, however, commit messages do not carry their corresponding transaction payload because all servers different from the leader are followers and followers receive such a payload first through a proposal message. Just forwarding commit messages as they currently are to an observer consequently is not sufficient. We have a couple of options: 1- Include the transaction payload along in commit messages to observers; 2- Send proposals to observers as well. Number 2 is simpler to implement because it doesn't require changing the protocol implementation, but it increases traffic slightly. The performance impact due to such an increase might be insignificant, though. For scalability purposes, we may consider having followers also forwarding commit messages to observers. With this option, observers can connect to followers, and receive messages from followers. This choice is important to avoid increasing the load on the leader with the number of observers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-368) Observers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12775956#action_12775956 ] Patrick Hunt commented on ZOOKEEPER-368: Henry, I don't see any docs for this in src/docs. I suggest that you start a new document (new xml file) for this feature, it should explain why/how(torun) at the very least -- so that potential users can come up to speed. Flavio, could you also review the comments on this JIRA as part of your commit review? We should make sure that either all of the issues are addressed, or at the very least new JIRAs are created (Henry could you do this?) for the pending items so that we don't lose the comments/concerns/issues that have been identified previously (this is a major new/visible feature so I think it warrants the extra time/effort). Observers - Key: ZOOKEEPER-368 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368 Project: Zookeeper Issue Type: New Feature Components: quorum Reporter: Flavio Paiva Junqueira Assignee: Henry Robinson Attachments: obs-refactor.patch, observer-refactor.patch, observers sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch Currently, all servers of an ensemble participate actively in reaching agreement on the order of ZooKeeper transactions. That is, all followers receive proposals, acknowledge them, and receive commit messages from the leader. A leader issues commit messages once it receives acknowledgments from a quorum of followers. For cross-colo operation, it would be useful to have a third role: observer. Using Paxos terminology, observers are similar to learners. An observer does not participate actively in the agreement step of the atomic broadcast protocol. Instead, it only commits proposals that have been accepted by some quorum of followers. One simple solution to implement observers is to have the leader forwarding commit messages not only to followers but also to observers, and have observers applying transactions according to the order followers agreed upon. In the current implementation of the protocol, however, commit messages do not carry their corresponding transaction payload because all servers different from the leader are followers and followers receive such a payload first through a proposal message. Just forwarding commit messages as they currently are to an observer consequently is not sufficient. We have a couple of options: 1- Include the transaction payload along in commit messages to observers; 2- Send proposals to observers as well. Number 2 is simpler to implement because it doesn't require changing the protocol implementation, but it increases traffic slightly. The performance impact due to such an increase might be insignificant, though. For scalability purposes, we may consider having followers also forwarding commit messages to observers. With this option, observers can connect to followers, and receive messages from followers. This choice is important to avoid increasing the load on the leader with the number of observers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-368) Observers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776004#action_12776004 ] Henry Robinson commented on ZOOKEEPER-368: -- Hi Falvio / Patrick - Thanks for your comments! Design document - there's a brief writeup at http://wiki.apache.org/hadoop/ZooKeeper/Observers which very broadly covers the design. I will update it when I get a moment to do so. User documentation - yes, will do, already on my to do list. There is a section in the above wiki page that will be a good start. Quorums - yes, it should work with all mechanisms. The only caveat is that it only works with the simple LeaderElection protocol, which presumes a majority quorum approach (there are lines where votes quorum.size() / 2 is hardcoded rather than using the verifier - I think this is the source of at least one of the to-dos). Debug messages: ugh, sorry about that. Will update the patch to build against trunk shortly and remove those messages. Observers - Key: ZOOKEEPER-368 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368 Project: Zookeeper Issue Type: New Feature Components: quorum Reporter: Flavio Paiva Junqueira Assignee: Henry Robinson Attachments: obs-refactor.patch, observer-refactor.patch, observers sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch Currently, all servers of an ensemble participate actively in reaching agreement on the order of ZooKeeper transactions. That is, all followers receive proposals, acknowledge them, and receive commit messages from the leader. A leader issues commit messages once it receives acknowledgments from a quorum of followers. For cross-colo operation, it would be useful to have a third role: observer. Using Paxos terminology, observers are similar to learners. An observer does not participate actively in the agreement step of the atomic broadcast protocol. Instead, it only commits proposals that have been accepted by some quorum of followers. One simple solution to implement observers is to have the leader forwarding commit messages not only to followers but also to observers, and have observers applying transactions according to the order followers agreed upon. In the current implementation of the protocol, however, commit messages do not carry their corresponding transaction payload because all servers different from the leader are followers and followers receive such a payload first through a proposal message. Just forwarding commit messages as they currently are to an observer consequently is not sufficient. We have a couple of options: 1- Include the transaction payload along in commit messages to observers; 2- Send proposals to observers as well. Number 2 is simpler to implement because it doesn't require changing the protocol implementation, but it increases traffic slightly. The performance impact due to such an increase might be insignificant, though. For scalability purposes, we may consider having followers also forwarding commit messages to observers. With this option, observers can connect to followers, and receive messages from followers. This choice is important to avoid increasing the load on the leader with the number of observers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-472) Making DataNode not instantiate a HashMap when the node is ephmeral
[ https://issues.apache.org/jira/browse/ZOOKEEPER-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Hunt updated ZOOKEEPER-472: --- Status: Patch Available (was: Open) Is this ready? I see some updates, throwing into the patch queue, reviewer please be sure all the comments are addressed. Making DataNode not instantiate a HashMap when the node is ephmeral --- Key: ZOOKEEPER-472 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-472 Project: Zookeeper Issue Type: Improvement Components: server Affects Versions: 3.2.0, 3.1.1 Reporter: Erik Holstad Assignee: Erik Holstad Priority: Minor Fix For: 3.3.0 Attachments: zookeeper-472.patch, zookeeper-472.patch, zookeeper-472.patch, zookeeper-472.patch Looking at the code, there is an overhead of a HashSet object for that nodes children, even though the node might be an ephmeral node and cannot have children. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (ZOOKEEPER-368) Observers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12776233#action_12776233 ] Henry Robinson commented on ZOOKEEPER-368: -- I just put up a set of notes on the patch on the wiki here: http://wiki.apache.org/hadoop/ZooKeeper/Observers/ReviewGuide to help make the review a little less painful - although non-comprehensive, it should help explain most of the major code changes. An updated patch will follow very shortly. Observers - Key: ZOOKEEPER-368 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368 Project: Zookeeper Issue Type: New Feature Components: quorum Reporter: Flavio Paiva Junqueira Assignee: Henry Robinson Attachments: obs-refactor.patch, observer-refactor.patch, observers sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch Currently, all servers of an ensemble participate actively in reaching agreement on the order of ZooKeeper transactions. That is, all followers receive proposals, acknowledge them, and receive commit messages from the leader. A leader issues commit messages once it receives acknowledgments from a quorum of followers. For cross-colo operation, it would be useful to have a third role: observer. Using Paxos terminology, observers are similar to learners. An observer does not participate actively in the agreement step of the atomic broadcast protocol. Instead, it only commits proposals that have been accepted by some quorum of followers. One simple solution to implement observers is to have the leader forwarding commit messages not only to followers but also to observers, and have observers applying transactions according to the order followers agreed upon. In the current implementation of the protocol, however, commit messages do not carry their corresponding transaction payload because all servers different from the leader are followers and followers receive such a payload first through a proposal message. Just forwarding commit messages as they currently are to an observer consequently is not sufficient. We have a couple of options: 1- Include the transaction payload along in commit messages to observers; 2- Send proposals to observers as well. Number 2 is simpler to implement because it doesn't require changing the protocol implementation, but it increases traffic slightly. The performance impact due to such an increase might be insignificant, though. For scalability purposes, we may consider having followers also forwarding commit messages to observers. With this option, observers can connect to followers, and receive messages from followers. This choice is important to avoid increasing the load on the leader with the number of observers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (ZOOKEEPER-368) Observers
[ https://issues.apache.org/jira/browse/ZOOKEEPER-368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henry Robinson updated ZOOKEEPER-368: - Attachment: ZOOKEEPER-368.patch Updated patch - removed some erroneous debugging logs, made a slight improvement to one test. Please see review guide at http://wiki.apache.org/hadoop/ZooKeeper/Observers/ReviewGuide - comments on any further tests required would be particularly welcome. Observers - Key: ZOOKEEPER-368 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-368 Project: Zookeeper Issue Type: New Feature Components: quorum Reporter: Flavio Paiva Junqueira Assignee: Henry Robinson Attachments: obs-refactor.patch, observer-refactor.patch, observers sync benchmark.png, observers.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch, ZOOKEEPER-368.patch Currently, all servers of an ensemble participate actively in reaching agreement on the order of ZooKeeper transactions. That is, all followers receive proposals, acknowledge them, and receive commit messages from the leader. A leader issues commit messages once it receives acknowledgments from a quorum of followers. For cross-colo operation, it would be useful to have a third role: observer. Using Paxos terminology, observers are similar to learners. An observer does not participate actively in the agreement step of the atomic broadcast protocol. Instead, it only commits proposals that have been accepted by some quorum of followers. One simple solution to implement observers is to have the leader forwarding commit messages not only to followers but also to observers, and have observers applying transactions according to the order followers agreed upon. In the current implementation of the protocol, however, commit messages do not carry their corresponding transaction payload because all servers different from the leader are followers and followers receive such a payload first through a proposal message. Just forwarding commit messages as they currently are to an observer consequently is not sufficient. We have a couple of options: 1- Include the transaction payload along in commit messages to observers; 2- Send proposals to observers as well. Number 2 is simpler to implement because it doesn't require changing the protocol implementation, but it increases traffic slightly. The performance impact due to such an increase might be insignificant, though. For scalability purposes, we may consider having followers also forwarding commit messages to observers. With this option, observers can connect to followers, and receive messages from followers. This choice is important to avoid increasing the load on the leader with the number of observers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (ZOOKEEPER-575) remove System.exit calls to make the server more container friendly
remove System.exit calls to make the server more container friendly --- Key: ZOOKEEPER-575 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-575 Project: Zookeeper Issue Type: Improvement Components: server Reporter: Patrick Hunt Fix For: 3.3.0 There are a handful of places left in the code that still use System.exit, we should remove these to make the server more container friendly. There are some legitimate places for the exits - in *Main.java for example should be fine - these are the command line main routines. Containers should be embedding code that runs just below this layer (or we should refactor so that it would). The tricky bit is ensuring the server shuts down in case of an unrecoverable error occurring, afaik these are the locations where we still have sys exit calls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Possible race in LETest.java
Closing the loop - what's the status on this? Can one of you open a JIRA and provide a patch for this? Thanks, Patrick Flavio Junqueira wrote: Hi Henry, Apologies for the the delay. Your observation sounds right to me. Here is how I'm reading it; let me know if it makes sense. If everyone votes for 3 in the second round and 3 has crashed, then in countVotes we will remove all votes to 3 and there will be no vote left. In such a case, there will be no winner as a result of the call to countVotes and lookForLeader won't change the current vote (LeaderElection.java:201). This is a situation in which we are stuck. Does it sound reasonable to add an else to the if statement of LeaderElection.java:201 to reset the vote? This modification would implementing resetting the vote when countVotes returns no winner, which should happen only when the replica itself votes for a dead leader. -Flavio On Oct 28, 2009, at 7:44 AM, Henry Robinson wrote: [ Sending this direct since the Apache mailserver is rejecting my e-mails at the moment ] As I understand it, 1 and 2 receive a vote for 3 in the first round, which causes them to vote for 3 in the second round. So in the second round, all votes cast are for 3. But 3 has died, so all votes for it are discounted. 1 and 2 continue to vote for 3 ad infinitum, never resetting their vote. Does this sound plausible, or am I missing something? cheers, Henry On Tue, Oct 27, 2009 at 3:48 PM, Flavio Junqueira f...@yahoo-inc.com wrote: Hi Henry, I don't understand how 1 and 2 do not end up electing 2 in your situation. If they exclude 3 in countVotes, then countVotes will end up returning 2 and not 3, assuming there is a vote for 2. What am I missing? The problem with QuorumPeer you're pointing at was also an issue with the FLE tests, and I couldn't see an easy way around it other than timing out and restarting leader election. Cheers, -Flavio On Oct 27, 2009, at 6:35 AM, Henry Robinson wrote: I've been working on adding a TCPResponderThread to the leader election process so that if a deployment needs to be TCP only, it can be and still use all election types. Testing this has exposed what might be a race condition in the leader election code that prevents a leader from being elected. Here's the behaviour I see in LETest occasionally. With three nodes (reduced from 30 for ease of debugging), node 3 gets elected before either node 1 or node 2 finish their election (there is one round where each node that 3 has the highest id, and then 3 completes its second round by receiving votes for itself from 1 and 2, but 1 and 2 do not receive votes from 3). Now 3 is killed by the test harness. 1 and 2 are still voting for it, but every time they try, the vote tally excludes 3 since it hasn't been heard from. They then spin round the voting process, unable to reset their vote. I expect that the heartbeat mechanism in a running QuorumPeer takes care of this when the leader is lost, but the associated QuorumPeers aren't running. If this is the case, then there is a simple fix to reset the nodes vote to themselves if they are voting for a node that hasn't been heard from. I don't know why using TCP instead of UDP for the responder thread is exacerbating this (and we can't rule out my introducing a bug :)); but as it's a race condition the different timings associated with waiting on a TCP socket might just be enough to expose the issue. Can someone verify this might be possible / figure out what I missed? cheers, Henry