[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 8: -Verified Build Started http://104.196.14.100/job/kudu-gerrit/3267/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Mike Percy has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4133 to look at the new patch set (#8). Change subject: Cleanup/refactor tracking of consensus watermarks .. Cleanup/refactor tracking of consensus watermarks This is a fairly invasive cleanup/refactor to consensus in preparation for propagating replication watermarks from the leader to the followers. The key item here is to address a long-standing TODO that the PeerMessageQueue should itself calculate the commit index, rather than delegate that job to ReplicaState. The flow used to be something like this on the leader upon receiving a response from a peer that it has replicated some new operations: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark - calls NotifyObserversOfMajorityReplOp --> submits an async task to the raft threadpool - for each observer (best I can tell there is currently always only one, but that's a separate issue) -- observer->UpdateMajorityReplicated() to grab committed index (the observer is always the RaftConsensus instance) --- RaftConsensus::UpdateMajorityReplicated ReplicaState::UpdateMajorityReplicatedUnlocked - this checks the condition that the majority-replicated watermark is within the leader's term before advancing the commit index. - calls AdvanceCommittedIndexUnlocked --- commits stuff, etc - sets *committed_index out-param -- PeerMessageQueue picks up this out-param and sets the new value within the PeerMessageQueue This was very difficult to follow, given that the "observer" in fact was passing data back to the "notifier" through an out-parameter. Moreover, it wasn't obvious to see which class was "authoritative" for the tracking and advancement of watermarks. The new design makes the PeerMessageQueue itself fully responsible for calculating when the commit index can advance. The flow is now the following, with '!' at the beginning of the line to denote where it is changed from before: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark. ! - If the majority-replicated watermark is within the current leader !term, advances the commit index (stored in QueueState) !- notifies observers of the new commit index ! - calls observer->NotifyCommitIndex() with the new commit index !- RaftConsensus::NotifyCommitIndex() ! - ReplicaState::AdvanceCommittedIndexUnlocked - commits stuff, etc. This required a fairly invasive surgery because the PeerMessageQueue itself doesn't remember the terms of pending operations, and thus the watermarks became indexes instead of OpIds. This is itself also a simplification -- we previously were very messy about using the word "index" when in fact we had an OpId type. In almost every case, we only used the index of those OpIds. In the Raft paper itself, these watermarks are just indexes and not OpIds, so this change brings us closer to the implementation described in the original design. I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800 times and the whole suite 500 times and got no failures[1]. [1] http://dist-test.cloudera.org//job?job_id=todd.1473280738.28972 Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 --- M docs/release_notes.adoc M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/raft_consensus-itest.cc 15 files changed, 464 insertions(+), 592 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/8 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 8: Build Started http://104.196.14.100/job/kudu-gerrit/3266/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Mike Percy has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 7: Code-Review+1 (1 comment) lgtm, just a syntax nit http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 703: && queue_state_.majority_replicated_index > queue_state_.committed_index) { nit: ampersands not consistent (put them at the end of the line) -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
David Ribeiro Alves has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 267: // changes, so the refactor isn't trivial. > seems like premature optimization -- will make the code longer, and this ha k, fair enough -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Todd Lipcon has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 267: // changes, so the refactor isn't trivial. > final q: do you think it might be worth it to just look at the last message seems like premature optimization -- will make the code longer, and this has never shown up as a hot spot -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
David Ribeiro Alves has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 459: for (const PeersMap::value_type& peer : peers_map_) { > yea, agreed that it's not a super impactful optimization, but more about co well the all replicated is always supposed to go back when we add a new node, so maybe we can change the name, but its nature seems to be necessary :) agreed on the follow on. http://gerrit.cloudera.org:8080/#/c/4133/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 267: // changes, so the refactor isn't trivial. final q: do you think it might be worth it to just look at the last message, like we did before and then reverse iterate only if the term actually changed? -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 7: Build Started http://104.196.14.100/job/kudu-gerrit/3257/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4133 to look at the new patch set (#7). Change subject: Cleanup/refactor tracking of consensus watermarks .. Cleanup/refactor tracking of consensus watermarks This is a fairly invasive cleanup/refactor to consensus in preparation for propagating replication watermarks from the leader to the followers. The key item here is to address a long-standing TODO that the PeerMessageQueue should itself calculate the commit index, rather than delegate that job to ReplicaState. The flow used to be something like this on the leader upon receiving a response from a peer that it has replicated some new operations: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark - calls NotifyObserversOfMajorityReplOp --> submits an async task to the raft threadpool - for each observer (best I can tell there is currently always only one, but that's a separate issue) -- observer->UpdateMajorityReplicated() to grab committed index (the observer is always the RaftConsensus instance) --- RaftConsensus::UpdateMajorityReplicated ReplicaState::UpdateMajorityReplicatedUnlocked - this checks the condition that the majority-replicated watermark is within the leader's term before advancing the commit index. - calls AdvanceCommittedIndexUnlocked --- commits stuff, etc - sets *committed_index out-param -- PeerMessageQueue picks up this out-param and sets the new value within the PeerMessageQueue This was very difficult to follow, given that the "observer" in fact was passing data back to the "notifier" through an out-parameter. Moreover, it wasn't obvious to see which class was "authoritative" for the tracking and advancement of watermarks. The new design makes the PeerMessageQueue itself fully responsible for calculating when the commit index can advance. The flow is now the following, with '!' at the beginning of the line to denote where it is changed from before: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark. ! - If the majority-replicated watermark is within the current leader !term, advances the commit index (stored in QueueState) !- notifies observers of the new commit index ! - calls observer->NotifyCommitIndex() with the new commit index !- RaftConsensus::NotifyCommitIndex() ! - ReplicaState::AdvanceCommittedIndexUnlocked - commits stuff, etc. This required a fairly invasive surgery because the PeerMessageQueue itself doesn't remember the terms of pending operations, and thus the watermarks became indexes instead of OpIds. This is itself also a simplification -- we previously were very messy about using the word "index" when in fact we had an OpId type. In almost every case, we only used the index of those OpIds. In the Raft paper itself, these watermarks are just indexes and not OpIds, so this change brings us closer to the implementation described in the original design. I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800 times and the whole suite 500 times and got no failures[1]. [1] http://dist-test.cloudera.org//job?job_id=todd.1473280738.28972 Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 --- M docs/release_notes.adoc M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/raft_consensus-itest.cc 15 files changed, 464 insertions(+), 592 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/7 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Todd Lipcon has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 137: if (current_term != queue_state_.current_term) { > are we verifying that the term doesn't go back? should we? sure, I'll give that a try. I think we are, elsewhere, but maybe not here. Line 263: // TODO: this code seems wrong. When we are leader, we've already bumped > yeah, this seems like it's making things a bit messier (not that they were oops, I actually meant to remove this TODO in the current revision. I agree it would be nice to do the SetLeaderAndAppendNoOp thing, but we're currently abusing SetLeaderMode to handle refreshing the tracked peers on a configuration change, and I didn't want to open the can of worms of trying to add more refactoring on top. So, I just kept this as is. I'll update the TODO, though, to suggest this refactor. Line 459: // TODO: The fact that we only consider peers whose last exchange was > this is technically true, but not sure how important/impactful it is. If A yea, agreed that it's not a super impactful optimization, but more about code cleanliness and the fact that our "watermarks" actually move backwards, not just forwards (and thus aren't technically watermarks!) But, would like to address it as a follow-on. -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
David Ribeiro Alves has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/4133/6/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 137: if (current_term != queue_state_.current_term) { are we verifying that the term doesn't go back? should we? Line 263: // TODO: this code seems wrong. When we are leader, we've already bumped yeah, this seems like it's making things a bit messier (not that they were much more nice and neat before). Would it be possible to create a new SetLeaderMode like method that also took the first round to be appended and that would set both the first index and do the SetLeaderMode stuff? Something like SetLeaderAndAppendNoOp then we would always snoop for both index and term (even f if you don't like snooping, at least we'd be consistent). This may not work great with the replicating peers though (right now we set leader mode, update the peers config and only then append the no op) Line 459: // TODO: The fact that we only consider peers whose last exchange was this is technically true, but not sure how important/impactful it is. If A goes down we wouldn't be able to advance the index anyway, and we'll be able to again only once we've removed A and added a new node. So yeah, once we've stopped talking to a peer we stop moving the all_replicated and maybe we could advance it a little more, but IMO we should address it only if it becomes pathological. (not saying anything about the last_received, which I agree we should make match raft proper. iirc we actually tried to do that and hit some roadblock) -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 6: Build Started http://104.196.14.100/job/kudu-gerrit/3252/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Hello David Ribeiro Alves, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4133 to look at the new patch set (#6). Change subject: Cleanup/refactor tracking of consensus watermarks .. Cleanup/refactor tracking of consensus watermarks This is a fairly invasive cleanup/refactor to consensus in preparation for propagating replication watermarks from the leader to the followers. The key item here is to address a long-standing TODO that the PeerMessageQueue should itself calculate the commit index, rather than delegate that job to ReplicaState. The flow used to be something like this on the leader upon receiving a response from a peer that it has replicated some new operations: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark - calls NotifyObserversOfMajorityReplOp --> submits an async task to the raft threadpool - for each observer (best I can tell there is currently always only one, but that's a separate issue) -- observer->UpdateMajorityReplicated() to grab committed index (the observer is always the RaftConsensus instance) --- RaftConsensus::UpdateMajorityReplicated ReplicaState::UpdateMajorityReplicatedUnlocked - this checks the condition that the majority-replicated watermark is within the leader's term before advancing the commit index. - calls AdvanceCommittedIndexUnlocked --- commits stuff, etc - sets *committed_index out-param -- PeerMessageQueue picks up this out-param and sets the new value within the PeerMessageQueue This was very difficult to follow, given that the "observer" in fact was passing data back to the "notifier" through an out-parameter. Moreover, it wasn't obvious to see which class was "authoritative" for the tracking and advancement of watermarks. The new design makes the PeerMessageQueue itself fully responsible for calculating when the commit index can advance. The flow is now the following, with '!' at the beginning of the line to denote where it is changed from before: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark. ! - If the majority-replicated watermark is within the current leader !term, advances the commit index (stored in QueueState) !- notifies observers of the new commit index ! - calls observer->NotifyCommitIndex() with the new commit index !- RaftConsensus::NotifyCommitIndex() ! - ReplicaState::AdvanceCommittedIndexUnlocked - commits stuff, etc. This required a fairly invasive surgery because the PeerMessageQueue itself doesn't remember the terms of pending operations, and thus the watermarks became indexes instead of OpIds. This is itself also a simplification -- we previously were very messy about using the word "index" when in fact we had an OpId type. In almost every case, we only used the index of those OpIds. In the Raft paper itself, these watermarks are just indexes and not OpIds, so this change brings us closer to the implementation described in the original design. I looped raft-consensus-itest.MultiThreadedInsertWithFailovers 800 times and the whole suite 500 times and got no failures. Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 --- M docs/release_notes.adoc M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/raft_consensus-itest.cc 15 files changed, 463 insertions(+), 592 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/6 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
David Ribeiro Alves has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 311: ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0); > I disagree in this case, since it's obvious what the numeric constant is re ok http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 137: // TODO: reset first_index_in_term? make it optional, reset here and set on the > hrm, having some trouble with this TODO actually. Trying to clean up the wa sounds reasonable Line 447: // TODO: why is this only if successful? shouldn't we still use > added this TODO in this patch but I think it's a pre-existing issue. Any th would have into the history to remind myself why this is here, but I seem to recall this being an important thing Line 499: boost::optional updated_commit_index; > given we already use boost::optional pretty often elsewhere, I dont think t ok Line 636: // If we're the leader, we are in charge of computing the new > hrm, why do you say we might not be the leader? We're still inside the lock queue is decoupled from state so isn't it possible that we lost an election but the queue doesn't know about it yet (i.e. that the call to change the queue's mode is queued behind this thread competing for the lock? http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: Line 1267: // TODO: interrogate queue rather than state? > trying to reduce the amount of duplication of responsibility (ideally could that makes sense in general, though I'm not totally sure what it looks like -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4133 to look at the new patch set (#5). Change subject: Cleanup/refactor tracking of consensus watermarks .. Cleanup/refactor tracking of consensus watermarks This is a fairly invasive cleanup/refactor to consensus in preparation for propagating replication watermarks from the leader to the followers. The key item here is to address a long-standing TODO that the PeerMessageQueue should itself calculate the commit index, rather than delegate that job to ReplicaState. The flow used to be something like this on the leader upon receiving a response from a peer that it has replicated some new operations: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark - calls NotifyObserversOfMajorityReplOp --> submits an async task to the raft threadpool - for each observer (best I can tell there is currently always only one, but that's a separate issue) -- observer->UpdateMajorityReplicated() to grab committed index (the observer is always the RaftConsensus instance) --- RaftConsensus::UpdateMajorityReplicated ReplicaState::UpdateMajorityReplicatedUnlocked - this checks the condition that the majority-replicated watermark is within the leader's term before advancing the commit index. - calls AdvanceCommittedIndexUnlocked --- commits stuff, etc - sets *committed_index out-param -- PeerMessageQueue picks up this out-param and sets the new value within the PeerMessageQueue This was very difficult to follow, given that the "observer" in fact was passing data back to the "notifier" through an out-parameter. Moreover, it wasn't obvious to see which class was "authoritative" for the tracking and advancement of watermarks. The new design makes the PeerMessageQueue itself fully responsible for calculating when the commit index can advance. This required a fairly invasive surgery because the PeerMessageQueue itself doesn't remember the terms of pending operations, and thus the watermarks became indexes instead of OpIds. This is itself also a simplification -- we previously were very messy about using the word "index" when in fact we had an OpId type. In almost every case, we only used the index of those OpIds. In the Raft paper itself, these watermarks are just indexes and not OpIds, so this change brings us closer to the implementation described in the original design. Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 --- M docs/release_notes.adoc M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/raft_consensus-itest.cc 15 files changed, 422 insertions(+), 591 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/5 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 5: Build Started http://104.196.14.100/job/kudu-gerrit/3183/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Todd Lipcon has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 4: (23 comments) http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 333: // The id of the last committed operation in the configuration. This is the > s/id/index/ Done Line 334: // id of the last operation the leader deemed committed from a consensus > s/id/index/ Done http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_peers-test.cc File src/kudu/consensus/consensus_peers-test.cc: Line 142: message_queue_->SetLeaderMode(0, 0, BuildRaftConfigPBForTests(3)); > Maybe better to use kMinimumTerm and kMinimumOpIdIndex from opid_util.h her Done http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 202: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2)); > use MinimumOpId.index() of the equivalent constant switched to using kMinimumOpIdIndex and kMinimumTerm here and elsewhere per Mike's suggestion Line 242: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2)); > same Done Line 305: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(3)); > same Done Line 311: ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0); > same here and below I disagree in this case, since it's obvious what the numeric constant is representing (given the other side of the equality assertion). In the "SetLeaderMode" case it's nice because you don't necessarily remember the order of the two integer arguments. Also I think it's important to realize that it's 0 as an integer, rather than maybe thinking it's 1 (which is actually the first term and index that any peer claims, right?) http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 112: queue_state_.current_term = 0; > use constants here and elsewhere? actually per my reply elsewhere I think 0's clearer Line 137: // TODO: reset first_index_in_term? make it optional, reset here and set on the hrm, having some trouble with this TODO actually. Trying to clean up the way the queue tracks current_term as leader, and I think some of our tests may be abusing the queue. In particular: static inline void AppendReplicateMessagesToQueue( PeerMessageQueue* queue, const scoped_refptr& clock, int first, int count, int payload_size = 0) { for (int i = first; i < first + count; i++) { int term = i / 7; int index = i; CHECK_OK(queue->AppendOperation(make_scoped_refptr_replicate( CreateDummyReplicate(term, index, clock->Now(), payload_size).release(; } } this code is appending ops from different terms to a leader-mode queue, which I don't think would ever happen in real life, right? Any time the term advances, the leader would step down and then have to call SetLeaderMode again for the next term? Mike/David: If this seems right to you guys, I'll make the change so that the queue doesn't track current term by "snooping" on ops anymore, but instead takes a 'first_opid_in_term' parameter to SetLeaderMode, which will be the opid of the NO_OP. Line 447: // TODO: why is this only if successful? shouldn't we still use added this TODO in this patch but I think it's a pre-existing issue. Any thoughts on this issue? otherwise I"m inclined to leave this as a TODO here Line 499: boost::optional updated_commit_index; > do we really need to introduce a boost dependency here? since this is just given we already use boost::optional pretty often elsewhere, I dont think the "introduce a dependency" argument carries that much weight. I like optional because it's impossible to accidentally miss the "special" value (eg accidentally compare vs the -1 value and decide something is above this commit index, or whatever) Line 636: // If we're the leader, we are in charge of computing the new > I think it would make sense to document somewhere that calculating the comm hrm, why do you say we might not be the leader? We're still inside the lock here, so we know we're in leader mode. Line 658: if (queue_state_.majority_replicated_index >= queue_state_.first_index_in_term) { > maybe rename this var to "first_index_in_current_term" Done Line 669: VLOG_WITH_PREFIX_UNLOCKED(2) << "Commit index advanced to " << *updated_commit_index; > add the "before" ci Done Line 676: (peer->last_known_committed_idx < queue_state_.committed_index); > nit: rename the suffix of this var to _index for consistency Done http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: Line 139: // of the commit
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
David Ribeiro Alves has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 4: (17 comments) http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 202: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2)); use MinimumOpId.index() of the equivalent constant Line 242: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(2)); same Line 305: queue_->SetLeaderMode(0, MinimumOpId().term(), BuildRaftConfigPBForTests(3)); same Line 311: ASSERT_EQ(queue_->GetMajorityReplicatedIndexForTests(), 0); same here and below http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 112: queue_state_.current_term = 0; use constants here and elsewhere? Line 499: boost::optional updated_commit_index; do we really need to introduce a boost dependency here? since this is just an int why not just have a NOT_UPDATED = -1 constant? Line 636: // If we're the leader, we are in charge of computing the new I think it would make sense to document somewhere that calculating the commit index here is safe, independently of whether we're leader or not (we might not be by the time we reach here) since the commit index is a property of the shared state machine and not something that is "decided" by the leader. Line 658: if (queue_state_.majority_replicated_index >= queue_state_.first_index_in_term) { maybe rename this var to "first_index_in_current_term" Line 669: VLOG_WITH_PREFIX_UNLOCKED(2) << "Commit index advanced to " << *updated_commit_index; add the "before" ci Line 676: (peer->last_known_committed_idx < queue_state_.committed_index); nit: rename the suffix of this var to _index for consistency http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: Line 139: // of the commit index from the follower code. having some trouble parsing this sentence: "updating the queue of the commit index" http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus-test.cc File src/kudu/consensus/raft_consensus-test.cc: Line 388: LOG(WARNING) << ""; ?? http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: Line 1267: // TODO: interrogate queue rather than state? what would be the difference/advantage? http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus.h File src/kudu/consensus/raft_consensus.h: Line 171: nit: extra line http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/raft_consensus_state.cc File src/kudu/consensus/raft_consensus_state.cc: Line 575: LOG(FATAL) << "TODO: remove me once convinced that this doesnt happen in tests"; add an actual error message that people can parse also add prefix Line 583: LOG_WITH_PREFIX_UNLOCKED(WARNING) << "given start C_O=" << committed_op more info in the log statement or remove http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 2543: req.set_committed_index(0); use constant -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Mike Percy has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 4: (4 comments) I am still working my way through this patch. There are several TODOs, not sure if you know the answers to those yet. But I will investigate when I pick up again on reviewing the logic changes in this patch. http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 333: // The id of the last committed operation in the configuration. This is the s/id/index/ Line 334: // id of the last operation the leader deemed committed from a consensus s/id/index/ http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_peers-test.cc File src/kudu/consensus/consensus_peers-test.cc: Line 142: message_queue_->SetLeaderMode(0, 0, BuildRaftConfigPBForTests(3)); Maybe better to use kMinimumTerm and kMinimumOpIdIndex from opid_util.h here and elsewhere http://gerrit.cloudera.org:8080/#/c/4133/4/src/kudu/consensus/consensus_queue.h File src/kudu/consensus/consensus_queue.h: Line 415: // TODO: doc me todo -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4133 to look at the new patch set (#4). Change subject: Cleanup/refactor tracking of consensus watermarks .. Cleanup/refactor tracking of consensus watermarks This is a fairly invasive cleanup/refactor to consensus in preparation for propagating replication watermarks from the leader to the followers. The key item here is to address a long-standing TODO that the PeerMessageQueue should itself calculate the commit index, rather than delegate that job to ReplicaState. The flow used to be something like this on the leader upon receiving a response from a peer that it has replicated some new operations: - PeerMessageQueue::ResponseFromPeer - updates peer last_received - calls PeerMessageQueue::AdvanceQueueWatermark - updates PeerMessageQueue::majority_replicated to the new majority watermark - calls NotifyObserversOfMajorityReplOp --> submits an async task to the raft threadpool - for each observer (best I can tell there is currently always only one, but that's a separate issue) -- observer->UpdateMajorityReplicated() to grab committed index (the observer is always the RaftConsensus instance) --- RaftConsensus::UpdateMajorityReplicated ReplicaState::UpdateMajorityReplicatedUnlocked - this checks the condition that the majority-replicated watermark is within the leader's term before advancing the commit index. - calls AdvanceCommittedIndexUnlocked --- commits stuff, etc - sets *committed_index out-param -- PeerMessageQueue picks up this out-param and sets the new value within the PeerMessageQueue This was very difficult to follow, given that the "observer" in fact was passing data back to the "notifier" through an out-parameter. Moreover, it wasn't obvious to see which class was "authoritative" for the tracking and advancement of watermarks. The new design makes the PeerMessageQueue itself fully responsible for calculating when the commit index can advance. This required a fairly invasive surgery because the PeerMessageQueue itself doesn't remember the terms of pending operations, and thus the watermarks became indexes instead of OpIds. This is itself also a simplification -- we previously were very messy about using the word "index" when in fact we had an OpId type. In almost every case, we only used the index of those OpIds. In the Raft paper itself, these watermarks are just indexes and not OpIds, so this change brings us closer to the implementation described in the original design. Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 --- M docs/release_notes.adoc M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.proto M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/raft_consensus-test.cc M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/raft_consensus-itest.cc 15 files changed, 400 insertions(+), 569 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4133/4 -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Cleanup/refactor tracking of consensus watermarks
Kudu Jenkins has posted comments on this change. Change subject: Cleanup/refactor tracking of consensus watermarks .. Patch Set 4: Build Started http://104.196.14.100/job/kudu-gerrit/3145/ -- To view, visit http://gerrit.cloudera.org:8080/4133 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2aa294472f018013d88f36a9358e9ebd9d5ed8f8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No