[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has submitted this change and it was merged. Change subject: make election timeout jitter more aggressive .. make election timeout jitter more aggressive Random election timeout jitter is necessary in Raft in order to guarantee that an election can be won. If the jitter is smaller than RTT or the accuracy of clocks, then elections could fail indefinitely. We frequently hit an issue during tests where timeouts tend to 'clump' together, causing elections to retry many times in a row, ultimately leading to test timeout. This commit increases the jitter, so that election timeout differences between nodes will hopefully be greater than the clock error. This issue could also manifest if the RTT between nodes is high. Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Reviewed-on: http://gerrit.cloudera.org:8080/3828 Tested-by: Kudu Jenkins Reviewed-by: Dan Burkert --- M src/kudu/consensus/raft_consensus.cc 1 file changed, 1 insertion(+), 1 deletion(-) Approvals: Dan Burkert: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3828 to look at the new patch set (#3). Change subject: make election timeout jitter more aggressive .. make election timeout jitter more aggressive Random election timeout jitter is necessary in Raft in order to guarantee that an election can be won. If the jitter is smaller than RTT or the accuracy of clocks, then elections could fail indefinitely. We frequently hit an issue during tests where timeouts tend to 'clump' together, causing elections to retry many times in a row, ultimately leading to test timeout. This commit increases the jitter, so that election timeout differences between nodes will hopefully be greater than the clock error. This issue could also manifest if the RTT between nodes is high. Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 --- M src/kudu/consensus/raft_consensus.cc 1 file changed, 1 insertion(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/3828/3 -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] make election timeout jitter more aggressive
Todd Lipcon has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: I've changed this so it's just making the jitter more aggressive. The clamp is still a theoretical issue, but it would require an unrealistic 20s RTT to manifest. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/3828 to look at the new patch set (#2). Change subject: make election timeout jitter more aggressive .. make election timeout jitter more aggressive Random election timeout jitter is necessary in Raft in order to guarantee that an election can be won. If the jitter is smaller than RTT or the accuracy of clocks, then elections could fail indefinitely. We frequently hit an issue during tests where timeouts tend to 'clump' together, causing elections to retry many times in a row, ultimately leading to test timeout. This commit increases the jitter, so that election timeout differences between nodes will hopefully be greater than the clock error. This issue could also manifest if the RTT between nodes is high. Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 --- M src/kudu/consensus/raft_consensus.cc 1 file changed, 1 insertion(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/3828/2 -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] make election timeout jitter more aggressive
Todd Lipcon has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: Let's try and close this one out soon? Not sure where the conversation got left. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: Thinking about this more, the 20s clamp is probably OK. It means we could theoretically not make progress is RTT are > 20s, but we already can't make progress in that situation anyway since the election timeout is 1.5s. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3828/1//COMMIT_MSG Commit Message: Line 7: make election timeout jitter more aggressive > I agree with Todd, backoff cap avoids insane exponential backoffs. It seems I don't think I'm following. How can you add more jitter without increasing the average timeout? We are, after al,l bounded by a minimum timeout of 0. The jitter *must* increase as a function of the number of retries, otherwise you risk a situation where the cluster can't make progress due to RTT being greater than the jitter. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] make election timeout jitter more aggressive
Mike Percy has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3828/1//COMMIT_MSG Commit Message: Line 7: make election timeout jitter more aggressive > yea, but the jitter is only aggressive due to the backoff being more aggres I agree with Todd, backoff cap avoids insane exponential backoffs. It seems like the jitter is what we are really worried about here. And TBH I'm not convinced this is the problem. Although I'd support wider jitter variance. On a sort of side note, I tried to add a generic exponential backoff helper a long time ago in https://gerrit.cloudera.org/#/c/979/ ... maybe we should partially resurrect that patch and work on ensuring that we have a single exponential backoff function that is parameterized and flexible enough to handle all situations. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] make election timeout jitter more aggressive
Todd Lipcon has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3828/1//COMMIT_MSG Commit Message: Line 7: make election timeout jitter more aggressive > The lower bound timeout isn't changed, only the upper bound. So the range yea, but the jitter is only aggressive due to the backoff being more aggressive -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] make election timeout jitter more aggressive
Dan Burkert has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3828/1//COMMIT_MSG Commit Message: Line 7: make election timeout jitter more aggressive > isn't it making the backoff more aggressive rather than making the jitter m The lower bound timeout isn't changed, only the upper bound. So the range of backoff times is greatly increased. For instance, the previous algorithm had spreads of (.15s, 0.315s, 0.4965s, 0.696s) after 0, 1, 2, 3 failed elections, respectively. The new spreads are (0.75s, 1.875s, 3.56s, 6.09s). The actual timeout is the base (1.5s) plus a random value between 0 and the spread. -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] make election timeout jitter more aggressive
Todd Lipcon has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/3828/1//COMMIT_MSG Commit Message: Line 7: make election timeout jitter more aggressive isn't it making the backoff more aggressive rather than making the jitter more aggressive? ie the biggest change is going from 1.1 base to 1.5? -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] make election timeout jitter more aggressive
Kudu Jenkins has posted comments on this change. Change subject: make election timeout jitter more aggressive .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2697/ -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] make election timeout jitter more aggressive
Hello Adar Dembo, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/3828 to review the following change. Change subject: make election timeout jitter more aggressive .. make election timeout jitter more aggressive Existing election timeouts have very low variance, and is capped at a maximum value. Having a variance cap is problematic because it could cause Raft to not make progress when the RTT between nodes is greater than the cap. Counter-intuitively, having a low variance in timeouts causes elections to take longer since it leads to more frequent election retries. This commit removes the cap and increases the variance. Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 --- M src/kudu/consensus/raft_consensus.cc 1 file changed, 2 insertions(+), 10 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/3828/1 -- To view, visit http://gerrit.cloudera.org:8080/3828 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I2c9dad820c2b7d4bc4b9e791b78222559cdf63c8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan Burkert Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Todd Lipcon