[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master
Adar Dembo has submitted this change and it was merged. Change subject: c++ client: use operation timeout as deadline for finding new leader master .. c++ client: use operation timeout as deadline for finding new leader master We had been using the default RPC timeout, which may be set to a very low value as in ClientStressTest_MultiMaster_TestLeaderResolutionTimeout. Now we'll use the operation timeout as the overall deadline while still preserving the semantics of using the default RPC timeout for the GetMasterRegistration() RPCs themselves. As my patch series removes the guarantee that a leader master is elected at the time that cluster tests run, it's important that the logic for finding the leader master provide ample time for an election to finish. Also, I think I've addressed the root cause behind KUDU-573 by fixing a race in GetLeaderMasterRpc's SendRpcCb() and GetMasterRegistrationRpcCbForNode() methods. The race manifests when the last two RPC responses are "I am the leader" and "I am not the leader" respectively. In one interleaving, both responses enter SendRpcCb(), and the second calls DelayedRetryCb(). If that were a call to DelayedRetry() instead, the GetLeaderMasterRpc would be destroyed by the time the reactor thread reran the RPC. Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Reviewed-on: http://gerrit.cloudera.org:8080/3718 Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon --- M src/kudu/client/client-internal.cc M src/kudu/integration-tests/external_mini_cluster.cc M src/kudu/master/master_rpc.cc M src/kudu/master/master_rpc.h M src/kudu/tserver/heartbeater.cc 5 files changed, 63 insertions(+), 38 deletions(-) Approvals: Todd Lipcon: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/3718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master
Todd Lipcon has posted comments on this change. Change subject: c++ client: use operation timeout as deadline for finding new leader master .. Patch Set 2: Alright, let's try this and see if people complain about the UX. (I'm going to include myself in "people" in this case... may complain later) -- To view, visit http://gerrit.cloudera.org:8080/3718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master
Todd Lipcon has posted comments on this change. Change subject: c++ client: use operation timeout as deadline for finding new leader master .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/3718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master
Adar Dembo has posted comments on this change. Change subject: c++ client: use operation timeout as deadline for finding new leader master .. Patch Set 2: > to work around in that particular test, why not explicitly wait for > a leader election on the masters? We could, but that's not 100% robust. An election can happen at any time, and while we'll tolerate that later on in the test when we're writing and timed out master RPCs are non-fatal, we won't in the beginning when we're trying to build the client. Unfortunately (for me, I guess), I would like to eliminate all sources of flakiness that I can. > I think the user experience is > not so great to say that if your cluster is down you have to wait > 60 seconds to get an error (even though you may be willing to wait > 60 seconds if you seem to be making some progress). > > Perhaps we can "early out" in the case that you get NetworkError > from _all_ of the potential masters? There's definitely an argument to be made for considering all of the responses in aggregate when making decisions (right now decisions are made based on the last response's status), but I don't think it's this. How do we differentiate between "the cluster is down for good" and "the cluster is down momentarily"? I think the only way to be faithful to the user's wishes is to adhere to the operation's deadline. Another option is to introduce a third client-level timeout (alongside "default operation" and "default RPC") to be used solely for discovering the leader master. For this test, it'd be enough to keep it at its default value. But, it's more cognitive load for everyone else, so I've been reticent to go down that path. -- To view, visit http://gerrit.cloudera.org:8080/3718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master
Todd Lipcon has posted comments on this change. Change subject: c++ client: use operation timeout as deadline for finding new leader master .. Patch Set 2: to work around in that particular test, why not explicitly wait for a leader election on the masters? I think the user experience is not so great to say that if your cluster is down you have to wait 60 seconds to get an error (even though you may be willing to wait 60 seconds if you seem to be making some progress). Perhaps we can "early out" in the case that you get NetworkError from _all_ of the potential masters? -- To view, visit http://gerrit.cloudera.org:8080/3718 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No