[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-22 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


c++ client: use operation timeout as deadline for finding new leader master

We had been using the default RPC timeout, which may be set to a very low
value as in ClientStressTest_MultiMaster_TestLeaderResolutionTimeout.
Now we'll use the operation timeout as the overall deadline while still
preserving the semantics of using the default RPC timeout for the
GetMasterRegistration() RPCs themselves.

As my patch series removes the guarantee that a leader master is elected at
the time that cluster tests run, it's important that the logic for finding
the leader master provide ample time for an election to finish.

Also, I think I've addressed the root cause behind KUDU-573 by fixing a race
in GetLeaderMasterRpc's SendRpcCb() and GetMasterRegistrationRpcCbForNode()
methods. The race manifests when the last two RPC responses are "I am the
leader" and "I am not the leader" respectively. In one interleaving, both
responses enter SendRpcCb(), and the second calls DelayedRetryCb(). If that
were a call to DelayedRetry() instead, the GetLeaderMasterRpc would be
destroyed by the time the reactor thread reran the RPC.

Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Reviewed-on: http://gerrit.cloudera.org:8080/3718
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
---
M src/kudu/client/client-internal.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/master/master_rpc.cc
M src/kudu/master/master_rpc.h
M src/kudu/tserver/heartbeater.cc
5 files changed, 63 insertions(+), 38 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

Alright, let's try this and see if people complain about the UX. (I'm going to 
include myself in "people" in this case... may complain later)

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

> to work around in that particular test, why not explicitly wait for
 > a leader election on the masters?

We could, but that's not 100% robust. An election can happen at any time, and 
while we'll tolerate that later on in the test when we're writing and timed out 
master RPCs are non-fatal, we won't in the beginning when we're trying to build 
the client. Unfortunately (for me, I guess), I would like to eliminate all 
sources of flakiness that I can.

 > I think the user experience is
 > not so great to say that if your cluster is down you have to wait
 > 60 seconds to get an error (even though you may be willing to wait
 > 60 seconds if you seem to be making some progress).
 > 
 > Perhaps we can "early out" in the case that you get NetworkError
 > from _all_ of the potential masters?

There's definitely an argument to be made for considering all of the responses 
in aggregate when making decisions (right now decisions are made based on the 
last response's status), but I don't think it's this. How do we differentiate 
between "the cluster is down for good" and "the cluster is down momentarily"? I 
think the only way to be faithful to the user's wishes is to adhere to the 
operation's deadline.

Another option is to introduce a third client-level timeout (alongside "default 
operation" and "default RPC") to be used solely for discovering the leader 
master. For this test, it'd be enough to keep it at its default value. But, 
it's more cognitive load for everyone else, so I've been reticent to go down 
that path.

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..


Patch Set 2:

to work around in that particular test, why not explicitly wait for a leader 
election on the masters? I think the user experience is not so great to say 
that if your cluster is down you have to wait 60 seconds to get an error (even 
though you may be willing to wait 60 seconds if you seem to be making some 
progress).

Perhaps we can "early out" in the case that you get NetworkError from _all_ of 
the potential masters?

-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No