[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-14 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/11918 )

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..

raft_consensus_nonvoter-itest: deflake a bit

I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement
(GetParam() was (1, false)) just after the master was restarted:

  raft_consensus_nonvoter-itest.cc:2070: Failure
  Failed
  Bad status: Service unavailable: Leader not yet ready to serve requests

This is odd as there's a WaitForCatalogManager() call in there, so why would
a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best
I can tell, the only way for this to happen is if the attempt to grab the
leadership lock from within the ListTables RPC (sent from
WaitForCatalogManager()) returns IllegalState, which it'll do if the
UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can
happen during a leader master election; maybe the cstate's UUID becomes
empty for a little while? If that's true, this should fix the problem by
considering IllegalState to be a non-final state and continuing the loop.

I couldn't repro this failure, but Alexey managed to do so in a dist-test
loop with special latency injection enabled. Without the fix, 93 out of 256
runs failed, and with the fix, none failed.

Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Reviewed-on: http://gerrit.cloudera.org:8080/11918
Reviewed-by: Alexey Serbin 
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 21 insertions(+), 8 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-14 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11918 )

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 14 Nov 2018 23:26:16 +
Gerrit-HasComments: No


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-14 Thread Adar Dembo (Code Review)
Hello Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/11918

to look at the new patch set (#2).

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..

raft_consensus_nonvoter-itest: deflake a bit

I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement
(GetParam() was (1, false)) just after the master was restarted:

  raft_consensus_nonvoter-itest.cc:2070: Failure
  Failed
  Bad status: Service unavailable: Leader not yet ready to serve requests

This is odd as there's a WaitForCatalogManager() call in there, so why would
a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best
I can tell, the only way for this to happen is if the attempt to grab the
leadership lock from within the ListTables RPC (sent from
WaitForCatalogManager()) returns IllegalState, which it'll do if the
UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can
happen during a leader master election; maybe the cstate's UUID becomes
empty for a little while? If that's true, this should fix the problem by
considering IllegalState to be a non-final state and continuing the loop.

I couldn't repro this failure, but Alexey managed to do so in a dist-test
loop with special latency injection enabled. Without the fix, 93 out of 256
runs failed, and with the fix, none failed.

Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
---
M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 21 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/11918/2
--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-14 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11918 )

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@21
PS1, Line 21: master's cstate doesn't match the UUID on disk
As it turned out, the reason was master's cstate had no leader, i.e. 
(cstate.leader_uuid() != uuid) would yield true since that was comparing UUID 
of system tablet replica with empty string.


http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22
PS1, Line 22: maybe the cstate's UUID becomes
: empty for a little while
> I took a look at the test logs.  As far as I can see, that was ServiceUnava
Additional information: since I was not able to repro the initial issue with 
over 1K runs, I injected random latency there: 
https://gerrit.cloudera.org/#/c/11931/

The 93 out of 256 runs failed, 3 with exact error message from ListTablets as 
from the original flake run:
  http://dist-test.cloudera.org//job?job_id=aserbin.1542234637.28507

After I applied the patch, not a single failure in 256 runs:
  http://dist-test.cloudera.org//job?job_id=aserbin.154223
5694.36306



--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 14 Nov 2018 23:05:46 +
Gerrit-HasComments: Yes


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-14 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11918 )

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22
PS1, Line 22: maybe the cstate's UUID becomes
: empty for a little while
> I'm not sure about this: as I see from the code in catalog_manager.cc, the
I took a look at the test logs.  As far as I can see, that was 
ServiceUnavailable status returned by  MasterServiceImpl::GetTabletLocations() 
because of master restart timings and following re-election upon start up of a 
single master.



--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 14 Nov 2018 18:15:02 +
Gerrit-HasComments: Yes


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-13 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/11918 )

Change subject: raft_consensus_nonvoter-itest: deflake a bit
..


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22
PS1, Line 22: maybe the cstate's UUID becomes
: empty for a little while
I'm not sure about this: as I see from the code in catalog_manager.cc, the 
message "Leader not yet ready to serve requests" corresponds to the case when 
cached leader term doesn't correspond to the term from the consensus state.

The fix looks good to me -- at least it should not make it worse.



--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 13 Nov 2018 17:45:52 +
Gerrit-HasComments: Yes


[kudu-CR] raft consensus nonvoter-itest: deflake a bit

2018-11-09 Thread Adar Dembo (Code Review)
Hello Alexey Serbin,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/11918

to review the following change.


Change subject: raft_consensus_nonvoter-itest: deflake a bit
..

raft_consensus_nonvoter-itest: deflake a bit

I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement
(GetParam() was (1, false)) just after the master was restarted:

  raft_consensus_nonvoter-itest.cc:2070: Failure
  Failed
  Bad status: Service unavailable: Leader not yet ready to serve requests

This is odd as there's a WaitForCatalogManager() call in there, so why would
a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best
I can tell, the only way for this to happen is if the attempt to grab the
leadership lock from within the ListTables RPC (sent from
WaitForCatalogManager()) returns IllegalState, which it'll do if the
UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can
happen during a leader master election; maybe the cstate's UUID becomes
empty for a little while?

If that's true, this should fix the problem by considering IllegalState to
be a non-final state and continuing the loop.

Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
---
M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 21 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/11918/1
--
To view, visit http://gerrit.cloudera.org:8080/11918
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd
Gerrit-Change-Number: 11918
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin