[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11918 ) Change subject: raft_consensus_nonvoter-itest: deflake a bit .. raft_consensus_nonvoter-itest: deflake a bit I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement (GetParam() was (1, false)) just after the master was restarted: raft_consensus_nonvoter-itest.cc:2070: Failure Failed Bad status: Service unavailable: Leader not yet ready to serve requests This is odd as there's a WaitForCatalogManager() call in there, so why would a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best I can tell, the only way for this to happen is if the attempt to grab the leadership lock from within the ListTables RPC (sent from WaitForCatalogManager()) returns IllegalState, which it'll do if the UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can happen during a leader master election; maybe the cstate's UUID becomes empty for a little while? If that's true, this should fix the problem by considering IllegalState to be a non-final state and continuing the loop. I couldn't repro this failure, but Alexey managed to do so in a dist-test loop with special latency injection enabled. Without the fix, 93 out of 256 runs failed, and with the fix, none failed. Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Reviewed-on: http://gerrit.cloudera.org:8080/11918 Reviewed-by: Alexey Serbin Tested-by: Kudu Jenkins --- M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc M src/kudu/mini-cluster/external_mini_cluster.cc M src/kudu/mini-cluster/external_mini_cluster.h 3 files changed, 21 insertions(+), 8 deletions(-) Approvals: Alexey Serbin: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 3 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120)
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11918 ) Change subject: raft_consensus_nonvoter-itest: deflake a bit .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Wed, 14 Nov 2018 23:26:16 + Gerrit-HasComments: No
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Hello Alexey Serbin, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/11918 to look at the new patch set (#2). Change subject: raft_consensus_nonvoter-itest: deflake a bit .. raft_consensus_nonvoter-itest: deflake a bit I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement (GetParam() was (1, false)) just after the master was restarted: raft_consensus_nonvoter-itest.cc:2070: Failure Failed Bad status: Service unavailable: Leader not yet ready to serve requests This is odd as there's a WaitForCatalogManager() call in there, so why would a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best I can tell, the only way for this to happen is if the attempt to grab the leadership lock from within the ListTables RPC (sent from WaitForCatalogManager()) returns IllegalState, which it'll do if the UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can happen during a leader master election; maybe the cstate's UUID becomes empty for a little while? If that's true, this should fix the problem by considering IllegalState to be a non-final state and continuing the loop. I couldn't repro this failure, but Alexey managed to do so in a dist-test loop with special latency injection enabled. Without the fix, 93 out of 256 runs failed, and with the fix, none failed. Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd --- M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc M src/kudu/mini-cluster/external_mini_cluster.cc M src/kudu/mini-cluster/external_mini_cluster.h 3 files changed, 21 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/11918/2 -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120)
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11918 ) Change subject: raft_consensus_nonvoter-itest: deflake a bit .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@21 PS1, Line 21: master's cstate doesn't match the UUID on disk As it turned out, the reason was master's cstate had no leader, i.e. (cstate.leader_uuid() != uuid) would yield true since that was comparing UUID of system tablet replica with empty string. http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22 PS1, Line 22: maybe the cstate's UUID becomes : empty for a little while > I took a look at the test logs. As far as I can see, that was ServiceUnava Additional information: since I was not able to repro the initial issue with over 1K runs, I injected random latency there: https://gerrit.cloudera.org/#/c/11931/ The 93 out of 256 runs failed, 3 with exact error message from ListTablets as from the original flake run: http://dist-test.cloudera.org//job?job_id=aserbin.1542234637.28507 After I applied the patch, not a single failure in 256 runs: http://dist-test.cloudera.org//job?job_id=aserbin.154223 5694.36306 -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Wed, 14 Nov 2018 23:05:46 + Gerrit-HasComments: Yes
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11918 ) Change subject: raft_consensus_nonvoter-itest: deflake a bit .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22 PS1, Line 22: maybe the cstate's UUID becomes : empty for a little while > I'm not sure about this: as I see from the code in catalog_manager.cc, the I took a look at the test logs. As far as I can see, that was ServiceUnavailable status returned by MasterServiceImpl::GetTabletLocations() because of master restart timings and following re-election upon start up of a single master. -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Wed, 14 Nov 2018 18:15:02 + Gerrit-HasComments: Yes
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11918 ) Change subject: raft_consensus_nonvoter-itest: deflake a bit .. Patch Set 1: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/11918/1//COMMIT_MSG@22 PS1, Line 22: maybe the cstate's UUID becomes : empty for a little while I'm not sure about this: as I see from the code in catalog_manager.cc, the message "Leader not yet ready to serve requests" corresponds to the case when cached leader term doesn't correspond to the term from the consensus state. The fix looks good to me -- at least it should not make it worse. -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Tue, 13 Nov 2018 17:45:52 + Gerrit-HasComments: Yes
[kudu-CR] raft consensus nonvoter-itest: deflake a bit
Hello Alexey Serbin, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/11918 to review the following change. Change subject: raft_consensus_nonvoter-itest: deflake a bit .. raft_consensus_nonvoter-itest: deflake a bit I saw a failure in ReplicaBehindWalGcThresholdITest.ReplicaReplacement (GetParam() was (1, false)) just after the master was restarted: raft_consensus_nonvoter-itest.cc:2070: Failure Failed Bad status: Service unavailable: Leader not yet ready to serve requests This is odd as there's a WaitForCatalogManager() call in there, so why would a subsequent GetTabletLocations RPC return this ServiceUnavailable? As best I can tell, the only way for this to happen is if the attempt to grab the leadership lock from within the ListTables RPC (sent from WaitForCatalogManager()) returns IllegalState, which it'll do if the UUID in the master's cstate doesn't match the UUID on disk. Perhaps this can happen during a leader master election; maybe the cstate's UUID becomes empty for a little while? If that's true, this should fix the problem by considering IllegalState to be a non-final state and continuing the loop. Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd --- M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc M src/kudu/mini-cluster/external_mini_cluster.cc M src/kudu/mini-cluster/external_mini_cluster.h 3 files changed, 21 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/11918/1 -- To view, visit http://gerrit.cloudera.org:8080/11918 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I8192bd669e7e309943ea82718dd715238d520bbd Gerrit-Change-Number: 11918 Gerrit-PatchSet: 1 Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Alexey Serbin