[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-30 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a set of of UUIDs that are allowed to be in a "bad" state while
not counting towards being under-replicated.

Since the goal of maintenance mode is to cope with unexpected failures,
"healthy" movement, e.g. through tooling that uses REPLACE and PROMOTE
tagging, is still allowed to and from tservers in maintenance mode.

Testing:
- a unit test is added to exercise the new quorum logic to ignore
  certain failed UUIDs, taking into account REPLACE and PROMOTE replicas
- integration tests are added to test:
  - behavior with RF=3 through restarts of the master
  - behavior when running move_replica tooling
  - behavior with RF=5 with background failures

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Reviewed-on: http://gerrit.cloudera.org:8080/14222
Reviewed-by: Adar Dembo 
Reviewed-by: Alexey Serbin 
Tested-by: Andrew Wong 
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 657 insertions(+), 24 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, but someone else must approve
  Alexey Serbin: Looks good to me, approved
  Andrew Wong: Verified

--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-30 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 6: Verified+1

Unrelated failure. This doesn't touch the tserver at all.


--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 01 Oct 2019 04:34:25 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-30 Thread Andrew Wong (Code Review)
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-30 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 6: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14222/6//COMMIT_MSG@13
PS6, Line 13: Since the goal of maintenance mode is to cope with unexpected 
failures,
: "healthy" movement, e.g. through tooling that uses REPLACE and 
PROMOTE
: tagging, is still allowed to and from tservers in maintenance 
mode.
That's great!



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 01 Oct 2019 04:23:21 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-30 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 6: Code-Review+1

Don't think the test failure is related, but not 100% sure.


--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 30 Sep 2019 22:28:27 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-29 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@645
PS5, Line 645:   // Even if the replica to replace is meant to be ignored on 
failure, we
 :   // should honor the replacement and try to add a replica.
 :   for (char health : { '+', '-', '?' }) {
 : RaftConfigPB config;
 : AddPeer(, "A", V, health, {{"REPLACE", true}});
 : AddPeer(, "B", V, '+');
 : AddPeer(, "C", V, '+');
 : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" }));
 :   }
 :   {
 : RaftConfigPB config;
 : AddPeer(, "A", V, '+', {{"REPLACE", true}});
 : AddPeer(, "B", V, '-');
 : AddPeer(, "C", V, '+');
 : // Ignoring failures shouldn't impede our ability to add a 
replica when the
 :
> nit: maybe, put this under for () cycle to iterate over the health status o
Done


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@675
PS5, Line 675: p
> Does it make sense to add coverage for other health statuses as well (unkno
Done


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@684
PS5, Line 684:   EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
 :   // But when there is a failure that isn't suppose
> ditto
Done


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc@455
PS5, Line 455: if (VLOG_IS_ON(1) && ignore_failure_for_underreplication) {
 :   VLOG(1) << Substitute("ignoring $0 if failed", peer_uuid);
 : }
> Does it deserve VLOG(1) it's not important at all?
Done


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@269
PS5, Line 269:   ASSERT_OK(ChangeTServerState(maintenance_uuid, 
TServerStateChangePB::EXIT_MAINTENANCE_MODE));
> Thank you for adding this scenario.
Ack


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@379
PS5, Line 379:   void SetUp() override {
 : SKIP_IF_SLOW_NOT_ALLOWED();
 : NO_FATALS(MaintenanceModeITest::SetUp());
 : NO_FATALS(SetUpCluster(5));
> Could it happen that replicas at the server fell behind WAL GC threshold si
Good point. I've separated the concern by disabling log GC and testing for 
FAILED_UNRECOVERABLE sepcifically using disk failures.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@385
PS5, Line 385: SKIP_IF_SLOW_NOT_ALLOWED();
> Does it make sense to verify the number of written rows after stopping the
Done



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Sun, 29 Sep 2019 06:18:15 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-29 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Adar Dembo, Hao Hao,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14222

to look at the new patch set (#6).

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a set of of UUIDs that are allowed to be in a "bad" state while
not counting towards being under-replicated.

Since the goal of maintenance mode is to cope with unexpected failures,
"healthy" movement, e.g. through tooling that uses REPLACE and PROMOTE
tagging, is still allowed to and from tservers in maintenance mode.

Testing:
- a unit test is added to exercise the new quorum logic to ignore
  certain failed UUIDs, taking into account REPLACE and PROMOTE replicas
- integration tests are added to test:
  - behavior with RF=3 through restarts of the master
  - behavior when running move_replica tooling
  - behavior with RF=5 with background failures

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 657 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/6
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 5:

(9 comments)

Overall looks good to me; a few nits.

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636
PS2, Line 636: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> This is somewhat white-box knowledge, but the "whitelist" isn't even part o
I think you meant the set of servers to ignore is not a part of 
ShouldEvictReplica(): yes, indeed.  For some reason, it seemed to me that 
ShouldEvictReplica() is also affected by the maintenance mode.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689
PS2, Line 689: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> Sure, I'll add those cases.
Great, thanks a lot.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@645
PS5, Line 645:   {
 : RaftConfigPB config;
 : AddPeer(, "A", V, '+', {{"REPLACE", true}});
 : AddPeer(, "B", V, '+');
 : AddPeer(, "C", V, '+');
 : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" }));
 :   }
 :   {
 : RaftConfigPB config;
 : AddPeer(, "A", V, '-', {{"REPLACE", true}});
 : AddPeer(, "B", V, '+');
 : AddPeer(, "C", V, '+');
 : // Even if the replica to replace is meant to be ignored on 
failure, we
 : // should honor the replacement and try to add a replica.
 : EXPECT_TRUE(ShouldAddReplica(config, 3, { "A" }));
 :   }
nit: maybe, put this under for () cycle to iterate over the health status of 
replica "A"?  You could add the verification for the "unknown" health status of 
replica as well easily.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@675
PS5, Line 675: -
Does it make sense to add coverage for other health statuses as well (unknown, 
healthy) to explicitly state how the maintenance mode is interpreted for those?


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util-test.cc@684
PS5, Line 684: AddPeer(, "A", N, '-', {{"PROMOTE", true}});
 : AddPeer(, "B", N, '-', {{"PROMOTE", true}});
ditto


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/consensus/quorum_util.cc@455
PS5, Line 455: if (VLOG_IS_ON(2) && ignore_failure_for_underreplication) {
 :   VLOG(2) << Substitute("ignoring $0 if failed", peer_uuid);
 : }
Does it deserve VLOG(1) it's not important at all?


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@269
PS5, Line 269: TEST_F(MaintenanceModeRF3ITest, 
TestMaintenanceModeDoesntObstructMove) {
Thank you for adding this scenario.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@379
PS5, Line 379:   const TServerDetails* maintenance_details = FindOrDie(ts_map, 
maintenance_uuid);
 :   vector mnt_tablet_ids;
 :   ASSERT_OK(ListRunningTabletIds(maintenance_details, 
MonoDelta::FromSeconds(30), _tablet_ids));
 :   ASSERT_EQ(kNumTablets, mnt_tablet_ids.size());
Could it happen that replicas at the server fell behind WAL GC threshold since 
the workload is running in background?  If so, there might be an interval when 
such replicas first evicted, and then new replicas appear, replacing the 
evicted ones.  That might bring some flakiness at line 382, so maybe it's worth 
wrapping lines 380-382 into ASSERT_EVENTUALLY in that case.


http://gerrit.cloudera.org:8080/#/c/14222/5/src/kudu/integration-tests/maintenance_mode-itest.cc@385
PS5, Line 385:   NO_FATALS(create_table.StopAndJoin());
Does it make sense to verify the number of written rows after stopping the 
workload or that's too much?  Something like ClusterVerifier::CheckRowCount()



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5

[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 5: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 26 Sep 2019 05:56:24 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Removed Verified+1 by Andrew Wong 
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 5: Verified+1 Code-Review-1

I'll fix the IWYU issue, but I don't want it to gate review while I work on 
other things.


--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 26 Sep 2019 01:38:49 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Removed Code-Review-1 by Andrew Wong 
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Adar Dembo, Hao Hao,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14222

to look at the new patch set (#5).

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a set of of UUIDs that are allowed to be in a "bad" state while
not counting towards being under-replicated.

Since the goal of maintenance mode is to cope with unexpected failures,
"healthy" movement, e.g. through tooling that uses REPLACE and PROMOTE
tagging, is still allowed to and from tservers in maintenance mode.

Testing:
- a unit test is added to exercise the new quorum logic to ignore
  certain failed UUIDs, taking into account REPLACE and PROMOTE replicas
- integration tests are added to test:
  - behavior with RF=3 through restarts of the master
  - behavior when running move_replica tooling
  - behavior with RF=5 with background failures

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 593 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/5
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Adar Dembo, Hao Hao,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14222

to look at the new patch set (#4).

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a set of of UUIDs that are allowed to be in a "bad" state while
not counting towards being under-replicated.

Since the goal of maintenance mode is to cope with unexpected failures,
"healthy" movement, e.g. through tooling that uses REPLACE and PROMOTE
tagging, is still allowed to and from tservers in maintenance mode.

Testing:
- a unit test is added to exercise the new quorum logic to ignore
  certain failed UUIDs, taking into account REPLACE and PROMOTE replicas
- integration tests are added to test:
  - behavior with RF=3 through restarts of the master
  - behavior when running move_replica tooling
  - behavior with RF=5 with background failures

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 593 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/4
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/3/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/3/src/kudu/integration-tests/maintenance_mode-itest.cc@165
PS3, Line 165:  int expected_failed) {
> warning: parameter 'num_replicas_failed' is unused [misc-unused-parameters]
Done



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 26 Sep 2019 00:08:56 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 3:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/14222/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14222/2//COMMIT_MSG@10
PS2, Line 10: llowed to be in a "ba
> nit: maybe 'can be in a bad state (e.g. due to in maintenance mode)'? To ex
Done

Also worth noting that maintenance mode is not itself a bad state. It is 
protection against being considered a bad state.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636
PS2, Line 636: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> What about tests scenarios for ShouldRemoveReplica() when some of the table
This is somewhat white-box knowledge, but the "whitelist" isn't even part of 
the `ShouldAddReplica()` API, so I'm not sure testing it here makes sense.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@648
PS2, Line 648: ig, "B", V, '+
> server with unknown health
Done


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@658
PS2, Line 658: // should honor the replacement and try to add
> nit: should this comment be moved to L664?
Ack


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689
PS2, Line 689: EXPECT_FALSE(ShouldAddReplica(config, 5, { "A", "B" }));
> Does it make sense to add the following cases for replication factor 3:
Sure, I'll add those cases.

Added some coverage for non-voters.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h
File src/kudu/consensus/quorum_util.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h@122
PS2, Line 122: set the UUID of the best candidate
> there is a quorum
Done


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h@132
PS2, Line 132:
> nit: would '{}' suffice here instead of 'std::unordered_set()'
I get an error when I try that: "chosen constructor is explicit in 
copy-initialization".


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc@456
PS2, Line 456: ) << Substitute("ignor
> Ah, just another observation.  Probably, I should have mentioned that earli
I updated my design doc with some notes on this:

Option 1: the rebalancing tool and the move tool will continue to work even 
during maintenance mode
REPLACE replicas _will_ continue to count towards underreplication when 
deciding to add replicas, even when in maintenance mode.
* There would be no changes to determining whether we should evict. In this 
way, maintenance mode does not affect the move tool at all.
* The downside here is that it may mean replication traffic. Using maintenance 
mode is generally a signal that something is wrong or needs to be fixed in a 
cluster/tserver. It thus behooves us to make sure there are fewer moving parts 
where we can.

Option 2: the rebalancing tool and the move tool will set REPLACE and PROMOTE, 
but those will not be honored by the master -- the master err on the side of 
not evicting and not adding replicas.
* When determining whether to add replicas, replicas marked REPLACE will not 
count towards being underreplicated, while those marked PROMOTE will count if 
healthy.
* When determining whether to remove replicas, replicas marked REPLACE will be 
prioritized as healthy (or even lower, so we truly never evict nodes in 
maintenance mode)
* The major downside here is that while the goal itself is clear, the exact 
semantics aren’t quite clear to articulate, especially when considering whether 
to evict (see ShouldEvictReplica() in quorum_util.cc)

Verdict: There is a clearer implementation path towards Option 1, and the 
semantics seem less surprising, when considering that operators may run tools 
manually. If we really want to avoid rebalancing during maintenance mode, it 
doesn’t seem unreasonable to implement checks in the rebalancer to check for 
maintenance mode.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@19
PS2, Line 19: #include 
> nit: remove the extra line
Done


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@87
PS2, Line 87: pts.num_
> nit: 'Perform' to align with the rest.
Done



[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-25 Thread Andrew Wong (Code Review)
Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Hao Hao,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14222

to look at the new patch set (#3).

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a set of of UUIDs that are allowed to be in a "bad" state while
not counting towards being under-replicated.

Since the goal of maintenance mode is to cope with unexpected failures,
"healthy" movement, e.g. through tooling that uses REPLACE and PROMOTE
tagging, is still allowed to and from tservers in maintenance mode.

Testing:
- a unit test is added to exercise the new quorum logic to ignore
  certain failed UUIDs, taking into account REPLACE and PROMOTE replicas
- integration tests are added to test:
  - behavior with RF=3 through restarts of the master
  - behavior when running move_replica tooling
  - behavior with RF=5 with background failures

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 593 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/3
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-16 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc@456
PS2, Line 456: peer.attrs().replace()
> I should have been more clear on what I said.
Ah, just another observation.  Probably, I should have mentioned that earlier 
for clarity.

If I'm not mistaken, only tools like 'cluster rebalancer' and 'tablet 
change_config move_replica' set the 'REPLACE' attribute to make replica to be 
moved to some other tablet server.  That corresponds to ChangeConfigType of 
MODIFY_PEER.  The automatic re-replication in master uses ADD_PEER and 
REMOVE_PEER correspondingly.

So, if we are about to ignore the REPLACE attribute when a tablet server is in 
the maintenance mode, that means a running session of the rebalancer tool of an 
explicit replica movement would stuck until the source tablet server is back 
into the regular (non-maintenance mode).

Maybe it's worth double-checking what we want from the operability perspective 
here if ignoring the REPLACE attribute for tablet servers went into the 
maintenance mode:
 * allow for accumulation of those 'REPLACE' attributes set for replicas hosted 
by a tablet server in the maintenance mode (and be prepared for corresponding 
movements once the server is back into the normal mode)
 * explicitly disallow setting the 'REPLACE' attribute for replicas hosted by a 
tablet server in the maintenance mode

Another alternative would be honoring the 'REPLACE' attribute even if a tablet 
server is in the maintenance mode, given that the presence of the 'REPLACE' 
attribute means it has been set explicitly, and it was not a result of the 
automatic re-replication activity.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 17 Sep 2019 00:38:48 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-16 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@156
PS2, Line 156: TestMaintenanceModeDoesntRereplicate
> Does it make sense to add a scenario to verify that it's possible to move r
To clarify, my comment was about the manual movement of replicas from the node, 
e.g., when explicitly moving a replica  using 'kudu tablet change_config 
move_replica'.

We talked offline regarding another comment related to this question.  So, if 
we are OK with introducing the restriction of not allowing replicas to be moved 
even explicitly, please ignore this comment.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 17 Sep 2019 00:07:46 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-16 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc@456
PS2, Line 456: peer.attrs().replace()
> Do we want to avoid replacing replicas from nodes in maintenance mode?
I should have been more clear on what I said.

My concern was that with this code moving replicas from a node put into the 
maintenance mode is impossible, even if it's done not because of automatic 
re-replication, but because of manual 'replica move' using kudu CLI.

Andrew and I discussed that offline, and it seems the constraint of not 
allowing replica movement from tablet servers in the maintenance mode is inline 
with the intention.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 16 Sep 2019 23:18:25 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-16 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h
File src/kudu/master/ts_manager.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h@94
PS2, Line 94:   void GetWhitelistedUuids(std::unordered_set* 
uuids) const;
> Not sure I understand why that would be any better than generating the list
If O(num tservers in maintenance mode) is relatively high, GetWhitelistedUuids 
will take a lot of locks. The acquisitions are one-at-a-time, but it's possible 
it's still computationally more expensive than ~3 TSManager lookups in 
ShouldAddReplica. With the change to only call GetWhitelistedUuids once per 
heartbeat vs. one per reported tablet, it's probably better than lookups in 
ShouldAddReplica _for full reports_. But what about incremental reports, which 
may only include a few tablets?

As for the name, maybe it'd be clearer if you shifted the complexity to 
ShouldAddReplica by calling the method GetMaintenanceModeTServers. Then 
ShouldAddReplica gets to explain how/why MM affects rereplication, but since 
that area of code is already dealing with that concept, it's more natural for 
the full explanation to also live there.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 16 Sep 2019 23:12:46 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-15 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h
File src/kudu/master/ts_manager.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h@94
PS2, Line 94:   void GetWhitelistedUuids(std::unordered_set* 
uuids) const;
Not sure I understand why that would be any better than generating the list up 
front. Passing in a const ref to ShouldAddReplica() doesn't seem expensive at 
all. With this call pulled out of the loop, aren't they equivalent, but with 
less complexity in this implementation? Or is the concern around the memory 
used by the whitelist?

Also what name would you prefer? I think "whitelisting" is apt, but you are the 
second person to mention this and I haven't been able to think of a better name.

> a list of people or things considered to be acceptable or trustworthy.
> "the software applies a blacklist of spammers and a whitelist of known good 
> senders"



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Sun, 15 Sep 2019 06:44:10 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-13 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@636
PS2, Line 636: TEST_P(QuorumUtilHealthPolicyParamTest, 
ShouldAddReplicaWhitelist) {
What about tests scenarios for ShouldRemoveReplica() when some of the tablet 
servers hosting replicas are put into the maintenance mode?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@648
PS2, Line 648: unknown server
server with unknown health


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@689
PS2, Line 689: }
Does it make sense to add the following cases for replication factor 3:
  * A:? B:+ C:+, all voters, where A is in maintenance mode
  * A:- B:+ C:+, all voters, where A, B, and C in maintenance mode
  * A:- B:- C:+, all voters, where A and B in maintenance mode

Also, how does the maintenance mode affect configurations with non-voter 
replicas?  Imagine, some non-voter replica was in process of copying when a 
tablet server with the failed replica was marked/put into the maintenance.  
What will happen to the non-voter replica?  Will it be removed since the tablet 
server with the failed voter replica has just been put into the maintenance 
mode?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@19
PS2, Line 19:
nit: remove the extra line


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@156
PS2, Line 156: TestMaintenanceModeDoesntRereplicate
> nit: TestMaintenanceModeWithoutRereplication?
Does it make sense to add a scenario to verify that it's possible to move 
replicas _from_ a tablet server put into the maintenance mode?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@195
PS2, Line 195: // Bringing the tablet server down shouldn't lead to 
re-replication.
Another reasons for re-replication are:
  * follower replica falling behind leader's WAL GC threshold
  * replica failed to bootstrap due to IO error

Does it make sense to add test scenario for those or we assume that's already 
covered due to the way how the code is written in quorum_util.cc?  If the 
latter, then please add a comment to reflect those possible cases as well.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc@4190
PS2, Line 4190: unordered_set whitelisted_uuids;
  : 
master_->ts_manager()->GetWhitelistedUuids(_uuids);
> It's wasteful to generate this for each tablet in the report.
+1



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 13 Sep 2019 07:47:45 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-12 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h
File src/kudu/consensus/quorum_util.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h@122
PS2, Line 122: there are enough replicas available
there is a quorum


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.h@132
PS2, Line 132: std::unordered_set()
nit: would '{}' suffice here instead of 'std::unordered_set()'  ?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util.cc@456
PS2, Line 456: peer.attrs().replace()
Do we want to avoid replacing replicas from nodes in maintenance mode?

As I understand, the intent of the REPLACE attribute is inline with whatever 
maintenance mode is currently trying to do.  Especially, if thinking about 
moving replicas from a node switched into the maintenance mode, there is no 
other way to remove replica with the 3-4-3 replica management scheme but mark 
the source replica with the REPLACE attribute and add a new non-voter replica 
in the same Raft transaction.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 13 Sep 2019 05:06:23 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-12 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@99
PS2, Line 99:   Status GetNumBytesCopied(int64_t* num_bytes_copied) {
This is a proxy for the question "did we rereplicate anything", right? Would it 
be more intuitive if we looked at the count of the server entity's 
METRIC_handler_latency_kudu_tserver_TabletCopyService_BeginTabletCopySession 
instead?

Also, you only ever compare the value to 0, so maybe convert this into a yes/no 
function?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@151
PS2, Line 151:  protected:
Don't need this?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@208
PS2, Line 208:   int64_t num_bytes_copied = 0;
Nit: don't need to initialize this.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/catalog_manager.cc@4190
PS2, Line 4190: unordered_set whitelisted_uuids;
  : 
master_->ts_manager()->GetWhitelistedUuids(_uuids);
It's wasteful to generate this for each tablet in the report.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h
File src/kudu/master/ts_manager.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h@94
PS2, Line 94:   void GetWhitelistedUuids(std::unordered_set* 
uuids) const;
"Whitelisted" doesn't really capture the expected semantics of these UUIDs.

Another thing I noticed is that there's potentially a big difference in 
cardinality between the returned set (number of tservers in maintenance mode) 
and the number of iterations in ShouldAddReplica (replication factor).

Taken together, perhaps we shouldn't generate this set up front, and instead 
pass the TSManager in some fashion into ShouldAddReplica? The consensus -> 
master dependency means we shouldn't pass in TSManager as-is, but maybe we 
could pass it via a lambda that takes a UUID and returns true if the tserver 
should be ignored for underreplication purposes?



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 13 Sep 2019 04:24:17 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-12 Thread Hao Hao (Code Review)
Hao Hao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14222 )

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/14222/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14222/2//COMMIT_MSG@10
PS2, Line 10: can be in a bad state
nit: maybe 'can be in a bad state (e.g. due to in maintenance mode)'? To 
explain a bit why bad state is acceptable.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc
File src/kudu/consensus/quorum_util-test.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/consensus/quorum_util-test.cc@658
PS2, Line 658: // We should still add a replica in this case.
nit: should this comment be moved to L664?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc
File src/kudu/integration-tests/maintenance_mode-itest.cc:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@87
PS2, Line 87: Performs
nit: 'Perform' to align with the rest.


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@156
PS2, Line 156: TestMaintenanceModeDoesntRereplicate
nit: TestMaintenanceModeWithoutRereplication?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/integration-tests/maintenance_mode-itest.cc@229
PS2, Line 229: Since our server is healthy, leaving maintenance mode shouldn't 
trigger
 :   // any re-replication either.
Should we add a test that re-replication can still happen to other tservers if 
any of them go down, while one is in maintenance mode?


http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h
File src/kudu/master/ts_manager.h:

http://gerrit.cloudera.org:8080/#/c/14222/2/src/kudu/master/ts_manager.h@92
PS2, Line 92: that can be in a failed state without
:   // counting towards under-replication.
nit: can you elaborate a bit in which cases this should happen.



--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 12 Sep 2019 20:46:06 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-12 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14222

to look at the new patch set (#2).

Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a "whitelist" of UUIDs that can be in a bad state while not
counting towards being under-replicated.

Testing:
- a unit test is added to exercise the new quorum-specific whitelisting
  logic
- an integration test is also added to exercise this end-to-end

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 343 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/2
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins (120)


[kudu-CR] KUDU-2069 p4: stop replication from failed servers in maintenance mode

2019-09-12 Thread Andrew Wong (Code Review)
Andrew Wong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14222


Change subject: KUDU-2069 p4: stop replication from failed servers in 
maintenance mode
..

KUDU-2069 p4: stop replication from failed servers in maintenance mode

When determining whether a replica needs to be added, we may now
consider a "whitelist" of UUIDs that can be in a bad state while not
counting towards being under-replicated.

Testing:
- a unit test is added to exercise the new quorum-specific whitelisting
  logic
- an integration test is also added to exercise this end-to-end

Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/maintenance_mode-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
8 files changed, 341 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/14222/1
--
To view, visit http://gerrit.cloudera.org:8080/14222
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I9a63b55011d16900c0d27eac0eb75880074204db
Gerrit-Change-Number: 14222
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong