[kudu-CR] Move CreateTableForTesting into registration-test
Hello Adar Dembo, Alexey Serbin, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6961 to look at the new patch set (#2). Change subject: Move CreateTableForTesting into registration-test .. Move CreateTableForTesting into registration-test This utility function was only used in one spot, so moving it out of the header. Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2 --- M src/kudu/integration-tests/registration-test.cc M src/kudu/master/master-test-util.h 2 files changed, 60 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/6961/2 -- To view, visit http://gerrit.cloudera.org:8080/6961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6916 to look at the new patch set (#3). Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet info .. KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet info This adds some checks in SysCatalogTable so that, if the updated version of a catalog table entry is identical to the previous version, we avoid writing anything to the table. This is a simple way of short-circuiting a common case of unnecessary slowness in tablet report processing. For example, when the master restarts or fails over, all of the tablet servers perform a full tablet report. However, most of the tablets will not have changed any information since their prior report, in which case the writes can be safely skipped on the master. This doesn't achieve all of the goals of KUDU-1125 (we still do separate sync writes for each _changed_ reported tablet) but should still be a substantial reduction. Perhaps most importantly, if a heartbeat from a tserver times out due to long processing, the retry from the TS should hit this code path and therefore be processed quickly. Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba --- M src/kudu/integration-tests/registration-test.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/master-test-util.h M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 103 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/16/6916/3 -- To view, visit http://gerrit.cloudera.org:8080/6916 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Move CreateTableForTEsting into registration-test
Hello Adar Dembo, Alexey Serbin, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6961 to review the following change. Change subject: Move CreateTableForTEsting into registration-test .. Move CreateTableForTEsting into registration-test This utility function was only used in one spot, so moving it out of the header. Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2 --- M src/kudu/integration-tests/registration-test.cc M src/kudu/master/master-test-util.h 2 files changed, 60 insertions(+), 59 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/6961/1 -- To view, visit http://gerrit.cloudera.org:8080/6961 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin
[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info
Todd Lipcon has posted comments on this change. Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet info .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6916/2/src/kudu/integration-tests/registration-test.cc File src/kudu/integration-tests/registration-test.cc: Line 38: #include "kudu/tablet/tablet_replica.h" > warning: #includes are not sorted properly [llvm-include-order] Done http://gerrit.cloudera.org:8080/#/c/6916/2/src/kudu/master/master-test-util.h File src/kudu/master/master-test-util.h: Line 71: void CreateTableForTesting(MiniMaster* mini_master, > warning: function 'CreateTableForTesting' defined in a header file; functio this is only used by one test (registration-test). I'll move it into there in a separate commit. -- To view, visit http://gerrit.cloudera.org:8080/6916 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6916 to look at the new patch set (#2). Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet info .. KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet info This adds some checks in SysCatalogTable so that, if the updated version of a catalog table entry is identical to the previous version, we avoid writing anything to the table. This is a simple way of short-circuiting a common case of unnecessary slowness in tablet report processing. For example, when the master restarts or fails over, all of the tablet servers perform a full tablet report. However, most of the tablets will not have changed any information since their prior report, in which case the writes can be safely skipped on the master. This doesn't achieve all of the goals of KUDU-1125 (we still do separate sync writes for each _changed_ reported tablet) but should still be a substantial reduction. Perhaps most importantly, if a heartbeat from a tserver times out due to long processing, the retry from the TS should hit this code path and therefore be processed quickly. Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba --- M src/kudu/integration-tests/registration-test.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/master-test-util.h M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 103 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/16/6916/2 -- To view, visit http://gerrit.cloudera.org:8080/6916 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Todd LipconGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe
Mike Percy has posted comments on this change. Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe .. Patch Set 1: Verified+1 I can't reproduce this failure (even after 1000s of iters on dist-test with different levels of stress / build types) -- To view, visit http://gerrit.cloudera.org:8080/6958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6925 to look at the new patch set (#8). Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet The 'active ingredient' in this patch is the change to TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS response to be returned if the tablet is currently being copied and the tablet copy thread pool is full. Previously an ALREADY_INPROGRESS response would only occur if the tablet was currently being copied, and the threadpool was not full. The effect of the failure to return ALREADY_INPROGRESS was that a leader would be much more likely consider a tablet server failed and to subsequently drop the replica from the Raft config. As a result, on a highly loaded cluster, a tablet copy could be started at the same time, 300 seconds apart, on many tablet servers. The remaining changes are to return more specific errors out of the tablet copy service, which aids with testing specific codepaths. One of the existing tablet_copy-itest cases has been beefed up to cover the tablet copy threadpool full path. Without the changes outlined before it fails with: ../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure Expected: (num_inprogress) > (0), actual: 0 vs 0 which is exactly what we would expect; the tablet server is failing to return INPROGRESS errors. Anecdotally, this patch has improved TTR times 5-10x on highly loaded clusters. It's still possible for tablets to be bounced around during re-replication if the copying tablet server has a full RPC queue, or it's unable to start the tablet copy for 300 seconds, but both of these conditions indicate that it's probably best to drop that tserver and retry on a (hopefully) less stressed server. Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc 8 files changed, 139 insertions(+), 57 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/8 -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Dan Burkert has posted comments on this change. Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. Patch Set 5: Code-Review+2 Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao HaoGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Hao Hao has posted comments on this change. Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/6922/4/docs/security.adoc File docs/security.adoc: Line 54: connections from publicly routable IPs will be rejected. > Reword as 'Unauthenticated connections from publicly routable IPs will be r Done -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao HaoGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Dan Burkert has submitted this change and it was merged. Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. docs: revise security doc based on KUDU-1875 and add release notes for 1.4 Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Reviewed-on: http://gerrit.cloudera.org:8080/6922 Reviewed-by: Dan BurkertTested-by: Dan Burkert --- M docs/prior_release_notes.adoc M docs/release_notes.adoc M docs/security.adoc 3 files changed, 210 insertions(+), 162 deletions(-) Approvals: Dan Burkert: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6922 to look at the new patch set (#5). Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. docs: revise security doc based on KUDU-1875 and add release notes for 1.4 Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a --- M docs/prior_release_notes.adoc M docs/release_notes.adoc M docs/security.adoc 3 files changed, 210 insertions(+), 162 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/6922/5 -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao HaoGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1988: add support for advertised host:port info.
Dan Burkert has posted comments on this change. Change subject: KUDU-1988: add support for advertised host:port info. .. Patch Set 8: (4 comments) http://gerrit.cloudera.org:8080/#/c/6827/8/src/kudu/server/rpc_server.cc File src/kudu/server/rpc_server.cc: Line 117: if (!options_.rpc_advertised_addresses.empty()) { This should guard against port 0, as on line 111 above. Line 210: for (const Sockaddr& addr : rpc_advertised_addresses_) { Same comment as in the webserver; this can just use the copy ctor. http://gerrit.cloudera.org:8080/#/c/6827/8/src/kudu/server/webserver.cc File src/kudu/server/webserver.cc: Line 207: RETURN_NOT_OK(ParseAddressList(opts_.webserver_advertised_addresses, This should check for 0 port as well. Line 308: for (const Sockaddr& addr : webserver_advertised_addresses_) { This can be simpler as a copy construct: *addresses = webserver_advertised_addresses_; -- To view, visit http://gerrit.cloudera.org:8080/6827 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I6735ca5630fc4c426bf72d0b21d6ef452173a890 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Patrik SundbergGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Patrik Sundberg Gerrit-Reviewer: Tidy Bot Gerrit-HasComments: Yes
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Dan Burkert has posted comments on this change. Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6922/4/docs/security.adoc File docs/security.adoc: Line 54: any unauthenticated connections from publicly routable IPs. Reword as 'Unauthenticated connections from publicly routable IPs will be rejected.' here and in the release notes. -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao HaoGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 1067: // Skip calling SetupErrorAndRespond since this path doesn't need the > Check out the 'Advanced per-instance throttling' section of util/logging.h OK so after trying this out on a cluster, I think we should allow it to log every time. To balance this out, I think we should downgrade the 'tablet x needs tablet copy' message. The net result is that we're logging the begin tablet copy result instead of the fact that we'll be requesting it. For reference, here's a cross section of these logs for a particular tablet: I0522 17:09:38.703362 15818 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:38.703636 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } I0522 17:09:40.211633 15820 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:40.211971 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } I0522 17:09:41.703528 11794 consensus_queue.cc:395] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: Peer cc32936bc8594948a04fd4240da36aed needs tablet copy W0522 17:09:41.703760 4776 consensus_peers.cc:352] T c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to begin Tablet Copy on peer: error { code: THROTTLED status { code: SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 0/0 tasks queued)" } } On clusters approaching normalcy, I wouldn't expect to see these logs much at all. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] threadpool-test: use test fixture
Adar Dembo has submitted this change and it was merged. Change subject: threadpool-test: use test fixture .. threadpool-test: use test fixture Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Reviewed-on: http://gerrit.cloudera.org:8080/6944 Reviewed-by: Todd LipconTested-by: Kudu Jenkins --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 140 insertions(+), 149 deletions(-) Approvals: Todd Lipcon: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1034 client does not failover due to timeout
Alexey Serbin has posted comments on this change. Change subject: KUDU-1034 client does not failover due to timeout .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h File src/kudu/rpc/retriable_rpc.h: Line 217: server_picker_->MarkServerFailed(server, result.status); > In the case of a timeout, that still might be a good idea, no? It's not 100 I think timeout is considered as non-retriable error. Given current behavior of the client (refreshing its meta-cache in case of absence of usable server for an RPC), marking a timed-out server would do no harm. Line 217: server_picker_->MarkServerFailed(server, result.status); > OK now I remember the issue I had when looking at this issue previously. I As I understand, regardless of whether the client aware of particular replica or not, marking a server failed will make the client to switch to a different tablet server in the scope of this particular RPC. Also, most of the errors in this context (except, may be, REPLICA_NOT_LEADER) have semantics of 'mark the whole server out': connection timeout, server too busy, invalid authn token. Yes, all the tablets will be marked as non-accessible when using MarkServerFailed(), but the client will refresh its meta-cache if it ended up with no active replica, right? E.g., take a look at handling REPLICA_NOT_LEADER error code. It might happen that by the time client marks the server as failed it becomes the leader. So, in that case the client will end up in having no servers in its metacache, and will refresh it to get a new leader. -- To view, visit http://gerrit.cloudera.org:8080/6924 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] threadpool-test: use test fixture
Todd Lipcon has posted comments on this change. Change subject: threadpool-test: use test fixture .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)
Todd Lipcon has posted comments on this change. Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again) .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6943/2/src/kudu/integration-tests/ts_recovery-itest.cc File src/kudu/integration-tests/ts_recovery-itest.cc: PS2, Line 75: const int kOneReplica = 1; Don't think this adds much value. If the goal was to make the parameter to StartCluster more obvious, I think it's better to just do: StartCluster(extra_ts_flags, {}, /*replicas=*/ 1); or something of that nature -- To view, visit http://gerrit.cloudera.org:8080/6943 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6772 to look at the new patch set (#9). Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config .. KUDU-1860: ksck doesn't identify tablets that are evicted but still in config This patch enhances ksck to gather consensus info from every tablet. It compares this info with master and outputs the master's config and every conflicting config, if there are any conflicts. To do this efficiently it reimplements the GetAllConsensusState RPC so that it gathers info about every replica's consensus state. This will catch at least the two problems identified in KUDU-1860: 1. The leader has a pending config to remove a tablet, but it is not committed so the master does not see this config. This can hide an unhealthy tablet if, e.g., one pending config member is down and the pending-to-be-kicked-out member is up, so 1/2 replicas are alive in the leader's active config but the master thinks 2/3 are alive. 2. No replica is leader but the master believes there is a leader because its cache is old and hasn't been updated. Sample output showing #1: https://gist.github.com/wdberkeley/d2606698e4f2e8ca3ef70d4dcef7ba9a Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 --- M src/kudu/consensus/consensus.proto M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/tool_action_cluster.cc M src/kudu/tserver/tablet_replica_lookup.h M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/tablet_service.h M src/kudu/tserver/ts_tablet_manager.h 14 files changed, 502 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/6772/9 -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] threadpool-test: use test fixture
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6944 to look at the new patch set (#2). Change subject: threadpool-test: use test fixture .. threadpool-test: use test fixture Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 140 insertions(+), 149 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/44/6944/2 -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config
Will Berkeley has posted comments on this change. Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config .. Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck-test.cc File src/kudu/tools/ksck-test.cc: Line 72: const ChecksumOptions& options, > warning: parameter 'options' is unused [misc-unused-parameters] Done http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: Line 66: using std::setfill; > warning: using decl 'setfill' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: Line 113: term(term), > warning: value argument 'term' can be moved to avoid copy [misc-move-constr Done http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 109: using kudu::consensus::ConsensusStatePB; > warning: using decl 'ConsensusStatePB' is unused [misc-unused-using-decls] Done Line 1056: for (const scoped_refptr replica : tablet_replicas) { > warning: the loop variable's type is not a reference type; this creates a c Done -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: Yes
[kudu-CR] threadpool-test: use test fixture
Adar Dembo has posted comments on this change. Change subject: threadpool-test: use test fixture .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6944/1/src/kudu/util/threadpool-test.cc File src/kudu/util/threadpool-test.cc: Line 42: using strings::Substitute; > are these Tidy Bot warnings correct? Yeah; these declarations belong in the next patch, not this one. Will fix. Line 53: ASSERT_OK(ThreadPoolBuilder(kDefaultPoolName).Build(_)); > looking at the cases, it seems like almost all of them end up "rebuilding". The token-based tests in the patches to follow generally use the default pool. -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config
Will Berkeley has posted comments on this change. Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: PS7, Line 627: -1 > Magic number removed by using an enum. See above for why the master doesn't s/see above/see response to other comment on this line/ http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck_remote.h File src/kudu/tools/ksck_remote.h: PS7, Line 58: FetchConsensusInfo > Done. I also renamed the flag from check_consensus_info to check_consensus_ And then just to consensus -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: Yes
[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object
Mike Percy has posted comments on this change. Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6957/1//COMMIT_MSG Commit Message: PS1, Line 16: 1) Allow RaftConsensus to operate while not initialized or in a shutdown :state. Also pass ConsensusMetadata into the RaftConsensus :constructor. > not 100% following this. The problem is that I need to provide a long lived object (see patch 3 in this series) in order to avoid racy handling of the ConsensusMetadata by TabletCopy / tombstoned voting for example. The long lived object I chose was ConsensusMetadata because making both TabletReplica and RaftConsensus immortal is a much bigger change. The compromise I landed on was ensuring that TabletReplica and RaftConsensus were always alive, however they would be replaced on TabletCopy. So the only immortal things end up being TabletMetadata and ConsensusMetadata. The upshot is that we always get a RaftConsensus object whenever we have a TabletReplica object, but we don't have to support a cyclical lifecycle DAG on either object (going from Shutdown() -> Init() -> Start()) PS1, Line 41: * Add additional lifecycle runtime assertions in TabletReplica and : RaftConsensus. > Can you also clearly document the lifecycle for RaftConsensus? I'm not enti The RaftConsensus::Shutdown() is similar to what it was before but you make a fair point that since RaftConsensus isn't destroyed on TabletPeer::Shutdown() anymore then it should be as aggressive as possible reclaiming memory when it's shut down. I'll check on that. I tried to give a summary of the lifecycle in my above comment. Let me know if I need to clarify anything and I'll add more details as a comment in the next rev. -- To view, visit http://gerrit.cloudera.org:8080/6957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6772 to look at the new patch set (#8). Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config .. KUDU-1860: ksck doesn't identify tablets that are evicted but still in config This patch enhances ksck to gather consensus info from every tablet. It compares this info with master and outputs the master's config and every conflicting config, if there are any conflicts. To do this efficiently it reimplements the GetAllConsensusState RPC so that it gathers info about every replica's consensus state. This will catch at least the two problems identified in KUDU-1860: 1. The leader has a pending config to remove a tablet, but it is not committed so the master does not see this config. This can hide an unhealthy tablet if, e.g., one pending config member is down and the pending-to-be-kicked-out member is up, so 1/2 replicas are alive in the leader's active config but the master thinks 2/3 are alive. 2. No replica is leader but the master believes there is a leader because its cache is old and hasn't been updated. Sample output showing #1: https://gist.github.com/wdberkeley/d2606698e4f2e8ca3ef70d4dcef7ba9a Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 --- M src/kudu/consensus/consensus.proto M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/tool_action_cluster.cc M src/kudu/tserver/tablet_replica_lookup.h M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/tablet_service.h M src/kudu/tserver/ts_tablet_manager.h 14 files changed, 503 insertions(+), 53 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/6772/8 -- To view, visit http://gerrit.cloudera.org:8080/6772 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out
Todd Lipcon has posted comments on this change. Change subject: KUDU-1580 retry tserver RPC if negotiation times out .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc File src/kudu/rpc/connection.cc: PS1, Line 170: const Status reported_status = status.IsTimedOut() : ? Status::Incomplete("RPC is not sent due to connection error", :status.ToString()) : : status; : > This sounds like a better idea. Thanks! Yea, changing Status is unfortunately a big project (has ABI compatibility concerns, too) and harder to figure out how to get all possible error codes into a single global namespace. -- To view, visit http://gerrit.cloudera.org:8080/6926 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out
Alexey Serbin has posted comments on this change. Change subject: KUDU-1580 retry tserver RPC if negotiation times out .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc File src/kudu/rpc/connection.cc: PS1, Line 170: const Status reported_status = status.IsTimedOut() : ? Status::Incomplete("RPC is not sent due to connection error", :status.ToString()) : : status; : > this seems like a very risky change to me, because now anywhere in the whol This sounds like a better idea. Thanks! Actually, while revving these set of patches I even started thinking about adding something like that into Status itself. But putting something like that into RpcController and/or Outbound call sounds like a more proper place for that. -- To view, visit http://gerrit.cloudera.org:8080/6926 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe
Mike Percy has posted comments on this change. Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe .. Patch Set 1: > I haven't looked at the implementation here yet but I'm wondering > whether this may have potential performance impact. > > In particular, we would hold this lock while flushing/fsyncing new > ConsensusMetadata to disk. Are there code paths that previously > would not block but now block? Or were we always previously > protected by the ReplicaState lock, and now we're protected by the > ConsensusMeta lock? No, they are the same. I don't think there will be much of a performance impact because the vast majority of the time we will be acquiring an uncontended futex due to holding the ReplicaState lock to do things like get the current term in RaftConsensus / ReplicaState. > Regarding the above blocking (which is already problematic) -- does > this get us closer or farther away from an end state where the > cmeta is synchronized using an RWCLock? It would be great to allow > read-only callers to continue to read while an fsync is going on, > for example. I like this idea but I hadn't factored it in when writing this. I don't think it really helps or hurts us from the perspective of doing that. However it's likely we'd need a significant refactor of ReplicaState to do that (which we should do anyway). -- To view, visit http://gerrit.cloudera.org:8080/6958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] threadpool-test: use test fixture
Todd Lipcon has posted comments on this change. Change subject: threadpool-test: use test fixture .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6944/1/src/kudu/util/threadpool-test.cc File src/kudu/util/threadpool-test.cc: Line 42: using strings::Substitute; > warning: using decl 'Substitute' is unused [misc-unused-using-decls] are these Tidy Bot warnings correct? Line 53: ASSERT_OK(ThreadPoolBuilder(kDefaultPoolName).Build(_)); looking at the cases, it seems like almost all of them end up "rebuilding". Maybe it would be cleaner to just not have any default pool and have every test be responsible for building whatever pool they want? -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out
Todd Lipcon has posted comments on this change. Change subject: KUDU-1580 retry tserver RPC if negotiation times out .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc File src/kudu/rpc/connection.cc: PS1, Line 170: const Status reported_status = status.IsTimedOut() : ? Status::Incomplete("RPC is not sent due to connection error", :status.ToString()) : : status; : this seems like a very risky change to me, because now anywhere in the whole codebase where we were checking for IsTimedOut may now have an unexpected Incomplete status instead. Perhaps we would be better off giving the RpcController and OutboundCall object some more getters to try to distinguish underlying specific causes for different timeouts? eg call->SetFailed(status, NEGOTIATION_TIMED_OUT, error.release()) or whatever? -- To view, visit http://gerrit.cloudera.org:8080/6926 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe
Todd Lipcon has posted comments on this change. Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe .. Patch Set 1: I haven't looked at the implementation here yet but I'm wondering whether this may have potential performance impact. In particular, we would hold this lock while flushing/fsyncing new ConsensusMetadata to disk. Are there code paths that previously would not block but now block? Or were we always previously protected by the ReplicaState lock, and now we're protected by the ConsensusMeta lock? Regarding the above blocking (which is already problematic) -- does this get us closer or farther away from an end state where the cmeta is synchronized using an RWCLock? It would be great to allow read-only callers to continue to read while an fsync is going on, for example. -- To view, visit http://gerrit.cloudera.org:8080/6958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object
Todd Lipcon has posted comments on this change. Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/6957/1//COMMIT_MSG Commit Message: PS1, Line 16: 1) Allow RaftConsensus to operate while not initialized or in a shutdown :state. Also pass ConsensusMetadata into the RaftConsensus :constructor. not 100% following this. I thought when discussing this, the idea was to do one or the other of these changes, not both? i.e., if you make it so you can instantiate RaftConsensus even for tombstoned tablet replicas, then why isn't it sufficient to still keep ConsensusMeta confined to being within the RaftConsensus class? Or alternatively, if you make TabletReplica aware of ConsensusMetadata, why do you need to make changes to the RaftConsensus lifecycle? PS1, Line 41: * Add additional lifecycle runtime assertions in TabletReplica and : RaftConsensus. Can you also clearly document the lifecycle for RaftConsensus? I'm not entirely following it from the patch. eg when you first create a RaftConsensus object, it seems to exist but have a lot of nullptr members. But then when you do Shutdown() on it, it doesn't re-nullify those members. Does that imply that a RaftConsensus instance that was tombstoned on a live server has different state than one that was tombstoned at startup? Does that have impact on thread lifecycle or memory consumption? Put another way, the above commit message tries to describe the changes, but I'm not sure what the new "state of the world" is, and the class docs here aren't clarified enough to make me feel confident that the new state of the world is correct. -- To view, visit http://gerrit.cloudera.org:8080/6957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config
Will Berkeley has posted comments on this change. Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but still in config .. Patch Set 7: (54 comments) http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/consensus/consensus.proto File src/kudu/consensus/consensus.proto: Line 471: // The ids of the tablets. > Can we add: An empty list means return info for all tablets known to the ta Done PS7, Line 557: // Returns the consensus state for a tablet. : rpc GetConsensusState(GetConsensusStateRequestPB) returns (GetConsensusStateResponsePB); > Can we delete this function now? Gotta love deleting code. I don't think th Done Line 560: // Returns the consensus state for a set of tablets. > I think we should document whether this returns info for tombstoned tablets Done PS7, Line 561: GetConsensusStates > I think GetConsensusState() is still a valid name for this endpoint, or if I like GetConsensusState. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck-test.cc File src/kudu/tools/ksck-test.cc: Line 70: const std::string& tablet_id, > warning: parameter 'tablet_id' is unused [misc-unused-parameters] Done Line 71: const Schema& schema, > warning: parameter 'schema' is unused [misc-unused-parameters] Done PS7, Line 232: > nit: spacing Done PS7, Line 236: InsertOrDie > Do we also need to worry about the duplicate UUID issue here? No. This is test code that's manually setting up a vanilla consensus configuration on each mock TS. If you wanted to test a case like that, you'd reach in and set it up in the specific test, as I do in the tests below for various situations Line 243:string tablet_id, > warning: the parameter 'tablet_id' is copied for each invocation but only u Done Line 243:string tablet_id, > I think this is complaining because we should either specify the input argu Kept it simple for the test and went with const string&. Line 394: FLAGS_check_consensus_info = false; > No need to reset your flags between tests due to KuduTest::flag_saver_ :) PS7, Line 401: shared_ptr ts > is it safe to declare this const shared_ptr& ts ? Done PS7, Line 401: "ts-id-0") > can we make "ts-id-0" and other identifier strings in here (except perhaps I don't think making constants for the whole file or for KsckTest makes sense just to cover these three specific tests, and the literals are actually used just once per test, so I don't think it makes it clearer or easier to store them in a variable. FWIW, it's also common to use a literal "ts-id-0" or similar in other tests in this file. Line 413: FLAGS_check_consensus_info = false; > also here Done Line 432: FLAGS_check_consensus_info = false; > and here Done Line 451: FLAGS_check_consensus_info = false; > and here Done Line 453: > Would also be interesting to test the duplicate TS UUID case. You mean the rare + bad case from L123 of ksck_remote.cc? That code path can't be tested here as that situation is dealt with by the RemoteKsckTabletServer, not by Ksck. What would happen, though, is that one of the duplicates' (same ts_id and tablet id) consensus info would be left out of the info provided to Ksck. I think it'd complicate the code to cover that rare edge case nicely, but it is easy to spit out a warning if we run into a duplicate. I'll do that. http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: Line 63: using std::left; > warning: using decl 'left' is unused [misc-unused-using-decls] Done PS7, Line 623: peers_from_master > nit: peer_uuids_from_master might be clearer since when I read peer I think Done PS7, Line 624: std::transform(tablet->replicas().cbegin(), tablet->replicas().cend(), : peers_from_master.begin(), : [](const std::shared_ptr& r) { return r->ts_uuid(); }); > nit: This could be more simply written as a for-each loop: Done PS7, Line 627: boost::none > Doesn't the master know the OpId index? In its cmeta for the tablet I would guess so but that's not actually info that ksck gathers presently. KsckRemoteMaster is sneaky and gets its info about the master's POV on tablets by requesting scan tokens for a scan it'll never do. We'd have to refactor a fair bit more to add in the term/opid. PS7, Line 627: -1 > nit: magic number. Also, doesn't the master know the term? Magic number removed by using an enum. See above for why the master doesn't have the term here. PS7, Line 649: strings::Substitute("$0:$1", ts->uuid(), tablet->id()) > as noted elsewhere this seems hacky compared to: Done PS7, Line 658: peers_from_replica > same as above on naming Done PS7, Line 659: std::transform(peers.begin(), peers.end(), peers_from_replica.begin(), :[](const
[kudu-CR] Improve ASSERT STR CONTAINS()
Todd Lipcon has posted comments on this change. Change subject: Improve ASSERT_STR_CONTAINS() .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6956 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Idea2abc0115e13dbb9f3853a5b821ccbf833d184 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Todd Lipcon has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 1067: // Skip calling SetupErrorAndRespond since this path doesn't need the > Yah, this shouldn't have an effect on the actual cluster dynamics. It's us Check out the 'Advanced per-instance throttling' section of util/logging.h -- it should do what you need -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4
Todd Lipcon has posted comments on this change. Change subject: docs: revise security doc based on KUDU-1875 and add release notes for 1.4 .. Patch Set 4: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/6922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Hao HaoGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG Commit Message: PS3, Line 22: to to > Double to Done -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: Line 336: if (!controller_.status().ok() || !tc_response_.has_error()) { > Yah, and in fact it may be wrong. I think I meant: Looking more at this, we don't seem to be updating 'last_successful_communication_time ' in the successful case. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6925 to look at the new patch set (#7). Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet The 'active ingredient' in this patch is the change to TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS response to be returned if the tablet is currently being copied and the tablet copy thread pool is full. Previously an ALREADY_INPROGRESS response would only occur if the tablet was currently being copied, and the threadpool was not full. The effect of the failure to return ALREADY_INPROGRESS was that a leader would be much more likely consider a tablet server failed and to subsequently drop the replica from the Raft config. As a result, on a highly loaded cluster, a tablet copy could be started at the same time, 300 seconds apart, on many tablet servers. The remaining changes are to return more specific errors out of the tablet copy service, which aids with testing specific codepaths. One of the existing tablet_copy-itest cases has been beefed up to cover the tablet copy threadpool full path. Without the changes outlined before it fails with: ../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure Expected: (num_inprogress) > (0), actual: 0 vs 0 which is exactly what we would expect; the tablet server is failing to return INPROGRESS errors. Anecdotally, this patch has improved TTR times 5-10x on highly loaded clusters. It's still possible for tablets to be bounced around during re-replication if the copying tablet server has a full RPC queue, or it's unable to start the tablet copy for 300 seconds, but both of these conditions indicate that it's probably best to drop that tserver and retry on a (hopefully) less stressed server. Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc 8 files changed, 138 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/7 -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe
Hello David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6958 to review the following change. Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe .. KUDU-871 (part 2). Make ConsensusMetadata thread-safe This patch simply adds locks around all accessors and makes ConsensusMetadata refcounted. This enabled ConsensusMetadata to be a long-lived shared object that brokers access to the underlying ConsensusMetadata file in a followup patch. TODO: Could use an MT test in consensus_meta-test to exercise the lock. Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7 --- M src/kudu/consensus/consensus_meta-test.cc M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_meta.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h M src/kudu/tablet/tablet_bootstrap-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tools/tool_action_local_replica.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h M src/kudu/tserver/tablet_copy_source_session-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 22 files changed, 195 insertions(+), 98 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/6958/1 -- To view, visit http://gerrit.cloudera.org:8080/6958 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] mini cluster: Test infrastructure improvements
Hello David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6959 to review the following change. Change subject: mini cluster: Test infrastructure improvements .. mini cluster: Test infrastructure improvements This patch adds a couple small improvements to the MiniCluster infra. * Add base class helpers to MiniClusterITestBase to avoid code duplication, including StopCluster() and ts_map_ for convenience and consistency with ExternalMiniClusterITestBase. * Add ListTablets() helper function to MiniTabletServer. Change-Id: I6bc6a04b113e59243fb788fec15b9935c3dcf069 --- M src/kudu/integration-tests/CMakeLists.txt M src/kudu/integration-tests/delete_tablet-itest.cc A src/kudu/integration-tests/mini_cluster-itest-base.cc M src/kudu/integration-tests/mini_cluster-itest-base.h M src/kudu/tserver/mini_tablet_server.cc M src/kudu/tserver/mini_tablet_server.h 6 files changed, 85 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/6959/1 -- To view, visit http://gerrit.cloudera.org:8080/6959 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I6bc6a04b113e59243fb788fec15b9935c3dcf069 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object
Hello David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6957 to review the following change. Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object .. KUDU-871 (part 1). Refactor to make cmeta a first class object This patch is primarily a refactor to TabletReplica and RaftConsensus to ensure that they are always constructed with a ConsensusMetadata object. The ConsensusMetadata file is required for tombstoned voting as the record of a replica's votes. Changes in this patch: 1) Allow RaftConsensus to operate while not initialized or in a shutdown state. Also pass ConsensusMetadata into the RaftConsensus constructor. 2) Refactor TabletReplica to create a RaftConsensus instance at construction time. This will allow us to incorporate tombstoned voting later by allowing RaftConsensus to be accessible from a TabletReplica that comes up fresh in the TABLET_DATA_TOMBSTONED state. Also, do not destroy the Consensus object on TabletPeer::Shutdown() for the same reasons. 3) Add an additional on-disk data state (TABLET_DATA_FRESH) to allow us to identify when we crashed between writing the superblock and the ConsensusMetadata file. If a tablet is in that state then it has never voted so is legal to fully delete instead of attempt to recover at startup time. Other changes: * Remove unused Consensus::State enum. * Get rid of TabletStatusListener * bug fix: RaftConsensus must hold lock before snoozing FD TODO: * Add additional lifecycle runtime assertions in TabletReplica and RaftConsensus. * Consider adding additional ReplicaState::State enum value for kInitialized but not kRunning. * Finish writing new crash recovery tests. Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388 --- M src/kudu/consensus/consensus.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_bootstrap-test.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_bootstrap.h M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h M src/kudu/tserver/tablet_copy_source_session-test.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h M src/kudu/tserver/tserver-path-handlers.cc 20 files changed, 411 insertions(+), 381 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/6957/1 -- To view, visit http://gerrit.cloudera.org:8080/6957 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-871 (part 3). Support tombstoned voting
Hello David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6960 to review the following change. Change subject: KUDU-871 (part 3). Support tombstoned voting .. KUDU-871 (part 3). Support tombstoned voting This patch makes it possible for tombstoned tablet replicas to vote in Raft elections. * Add ConsensusMetadata flush control locking mechanism and plumb it into TabletReplica. * Make ConsensusMetadata "immortal" to prevent flush races / interleavings between TabletCopy, TabletBootstrap, and RaftConsensus tombstoned voting. * Add new test TombstonedVotingITest * Fix DeleteTableITest.TestMergeConsensusMetadata after tombstoned voting changed its assumption that tombstoned tablets would not vote. TODO: * Add negative test for tombstoned voting. * Add a targeted concurrency test that induces lots of tombstoned voting while running TabletCopy, TabletBootstrap, and DeleteTablet. Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570 --- M src/kudu/consensus/consensus-test-util.h M src/kudu/consensus/consensus.h M src/kudu/consensus/consensus_meta.cc M src/kudu/consensus/consensus_meta.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/consensus/raft_consensus_state.cc M src/kudu/consensus/raft_consensus_state.h M src/kudu/integration-tests/CMakeLists.txt M src/kudu/integration-tests/delete_table-itest.cc A src/kudu/integration-tests/tombstoned_voting-itest.cc M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 20 files changed, 443 insertions(+), 102 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/60/6960/1 -- To view, visit http://gerrit.cloudera.org:8080/6960 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike PercyGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] threadpool: new test for pools with no max threads
Alexey Serbin has posted comments on this change. Change subject: threadpool: new test for pools with no max_threads .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: Line 336: if (!controller_.status().ok() || !tc_response_.has_error()) { > This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROG Yah, and in fact it may be wrong. I think I meant: if (!controller_.status().ok() || tc_response_.has_error()) I'll simplify it. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 1067: // Skip calling SetupErrorAndRespond since this path doesn't need the > This doesn't actually affect tablet copy operations, right? If SetupErrorAn Yah, this shouldn't have an effect on the actual cluster dynamics. It's useful for testing this specific case, though. I've changed the logging on the leader (consensus_peers.cc) so that all tablet copy failures are logged. I'm going to do another cluster test to verify that this isn't too noisy. Ideally it would be limited to a WARN log every 60 seconds or so, but I'm not sure how to do that on a per-tablet basis. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Adar Dembo has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG Commit Message: PS3, Line 22: to to Double to http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: Line 336: if (!controller_.status().ok() || !tc_response_.has_error()) { This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROGRESS comment and write something more comprehensive explaining what's going on here and why? http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc File src/kudu/tserver/tablet_service.cc: Line 1067: // Skip calling SetupErrorAndRespond since this path doesn't need the This doesn't actually affect tablet copy operations, right? If SetupErrorAndRespond() is called, it'd convert ServiceUnavailable+THROTTLED into Rpc::ERROR_SERVER_TOO_BUSY but as far as I can tell that has no real effect on Peer::ProcessTabletCopyResponse apart from preventing a noisy LOG(WARNING). Do we actually want that noisy log? -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 5: (2 comments) http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/cluster_itest_util.h File src/kudu/integration-tests/cluster_itest_util.h: Line 337: const tablet::TabletDataState delete_type, > warning: parameter 'delete_type' is const-qualified in the function declara Done http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/tablet_copy-itest.cc File src/kudu/integration-tests/tablet_copy-itest.cc: Line 66: using std::pair; > warning: using decl 'pair' is unused [misc-unused-using-decls] Done -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6925 to look at the new patch set (#6). Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet The 'active ingredient' in this patch is the change to TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS response to be returned if the tablet is currently being copied and the tablet copy thread pool is full. Previously an ALREADY_INPROGRESS response would only occur if the tablet was currently being copied, and the threadpool was not full. The effect of the failure to return ALREADY_INPROGRESS was that a leader would be much more likely consider a tablet server failed and to subsequently drop the replica from the Raft config. As a result, on a highly loaded cluster, a tablet copy could be started at the same time, 300 seconds apart, on many tablet servers. The remaining changes are to to return more specific errors out of the tablet copy service, which aids with testing specific codepaths. One of the existing tablet_copy-itest cases has been beefed up to cover this specific regression. Without the changes outlined before it fails with: ../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure Expected: (num_inprogress) > (0), actual: 0 vs 0 which is exactly what we would expect; the tablet server is failing to return INPROGRESS errors. Anecdotally, this patch has improved TTR times 5-10x on highly loaded clusters. It's still possible for tablets to be bounced around during re-replication if the copying tablet server has a full RPC queue, or it's unable to start the tablet copy for 300 seconds, but both of these conditions indicate that it's probably best to drop that tserver and retry on a (hopefully) less stressed server. Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc 6 files changed, 124 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/6 -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Todd Lipcon has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc File src/kudu/integration-tests/cluster_itest_util.cc: PS3, Line 941: CHECK(resp.error().has_code()) << "Tablet copy error response has no code"; : CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code())) : << "Tablet copy error response code is not valid"; > My understanding is that this is test-only code. oh, duh, sorry... missed the context here. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Hello Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6925 to look at the new patch set (#5). Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet The 'active ingredient' in this patch is the change to TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS response to be returned if the tablet is currently being copied and the tablet copy thread pool is full. Previously an ALREADY_INPROGRESS response would only occur if the tablet was currently being copied, and the threadpool was not full. The effect of the failure to return ALREADY_INPROGRESS was that a leader would be much more likely consider a tablet server failed and to subsequently drop the replica from the Raft config. As a result, on a highly loaded cluster, a tablet copy could be started at the same time, 300 seconds apart, on many tablet servers. The remaining changes are to to return more specific errors out of the tablet copy service, which aids with testing specific codepaths. One of the existing tablet_copy-itest cases has been beefed up to cover this specific regression. Without the changes outlined before it fails with: ../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure Expected: (num_inprogress) > (0), actual: 0 vs 0 which is exactly what we would expect; the tablet server is failing to return INPROGRESS errors. Anecdotally, this patch has improved TTR times 5-10x on highly loaded clusters. It's still possible for tablets to be bounced around during re-replication if the copying tablet server has a full RPC queue, or it's unable to start the tablet copy for 300 seconds, but both of these conditions indicate that it's probably best to drop that tserver and retry on a (hopefully) less stressed server. Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 --- M src/kudu/consensus/consensus_peers.cc M src/kudu/integration-tests/cluster_itest_util.cc M src/kudu/integration-tests/cluster_itest_util.h M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc 6 files changed, 124 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/5 -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc File src/kudu/integration-tests/cluster_itest_util.cc: PS3, Line 941: CHECK(resp.error().has_code()) << "Tablet copy error response has no code"; : CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code())) : << "Tablet copy error response code is not valid"; > could we hit these CHECKs in a rolling upgrade scenario? we don't officiall My understanding is that this is test-only code. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet
Dan Burkert has posted comments on this change. Change subject: KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: PS2, Line 395: // The thread pool is at capacity. Check if the tablet is already in : // transition (i.e. being copied). : boost::optional transition; : { : std::lock_guard lock(lock_); : auto* t = FindOrNull(transition_in_progress_, tablet_id); : if (t) { : transition = *t; : } : } > +1 The 'happy path' in this case is that the thread pool is not oversubscribed. In that case the tablet copy immediately gets a thread, and as part of initializing, it already checks that there isn't a copy in progress. So, if we put the check up front, it would actually happen twice for the fast path. PS2, Line 406: cb(Status::IllegalState( : strings::Substitute("State transition of tablet $0 already in progress: $1", : tablet_id, *transition)), : TabletServerErrorPB::ALREADY_INPROGRESS); > It should get logged by the leader making the remote call. Yes, I've beefed up the logging of these errors on the leader. Going to do another cluster test to make sure it's not overwhelming. -- To view, visit http://gerrit.cloudera.org:8080/6925 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dan BurkertGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] threadpool: new test for pools with no max threads
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6945 to look at the new patch set (#3). Change subject: threadpool: new test for pools with no max_threads .. threadpool: new test for pools with no max_threads This test ensures that a pool created with effectively no max_threads works as expected. That is: 1. Tokenless tasks should trigger the creation of a new thread. 2. Token-based tasks can create new threads, but only up to the number of tokens submitted against. I intend to use this "feature" to consolidate some Raft thread pools where a num_cpus upper bound may be undesirable (i.e. where tasks submitted to the pools may result in blocking IO). Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 70 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/3 -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] threadpool: new test for pools with no max threads
Adar Dembo has posted comments on this change. Change subject: threadpool: new test for pools with no max_threads .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6945/2/src/kudu/util/threadpool-test.cc File src/kudu/util/threadpool-test.cc: PS2, Line 153: shared_ptr(new SlowTask(latch)); > nit: consider using std::make_shared(latch) instead Done PS2, Line 229: pool_->Wait(); : pool_->Shutdown(); > What if one of asserts above triggers due to a failure? Would it be necess ThreadPool's destructor also calls Shutdown(). -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] threadpool: token-based task sequencing
Adar Dembo has posted comments on this change. Change subject: threadpool: token-based task sequencing .. Patch Set 3: Verified+1 Failure was in kinit in a Java test. -- To view, visit http://gerrit.cloudera.org:8080/6874 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] threadpool: new test for pools with no max threads
Alexey Serbin has posted comments on this change. Change subject: threadpool: new test for pools with no max_threads .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/6945/2/src/kudu/util/threadpool-test.cc File src/kudu/util/threadpool-test.cc: PS2, Line 153: shared_ptr(new SlowTask(latch)); nit: consider using std::make_shared(latch) instead PS2, Line 229: pool_->Wait(); : pool_->Shutdown(); What if one of asserts above triggers due to a failure? Would it be necessary to call Shutdown() on pool_ or it's will be cleaned up automatically somewhere else? -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] threadpool: new test for pools with no max threads
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6945 to look at the new patch set (#2). Change subject: threadpool: new test for pools with no max_threads .. threadpool: new test for pools with no max_threads This test ensures that a pool created with effectively no max_threads works as expected. That is: 1. Tokenless tasks should trigger the creation of a new thread. 2. Token-based tasks can create new threads, but only up to the number of tokens submitted against. I intend to use this "feature" to consolidate some Raft thread pools where a num_cpus upper bound may be undesirable (i.e. where tasks submitted to the pools may result in blocking IO). Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 70 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/2 -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] threadpool: token-based task sequencing
Adar Dembo has posted comments on this change. Change subject: threadpool: token-based task sequencing .. Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool-test.cc File src/kudu/util/threadpool-test.cc: Line 45: using strings::Substitute; > warning: using decl 'Substitute' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool.cc File src/kudu/util/threadpool.cc: Line 168: return Submit(std::shared_ptr(new ClosureRunnable(std::move(c; > warning: std::move of the const variable has no effect; remove std::move() Done Line 380: return Submit(std::shared_ptr(new ClosureRunnable(std::move(c; > warning: std::move of the const variable has no effect; remove std::move() Done http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool.h File src/kudu/util/threadpool.h: Line 285: Status DoSubmit(std::shared_ptr task, > warning: function 'kudu::ThreadPool::DoSubmit' has a definition with differ Done -- To view, visit http://gerrit.cloudera.org:8080/6874 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] consensus: consolidate Raft thread pools
Adar Dembo has posted comments on this change. Change subject: consensus: consolidate Raft thread pools .. Patch Set 1: I'm having trouble reconciling the SequenceToken lifecycle requirements with the complicated RaftConsensus lifecycle. The issue is that while RaftConsensus::ShutDown() is called in a normal context (typically a service pool thread handling the DeleteTablet() RPC), the object itself may be destroyed later, by its thread pool's worker thread. This happens because the RaftConsensus object is bound and retained by several callbacks that run on the thread pool. When this happens, the process crashes because ~RaftConsensus tries to release its SequenceToken, but the token's slot is still active because it's running this destructor. I'm not sure how to fix this because I still don't fully understand RaftConsensus's locking. Is it sufficient for RaftConsensus::Shutdown() to destroy the SequenceToken and for the various callbacks to check if the token is live before submitting to it? Or does that require additional synchronization? Here's a sample failure: https://gist.github.com/adembo/9a3c8e91807ba41bb4aceef67ec1dc0a with some additional logging. Note how thread 12205 initiates the crash, and the last thing it logged was "Destroying runnable" but not "Runnable destroyed"; this is a sign that the crash was in the worker thread's runnable release code. -- To view, visit http://gerrit.cloudera.org:8080/6946 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] threadpool-test: use test fixture
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6944 to review the following change. Change subject: threadpool-test: use test fixture .. threadpool-test: use test fixture Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 146 insertions(+), 149 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/44/6944/1 -- To view, visit http://gerrit.cloudera.org:8080/6944 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] threadpool: new test for pools with no max threads
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6945 to review the following change. Change subject: threadpool: new test for pools with no max_threads .. threadpool: new test for pools with no max_threads This test ensures that a pool created with effectively no max_threads works as expected. That is: 1. Tokenless tasks should trigger the creation of a new thread. 2. Token-based tasks can create new threads, but only up to the number of tokens submitted against. I intend to use this "feature" to consolidate some Raft thread pools where a num_cpus upper bound may be undesirable (i.e. where tasks submitted to the pools may result in blocking IO). Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 --- M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.h 2 files changed, 70 insertions(+), 17 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/1 -- To view, visit http://gerrit.cloudera.org:8080/6945 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon
[kudu-CR] threadpool: token-based task sequencing
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6874 to look at the new patch set (#2). Change subject: threadpool: token-based task sequencing .. threadpool: token-based task sequencing This patch adds task sequencing to util/threadpool. Task sequencing allows m contexts to share a single pool with n threads while also ensuring that the pool executes tasks belonging to each context in order. Previously this was only achievable via "one singleton pool per context", which grossly inflated the total number of threads and overall state. A group of tasks are sequenced by submission to a dedicated "slot". A client obtains exclusive access to a slot via AllocateTokenSlot(), which returns a "token" that can be used for task submission/waiting. When the token is destroyed (or Release() is called), the slot is returned to the pool. For implementation simplicity, clients must wait for a token's outstanding tasks to complete before destroying their token. Some notes: - Slots and tokens are mapped 1-1 so they could theoretically be combined, but I prefer this separation of concerns. - The current implementation requires tokens to have no outstanding tasks when they are released. This was a conscious choice made so that token destruction (especially when done as part of a larger object graph destruct) won't take an unusually long amount of time. - I evaluated two other implementations. In one, tokens had an implicit lifecycle that was automatically managed by the threadpool. While simpler for clients, the threadpool was more inefficient with more allocations and deallocations in each submission. In the other, token submission was built using regular task submission. This afforded a certain separation between the "core" of the threadpool and the token logic, but it complicated locking and tracking of queued tasks. - I tried to keep submission (whether token-based or tokenless) fast. Just the change from std::list to std::deque for the main queue ought to improve performance of tokenless submissions. The next bottleneck is likely to be lock contention on the global threadpool lock. Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70 --- M src/kudu/util/debug-util.h M src/kudu/util/threadpool-test.cc M src/kudu/util/threadpool.cc M src/kudu/util/threadpool.h 4 files changed, 708 insertions(+), 70 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/6874/2 -- To view, visit http://gerrit.cloudera.org:8080/6874 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] consensus: consolidate Raft thread pools
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/6946 to review the following change. Change subject: consensus: consolidate Raft thread pools .. consensus: consolidate Raft thread pools This patch consolidates the two thread pools used in Raft consensus: the observer and "Raft" (for lack of a better name) pools. The former runs tasks for infrequent events such as term and config changes while the latter is used for periodic events such as processing heartbeat responses. In this patch, each per-replica pool is consolidated into a single per-server pool. Sequence tokens are used to ensure that tasks belonging to a single replica are executed serially and in order. For the "Raft" pool, a single sequence token is shared by the PeerManager and RaftConsensus; this mirrors the existing sharing behavior and is safe because token operations are thread-safe. The non-default max_threads may come as a surprise, but David showed me how tasks submitted to either pool may sometimes lead to an fsync (mostly via cmeta flushes). As such, it isn't safe to use the default num_cpus upper bound, as that may cause such IO-based tasks to block other CPU-only tasks (or worse, periodic tasks like heartbeat response processing). What per-replica threads are not addressed by this patch? - Failure detection thread: a threadpool isn't the right model for these. Something like a hash wheel timer would be ideal. - Prepare thread pool (max_threads=1): these could be consolidated too, but their metrics are per-replica, and sequence tokens don't (yet) support that. Meaning, if they were consolidated now, the metrics would also consolidate and would be less useful as a result. Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54 --- M src/kudu/consensus/consensus_peers-test.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_peers.h M src/kudu/consensus/consensus_queue-test.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/consensus/peer_manager.cc M src/kudu/consensus/peer_manager.h M src/kudu/consensus/raft_consensus.cc M src/kudu/consensus/raft_consensus.h M src/kudu/consensus/raft_consensus_quorum-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/server/server_base.cc M src/kudu/server/server_base.h M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_copy_source_session-test.cc M src/kudu/tserver/ts_tablet_manager.cc 19 files changed, 188 insertions(+), 117 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/46/6946/1 -- To view, visit http://gerrit.cloudera.org:8080/6946 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar DemboGerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Todd Lipcon