[kudu-CR] Move CreateTableForTesting into registration-test

2017-05-22 Thread Todd Lipcon (Code Review)
Hello Adar Dembo, Alexey Serbin,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6961

to look at the new patch set (#2).

Change subject: Move CreateTableForTesting into registration-test
..

Move CreateTableForTesting into registration-test

This utility function was only used in one spot, so moving it out of the
header.

Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2
---
M src/kudu/integration-tests/registration-test.cc
M src/kudu/master/master-test-util.h
2 files changed, 60 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/6961/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6961
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info

2017-05-22 Thread Todd Lipcon (Code Review)
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6916

to look at the new patch set (#3).

Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily 
rewriting tablet info
..

KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet 
info

This adds some checks in SysCatalogTable so that, if the updated version
of a catalog table entry is identical to the previous version, we avoid
writing anything to the table.

This is a simple way of short-circuiting a common case of unnecessary
slowness in tablet report processing. For example, when the master
restarts or fails over, all of the tablet servers perform a full tablet
report. However, most of the tablets will not have changed any
information since their prior report, in which case the writes can be
safely skipped on the master.

This doesn't achieve all of the goals of KUDU-1125 (we still do separate
sync writes for each _changed_ reported tablet) but should still be a
substantial reduction. Perhaps most importantly, if a heartbeat from a
tserver times out due to long processing, the retry from the TS should
hit this code path and therefore be processed quickly.

Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba
---
M src/kudu/integration-tests/registration-test.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test-util.h
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 103 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/16/6916/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Move CreateTableForTEsting into registration-test

2017-05-22 Thread Todd Lipcon (Code Review)
Hello Adar Dembo, Alexey Serbin,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6961

to review the following change.

Change subject: Move CreateTableForTEsting into registration-test
..

Move CreateTableForTEsting into registration-test

This utility function was only used in one spot, so moving it out of the
header.

Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2
---
M src/kudu/integration-tests/registration-test.cc
M src/kudu/master/master-test-util.h
2 files changed, 60 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/6961/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6961
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id72a37b2dee05db65f27c27d52b0f18dd5c489f2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 


[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily 
rewriting tablet info
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6916/2/src/kudu/integration-tests/registration-test.cc
File src/kudu/integration-tests/registration-test.cc:

Line 38: #include "kudu/tablet/tablet_replica.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


http://gerrit.cloudera.org:8080/#/c/6916/2/src/kudu/master/master-test-util.h
File src/kudu/master/master-test-util.h:

Line 71: void CreateTableForTesting(MiniMaster* mini_master,
> warning: function 'CreateTableForTesting' defined in a header file; functio
this is only used by one test (registration-test). I'll move it into there in a 
separate commit.


-- 
To view, visit http://gerrit.cloudera.org:8080/6916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1125 (part 1) catalog manager: try to avoid unnecessarily rewriting tablet info

2017-05-22 Thread Todd Lipcon (Code Review)
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6916

to look at the new patch set (#2).

Change subject: KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily 
rewriting tablet info
..

KUDU-1125 (part 1) catalog_manager: try to avoid unnecessarily rewriting tablet 
info

This adds some checks in SysCatalogTable so that, if the updated version
of a catalog table entry is identical to the previous version, we avoid
writing anything to the table.

This is a simple way of short-circuiting a common case of unnecessary
slowness in tablet report processing. For example, when the master
restarts or fails over, all of the tablet servers perform a full tablet
report. However, most of the tablets will not have changed any
information since their prior report, in which case the writes can be
safely skipped on the master.

This doesn't achieve all of the goals of KUDU-1125 (we still do separate
sync writes for each _changed_ reported tablet) but should still be a
substantial reduction. Perhaps most importantly, if a heartbeat from a
tserver times out due to long processing, the retry from the TS should
hit this code path and therefore be processed quickly.

Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba
---
M src/kudu/integration-tests/registration-test.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test-util.h
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 103 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/16/6916/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0de7189b8f1dbeea55172929396b73fd92fd62ba
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe

2017-05-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe
..


Patch Set 1: Verified+1

I can't reproduce this failure (even after 1000s of iters on dist-test with 
different levels of stress / build types)

-- 
To view, visit http://gerrit.cloudera.org:8080/6958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#8).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..

KUDU-2020: tserver failure causes multiple tablet copy operations per 
under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 139 insertions(+), 57 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..


Patch Set 5: Code-Review+2 Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Hao Hao (Code Review)
Hao Hao has posted comments on this change.

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6922/4/docs/security.adoc
File docs/security.adoc:

Line 54: connections from publicly routable IPs will be rejected.
> Reword as 'Unauthenticated connections from publicly routable IPs will be r
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has submitted this change and it was merged.

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..


docs: revise security doc based on KUDU-1875 and
add release notes for 1.4

Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Reviewed-on: http://gerrit.cloudera.org:8080/6922
Reviewed-by: Dan Burkert 
Tested-by: Dan Burkert 
---
M docs/prior_release_notes.adoc
M docs/release_notes.adoc
M docs/security.adoc
3 files changed, 210 insertions(+), 162 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Hao Hao (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6922

to look at the new patch set (#5).

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..

docs: revise security doc based on KUDU-1875 and
add release notes for 1.4

Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
---
M docs/prior_release_notes.adoc
M docs/release_notes.adoc
M docs/security.adoc
3 files changed, 210 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/22/6922/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1988: add support for advertised host:port info.

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-1988: add support for advertised host:port info.
..


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6827/8/src/kudu/server/rpc_server.cc
File src/kudu/server/rpc_server.cc:

Line 117:   if (!options_.rpc_advertised_addresses.empty()) {
This should guard against port 0, as on line 111 above.


Line 210:   for (const Sockaddr& addr : rpc_advertised_addresses_) {
Same comment as in the webserver; this can just use the copy ctor.


http://gerrit.cloudera.org:8080/#/c/6827/8/src/kudu/server/webserver.cc
File src/kudu/server/webserver.cc:

Line 207: 
RETURN_NOT_OK(ParseAddressList(opts_.webserver_advertised_addresses,
This should check for 0 port as well.


Line 308:   for (const Sockaddr& addr : webserver_advertised_addresses_) {
This can be simpler as a copy construct:

*addresses = webserver_advertised_addresses_;


-- 
To view, visit http://gerrit.cloudera.org:8080/6827
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6735ca5630fc4c426bf72d0b21d6ef452173a890
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Patrik Sundberg 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Patrik Sundberg 
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6922/4/docs/security.adoc
File docs/security.adoc:

Line 54: any unauthenticated connections from publicly routable IPs.
Reword as 'Unauthenticated connections from publicly routable IPs will be 
rejected.'  here and in the release notes.


-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:   // Skip calling SetupErrorAndRespond since this path doesn't 
need the
> Check out the 'Advanced per-instance throttling' section of util/logging.h 
OK so after trying this out on a cluster, I think we should allow it to log 
every time.  To balance this out, I think we should downgrade the 'tablet x 
needs tablet copy' message.  The net result is that we're logging the begin 
tablet copy result instead of the fact that we'll be requesting it.  For 
reference, here's a cross section of these logs for a particular tablet:

I0522 17:09:38.703362 15818 consensus_queue.cc:395] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: 
Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:38.703636  4776 consensus_peers.cc:352] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer 
cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to 
begin Tablet Copy on peer: error { code: THROTTLED status { code: 
SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 
0/0 tasks queued)" } }
I0522 17:09:40.211633 15820 consensus_queue.cc:395] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: 
Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:40.211971  4776 consensus_peers.cc:352] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer 
cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to 
begin Tablet Copy on peer: error { code: THROTTLED status { code: 
SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 
0/0 tasks queued)" } }
I0522 17:09:41.703528 11794 consensus_queue.cc:395] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 [LEADER]: 
Peer cc32936bc8594948a04fd4240da36aed needs tablet copy
W0522 17:09:41.703760  4776 consensus_peers.cc:352] T 
c03811b02d7045e9a8cc426246c9595c P 70f7ee61ead54b1885d819f354eb3405 -> Peer 
cc32936bc8594948a04fd4240da36aed (vd0236.halxg.cloudera.com:7050): Unable to 
begin Tablet Copy on peer: error { code: THROTTLED status { code: 
SERVICE_UNAVAILABLE message: "Thread pool is at capacity (10/10 tasks running, 
0/0 tasks queued)" } }


On clusters approaching normalcy, I wouldn't expect to see these logs much at 
all.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: threadpool-test: use test fixture
..


threadpool-test: use test fixture

Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Reviewed-on: http://gerrit.cloudera.org:8080/6944
Reviewed-by: Todd Lipcon 
Tested-by: Kudu Jenkins
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 140 insertions(+), 149 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1034 client does not failover due to timeout

2017-05-22 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1034 client does not failover due to timeout
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6924/1/src/kudu/rpc/retriable_rpc.h
File src/kudu/rpc/retriable_rpc.h:

Line 217: server_picker_->MarkServerFailed(server, result.status);
> In the case of a timeout, that still might be a good idea, no? It's not 100
I think timeout is considered as non-retriable error.  Given current behavior 
of the client (refreshing its meta-cache in case of absence of usable server 
for an RPC), marking a timed-out server would do no harm.


Line 217: server_picker_->MarkServerFailed(server, result.status);
> OK now I remember the issue I had when looking at this issue previously. I 
As I understand, regardless of whether the client aware of particular replica 
or not, marking a server failed will make the client to switch to a different 
tablet server in the scope of this particular RPC.  Also, most of the errors in 
this context (except, may be, REPLICA_NOT_LEADER) have semantics of 'mark the 
whole server out': connection timeout, server too busy, invalid authn token.

Yes, all the tablets will be marked as non-accessible when using 
MarkServerFailed(), but the client will refresh its meta-cache if it ended up 
with no active replica, right?

E.g., take a look at handling REPLICA_NOT_LEADER error code.  It might happen 
that by the time client marks the server as failed it becomes the leader.  So, 
in that case the client will end up in having no servers in its metacache, and 
will refresh it to get a new leader.


-- 
To view, visit http://gerrit.cloudera.org:8080/6924
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icfcece485e4053d921ffdc865612b3e7b9a992a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: threadpool-test: use test fixture
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] Fix flaky test TestRecoverFromOpIdOverflow (again)

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: Fix flaky test TestRecoverFromOpIdOverflow (again)
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6943/2/src/kudu/integration-tests/ts_recovery-itest.cc
File src/kudu/integration-tests/ts_recovery-itest.cc:

PS2, Line 75:   const int kOneReplica = 1;
Don't think this adds much value. If the goal was to make the parameter to 
StartCluster more obvious, I think it's better to just do:

StartCluster(extra_ts_flags, {}, /*replicas=*/ 1);

or something of that nature


-- 
To view, visit http://gerrit.cloudera.org:8080/6943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1f7326136479311ba2a84b384327e07d280df7c3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

2017-05-22 Thread Will Berkeley (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6772

to look at the new patch set (#9).

Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but 
still in config
..

KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

This patch enhances ksck to gather consensus info from every
tablet. It compares this info with master and outputs the
master's config and every conflicting config, if there are any
conflicts. To do this efficiently it reimplements the
GetAllConsensusState RPC so that it gathers info about every
replica's consensus state.

This will catch at least the two problems identified in
KUDU-1860: 1. The leader has a pending config to remove a
tablet, but it is not committed so the master does not see this
config. This can hide an unhealthy tablet if, e.g., one pending
config member is down and the pending-to-be-kicked-out member is
up, so 1/2 replicas are alive in the leader's active config but
the master thinks 2/3 are alive. 2. No replica is leader but the
master believes there is a leader because its cache is old and
hasn't been updated.

Sample output showing #1:
https://gist.github.com/wdberkeley/d2606698e4f2e8ca3ef70d4dcef7ba9a

Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tools/ksck-test.cc
M src/kudu/tools/ksck.cc
M src/kudu/tools/ksck.h
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/ksck_remote.h
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tserver/tablet_replica_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 502 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/6772/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6772
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6944

to look at the new patch set (#2).

Change subject: threadpool-test: use test fixture
..

threadpool-test: use test fixture

Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 140 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/44/6944/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

2017-05-22 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change.

Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but 
still in config
..


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck-test.cc
File src/kudu/tools/ksck-test.cc:

Line 72:   const ChecksumOptions& options,
> warning: parameter 'options' is unused [misc-unused-parameters]
Done


http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck.cc
File src/kudu/tools/ksck.cc:

Line 66: using std::setfill;
> warning: using decl 'setfill' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tools/ksck.h
File src/kudu/tools/ksck.h:

Line 113:   term(term),
> warning: value argument 'term' can be moved to avoid copy [misc-move-constr
Done


http://gerrit.cloudera.org:8080/#/c/6772/8/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 109: using kudu::consensus::ConsensusStatePB;
> warning: using decl 'ConsensusStatePB' is unused [misc-unused-using-decls]
Done


Line 1056:   for (const scoped_refptr replica : tablet_replicas) 
{
> warning: the loop variable's type is not a reference type; this creates a c
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6772
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: Yes


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: threadpool-test: use test fixture
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6944/1/src/kudu/util/threadpool-test.cc
File src/kudu/util/threadpool-test.cc:

Line 42: using strings::Substitute;
> are these Tidy Bot warnings correct?
Yeah; these declarations belong in the next patch, not this one. Will fix.


Line 53: ASSERT_OK(ThreadPoolBuilder(kDefaultPoolName).Build(_));
> looking at the cases, it seems like almost all of them end up "rebuilding".
The token-based tests in the patches to follow generally use the default pool.


-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

2017-05-22 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change.

Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but 
still in config
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.cc
File src/kudu/tools/ksck.cc:

PS7, Line 627: -1
> Magic number removed by using an enum. See above for why the master doesn't
s/see above/see response to other comment on this line/


http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck_remote.h
File src/kudu/tools/ksck_remote.h:

PS7, Line 58: FetchConsensusInfo
> Done. I also renamed the flag from check_consensus_info to check_consensus_
And then just to consensus


-- 
To view, visit http://gerrit.cloudera.org:8080/6772
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object

2017-05-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6957/1//COMMIT_MSG
Commit Message:

PS1, Line 16: 1) Allow RaftConsensus to operate while not initialized or in a 
shutdown
:state. Also pass ConsensusMetadata into the RaftConsensus
:constructor.
> not 100% following this.
The problem is that I need to provide a long lived object (see patch 3 in this 
series) in order to avoid racy handling of the ConsensusMetadata by TabletCopy 
/ tombstoned voting for example. The long lived object I chose was 
ConsensusMetadata because making both TabletReplica and RaftConsensus immortal 
is a much bigger change.

The compromise I landed on was ensuring that TabletReplica and RaftConsensus 
were always alive, however they would be replaced on TabletCopy. So the only 
immortal things end up being TabletMetadata and ConsensusMetadata.

The upshot is that we always get a RaftConsensus object whenever we have a 
TabletReplica object, but we don't have to support a cyclical lifecycle DAG on 
either object (going from Shutdown() -> Init() -> Start())


PS1, Line 41: * Add additional lifecycle runtime assertions in TabletReplica and
:   RaftConsensus.
> Can you also clearly document the lifecycle for RaftConsensus? I'm not enti
The RaftConsensus::Shutdown() is similar to what it was before but you make a 
fair point that since RaftConsensus isn't destroyed on TabletPeer::Shutdown() 
anymore then it should be as aggressive as possible reclaiming memory when it's 
shut down. I'll check on that.

I tried to give a summary of the lifecycle in my above comment. Let me know if 
I need to clarify anything and I'll add more details as a comment in the next 
rev.


-- 
To view, visit http://gerrit.cloudera.org:8080/6957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

2017-05-22 Thread Will Berkeley (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6772

to look at the new patch set (#8).

Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but 
still in config
..

KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

This patch enhances ksck to gather consensus info from every
tablet. It compares this info with master and outputs the
master's config and every conflicting config, if there are any
conflicts. To do this efficiently it reimplements the
GetAllConsensusState RPC so that it gathers info about every
replica's consensus state.

This will catch at least the two problems identified in
KUDU-1860: 1. The leader has a pending config to remove a
tablet, but it is not committed so the master does not see this
config. This can hide an unhealthy tablet if, e.g., one pending
config member is down and the pending-to-be-kicked-out member is
up, so 1/2 replicas are alive in the leader's active config but
the master thinks 2/3 are alive. 2. No replica is leader but the
master believes there is a leader because its cache is old and
hasn't been updated.

Sample output showing #1:
https://gist.github.com/wdberkeley/d2606698e4f2e8ca3ef70d4dcef7ba9a

Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/tools/ksck-test.cc
M src/kudu/tools/ksck.cc
M src/kudu/tools/ksck.h
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/ksck_remote.h
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tserver/tablet_replica_lookup.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
M src/kudu/tserver/ts_tablet_manager.h
14 files changed, 503 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/72/6772/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6772
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I16e4de09821b372c3773b4ade3fd9e37ab818808
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1580 retry tserver RPC if negotiation times out
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc
File src/kudu/rpc/connection.cc:

PS1, Line 170:   const Status reported_status = status.IsTimedOut()
 :   ? Status::Incomplete("RPC is not sent due to 
connection error",
 :status.ToString())
 :   : status;
 :  
> This sounds like a better idea.  Thanks!
Yea, changing Status is unfortunately a big project (has ABI compatibility 
concerns, too) and harder to figure out how to get all possible error codes 
into a single global namespace.


-- 
To view, visit http://gerrit.cloudera.org:8080/6926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out

2017-05-22 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: KUDU-1580 retry tserver RPC if negotiation times out
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc
File src/kudu/rpc/connection.cc:

PS1, Line 170:   const Status reported_status = status.IsTimedOut()
 :   ? Status::Incomplete("RPC is not sent due to 
connection error",
 :status.ToString())
 :   : status;
 :  
> this seems like a very risky change to me, because now anywhere in the whol
This sounds like a better idea.  Thanks!

Actually, while revving these set of patches I even started thinking about 
adding something like that into Status itself.  But putting something like that 
into RpcController and/or Outbound call sounds like a more proper place for 
that.


-- 
To view, visit http://gerrit.cloudera.org:8080/6926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe

2017-05-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe
..


Patch Set 1:

> I haven't looked at the implementation here yet but I'm wondering
 > whether this may have potential performance impact.
 > 
 > In particular, we would hold this lock while flushing/fsyncing new
 > ConsensusMetadata to disk. Are there code paths that previously
 > would not block but now block? Or were we always previously
 > protected by the ReplicaState lock, and now we're protected by the
 > ConsensusMeta lock?

No, they are the same. I don't think there will be much of a performance impact 
because the vast majority of the time we will be acquiring an uncontended futex 
due to holding the ReplicaState lock to do things like get the current term in 
RaftConsensus / ReplicaState.

 > Regarding the above blocking (which is already problematic) -- does
 > this get us closer or farther away from an end state where the
 > cmeta is synchronized using an RWCLock? It would be great to allow
 > read-only callers to continue to read while an fsync is going on,
 > for example.

I like this idea but I hadn't factored it in when writing this. I don't think 
it really helps or hurts us from the perspective of doing that. However it's 
likely we'd need a significant refactor of ReplicaState to do that (which we 
should do anyway).

-- 
To view, visit http://gerrit.cloudera.org:8080/6958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: threadpool-test: use test fixture
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6944/1/src/kudu/util/threadpool-test.cc
File src/kudu/util/threadpool-test.cc:

Line 42: using strings::Substitute;
> warning: using decl 'Substitute' is unused [misc-unused-using-decls]
are these Tidy Bot warnings correct?


Line 53: ASSERT_OK(ThreadPoolBuilder(kDefaultPoolName).Build(_));
looking at the cases, it seems like almost all of them end up "rebuilding". 
Maybe it would be cleaner to just not have any default pool and have every test 
be responsible for building whatever pool they want?


-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1580 retry tserver RPC if negotiation times out

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1580 retry tserver RPC if negotiation times out
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6926/1/src/kudu/rpc/connection.cc
File src/kudu/rpc/connection.cc:

PS1, Line 170:   const Status reported_status = status.IsTimedOut()
 :   ? Status::Incomplete("RPC is not sent due to 
connection error",
 :status.ToString())
 :   : status;
 :  
this seems like a very risky change to me, because now anywhere in the whole 
codebase where we were checking for IsTimedOut may now have an unexpected 
Incomplete status instead.

Perhaps we would be better off giving the RpcController and OutboundCall object 
some more getters to try to distinguish underlying specific causes for 
different timeouts? eg call->SetFailed(status, NEGOTIATION_TIMED_OUT, 
error.release())

or whatever?


-- 
To view, visit http://gerrit.cloudera.org:8080/6926
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Icee8bf4978365a23d6627e7bc411b63f53540a3b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe
..


Patch Set 1:

I haven't looked at the implementation here yet but I'm wondering whether this 
may have potential performance impact.

In particular, we would hold this lock while flushing/fsyncing new 
ConsensusMetadata to disk. Are there code paths that previously would not block 
but now block? Or were we always previously protected by the ReplicaState lock, 
and now we're protected by the ConsensusMeta lock?

Regarding the above blocking (which is already problematic) -- does this get us 
closer or farther away from an end state where the cmeta is synchronized using 
an RWCLock? It would be great to allow read-only callers to continue to read 
while an fsync is going on, for example.

-- 
To view, visit http://gerrit.cloudera.org:8080/6958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6957/1//COMMIT_MSG
Commit Message:

PS1, Line 16: 1) Allow RaftConsensus to operate while not initialized or in a 
shutdown
:state. Also pass ConsensusMetadata into the RaftConsensus
:constructor.
not 100% following this.

I thought when discussing this, the idea was to do one or the other of these 
changes, not both?

i.e., if you make it so you can instantiate RaftConsensus even for tombstoned 
tablet replicas, then why isn't it sufficient to still keep ConsensusMeta 
confined to being within the RaftConsensus class?

Or alternatively, if you make TabletReplica aware of ConsensusMetadata, why do 
you need to make changes to the RaftConsensus lifecycle?


PS1, Line 41: * Add additional lifecycle runtime assertions in TabletReplica and
:   RaftConsensus.
Can you also clearly document the lifecycle for RaftConsensus? I'm not entirely 
following it from the patch. eg when you first create a RaftConsensus object, 
it seems to exist but have a lot of nullptr members. But then when you do 
Shutdown() on it, it doesn't re-nullify those members. Does that imply that a 
RaftConsensus instance that was tombstoned on a live server has different state 
than one that was tombstoned at startup? Does that have impact on thread 
lifecycle or memory consumption?

Put another way, the above commit message tries to describe the changes, but 
I'm not sure what the new "state of the world" is, and the class docs here 
aren't clarified enough to make me feel confident that the new state of the 
world is correct.


-- 
To view, visit http://gerrit.cloudera.org:8080/6957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1860: ksck doesn't identify tablets that are evicted but still in config

2017-05-22 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change.

Change subject: KUDU-1860: ksck doesn't identify tablets that are evicted but 
still in config
..


Patch Set 7:

(54 comments)

http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

Line 471:   // The ids of the tablets.
> Can we add: An empty list means return info for all tablets known to the ta
Done


PS7, Line 557:   // Returns the consensus state for a tablet.
 :   rpc GetConsensusState(GetConsensusStateRequestPB) returns 
(GetConsensusStateResponsePB);
> Can we delete this function now? Gotta love deleting code. I don't think th
Done


Line 560:   // Returns the consensus state for a set of tablets.
> I think we should document whether this returns info for tombstoned tablets
Done


PS7, Line 561: GetConsensusStates
> I think GetConsensusState() is still a valid name for this endpoint, or if 
I like GetConsensusState.


http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck-test.cc
File src/kudu/tools/ksck-test.cc:

Line 70:   const std::string& tablet_id,
> warning: parameter 'tablet_id' is unused [misc-unused-parameters]
Done


Line 71:   const Schema& schema,
> warning: parameter 'schema' is unused [misc-unused-parameters]
Done


PS7, Line 232:   
> nit: spacing
Done


PS7, Line 236: InsertOrDie
> Do we also need to worry about the duplicate UUID issue here?
No. This is test code that's manually setting up a vanilla consensus 
configuration on each mock TS.

If you wanted to test a case like that, you'd reach in and set it up in the 
specific test, as I do in the tests below for various situations


Line 243:string tablet_id,
> warning: the parameter 'tablet_id' is copied for each invocation but only u
Done


Line 243:string tablet_id,
> I think this is complaining because we should either specify the input argu
Kept it simple for the test and went with const string&.


Line 394:   FLAGS_check_consensus_info = false;
> No need to reset your flags between tests due to KuduTest::flag_saver_
:)


PS7, Line 401: shared_ptr ts
> is it safe to declare this const shared_ptr& ts ?
Done


PS7, Line 401: "ts-id-0")
> can we make "ts-id-0" and other identifier strings in here (except perhaps 
I don't think making constants for the whole file or for KsckTest makes sense 
just to cover these three specific tests, and the literals are actually used 
just once per test, so I don't think it makes it clearer or easier to store 
them in a variable.

FWIW, it's also common to use a literal "ts-id-0" or similar in other tests in 
this file.


Line 413:   FLAGS_check_consensus_info = false;
> also here
Done


Line 432:   FLAGS_check_consensus_info = false;
> and here
Done


Line 451:   FLAGS_check_consensus_info = false;
> and here
Done


Line 453: 
> Would also be interesting to test the duplicate TS UUID case.
You mean the rare + bad case from L123 of ksck_remote.cc? That code path can't 
be tested here as that situation is dealt with by the RemoteKsckTabletServer, 
not by Ksck.

What would happen, though, is that one of the duplicates' (same ts_id and 
tablet id) consensus info would be left out of the info provided to Ksck. I 
think it'd complicate the code to cover that rare edge case nicely, but it is 
easy to spit out a warning if we run into a duplicate. I'll do that.


http://gerrit.cloudera.org:8080/#/c/6772/7/src/kudu/tools/ksck.cc
File src/kudu/tools/ksck.cc:

Line 63: using std::left;
> warning: using decl 'left' is unused [misc-unused-using-decls]
Done


PS7, Line 623: peers_from_master
> nit: peer_uuids_from_master might be clearer since when I read peer I think
Done


PS7, Line 624:   std::transform(tablet->replicas().cbegin(), 
tablet->replicas().cend(),
 :  peers_from_master.begin(),
 :  [](const std::shared_ptr& r) 
{ return r->ts_uuid(); });
> nit: This could be more simply written as a for-each loop:
Done


PS7, Line 627: boost::none
> Doesn't the master know the OpId index?
In its cmeta for the tablet I would guess so but that's not actually info that 
ksck gathers presently. KsckRemoteMaster is sneaky and gets its info about the 
master's POV on tablets by requesting scan tokens for a scan it'll never do. 
We'd have to refactor a fair bit more to add in the term/opid.


PS7, Line 627: -1
> nit: magic number. Also, doesn't the master know the term?
Magic number removed by using an enum. See above for why the master doesn't 
have the term here.


PS7, Line 649: strings::Substitute("$0:$1", ts->uuid(), tablet->id())
> as noted elsewhere this seems hacky compared to:
Done


PS7, Line 658: peers_from_replica
> same as above on naming
Done


PS7, Line 659: std::transform(peers.begin(), peers.end(), 
peers_from_replica.begin(),
 :[](const 

[kudu-CR] Improve ASSERT STR CONTAINS()

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: Improve ASSERT_STR_CONTAINS()
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6956
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idea2abc0115e13dbb9f3853a5b821ccbf833d184
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:   // Skip calling SetupErrorAndRespond since this path doesn't 
need the
> Yah, this shouldn't have an effect on the actual cluster dynamics.  It's us
Check out the 'Advanced per-instance throttling' section of util/logging.h -- 
it should do what you need


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] docs: revise security doc based on KUDU-1875 and add release notes for 1.4

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: docs: revise security doc based on KUDU-1875 and add release 
notes for 1.4
..


Patch Set 4: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6922
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2e386203f5ed3ef66e2ec136e67738b8c7eb8b1a
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG
Commit Message:

PS3, Line 22: to to
> Double to
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
> Yah, and in fact it may be wrong.  I think I meant:
Looking more at this, we don't seem to be updating 
'last_successful_communication_time ' in the successful case.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#7).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..

KUDU-2020: tserver failure causes multiple tablet copy operations per 
under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover the
tablet copy threadpool full path. Without the changes outlined before it
fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 138 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-871 (part 2). Make ConsensusMetadata thread-safe

2017-05-22 Thread Mike Percy (Code Review)
Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6958

to review the following change.

Change subject: KUDU-871 (part 2). Make ConsensusMetadata thread-safe
..

KUDU-871 (part 2). Make ConsensusMetadata thread-safe

This patch simply adds locks around all accessors and makes
ConsensusMetadata refcounted. This enabled ConsensusMetadata to be a
long-lived shared object that brokers access to the underlying
ConsensusMetadata file in a followup patch.

TODO: Could use an MT test in consensus_meta-test to exercise the lock.

Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7
---
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_meta.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_source_session-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
22 files changed, 195 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/58/6958/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6958
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id2c70605c0593e55486184705ea448dcb4bef2d7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] mini cluster: Test infrastructure improvements

2017-05-22 Thread Mike Percy (Code Review)
Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6959

to review the following change.

Change subject: mini cluster: Test infrastructure improvements
..

mini cluster: Test infrastructure improvements

This patch adds a couple small improvements to the MiniCluster infra.

* Add base class helpers to MiniClusterITestBase to avoid code
  duplication, including StopCluster() and ts_map_ for convenience and
  consistency with ExternalMiniClusterITestBase.
* Add ListTablets() helper function to MiniTabletServer.

Change-Id: I6bc6a04b113e59243fb788fec15b9935c3dcf069
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/delete_tablet-itest.cc
A src/kudu/integration-tests/mini_cluster-itest-base.cc
M src/kudu/integration-tests/mini_cluster-itest-base.h
M src/kudu/tserver/mini_tablet_server.cc
M src/kudu/tserver/mini_tablet_server.h
6 files changed, 85 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/59/6959/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6959
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6bc6a04b113e59243fb788fec15b9935c3dcf069
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-871 (part 1). Refactor to make cmeta a first class object

2017-05-22 Thread Mike Percy (Code Review)
Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6957

to review the following change.

Change subject: KUDU-871 (part 1). Refactor to make cmeta a first class object
..

KUDU-871 (part 1). Refactor to make cmeta a first class object

This patch is primarily a refactor to TabletReplica and RaftConsensus to
ensure that they are always constructed with a ConsensusMetadata object.
The ConsensusMetadata file is required for tombstoned voting as the
record of a replica's votes.

Changes in this patch:

1) Allow RaftConsensus to operate while not initialized or in a shutdown
   state. Also pass ConsensusMetadata into the RaftConsensus
   constructor.

2) Refactor TabletReplica to create a RaftConsensus instance at
   construction time. This will allow us to incorporate tombstoned
   voting later by allowing RaftConsensus to be accessible from a
   TabletReplica that comes up fresh in the TABLET_DATA_TOMBSTONED
   state. Also, do not destroy the Consensus object on
   TabletPeer::Shutdown() for the same reasons.

3) Add an additional on-disk data state (TABLET_DATA_FRESH) to allow us
   to identify when we crashed between writing the superblock and the
   ConsensusMetadata file. If a tablet is in that state then it has
   never voted so is legal to fully delete instead of attempt to recover
   at startup time.

Other changes:

* Remove unused Consensus::State enum.
* Get rid of TabletStatusListener
* bug fix: RaftConsensus must hold lock before snoozing FD

TODO:

* Add additional lifecycle runtime assertions in TabletReplica and
  RaftConsensus.
* Consider adding additional ReplicaState::State enum value for
  kInitialized but not kRunning.
* Finish writing new crash recovery tests.

Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388
---
M src/kudu/consensus/consensus.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_bootstrap.h
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_source_session-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
M src/kudu/tserver/tserver-path-handlers.cc
20 files changed, 411 insertions(+), 381 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/6957/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia27a091d27b3996d37009d5ec866e744f9608388
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-871 (part 3). Support tombstoned voting

2017-05-22 Thread Mike Percy (Code Review)
Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6960

to review the following change.

Change subject: KUDU-871 (part 3). Support tombstoned voting
..

KUDU-871 (part 3). Support tombstoned voting

This patch makes it possible for tombstoned tablet replicas to vote in
Raft elections.

* Add ConsensusMetadata flush control locking mechanism and plumb it
  into TabletReplica.
* Make ConsensusMetadata "immortal" to prevent flush races /
  interleavings between TabletCopy, TabletBootstrap, and RaftConsensus
  tombstoned voting.
* Add new test TombstonedVotingITest
* Fix DeleteTableITest.TestMergeConsensusMetadata after tombstoned
  voting changed its assumption that tombstoned tablets would not vote.

TODO:

* Add negative test for tombstoned voting.
* Add a targeted concurrency test that induces lots of tombstoned voting
  while running TabletCopy, TabletBootstrap, and DeleteTablet.

Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_meta.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/delete_table-itest.cc
A src/kudu/integration-tests/tombstoned_voting-itest.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
20 files changed, 443 insertions(+), 102 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/60/6960/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6960
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia19d75b185299443b27f41e468bbae20065e7570
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: threadpool: new test for pools with no max_threads
..


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
> This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROG
Yah, and in fact it may be wrong.  I think I meant:

if (!controller_.status().ok() || tc_response_.has_error())

I'll simplify it.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:   // Skip calling SetupErrorAndRespond since this path doesn't 
need the
> This doesn't actually affect tablet copy operations, right? If SetupErrorAn
Yah, this shouldn't have an effect on the actual cluster dynamics.  It's useful 
for testing this specific case, though.  I've changed the logging on the leader 
(consensus_peers.cc) so that all tablet copy failures are logged.  I'm going to 
do another cluster test to verify that this isn't too noisy.  Ideally it would 
be limited to a WARN log every 60 seconds or so, but I'm not sure how to do 
that on a per-tablet basis.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/6925/3//COMMIT_MSG
Commit Message:

PS3, Line 22: to to
Double to


http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 336:   if (!controller_.status().ok() || !tc_response_.has_error()) {
This is, uh, confusing to say the least. Can you pull up the ALREADY_INPROGRESS 
comment and write something more comprehensive explaining what's going on here 
and why?


http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1067:   // Skip calling SetupErrorAndRespond since this path doesn't 
need the
This doesn't actually affect tablet copy operations, right? If 
SetupErrorAndRespond() is called, it'd convert ServiceUnavailable+THROTTLED 
into Rpc::ERROR_SERVER_TOO_BUSY but as far as I can tell that has no real 
effect on Peer::ProcessTabletCopyResponse apart from preventing a noisy 
LOG(WARNING).

Do we actually want that noisy log?


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/cluster_itest_util.h
File src/kudu/integration-tests/cluster_itest_util.h:

Line 337: const tablet::TabletDataState delete_type,
> warning: parameter 'delete_type' is const-qualified in the function declara
Done


http://gerrit.cloudera.org:8080/#/c/6925/5/src/kudu/integration-tests/tablet_copy-itest.cc
File src/kudu/integration-tests/tablet_copy-itest.cc:

Line 66: using std::pair;
> warning: using decl 'pair' is unused [misc-unused-using-decls]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#6).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..

KUDU-2020: tserver failure causes multiple tablet copy operations per 
under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 124 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS3, Line 941: CHECK(resp.error().has_code()) << "Tablet copy error 
response has no code";
 : 
CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code()))
 : << "Tablet copy error response code is not valid";
> My understanding is that this is test-only code.
oh, duh, sorry... missed the context here.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6925

to look at the new patch set (#5).

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..

KUDU-2020: tserver failure causes multiple tablet copy operations per 
under-replicated tablet

The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.

The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.

The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:

../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0

which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.

Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.

Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 124 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6925/3/src/kudu/integration-tests/cluster_itest_util.cc
File src/kudu/integration-tests/cluster_itest_util.cc:

PS3, Line 941: CHECK(resp.error().has_code()) << "Tablet copy error 
response has no code";
 : 
CHECK(tserver::TabletServerErrorPB::Code_IsValid(resp.error().code()))
 : << "Tablet copy error response code is not valid";
> could we hit these CHECKs in a rolling upgrade scenario? we don't officiall
My understanding is that this is test-only code.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2020: tserver failure causes multiple tablet copy operations per under-replicated tablet

2017-05-22 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: KUDU-2020: tserver failure causes multiple tablet copy 
operations per under-replicated tablet
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6925/2/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

PS2, Line 395:   // The thread pool is at capacity. Check if the tablet is 
already in
 :   // transition (i.e. being copied).
 :   boost::optional transition;
 :   {
 : std::lock_guard lock(lock_);
 : auto* t = FindOrNull(transition_in_progress_, tablet_id);
 : if (t) {
 :   transition = *t;
 : }
 :   }
> +1
The 'happy path' in this case is that the thread pool is not oversubscribed.  
In that case the tablet copy immediately gets a thread, and as part of 
initializing, it already checks that there isn't a copy in progress.

So, if we put the check up front, it would actually happen twice for the fast 
path.


PS2, Line 406: cb(Status::IllegalState(
 :   strings::Substitute("State transition of tablet $0 
already in progress: $1",
 :   tablet_id, *transition)),
 :   TabletServerErrorPB::ALREADY_INPROGRESS);
> It should get logged by the leader making the remote call.
Yes, I've beefed up the logging of these errors on the leader.  Going to do 
another cluster test to make sure it's not overwhelming.


-- 
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6945

to look at the new patch set (#3).

Change subject: threadpool: new test for pools with no max_threads
..

threadpool: new test for pools with no max_threads

This test ensures that a pool created with effectively no max_threads works
as expected. That is:
1. Tokenless tasks should trigger the creation of a new thread.
2. Token-based tasks can create new threads, but only up to the number of
   tokens submitted against.

I intend to use this "feature" to consolidate some Raft thread pools where a
num_cpus upper bound may be undesirable (i.e. where tasks submitted to the
pools may result in blocking IO).

Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 70 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: threadpool: new test for pools with no max_threads
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6945/2/src/kudu/util/threadpool-test.cc
File src/kudu/util/threadpool-test.cc:

PS2, Line 153: shared_ptr(new SlowTask(latch));
> nit: consider using std::make_shared(latch) instead
Done


PS2, Line 229:   pool_->Wait();
 :   pool_->Shutdown();
> What if one of asserts above triggers due to a failure?  Would it be necess
ThreadPool's destructor also calls Shutdown().


-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] threadpool: token-based task sequencing

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: threadpool: token-based task sequencing
..


Patch Set 3: Verified+1

Failure was in kinit in a Java test.

-- 
To view, visit http://gerrit.cloudera.org:8080/6874
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: threadpool: new test for pools with no max_threads
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6945/2/src/kudu/util/threadpool-test.cc
File src/kudu/util/threadpool-test.cc:

PS2, Line 153: shared_ptr(new SlowTask(latch));
nit: consider using std::make_shared(latch) instead


PS2, Line 229:   pool_->Wait();
 :   pool_->Shutdown();
What if one of asserts above triggers due to a failure?  Would it be necessary 
to call Shutdown() on pool_ or it's will be cleaned up automatically somewhere 
else?


-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6945

to look at the new patch set (#2).

Change subject: threadpool: new test for pools with no max_threads
..

threadpool: new test for pools with no max_threads

This test ensures that a pool created with effectively no max_threads works
as expected. That is:
1. Tokenless tasks should trigger the creation of a new thread.
2. Token-based tasks can create new threads, but only up to the number of
   tokens submitted against.

I intend to use this "feature" to consolidate some Raft thread pools where a
num_cpus upper bound may be undesirable (i.e. where tasks submitted to the
pools may result in blocking IO).

Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 70 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] threadpool: token-based task sequencing

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: threadpool: token-based task sequencing
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool-test.cc
File src/kudu/util/threadpool-test.cc:

Line 45: using strings::Substitute;
> warning: using decl 'Substitute' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool.cc
File src/kudu/util/threadpool.cc:

Line 168:   return Submit(std::shared_ptr(new 
ClosureRunnable(std::move(c;
> warning: std::move of the const variable has no effect; remove std::move() 
Done


Line 380:   return Submit(std::shared_ptr(new 
ClosureRunnable(std::move(c;
> warning: std::move of the const variable has no effect; remove std::move() 
Done


http://gerrit.cloudera.org:8080/#/c/6874/2/src/kudu/util/threadpool.h
File src/kudu/util/threadpool.h:

Line 285:   Status DoSubmit(std::shared_ptr task,
> warning: function 'kudu::ThreadPool::DoSubmit' has a definition with differ
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6874
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] consensus: consolidate Raft thread pools

2017-05-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: consensus: consolidate Raft thread pools
..


Patch Set 1:

I'm having trouble reconciling the SequenceToken lifecycle requirements with 
the complicated RaftConsensus lifecycle.

The issue is that while RaftConsensus::ShutDown() is called in a normal context 
(typically a service pool thread handling the DeleteTablet() RPC), the  object 
itself may be destroyed later, by its thread pool's worker thread. This happens 
because the RaftConsensus object is bound and retained by several callbacks 
that run on the thread pool. When this happens, the process crashes because 
~RaftConsensus tries to release its SequenceToken, but the token's slot is 
still active because it's running this destructor.

I'm not sure how to fix this because I still don't fully understand 
RaftConsensus's locking. Is it sufficient for RaftConsensus::Shutdown() to 
destroy the SequenceToken and for the various callbacks to check if the token 
is live before submitting to it? Or does that require additional 
synchronization?

Here's a sample failure: 
https://gist.github.com/adembo/9a3c8e91807ba41bb4aceef67ec1dc0a with some 
additional logging. Note how thread 12205 initiates the crash, and the last 
thing it logged was "Destroying runnable" but not "Runnable destroyed"; this is 
a sign that the crash was in the worker thread's runnable release code.

-- 
To view, visit http://gerrit.cloudera.org:8080/6946
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] threadpool-test: use test fixture

2017-05-22 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6944

to review the following change.

Change subject: threadpool-test: use test fixture
..

threadpool-test: use test fixture

Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 146 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/44/6944/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6944
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic76e5f946356d2cf6869cb7a665d0ba5adde
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] threadpool: new test for pools with no max threads

2017-05-22 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6945

to review the following change.

Change subject: threadpool: new test for pools with no max_threads
..

threadpool: new test for pools with no max_threads

This test ensures that a pool created with effectively no max_threads works
as expected. That is:
1. Tokenless tasks should trigger the creation of a new thread.
2. Token-based tasks can create new threads, but only up to the number of
   tokens submitted against.

I intend to use this "feature" to consolidate some Raft thread pools where a
num_cpus upper bound may be undesirable (i.e. where tasks submitted to the
pools may result in blocking IO).

Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
---
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.h
2 files changed, 70 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/6945/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6945
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I37d0bd0a05098bcc1a81ec9f1ac33e74e98781e4
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] threadpool: token-based task sequencing

2017-05-22 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6874

to look at the new patch set (#2).

Change subject: threadpool: token-based task sequencing
..

threadpool: token-based task sequencing

This patch adds task sequencing to util/threadpool. Task sequencing allows
m contexts to share a single pool with n threads while also ensuring that
the pool executes tasks belonging to each context in order. Previously this
was only achievable via "one singleton pool per context", which grossly
inflated the total number of threads and overall state.

A group of tasks are sequenced by submission to a dedicated "slot". A client
obtains exclusive access to a slot via AllocateTokenSlot(), which returns a
"token" that can be used for task submission/waiting. When the token is
destroyed (or Release() is called), the slot is returned to the pool. For
implementation simplicity, clients must wait for a token's outstanding tasks
to complete before destroying their token.

Some notes:
- Slots and tokens are mapped 1-1 so they could theoretically be combined,
  but I prefer this separation of concerns.
- The current implementation requires tokens to have no outstanding tasks
  when they are released. This was a conscious choice made so that token
  destruction (especially when done as part of a larger object graph
  destruct) won't take an unusually long amount of time.
- I evaluated two other implementations. In one, tokens had an implicit
  lifecycle that was automatically managed by the threadpool. While simpler
  for clients, the threadpool was more inefficient with more allocations and
  deallocations in each submission. In the other, token submission was built
  using regular task submission. This afforded a certain separation between
  the "core" of the threadpool and the token logic, but it complicated
  locking and tracking of queued tasks.
- I tried to keep submission (whether token-based or tokenless) fast. Just
  the change from std::list to std::deque for the main queue ought to
  improve performance of tokenless submissions. The next bottleneck is
  likely to be lock contention on the global threadpool lock.

Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70
---
M src/kudu/util/debug-util.h
M src/kudu/util/threadpool-test.cc
M src/kudu/util/threadpool.cc
M src/kudu/util/threadpool.h
4 files changed, 708 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/6874/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6874
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If46dc34212027b6ea5dbc2ead7c7af8be57f2c70
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] consensus: consolidate Raft thread pools

2017-05-22 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/6946

to review the following change.

Change subject: consensus: consolidate Raft thread pools
..

consensus: consolidate Raft thread pools

This patch consolidates the two thread pools used in Raft consensus: the
observer and "Raft" (for lack of a better name) pools. The former runs tasks
for infrequent events such as term and config changes while the latter is
used for periodic events such as processing heartbeat responses.

In this patch, each per-replica pool is consolidated into a single
per-server pool. Sequence tokens are used to ensure that tasks belonging to
a single replica are executed serially and in order. For the "Raft" pool, a
single sequence token is shared by the PeerManager and RaftConsensus; this
mirrors the existing sharing behavior and is safe because token operations
are thread-safe.

The non-default max_threads may come as a surprise, but David showed me how
tasks submitted to either pool may sometimes lead to an fsync (mostly via
cmeta flushes). As such, it isn't safe to use the default num_cpus upper
bound, as that may cause such IO-based tasks to block other CPU-only tasks
(or worse, periodic tasks like heartbeat response processing).

What per-replica threads are not addressed by this patch?
- Failure detection thread: a threadpool isn't the right model
  for these. Something like a hash wheel timer would be ideal.
- Prepare thread pool (max_threads=1): these could be consolidated too, but
  their metrics are per-replica, and sequence tokens don't (yet) support
  that. Meaning, if they were consolidated now, the metrics would also
  consolidate and would be less useful as a result.

Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54
---
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/peer_manager.cc
M src/kudu/consensus/peer_manager.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_copy_source_session-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
19 files changed, 188 insertions(+), 117 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/46/6946/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6946
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8958c607d11ac2e94908670c985059bef0ff5f54
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon