[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-21 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 12:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3609/12//COMMIT_MSG
Commit Message:

Line 20: possible; the only way a master can "lose" a cached TSDescriptor is if 
the
> what about when a tserver restarts? is it still going to do a proper full T
If the tserver restarts, it always issues a full TR as its first heartbeat. And 
if that heartbeat fails, it'll get retried until the server acknowledges it.

That's always been the case; the subsequent patch just adds "if you see your 
master go from not_leader to leader, send a full TR".


http://gerrit.cloudera.org:8080/#/c/3609/12/src/kudu/integration-tests/mini_cluster.cc
File src/kudu/integration-tests/mini_cluster.cc:

Line 289: auto ts = mini_tablet_server->server();
> should this verify that the TS process is running? (there's a getter for th
I'm inclined to keep it the way it is. This is old code; my goal was to add the 
DO_NOT_MATCH_TSERVERS mode so I could use it in my integration test.

I'll document that the tservers may not be running, though.


Line 299:   }
> else LOG(FATAL) perhaps? or use a switch() so that you get a compilation er
I'll use a switch. Note that the downside of using enum classes is that you 
can't use operator<< on the value (it doesn't implicit cast to an int).


http://gerrit.cloudera.org:8080/#/c/3609/12/src/kudu/integration-tests/mini_cluster.h
File src/kudu/integration-tests/mini_cluster.h:

Line 145: // Match the tservers retrieved from each master against the 
tservers in
> not sure what 'Match' means here. Is this a verification? (i.e that the cou
OK, will elaborate in the comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] master: do not delete unknown tablets

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3645

to look at the new patch set (#8).

Change subject: master: do not delete unknown tablets
..

master: do not delete unknown tablets

Quoting from docs/design-docs/multi-master-1.0.md:

"The master and/or tserver must enforce that all actions take effect
iff they were sent by the master that is currently the leader.

After an exhaustive audit of all master state changes (see appendix A), it
was determined that the current protection mechanisms built into each RPC
are sufficient to provide fencing. The one exception is orphaned replica
deletion done in response to a heartbeat. To protect against that, true
orphans (i.e. tablets for which no persistent record exists) will not be
deleted at all. As the master retains deleted table/tablet metadata in
perpetuity, this should ensure that true orphans appear only under drastic
circumstances, such as a tserver that heartbeats to the wrong cluster."

The new test isn't ideal in that it must wait some time to allow the tserver
to receive an RPC from the master, but on my laptop it does fail without the
fix, and it should fail fairly often in other machines/environments too.

Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
---
M src/kudu/integration-tests/create-table-itest.cc
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/master/catalog_manager.cc
3 files changed, 78 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/3645/8
-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: do not delete unknown tablets

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3645

to look at the new patch set (#7).

Change subject: master: do not delete unknown tablets
..

master: do not delete unknown tablets

Quoting from docs/design-docs/multi-master-1.0.md:

"The master and/or tserver must enforce that all actions take effect
iff they were sent by the master that is currently the leader.

After an exhaustive audit of all master state changes (see appendix A), it
was determined that the current protection mechanisms built into each RPC
are sufficient to provide fencing. The one exception is orphaned replica
deletion done in response to a heartbeat. To protect against that, true
orphans (i.e. tablets for which no persistent record exists) will not be
deleted at all. As the master retains deleted table/tablet metadata in
perpetuity, this should ensure that true orphans appear only under drastic
circumstances, such as a tserver that heartbeats to the wrong cluster."

The new test isn't ideal in that it must wait some time to allow the tserver
to receive an RPC from the master, but on my laptop it does fail without the
fix, and it should fail fairly often in other machines/environments too.

Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
---
M src/kudu/integration-tests/create-table-itest.cc
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/master/catalog_manager.cc
3 files changed, 77 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/3645/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] c++ client: use operation timeout as deadline for finding new leader master

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/3718

to review the following change.

Change subject: c++ client: use operation timeout as deadline for finding new 
leader master
..

c++ client: use operation timeout as deadline for finding new leader master

We had been using the default RPC timeout, which may be set to a very low
value as in ClientStressTest_MultiMaster_TestLeaderResolutionTimeout.
Now we'll use the operation timeout as the overall deadline while still
preserving the semantics of using the default RPC timeout for the
GetMasterRegistration() RPCs themselves.

As my patch series removes the guarantee that a leader master is elected at
the time that cluster tests run, it's important that the logic for finding
the leader master provide ample time for an election to finish.

Also, I think I've addressed the root cause behind KUDU-573 by fixing a race
in GetLeaderMasterRpc's SendRpcCb() and GetMasterRegistrationRpcCbForNode()
methods. The race manifests when the last two RPC responses are "I am the
leader" and "I am not the leader" respectively. In one interleaving, both
responses enter SendRpcCb(), and the second calls DelayedRetryCb(). If that
were a call to DelayedRetry() instead, the GetLeaderMasterRpc would be
destroyed by the time the reactor thread reran the RPC.

Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
---
M src/kudu/client/client-internal.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/master/master_rpc.cc
M src/kudu/master/master_rpc.h
M src/kudu/tserver/heartbeater.cc
5 files changed, 52 insertions(+), 38 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/18/3718/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0d770875bbf4703444abac11dbc232d7e382165e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3611

to look at the new patch set (#11).

Change subject: KUDU-1358 (part 3): new multi-master stress test
..

KUDU-1358 (part 3): new multi-master stress test

This commit adds a stress test for multiple masters. The idea is simple:
issue DDL operations at a high rate while periodically restarting a master.

There's a balance to be struck both in the throughput of the operations and
in the periodicity of the restarts; we need to ensure that the masters can
make enough forward progress (in spite of the failures) to process all of
the requests without timing out. To assist, the client uses abnormally long
timeouts on all operations.

Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/master-stress-test.cc
2 files changed, 421 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/3611/11
-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1374: send full tablet report when new leader master is detected

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3643

to look at the new patch set (#7).

Change subject: KUDU-1374: send full tablet report when new leader master is 
detected
..

KUDU-1374: send full tablet report when new leader master is detected

This should help prevent missed tablet reports in very specific edge cases,
detailed in the bug report.

The new integration test fails 100% of the time without the change, and
passes 100% of the time with it.

Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
---
M src/kudu/integration-tests/master_failover-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/heartbeater.cc
3 files changed, 100 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/3643/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-20 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3609

to look at the new patch set (#11).

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..

KUDU-1358 (part 1): master should accept heartbeat even if follower

This patch changes the master's heartbeat acceptance code so that heartbeats
are not rejected outright if the master is a follower. To be specific,
tablet reports are ignored, but heartbeats are processed just enough to warm
the TSDescriptor cache. That way, if this master is elected leader, it can
respond to a CreateTable() even before the first round of heartbeats.

I reduced the complexity of the "should this tserver register or send a full
tablet report?" dance by removing TSDescriptor.has_tablet_report_. It was
used to guarantee a full tablet report in the event that 1) the tserver is
sending incremental tablet reports, and 2) the master has already registered
the tserver. I don't think this exact sequence of events is actually
possible; the only way a master can "lose" a cached TSDescriptor is if the
master is restarted, at which point it loses the tserver registration too.
Plus, all the unit tests passed (in slow mode).

I also snuck in a fix to TSManager::RegisterTS: it wasn't actually returning
a TSDescriptor in its out parameter.

Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
---
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/master_replication-itest.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/integration-tests/mini_cluster.h
M src/kudu/integration-tests/registration-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/ts_descriptor.cc
M src/kudu/master/ts_descriptor.h
M src/kudu/master/ts_manager.cc
13 files changed, 216 insertions(+), 124 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/09/3609/11
-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-20 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 3): new multi-master stress test
..


Patch Set 10:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3611/10/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

Line 87:   virtual void SetUp() OVERRIDE {
> void SetUp() override {
Done


Line 212: // The client retried after the RPC timed out, but the master 
did in
> Seems to me this is unlikely to timeout, since it's not really doing much o
It's not possible for another thread to race in the rename/delete due to how 
BlockingGetTableName()/PutTableName() work. So this really is all about the 
timeouts.


Line 238: // TODO: Should be fixed with Exactly Once semantics, see 
KUDU-1537.
> same here
See above.


-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-20 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3648/11/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 1594:   if (has_metadata_changes) {
What about this? Shouldn't it be the same condition as when setting 
actions.table_to_update?


-- 
To view, visit http://gerrit.cloudera.org:8080/3648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42437f365397baf9d4b39b5b17a1587fae70c4be
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-20 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3648/9/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 1522:   if (has_metadata_changes) {
> Yes, that's basically what 'has_metadata_changes' is tracking.
But what if !tablets_to_add.empty() and 
!has_metadata_changes_for_existing_tablets? Seems like that could happen if an 
AlterTable() comes in that just adds a new partition. In that case, we will set 
the table's state to ALTERING but not write it out, and revert the change 
below. That doesn't seem correct.


-- 
To view, visit http://gerrit.cloudera.org:8080/3648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42437f365397baf9d4b39b5b17a1587fae70c4be
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-20 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3648/9/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 1522:   if (has_metadata_changes) {
Shouldn't this run in any case where l.mutable_data() was modified?


Line 1593:   if (has_metadata_changes) {
This too.


-- 
To view, visit http://gerrit.cloudera.org:8080/3648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42437f365397baf9d4b39b5b17a1587fae70c4be
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] client/sample.cc: fixed a couple of crashes

2016-07-20 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: client/sample.cc: fixed a couple of crashes
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3685/1//COMMIT_MSG
Commit Message:

Line 11: while logging some messages from terminating reactor threads.
> is there another way we can fix this without forcing callers to uninstall t
As with the below, any ideas as to why the precommit test didn't experience 
this?

As for a less onerous fix, log_cb could be declared globally so that it doesn't 
go out of scope at the end of main(). That's probably what "real users" would 
do anyway.


Line 13: Fixed issue with an attempt to access non-existing element
> hmm, the bug fix looks reasonable, but I'm curious why we don't see this cr
I'm also surprised. front() on an empty vector is undefined behavior, so maybe 
the precommit compilers elide the entire statement, and Alexei has been using 
one that doesn't?


-- 
To view, visit http://gerrit.cloudera.org:8080/3685
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic5fa3b812e6402a113bf5e432a3a451dc4cc3735
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new patch set (#10).

Change subject: KUDU-1358 (part 3): new multi-master stress test
..

KUDU-1358 (part 3): new multi-master stress test

This commit adds a stress test for multiple masters. The idea is simple:
issue DDL operations at a high rate while periodically restarting a master.

There's a balance to be struck both in the throughput of the operations and
in the periodicity of the restarts; we need to ensure that the masters can
make enough forward progress (in spite of the failures) to process all of
the requests without timing out. To assist, the client uses abnormally long
timeouts on all operations.

Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/master-stress-test.cc
2 files changed, 422 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/3611/10
-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: do not delete unknown tablets

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new patch set (#6).

Change subject: master: do not delete unknown tablets
..

master: do not delete unknown tablets

Quoting from docs/design-docs/multi-master-1.0.md:

"The master and/or tserver must enforce that all actions take effect
iff they were sent by the master that is currently the leader.

After an exhaustive audit of all master state changes (see appendix A), it
was determined that the current protection mechanisms built into each RPC
are sufficient to provide fencing. The one exception is orphaned replica
deletion done in response to a heartbeat. To protect against that, true
orphans (i.e. tablets for which no persistent record exists) will not be
deleted at all. As the master retains deleted table/tablet metadata in
perpetuity, this should ensure that true orphans appear only under drastic
circumstances, such as a tserver that heartbeats to the wrong cluster."

The new test isn't ideal in that it must wait some time to allow the tserver
to receive an RPC from the master, but on my laptop it does fail without the
fix, and it should fail fairly often in other machines/environments too.

Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
---
M src/kudu/integration-tests/create-table-itest.cc
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/master/catalog_manager.cc
3 files changed, 77 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/3645/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-19 Thread Adar Dembo (Code Review)
Hello Dan Burkert,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3609

to look at the new patch set (#10).

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..

KUDU-1358 (part 1): master should accept heartbeat even if follower

This patch changes the master's heartbeat acceptance code so that heartbeats
are not rejected outright if the master is a follower. To be specific,
tablet reports are ignored, but heartbeats are processed just enough to warm
the TSDescriptor cache. That way, if this master is elected leader, it can
respond to a CreateTable() even before the first round of heartbeats.

An unfortunate side effect is that the "should this tserver register or send
full tablet report?" dance is now more complicated; I tried to cover all
cases in the modified unit test.

I also snuck in a fix to TSManager::RegisterTS: it wasn't actually returning
a TSDescriptor in its out parameter.

Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
---
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/master_replication-itest.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/integration-tests/mini_cluster.h
M src/kudu/integration-tests/registration-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/master_service.cc
M src/kudu/master/ts_descriptor.cc
M src/kudu/master/ts_descriptor.h
M src/kudu/master/ts_manager.cc
12 files changed, 251 insertions(+), 94 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/09/3609/10
-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1374: send full tablet report when new leader master is detected

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new patch set (#6).

Change subject: KUDU-1374: send full tablet report when new leader master is 
detected
..

KUDU-1374: send full tablet report when new leader master is detected

This should help prevent missed tablet reports in very specific edge cases,
detailed in the bug report.

The new integration test fails some fraction of the time without change, and
passes 100% of the time with it.

Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
---
M src/kudu/integration-tests/master_failover-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/heartbeater.cc
3 files changed, 109 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/3643/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1374: send full tablet report when new leader master is detected

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1374: send full tablet report when new leader master is 
detected
..


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3643/5//COMMIT_MSG
Commit Message:

Line 13: there was a way to mock a server-side krpc component as easily as it 
is to
> what about injecting a master crash in the AsyncSendFoo() method? that woul
Unfortunately, the code path that results in ts_proxy.FooAsync() is synchronous 
w.r.t. handling the heartbeats. So if we're going to induce a crash, it has to 
be somewhere in the RPC layer. Or we have to force the first RPC attempt to 
fail, then crash in the retry which is asynchronous.

I banged my head against the wall for a while here but I did produce a test 
that fails 4/10 times without the new heuristic. Take a look and let me know if 
you see a way to simplify it, or to make it fail more reliably.


Line 15: multi-master tests, though.
> is there a test you can loop before/after to verify this?
Done


http://gerrit.cloudera.org:8080/#/c/3643/5/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

Line 175:   // Indicates that the thread should set a full tablet report. Set 
when
> s/set/send/
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [java-client] Re-enable multi-master tests

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java-client] Re-enable multi-master tests
..


Patch Set 4: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia1051222738c84ef3d3e1a33b4981bc9454b7972
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] master: do not delete unknown tablets

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: do not delete unknown tablets
..


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3645/5//COMMIT_MSG
Commit Message:

Line 7: master: do not delete unknown tablets
> I wonder if we now need some tool to "reformat" a tablet server? or to inse
I'm almost positive that a user will one day blow away their master metadata 
directory and ask us to fix it. IIRC this happened with HDFS in the CDH3 days. 
The aggregate of all  tablet metadata should be sufficient to reconstruct the 
master metadata, provided all of the tablets are sufficiently replicated. But, 
I don't know if it makes sense to spend any time on this problem now, given 
that no one has run into this and our other priorities.

Reformatting a tserver is pretty easy: just stop it, delete its wal/data 
directories, and restart it. Or did you mean something else?

I don't really see a use case for dummy "deleted table/tablet" records, apart 
from the aforementioned recovery case, at which point these aren't "dummy" 
records. But, deleting orphaned tablets to reclaim space is a legit use case. 
The minimal fix is to add a gflag that defaults to off and grants the master 
permission to delete orphaned tablets. I'll do that here.

Beyond that, let me know what JIRAs I should file. :)


Line 9: Quoting from the multi-master design doc:
> nit: can you provide a link or source code path here?
Done


http://gerrit.cloudera.org:8080/#/c/3645/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS5, Line 1546: INFO)
> WARNING might be more appropriate here? or perhaps INFO if it's an incremen
I changed it to WARNING. It's hard to know in this context whether it's an 
incremental report or not, and besides, an incremental report only includes a 
tablet if its state has changed in some way.


-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] Doxygen for C++ client API

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Doxygen for C++ client API
..


Patch Set 11:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/3619/9/src/kudu/client/client.h
File src/kudu/client/client.h:

> 1.  It's a purely stylistic change, it's not related to doxygen requirement
1. OK, then I think reverting back to one space makes sense.
2. Thanks, that makes sense.
3. Cool.

Crap, I just realized I didn't publish this after I looked at PS10. Now I 
understand why you hadn't changed anything based on #1 until we talked on 
HipChat. My mistake.


Line 432: 
> I don't understand what "Required" and "Optional" tags are in this context.
I'm referring to how, in some of the old comments here, there were one-word 
sentences like "Required." or "Optional." that help the user understand which 
builder methods they must call and which they can omit.


http://gerrit.cloudera.org:8080/#/c/3619/11/src/kudu/client/client.h
File src/kudu/client/client.h:

Line 767:   /// the opration is complete. Otherwise, every operation returns 
immediately.
Nit: operation


Line 1079:   /// @copydoc KuduSession::Flush()
Will this also include the "Flush any pending writes" sentence from Flush()? If 
so, that would be kind of weird.


Line 1332:   /// @return Result status of the operation (beging scanning).
Nit: begin scanning


-- 
To view, visit http://gerrit.cloudera.org:8080/3619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie7d42fb1c90b83074e357dcecf42489ed9fc4f02
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Misty Stanley-Jones 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 3): new multi-master stress test
..


Patch Set 9:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3611/9/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

PS9, Line 73: tT
> admin?
Done


Line 219: // TODO: Should be fixed with Exactly Once semantics.
> would be nice to tie these to JIRAs
Filed KUDU-1537.


Line 223:   num_altered++;
> why not get rid of these local variables and just IncrementBy(1) on the glo
At one point these were tight loops and I thought that'd be unnecessary 
contention on a shared resource. But that's no longer the case (and probably 
silly to begin with).


Line 323:   SleepFor(MonoDelta::FromMilliseconds(100));
> would be good to randomize the sleeps, so you might catch some slightly dif
Just for RestartMasterLoop(), right? Since the rest are interleaved by virtue 
of using multiple threads? Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 2): heartbeat to every master
..


Patch Set 9:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/3610/9/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

Line 164:   mutable std::atomic_int next_report_seq_;
> hrm, this really has to be mutable?
For the tablet report generation functions, yes. But as discussed below, it's 
probably less confusing to mark them non-const and remove this.


Line 522: TabletReportState state;
> maybe use = { seqno }; here? that way you'll get a warning at some point la
Done


PS9, Line 528: const 
> maybe makes more sense for this to not be 'const' since it changes the sequ
I agree that mutable and const here is kind of "cheating". Will remove both.


http://gerrit.cloudera.org:8080/#/c/3610/9/src/kudu/tserver/heartbeater.h
File src/kudu/tserver/heartbeater.h:

Line 57:   // not dirty once the report has been acknowledged by every master.
> this makes it sound like it will keep reporting as "dirty" to all masters u
Yeah. I'll clarify the comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 9:

(1 comment)

> (1 comment)
 > 
 > Any way to system test this, like calling ListTabletServers against
 > a follower master? (or visiting its /tablet-servers web page?)

I added a new test in master_replication-itest. Let me know if it doesn't fit 
the bill.

http://gerrit.cloudera.org:8080/#/c/3609/9/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS9, Line 366:   };
> DISALLOW_COPY_AND_ASSIGN
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [java-client] fix soundness hole in flushing async kudu session

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java-client] fix soundness hole in flushing async kudu session
..


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3676
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic838b65e2ccf5c48f0b51db235936b834aa24cae
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 7: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42437f365397baf9d4b39b5b17a1587fae70c4be
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] [java-client] Re-enable multi-master tests

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java-client] Re-enable multi-master tests
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3654/3/java/kudu-client/src/main/java/org/kududb/client/GetMasterRegistrationReceived.java
File 
java/kudu-client/src/main/java/org/kududb/client/GetMasterRegistrationReceived.java:

PS3, Line 109: and
 : // there's as many of them as responses received.
This much is obvious from the code itself, but what's not obvious is the 
significance. Could you modify the comment to explain why this matters?


http://gerrit.cloudera.org:8080/#/c/3654/3/java/kudu-client/src/test/java/org/kududb/client/BaseKuduTest.java
File java/kudu-client/src/test/java/org/kududb/client/BaseKuduTest.java:

Line 358: waitForAllTabletServers();
I've been working on a change that'll allow ListTabletServers to be called on 
follower masters, as that reflects the new world order, where all masters 
receive heartbeats in order to keep their TS caches warm.

With that change in place, I don't think waitForAllTabletServers() can be used 
as a proxy for "wait for a new master to be elected and for tservers to report 
to it".  getTablesList() comes close (because it can only succeed after a new 
leader master is elected) but that's still not the same thing.

Why do you actually need to wait? AFAICT killMasterLeader() is only used in 
TestMasterFailover.testKillLeader(), and the operations that follow the call 
should all succeed even without waiting, because the client itself is supposed 
to wait. The only exception is the createTable(), but that will be fixed with 
my patches.


http://gerrit.cloudera.org:8080/#/c/3654/3/java/kudu-client/src/test/java/org/kududb/client/TestGetMasterRegistrationReceived.java
File 
java/kudu-client/src/test/java/org/kududb/client/TestGetMasterRegistrationReceived.java:

Line 91: // Permutation of the previous
Nit: terminate with a period. Below too.


Line 162: GetMasterRegistrationReceived grrm = new 
GetMasterRegistrationReceived(MASTERS, d);
Nice variable name.


Line 177:   d.join(1000); // Don't care about the response.
Is there any significance to the timeout value here? Could we call join() 
without a timeout?


Line 188:   // Helper method that determines if the callback or errback should 
be called.
This one and makeGMRR() can be static.


-- 
To view, visit http://gerrit.cloudera.org:8080/3654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia1051222738c84ef3d3e1a33b4981bc9454b7972
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 6:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/3648/6/src/kudu/client/client.cc
File src/kudu/client/client.cc:

PS6, Line 853: KuduPartialRow* lower_bound,
 :   
KuduPartialRow* upper_bound
Should we first check that the bounds aren't null?


http://gerrit.cloudera.org:8080/#/c/3648/6/src/kudu/integration-tests/external_mini_cluster.h
File src/kudu/integration-tests/external_mini_cluster.h:

Line 406:   Status WaitForMasterUpAndRunning() WARN_UNUSED_RESULT;
Maybe doc this new method? And since it's an ExternalMaster method, maybe 
rename to "WaitForCatalogManager" or something more specific?


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 1345:   }
> yes, otherwise the messages would need to be copied.
Can't these vectors be of crefs instead? I just don't think any other methods 
here modify the input, and it doesn't seem like something we should do.


http://gerrit.cloudera.org:8080/#/c/3648/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS6, Line 1267:   return Status::NotFound("New partition conflicts 
with existing partition",
  :   step.DebugString());
Can we point out the conflict in these messages? Or is that too verbose?


PS6, Line 1272: auto prev = std::prev(existing_iter);
  : TabletMetadataLock metadata(prev->second, 
TabletMetadataLock::READ);
Nit: combine?


Line 1284: if (metadata.pb.partition().partition_key_start()< 
upper_bound) {
Nit: partition_key_start() < upper_bound


Line 1331: existing_tablets.erase(existing_iter);
Maybe CHECK that this returns 1? Or add EraseOrDie to map-util.h and use that?


Line 1551:   {
Please add a comment here rationalizing the order of operations.


Line 1563: table->AddRemoveTablets(tablets_to_add, tablets_to_drop);
I don't remember why we thought this should be done with the global catalog 
manager spinlock held. Can you remind me?


http://gerrit.cloudera.org:8080/#/c/3648/6/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 207:   // Adds all tablets to the return vector in partition key sorted 
order.
Nit: returned vector


Line 230:   std::map tablet_map() const {
Let's make the TabletInfoMap typedef public and use it in this method.


-- 
To view, visit http://gerrit.cloudera.org:8080/3648
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I42437f365397baf9d4b39b5b17a1587fae70c4be
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] async background flush provision for C++ client

2016-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: async background flush provision for C++ client
..


Patch Set 1:

What was the rationale behind using a dedicated thread to manage background 
flushing? I think synchronization would be net simpler if instead, we reused 
the messenger's reactor threads for background flushing. They're already used 
to run callbacks whenever a Proxy::FooAsync() call is made, and you can use 
Messenger::ScheduleOnReactor() to schedule the background flushing to happen at 
some time in the future.

Some other arguments in favor of using reactor threads:
1. It reduces the thread count  of the client.
2. Structurally it makes the implementation similar to the Java client.

-- 
To view, visit http://gerrit.cloudera.org:8080/3668
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idfde05bef342db24990c6e3da3b0270c3bb37a9d
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] doxygen for C++ client API

2016-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: doxygen for C++ client API
..


Patch Set 9:

(65 comments)

http://gerrit.cloudera.org:8080/#/c/3619/9/src/kudu/client/client.h
File src/kudu/client/client.h:

High level formatting questions:
1. I noticed that you've reformatted the comments to insert two spaces when 
beginning a new sentence rather than one. There's no standard for this as far 
as Kudu is concerned, but I think most of us only use one space. Is this change 
purely stylistic, or is it mandated by doxygen?
2. Should the sentence summarizing a given method end with a period? In some 
cases (e.g. SetVerboseLogLevel, UninstallLoggingCallback) it does, while in 
others (e.g. SetInternalSignalNumber, GetShortVersionString) it doesn't. Is 
there a protocol here? Or is it just inconsistent? If it's just an 
inconsistency, could you fix it? The same applies for parameters and return 
values.
3. Some getters only have a @return while others have both a @return and a 
short summarizing sentence. Why the inconsistency? If either are acceptable, 
I'd prefer if simple getters just had a @return, as it's less noise.


PS9, Line 91: Before a callback is registered, all internal client log events
: /// are logged to the stderr.
Nit: why combine this sentence into the previous paragraph instead of leaving 
it on its own as before?


Line 131: ///   Signal number to use for internal
Nit: for internal what? Maybe you meant "Signal number to use internally"?


PS9, Line 222: Each KuduClient instance is sandboxed with no
 : /// global cross-client state.
Not related to your change, but this isn't strictly true anymore. Basically, 
free functions (like those at the top of client.h) affect all clients because 
they modify global state.


Line 247:   /// @return Pointer to newly created instance: it's the caller's
Nit: semi-colon is more appropriate here.


Line 277:   /// Check if table alteration is in process
Nit: in-progress, to be consistent with IsCreateTableInProgress.


PS9, Line 282:   ///   Set to @c true if an AlterTable operation is in-progress;
 :   ///   set to @c false otherwise
Nit: can you reword to be consistent with IsCreateTableInProgress, since they 
both work the same way?


Line 287:   /// Get scheme for a table
Nit: schema, not scheme.


Line 290:   ///   Name of the table if question
Nit: in question


Line 292:   ///   Raw pointer to the schema object: caller gets ownership
Nit: semi-colon.


Line 310:   ///   Substring-filter to use; empty sub-string filter matches any 
tables
Nit: Substring filter


Line 320:   ///   Set only on success: set to @c true iff table exists
Nit: semi-colon


PS9, Line 326:   /// If the table has not been opened before, then open the 
table
 :   /// with the given name.  If the table has not been opened 
before
 :   /// for this client, this will do an RPC to ensure that the 
table exists
 :   /// and look up its schema.
Also not related to this change, but the RPC and schema lookup are always done, 
regardless if "the table has not been opened before" or not.


Line 346:   /// @return A new session object: caller is responsible for 
destroying it.
Nit: semi-colon.


Line 359:   /// @return @c true iff it's a multi-master session
This isn't related to sessions per-se, it's true if the client was configured 
to talk to multiple Kudu masters.


Line 365:   /// @return Default timeout for RPC operations.
Nit: we do distinguish between "operations" and "RPCs". The former are logical 
actions, like "create this new table and wait until it's fully created", while 
the latter are single messages sent to a server, like "Write() sent to a tablet 
server". The distinction is important because an operation may require sending 
multiple RPCs, or retrying some RPCs.

As such, "RPC operations" is confusing. Should probably just drop "operations".


Line 376:   /// by this client.  Check KuduScanner for more details on 
timestamps.
Nit: why the change from "See KuduScanner" to "Check KuduScanner"? Was the 
former unclear or imprecise?


PS9, Line 385: To use this the user must obtain
 :   /// the HybridTime encoded timestamp from the first client with
 :   /// KuduClient::GetLatestObservedTimestamp() and then set it
 :   /// in the new client with this method.
Nit: let's separate this into a new paragraph.


Line 432: class KUDU_EXPORT KuduTableCreator {
The "Required" and "Optional" tags are gone. I think they were useful; can we 
restore them?


Line 438:   /// This method must be called at an object before .
This sentence is unclear, and there's an extra space at the end.


Line 447:   /// Set the schema with which to create next table.
Nit: the subsequent builder methods refer to the table to be created as simply 
"the table". Can we do that here too, instead of "next table"? 

[kudu-CR] master: add assert checks for leader lock

2016-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: master: add assert checks for leader_lock
..


master: add assert checks for leader_lock

A side effect of recursive checking in RWMutex is that we can now assert
that a RWMutex is held for reading/writing. Let's add that to the various
catalog manager entry points.

Change-Id: Iefb5762c70192b27490cc71e20568815d18d6ad5
Reviewed-on: http://gerrit.cloudera.org:8080/3642
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
---
M src/kudu/client/client-test.cc
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/integration-tests/mini_cluster.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test-util.h
M src/kudu/master/master.cc
9 files changed, 78 insertions(+), 41 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Iefb5762c70192b27490cc71e20568815d18d6ad5
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: prevent recursive use

2016-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: rw_mutex: prevent recursive use
..


rw_mutex: prevent recursive use

Todd provided an example[1] of deadlocked rwlocks due to a fairness policy.
In the example, T1 (holding the lock for reading) join()ed on T2 (trying to
acquire the lock for reading) all while T3 was trying to acquire the lock
for writing. The lock's fairness policy prevented T2 from acquiring the read
lock thus deadlocking all three threads. The takeaway is to be careful when
calling join() while holding locks.

Beyond that, deadlocks can arise if rwlocks are taken recursively. That's
not a feature we need in our rwlocks, so I tried to disable it at the
pthread level. Unfortunately, the best I can do is disable recursive write
lock acquisition; read locks are apparently always recursive (see "man
pthread_rwlockattr_setkind_np"). So instead, I built recursive checking into
the RWMutex itself. It's quite slow so it's only present in debug builds.

Note that pthread rwlocks do have some built-in deadlock detection (i.e.
lock calls may return EDEADLK), but it doesn't appear to be comprehensive.

1. 
https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647

Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Reviewed-on: http://gerrit.cloudera.org:8080/3641
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
Reviewed-by: Todd Lipcon 
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/rw_mutex-test.cc
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
4 files changed, 314 insertions(+), 4 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Memory tracking for result tracker

2016-07-18 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Memory tracking for result tracker
..


Patch Set 18:

(1 comment)

I don't have any more comments except on the tests.

http://gerrit.cloudera.org:8080/#/c/3627/18/src/kudu/rpc/rpc-stress-test.cc
File src/kudu/rpc/rpc-stress-test.cc:

Line 260: SleepFor(MonoDelta::FromMilliseconds(100));
I'm concerned that the sleeps will make these tests flaky. Could you loop them 
a thousand times in TSAN mode?


-- 
To view, visit http://gerrit.cloudera.org:8080/3627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3b81dda41c8bc7f70380ce426142c34afe6f1625
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] Memory tracking for result tracker

2016-07-15 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Memory tracking for result tracker
..


Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/3627/13/src/kudu/rpc/result_tracker.cc
File src/kudu/rpc/result_tracker.cc:

PS13, Line 78:   // Release all the memory for the stuff we'll delete on 
destruction.
 :   for (auto& client_state : clients_) {
 : for (auto& completion_record : 
client_state.second->completion_records) {
 :   
mem_tracker_->Release(completion_record.second->memory_footprint());
 : }
 : 
mem_tracker_->Release(client_state.second->memory_footprint());
 :   }
 :   mem_tracker_->Release(memory_footprint());
Can we aggregate this and make just one Release() call? This isn't a hot path, 
but it seems easy enough to do.


http://gerrit.cloudera.org:8080/#/c/3627/13/src/kudu/rpc/result_tracker.h
File src/kudu/rpc/result_tracker.h:

Line 228: return client_state_map_bytes_;
Typically memory_footprint() methods are supposed to account for the entirety 
of the object, including any nested allocations. Since you're just using it for 
a "shallow" accounting, perhaps call it memory_footprint_excluding_clients, and 
add a comment explaining that?

Ahh, the issue is that the ScopedMemTrackerUpdater expects a method named 
memory_footprint(). How about renaming all of them to 
memory_footprint_shallow() instead, and making it clear via comments which 
"deep" structures they are not accounting for?


Line 254: int64_t memory_footprint() const {
Likewise, memory_footprint_excluding_rpcs.


Line 255:   return kudu_malloc_usable_size(this)
Does this method need to be synchronized in some way? Can the CompletionRecord 
be modified while we're executing here? Or is every access expected to hold 
lock_? If so, maybe add Unlocked() to the name of the method.


Line 281: int64_t memory_footprint() const {
Likewise, memory_footprint_excluding_completion_records.


Line 282:   return kudu_malloc_usable_size(this) + 
completion_record_map_bytes_;
Do we need to synchronize access to completion_record_map_bytes_?


PS13, Line 327:   // Lock that protects access to 'clients_' and to the state 
contained in each
  :   // ClientState.
  :   // TODO consider a per-ClientState lock if we find this too 
coarse grained.
Does this also protect access to client_state_map_bytes_? If so, should it be 
taken in memory_footprint()?


http://gerrit.cloudera.org:8080/#/c/3627/13/src/kudu/rpc/rpc-stress-test.cc
File src/kudu/rpc/rpc-stress-test.cc:

Line 259: usleep(100 * 1000);
Nit: Can we use SleepFor() instead? It's more idiomatic for Kudu, and makes the 
units a little clearer.

Separately, why do we need to sleep here?


http://gerrit.cloudera.org:8080/#/c/3627/9/src/kudu/server/server_base.cc
File src/kudu/server/server_base.cc:

PS9, Line 89: }
: 
: } /
> Done re the id.
Hmm, OK, I guess that's true. Admittedly a separate memtracker is more of an 
issue when its parent is the tablet memtracker, as then a server can have 
hundreds or thousands additional memtrackers, which muddies the memory UI view 
significantly.


http://gerrit.cloudera.org:8080/#/c/3627/13/src/kudu/server/server_base.cc
File src/kudu/server/server_base.cc:

Line 102:   result_tracker_(new 
rpc::ResultTracker(CreateMemTrackerForForResultTracker(mem_tracker_))),
How about:

  result_tracker_(new rpc::ResultTracker(MemTracker::CreateTracker(-1, 
"result-tracker", mem_tracker_))),

Or is that too long?


-- 
To view, visit http://gerrit.cloudera.org:8080/3627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3b81dda41c8bc7f70380ce426142c34afe6f1625
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] doxygen for C++ client API

2016-07-15 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: doxygen for C++ client API
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3619/8/CMakeLists.txt
File CMakeLists.txt:

Line 989: add_custom_target(doxy_install_client_alt_destdir
> Yesterday I tried dependency on 'kudu_client_exported' as well :)  The prob
Interesting. I think the kudu_client_exported problem is due a misuse of 
add_dependencies(), where the gen_version_info target is set as a dependency of 
kudu_util but not kudu_util_exported.

Can you try this patch? It should allow depending on kudu_client_exported 
properly, and it's a good improvement to make regardless.

  diff --git a/src/kudu/util/CMakeLists.txt b/src/kudu/util/CMakeLists.txt
  index a924f38..df1b4c9 100644
  --- a/src/kudu/util/CMakeLists.txt
  +++ b/src/kudu/util/CMakeLists.txt
  @@ -218,10 +218,9 @@ endif()
   ADD_EXPORTABLE_LIBRARY(kudu_util
 SRCS ${UTIL_SRCS}
 DEPS ${UTIL_LIBS}
  +  NONLINK_DEPS gen_version_info
 EXPORTED_DEPS ${EXPORTED_UTIL_LIBS})
 
  -add_dependencies(kudu_util gen_version_info)
  -
   ###
   # kudu_test_util
   ###


Line 992:   WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/src/kudu/client
> Yes, but why to depend on that at all and spend extra-time fixing that if s
Running make out of anywhere but the top-level directory isn't something most 
(if any) of us do, so I'm not sure how much I trust it. Plus, it means another 
place to update if we wanted to move the client code (or the client's install 
code) to a different directory. In that sense, the target abstracts away the 
filesystem layout, and that's a good thing.


-- 
To view, visit http://gerrit.cloudera.org:8080/3619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie7d42fb1c90b83074e357dcecf42489ed9fc4f02
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: Yes


[kudu-CR] doxygen for C++ client API

2016-07-15 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: doxygen for C++ client API
..


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/3619/8/CMakeLists.txt
File CMakeLists.txt:

Line 983: set(DOXY_SUBDIR ${CMAKE_CURRENT_BINARY_DIR}/docs/doxygen)
Could you first log what the location of doxygen is? Maybe logging 
DOXYGEN_EXECUTABLE is sufficient, not sure.


Line 989: add_custom_target(doxy_install_client_alt_destdir
What if doxy_install_client_alt_destdir depended on 'all'? Then could you drop 
the explicit "make all" step? I think that would be better, as it'd feed more 
dependency information into the generated makefiles ("make all install" as a 
command doesn't do that). Also, I don't think you'd need the remove_directory 
commands then either.


Line 990:   COMMAND ${CMAKE_COMMAND} -E remove_directory 
${DOXY_CLIENT_DESTDIR}
Why do we need to remove the old directory first?


Line 992:   WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/src/kudu/client
Why does the working directory need to be modified for this command?


Line 997:   COMMAND ${CMAKE_COMMAND} -E remove_directory 
${DOXY_CLIENT_DESTDIR}
Why this?


http://gerrit.cloudera.org:8080/#/c/3619/8/thirdparty/download-thirdparty.sh
File thirdparty/download-thirdparty.sh:

PS8, Line 63: if [[ "$FILENAME" =~ \.zip$ ]]; then
:   if ! unzip -q "$FILENAME"; then
: echo "Error unzipping $FILENAME, removing file"
: rm "$FILENAME"
: continue
:   fi
: elif [[ "$FILENAME" =~ \.(tar\.gz|tgz)$ ]]; then
Are these changes still useful? Or can they be reverted? They seem cosmetic to 
me, but maybe I'm missing something.


-- 
To view, visit http://gerrit.cloudera.org:8080/3619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie7d42fb1c90b83074e357dcecf42489ed9fc4f02
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: Yes


[kudu-CR] Memory tracking for result tracker

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Memory tracking for result tracker
..


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/3627/9/src/kudu/rpc/result_tracker.cc
File src/kudu/rpc/result_tracker.cc:

Line 75: mem_tracker_->Consume(sizeof(ClientState));
If we want to make ClientState memory tracking accurate, we need to account for 
two additional things:

1. Malloc "slop". sizeof(ClientState) may not account for the actual size of 
the memory allocation used in "new ClientState()". For example, if the size of 
the struct is 40 bytes, the allocation may be the next power of 2 (i.e. 64 
bytes). We deal with this elsewhere by using kudu_malloc_usable_size(ptr) as 
the "size of object"; sometimes this is internalized inside a 
memory_footprint() method.

2. The size of any nested heap pointers. ClientState contains a map, and that 
map contains nested pointers whose memory consumption isn't being accounted 
for. Look at how Schema::name_to_index_ is managed to see how you might do that.


Line 85: mem_tracker_->Consume(sizeof(CompletionRecord));
Same issues with CompletionRecord and its nested vector of OngoingRpcs (you can 
use kudu_malloc_usable_size(vector.data()), though you  need to check that 
vector.capacity() >0 first).

I think you're already taking care of the nested protobuf elsewhere.


Line 234:   mem_tracker_->Consume(response->ByteSize());
Should use SpaceUsed(), I think.


Line 255:   mem_tracker_->Release(sizeof(OnGoingRpcInfo));
We've found that Release() in a loop is generally less efficient than adding up 
all the memory consumed in the loop and making one call to Release() at the 
end. That change reduced block manager CPU consumption pretty dramatically on 
startup.


http://gerrit.cloudera.org:8080/#/c/3627/9/src/kudu/server/server_base.cc
File src/kudu/server/server_base.cc:

PS9, Line 89:   if (id != 0) {
: StrAppend(_str, " ", id);
:   }
Don't need this; each of these memtrackers is already "disambiguated" by virtue 
of having a unique parent.

That said, why bother creating a memtracker for the ResultTracker instead of 
just reusing the server memtracker? IIRC we only create separate memtrackers 
for objects that come and go (e.g. tablet, memrowset, deltamemstore).


-- 
To view, visit http://gerrit.cloudera.org:8080/3627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3b81dda41c8bc7f70380ce426142c34afe6f1625
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] rw mutex: prevent recursive use

2016-07-14 Thread Adar Dembo (Code Review)
Hello Dan Burkert,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3641

to look at the new patch set (#6).

Change subject: rw_mutex: prevent recursive use
..

rw_mutex: prevent recursive use

Todd provided an example[1] of deadlocked rwlocks due to a fairness policy.
In the example, T1 (holding the lock for reading) join()ed on T2 (trying to
acquire the lock for reading) all while T3 was trying to acquire the lock
for writing. The lock's fairness policy prevented T2 from acquiring the read
lock thus deadlocking all three threads. The takeaway is to be careful when
calling join() while holding locks.

Beyond that, deadlocks can arise if rwlocks are taken recursively. That's
not a feature we need in our rwlocks, so I tried to disable it at the
pthread level. Unfortunately, the best I can do is disable recursive write
lock acquisition; read locks are apparently always recursive (see "man
pthread_rwlockattr_setkind_np"). So instead, I built recursive checking into
the RWMutex itself. It's quite slow so it's only present in debug builds.

Note that pthread rwlocks do have some built-in deadlock detection (i.e.
lock calls may return EDEADLK), but it doesn't appear to be comprehensive.

1. 
https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647

Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/rw_mutex-test.cc
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
4 files changed, 314 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3641/6
-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1530: Update docs about OS X build dependency on Xcode package

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1530: Update docs about OS X build dependency on Xcode 
package
..


Patch Set 3: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3653
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8962d5539f437dba8b120c70c90c1e384ed550c9
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Dinesh Bhat 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] Move the maintenance manager to util

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Move the maintenance manager to util
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3656/3/src/kudu/util/maintenance_manager.h
File src/kudu/util/maintenance_manager.h:

Line 243:   void GetMaintenanceManagerStatusDump(MaintenanceManagerStatusPB* 
out_pb);
Since this is a pointer, you don't even need to include 
maintenance_manager.pb.h; you could forward declare this instead.


-- 
To view, visit http://gerrit.cloudera.org:8080/3656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifcf072d443ac3d069bda15b9dc0f8c442b9ac5c0
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1311 [master] support adding and dropping range partitions

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1311 [master] support adding and dropping range partitions
..


Patch Set 4:

(37 comments)

http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/client/client.cc
File src/kudu/client/client.cc:

PS4, Line 892: 
RETURN_NOT_OK(data_->client_->data_->WaitForAlterTableToFinish(
 : data_->client_, alter_name, deadline));
If this fails, don't we want to clear the meta cache anyway? Maybe ClearCache() 
should be set up as a ScopedCleanup thing right after L888-889.


Line 908: // It is sufficient to clear that cache even if the table 
alteration is not
Nit: "the cache".

Also, not sure about "sufficient"; it implies that there was something more we 
could do but have chosen not to. Maybe you meant "necessary"?


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/client/client.h
File src/kudu/client/client.h:

PS4, Line 537: unless the existing range partitions are dropped first
What happens if a single KuduTableAlterer has DropRange and AddRange on the 
same range? Is that legal, provided they're in that order?


Line 541:   // this method returns, however other existing clients may have to 
wait for a
Don't you mean "when Alter() returns success"?


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/client/meta_cache.cc
File src/kudu/client/meta_cache.cc:

Line 19: 
What were the include and using changes for?


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/client/table_alterer-internal.h
File src/kudu/client/table_alterer-internal.h:

Line 57:   struct PartitioningStep {
I think the code would be net less complex if PartitionStep and Step were 
combined. The overhead of two unique_ptrs per Step for non-partitioning steps 
is minimal, and it'll simplify ToRequest() somewhat.


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/integration-tests/alter_table-randomized-test.cc
File src/kudu/integration-tests/alter_table-randomized-test.cc:

Line 363: LOG(INFO) << "Adding column " << name;
Were these LOG statements just for debugging or do you want to keep them?


Line 459:   scanner.SetTimeoutMillis(6);
What's the significance of this value?


Line 510:   vector master_addrs_;
Unused?


PS4, Line 547: } else if (r < 850 && t.num_range_partitions() < 
kMaxRangePartitions) {
 :   t.AddRangePartition();
 : } else if (r < 900) {
 :   t.DropRangePartition();
It would be cool if this also tested "compound" operations, i.e. an 
AlterTable() that adds a partition, drops another, and adds a column too.


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/integration-tests/alter_table-test.cc
File src/kudu/integration-tests/alter_table-test.cc:

Line 27: #include "kudu/client/client.h"
Nit: the previous sort order was more correct.


Line 36: #include "kudu/master/master-test-util.h"
Nit: likewise, this was more correct before.


Line 463:   session->SetTimeoutMillis(15 * 1000);
Why this value?


Line 468: gscoped_ptr insert(table->NewInsert());
Use unique_ptr here?


PS4, Line 470: if (table->schema().num_columns() > 1) {
 :   RETURN_NOT_OK(insert->mutable_row()->SetInt32(1, i));
 : }
The assumption being that every column is going to be an Int32?


Line 550: int AlterTableTest::CountRows(const string& table_name) {
Can you reuse CountTableRows() from client-test-util.h? There may be some other 
useful goodies there.


Line 1064: TEST_F(AlterTableTest, TestAlterRangePartitioning) {
How about a test case for one AlterTable() adding and dropping the same 
partition?

Also, what about negative test cases? Like dropping partitions that don't 
exist, adding partitions that overlap with existing ones, adding the same 
partition twice, etc.


Line 1065:   gscoped_ptr table_alterer;
unique_ptr?


PS4, Line 1091:   table_alterer->wait(true);
  :   
ASSERT_OK(table_alterer->AddRangePartition(add_lower.release(), 
add_upper.release())->Alter());
Combine?


PS4, Line 1097:   ASSERT_OK(InsertRowsSequential(table_name, 0, 100));
  :   ASSERT_EQ(100, CountRows(table_name));
This is to test that IsAlterTableInProgress()->false actually means we can 
insert/scan?


http://gerrit.cloudera.org:8080/#/c/3648/4/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS4, Line 1215:   Schema schema;
  :   
RETURN_NOT_OK(SchemaFromPB(table->metadata().state().pb.schema(), ));
  :   PartitionSchema partition_schema;
  :   
RETURN_NOT_OK(PartitionSchema::FromPB(table->metadata().state().pb.partition_schema(),
  : schema, 
_schema));
This isn't safe; you should be getting the schema() through the COWLocked 
object.


PS4, Line 1223: std::lock_guard l(table->lock_);
  : tablets = table->tablet_map_;
This is why you needed to 

[kudu-CR] Move the maintenance manager to util

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Move the maintenance manager to util
..


Patch Set 1:

(15 comments)

What's the motivation for this move?

http://gerrit.cloudera.org:8080/#/c/3656/1//COMMIT_MSG
Commit Message:

Line 10: It doesn't change the namespace (kudu::) since that would be more 
involved.
But isn't kudu:: the right namespace for something in util?


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/integration-tests/full_stack-insert-scan-test.cc
File src/kudu/integration-tests/full_stack-insert-scan-test.cc:

Line 40: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/master/master.cc
File src/kudu/master/master.cc:

Line 39: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tablet/CMakeLists.txt
File src/kudu/tablet/CMakeLists.txt:

Line 36: ../util/maintenance_manager.cc
What's this doing here? The maintenance manager is now part of libkudu_util, so 
shouldn't this just be removed?


Line 101: ADD_KUDU_TEST(maintenance_manager-test)
The test should be moved too.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tablet/tablet.cc
File src/kudu/tablet/tablet.cc:

Line 48: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tablet/tablet_peer-test.cc
File src/kudu/tablet/tablet_peer-test.cc:

Line 35: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tablet/tablet_peer_mm_ops.cc
File src/kudu/tablet/tablet_peer_mm_ops.cc:

Line 27: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tserver/mini_tablet_server.cc
File src/kudu/tserver/mini_tablet_server.cc:

Line 34: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tserver/tablet_server-test-base.h
File src/kudu/tserver/tablet_server-test-base.h:

Line 43: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tserver/tablet_server.cc
File src/kudu/tserver/tablet_server.cc:

Line 30: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/tserver/tserver-path-handlers.cc
File src/kudu/tserver/tserver-path-handlers.cc:

Line 35: #include "kudu/util/maintenance_manager.h"
Nit: resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/util/maintenance_manager-test.cc
File src/kudu/util/maintenance_manager-test.cc:

Line 25: #include "maintenance_manager.h"
Should be kudu/util, and resort.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/util/maintenance_manager.cc
File src/kudu/util/maintenance_manager.cc:

Line 18: #include "maintenance_manager.h"
Nit: kudu/util.


http://gerrit.cloudera.org:8080/#/c/3656/1/src/kudu/util/maintenance_manager.h
File src/kudu/util/maintenance_manager.h:

Line 17: #pragma once
Nit: separate from the license with an empty line.


-- 
To view, visit http://gerrit.cloudera.org:8080/3656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifcf072d443ac3d069bda15b9dc0f8c442b9ac5c0
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 3): new multi-master stress test
..


Patch Set 9: Verified+1

chrpath() failures.

-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] rw mutex: prevent recursive use

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: prevent recursive use
..


Patch Set 5: Verified+1

KUDU-1527 and isolate failures.

-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 2): heartbeat to every master
..


Patch Set 9: Verified+1

Isolate failed.

-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 9: Verified+1

chrpath() failed.

-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: do not delete unknown tablets

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: do not delete unknown tablets
..


Patch Set 5: Verified+1

More isolate and chrpath() failures.

-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: No


[kudu-CR] KUDU-1374: send full tablet report when new leader master is detected

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1374: send full tablet report when new leader master is 
detected
..


Patch Set 5: Verified+1

More chrpath() and isolate failures.

-- 
To view, visit http://gerrit.cloudera.org:8080/3643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-14 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3610

to look at the new patch set (#9).

Change subject: KUDU-1358 (part 2): heartbeat to every master
..

KUDU-1358 (part 2): heartbeat to every master

Now that followers accept heartbeats, let's modify the tserver to send one
to every master. Spawning a heartbeater thread for each master seemed like
the natural way to do this; it should simplify dynamic master changes in the
future (i.e. just add or remove threads as needed).

The "dirty tablet" state is now encapsulated in the heartbeater threads
themselves, and the heartbeater must "fan out" to manipulate all of it. It's
a little noisy but I think it's reasonable. The alternative is for this
state to remain in the TSTabletManager, for the heartbeater to continue
tracking which master is the leader, and for it to only send tablet reports
to that master. This can be done with a few changes (e.g. adding term
numbers to the heartbeat response), but the only benefit is reduced network
traffic when tablets are dirty, so that didn't seem worth the complexity.

There's no new test here, but this code path is exercised in the test I
reenabled, and in the new stress test (follow-on patch).

Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
---
M src/kudu/integration-tests/master_failover-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tserver/heartbeater.cc
M src/kudu/tserver/heartbeater.h
M src/kudu/tserver/ts_tablet_manager-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 311 insertions(+), 263 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/10/3610/9
-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: add read-write lock to serialize operations around elections

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: master: add read-write lock to serialize operations around 
elections
..


master: add read-write lock to serialize operations around elections

This rigmarole began with an investigation into a test failure [1], which
led to a new integration test that hammers VisitTablesAndTablets() while
creating tables. That test revealed other locking issues, which brings us
to where we are now.

This patch introduces a read-write lock to serialize all master operations
so that they fall on one side or the other of a leader election. The idea
is to avoid performing operations concurrently with a reload of the master
metadata; doing so can lead to problems in Shutdown() and (very rarely,
perhaps only conceptually) to inconsistent on-disk state.

I was hoping this lock could replace the fencing done by leader_ready_term_,
but eventually reasoned that we need both; without leader_ready_term_
fencing, the master's consensus state machine could fool an operation into
thinking the master became the leader before the metadata was reloaded.

Three other things of note here:
- The new lock is acquired via TryLock() so that, if the lock could not be
  acquired, the RPC will fail rather than block. A future patch modifies
  TSHeartbeat() to partially accept heartbeats even if the master is a
  follower; TryLock() means that a transitioning leader that is pelted with
  RPCs won't fill up its service queue and can still process heartbeats.
- TableInfo's AddTask() and RemoveTask() methods now don't hold the table's
  lock when adding and removing refs from the task respectively. This is
  the fix for the original test failure.
- When reloading metadata, we now abort all outstanding table tasks to
  avoid orphaning them.

1. 
http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001

Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Reviewed-on: http://gerrit.cloudera.org:8080/3550
Reviewed-by: Dan Burkert 
Tested-by: Adar Dembo 
---
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master.cc
M src/kudu/master/master_service.cc
6 files changed, 334 insertions(+), 144 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Adar Dembo: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: fix corruption when AlterTable() races with CreateTable()

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: master: fix corruption when AlterTable() races with 
CreateTable()
..


master: fix corruption when AlterTable() races with CreateTable()

Admittedly, this is a contrived scenario:
1. T1 tries to create table with name 'foo'
2. T2 tries to rename table with name 'bar' to 'foo'

With just the right timing, both operations succeed and the metadata now has
two tables named 'foo', each with a different table ID. The fix is simple:
generalize the "tables being created" logic already used by CreateTable().

Without the fix, the new test failed every 50th run or so. With it, it
doesn't fail in 1000 runs.

Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
Reviewed-on: http://gerrit.cloudera.org:8080/3607
Reviewed-by: Dan Burkert 
Tested-by: Adar Dembo 
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
3 files changed, 231 insertions(+), 30 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Adar Dembo: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: fix corruption when AlterTable() races with CreateTable()

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: fix corruption when AlterTable() races with 
CreateTable()
..


Patch Set 7: Verified+1

Isolate crashed.

-- 
To view, visit http://gerrit.cloudera.org:8080/3607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 3): new multi-master stress test

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 3): new multi-master stress test
..


Patch Set 8: Verified+1

Another instance of KUDU-1527.

-- 
To view, visit http://gerrit.cloudera.org:8080/3611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I40b5b78c100a7b427b2f4aac3a54665e82a9618c
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: various fixes to DDL operations

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: c++ client: various fixes to DDL operations
..


Patch Set 8: Verified+1

A whole slew of chrpath() failures in dist-test.

-- 
To view, visit http://gerrit.cloudera.org:8080/3608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: add assert checks for leader lock

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: add assert checks for leader_lock
..


Patch Set 4: Verified+1

Another chrpath() failure in dist-test.

-- 
To view, visit http://gerrit.cloudera.org:8080/3642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iefb5762c70192b27490cc71e20568815d18d6ad5
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] rw mutex: prevent recursive use

2016-07-14 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3641

to look at the new patch set (#4).

Change subject: rw_mutex: prevent recursive use
..

rw_mutex: prevent recursive use

Todd provided an example[1] of deadlocked rwlocks due to a fairness policy.
In the example, T1 (holding the lock for reading) join()ed on T2 (trying to
acquire the lock for reading) all while T3 was trying to acquire the lock
for writing. The lock's fairness policy prevented T2 from acquiring the read
lock thus deadlocking all three threads. The takeaway is to be careful when
calling join() while holding locks.

Beyond that, deadlocks can arise if rwlocks are taken recursively. That's
not a feature we need in our rwlocks, so I tried to disable it at the
pthread level. Unfortunately, the best I can do is disable recursive write
lock acquisition; read locks are apparently always recursive (see "man
pthread_rwlockattr_setkind_np"). So instead, I built recursive checking into
the RWMutex itself. It's quite slow so it's only present in debug builds.

Note that pthread rwlocks do have some built-in deadlock detection (i.e.
lock calls may return EDEADLK), but it doesn't appear to be comprehensive.

1. 
https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647

Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/rw_mutex-test.cc
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
4 files changed, 297 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3641/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: prevent recursive use

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: prevent recursive use
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3641/1/src/kudu/util/rw_mutex.h
File src/kudu/util/rw_mutex.h:

Line 85: I_AM_NEITHER,
> You can just call these 'NEITHER', 'READER', and 'WRITER', there is no clas
I'll change it, but what could the new values clash with in the first place? 
The Priority enum values are PREFER_READING and PREFER_WRITING.


-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] WIP [java-client] Re-enable multi-master tests

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: WIP [java-client] Re-enable multi-master tests
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3654/1/java/kudu-client/src/test/java/org/kududb/client/MiniKuduCluster.java
File java/kudu-client/src/test/java/org/kududb/client/MiniKuduCluster.java:

Line 198:   "--raft_heartbeat_interval_ms=200");
Why this change?


http://gerrit.cloudera.org:8080/#/c/3654/1/java/kudu-client/src/test/java/org/kududb/client/TestMasterFailover.java
File java/kudu-client/src/test/java/org/kududb/client/TestMasterFailover.java:

PS1, Line 42:   /**
:* This test is disabled as we're not supporting multi-master 
just yet.
:*/
Remove this.


-- 
To view, visit http://gerrit.cloudera.org:8080/3654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia1051222738c84ef3d3e1a33b4981bc9454b7972
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes


[kudu-CR] [c++-client]: cache non-covering ranges in meta cache

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [c++-client]: cache non-covering ranges in meta cache
..


Patch Set 5:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

PS5, Line 1223:   int original_table_locations_ttl_ms = 
FLAGS_table_locations_ttl_ms;
  :   FLAGS_table_locations_ttl_ms = new_ttl;
  :   auto cleanup = MakeScopedCleanup([&] () {
  :   FLAGS_table_locations_ttl_ms = 
original_table_locations_ttl_ms;
  :   });
You can instantiate a FlagSaver instead.


Line 1238:   ASSERT_FALSE(entry.stale());
The TTL is 25ms, so it's conceivable that on a bogged down test environment the 
entry may become stale between the time that it is inserted and the time that 
stale() is called.

Could you loop this test 1000 times with TSAN and see if it's indeed flaky? 
Alternatively we could raise the TTL but that just means a longer sleep below. 
Or perhaps this (and L1246-1247) should be done in a loop and accepted provided 
one iteration passes.


http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/client.h
File src/kudu/client/client.h:

Line 300:   FRIEND_TEST(ClientTest, TestMetaCacheExpiry);
Nit: sort alphabetically.


http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/meta_cache.cc
File src/kudu/client/meta_cache.cc:

PS5, Line 753:   if (new_status.IsServiceUnavailable()) {
 : // One or more of the tablets is not running; retry after a 
backoff period.
 : mutable_retrier()->DelayedRetry(this, new_status);
 : ignore_result(delete_me.release());
 : return;
 :   }
To be clear, this has nothing to do with non-covering range partition support, 
right? This should have been added as part of 30d76a3? As an analogue of 
1a97c42?


Line 790:   if (rpc.resp().tablet_locations().empty()) {
rpc.resp().tablet_locations() is called four times, maybe pull it out into a 
cref local variable?


Line 803: string last_upper_bound =
If possible, could you add an ASCII art diagram to this section that will 
visually map tablets (or non-covered ranges) to sections of code that handle 
them?

For example:

  // ab cbc   b
  //   +--+   +-+  +-+
  //   |  |   | |  | |
  //   +--+   +-+  +-+

And then in the code comments, add references to a), b), c) etc. as needed.


PS5, Line 817: tablet.partition().partition_key_start()
Can you pull this (and end) into local variables?


PS5, Line 848: FindOrDie(tablets_by_key, 
tablet.partition().partition_key_start());
This assumes that a tablet can't change in partition size (i.e. that should 
mean a new tablet), right? Since the partition information comes off the wire, 
it might be better to handle that weird state gracefully.

Alternatively, if tablets_by_id_ values were MetaCacheEntries instead of 
RemoteTablets, you'd have the TTL right there. But then you'd need to share 
MetaCacheEntries amongst the two maps (using shared ownership); not sure if 
that's net less complexity or not.


PS5, Line 850: entry.upper_bound_partition_key() == 
tablet.partition().partition_key_end());
This part also depends on well-formed information off the wire.


PS5, Line 865:   MetaCacheEntry entry(expiration_time, remote);
 :   VLOG(3) << "Caching '" << rpc.table_name() << "' entry " 
<< entry.DebugString(rpc.table());
 : 
 :   InsertOrDie(_by_id_, tablet_id, remote);
 :   InsertOrDie(_by_key, 
partition.partition_key_start(), std::move(entry));
Maybe encapsulate this (and the non-covered range variant) in a helper method?


PS5, Line 874:   // There is a non-covered range between the last tablet 
upper bound and
 :   // the end of the partition key space.
What if the total number of tablets is a multiple of 
MAX_RETURNED_TABLE_LOCATIONS and we just did a lookup for the last batch of 
tablets? There may be a non-covered range at the end, but we won't insert an 
entry for it here. Is that a problem?


Line 885:   *cache_entry = *DCHECK_NOTNULL(FindFloorOrNull(tablets_by_key, 
rpc.partition_key()));
Why FindFloorOrNull if you DCHECK_NOTNULL() right after? Seems unsafe.


http://gerrit.cloudera.org:8080/#/c/3581/1/src/kudu/client/meta_cache.h
File src/kudu/client/meta_cache.h:

Line 283:   const scoped_refptr& tablet() const {
> That wouldn't type check.  I'm not seeing a warning with clang or gcc 4.9, 
Sure, or DCHECK(tablet_.get())


http://gerrit.cloudera.org:8080/#/c/3581/5/src/kudu/client/table-internal.cc
File src/kudu/client/table-internal.cc:

PS5, Line 133: if (s.ok()) {
 :   break;
 : } else {
 :   LOG(WARNING) << "Error getting table locations: " << 
s.ToString() << ", retrying.";
 : }

[kudu-CR] rw mutex: prevent recursive use

2016-07-14 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3641

to look at the new patch set (#3).

Change subject: rw_mutex: prevent recursive use
..

rw_mutex: prevent recursive use

Todd provided an example[1] of deadlocked rwlocks due to a fairness policy.
In the example, T1 (holding the lock for reading) join()ed on T2 (trying to
acquire the lock for reading) all while T3 was trying to acquire the lock
for writing. The lock's fairness policy prevented T2 from acquiring the read
lock thus deadlocking all three threads. The takeaway is to be careful when
calling join() while holding locks.

Beyond that, deadlocks can arise if rwlocks are taken recursively. That's
not a feature we need in our rwlocks, so I tried to disable it at the
pthread level. Unfortunately, the best I can do is disable recursive write
lock acquisition; read locks are apparently always recursive (see "man
pthread_rwlockattr_setkind_np"). So instead, I built recursive checking into
the RWMutex itself. It's quite slow so it's only present in debug builds.

Note that pthread rwlocks do have some built-in deadlock detection (i.e.
lock calls may return EDEADLK), but it doesn't appear to be comprehensive.

1. 
https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647

Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/rw_mutex-test.cc
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
4 files changed, 297 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3641/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: add configurable priority

2016-07-14 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: add configurable priority
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3603/4/src/kudu/util/rw_mutex.h
File src/kudu/util/rw_mutex.h:

Line 28: // Implemented as a thin wrapper around pthread_rwlock_t.
> A note about reentrance is warranted I think.
I added one in the follow-on patch (the one that bans recursive acquisition).


-- 
To view, visit http://gerrit.cloudera.org:8080/3603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] env: add GetFileSizeOnDiskRecursively

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: env: add GetFileSizeOnDiskRecursively
..


Patch Set 2: Verified+1

Unrelated flake in ClientStressTest.StartScans.

-- 
To view, visit http://gerrit.cloudera.org:8080/3644
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic49629595e776ce5c755e15b04c6509053ff361f
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 6: Verified+1

Two isolate failures and one chrpath failure.

-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: do not delete unknown tablets

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: do not delete unknown tablets
..


Patch Set 2: Verified+1

One run failed from the same dist-failure as 
http://gerrit.cloudera.org:8080/3610/6. Another failed when the isolate server 
started returning code 500.

-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 2): heartbeat to every master
..


Patch Set 6: Verified+1

Looks like I got hit by some sort of sporadic dist-test failure:

  open: Text file busy
  elf_open: Invalid argument
  Traceback (most recent call last):
File "../../build-support/run_dist_test.py", line 138, in 
  main()
File "../../build-support/run_dist_test.py", line 120, in main
  fixup_rpaths(os.path.join(ROOT, "thirdparty"))
File "../../build-support/run_dist_test.py", line 90, in fixup_rpaths
  fix_rpath(p)
File "../../build-support/run_dist_test.py", line 79, in fix_rpath
  subprocess.check_call(["chrpath", "-r", new_path, path])
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
  raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['chrpath', '-r', '$ORIGIN/../lib', 
'/tmp/run_tha_testAzqv38/thirdparty/installed/bin/llvm-symbolizer']' returned 
non-zero exit status 1

-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: add read-write lock to serialize operations around elections

2016-07-13 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3550

to look at the new patch set (#8).

Change subject: master: add read-write lock to serialize operations around 
elections
..

master: add read-write lock to serialize operations around elections

This rigmarole began with an investigation into a test failure [1], which
led to a new integration test that hammers VisitTablesAndTablets() while
creating tables. That test revealed other locking issues, which brings us
to where we are now.

This patch introduces a read-write lock to serialize all master operations
so that they fall on one side or the other of a leader election. The idea
is to avoid performing operations concurrently with a reload of the master
metadata; doing so can lead to problems in Shutdown() and (very rarely,
perhaps only conceptually) to inconsistent on-disk state.

I was hoping this lock could replace the fencing done by leader_ready_term_,
but eventually reasoned that we need both; without leader_ready_term_
fencing, the master's consensus state machine could fool an operation into
thinking the master became the leader before the metadata was reloaded.

Three other things of note here:
- The new lock is acquired via TryLock() so that, if the lock could not be
  acquired, the RPC will fail rather than block. A future patch modifies
  TSHeartbeat() to partially accept heartbeats even if the master is a
  follower; TryLock() means that a transitioning leader that is pelted with
  RPCs won't fill up its service queue and can still process heartbeats.
- TableInfo's AddTask() and RemoveTask() methods now don't hold the table's
  lock when adding and removing refs from the task respectively. This is
  the fix for the original test failure.
- When reloading metadata, we now abort all outstanding table tasks to
  avoid orphaning them.

1. 
http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001

Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
---
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master.cc
M src/kudu/master/master_service.cc
6 files changed, 334 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/8
-- 
To view, visit http://gerrit.cloudera.org:8080/3550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [java client] Redo how we manage exceptions

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] Redo how we manage exceptions
..


Patch Set 8: Code-Review+1

Will leave the +2ing for Dan, since he had some comments originally.

-- 
To view, visit http://gerrit.cloudera.org:8080/3055
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iba6e8a022d7a5391c3657cbdc9d3f06f951be048
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: add read-write lock to serialize operations around elections

2016-07-13 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Todd Lipcon,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3550

to look at the new patch set (#7).

Change subject: master: add read-write lock to serialize operations around 
elections
..

master: add read-write lock to serialize operations around elections

This rigmarole began with an investigation into a test failure [1], which
led to a new integration test that hammers VisitTablesAndTablets() while
creating tables. That test revealed other locking issues, which brings us
to where we are now.

This patch introduces a read-write lock to serialize all master operations
so that they fall on one side or the other of a leader election. The idea
is to avoid performing operations concurrently with a reload of the master
metadata; doing so can lead to problems in Shutdown() and (very rarely,
perhaps only conceptually) to inconsistent on-disk state.

I was hoping this lock could replace the fencing done by leader_ready_term_,
but eventually reasoned that we need both; without leader_ready_term_
fencing, the master's consensus state machine could fool an operation into
thinking the master became the leader before the metadata was reloaded.

Three other things of note here:
- The new lock is acquired via TryLock() so that, if the lock could not be
  acquired, the RPC will fail rather than block. A future patch modifies
  TSHeartbeat() to partially accept heartbeats even if the master is a
  follower; TryLock() means that a transitioning leader that is pelted with
  RPCs won't fill up its service queue and can still process heartbeats.
- TableInfo's AddTask() and RemoveTask() methods now don't hold the table's
  lock when adding and removing refs from the task respectively. This is
  the fix for the original test failure.
- When reloading metadata, we now abort all outstanding table tasks to
  avoid orphaning them.

1. 
http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001

Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
---
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master.cc
M src/kudu/master/master_service.cc
6 files changed, 331 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/7
-- 
To view, visit http://gerrit.cloudera.org:8080/3550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-13 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3610

to look at the new patch set (#5).

Change subject: KUDU-1358 (part 2): heartbeat to every master
..

KUDU-1358 (part 2): heartbeat to every master

Now that followers accept heartbeats, let's modify the tserver to send one
to every master. Spawning a heartbeater thread for each master seemed like
the natural way to do this; it should simplify dynamic master changes in the
future (i.e. just add or remove threads as needed).

The "dirty tablet" state is now encapsulated in the heartbeater threads
themselves, and the heartbeater must "fan out" to manipulate all of it. It's
a little noisy but I think it's reasonable. The alternative is for this
state to remain in the TSTabletManager, for the heartbeater to continue
tracking which master is the leader, and for it to only send tablet reports
to that master. This can be done with a few changes (e.g. adding term
numbers to the heartbeat response), but the only benefit is reduced network
traffic when tablets are dirty, so that didn't seem worth the complexity.

There's no new test here, but this code path is exercised in the test I
reenabled, and in the new stress test (follow-on patch).

Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
---
M src/kudu/integration-tests/master_failover-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tserver/heartbeater.cc
M src/kudu/tserver/heartbeater.h
M src/kudu/tserver/ts_tablet_manager-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 316 insertions(+), 267 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/10/3610/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: prevent recursive use

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3641

Change subject: rw_mutex: prevent recursive use
..

rw_mutex: prevent recursive use

Todd provided an example[1] of deadlocked rwlocks due to a fairness policy.
In the example, T1 (holding the lock for reading) join()ed on T2 (trying to
acquire the lock for reading) all while T3 was trying to acquire the lock
for writing. The lock's fairness policy prevented T2 from acquiring the read
lock thus deadlocking all three threads. The takeaway is to be careful when
calling join() while holding locks.

Beyond that, deadlocks can arise if rwlocks are taken recursively. That's
not a feature we need in our rwlocks, so I tried to disable it at the
pthread level. Unfortunately, the best I can do is disable recursive write
lock acquisition; read locks are apparently always recursive (see "man
pthread_rwlockattr_setkind_np"). So instead, I built recursive checking into
the RWMutex itself. It's quite slow so it's only present in debug builds.

Note that pthread rwlocks do have some built-in deadlock detection (i.e.
lock calls may return EDEADLK), but it doesn't appear to be comprehensive.

1. 
https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647

Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
---
M src/kudu/util/CMakeLists.txt
A src/kudu/util/rw_mutex-test.cc
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
4 files changed, 307 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/3641/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3641
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I7ae30ec123a16c39ef0c15ee2d2176f807df03db
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 


[kudu-CR] KUDU-1374: send full tablet report when new leader master is detected

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3643

Change subject: KUDU-1374: send full tablet report when new leader master is 
detected
..

KUDU-1374: send full tablet report when new leader master is detected

This should help prevent missed tablet reports in very specific edge cases,
detailed in the bug report.

No new tests, as I'm not quite sure how to test this (it would be easy if
there was a way to mock a server-side krpc component as easily as it is to
mock a client-side one). It should reduce flakiness in some existing
multi-master tests, though.

Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 30 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/3643/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3643
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic16fc46736476dba39616e79ecfe79eee48b3d7f
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 


[kudu-CR] master: add assert checks for leader lock

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3642

Change subject: master: add assert checks for leader_lock
..

master: add assert checks for leader_lock

A side effect of recursive checking in RWMutex is that we can now assert
that a RWMutex is held for reading/writing. Let's add that to the various
catalog manager entry points.

Change-Id: Iefb5762c70192b27490cc71e20568815d18d6ad5
---
M src/kudu/client/client-test.cc
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/integration-tests/mini_cluster.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test-util.h
M src/kudu/master/master.cc
9 files changed, 78 insertions(+), 41 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/3642/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3642
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iefb5762c70192b27490cc71e20568815d18d6ad5
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 


[kudu-CR] master: do not delete unknown tablets

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3645

Change subject: master: do not delete unknown tablets
..

master: do not delete unknown tablets

Quoting from the multi-master design doc:

"The master and/or tserver must enforce that all actions take effect
iff they were sent by the master that is currently the leader.

After an exhaustive audit of all master state changes (see appendix A), it
was determined that the current protection mechanisms built into each RPC
are sufficient to provide fencing. The one exception is orphaned replica
deletion done in response to a heartbeat. To protect against that, true
orphans (i.e. tablets for which no persistent record exists) will not be
deleted at all. As the master retains deleted table/tablet metadata in
perpetuity, this should ensure that true orphans appear only under drastic
circumstances, such as a tserver that heartbeats to the wrong cluster."

The new test isn't ideal in that it must wait some time to allow the tserver
to receive an RPC from the master, but on my laptop it does fail without the
fix, and it should fail fairly often in other machines/environments too.

Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
---
M src/kudu/integration-tests/create-table-itest.cc
M src/kudu/integration-tests/delete_table-test.cc
M src/kudu/master/catalog_manager.cc
3 files changed, 46 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/45/3645/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3645
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I331f2d5bb06c38daa7b09854dbb24a7881723551
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 


[kudu-CR] [java client] Redo how we manage exceptions

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] Redo how we manage exceptions
..


Patch Set 7:

(3 comments)

Only a few nits left.

When you're done, could you make a pass over the client-consuming modules in 
the java subdirectory and update their exception handling, if need be? In a 
separate patch, of course.

http://gerrit.cloudera.org:8080/#/c/3055/7/java/kudu-client/src/main/java/org/kududb/client/NonCoveredRangeException.java
File 
java/kudu-client/src/main/java/org/kududb/client/NonCoveredRangeException.java:

Line 25: @InterfaceAudience.Public
Hmm, shouldn't this be Private?


http://gerrit.cloudera.org:8080/#/c/3055/7/java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java
File java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java:

Line 55: int numRows, Slice bs, Slice indirectBs) {
Nit: indentation here looks off.


PS7, Line 66: Schema 
schema,
: 
WireProtocol.RowwiseRowBlockPB data,
: final 
CallResponse callResponse)
Nit: here too.


-- 
To view, visit http://gerrit.cloudera.org:8080/3055
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iba6e8a022d7a5391c3657cbdc9d3f06f951be048
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] rw mutex: add configurable priority

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: add configurable priority
..


Patch Set 3:

> I have experienced in the past an issue where the fairness policy
 > causes a deadlock. See 
 > https://issues.apache.org/jira/browse/HDFS-2223?focusedCommentId=13097647=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13097647
 > for example.
 > 
 > I think the issue only happens when you recursively acquire the
 > read lock, or if you are holding the read lock while waiting on
 > another thread which needs to acquire the read lock. I'm not sure
 > if we have cases of either, but we should probably document this
 > danger.

I see.

I added a comment to the fairness policies to highlight the dangers. I've also 
got a follow-on patch that prohibits recursive locking altogether, so that 
should help too.

-- 
To view, visit http://gerrit.cloudera.org:8080/3603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] [java client] Redo how we manage exceptions

2016-07-13 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] Redo how we manage exceptions
..


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/kududb/client/AsyncKuduClient.java:

Line 1029: Status statusTimedOut = Status.TimedOut(message + request);
> Right I'm saying this is equivalent, the cause is what is prompting us to r
Oh, I forgot there's a Status for each exception in the chain. In that case, 
yes, I think this is fine, as there's a programmatic representation for each 
cause.


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java
File java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java:

PS5, Line 71:   throw new RuntimeException("RowResult block has " + 
bs.length() +
:   " bytes of data but expected " + expectedSize + " for " 
+ numRows + " rows");
> I'm not fully understanding your concern with having this here. Is it just 
Yeah, I find throwing from a constructor to be generally weird. And beyond 
that, throwing a checked exception from a constructor is even weirder, and I 
think this should be a KuduException because it's thrown in response to bad 
on-the-wire data.

An init() method is okay too, it's just more work for the caller than a factory 
method (i.e. two calls to make instead of one).


-- 
To view, visit http://gerrit.cloudera.org:8080/3055
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iba6e8a022d7a5391c3657cbdc9d3f06f951be048
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [java client] Redo how we manage exceptions

2016-07-12 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [java client] Redo how we manage exceptions
..


Patch Set 5:

(16 comments)

Definitely make a pass over the Javadoc for:
1. Adding @throws where necessary, and
2. Updating public-facing docs to avoid mentioning private exception classes.

Also, what value does the separation between RecoverableException and 
NonRecoverableException bring, for internal client use that is?

http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/AsyncKuduClient.java
File java/kudu-client/src/main/java/org/kududb/client/AsyncKuduClient.java:

Line 1029: Status statusTimedOut = Status.TimedOut(message + request);
Perhaps unrelated, but in the C++ client we try to preserve the last "real" 
error so that in the event of a timeout, we can provide that to the user too.


Line 1280:   return new NonRecoverableException(status);
Ugh. Can we use an errback instead of using the callback for exceptions too?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/AsyncKuduScanner.java
File java/kudu-client/src/main/java/org/kududb/client/AsyncKuduScanner.java:

Line 843: Status statusCorruption = Status.Corruption("Scan RPC 
response was for scanner"
I think we typically reserve "corruption" for scrambled data and equivalent. 
Maybe IllegalState or InvalidArgument here?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/AsyncKuduSession.java
File java/kudu-client/src/main/java/org/kududb/client/AsyncKuduSession.java:

Line 491:   public Deferred apply(final Operation operation) 
throws KuduException {
Update Javadoc. Make a pass over the other changed methods too, to make sure 
they have a @throws.


Line 536: Status.ConfigurationError("MANUAL_FLUSH is enabled 
but the buffer is too big");
Maybe IllegalState instead?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/GetMasterRegistrationReceived.java
File 
java/kudu-client/src/main/java/org/kududb/client/GetMasterRegistrationReceived.java:

Line 127:   Status statusConfigurationError = Status.ConfigurationError(
Why not NotFound?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/KuduClient.java
File java/kudu-client/src/main/java/org/kududb/client/KuduClient.java:

Line 92:* @throws NonRecoverableException for any error returned by sending 
RPCs to the master
Nope, NonRecoverableException isn't public.


Line 107:   } catch (Exception ex) {
Isn't it only d.join() that needs to be surrounded in the try/catch?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/KuduRpc.java
File java/kudu-client/src/main/java/org/kududb/client/KuduRpc.java:

Why RuntimeExceptions from this file and not some kind of KuduException?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/KuduSession.java
File java/kudu-client/src/main/java/org/kududb/client/KuduSession.java:

Line 52:* Blocking call with a different behavior based on the flush mode. 
PleaseThrottleException is
Should update this to avoid mentioning private exception types.


PS5, Line 99: DeferredGroupException.
We shouldn't mention this; it's not a public exception any more.


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/NoLeaderMasterFoundException.java
File 
java/kudu-client/src/main/java/org/kududb/client/NoLeaderMasterFoundException.java:

Line 42:* Factory method that creates a NoLeaderException given a message 
and a list
Nit: NoLeaderMasterFoundException


Line 53:   static NoLeaderMasterFoundException create(String msg, 
List causes) {
What value does this factory method add? I think all other exceptions are 
constructed directly; can we do the same here?

I think it'd make more sense that way, because then the caller is responsible 
for constructing the Status, and the caller is best positioned to know exactly 
what kind of Status to use.


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java
File java/kudu-client/src/main/java/org/kududb/client/RowResultIterator.java:

PS5, Line 71:   throw new RuntimeException("RowResult block has " + 
bs.length() +
:   " bytes of data but expected " + expectedSize + " for " 
+ numRows + " rows");
Maybe we can move this check into a method where it'd be more natural to throw 
a checked exception?


http://gerrit.cloudera.org:8080/#/c/3055/5/java/kudu-client/src/main/java/org/kududb/client/Status.java
File java/kudu-client/src/main/java/org/kududb/client/Status.java:

PS5, Line 26: Wraps {@link org.kududb.WireProtocol.AppStatusPB}.
:  * See also {@code src/kudu/util/status.h} in the 

[kudu-CR] [c++-client]: cache non-covering ranges in meta cache

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [c++-client]: cache non-covering ranges in meta cache
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3581/1/src/kudu/client/meta_cache.h
File src/kudu/client/meta_cache.h:

Line 283: DCHECK_NOTNULL(tablet_.get());
> That doesn't compile since tablet_ is not a pointer.
Maybe "return DCHECK_NOTNULL(tablet_.get());" then?

Would be good to avoid that warning.


Line 311:   // Returns true if this meta cache entry is stale.
> Yes, it will probably crash.  Having the table is necessary for any kind of
Can you rename the function then? I really prefer to see ToString() as a no-arg 
method like in Java, and to use an alternative name (e.g. DebugString(), 
PartitionDebugString(), things like that) for cases like this.


-- 
To view, visit http://gerrit.cloudera.org:8080/3581
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I05bcb3fe05d51d7c455e1d68bd2baa6f3c2b9d21
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] master: reduce timeout for master to tserver rpcs

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: reduce timeout for master to tserver rpcs
..


Patch Set 3:

> I do recall seeing a once-in-a-blue-moon AlterTable() timeout that
 > I traced back to this 30s timeout and a coinciding leader election,
 > but I can't remember the details now. I'll retest without the
 > timeout to see if I can retrigger it.

I ran the test 1000 times and it didn't fail. So I'll abandon this.

-- 
To view, visit http://gerrit.cloudera.org:8080/3606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06b5f32c0295f0a9d24ff42695905858ae6101f6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: reduce timeout for master to tserver rpcs

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has abandoned this change.

Change subject: master: reduce timeout for master to tserver rpcs
..


Abandoned

-- 
To view, visit http://gerrit.cloudera.org:8080/3606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I06b5f32c0295f0a9d24ff42695905858ae6101f6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] c++ client: various fixes to DDL operations

2016-07-11 Thread Adar Dembo (Code Review)
Hello Dan Burkert, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3608

to look at the new patch set (#4).

Change subject: c++ client: various fixes to DDL operations
..

c++ client: various fixes to DDL operations

1. Remove the num_attempts>1 workarounds found in some operations. The
   CreateTable one was actually broken (the PartitionSchema::FromPB() call
   triggered a CHECK) and this would be more robustly handled via the new
   "exactly once" semantics.
2. In SyncLeaderMasterRpc(), take an extra ref on the master proxy.
   Otherwise a concurrent master leader election could crash the client.
3. Add exponential backoff to SyncLeaderMasterRpc(). It's nothing fancy, but
   I was tired of the tight loops and log spew.

Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
---
M src/kudu/client/client-internal.cc
M src/kudu/client/client-internal.h
M src/kudu/client/client.cc
M src/kudu/client/scanner-internal.cc
4 files changed, 34 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/3608/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Update docs on how to run gcovr

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Update docs on how to run gcovr
..


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3508
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I494136e20452b76572d753b54fc7a095aa54a69b
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR](gh-pages) Add weekly update for 07/11

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: Add weekly update for 07/11
..


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3616
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5c60067b9e548ff1f75b1cd940646ad6debf462d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: gh-pages
Gerrit-Owner: Jean-Daniel Cryans 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Misty Stanley-Jones 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-11 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3610

to look at the new patch set (#3).

Change subject: KUDU-1358 (part 2): heartbeat to every master
..

KUDU-1358 (part 2): heartbeat to every master

Now that followers accept heartbeats, let's modify the tserver to send one
to every master. Spawning a heartbeater thread for each master seemed like
the natural way to do this; it should simplify dynamic master changes in the
future (i.e. just add or remove threads on the fly).

As a result, the "dirty tablet" state is now encapsulated in the heartbeater
threads themselves, and the heartbeater must "fan out" to manipulate all of
it. It's a little noisy but I think it's worth the encapsulation.

There's no new test here, but this code path is exercised in the test I
reenabled, and in the new stress test (follow-on patch).

Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
---
M src/kudu/integration-tests/master_failover-itest.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/tserver/heartbeater.cc
M src/kudu/tserver/heartbeater.h
M src/kudu/tserver/ts_tablet_manager-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 316 insertions(+), 267 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/10/3610/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: add try lock methods

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: rw_mutex: add try lock methods
..


rw_mutex: add try lock methods

I found these to be useful, so I'm exposing them in our wrapper, and beefing
up kudu::shared_lock accordingly.

Change-Id: Ic3dd4f406b5f90898870083ed3143667d1627bd6
Reviewed-on: http://gerrit.cloudera.org:8080/3604
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
Reviewed-by: Dan Burkert 
---
M src/kudu/util/locks.h
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
3 files changed, 40 insertions(+), 4 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Todd Lipcon: Looks good to me, but someone else must approve
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3604
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic3dd4f406b5f90898870083ed3143667d1627bd6
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 2): heartbeat to every master

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 2): heartbeat to every master
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3610/2/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

Line 130:   int last_locate_master_idx_;
> as far as I can tell this isn't used anywhere.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3610
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic85ac4193462d21c989dbd7874b451e8eaab8e3e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3609/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 345:   : catalog_(catalog),
> indent 4
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] master: reduce timeout for master to tserver rpcs

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: reduce timeout for master to tserver rpcs
..


Patch Set 2:

Filed KUDU-1526 for the test failure (which appears to be an unrelated flake).

 > I think we bumped this to be pretty high because we found that
 > creating a new tablet can actually take reasonably long (it
 > involves a bunch of fsyncs, preallocating log, etc) especially when
 > there are lots of them going on at the same time (as when you
 > create a table with 10s of tablets per server).

Note that this only changes the "generic" timeout. AsyncCreateReplica() uses 
its own timeout gflag which is still set to 30s. That one is more dangerous to 
dial down, as a timeout would cause the master to replace the tablet in 
question. But, a discrepancy between the two gflags is weird, so I am tempted 
to revert this.

I do recall seeing a once-in-a-blue-moon AlterTable() timeout that I traced 
back to this 30s timeout and a coinciding leader election, but I can't remember 
the details now. I'll retest without the timeout to see if I can retrigger it.

-- 
To view, visit http://gerrit.cloudera.org:8080/3606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06b5f32c0295f0a9d24ff42695905858ae6101f6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: reduce timeout for master to tserver rpcs

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: reduce timeout for master to tserver rpcs
..


Patch Set 2:

> I think we bumped this to be pretty high because we found that
 > creating a new tablet can actually take reasonably long (it
 > involves a bunch of fsyncs, preallocating log, etc) especially when
 > there are lots of them going on at the same time (as when you
 > create a table with 10s of tablets per server).

-- 
To view, visit http://gerrit.cloudera.org:8080/3606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06b5f32c0295f0a9d24ff42695905858ae6101f6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] rw mutex: add configurable priority

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: add configurable priority
..


Patch Set 2:

> Is there any possibility that the different priorities could create
 > a case where OSX would deadlock but Linux wouldn't? am slightly
 > worried about introducing different semantics on the different
 > platforms (and not just a perf difference)

Lock priority can affect starvation or denial of service, but I don't see how 
it could affect deadlocks (which are, in my experience, "provable" one way or 
the other). If a particular application of the new PREFER_WRITING priority 
somehow prevented a deadlock, then before this patch that RWMutex was already 
deadlock prone and buggy.

More concretely, I expect that with this patch, an application of 
PREFER_WRITING on Linux may show better performance or fewer timeouts when 
under stress than the same code on macOS.

-- 
To view, visit http://gerrit.cloudera.org:8080/3603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] c++ client: various fixes to DDL operations

2016-07-11 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: c++ client: various fixes to DDL operations
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3608/1/src/kudu/client/client-internal.cc
File src/kudu/client/client-internal.cc:

PS1, Line 149: num_attempts
> make this explicit
Done


PS1, Line 155: 10
> so many magic numbers.. pull constants/flags?
I didn't write this; I just copied it from scanner-internal.cc. I have no idea 
why this formula was chosen or what the significance of each constant is; Dan 
put it together for KUDU-1079.

But, I took your advice and deduplicated the code.


-- 
To view, visit http://gerrit.cloudera.org:8080/3608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] rw mutex: add try lock methods

2016-07-10 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3604

to look at the new patch set (#2).

Change subject: rw_mutex: add try lock methods
..

rw_mutex: add try lock methods

I found these to be useful, so I'm exposing them in our wrapper, and beefing
up kudu::shared_lock accordingly.

Change-Id: Ic3dd4f406b5f90898870083ed3143667d1627bd6
---
M src/kudu/util/locks.h
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
3 files changed, 40 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/04/3604/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3604
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic3dd4f406b5f90898870083ed3143667d1627bd6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: add read-write lock to serialize operations around elections

2016-07-10 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3550

to look at the new patch set (#5).

Change subject: master: add read-write lock to serialize operations around 
elections
..

master: add read-write lock to serialize operations around elections

This rigmarole began with an investigation into a test failure [1], which
led to a new integration test that hammers VisitTablesAndTablets() while
creating tables. That test revealed other locking issues, which brings us
to where we are now.

This patch introduces a read-write lock to serialize all master operations
so that they fall on one side or the other of a leader election. The idea
is to avoid performing operations concurrently with a reload of the master
metadata; doing so can lead to problems in Shutdown() and (very rarely,
perhaps only conceptually) to inconsistent on-disk state.

I was hoping this lock could replace the fencing done by leader_ready_term_,
but eventually reasoned that we need both; without leader_ready_term_
fencing, the master's consensus state machine could fool an operation into
thinking the master became the leader before the metadata was reloaded.

Three other things of note here:
- The new lock is acquired via TryLock() so that, if the lock could not be
  acquired, the RPC will fail rather than block. A future patch modifies
  TSHeartbeat() to partially accept heartbeats even if the master is a
  follower; TryLock() means that a transitioning leader that is pelted with
  RPCs won't fill up its service queue and can still process heartbeats.
- TableInfo's AddTask() and RemoveTask() methods now don't hold the table's
  lock when adding and removing refs from the task respectively. This is
  the fix for the original test failure.
- When reloading metadata, we now abort all outstanding table tasks to
  avoid orphaning them.

1. 
http://dist-test.cloudera.org:8080/diagnose?key=224b3aa2-3c87-11e6-9a09-0242ac110001

Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
---
M src/kudu/integration-tests/create-table-stress-test.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master.cc
M src/kudu/master/master_service.cc
6 files changed, 331 insertions(+), 144 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/50/3550/5
-- 
To view, visit http://gerrit.cloudera.org:8080/3550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: add configurable priority

2016-07-10 Thread Adar Dembo (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3603

to look at the new patch set (#2).

Change subject: rw_mutex: add configurable priority
..

rw_mutex: add configurable priority

The glibc implementation of pthread rwlocks exposes priorities that can help
if avoiding reader or writer starvation is desirable. I have a use case for
the latter, so let's expose the priorities in our wrapper.

Note: pthread rwlock priorities don't exist on macOS, which is why they're a
"best effort".

Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
---
M src/kudu/util/rw_mutex.cc
M src/kudu/util/rw_mutex.h
2 files changed, 47 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/03/3603/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-1358 (part 1): master should accept heartbeat even if follower

2016-07-10 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: KUDU-1358 (part 1): master should accept heartbeat even if 
follower
..


Patch Set 1:

Filed KUDU-1521 for the Java test failure.

-- 
To view, visit http://gerrit.cloudera.org:8080/3609
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I578674927b65b4171e8437de8515130e4a0ed139
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] master: fix corruption when AlterTable() races with CreateTable()

2016-07-10 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: fix corruption when AlterTable() races with 
CreateTable()
..


Patch Set 1:

(2 comments)

The test failure was in the initial write to ITScannerMultiTablet. I suspect 
it's flakiness, but I don't know for sure because kudu-gerrit wasn't archiving 
failsafe-based test logs. I've reconfigured it to do that and will keep an eye 
out going forward.

http://gerrit.cloudera.org:8080/#/c/3607/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 880:   SetupError(resp->mutable_error(), 
MasterErrorPB::TABLE_NOT_FOUND, s);
> While we're revisiting this code-- is TABLE_NOT_FOUND really the right erro
See https://gerrit.cloudera.org/#/c/2714 for why this combination was chosen. I 
agree that it's somewhat unintuitive, but I don't think there's a slam dunk 
option amongst any of our existing error codes.


http://gerrit.cloudera.org:8080/#/c/3607/1/src/kudu/master/master-test.cc
File src/kudu/master/master-test.cc:

Line 1087: CreateTableRequestPB req;
> Why not use CreateTable here?
I need access to resp.error().code(); CreateTable only gives me  
resp.error().status(). Figured this was the path of least resistance, as in 
TestConcurrentCreateOfSameTable.


-- 
To view, visit http://gerrit.cloudera.org:8080/3607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] master: add read-write lock to serialize operations around elections

2016-07-10 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: master: add read-write lock to serialize operations around 
elections
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3550/4/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 358: // If there is an error (e.g., we are not the leader) 
abort this task
> Should this comment be update?
Not necessarily; this master could lose its leadership after taking the lock. 
It'll find out when trying to replicate to the followers, and that error will 
propagate up here.


Line 3147:   std::lock_guard l(catalog_->state_lock_);
> I expected these checks to be made 'on demand' in CheckIsInitializedOrRespo
For one, some users don't use the CheckIs...() methods; they call the status() 
accessors only.

For two, I modeled this class after lock_guard and shared_lock; I want its 
users to think that the resource is taken when the class is declared, and 
released when the instance goes out of scope.


-- 
To view, visit http://gerrit.cloudera.org:8080/3550
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I5084c09f1a77ccf620fb6cd621094c4778d636f8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] master-test: rewrite to use std::thread instead of kudu::Thread

2016-07-10 Thread Adar Dembo (Code Review)
Adar Dembo has submitted this change and it was merged.

Change subject: master-test: rewrite to use std::thread instead of kudu::Thread
..


master-test: rewrite to use std::thread instead of kudu::Thread

It's more terse than kudu::Thread and is easier to use with lambdas.

There are no functional changes here.

Change-Id: Ibd1880b4a47cb741742ec00ecfd10ab2e5c223e3
Reviewed-on: http://gerrit.cloudera.org:8080/3602
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
---
M src/kudu/master/master-test.cc
1 file changed, 86 insertions(+), 108 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3602
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibd1880b4a47cb741742ec00ecfd10ab2e5c223e3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] rw mutex: add configurable priority

2016-07-10 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: rw_mutex: add configurable priority
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3603/1/src/kudu/util/rw_mutex.h
File src/kudu/util/rw_mutex.h:

Line 31:   enum Priority {
> consider making this an enum class.
Neat, wasn't aware of this feature. Unfortunately, it makes logging the enum 
value in the case where it somehow isn't one these two values difficult:

  /home/adar/Source/kudu/src/kudu/util/rw_mutex.cc: In member function ‘void 
kudu::RWMutex::Init(kudu::RWMutex::Priority)’:
  /home/adar/Source/kudu/src/kudu/util/rw_mutex.cc:53:41: error: no match for 
‘operator<<’ (operand types are ‘std::basic_ostream’ and 
‘kudu::RWMutex::Priority’)
   LOG(FATAL) << "Unknown priority: " << prio;

The solution proposed in http://stackoverflow.com/a/11421471 is pretty ugly, so 
unless you know of a better way, I'm going to keep the code as-is.


Line 57:   void Init(Priority prio);
> any reason to have a separate init instead of doing it in the ctor as befor
My understanding is that there's no way to "chain" constructors in C++, so I 
pulled the common initialization code into Init() and am invoking it from both 
constructors.


-- 
To view, visit http://gerrit.cloudera.org:8080/3603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I16ba6cd041f126c94e63fa07a1e84c88db6778d7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] c++ client: various fixes to DDL operations

2016-07-08 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/3608

to review the following change.

Change subject: c++ client: various fixes to DDL operations
..

c++ client: various fixes to DDL operations

1. Remove the num_attempts>1 workarounds found in some operations. The
   CreateTable one was actually broken (the PartitionSchema::FromPB() call
   triggered a CHECK) and this would be more robustly handled via the new
   "exactly once" semantics.
2. In SyncLeaderMasterRpc(), take an extra ref on the master proxy.
   Otherwise a concurrent master leader election could crash the client.
3. Add exponential backoff to SyncLeaderMasterRpc(). It's nothing fancy, but
   I was tired of the tight loops and log spew.

Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
---
M src/kudu/client/client-internal.cc
M src/kudu/client/client-internal.h
M src/kudu/client/client.cc
3 files changed, 29 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/3608/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3608
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I09768240bd04cca95d95aefe17c34d276075125b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] master: fix corruption when AlterTable() races with CreateTable()

2016-07-08 Thread Adar Dembo (Code Review)
Hello Dan Burkert, David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

http://gerrit.cloudera.org:8080/3607

to review the following change.

Change subject: master: fix corruption when AlterTable() races with 
CreateTable()
..

master: fix corruption when AlterTable() races with CreateTable()

Admittedly, this is a contrived scenario:
1. T1 tries to create table with name 'foo'
2. T2 tries to rename table with name 'bar' to 'foo'

With just the right timing, both operations succeed and the metadata now has
two tables named 'foo', each with a different table ID. The fix is simple:
generalize the "tables being created" logic already used by CreateTable().

Without the fix, the new test failed every 50th run or so. With it, it
doesn't fail in 1000 runs.

Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
3 files changed, 231 insertions(+), 30 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/3607/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6c9e4214c09bc47a5a10b12d6ffe8b35906708c9
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Todd Lipcon 


  1   2   3   4   >