[kudu-CR] WIP [catalog manager] added 3-4-3 behavior

2017-11-21 Thread Alexey Serbin (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8619

to look at the new patch set (#3).

Change subject: WIP [catalog_manager] added 3-4-3 behavior
..

WIP [catalog_manager] added 3-4-3 behavior

Updated the catalog_manager to behave as specified in 3-4-3 v1 design
implementation plan, patch2.

WIP: need to add comprehesive functional/integration tests using
 minicluster once the leader replica starts reporting at least about
 reachable and unreachable replicas.

Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
5 files changed, 829 insertions(+), 105 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/8619/3
--
To view, visit http://gerrit.cloudera.org:8080/8619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
Gerrit-Change-Number: 8619
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.3.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8599 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..

[security] test and fixes for TLS socket EINTR issues

SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE}
correspondingly when signal interrupts recv()/send() calls even if
SSL_MODE_AUTO_RETRY is set in the TLS context.  To handle that
properly in Socket::Blocking{Recv,Write}() methods, return
NetworkError() with appropriate POSIX error code from
TlsSocket::{Recv,Write}().

As a by-product, this changelist fixes flakiness in TestUniqueClientIds
scenario of the ClientStressTest test and other flaky tests which failed
with errors like below:

  Bad status: IO error: Could not connect to the cluster: \
Client connection negotiation failed: client connection to \
IP:port: Read zero bytes on a blocking Recv() call: \
Transferred 0 of 4 bytes

Prior to this fix, the test failure ratio observed with dist-test
for TSAN builds was about 6% in multiple 1K runs.  After the fix,
no failures observed.

Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Reviewed-on: http://gerrit.cloudera.org:8080/8462
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
Reviewed-by: Alexey Serbin 
(cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c)
Reviewed-on: http://gerrit.cloudera.org:8080/8599
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/CMakeLists.txt
M src/kudu/security/tls_handshake.cc
A src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
M src/kudu/util/net/socket.cc
5 files changed, 227 insertions(+), 7 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8599
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.3.x
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8599
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.3.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8600 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..

KUDU-2218. tls_socket: properly handle temporary socket errors in Writev

This fixes a bug which caused RaftConsensusITest.TestLargeBatches to
fail when run under stress, as in the following command line:

taskset -c 0-4 \
 build/latest/bin/raft_consensus-itest \
   --gtest_filter=\*LargeBat\* \
   --stress-cpu-threads=8

This would produce an error like:
Network error: failed to write to TLS socket: error:1409F07F:SSL 
routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878

This means that we were retrying a write after getting EAGAIN, but with
a different buffer than the first time.

I tracked this down to mishandling of temporary socket errors in
TlsSocket::Writev(). In the case that we successfully write part of the
io vector but hit such an error trying to write a later element in the
vector, we were still propagating the error back up to the caller. The
caller didn't realize that part of the write was successful, and thus it
would retry the write from the beginning.

The fix is to fix the above, but also to enable partial writes in
TlsContext. The new test fails if either of the above two changes are
backed out.

Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Reviewed-on: http://gerrit.cloudera.org:8080/8570
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin 
(cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d)
Reviewed-on: http://gerrit.cloudera.org:8080/8600
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/tls_context.cc
M src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
3 files changed, 219 insertions(+), 70 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Todd Lipcon: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/8600
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.3.x
Gerrit-MessageType: merged
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8600
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.4.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8602 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..

KUDU-2218. tls_socket: properly handle temporary socket errors in Writev

This fixes a bug which caused RaftConsensusITest.TestLargeBatches to
fail when run under stress, as in the following command line:

taskset -c 0-4 \
 build/latest/bin/raft_consensus-itest \
   --gtest_filter=\*LargeBat\* \
   --stress-cpu-threads=8

This would produce an error like:
Network error: failed to write to TLS socket: error:1409F07F:SSL 
routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878

This means that we were retrying a write after getting EAGAIN, but with
a different buffer than the first time.

I tracked this down to mishandling of temporary socket errors in
TlsSocket::Writev(). In the case that we successfully write part of the
io vector but hit such an error trying to write a later element in the
vector, we were still propagating the error back up to the caller. The
caller didn't realize that part of the write was successful, and thus it
would retry the write from the beginning.

The fix is to fix the above, but also to enable partial writes in
TlsContext. The new test fails if either of the above two changes are
backed out.

Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Reviewed-on: http://gerrit.cloudera.org:8080/8570
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin 
(cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d)
Reviewed-on: http://gerrit.cloudera.org:8080/8602
Tested-by: Alexey Serbin 
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/tls_context.cc
M src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
3 files changed, 219 insertions(+), 70 deletions(-)

Approvals:
  Alexey Serbin: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8602
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: merged
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8602
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.3.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8599 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8599
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.3.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8599
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:41 +
Gerrit-HasComments: No


[kudu-CR](branch-1.3.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8600 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8600
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.3.x
Gerrit-MessageType: comment
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8600
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:58 +
Gerrit-HasComments: No


[kudu-CR](branch-1.4.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8602 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8602
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: comment
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8602
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:01 +
Gerrit-HasComments: No


[kudu-CR](branch-1.4.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8601 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8601
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:42:46 +
Gerrit-HasComments: No


[kudu-CR](branch-1.4.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8601 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..

[security] test and fixes for TLS socket EINTR issues

SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE}
correspondingly when signal interrupts recv()/send() calls even if
SSL_MODE_AUTO_RETRY is set in the TLS context.  To handle that
properly in Socket::Blocking{Recv,Write}() methods, return
NetworkError() with appropriate POSIX error code from
TlsSocket::{Recv,Write}().

As a by-product, this changelist fixes flakiness in TestUniqueClientIds
scenario of the ClientStressTest test and other flaky tests which failed
with errors like below:

  Bad status: IO error: Could not connect to the cluster: \
Client connection negotiation failed: client connection to \
IP:port: Read zero bytes on a blocking Recv() call: \
Transferred 0 of 4 bytes

Prior to this fix, the test failure ratio observed with dist-test
for TSAN builds was about 6% in multiple 1K runs.  After the fix,
no failures observed.

Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Reviewed-on: http://gerrit.cloudera.org:8080/8462
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
Reviewed-by: Alexey Serbin 
(cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c)
(cherry picked from commit d98c24fa69ccea676cdf50857ebdff8fe866f751)
Reviewed-on: http://gerrit.cloudera.org:8080/8601
Tested-by: Alexey Serbin 
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/CMakeLists.txt
M src/kudu/security/tls_handshake.cc
A src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
M src/kudu/util/net/socket.cc
5 files changed, 227 insertions(+), 7 deletions(-)

Approvals:
  Alexey Serbin: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8601
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.4.x
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8601
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: mark delta tracker read-only on error

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8605 )

Change subject: tablet: mark delta tracker read-only on error
..


Patch Set 9: Code-Review+2

Carrying forward Mike's +2


--
To view, visit http://gerrit.cloudera.org:8080/8605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
Gerrit-Change-Number: 8605
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:42:31 +
Gerrit-HasComments: No


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
- Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
- Before downloading each WAL segments or block, check that the tablet
  is in a healthy group.
And for the tablet copy session as:
- Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Reviewed-on: http://gerrit.cloudera.org:8080/7654
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy 
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
7 files changed, 124 insertions(+), 7 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] tablet: add early returns to maintenance functions

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8606 )

Change subject: tablet: add early returns to maintenance functions
..

tablet: add early returns to maintenance functions

When a Tablet is stopped, further maintenance ops are not scheduled,
further IO is prevented, etc. This patch optimizes this further to stop
various functions that are called by maintenance threads to prevent
their execution, returning an error instead.

Previously, certain ops (e.g. flush DMS) would guarantee durability by
checking for the success these functions. These checks are now replaced
with on-error checks that the tablet has been stopped, since these
failures are inconsequential if the tablet is stopped.

Currently this is an optimization for tablets that are shutting down to
return early from these calls, but in the future, this could be useful
in stopping all IO done by a tablet that is failing, e.g. due to disk
failure.

Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Reviewed-on: http://gerrit.cloudera.org:8080/8606
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy 
---
M src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
5 files changed, 34 insertions(+), 8 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Gerrit-Change-Number: 8606
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: make various update paths atomic

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8441 )

Change subject: tablet: make various update paths atomic
..

tablet: make various update paths atomic

A few codepaths in the tablet subsystem aren't atomic, i.e. if they fail
mid-method (e.g. due to a file error), they leave some in-memory
structures updated and others untouched. This has been safe because we
CHECKed to ensure their success. In preparation for _not_ crashing in
these methods, this patch refactors some of these functions to be
atomic, and notes others that still have the possibility of failing in
such a state (these calls still trigger a CHECK failure).

There are no functional changes in this patch.

Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Reviewed-on: http://gerrit.cloudera.org:8080/8441
Reviewed-by: Mike Percy 
Tested-by: Kudu Jenkins
---
M src/kudu/tablet/delta_compaction.cc
M src/kudu/tablet/delta_compaction.h
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/delta_tracker.h
M src/kudu/tablet/diskrowset.cc
M src/kudu/tablet/diskrowset.h
M src/kudu/tablet/metadata-test.cc
M src/kudu/tablet/rowset_metadata.cc
M src/kudu/tablet/rowset_metadata.h
9 files changed, 224 insertions(+), 169 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8441
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Gerrit-Change-Number: 8441
Gerrit-PatchSet: 15
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.5.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8604 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8604
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.5.x
Gerrit-MessageType: comment
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8604
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:09:00 +
Gerrit-HasComments: No


[kudu-CR](branch-1.5.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8604 )

Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors 
in Writev
..

KUDU-2218. tls_socket: properly handle temporary socket errors in Writev

This fixes a bug which caused RaftConsensusITest.TestLargeBatches to
fail when run under stress, as in the following command line:

taskset -c 0-4 \
 build/latest/bin/raft_consensus-itest \
   --gtest_filter=\*LargeBat\* \
   --stress-cpu-threads=8

This would produce an error like:
Network error: failed to write to TLS socket: error:1409F07F:SSL 
routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878

This means that we were retrying a write after getting EAGAIN, but with
a different buffer than the first time.

I tracked this down to mishandling of temporary socket errors in
TlsSocket::Writev(). In the case that we successfully write part of the
io vector but hit such an error trying to write a later element in the
vector, we were still propagating the error back up to the caller. The
caller didn't realize that part of the write was successful, and thus it
would retry the write from the beginning.

The fix is to fix the above, but also to enable partial writes in
TlsContext. The new test fails if either of the above two changes are
backed out.

While merging from the main trunk, the SCOPED_CLEANUP macro was replaced
with MakeScopedCleanup.

Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Reviewed-on: http://gerrit.cloudera.org:8080/8570
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin 
(cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d)
Reviewed-on: http://gerrit.cloudera.org:8080/8604
Tested-by: Alexey Serbin 
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/tls_context.cc
M src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
3 files changed, 219 insertions(+), 70 deletions(-)

Approvals:
  Alexey Serbin: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8604
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.5.x
Gerrit-MessageType: merged
Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472
Gerrit-Change-Number: 8604
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR](branch-1.5.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8603 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.5.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8603
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 06:08:20 +
Gerrit-HasComments: No


[kudu-CR](branch-1.5.x) [security] test and fixes for TLS socket EINTR issues

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8603 )

Change subject: [security] test and fixes for TLS socket EINTR issues
..

[security] test and fixes for TLS socket EINTR issues

SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE}
correspondingly when signal interrupts recv()/send() calls even if
SSL_MODE_AUTO_RETRY is set in the TLS context.  To handle that
properly in Socket::Blocking{Recv,Write}() methods, return
NetworkError() with appropriate POSIX error code from
TlsSocket::{Recv,Write}().

As a by-product, this changelist fixes flakiness in TestUniqueClientIds
scenario of the ClientStressTest test and other flaky tests which failed
with errors like below:

  Bad status: IO error: Could not connect to the cluster: \
Client connection negotiation failed: client connection to \
IP:port: Read zero bytes on a blocking Recv() call: \
Transferred 0 of 4 bytes

Prior to this fix, the test failure ratio observed with dist-test
for TSAN builds was about 6% in multiple 1K runs.  After the fix,
no failures observed.

While merging from the main trunk, the SCOPED_CLEANUP macro was
replaced with MakeScopedCleanup.

Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Reviewed-on: http://gerrit.cloudera.org:8080/8462
Tested-by: Kudu Jenkins
Reviewed-by: Dan Burkert 
Reviewed-by: Alexey Serbin 
(cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c)
Reviewed-on: http://gerrit.cloudera.org:8080/8603
Reviewed-by: Todd Lipcon 
---
M src/kudu/security/CMakeLists.txt
M src/kudu/security/tls_handshake.cc
A src/kudu/security/tls_socket-test.cc
M src/kudu/security/tls_socket.cc
M src/kudu/util/net/socket.cc
5 files changed, 227 insertions(+), 7 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8603
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.5.x
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e
Gerrit-Change-Number: 8603
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] shutdown tablets on disk failure at runtime

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
..


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1367
PS11, Line 1367: Tablet* tablet = replica->tablet()
prefer: shared_ptr tablet = replica->shared_tablet();

because the TabletReplica is allowed to delete the Tablet when it shuts down.


http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368
PS11, Line 1368: (!tablet || tablet->HasBeenStopped())
hmm. this doesn't really make sense to me. Don't we want to shut down the 
replica regardless of this stuff?



--
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:39:20 +
Gerrit-HasComments: Yes


[kudu-CR] error manager: synchronize/serialize handling

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8395 )

Change subject: error_manager: synchronize/serialize handling
..


Patch Set 12: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8395
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e
Gerrit-Change-Number: 8395
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:29:18 +
Gerrit-HasComments: No


[kudu-CR] tablet: add early returns to maintenance functions

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8606 )

Change subject: tablet: add early returns to maintenance functions
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Gerrit-Change-Number: 8606
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:26:49 +
Gerrit-HasComments: No


[kudu-CR] tablet: mark delta tracker read-only on error

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8605 )

Change subject: tablet: mark delta tracker read-only on error
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
Gerrit-Change-Number: 8605
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:25:55 +
Gerrit-HasComments: No


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:26:11 +
Gerrit-HasComments: No


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8620 )

Change subject: tablet: remove ignore_result in ApplyRowOperation
..

tablet: remove ignore_result in ApplyRowOperation

ApplyRowOperation() previously ignored the result it returned. This
patch updates this to check for errors that we expect, rather than a
blanket ignorance of all errors. Unexpected Statuses are returned.

Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Reviewed-on: http://gerrit.cloudera.org:8080/8620
Reviewed-by: Mike Percy 
Tested-by: Kudu Jenkins
---
M src/kudu/tablet/tablet.cc
1 file changed, 11 insertions(+), 5 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] shutdown tablets on disk failure at runtime

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7442

to look at the new patch set (#11).

Change subject: shutdown tablets on disk failure at runtime
..

shutdown tablets on disk failure at runtime

Before, various code paths pass along disk failure Statuses until they
eventually hit a CHECK failure and crash the server. Such fatal errors
were "safe" by design, as they would ensure no additional changes were
made durable to each tablet. This patch aims to achieve similar behavior
for failed replicas while keeping the server alive.

These failures are permitted provided the following have occurred for
each tablet in the affected directory:
- The failed directory is immediately marked as failed, preventing
  further tablets from being striped across a failed disk.
- The tablet's MvccManager is shut down to prevent further writes from
  being made durable and preventing I/O to the tablet.
- A request is submitted to a threadpool to eventually completely shut
  down the replica, eventually marking it for eviction.

Beyond the above functionality, to cancel replica maintenance ops along
with the rest of the error handling, I updated the locking behavior of
TabletReplica so access to its maintenance ops can be done in a
thread-safe way (by guarding the list of ops with the replica's lock).

NOTE: failures of the metadata directory and the WAL directory are
fatal. Code paths that update these explicitly crash the server.

This is a part of a series of patches to handle disk failure. To see how
this patch fits in, see section 3 of:
https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit

Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
---
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
7 files changed, 162 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/11
--
To view, visit http://gerrit.cloudera.org:8080/7442
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8
Gerrit-Change-Number: 7442
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] shutdown tablets on disk failure at runtime

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7442 )

Change subject: shutdown tablets on disk failure at runtime
..


Patch Set 10:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h
File src/kudu/tablet/tablet.h:

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h@157
PS2, Line 157: 
> Seems like this has a new (sort of awkward) contract now since you have to
Hrm, the contract was meant to freeze any further or currently-running calls to 
these functions. Good point that we might just be able to get by by returning a 
status.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h
File src/kudu/tablet/tablet_replica.h:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h@268
PS10, Line 268:   void RegisterMaintenanceOps(MaintenanceManager* 
maintenance_manager);
> warning: function 'kudu::tablet::TabletReplica::RegisterMaintenanceOps' has
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc
File src/kudu/tablet/tablet_replica.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@733
PS10, Line 733: void TabletReplica::MakeUnavailable(const string& reason) {
> My suggestion for this function:
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@739
PS10, Line 739:   tablet_->Stop();
> is there a reason we need to call tablet_->Stop() while holding TabletRepli
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@740
PS10, Line 740:   tablet_->CancelMaintenanceOps();
> this isn't needed because Stop() does it automatically
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@749
PS10, Line 749: Status::IOError(reason, "", EIO)
> This seems a bit unexpected. If we need a Status why not pass it into MakeU
Done


http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc@137
PS2, Line 137:   CHECK(!tablet->rowsets_flush_sem_.try_lock());
> Hrm. Is it safe to change a CHECK() to a LOG(ERROR) here? I would like to s
Duly noted. I'm still thinking about how to best test this (other than the 
integration tests for disk failure as a whole), which I could move into this 
patch as well.

Also note that this is being replaced with a LOG(ERROR) and then another CHECK 
below. If there is a disk failure, it _shouldn't_ matter what state the MRSs 
are in, since the tablet shouldn't be used anyway. In the error-handling code, 
it's set to FAILED, so scans won't be able to touch it, and further 
transactions and maintenance ops should no longer be run.


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@125
PS10, Line 125:   // Transitions the tablet state specified by 'tablet_id' with 
the specified
> How about using this for the comment:
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@127
PS10, Line 127:   Status StartTransitionTabletState(const std::string& 
tablet_id,
> 1. why is this public?
1. Done
2. Done
3. I think we do need to transition to avoid bootstrap/copy overlapping with 
shutdown


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@212
PS10, Line 212: FailTablet
> Perhaps a better name for this method would be FailTabletAndScheduleShutdow
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@309
PS10, Line 309: FailTabletAsync
> The Async suffix makes me think that this schedules work on a background th
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1316
PS10, Line 1316: if (!s.IsDiskFailure()) {
   :   return s;
   : }
   : LOG(FATAL) << Substitute("Tablet $0's consensus metadata 
is in a failed disk", tablet_id);
> 1) s/in a failed disk/on a failed disk/
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1351
PS10, Line 1351:   LOG(INFO) << Substitute("Forcing failure of tablet $0", 
tablet_id);
> use LogPrefix(tablet_id) here
Done


http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1372
PS10, Line 1372: s = StartTransitionTabletState(tablet_id, "failing tablet",
> What is the purpose of all this transitioning stuff in here?
Logically, that's what it's doing, but I think shutdown, bootstrapping, and 
copying are designed to be synchronized in th

[kudu-CR] error manager: synchronize/serialize handling

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8395 )

Change subject: error_manager: synchronize/serialize handling
..


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8395/9/src/kudu/fs/error_manager-test.cc
File src/kudu/fs/error_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/8395/9/src/kudu/fs/error_manager-test.cc@92
PS9, Line 92: SleepForRand();
> Why not just "curry" it, which means embed the parameter at the bind site?
Oh neat! Didn't realize we had such currying abilities in K++ :)


http://gerrit.cloudera.org:8080/#/c/8395/11/src/kudu/fs/error_manager-test.cc
File src/kudu/fs/error_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/8395/11/src/kudu/fs/error_manager-test.cc@184
PS11, Line 184:
> since you are using emplace_back, you can now omit the thread( part also
Done



--
To view, visit http://gerrit.cloudera.org:8080/8395
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e
Gerrit-Change-Number: 8395
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 03:17:52 +
Gerrit-HasComments: Yes


[kudu-CR] error manager: synchronize/serialize handling

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8395

to look at the new patch set (#12).

Change subject: error_manager: synchronize/serialize handling
..

error_manager: synchronize/serialize handling

The FsErrorManager is helper infrastructure that other classes can use to
help provide API contracts related to error handling.

Here is an example error-handling contract provided to the TSTabletManager
in a future patch, if a disk failure Status is returned during tablet
operation, either:
1) the tablet server will crash and we can rely on Kudu's crash-consistency
   mechanisms for safety, or
2) any affected Tablet will transition to the 'kStopped' state via an error
   manager callback. Most components will simply pass the non-OK Status up
   the call chain. However, if a method expects an IO operation to
   succeed for correctness purposes, and it receives an error Status,
   then it should check that the Tablet is stopped. If so, it can be
   assumed that the error was handled. Otherwise, Kudu must rely on the
   crash-consistency mechanisms and crash.

Given the above contract, The state of a tablet server post-disk-failure
depends significantly on the completion of disk-failure-handling
callbacks. Error-handling _must_ finish before anything is propagated
back to the offending caller.

To ensure correct completion of error-handling even in a concurrent
setting, this patch extends the error manager with a mutex and a second
error-handling callback type. This ensures that errors indirectly caused
by disk failures can be handled by non-disk-specific handling,
serializing failure-handling in the same fashion.

As an example of where this is necessary, say a tablet has data in a
single directory and hits a bad disk. That directory is immediately
marked failed and handling starts to fail all tablets in the directory.
Before, if the tablet were to create a new block before being failed, it
would fail immediately, complaining that no directories are available,
and would eventually fail a CHECK that translates roughly to: "Has error
handling for this tablet completed?" By wrapping block creation with
tablet-specific error handling and with serialized error-handling, this
CHECK will pass.

A previous revision accomplished this by using a single-threaded
threadpool to trigger error-handling and ensuring completion of
error-handling by waiting on the threadpool. I ultimately went with this
implentation since it's a bit more straight-forward wrt when to make the
different calls (i.e. "trigger disk error-handling vs trigger tablet
error-handling", instead of "trigger error-handling vs wait for
error-handling to finish"). This also has the benefit of being
extendable for other kinds of errors in the future.

Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e
---
M src/kudu/fs/CMakeLists.txt
M src/kudu/fs/data_dirs-test.cc
A src/kudu/fs/error_manager-test.cc
A src/kudu/fs/error_manager.cc
M src/kudu/fs/error_manager.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/fs_manager.cc
M src/kudu/fs/fs_manager.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/tserver/tablet_server.cc
M src/kudu/util/status.h
12 files changed, 389 insertions(+), 58 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/8395/12
--
To view, visit http://gerrit.cloudera.org:8080/8395
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e
Gerrit-Change-Number: 8395
Gerrit-PatchSet: 12
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: make various update paths atomic

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8441 )

Change subject: tablet: make various update paths atomic
..


Patch Set 14: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8441
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Gerrit-Change-Number: 8441
Gerrit-PatchSet: 14
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 03:02:46 +
Gerrit-HasComments: No


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8620 )

Change subject: tablet: remove ignore_result in ApplyRowOperation
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 03:02:32 +
Gerrit-HasComments: No


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8620 )

Change subject: tablet: remove ignore_result in ApplyRowOperation
..


Patch Set 4:

fixed iwyu and rebased.


--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 02:54:01 +
Gerrit-HasComments: No


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8620

to look at the new patch set (#4).

Change subject: tablet: remove ignore_result in ApplyRowOperation
..

tablet: remove ignore_result in ApplyRowOperation

ApplyRowOperation() previously ignored the result it returned. This
patch updates this to check for errors that we expect, rather than a
blanket ignorance of all errors. Unexpected Statuses are returned.

Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
---
M src/kudu/tablet/tablet.cc
1 file changed, 11 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/8620/4
--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Andrew Wong (Code Review)
Hello Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8620

to look at the new patch set (#3).

Change subject: tablet: remove ignore_result in ApplyRowOperation
..

tablet: remove ignore_result in ApplyRowOperation

ApplyRowOperation() previously ignored the result it returned. This
patch updates this to check for errors that we expect, rather than a
blanket ignorance of all errors. Unexpected Statuses are returned.

Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
---
M src/kudu/tablet/tablet.cc
1 file changed, 11 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/8620/3
--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8557 )

Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is 
too high
..

ntp: dump diagnostics when NTP is unsynchronized or error is too high

This adds some code to shell out to various NTP diagnostics tools when
clock problems are detected, just before fataling.

Hopefully this will make it easier to diagnose improper NTP setups in
the wild.

Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Reviewed-on: http://gerrit.cloudera.org:8080/8557
Reviewed-by: Alexey Serbin 
Tested-by: Kudu Jenkins
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.cc
M src/kudu/clock/system_ntp.h
M src/kudu/clock/time_service.h
M src/kudu/security/test/mini_kdc.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/util/path_util.cc
M src/kudu/util/path_util.h
M src/kudu/util/test_util.cc
M src/kudu/util/test_util.h
12 files changed, 154 insertions(+), 42 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 6
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8627 )

Change subject: schema: fast-path for Schema::KeyEquals, etc.
..

schema: fast-path for Schema::KeyEquals, etc.

If two schemas are the same object, then KeyEquals can be fast-pathed.
The same goes for various ColumnSchema equality checks.

This hasn't been a hot path in release builds, but I noticed a lot of wasted
CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in
a great number of cases the objects are identical. Even in release builds, may
as well avoid the unnecessary cycles.

Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Reviewed-on: http://gerrit.cloudera.org:8080/8627
Reviewed-by: Alexey Serbin 
Tested-by: Kudu Jenkins
---
M src/kudu/common/schema.h
1 file changed, 5 insertions(+), 0 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
> I think 'raft_prepare_replacement_before_eviction' sounds much better.
A little bit more of bike-shedding from my side: I think it's worth noting in 
the description that this flag means different things in case of 
master/catalog_manager and tserver.

Instead, maybe, introduce a flag to both tserver and master, named like 
'ts_tablet_prepare_replacement_before_eviction' ?

Naming is hard.



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Wed, 22 Nov 2017 02:21:30 +
Gerrit-HasComments: Yes


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
> I don't like putting version numbers in for config flags because they often
I think 'raft_prepare_replacement_before_eviction' sounds much better.



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Wed, 22 Nov 2017 01:43:43 +
Gerrit-HasComments: Yes


[kudu-CR] WIP [catalog manager] added 3-4-3 behavior

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8619 )

Change subject: WIP [catalog_manager] added 3-4-3 behavior
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@421
PS2, Line 421: peer.attrs().has_promote()
> I don't think we need to rely on has_ since this defaults to false.
Done


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@450
PS2, Line 450:   for (const RaftPeerPB& peer : config.peers()) {
> I find the logic a little tough to follow in here. Maybe it would be a litt
I thought about that but it would require keeping the finding in the temporary 
containers, etc. while it's possible to do so without allocating an extra stuff.

But if it would be easier to follow -- sure, I'll update that as needed.

Yeah, that a good point about using the same code for deciding  to evict either 
a voter or non-voter.  The logic in the functions to evict voter and non-voters 
was using different criteria in the first version, but now it's the same.  I'll 
update that, thanks.


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@204
PS2, Line 204: // TODO(aserbin): can we get a better name for this flag? 
Ideally, there should
> i posted a patch at https://gerrit.cloudera.org/c/8626/ so maybe we can use
thanks! that would be a good thing to have.


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3007
PS2, Line 3007: member_type_(member_type)
> this field only seems relevant for adding new replicas or a forced promotio
good catch -- i'll move it into appropriate function.


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3198
PS2, Line 3198: peer->set_member_type(member_type_);
> why is this field needed for an eviction?
Good point: I think it's not needed for eviction.  I'll move it into the 
AsyncAddReplicaTask.


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3482
PS2, Line 3482: else {
  : string to_evict;
  : if
> nit: we can make this an "else if" and avoid an additional level of nesting
I was trying to keep to_evict local.  All right, having less level of nesting 
is a priority makes more sense, I'll switch to 'else if'.



--
To view, visit http://gerrit.cloudera.org:8080/8619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
Gerrit-Change-Number: 8619
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 01:40:52 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8628 )

Change subject: KUDU-. update_scan_delta_compact-test: increase write 
timeout
..

KUDU-. update_scan_delta_compact-test: increase write timeout

In ASAN builds, the 5-second timeout on 1000-row batches was not sufficient to
prevent write timeouts.

With this, the test passed on dist-test 100/100[1], whereas previously it
failed 87/100[2].

[1] http://dist-test.cloudera.org/job?job_id=todd.1511306572.114498
[2] http://dist-test.cloudera.org/job?job_id=todd.1511304651.87727

Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6
Reviewed-on: http://gerrit.cloudera.org:8080/8628
Reviewed-by: Alexey Serbin 
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/update_scan_delta_compact-test.cc
1 file changed, 1 insertion(+), 1 deletion(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8628
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6
Gerrit-Change-Number: 8628
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..

Add a test for data consistency after stopping tablets

This adds a new test which starts a write workload and then shuts down all of
the servers while the writes are in-flight. The shutdown of the servers
causes the tablets to shut down, hence testing the new "abort" path for
in-flight writes.

Passed 1000/1000 in a TSAN build:
http://dist-test.cloudera.org/job?job_id=todd.1511294742.123710

Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Reviewed-on: http://gerrit.cloudera.org:8080/8618
Reviewed-by: Andrew Wong 
Tested-by: Andrew Wong 
Reviewed-by: Mike Percy 
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
M src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/internal_mini_cluster.cc
M src/kudu/tools/ksck_remote.cc
8 files changed, 81 insertions(+), 15 deletions(-)

Approvals:
  Andrew Wong: Looks good to me, approved; Verified
  Mike Percy: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
> OK, I see.
I don't like putting version numbers in for config flags because they often 
last long after the versions have meaning, but you make a good point.

How about: raft_prepare_replacement_before_eviction

This is something that is actually new, and that would apply to many future 
designs, and represents something we're not doing that causes availability 
problems today.



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:44:41 +
Gerrit-HasComments: Yes


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:41:41 +
Gerrit-HasComments: No


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3: Verified+1 Code-Review+2

Build failure was caused by KUDU-.


--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:38:24 +
Gerrit-HasComments: No


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has removed a vote on this change.

Change subject: Add a test for data consistency after stopping tablets
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8557 )

Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is 
too high
..


Patch Set 5: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:38:04 +
Gerrit-HasComments: No


[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8627 )

Change subject: schema: fast-path for Schema::KeyEquals, etc.
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:36:32 +
Gerrit-HasComments: No


[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.

2017-11-21 Thread Todd Lipcon (Code Review)
Hello Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8627

to look at the new patch set (#2).

Change subject: schema: fast-path for Schema::KeyEquals, etc.
..

schema: fast-path for Schema::KeyEquals, etc.

If two schemas are the same object, then KeyEquals can be fast-pathed.
The same goes for various ColumnSchema equality checks.

This hasn't been a hot path in release builds, but I noticed a lot of wasted
CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in
a great number of cases the objects are identical. Even in release builds, may
as well avoid the unnecessary cycles.

Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
---
M src/kudu/common/schema.h
1 file changed, 5 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/27/8627/2
--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] schema: fast-path for Schema::KeyEquals

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8627 )

Change subject: schema: fast-path for Schema::KeyEquals
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h
File src/kudu/common/schema.h:

http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h@706
PS1, Line 706: if (this == &other) return true;
> Does it make sense to add the same for ColumnSchema::{Equals,EqualsType,Equ
Done



--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:30:03 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..

tablet: introduce closed mvcc and stopped tablets

Currently, the only way to stop an Applying transaction is to wait for
it to finish and Commit it. This constraint was put in place to
guarantee on-disk correctness, but is sometimes too strict. E.g. if the
tablet is shutting down, the Apply doesn't need to finish.

This patch adds a new state to the MvccManager in which it is closed
for transactions. Once in this closed state:
1. New Applies will return and not move to the Commit phase, and any
   methods waiting for the tablet's Applies to Commit (e.g. new snapshot
   scans, FlushMRS) will respond with an error immediately. This allows
   an escape from the existing invariant that Applies _must_ Commit,
   provided the MvccManager is in this closed state.
2. Applies that are already underway may still Commit, but will return
   early on a best-effort basis. These non-Committed operations are
   inconsequential w.r.t. consistency; having some in-flight
   transactions Commit and others not is consistent with the server
   shutting down in between the Commits of two transactions.
3. New transactions drivers will abort immediately before even reaching
   the Prepare phase, ensuring no more writes to the tablet are made
   durable.

The Tablet class uses this closed MVCC state in a new "stopped" state of
its own. A Tablet that has been stopped will avoid further activity:
its MvccManager is closed to prevent further writes, and its maintenance
ops are cancelled to prevent further scheduling.

This patch includes these new behaviors when shutting down a tablet,
with the assumption that a tablet will only be shut down when it's being
deleted and we don't care too much about its in-flight transactions
Committing or its further maintenance ops. Code paths that previously
crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask,
MvccManager::AbortTransaction, etc.) or that waited for Applies to
finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet
has been stopped and will log a warning instead.

Testing is done by adding the following:
- a test in mvcc-test to shut down MVCC and delete an Applying
  transaction, ensuring that there are no errors when it leaves scope.
- a test in mvcc-test to wait on an Applying transaction, shut down
  MVCC, and ensure that any waiters will return with an error.
- a new test stop_tablet-itest is added to ensure stopped leaders block
  writes (because they cannot start new transactions) and stopped
  followers don't (because while they cannot service the op, there still
  exists a majority that can); that stopped tablets don't prevent
  fault-tolerant scans; and that stopping the only tablet does prevent
  scans

Change-Id: I983620f27e7226806a2cca253db7619731914d42
Reviewed-on: http://gerrit.cloudera.org:8080/7439
Reviewed-by: Mike Percy 
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon 
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/tablet_service.cc
17 files changed, 733 insertions(+), 135 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified
  Todd Lipcon: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 40
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 39: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 39
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:27:39 +
Gerrit-HasComments: No


[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Todd Lipcon (Code Review)
Hello Will Berkeley, Tidy Bot, Alexey Serbin, Jean-Daniel Cryans, Andrew Wong, 
Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8557

to look at the new patch set (#5).

Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is 
too high
..

ntp: dump diagnostics when NTP is unsynchronized or error is too high

This adds some code to shell out to various NTP diagnostics tools when
clock problems are detected, just before fataling.

Hopefully this will make it easier to diagnose improper NTP setups in
the wild.

Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.cc
M src/kudu/clock/system_ntp.h
M src/kudu/clock/time_service.h
M src/kudu/security/test/mini_kdc.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/util/path_util.cc
M src/kudu/util/path_util.h
M src/kudu/util/test_util.cc
M src/kudu/util/test_util.h
12 files changed, 154 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/8557/5
--
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 


[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8557 )

Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error 
is too high
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h
File src/kudu/clock/system_ntp.h:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50
PS4, Line 50: log
> can't do default parameters since it's a virtual method (tidy complained)
ah, sure


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc
File src/kudu/clock/system_ntp.cc:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90
PS4, Line 90: ERROR
> I wanted to log the whole thing at ERROR level since this is only used righ
sgtm



--
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:22:25 +
Gerrit-HasComments: Yes


[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8628 )

Change subject: KUDU-. update_scan_delta_compact-test: increase write 
timeout
..


Patch Set 1: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8628
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6
Gerrit-Change-Number: 8628
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:20:31 +
Gerrit-HasComments: No


[kudu-CR] schema: fast-path for Schema::KeyEquals

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8627 )

Change subject: schema: fast-path for Schema::KeyEquals
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h
File src/kudu/common/schema.h:

http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h@706
PS1, Line 706: if (this == &other) return true;
Does it make sense to add the same for 
ColumnSchema::{Equals,EqualsType,EqualsPhysicalType} while you are at it?



--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:18:48 +
Gerrit-HasComments: Yes


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
> The idea is that this enables 3-4-3 re-replication, and must be set to "tru
OK, I see.

I remember there were some musings about preference of 3-2-3 path for the case 
when we definitely know that one out of 3 replicas is failed, no?  Overall, do 
you think '3-4-3' theme is good to describe the new way of auto-recovery in 
case of replica failure?

Maybe, something like 'raft_rereplication_v1' is the best way to name it?



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Wed, 22 Nov 2017 00:00:58 +
Gerrit-HasComments: Yes


[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8557 )

Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error 
is too high
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@7
PS4, Line 7: NTP
> nit: drop this
Done


http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@9
PS4, Line 9: ntp
> nit: NTP
Done


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h
File src/kudu/clock/system_ntp.h:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50
PS4, Line 50: log
> nit: maybe, make this a parameter by default?
can't do default parameters since it's a virtual method (tidy complained)


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc
File src/kudu/clock/system_ntp.cc:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90
PS4, Line 90: ERROR
> nit: maybe, just INFO in case of s.ok() ?
I wanted to log the whole thing at ERROR level since this is only used right 
before a FATAL, and this way we'll guarantee to see the info in the ERROR log 
along with the fatal message, whereas INFO might be filtered out.


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@77
PS4, Line 77: void TryRun(vector cmd, vector* log) {
:   string exe, out, err;
:   Status s = FindExecutable(cmd[0], {"/sbin", "/usr/sbin/"}, 
&exe);
:   if (!s.ok()) {
: LOG_STRING(WARNING, log) << "could not find executable: " << 
cmd[0];
: return;
:   }
:
:   cmd[0] = exe;
:   s = Subprocess::Call(cmd, "", &out, &err);
:   // Subprocess::Call() returns RuntimeError in the case that the 
process returns
:   // a non-zero exit code, but that might still generate useful 
err.
:   if (s.ok() || (s.IsRuntimeError() && (!out.empty() || 
!err.empty( {
: LOG_STRING(ERROR, log)
: << JoinStrings(cmd, " ")
: << "\n--"
: << (!out.empty() ? Substitute("\nstdout:\n$0", out) : "")
: << (!err.empty() ? Substitute("\nstderr:\n$0", err) : "")
: << "\n";
:   } else {
: LOG_STRING(WARNING, log) << "failed to run executable: " << 
cmd[0];
:   }
:
: }
> maybe, move this into src/kudu/util/subprocess.cc?
eh, I think this is specific enough to what we're doing here (logging the 
output to ERROR, even in the case of non-zero exit code, etc) that I would 
rather not put it in the util/ code


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h
File src/kudu/util/subprocess.h:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h@220
PS4, Line 220: Status FindExecutable(const std::string& binary,
> maybe, the better place for that is in path_util.h?
Done


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc
File src/kudu/util/test_util.cc:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc@a307
PS4, Line 307:
> remove the declaration part from test_util.h as well?
Done



--
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:59:49 +
Gerrit-HasComments: Yes


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
> Does this mean the replicas are to be evicted by the leader replica when th
The idea is that this enables 3-4-3 re-replication, and must be set to "true" 
on all masters and tservers to work properly.



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:42:52 +
Gerrit-HasComments: Yes


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8626 )

Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124
PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false,
 : "When enabled, failed replicas will only be evicted 
after a "
 : "replacement has been added to the config.");
Does this mean the replicas are to be evicted by the leader replica when this 
is set to 'true'?  Or this flag actually switches to the new scheme where 
master, but not the leader replica, evicts failed replicas?



--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:41:25 +
Gerrit-HasComments: Yes


[kudu-CR] schema: fast-path for Schema::KeyEquals

2017-11-21 Thread Todd Lipcon (Code Review)
Hello Alexey Serbin,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/8627

to review the following change.


Change subject: schema: fast-path for Schema::KeyEquals
..

schema: fast-path for Schema::KeyEquals

If two schemas are the same object, then KeyEquals can be fast-pathed.

This hasn't been a hot path in release builds, but I noticed a lot of wasted
CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in
a great number of cases the objects are identical. Even in release builds, may
as well avoid the unnecessary cycles.

Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
---
M src/kudu/common/schema.h
1 file changed, 1 insertion(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/27/8627/1
--
To view, visit http://gerrit.cloudera.org:8080/8627
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae
Gerrit-Change-Number: 8627
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 


[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout

2017-11-21 Thread Todd Lipcon (Code Review)
Hello Alexey Serbin,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/8628

to review the following change.


Change subject: KUDU-. update_scan_delta_compact-test: increase write 
timeout
..

KUDU-. update_scan_delta_compact-test: increase write timeout

In ASAN builds, the 5-second timeout on 1000-row batches was not sufficient to
prevent write timeouts.

With this, the test passed on dist-test 100/100[1], whereas previously it
failed 87/100[2].

[1] http://dist-test.cloudera.org/job?job_id=todd.1511306572.114498
[2] http://dist-test.cloudera.org/job?job_id=todd.1511304651.87727

Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6
---
M src/kudu/integration-tests/update_scan_delta_compact-test.cc
1 file changed, 1 insertion(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/8628/1
--
To view, visit http://gerrit.cloudera.org:8080/8628
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6
Gerrit-Change-Number: 8628
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 


[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high

2017-11-21 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8557 )

Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error 
is too high
..


Patch Set 4:

(7 comments)

overall looks good, just some nits

http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@7
PS4, Line 7: NTP
nit: drop this


http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@9
PS4, Line 9: ntp
nit: NTP


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h
File src/kudu/clock/system_ntp.h:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50
PS4, Line 50: log
nit: maybe, make this a parameter by default?


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc
File src/kudu/clock/system_ntp.cc:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90
PS4, Line 90: ERROR
nit: maybe, just INFO in case of s.ok() ?


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@77
PS4, Line 77: void TryRun(vector cmd, vector* log) {
:   string exe, out, err;
:   Status s = FindExecutable(cmd[0], {"/sbin", "/usr/sbin/"}, 
&exe);
:   if (!s.ok()) {
: LOG_STRING(WARNING, log) << "could not find executable: " << 
cmd[0];
: return;
:   }
:
:   cmd[0] = exe;
:   s = Subprocess::Call(cmd, "", &out, &err);
:   // Subprocess::Call() returns RuntimeError in the case that the 
process returns
:   // a non-zero exit code, but that might still generate useful 
err.
:   if (s.ok() || (s.IsRuntimeError() && (!out.empty() || 
!err.empty( {
: LOG_STRING(ERROR, log)
: << JoinStrings(cmd, " ")
: << "\n--"
: << (!out.empty() ? Substitute("\nstdout:\n$0", out) : "")
: << (!err.empty() ? Substitute("\nstderr:\n$0", err) : "")
: << "\n";
:   } else {
: LOG_STRING(WARNING, log) << "failed to run executable: " << 
cmd[0];
:   }
:
: }
maybe, move this into src/kudu/util/subprocess.cc?


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h
File src/kudu/util/subprocess.h:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h@220
PS4, Line 220: Status FindExecutable(const std::string& binary,
maybe, the better place for that is in path_util.h?


http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc
File src/kudu/util/test_util.cc:

http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc@a307
PS4, Line 307:
remove the declaration part from test_util.h as well?



-- 
To view, visit http://gerrit.cloudera.org:8080/8557
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33
Gerrit-Change-Number: 8557
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:27:52 +
Gerrit-HasComments: Yes


[kudu-CR] WIP [catalog manager] added 3-4-3 behavior

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8619 )

Change subject: WIP [catalog_manager] added 3-4-3 behavior
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@421
PS2, Line 421: peer.attrs().has_promote()
I don't think we need to rely on has_ since this defaults to false.


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@450
PS2, Line 450:   for (const RaftPeerPB& peer : config.peers()) {
I find the logic a little tough to follow in here. Maybe it would be a little 
easier to follow if it was structured as 2 loops?

1. loop over the config and collect stats (number of healthy / failed voters, 
number of healthy / failed non-voters, number of nodes to replace)
2. decide whether any can be evicted, if not, return false
3. if a node can be evicted, loop again with a predicate indicating the 
criteria for the node to evict and return the first match.

Also, I wonder why we need a separate function for evicting voters and 
non-voters?


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@204
PS2, Line 204: // TODO(aserbin): can we get a better name for this flag? 
Ideally, there should
i posted a patch at https://gerrit.cloudera.org/c/8626/ so maybe we can use 
that one everywhere


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3007
PS2, Line 3007: member_type_(member_type)
this field only seems relevant for adding new replicas or a forced promotion / 
demotion


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3198
PS2, Line 3198: peer->set_member_type(member_type_);
why is this field needed for an eviction?


http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3482
PS2, Line 3482: else {
  : string to_evict;
  : if
nit: we can make this an "else if" and avoid an additional level of nesting



--
To view, visit http://gerrit.cloudera.org:8080/8619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
Gerrit-Change-Number: 8619
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 23:24:41 +
Gerrit-HasComments: Yes


[kudu-CR] [java] Add ReplicaSelection in KuduScanToken

2017-11-21 Thread Dan Burkert (Code Review)
Dan Burkert has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8559 )

Change subject: [java] Add ReplicaSelection in KuduScanToken
..

[java] Add ReplicaSelection in KuduScanToken

This patch adds ReplicaSelection in KuduScanToken for java client,
so that deserializing a ScanToken results in propagating the replica
selection policy of the serializer into the deserializer.

Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914
Reviewed-on: http://gerrit.cloudera.org:8080/8559
Reviewed-by: Dan Burkert 
Tested-by: Kudu Jenkins
---
M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java
M 
java/kudu-client/src/test/java/org/apache/kudu/client/TestScannerMultiTablet.java
M src/kudu/client/client.proto
M src/kudu/common/common.proto
4 files changed, 50 insertions(+), 0 deletions(-)

Approvals:
  Dan Burkert: Looks good to me, approved
  Kudu Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8559
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914
Gerrit-Change-Number: 8559
Gerrit-PatchSet: 3
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication

2017-11-21 Thread Mike Percy (Code Review)
Hello Alexey Serbin,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/8626

to review the following change.


Change subject: consensus: Add gflag to enable 3-4-3 re-replication
..

consensus: Add gflag to enable 3-4-3 re-replication

This gflag will be used to control whether 3-4-3 re-replication is used
on a cluster.

Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
---
M src/kudu/consensus/raft_consensus.cc
1 file changed, 7 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/8626/1
--
To view, visit http://gerrit.cloudera.org:8080/8626
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01
Gerrit-Change-Number: 8626
Gerrit-PatchSet: 1
Gerrit-Owner: Mike Percy 
Gerrit-Reviewer: Alexey Serbin 


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@231
PS3, Line 231:  why does the client get a RemoteError in this test?
> what does the RemoteError say?
while starting up the tservers, they start their RPC server before they 
register the TabletService RPC service itself into the messenger. So, the 
client bounces around and gets ServiceUnavailable on all replicas, and then 
returns RemoteError instead of TimedOut



--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:44:09 +
Gerrit-HasComments: Yes


[kudu-CR] WIP [catalog manager] added 3-4-3 behavior

2017-11-21 Thread Alexey Serbin (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8619

to look at the new patch set (#2).

Change subject: WIP [catalog_manager] added 3-4-3 behavior
..

WIP [catalog_manager] added 3-4-3 behavior

Updated the catalog_manager to behave as specified in 3-4-3 v1 design
implementation plan, patch2.

WIP: need to add units tests for new quorum_util functions and
 comprehesive functional/integration tests

Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
---
M src/kudu/consensus/quorum_util-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
5 files changed, 806 insertions(+), 96 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/8619/2
--
To view, visit http://gerrit.cloudera.org:8080/8619
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27
Gerrit-Change-Number: 8619
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8617 )

Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
..

KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

InternalMiniCluster tests were spitting out TSAN warnings where one thread
(associated with one mini-tserver) accesses the CacheMetrics while another
thread is busy restarting a tserver. The restarting tserver ends up
replacing the CacheMetrics instance, which caused TSAN to warn due to the
unprotected concurrent access.

This patch is a relatively simple workaround that just prevents the Cache
from switching CacheMetrics instances once it is first constructed. The
underlying metrics are ref-counted, so we shouldn't have any leak or illegal
access. The only downside is that, after a restart, new tservers won't be able
to "attach" to the cache instance. But, we already had somewhat randomly
associated the cache metrics with one of the tablet servers, so it seems
unlikely that any assertions are depending on the functionality.

Along the way, I removed the Cache::NewId() method which was unused.

Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Reviewed-on: http://gerrit.cloudera.org:8080/8617
Reviewed-by: Mike Percy 
Tested-by: Todd Lipcon 
---
M src/kudu/util/cache-test.cc
M src/kudu/util/cache.cc
M src/kudu/util/cache.h
M src/kudu/util/nvm_cache.cc
4 files changed, 21 insertions(+), 31 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Todd Lipcon: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8617
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Gerrit-Change-Number: 8617
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8617 )

Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
..


Patch Set 4: Verified+1

Unrelated race


--
To view, visit http://gerrit.cloudera.org:8080/8617
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Gerrit-Change-Number: 8617
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:41:09 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

2017-11-21 Thread Todd Lipcon (Code Review)
Todd Lipcon has removed a vote on this change.

Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
..


Removed Verified-1 by Kudu Jenkins (120)
--
To view, visit http://gerrit.cloudera.org:8080/8617
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Gerrit-Change-Number: 8617
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 39: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 39
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:26 +
Gerrit-HasComments: No


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:07 +
Gerrit-HasComments: No


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#39).

Change subject: tablet: introduce closed mvcc and stopped tablets
..

tablet: introduce closed mvcc and stopped tablets

Currently, the only way to stop an Applying transaction is to wait for
it to finish and Commit it. This constraint was put in place to
guarantee on-disk correctness, but is sometimes too strict. E.g. if the
tablet is shutting down, the Apply doesn't need to finish.

This patch adds a new state to the MvccManager in which it is closed
for transactions. Once in this closed state:
1. New Applies will return and not move to the Commit phase, and any
   methods waiting for the tablet's Applies to Commit (e.g. new snapshot
   scans, FlushMRS) will respond with an error immediately. This allows
   an escape from the existing invariant that Applies _must_ Commit,
   provided the MvccManager is in this closed state.
2. Applies that are already underway may still Commit, but will return
   early on a best-effort basis. These non-Committed operations are
   inconsequential w.r.t. consistency; having some in-flight
   transactions Commit and others not is consistent with the server
   shutting down in between the Commits of two transactions.
3. New transactions drivers will abort immediately before even reaching
   the Prepare phase, ensuring no more writes to the tablet are made
   durable.

The Tablet class uses this closed MVCC state in a new "stopped" state of
its own. A Tablet that has been stopped will avoid further activity:
its MvccManager is closed to prevent further writes, and its maintenance
ops are cancelled to prevent further scheduling.

This patch includes these new behaviors when shutting down a tablet,
with the assumption that a tablet will only be shut down when it's being
deleted and we don't care too much about its in-flight transactions
Committing or its further maintenance ops. Code paths that previously
crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask,
MvccManager::AbortTransaction, etc.) or that waited for Applies to
finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet
has been stopped and will log a warning instead.

Testing is done by adding the following:
- a test in mvcc-test to shut down MVCC and delete an Applying
  transaction, ensuring that there are no errors when it leaves scope.
- a test in mvcc-test to wait on an Applying transaction, shut down
  MVCC, and ensure that any waiters will return with an error.
- a new test stop_tablet-itest is added to ensure stopped leaders block
  writes (because they cannot start new transactions) and stopped
  followers don't (because while they cannot service the op, there still
  exists a majority that can); that stopped tablets don't prevent
  fault-tolerant scans; and that stopping the only tablet does prevent
  scans

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/tablet_service.cc
17 files changed, 733 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/39
--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 39
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 39:

Fixed IWYU + the build failure was KUDU-


--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 39
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:04 +
Gerrit-HasComments: No


[kudu-CR] tablet: mark delta tracker read-only on error

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8605 )

Change subject: tablet: mark delta tracker read-only on error
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
Gerrit-Change-Number: 8605
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:38:29 +
Gerrit-HasComments: No


[kudu-CR] tablet: add early returns to maintenance functions

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8606 )

Change subject: tablet: add early returns to maintenance functions
..


Patch Set 8: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8606/5/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/8606/5/src/kudu/integration-tests/stop_tablet-itest.cc@93
PS5, Line 93: server.
> I think that's an optimization, but it wouldn't get the same result as canc
hmm, ok



--
To view, visit http://gerrit.cloudera.org:8080/8606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Gerrit-Change-Number: 8606
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:36:48 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: make various update paths atomic

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8441 )

Change subject: tablet: make various update paths atomic
..


Patch Set 13: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8441
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Gerrit-Change-Number: 8441
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:35:06 +
Gerrit-HasComments: No


[kudu-CR] tablet: remove ignore result in ApplyRowOperation

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8620 )

Change subject: tablet: remove ignore_result in ApplyRowOperation
..


Patch Set 2: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8620
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb
Gerrit-Change-Number: 8620
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:34:33 +
Gerrit-HasComments: No


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:34:14 +
Gerrit-HasComments: No


[kudu-CR] tablet: mark delta tracker read-only on error

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8605 )

Change subject: tablet: mark delta tracker read-only on error
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc
File src/kudu/tablet/delta_tracker.cc:

http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc@355
PS6, Line 355: CheckWritableUnlocked(
> nit: perhaps rename to CheckWritableUnlocked()
Done


http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc@740
PS6, Line 740: read_only_ = true;
> CHECK(s.IsDiskFailure()) << LogPrefix() << s.ToString();
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
Gerrit-Change-Number: 8605
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:31:10 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: add early returns to maintenance functions

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8606 )

Change subject: tablet: add early returns to maintenance functions
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8606/6/src/kudu/tablet/tablet_replica_mm_ops.cc
File src/kudu/tablet/tablet_replica_mm_ops.cc:

http://gerrit.cloudera.org:8080/#/c/8606/6/src/kudu/tablet/tablet_replica_mm_ops.cc@200
PS6, Line 200: ())
> LogPrefix() includes the trailing space, so it's usually preferable to writ
Done



--
To view, visit http://gerrit.cloudera.org:8080/8606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Gerrit-Change-Number: 8606
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:31:02 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: make various update paths atomic

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Alexey Serbin, Mike Percy, Kudu Jenkins, Adar Dembo, Todd 
Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8441

to look at the new patch set (#13).

Change subject: tablet: make various update paths atomic
..

tablet: make various update paths atomic

A few codepaths in the tablet subsystem aren't atomic, i.e. if they fail
mid-method (e.g. due to a file error), they leave some in-memory
structures updated and others untouched. This has been safe because we
CHECKed to ensure their success. In preparation for _not_ crashing in
these methods, this patch refactors some of these functions to be
atomic, and notes others that still have the possibility of failing in
such a state (these calls still trigger a CHECK failure).

There are no functional changes in this patch.

Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
---
M src/kudu/tablet/delta_compaction.cc
M src/kudu/tablet/delta_compaction.h
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/delta_tracker.h
M src/kudu/tablet/diskrowset.cc
M src/kudu/tablet/diskrowset.h
M src/kudu/tablet/metadata-test.cc
M src/kudu/tablet/rowset_metadata.cc
M src/kudu/tablet/rowset_metadata.h
9 files changed, 224 insertions(+), 169 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/8441/13
--
To view, visit http://gerrit.cloudera.org:8080/8441
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Gerrit-Change-Number: 8441
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: make various update paths atomic

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8441 )

Change subject: tablet: make various update paths atomic
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8441/12/src/kudu/tablet/diskrowset.cc
File src/kudu/tablet/diskrowset.cc:

http://gerrit.cloudera.org:8080/#/c/8441/12/src/kudu/tablet/diskrowset.cc@579
PS12, Line 579: LOG_WITH_PREFIX(WARNING) << "Error during major delta 
compaction! Rolling back rowset metadata";
> WITH_PREFIX here so you get the tablet ID
Done



--
To view, visit http://gerrit.cloudera.org:8080/8441
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141
Gerrit-Change-Number: 8441
Gerrit-PatchSet: 13
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:30:48 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: add early returns to maintenance functions

2017-11-21 Thread Andrew Wong (Code Review)
Hello Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8606

to look at the new patch set (#8).

Change subject: tablet: add early returns to maintenance functions
..

tablet: add early returns to maintenance functions

When a Tablet is stopped, further maintenance ops are not scheduled,
further IO is prevented, etc. This patch optimizes this further to stop
various functions that are called by maintenance threads to prevent
their execution, returning an error instead.

Previously, certain ops (e.g. flush DMS) would guarantee durability by
checking for the success these functions. These checks are now replaced
with on-error checks that the tablet has been stopped, since these
failures are inconsequential if the tablet is stopped.

Currently this is an optimization for tablets that are shutting down to
return early from these calls, but in the future, this could be useful
in stopping all IO done by a tablet that is failing, e.g. due to disk
failure.

Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
---
M src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tablet/tablet_replica_mm_ops.cc
5 files changed, 34 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/06/8606/8
--
To view, visit http://gerrit.cloudera.org:8080/8606
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516
Gerrit-Change-Number: 8606
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#8).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
- Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
- Before downloading each WAL segments or block, check that the tablet
  is in a healthy group.
And for the tablet copy session as:
- Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
7 files changed, 124 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/8
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc
File src/kudu/tserver/tablet_copy_source_session.cc:

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc@133
PS7, Line 133: ));
> we can remove this now
Done



--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:30:43 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: mark delta tracker read-only on error

2017-11-21 Thread Andrew Wong (Code Review)
Hello Mike Percy, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8605

to look at the new patch set (#7).

Change subject: tablet: mark delta tracker read-only on error
..

tablet: mark delta tracker read-only on error

When DeltaTracker::Flush() fails, it leaves a DeltaMemStore in the redo
store list. This is fine for reads because the readpath expects any
DeltaStore to be in that slot, but this is a problem for updates, which
expect only DeltaFileReaders in the list while the compact_flush_lock_
is held.

Before, we would get around this by CHECKing so Flush() could never
return in such a state. This patch introduces a flag to the
DeltaTracker that indicates whether it has experienced such an error.
In order to ensure type-correctness, this flag must be checked
immediately upon entering the critical section of the
compact_flush_lock_.

Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
---
M src/kudu/tablet/delta_tracker.cc
M src/kudu/tablet/delta_tracker.h
2 files changed, 35 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/8605/7
--
To view, visit http://gerrit.cloudera.org:8080/8605
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced
Gerrit-Change-Number: 8605
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8617 )

Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
..


Patch Set 4: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8617
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Gerrit-Change-Number: 8617
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:16:43 +
Gerrit-HasComments: No


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@249
PS3, Line 249: Shutdown
Did you mean to Shutdown() or Stop() here? Maybe one after the other?



--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:13:56 +
Gerrit-HasComments: Yes


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@249
PS3, Line 249: Shutdown
> Did you mean to Shutdown() or Stop() here? Maybe one after the other?
Oh, wait, Shutdown() calls Stop() first. Good idea with this.



--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:14:34 +
Gerrit-HasComments: Yes


[kudu-CR] Add a test for data consistency after stopping tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8618 )

Change subject: Add a test for data consistency after stopping tablets
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/cluster_verifier.cc
File src/kudu/integration-tests/cluster_verifier.cc:

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/cluster_verifier.cc@92
PS3, Line 92: we should support
totally agree


http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@231
PS3, Line 231:  why does the client get a RemoteError in this test?
what does the RemoteError say?



--
To view, visit http://gerrit.cloudera.org:8080/8618
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2
Gerrit-Change-Number: 8618
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:12:18 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 38: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 38
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:08:02 +
Gerrit-HasComments: No


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 38:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc@193
PS37, Line 193: ASSERT_TRUE(s.IsTimedOut())
> nit: add << s.ToString() so this is diagnosable when / if it fails in the f
Done



--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 38
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:05:48 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#38).

Change subject: tablet: introduce closed mvcc and stopped tablets
..

tablet: introduce closed mvcc and stopped tablets

Currently, the only way to stop an Applying transaction is to wait for
it to finish and Commit it. This constraint was put in place to
guarantee on-disk correctness, but is sometimes too strict. E.g. if the
tablet is shutting down, the Apply doesn't need to finish.

This patch adds a new state to the MvccManager in which it is closed
for transactions. Once in this closed state:
1. New Applies will return and not move to the Commit phase, and any
   methods waiting for the tablet's Applies to Commit (e.g. new snapshot
   scans, FlushMRS) will respond with an error immediately. This allows
   an escape from the existing invariant that Applies _must_ Commit,
   provided the MvccManager is in this closed state.
2. Applies that are already underway may still Commit, but will return
   early on a best-effort basis. These non-Committed operations are
   inconsequential w.r.t. consistency; having some in-flight
   transactions Commit and others not is consistent with the server
   shutting down in between the Commits of two transactions.
3. New transactions drivers will abort immediately before even reaching
   the Prepare phase, ensuring no more writes to the tablet are made
   durable.

The Tablet class uses this closed MVCC state in a new "stopped" state of
its own. A Tablet that has been stopped will avoid further activity:
its MvccManager is closed to prevent further writes, and its maintenance
ops are cancelled to prevent further scheduling.

This patch includes these new behaviors when shutting down a tablet,
with the assumption that a tablet will only be shut down when it's being
deleted and we don't care too much about its in-flight transactions
Committing or its further maintenance ops. Code paths that previously
crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask,
MvccManager::AbortTransaction, etc.) or that waited for Applies to
finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet
has been stopped and will log a warning instead.

Testing is done by adding the following:
- a test in mvcc-test to shut down MVCC and delete an Applying
  transaction, ensuring that there are no errors when it leaves scope.
- a test in mvcc-test to wait on an Applying transaction, shut down
  MVCC, and ensure that any waiters will return with an error.
- a new test stop_tablet-itest is added to ensure stopped leaders block
  writes (because they cannot start new transactions) and stopped
  followers don't (because while they cannot service the op, there still
  exists a majority that can); that stopped tablets don't prevent
  fault-tolerant scans; and that stopping the only tablet does prevent
  scans

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/tablet_service.cc
17 files changed, 735 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/38
--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 38
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 37:

(1 comment)

looks good

http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc@193
PS37, Line 193: ASSERT_TRUE(s.IsTimedOut())
nit: add << s.ToString() so this is diagnosable when / if it fails in the 
future.



--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 37
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:02:10 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7439 )

Change subject: tablet: introduce closed mvcc and stopped tablets
..


Patch Set 37:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7439/36/src/kudu/integration-tests/stop_tablet-itest.cc
File src/kudu/integration-tests/stop_tablet-itest.cc:

http://gerrit.cloudera.org:8080/#/c/7439/36/src/kudu/integration-tests/stop_tablet-itest.cc@138
PS36, Line 138: *leader_num = i;
> It seems like this works but I'm worried that if someone changes the read t
Done



--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 37
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 21 Nov 2017 21:57:57 +
Gerrit-HasComments: Yes


[kudu-CR] tablet: introduce closed mvcc and stopped tablets

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#37).

Change subject: tablet: introduce closed mvcc and stopped tablets
..

tablet: introduce closed mvcc and stopped tablets

Currently, the only way to stop an Applying transaction is to wait for
it to finish and Commit it. This constraint was put in place to
guarantee on-disk correctness, but is sometimes too strict. E.g. if the
tablet is shutting down, the Apply doesn't need to finish.

This patch adds a new state to the MvccManager in which it is closed
for transactions. Once in this closed state:
1. New Applies will return and not move to the Commit phase, and any
   methods waiting for the tablet's Applies to Commit (e.g. new snapshot
   scans, FlushMRS) will respond with an error immediately. This allows
   an escape from the existing invariant that Applies _must_ Commit,
   provided the MvccManager is in this closed state.
2. Applies that are already underway may still Commit, but will return
   early on a best-effort basis. These non-Committed operations are
   inconsequential w.r.t. consistency; having some in-flight
   transactions Commit and others not is consistent with the server
   shutting down in between the Commits of two transactions.
3. New transactions drivers will abort immediately before even reaching
   the Prepare phase, ensuring no more writes to the tablet are made
   durable.

The Tablet class uses this closed MVCC state in a new "stopped" state of
its own. A Tablet that has been stopped will avoid further activity:
its MvccManager is closed to prevent further writes, and its maintenance
ops are cancelled to prevent further scheduling.

This patch includes these new behaviors when shutting down a tablet,
with the assumption that a tablet will only be shut down when it's being
deleted and we don't care too much about its in-flight transactions
Committing or its further maintenance ops. Code paths that previously
crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask,
MvccManager::AbortTransaction, etc.) or that waited for Applies to
finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet
has been stopped and will log a warning instead.

Testing is done by adding the following:
- a test in mvcc-test to shut down MVCC and delete an Applying
  transaction, ensuring that there are no errors when it leaves scope.
- a test in mvcc-test to wait on an Applying transaction, shut down
  MVCC, and ensure that any waiters will return with an error.
- a new test stop_tablet-itest is added to ensure stopped leaders block
  writes (because they cannot start new transactions) and stopped
  followers don't (because while they cannot service the op, there still
  exists a majority that can); that stopped tablets don't prevent
  fault-tolerant scans; and that stopping the only tablet does prevent
  scans

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/stop_tablet-itest.cc
M src/kudu/tablet/local_tablet_writer.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/rowset.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica_mm_ops.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tserver/tablet_service.cc
17 files changed, 735 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/37
--
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-Change-Number: 7439
Gerrit-PatchSet: 37
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [spark] add 'local-cluster' mode for unit test

2017-11-21 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8585 )

Change subject: [spark] add 'local-cluster' mode for unit test
..


Patch Set 1:

I pointed SPARK_HOME at the unpacked release tarball downloaded from 
spark.apache.org.  Is the source tarball really necessary?


-- 
To view, visit http://gerrit.cloudera.org:8080/8585
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788
Gerrit-Change-Number: 8585
Gerrit-PatchSet: 1
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 21 Nov 2017 21:46:28 +
Gerrit-HasComments: No


[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

2017-11-21 Thread Todd Lipcon (Code Review)
Hello Tidy Bot, Mike Percy, Andrew Wong, Kudu Jenkins, Andrew Wong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8617

to look at the new patch set (#4).

Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
..

KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics

InternalMiniCluster tests were spitting out TSAN warnings where one thread
(associated with one mini-tserver) accesses the CacheMetrics while another
thread is busy restarting a tserver. The restarting tserver ends up
replacing the CacheMetrics instance, which caused TSAN to warn due to the
unprotected concurrent access.

This patch is a relatively simple workaround that just prevents the Cache
from switching CacheMetrics instances once it is first constructed. The
underlying metrics are ref-counted, so we shouldn't have any leak or illegal
access. The only downside is that, after a restart, new tservers won't be able
to "attach" to the cache instance. But, we already had somewhat randomly
associated the cache metrics with one of the tablet servers, so it seems
unlikely that any assertions are depending on the functionality.

Along the way, I removed the Cache::NewId() method which was unused.

Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
---
M src/kudu/util/cache-test.cc
M src/kudu/util/cache.cc
M src/kudu/util/cache.h
M src/kudu/util/nvm_cache.cc
4 files changed, 21 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8617/4
--
To view, visit http://gerrit.cloudera.org:8080/8617
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f
Gerrit-Change-Number: 8617
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [java] Add ReplicaSelection in KuduScanToken

2017-11-21 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8559 )

Change subject: [java] Add ReplicaSelection in KuduScanToken
..


Patch Set 2: Code-Review+2

Looks good.  Are you going to add support to the C++ client as well?


--
To view, visit http://gerrit.cloudera.org:8080/8559
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914
Gerrit-Change-Number: 8559
Gerrit-PatchSet: 2
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 21 Nov 2017 21:31:27 +
Gerrit-HasComments: No


[kudu-CR] [spark] add 'local-cluster' mode for unit test

2017-11-21 Thread Hao Hao (Code Review)
Hao Hao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8585 )

Change subject: [spark] add 'local-cluster' mode for unit test
..


Patch Set 1:

> (5 comments)
 >
 > I wasn't able to get the tests to run on my system, I got a bunch
 > of error messages like this:
 >
 > 12:22:13.383 [ERROR - ExecutorRunner for app-20171121122213-/0]
 > (Logging.scala:91) Error running executor
 > java.lang.IllegalStateException: Cannot find any build directories.
 > at 
 > org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
 > at 
 > org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:241)
 > at 
 > org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:195)
 > at 
 > org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:118)
 > at 
 > org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39)
 > at 
 > org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:47)
 > at 
 > org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63)
 > at 
 > org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51)
 > at 
 > org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145)
 > at 
 > org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)
 >
 >
 >
 > Does that ring any bells?

Yes, it should be caused by either the "SPARK_HOME" is not properly set, or you 
did not happen to build the source code pointed by the "SPARK_HOME"? Sorry that 
I should also document the latter.


--
To view, visit http://gerrit.cloudera.org:8080/8585
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788
Gerrit-Change-Number: 8585
Gerrit-PatchSet: 1
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Hao Hao 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 21 Nov 2017 21:22:31 +
Gerrit-HasComments: No


[kudu-CR] [spark] add 'local-cluster' mode for unit test

2017-11-21 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8585 )

Change subject: [spark] add 'local-cluster' mode for unit test
..


Patch Set 1:

(5 comments)

I wasn't able to get the tests to run on my system, I got a bunch of error 
messages like this:

12:22:13.383 [ERROR - ExecutorRunner for app-20171121122213-/0] 
(Logging.scala:91) Error running executor
java.lang.IllegalStateException: Cannot find any build directories.
at 
org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
at 
org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:241)
at 
org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:195)
at 
org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:118)
at 
org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39)
at 
org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:47)
at 
org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63)
at 
org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51)
at 
org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145)
at 
org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73)



Does that ring any bells?

http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala
File 
java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala:

http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@69
PS1, Line 69: testMode.toLowerCase
Can be simplified to

System.getenv("TEST_MODE") match {


http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@70
PS1, Line 70: // According to documentation on spark code, this mode: 
creates a Spark standalone
Might be useful to add a reference to SPARK-595.


http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@78
PS1, Line 78: conf.setMaster("local[*]")
I think it'd be better to throw an exception here so that it fails when 
TEST_MODE is set without SPARK_HOME.  Is there a reason I'm missing to do a 
silent fallback?


http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@80
PS1, Line 80: val executorLog: String = System.getenv("EXE_LOG_PATH")
These configuration knobs are going to be hard to use, and I don't think they 
will ultimately be very useful.  Can we skip adding them for now?  We can 
always add them back later if they are necessary.


http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@90
PS1, Line 90: "local" | _
can be simplified to _



--
To view, visit http://gerrit.cloudera.org:8080/8585
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788
Gerrit-Change-Number: 8585
Gerrit-PatchSet: 1
Gerrit-Owner: Hao Hao 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 21 Nov 2017 21:13:42 +
Gerrit-HasComments: Yes


  1   2   >