[kudu-CR] WIP [catalog manager] added 3-4-3 behavior
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8619 to look at the new patch set (#3). Change subject: WIP [catalog_manager] added 3-4-3 behavior .. WIP [catalog_manager] added 3-4-3 behavior Updated the catalog_manager to behave as specified in 3-4-3 v1 design implementation plan, patch2. WIP: need to add comprehesive functional/integration tests using minicluster once the leader replica starts reporting at least about reachable and unreachable replicas. Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 --- M src/kudu/consensus/quorum_util-test.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h 5 files changed, 829 insertions(+), 105 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/8619/3 -- To view, visit http://gerrit.cloudera.org:8080/8619 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 Gerrit-Change-Number: 8619 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.3.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8599 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. [security] test and fixes for TLS socket EINTR issues SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE} correspondingly when signal interrupts recv()/send() calls even if SSL_MODE_AUTO_RETRY is set in the TLS context. To handle that properly in Socket::Blocking{Recv,Write}() methods, return NetworkError() with appropriate POSIX error code from TlsSocket::{Recv,Write}(). As a by-product, this changelist fixes flakiness in TestUniqueClientIds scenario of the ClientStressTest test and other flaky tests which failed with errors like below: Bad status: IO error: Could not connect to the cluster: \ Client connection negotiation failed: client connection to \ IP:port: Read zero bytes on a blocking Recv() call: \ Transferred 0 of 4 bytes Prior to this fix, the test failure ratio observed with dist-test for TSAN builds was about 6% in multiple 1K runs. After the fix, no failures observed. Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Reviewed-on: http://gerrit.cloudera.org:8080/8462 Tested-by: Kudu Jenkins Reviewed-by: Dan Burkert Reviewed-by: Alexey Serbin (cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c) Reviewed-on: http://gerrit.cloudera.org:8080/8599 Reviewed-by: Todd Lipcon --- M src/kudu/security/CMakeLists.txt M src/kudu/security/tls_handshake.cc A src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc M src/kudu/util/net/socket.cc 5 files changed, 227 insertions(+), 7 deletions(-) Approvals: Kudu Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8599 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.3.x Gerrit-MessageType: merged Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8599 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.3.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8600 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. KUDU-2218. tls_socket: properly handle temporary socket errors in Writev This fixes a bug which caused RaftConsensusITest.TestLargeBatches to fail when run under stress, as in the following command line: taskset -c 0-4 \ build/latest/bin/raft_consensus-itest \ --gtest_filter=\*LargeBat\* \ --stress-cpu-threads=8 This would produce an error like: Network error: failed to write to TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878 This means that we were retrying a write after getting EAGAIN, but with a different buffer than the first time. I tracked this down to mishandling of temporary socket errors in TlsSocket::Writev(). In the case that we successfully write part of the io vector but hit such an error trying to write a later element in the vector, we were still propagating the error back up to the caller. The caller didn't realize that part of the write was successful, and thus it would retry the write from the beginning. The fix is to fix the above, but also to enable partial writes in TlsContext. The new test fails if either of the above two changes are backed out. Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Reviewed-on: http://gerrit.cloudera.org:8080/8570 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin (cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d) Reviewed-on: http://gerrit.cloudera.org:8080/8600 Reviewed-by: Todd Lipcon --- M src/kudu/security/tls_context.cc M src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc 3 files changed, 219 insertions(+), 70 deletions(-) Approvals: Kudu Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8600 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.3.x Gerrit-MessageType: merged Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8600 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.4.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8602 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. KUDU-2218. tls_socket: properly handle temporary socket errors in Writev This fixes a bug which caused RaftConsensusITest.TestLargeBatches to fail when run under stress, as in the following command line: taskset -c 0-4 \ build/latest/bin/raft_consensus-itest \ --gtest_filter=\*LargeBat\* \ --stress-cpu-threads=8 This would produce an error like: Network error: failed to write to TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878 This means that we were retrying a write after getting EAGAIN, but with a different buffer than the first time. I tracked this down to mishandling of temporary socket errors in TlsSocket::Writev(). In the case that we successfully write part of the io vector but hit such an error trying to write a later element in the vector, we were still propagating the error back up to the caller. The caller didn't realize that part of the write was successful, and thus it would retry the write from the beginning. The fix is to fix the above, but also to enable partial writes in TlsContext. The new test fails if either of the above two changes are backed out. Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Reviewed-on: http://gerrit.cloudera.org:8080/8570 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin (cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d) Reviewed-on: http://gerrit.cloudera.org:8080/8602 Tested-by: Alexey Serbin Reviewed-by: Todd Lipcon --- M src/kudu/security/tls_context.cc M src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc 3 files changed, 219 insertions(+), 70 deletions(-) Approvals: Alexey Serbin: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8602 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.4.x Gerrit-MessageType: merged Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8602 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.3.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8599 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8599 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.3.x Gerrit-MessageType: comment Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8599 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:41 + Gerrit-HasComments: No
[kudu-CR](branch-1.3.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8600 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8600 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.3.x Gerrit-MessageType: comment Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8600 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:58 + Gerrit-HasComments: No
[kudu-CR](branch-1.4.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8602 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8602 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.4.x Gerrit-MessageType: comment Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8602 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:43:01 + Gerrit-HasComments: No
[kudu-CR](branch-1.4.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8601 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8601 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.4.x Gerrit-MessageType: comment Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8601 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:42:46 + Gerrit-HasComments: No
[kudu-CR](branch-1.4.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8601 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. [security] test and fixes for TLS socket EINTR issues SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE} correspondingly when signal interrupts recv()/send() calls even if SSL_MODE_AUTO_RETRY is set in the TLS context. To handle that properly in Socket::Blocking{Recv,Write}() methods, return NetworkError() with appropriate POSIX error code from TlsSocket::{Recv,Write}(). As a by-product, this changelist fixes flakiness in TestUniqueClientIds scenario of the ClientStressTest test and other flaky tests which failed with errors like below: Bad status: IO error: Could not connect to the cluster: \ Client connection negotiation failed: client connection to \ IP:port: Read zero bytes on a blocking Recv() call: \ Transferred 0 of 4 bytes Prior to this fix, the test failure ratio observed with dist-test for TSAN builds was about 6% in multiple 1K runs. After the fix, no failures observed. Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Reviewed-on: http://gerrit.cloudera.org:8080/8462 Tested-by: Kudu Jenkins Reviewed-by: Dan Burkert Reviewed-by: Alexey Serbin (cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c) (cherry picked from commit d98c24fa69ccea676cdf50857ebdff8fe866f751) Reviewed-on: http://gerrit.cloudera.org:8080/8601 Tested-by: Alexey Serbin Reviewed-by: Todd Lipcon --- M src/kudu/security/CMakeLists.txt M src/kudu/security/tls_handshake.cc A src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc M src/kudu/util/net/socket.cc 5 files changed, 227 insertions(+), 7 deletions(-) Approvals: Alexey Serbin: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8601 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.4.x Gerrit-MessageType: merged Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8601 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: mark delta tracker read-only on error
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8605 ) Change subject: tablet: mark delta tracker read-only on error .. Patch Set 9: Code-Review+2 Carrying forward Mike's +2 -- To view, visit http://gerrit.cloudera.org:8080/8605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced Gerrit-Change-Number: 8605 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:42:31 + Gerrit-HasComments: No
[kudu-CR] handle disk failures during tablet copies
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7654 ) Change subject: handle disk failures during tablet copies .. handle disk failures during tablet copies There are two components in a tablet copy: the copy client (that receives data) and the copy session source (that sends data). Coarse-grain handling of disk failures during tablet copies is done for the tablet copy client as: - Before starting a copy client, if no disks are available to place the tablet, simply return (instead of failing a CHECK). - Before downloading each WAL segments or block, check that the tablet is in a healthy group. And for the tablet copy session as: - Before sending a block or log segment, check if the tablet has an error. Upon returning an error, the tablet copy client will shutdown the replica, leaving it in a failed state. A test is added to ensure that both copy clients and that source sessions with failed disks will return errors to the copying client. Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Reviewed-on: http://gerrit.cloudera.org:8080/7654 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy --- M src/kudu/tserver/tablet_copy-test-base.h M src/kudu/tserver/tablet_copy_client-test.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h M src/kudu/tserver/tablet_copy_service-test.cc M src/kudu/tserver/tablet_copy_source_session.cc M src/kudu/tserver/tablet_copy_source_session.h 7 files changed, 124 insertions(+), 7 deletions(-) Approvals: Kudu Jenkins: Verified Mike Percy: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/7654 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Gerrit-Change-Number: 7654 Gerrit-PatchSet: 10 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] tablet: add early returns to maintenance functions
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8606 ) Change subject: tablet: add early returns to maintenance functions .. tablet: add early returns to maintenance functions When a Tablet is stopped, further maintenance ops are not scheduled, further IO is prevented, etc. This patch optimizes this further to stop various functions that are called by maintenance threads to prevent their execution, returning an error instead. Previously, certain ops (e.g. flush DMS) would guarantee durability by checking for the success these functions. These checks are now replaced with on-error checks that the tablet has been stopped, since these failures are inconsequential if the tablet is stopped. Currently this is an optimization for tablets that are shutting down to return early from these calls, but in the future, this could be useful in stopping all IO done by a tablet that is failing, e.g. due to disk failure. Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Reviewed-on: http://gerrit.cloudera.org:8080/8606 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy --- M src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc 5 files changed, 34 insertions(+), 8 deletions(-) Approvals: Kudu Jenkins: Verified Mike Percy: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8606 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Gerrit-Change-Number: 8606 Gerrit-PatchSet: 10 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: make various update paths atomic
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8441 ) Change subject: tablet: make various update paths atomic .. tablet: make various update paths atomic A few codepaths in the tablet subsystem aren't atomic, i.e. if they fail mid-method (e.g. due to a file error), they leave some in-memory structures updated and others untouched. This has been safe because we CHECKed to ensure their success. In preparation for _not_ crashing in these methods, this patch refactors some of these functions to be atomic, and notes others that still have the possibility of failing in such a state (these calls still trigger a CHECK failure). There are no functional changes in this patch. Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Reviewed-on: http://gerrit.cloudera.org:8080/8441 Reviewed-by: Mike Percy Tested-by: Kudu Jenkins --- M src/kudu/tablet/delta_compaction.cc M src/kudu/tablet/delta_compaction.h M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/delta_tracker.h M src/kudu/tablet/diskrowset.cc M src/kudu/tablet/diskrowset.h M src/kudu/tablet/metadata-test.cc M src/kudu/tablet/rowset_metadata.cc M src/kudu/tablet/rowset_metadata.h 9 files changed, 224 insertions(+), 169 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 15 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.5.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8604 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8604 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.5.x Gerrit-MessageType: comment Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8604 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:09:00 + Gerrit-HasComments: No
[kudu-CR](branch-1.5.x) KUDU-2218. tls socket: properly handle temporary socket errors in Writev
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8604 ) Change subject: KUDU-2218. tls_socket: properly handle temporary socket errors in Writev .. KUDU-2218. tls_socket: properly handle temporary socket errors in Writev This fixes a bug which caused RaftConsensusITest.TestLargeBatches to fail when run under stress, as in the following command line: taskset -c 0-4 \ build/latest/bin/raft_consensus-itest \ --gtest_filter=\*LargeBat\* \ --stress-cpu-threads=8 This would produce an error like: Network error: failed to write to TLS socket: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry:s3_pkt.c:878 This means that we were retrying a write after getting EAGAIN, but with a different buffer than the first time. I tracked this down to mishandling of temporary socket errors in TlsSocket::Writev(). In the case that we successfully write part of the io vector but hit such an error trying to write a later element in the vector, we were still propagating the error back up to the caller. The caller didn't realize that part of the write was successful, and thus it would retry the write from the beginning. The fix is to fix the above, but also to enable partial writes in TlsContext. The new test fails if either of the above two changes are backed out. While merging from the main trunk, the SCOPED_CLEANUP macro was replaced with MakeScopedCleanup. Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Reviewed-on: http://gerrit.cloudera.org:8080/8570 Tested-by: Kudu Jenkins Reviewed-by: Alexey Serbin (cherry picked from commit 64eb9f37b171419ed12a3795efe28faf2fd33b3d) Reviewed-on: http://gerrit.cloudera.org:8080/8604 Tested-by: Alexey Serbin Reviewed-by: Todd Lipcon --- M src/kudu/security/tls_context.cc M src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc 3 files changed, 219 insertions(+), 70 deletions(-) Approvals: Alexey Serbin: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8604 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.5.x Gerrit-MessageType: merged Gerrit-Change-Id: If797f220f42bfb2e6f452b66f15e7a758e883472 Gerrit-Change-Number: 8604 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Todd Lipcon
[kudu-CR](branch-1.5.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8603 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8603 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.5.x Gerrit-MessageType: comment Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8603 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 06:08:20 + Gerrit-HasComments: No
[kudu-CR](branch-1.5.x) [security] test and fixes for TLS socket EINTR issues
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8603 ) Change subject: [security] test and fixes for TLS socket EINTR issues .. [security] test and fixes for TLS socket EINTR issues SSL_{read,write}() can return SSL_ERROR_WANT_{READ,WRITE} correspondingly when signal interrupts recv()/send() calls even if SSL_MODE_AUTO_RETRY is set in the TLS context. To handle that properly in Socket::Blocking{Recv,Write}() methods, return NetworkError() with appropriate POSIX error code from TlsSocket::{Recv,Write}(). As a by-product, this changelist fixes flakiness in TestUniqueClientIds scenario of the ClientStressTest test and other flaky tests which failed with errors like below: Bad status: IO error: Could not connect to the cluster: \ Client connection negotiation failed: client connection to \ IP:port: Read zero bytes on a blocking Recv() call: \ Transferred 0 of 4 bytes Prior to this fix, the test failure ratio observed with dist-test for TSAN builds was about 6% in multiple 1K runs. After the fix, no failures observed. While merging from the main trunk, the SCOPED_CLEANUP macro was replaced with MakeScopedCleanup. Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Reviewed-on: http://gerrit.cloudera.org:8080/8462 Tested-by: Kudu Jenkins Reviewed-by: Dan Burkert Reviewed-by: Alexey Serbin (cherry picked from commit 18e024cf8bcaea192efb63780802cc4c799bbb9c) Reviewed-on: http://gerrit.cloudera.org:8080/8603 Reviewed-by: Todd Lipcon --- M src/kudu/security/CMakeLists.txt M src/kudu/security/tls_handshake.cc A src/kudu/security/tls_socket-test.cc M src/kudu/security/tls_socket.cc M src/kudu/util/net/socket.cc 5 files changed, 227 insertions(+), 7 deletions(-) Approvals: Kudu Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8603 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: branch-1.5.x Gerrit-MessageType: merged Gerrit-Change-Id: Ibec9049186f79f1c43295e4735538ed7ba4e516e Gerrit-Change-Number: 8603 Gerrit-PatchSet: 3 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] shutdown tablets on disk failure at runtime
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 ) Change subject: shutdown tablets on disk failure at runtime .. Patch Set 11: (2 comments) http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1367 PS11, Line 1367: Tablet* tablet = replica->tablet() prefer: shared_ptr tablet = replica->shared_tablet(); because the TabletReplica is allowed to delete the Tablet when it shuts down. http://gerrit.cloudera.org:8080/#/c/7442/11/src/kudu/tserver/ts_tablet_manager.cc@1368 PS11, Line 1368: (!tablet || tablet->HasBeenStopped()) hmm. this doesn't really make sense to me. Don't we want to shut down the replica regardless of this stuff? -- To view, visit http://gerrit.cloudera.org:8080/7442 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8 Gerrit-Change-Number: 7442 Gerrit-PatchSet: 11 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 05:39:20 + Gerrit-HasComments: Yes
[kudu-CR] error manager: synchronize/serialize handling
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8395 ) Change subject: error_manager: synchronize/serialize handling .. Patch Set 12: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 12 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 05:29:18 + Gerrit-HasComments: No
[kudu-CR] tablet: add early returns to maintenance functions
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8606 ) Change subject: tablet: add early returns to maintenance functions .. Patch Set 9: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8606 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Gerrit-Change-Number: 8606 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 05:26:49 + Gerrit-HasComments: No
[kudu-CR] tablet: mark delta tracker read-only on error
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8605 ) Change subject: tablet: mark delta tracker read-only on error .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced Gerrit-Change-Number: 8605 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 05:25:55 + Gerrit-HasComments: No
[kudu-CR] handle disk failures during tablet copies
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7654 ) Change subject: handle disk failures during tablet copies .. Patch Set 9: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7654 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Gerrit-Change-Number: 7654 Gerrit-PatchSet: 9 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Wed, 22 Nov 2017 05:26:11 + Gerrit-HasComments: No
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8620 ) Change subject: tablet: remove ignore_result in ApplyRowOperation .. tablet: remove ignore_result in ApplyRowOperation ApplyRowOperation() previously ignored the result it returned. This patch updates this to check for errors that we expect, rather than a blanket ignorance of all errors. Unexpected Statuses are returned. Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Reviewed-on: http://gerrit.cloudera.org:8080/8620 Reviewed-by: Mike Percy Tested-by: Kudu Jenkins --- M src/kudu/tablet/tablet.cc 1 file changed, 11 insertions(+), 5 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 5 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] shutdown tablets on disk failure at runtime
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7442 to look at the new patch set (#11). Change subject: shutdown tablets on disk failure at runtime .. shutdown tablets on disk failure at runtime Before, various code paths pass along disk failure Statuses until they eventually hit a CHECK failure and crash the server. Such fatal errors were "safe" by design, as they would ensure no additional changes were made durable to each tablet. This patch aims to achieve similar behavior for failed replicas while keeping the server alive. These failures are permitted provided the following have occurred for each tablet in the affected directory: - The failed directory is immediately marked as failed, preventing further tablets from being striped across a failed disk. - The tablet's MvccManager is shut down to prevent further writes from being made durable and preventing I/O to the tablet. - A request is submitted to a threadpool to eventually completely shut down the replica, eventually marking it for eviction. Beyond the above functionality, to cancel replica maintenance ops along with the rest of the error handling, I updated the locking behavior of TabletReplica so access to its maintenance ops can be done in a thread-safe way (by guarding the list of ops with the replica's lock). NOTE: failures of the metadata directory and the WAL directory are fatal. Code paths that update these explicitly crash the server. This is a part of a series of patches to handle disk failure. To see how this patch fits in, see section 3 of: https://docs.google.com/document/d/1yGVzDzV14mKReZ7EzlZZV_KfDBRnHJkRtlDox_RPXAA/edit Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8 --- M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/ts_tablet_manager.h 7 files changed, 162 insertions(+), 47 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/7442/11 -- To view, visit http://gerrit.cloudera.org:8080/7442 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I109635a54268b9db741b2ae9ea3e9f1fe072d0a8 Gerrit-Change-Number: 7442 Gerrit-PatchSet: 11 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] shutdown tablets on disk failure at runtime
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7442 ) Change subject: shutdown tablets on disk failure at runtime .. Patch Set 10: (15 comments) http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h File src/kudu/tablet/tablet.h: http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet.h@157 PS2, Line 157: > Seems like this has a new (sort of awkward) contract now since you have to Hrm, the contract was meant to freeze any further or currently-running calls to these functions. Good point that we might just be able to get by by returning a status. http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h File src/kudu/tablet/tablet_replica.h: http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.h@268 PS10, Line 268: void RegisterMaintenanceOps(MaintenanceManager* maintenance_manager); > warning: function 'kudu::tablet::TabletReplica::RegisterMaintenanceOps' has Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc File src/kudu/tablet/tablet_replica.cc: http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@733 PS10, Line 733: void TabletReplica::MakeUnavailable(const string& reason) { > My suggestion for this function: Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@739 PS10, Line 739: tablet_->Stop(); > is there a reason we need to call tablet_->Stop() while holding TabletRepli Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@740 PS10, Line 740: tablet_->CancelMaintenanceOps(); > this isn't needed because Stop() does it automatically Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tablet/tablet_replica.cc@749 PS10, Line 749: Status::IOError(reason, "", EIO) > This seems a bit unexpected. If we need a Status why not pass it into MakeU Done http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: http://gerrit.cloudera.org:8080/#/c/7442/2/src/kudu/tablet/tablet_replica_mm_ops.cc@137 PS2, Line 137: CHECK(!tablet->rowsets_flush_sem_.try_lock()); > Hrm. Is it safe to change a CHECK() to a LOG(ERROR) here? I would like to s Duly noted. I'm still thinking about how to best test this (other than the integration tests for disk failure as a whole), which I could move into this patch as well. Also note that this is being replaced with a LOG(ERROR) and then another CHECK below. If there is a disk failure, it _shouldn't_ matter what state the MRSs are in, since the tablet shouldn't be used anyway. In the error-handling code, it's set to FAILED, so scans won't be able to touch it, and further transactions and maintenance ops should no longer be run. http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@125 PS10, Line 125: // Transitions the tablet state specified by 'tablet_id' with the specified > How about using this for the comment: Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@127 PS10, Line 127: Status StartTransitionTabletState(const std::string& tablet_id, > 1. why is this public? 1. Done 2. Done 3. I think we do need to transition to avoid bootstrap/copy overlapping with shutdown http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@212 PS10, Line 212: FailTablet > Perhaps a better name for this method would be FailTabletAndScheduleShutdow Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.h@309 PS10, Line 309: FailTabletAsync > The Async suffix makes me think that this schedules work on a background th Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1316 PS10, Line 1316: if (!s.IsDiskFailure()) { : return s; : } : LOG(FATAL) << Substitute("Tablet $0's consensus metadata is in a failed disk", tablet_id); > 1) s/in a failed disk/on a failed disk/ Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1351 PS10, Line 1351: LOG(INFO) << Substitute("Forcing failure of tablet $0", tablet_id); > use LogPrefix(tablet_id) here Done http://gerrit.cloudera.org:8080/#/c/7442/10/src/kudu/tserver/ts_tablet_manager.cc@1372 PS10, Line 1372: s = StartTransitionTabletState(tablet_id, "failing tablet", > What is the purpose of all this transitioning stuff in here? Logically, that's what it's doing, but I think shutdown, bootstrapping, and copying are designed to be synchronized in th
[kudu-CR] error manager: synchronize/serialize handling
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8395 ) Change subject: error_manager: synchronize/serialize handling .. Patch Set 12: (2 comments) http://gerrit.cloudera.org:8080/#/c/8395/9/src/kudu/fs/error_manager-test.cc File src/kudu/fs/error_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/8395/9/src/kudu/fs/error_manager-test.cc@92 PS9, Line 92: SleepForRand(); > Why not just "curry" it, which means embed the parameter at the bind site? Oh neat! Didn't realize we had such currying abilities in K++ :) http://gerrit.cloudera.org:8080/#/c/8395/11/src/kudu/fs/error_manager-test.cc File src/kudu/fs/error_manager-test.cc: http://gerrit.cloudera.org:8080/#/c/8395/11/src/kudu/fs/error_manager-test.cc@184 PS11, Line 184: > since you are using emplace_back, you can now omit the thread( part also Done -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 12 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 03:17:52 + Gerrit-HasComments: Yes
[kudu-CR] error manager: synchronize/serialize handling
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8395 to look at the new patch set (#12). Change subject: error_manager: synchronize/serialize handling .. error_manager: synchronize/serialize handling The FsErrorManager is helper infrastructure that other classes can use to help provide API contracts related to error handling. Here is an example error-handling contract provided to the TSTabletManager in a future patch, if a disk failure Status is returned during tablet operation, either: 1) the tablet server will crash and we can rely on Kudu's crash-consistency mechanisms for safety, or 2) any affected Tablet will transition to the 'kStopped' state via an error manager callback. Most components will simply pass the non-OK Status up the call chain. However, if a method expects an IO operation to succeed for correctness purposes, and it receives an error Status, then it should check that the Tablet is stopped. If so, it can be assumed that the error was handled. Otherwise, Kudu must rely on the crash-consistency mechanisms and crash. Given the above contract, The state of a tablet server post-disk-failure depends significantly on the completion of disk-failure-handling callbacks. Error-handling _must_ finish before anything is propagated back to the offending caller. To ensure correct completion of error-handling even in a concurrent setting, this patch extends the error manager with a mutex and a second error-handling callback type. This ensures that errors indirectly caused by disk failures can be handled by non-disk-specific handling, serializing failure-handling in the same fashion. As an example of where this is necessary, say a tablet has data in a single directory and hits a bad disk. That directory is immediately marked failed and handling starts to fail all tablets in the directory. Before, if the tablet were to create a new block before being failed, it would fail immediately, complaining that no directories are available, and would eventually fail a CHECK that translates roughly to: "Has error handling for this tablet completed?" By wrapping block creation with tablet-specific error handling and with serialized error-handling, this CHECK will pass. A previous revision accomplished this by using a single-threaded threadpool to trigger error-handling and ensuring completion of error-handling by waiting on the threadpool. I ultimately went with this implentation since it's a bit more straight-forward wrt when to make the different calls (i.e. "trigger disk error-handling vs trigger tablet error-handling", instead of "trigger error-handling vs wait for error-handling to finish"). This also has the benefit of being extendable for other kinds of errors in the future. Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e --- M src/kudu/fs/CMakeLists.txt M src/kudu/fs/data_dirs-test.cc A src/kudu/fs/error_manager-test.cc A src/kudu/fs/error_manager.cc M src/kudu/fs/error_manager.h M src/kudu/fs/file_block_manager.cc M src/kudu/fs/fs_manager.cc M src/kudu/fs/fs_manager.h M src/kudu/fs/log_block_manager-test.cc M src/kudu/fs/log_block_manager.cc M src/kudu/tserver/tablet_server.cc M src/kudu/util/status.h 12 files changed, 389 insertions(+), 58 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/8395/12 -- To view, visit http://gerrit.cloudera.org:8080/8395 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ie61c408a0b4424f933f40a31147568c2f906be0e Gerrit-Change-Number: 8395 Gerrit-PatchSet: 12 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: make various update paths atomic
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8441 ) Change subject: tablet: make various update paths atomic .. Patch Set 14: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 14 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 03:02:46 + Gerrit-HasComments: No
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8620 ) Change subject: tablet: remove ignore_result in ApplyRowOperation .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 4 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 03:02:32 + Gerrit-HasComments: No
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8620 ) Change subject: tablet: remove ignore_result in ApplyRowOperation .. Patch Set 4: fixed iwyu and rebased. -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 4 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 02:54:01 + Gerrit-HasComments: No
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8620 to look at the new patch set (#4). Change subject: tablet: remove ignore_result in ApplyRowOperation .. tablet: remove ignore_result in ApplyRowOperation ApplyRowOperation() previously ignored the result it returned. This patch updates this to check for errors that we expect, rather than a blanket ignorance of all errors. Unexpected Statuses are returned. Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb --- M src/kudu/tablet/tablet.cc 1 file changed, 11 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/8620/4 -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 4 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Hello Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8620 to look at the new patch set (#3). Change subject: tablet: remove ignore_result in ApplyRowOperation .. tablet: remove ignore_result in ApplyRowOperation ApplyRowOperation() previously ignored the result it returned. This patch updates this to check for errors that we expect, rather than a blanket ignorance of all errors. Unexpected Statuses are returned. Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb --- M src/kudu/tablet/tablet.cc 1 file changed, 11 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/8620/3 -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 3 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8557 ) Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is too high .. ntp: dump diagnostics when NTP is unsynchronized or error is too high This adds some code to shell out to various NTP diagnostics tools when clock problems are detected, just before fataling. Hopefully this will make it easier to diagnose improper NTP setups in the wild. Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Reviewed-on: http://gerrit.cloudera.org:8080/8557 Reviewed-by: Alexey Serbin Tested-by: Kudu Jenkins --- M src/kudu/clock/hybrid_clock-test.cc M src/kudu/clock/hybrid_clock.cc M src/kudu/clock/hybrid_clock.h M src/kudu/clock/system_ntp.cc M src/kudu/clock/system_ntp.h M src/kudu/clock/time_service.h M src/kudu/security/test/mini_kdc.cc M src/kudu/tools/kudu-tool-test.cc M src/kudu/util/path_util.cc M src/kudu/util/path_util.h M src/kudu/util/test_util.cc M src/kudu/util/test_util.h 12 files changed, 154 insertions(+), 42 deletions(-) Approvals: Alexey Serbin: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 6 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8627 ) Change subject: schema: fast-path for Schema::KeyEquals, etc. .. schema: fast-path for Schema::KeyEquals, etc. If two schemas are the same object, then KeyEquals can be fast-pathed. The same goes for various ColumnSchema equality checks. This hasn't been a hot path in release builds, but I noticed a lot of wasted CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in a great number of cases the objects are identical. Even in release builds, may as well avoid the unnecessary cycles. Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Reviewed-on: http://gerrit.cloudera.org:8080/8627 Reviewed-by: Alexey Serbin Tested-by: Kudu Jenkins --- M src/kudu/common/schema.h 1 file changed, 5 insertions(+), 0 deletions(-) Approvals: Alexey Serbin: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); > I think 'raft_prepare_replacement_before_eviction' sounds much better. A little bit more of bike-shedding from my side: I think it's worth noting in the description that this flag means different things in case of master/catalog_manager and tserver. Instead, maybe, introduce a flag to both tserver and master, named like 'ts_tablet_prepare_replacement_before_eviction' ? Naming is hard. -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Wed, 22 Nov 2017 02:21:30 + Gerrit-HasComments: Yes
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); > I don't like putting version numbers in for config flags because they often I think 'raft_prepare_replacement_before_eviction' sounds much better. -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Wed, 22 Nov 2017 01:43:43 + Gerrit-HasComments: Yes
[kudu-CR] WIP [catalog manager] added 3-4-3 behavior
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8619 ) Change subject: WIP [catalog_manager] added 3-4-3 behavior .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@421 PS2, Line 421: peer.attrs().has_promote() > I don't think we need to rely on has_ since this defaults to false. Done http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@450 PS2, Line 450: for (const RaftPeerPB& peer : config.peers()) { > I find the logic a little tough to follow in here. Maybe it would be a litt I thought about that but it would require keeping the finding in the temporary containers, etc. while it's possible to do so without allocating an extra stuff. But if it would be easier to follow -- sure, I'll update that as needed. Yeah, that a good point about using the same code for deciding to evict either a voter or non-voter. The logic in the functions to evict voter and non-voters was using different criteria in the first version, but now it's the same. I'll update that, thanks. http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@204 PS2, Line 204: // TODO(aserbin): can we get a better name for this flag? Ideally, there should > i posted a patch at https://gerrit.cloudera.org/c/8626/ so maybe we can use thanks! that would be a good thing to have. http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3007 PS2, Line 3007: member_type_(member_type) > this field only seems relevant for adding new replicas or a forced promotio good catch -- i'll move it into appropriate function. http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3198 PS2, Line 3198: peer->set_member_type(member_type_); > why is this field needed for an eviction? Good point: I think it's not needed for eviction. I'll move it into the AsyncAddReplicaTask. http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3482 PS2, Line 3482: else { : string to_evict; : if > nit: we can make this an "else if" and avoid an additional level of nesting I was trying to keep to_evict local. All right, having less level of nesting is a priority makes more sense, I'll switch to 'else if'. -- To view, visit http://gerrit.cloudera.org:8080/8619 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 Gerrit-Change-Number: 8619 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 01:40:52 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8628 ) Change subject: KUDU-. update_scan_delta_compact-test: increase write timeout .. KUDU-. update_scan_delta_compact-test: increase write timeout In ASAN builds, the 5-second timeout on 1000-row batches was not sufficient to prevent write timeouts. With this, the test passed on dist-test 100/100[1], whereas previously it failed 87/100[2]. [1] http://dist-test.cloudera.org/job?job_id=todd.1511306572.114498 [2] http://dist-test.cloudera.org/job?job_id=todd.1511304651.87727 Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6 Reviewed-on: http://gerrit.cloudera.org:8080/8628 Reviewed-by: Alexey Serbin Tested-by: Kudu Jenkins --- M src/kudu/integration-tests/update_scan_delta_compact-test.cc 1 file changed, 1 insertion(+), 1 deletion(-) Approvals: Alexey Serbin: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8628 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6 Gerrit-Change-Number: 8628 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] Add a test for data consistency after stopping tablets
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Add a test for data consistency after stopping tablets This adds a new test which starts a write workload and then shuts down all of the servers while the writes are in-flight. The shutdown of the servers causes the tablets to shut down, hence testing the new "abort" path for in-flight writes. Passed 1000/1000 in a TSAN build: http://dist-test.cloudera.org/job?job_id=todd.1511294742.123710 Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Reviewed-on: http://gerrit.cloudera.org:8080/8618 Reviewed-by: Andrew Wong Tested-by: Andrew Wong Reviewed-by: Mike Percy --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/integration-tests/cluster_verifier.h M src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/integration-tests/test_workload.cc M src/kudu/integration-tests/test_workload.h M src/kudu/mini-cluster/external_mini_cluster.cc M src/kudu/mini-cluster/internal_mini_cluster.cc M src/kudu/tools/ksck_remote.cc 8 files changed, 81 insertions(+), 15 deletions(-) Approvals: Andrew Wong: Looks good to me, approved; Verified Mike Percy: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); > OK, I see. I don't like putting version numbers in for config flags because they often last long after the versions have meaning, but you make a good point. How about: raft_prepare_replacement_before_eviction This is something that is actually new, and that would apply to many future designs, and represents something we're not doing that causes availability problems today. -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Wed, 22 Nov 2017 00:44:41 + Gerrit-HasComments: Yes
[kudu-CR] Add a test for data consistency after stopping tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 00:41:41 + Gerrit-HasComments: No
[kudu-CR] Add a test for data consistency after stopping tablets
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: Verified+1 Code-Review+2 Build failure was caused by KUDU-. -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 00:38:24 + Gerrit-HasComments: No
[kudu-CR] Add a test for data consistency after stopping tablets
Andrew Wong has removed a vote on this change. Change subject: Add a test for data consistency after stopping tablets .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8557 ) Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is too high .. Patch Set 5: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 5 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Wed, 22 Nov 2017 00:38:04 + Gerrit-HasComments: No
[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8627 ) Change subject: schema: fast-path for Schema::KeyEquals, etc. .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 00:36:32 + Gerrit-HasComments: No
[kudu-CR] schema: fast-path for Schema::KeyEquals, etc.
Hello Alexey Serbin, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8627 to look at the new patch set (#2). Change subject: schema: fast-path for Schema::KeyEquals, etc. .. schema: fast-path for Schema::KeyEquals, etc. If two schemas are the same object, then KeyEquals can be fast-pathed. The same goes for various ColumnSchema equality checks. This hasn't been a hot path in release builds, but I noticed a lot of wasted CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in a great number of cases the objects are identical. Even in release builds, may as well avoid the unnecessary cycles. Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae --- M src/kudu/common/schema.h 1 file changed, 5 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/27/8627/2 -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon
[kudu-CR] schema: fast-path for Schema::KeyEquals
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8627 ) Change subject: schema: fast-path for Schema::KeyEquals .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h File src/kudu/common/schema.h: http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h@706 PS1, Line 706: if (this == &other) return true; > Does it make sense to add the same for ColumnSchema::{Equals,EqualsType,Equ Done -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 00:30:03 + Gerrit-HasComments: Yes
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. tablet: introduce closed mvcc and stopped tablets Currently, the only way to stop an Applying transaction is to wait for it to finish and Commit it. This constraint was put in place to guarantee on-disk correctness, but is sometimes too strict. E.g. if the tablet is shutting down, the Apply doesn't need to finish. This patch adds a new state to the MvccManager in which it is closed for transactions. Once in this closed state: 1. New Applies will return and not move to the Commit phase, and any methods waiting for the tablet's Applies to Commit (e.g. new snapshot scans, FlushMRS) will respond with an error immediately. This allows an escape from the existing invariant that Applies _must_ Commit, provided the MvccManager is in this closed state. 2. Applies that are already underway may still Commit, but will return early on a best-effort basis. These non-Committed operations are inconsequential w.r.t. consistency; having some in-flight transactions Commit and others not is consistent with the server shutting down in between the Commits of two transactions. 3. New transactions drivers will abort immediately before even reaching the Prepare phase, ensuring no more writes to the tablet are made durable. The Tablet class uses this closed MVCC state in a new "stopped" state of its own. A Tablet that has been stopped will avoid further activity: its MvccManager is closed to prevent further writes, and its maintenance ops are cancelled to prevent further scheduling. This patch includes these new behaviors when shutting down a tablet, with the assumption that a tablet will only be shut down when it's being deleted and we don't care too much about its in-flight transactions Committing or its further maintenance ops. Code paths that previously crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask, MvccManager::AbortTransaction, etc.) or that waited for Applies to finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet has been stopped and will log a warning instead. Testing is done by adding the following: - a test in mvcc-test to shut down MVCC and delete an Applying transaction, ensuring that there are no errors when it leaves scope. - a test in mvcc-test to wait on an Applying transaction, shut down MVCC, and ensure that any waiters will return with an error. - a new test stop_tablet-itest is added to ensure stopped leaders block writes (because they cannot start new transactions) and stopped followers don't (because while they cannot service the op, there still exists a majority that can); that stopped tablets don't prevent fault-tolerant scans; and that stopping the only tablet does prevent scans Change-Id: I983620f27e7226806a2cca253db7619731914d42 Reviewed-on: http://gerrit.cloudera.org:8080/7439 Reviewed-by: Mike Percy Tested-by: Kudu Jenkins Reviewed-by: Todd Lipcon --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/local_tablet_writer.h M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/rowset.cc M src/kudu/tablet/tablet-harness.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/tablet_service.cc 17 files changed, 733 insertions(+), 135 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified Todd Lipcon: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 40 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 39: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 39 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 22 Nov 2017 00:27:39 + Gerrit-HasComments: No
[kudu-CR] ntp: dump diagnostics when NTP is unsynchronized or error is too high
Hello Will Berkeley, Tidy Bot, Alexey Serbin, Jean-Daniel Cryans, Andrew Wong, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8557 to look at the new patch set (#5). Change subject: ntp: dump diagnostics when NTP is unsynchronized or error is too high .. ntp: dump diagnostics when NTP is unsynchronized or error is too high This adds some code to shell out to various NTP diagnostics tools when clock problems are detected, just before fataling. Hopefully this will make it easier to diagnose improper NTP setups in the wild. Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 --- M src/kudu/clock/hybrid_clock-test.cc M src/kudu/clock/hybrid_clock.cc M src/kudu/clock/hybrid_clock.h M src/kudu/clock/system_ntp.cc M src/kudu/clock/system_ntp.h M src/kudu/clock/time_service.h M src/kudu/security/test/mini_kdc.cc M src/kudu/tools/kudu-tool-test.cc M src/kudu/util/path_util.cc M src/kudu/util/path_util.h M src/kudu/util/test_util.cc M src/kudu/util/test_util.h 12 files changed, 154 insertions(+), 42 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/8557/5 -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 5 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8557 ) Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high .. Patch Set 4: (2 comments) http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h File src/kudu/clock/system_ntp.h: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50 PS4, Line 50: log > can't do default parameters since it's a virtual method (tidy complained) ah, sure http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc File src/kudu/clock/system_ntp.cc: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90 PS4, Line 90: ERROR > I wanted to log the whole thing at ERROR level since this is only used righ sgtm -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Wed, 22 Nov 2017 00:22:25 + Gerrit-HasComments: Yes
[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8628 ) Change subject: KUDU-. update_scan_delta_compact-test: increase write timeout .. Patch Set 1: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8628 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6 Gerrit-Change-Number: 8628 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 22 Nov 2017 00:20:31 + Gerrit-HasComments: No
[kudu-CR] schema: fast-path for Schema::KeyEquals
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8627 ) Change subject: schema: fast-path for Schema::KeyEquals .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h File src/kudu/common/schema.h: http://gerrit.cloudera.org:8080/#/c/8627/1/src/kudu/common/schema.h@706 PS1, Line 706: if (this == &other) return true; Does it make sense to add the same for ColumnSchema::{Equals,EqualsType,EqualsPhysicalType} while you are at it? -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Wed, 22 Nov 2017 00:18:48 + Gerrit-HasComments: Yes
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); > The idea is that this enables 3-4-3 re-replication, and must be set to "tru OK, I see. I remember there were some musings about preference of 3-2-3 path for the case when we definitely know that one out of 3 replicas is failed, no? Overall, do you think '3-4-3' theme is good to describe the new way of auto-recovery in case of replica failure? Maybe, something like 'raft_rereplication_v1' is the best way to name it? -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Wed, 22 Nov 2017 00:00:58 + Gerrit-HasComments: Yes
[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8557 ) Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high .. Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@7 PS4, Line 7: NTP > nit: drop this Done http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@9 PS4, Line 9: ntp > nit: NTP Done http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h File src/kudu/clock/system_ntp.h: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50 PS4, Line 50: log > nit: maybe, make this a parameter by default? can't do default parameters since it's a virtual method (tidy complained) http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc File src/kudu/clock/system_ntp.cc: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90 PS4, Line 90: ERROR > nit: maybe, just INFO in case of s.ok() ? I wanted to log the whole thing at ERROR level since this is only used right before a FATAL, and this way we'll guarantee to see the info in the ERROR log along with the fatal message, whereas INFO might be filtered out. http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@77 PS4, Line 77: void TryRun(vector cmd, vector* log) { : string exe, out, err; : Status s = FindExecutable(cmd[0], {"/sbin", "/usr/sbin/"}, &exe); : if (!s.ok()) { : LOG_STRING(WARNING, log) << "could not find executable: " << cmd[0]; : return; : } : : cmd[0] = exe; : s = Subprocess::Call(cmd, "", &out, &err); : // Subprocess::Call() returns RuntimeError in the case that the process returns : // a non-zero exit code, but that might still generate useful err. : if (s.ok() || (s.IsRuntimeError() && (!out.empty() || !err.empty( { : LOG_STRING(ERROR, log) : << JoinStrings(cmd, " ") : << "\n--" : << (!out.empty() ? Substitute("\nstdout:\n$0", out) : "") : << (!err.empty() ? Substitute("\nstderr:\n$0", err) : "") : << "\n"; : } else { : LOG_STRING(WARNING, log) << "failed to run executable: " << cmd[0]; : } : : } > maybe, move this into src/kudu/util/subprocess.cc? eh, I think this is specific enough to what we're doing here (logging the output to ERROR, even in the case of non-zero exit code, etc) that I would rather not put it in the util/ code http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h File src/kudu/util/subprocess.h: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h@220 PS4, Line 220: Status FindExecutable(const std::string& binary, > maybe, the better place for that is in path_util.h? Done http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc File src/kudu/util/test_util.cc: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc@a307 PS4, Line 307: > remove the declaration part from test_util.h as well? Done -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 21 Nov 2017 23:59:49 + Gerrit-HasComments: Yes
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); > Does this mean the replicas are to be evicted by the leader replica when th The idea is that this enables 3-4-3 re-replication, and must be set to "true" on all masters and tservers to work properly. -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Comment-Date: Tue, 21 Nov 2017 23:42:52 + Gerrit-HasComments: Yes
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8626 ) Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc File src/kudu/consensus/raft_consensus.cc: http://gerrit.cloudera.org:8080/#/c/8626/1/src/kudu/consensus/raft_consensus.cc@124 PS1, Line 124: DEFINE_bool(raft_rereplication_add_before_evict, false, : "When enabled, failed replicas will only be evicted after a " : "replacement has been added to the config."); Does this mean the replicas are to be evicted by the leader replica when this is set to 'true'? Or this flag actually switches to the new scheme where master, but not the leader replica, evicts failed replicas? -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 21 Nov 2017 23:41:25 + Gerrit-HasComments: Yes
[kudu-CR] schema: fast-path for Schema::KeyEquals
Hello Alexey Serbin, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/8627 to review the following change. Change subject: schema: fast-path for Schema::KeyEquals .. schema: fast-path for Schema::KeyEquals If two schemas are the same object, then KeyEquals can be fast-pathed. This hasn't been a hot path in release builds, but I noticed a lot of wasted CPU in a debug build evaluating Schema::KeyEquals, and my suspicion is that in a great number of cases the objects are identical. Even in release builds, may as well avoid the unnecessary cycles. Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae --- M src/kudu/common/schema.h 1 file changed, 1 insertion(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/27/8627/1 -- To view, visit http://gerrit.cloudera.org:8080/8627 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ia5fadfd70ea651b9105837535f21a34c6e0af6ae Gerrit-Change-Number: 8627 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin
[kudu-CR] KUDU-2222. update scan delta compact-test: increase write timeout
Hello Alexey Serbin, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/8628 to review the following change. Change subject: KUDU-. update_scan_delta_compact-test: increase write timeout .. KUDU-. update_scan_delta_compact-test: increase write timeout In ASAN builds, the 5-second timeout on 1000-row batches was not sufficient to prevent write timeouts. With this, the test passed on dist-test 100/100[1], whereas previously it failed 87/100[2]. [1] http://dist-test.cloudera.org/job?job_id=todd.1511306572.114498 [2] http://dist-test.cloudera.org/job?job_id=todd.1511304651.87727 Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6 --- M src/kudu/integration-tests/update_scan_delta_compact-test.cc 1 file changed, 1 insertion(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/8628/1 -- To view, visit http://gerrit.cloudera.org:8080/8628 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ic393579320b87e0f5cec494870441b8aa69db3b6 Gerrit-Change-Number: 8628 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin
[kudu-CR] ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/8557 ) Change subject: ntp: dump NTP diagnostics when NTP is unsynchronized or error is too high .. Patch Set 4: (7 comments) overall looks good, just some nits http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@7 PS4, Line 7: NTP nit: drop this http://gerrit.cloudera.org:8080/#/c/8557/4//COMMIT_MSG@9 PS4, Line 9: ntp nit: NTP http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h File src/kudu/clock/system_ntp.h: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.h@50 PS4, Line 50: log nit: maybe, make this a parameter by default? http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc File src/kudu/clock/system_ntp.cc: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@90 PS4, Line 90: ERROR nit: maybe, just INFO in case of s.ok() ? http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/clock/system_ntp.cc@77 PS4, Line 77: void TryRun(vector cmd, vector* log) { : string exe, out, err; : Status s = FindExecutable(cmd[0], {"/sbin", "/usr/sbin/"}, &exe); : if (!s.ok()) { : LOG_STRING(WARNING, log) << "could not find executable: " << cmd[0]; : return; : } : : cmd[0] = exe; : s = Subprocess::Call(cmd, "", &out, &err); : // Subprocess::Call() returns RuntimeError in the case that the process returns : // a non-zero exit code, but that might still generate useful err. : if (s.ok() || (s.IsRuntimeError() && (!out.empty() || !err.empty( { : LOG_STRING(ERROR, log) : << JoinStrings(cmd, " ") : << "\n--" : << (!out.empty() ? Substitute("\nstdout:\n$0", out) : "") : << (!err.empty() ? Substitute("\nstderr:\n$0", err) : "") : << "\n"; : } else { : LOG_STRING(WARNING, log) << "failed to run executable: " << cmd[0]; : } : : } maybe, move this into src/kudu/util/subprocess.cc? http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h File src/kudu/util/subprocess.h: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/subprocess.h@220 PS4, Line 220: Status FindExecutable(const std::string& binary, maybe, the better place for that is in path_util.h? http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc File src/kudu/util/test_util.cc: http://gerrit.cloudera.org:8080/#/c/8557/4/src/kudu/util/test_util.cc@a307 PS4, Line 307: remove the declaration part from test_util.h as well? -- To view, visit http://gerrit.cloudera.org:8080/8557 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifeb2206a0475b8c8e183a74aee21315e6e43dc33 Gerrit-Change-Number: 8557 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Tue, 21 Nov 2017 23:27:52 + Gerrit-HasComments: Yes
[kudu-CR] WIP [catalog manager] added 3-4-3 behavior
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8619 ) Change subject: WIP [catalog_manager] added 3-4-3 behavior .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc File src/kudu/consensus/quorum_util.cc: http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@421 PS2, Line 421: peer.attrs().has_promote() I don't think we need to rely on has_ since this defaults to false. http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/consensus/quorum_util.cc@450 PS2, Line 450: for (const RaftPeerPB& peer : config.peers()) { I find the logic a little tough to follow in here. Maybe it would be a little easier to follow if it was structured as 2 loops? 1. loop over the config and collect stats (number of healthy / failed voters, number of healthy / failed non-voters, number of nodes to replace) 2. decide whether any can be evicted, if not, return false 3. if a node can be evicted, loop again with a predicate indicating the criteria for the node to evict and return the first match. Also, I wonder why we need a separate function for evicting voters and non-voters? http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@204 PS2, Line 204: // TODO(aserbin): can we get a better name for this flag? Ideally, there should i posted a patch at https://gerrit.cloudera.org/c/8626/ so maybe we can use that one everywhere http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3007 PS2, Line 3007: member_type_(member_type) this field only seems relevant for adding new replicas or a forced promotion / demotion http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3198 PS2, Line 3198: peer->set_member_type(member_type_); why is this field needed for an eviction? http://gerrit.cloudera.org:8080/#/c/8619/2/src/kudu/master/catalog_manager.cc@3482 PS2, Line 3482: else { : string to_evict; : if nit: we can make this an "else if" and avoid an additional level of nesting -- To view, visit http://gerrit.cloudera.org:8080/8619 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 Gerrit-Change-Number: 8619 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 23:24:41 + Gerrit-HasComments: Yes
[kudu-CR] [java] Add ReplicaSelection in KuduScanToken
Dan Burkert has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8559 ) Change subject: [java] Add ReplicaSelection in KuduScanToken .. [java] Add ReplicaSelection in KuduScanToken This patch adds ReplicaSelection in KuduScanToken for java client, so that deserializing a ScanToken results in propagating the replica selection policy of the serializer into the deserializer. Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914 Reviewed-on: http://gerrit.cloudera.org:8080/8559 Reviewed-by: Dan Burkert Tested-by: Kudu Jenkins --- M java/kudu-client/src/main/java/org/apache/kudu/client/KuduScanToken.java M java/kudu-client/src/test/java/org/apache/kudu/client/TestScannerMultiTablet.java M src/kudu/client/client.proto M src/kudu/common/common.proto 4 files changed, 50 insertions(+), 0 deletions(-) Approvals: Dan Burkert: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8559 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914 Gerrit-Change-Number: 8559 Gerrit-PatchSet: 3 Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins
[kudu-CR] consensus: Add gflag to enable 3-4-3 re-replication
Hello Alexey Serbin, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/8626 to review the following change. Change subject: consensus: Add gflag to enable 3-4-3 re-replication .. consensus: Add gflag to enable 3-4-3 re-replication This gflag will be used to control whether 3-4-3 re-replication is used on a cluster. Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 --- M src/kudu/consensus/raft_consensus.cc 1 file changed, 7 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/8626/1 -- To view, visit http://gerrit.cloudera.org:8080/8626 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I247842af7d0172d6e6b167a29613f6b261990a01 Gerrit-Change-Number: 8626 Gerrit-PatchSet: 1 Gerrit-Owner: Mike Percy Gerrit-Reviewer: Alexey Serbin
[kudu-CR] Add a test for data consistency after stopping tablets
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@231 PS3, Line 231: why does the client get a RemoteError in this test? > what does the RemoteError say? while starting up the tservers, they start their RPC server before they register the TabletService RPC service itself into the messenger. So, the client bounces around and gets ServiceUnavailable on all replicas, and then returns RemoteError instead of TimedOut -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:44:09 + Gerrit-HasComments: Yes
[kudu-CR] WIP [catalog manager] added 3-4-3 behavior
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8619 to look at the new patch set (#2). Change subject: WIP [catalog_manager] added 3-4-3 behavior .. WIP [catalog_manager] added 3-4-3 behavior Updated the catalog_manager to behave as specified in 3-4-3 v1 design implementation plan, patch2. WIP: need to add units tests for new quorum_util functions and comprehesive functional/integration tests Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 --- M src/kudu/consensus/quorum_util-test.cc M src/kudu/consensus/quorum_util.cc M src/kudu/consensus/quorum_util.h M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h 5 files changed, 806 insertions(+), 96 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/8619/2 -- To view, visit http://gerrit.cloudera.org:8080/8619 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I6f0469ac641bf7a03dbef01eaa3f1b58a5bf5d27 Gerrit-Change-Number: 8619 Gerrit-PatchSet: 2 Gerrit-Owner: Alexey Serbin Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8617 ) Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics .. KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics InternalMiniCluster tests were spitting out TSAN warnings where one thread (associated with one mini-tserver) accesses the CacheMetrics while another thread is busy restarting a tserver. The restarting tserver ends up replacing the CacheMetrics instance, which caused TSAN to warn due to the unprotected concurrent access. This patch is a relatively simple workaround that just prevents the Cache from switching CacheMetrics instances once it is first constructed. The underlying metrics are ref-counted, so we shouldn't have any leak or illegal access. The only downside is that, after a restart, new tservers won't be able to "attach" to the cache instance. But, we already had somewhat randomly associated the cache metrics with one of the tablet servers, so it seems unlikely that any assertions are depending on the functionality. Along the way, I removed the Cache::NewId() method which was unused. Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Reviewed-on: http://gerrit.cloudera.org:8080/8617 Reviewed-by: Mike Percy Tested-by: Todd Lipcon --- M src/kudu/util/cache-test.cc M src/kudu/util/cache.cc M src/kudu/util/cache.h M src/kudu/util/nvm_cache.cc 4 files changed, 21 insertions(+), 31 deletions(-) Approvals: Mike Percy: Looks good to me, approved Todd Lipcon: Verified -- To view, visit http://gerrit.cloudera.org:8080/8617 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Gerrit-Change-Number: 8617 Gerrit-PatchSet: 5 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8617 ) Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics .. Patch Set 4: Verified+1 Unrelated race -- To view, visit http://gerrit.cloudera.org:8080/8617 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Gerrit-Change-Number: 8617 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:41:09 + Gerrit-HasComments: No
[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
Todd Lipcon has removed a vote on this change. Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics .. Removed Verified-1 by Kudu Jenkins (120) -- To view, visit http://gerrit.cloudera.org:8080/8617 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: deleteVote Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Gerrit-Change-Number: 8617 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 39: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 39 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:26 + Gerrit-HasComments: No
[kudu-CR] Add a test for data consistency after stopping tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:07 + Gerrit-HasComments: No
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#39). Change subject: tablet: introduce closed mvcc and stopped tablets .. tablet: introduce closed mvcc and stopped tablets Currently, the only way to stop an Applying transaction is to wait for it to finish and Commit it. This constraint was put in place to guarantee on-disk correctness, but is sometimes too strict. E.g. if the tablet is shutting down, the Apply doesn't need to finish. This patch adds a new state to the MvccManager in which it is closed for transactions. Once in this closed state: 1. New Applies will return and not move to the Commit phase, and any methods waiting for the tablet's Applies to Commit (e.g. new snapshot scans, FlushMRS) will respond with an error immediately. This allows an escape from the existing invariant that Applies _must_ Commit, provided the MvccManager is in this closed state. 2. Applies that are already underway may still Commit, but will return early on a best-effort basis. These non-Committed operations are inconsequential w.r.t. consistency; having some in-flight transactions Commit and others not is consistent with the server shutting down in between the Commits of two transactions. 3. New transactions drivers will abort immediately before even reaching the Prepare phase, ensuring no more writes to the tablet are made durable. The Tablet class uses this closed MVCC state in a new "stopped" state of its own. A Tablet that has been stopped will avoid further activity: its MvccManager is closed to prevent further writes, and its maintenance ops are cancelled to prevent further scheduling. This patch includes these new behaviors when shutting down a tablet, with the assumption that a tablet will only be shut down when it's being deleted and we don't care too much about its in-flight transactions Committing or its further maintenance ops. Code paths that previously crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask, MvccManager::AbortTransaction, etc.) or that waited for Applies to finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet has been stopped and will log a warning instead. Testing is done by adding the following: - a test in mvcc-test to shut down MVCC and delete an Applying transaction, ensuring that there are no errors when it leaves scope. - a test in mvcc-test to wait on an Applying transaction, shut down MVCC, and ensure that any waiters will return with an error. - a new test stop_tablet-itest is added to ensure stopped leaders block writes (because they cannot start new transactions) and stopped followers don't (because while they cannot service the op, there still exists a majority that can); that stopped tablets don't prevent fault-tolerant scans; and that stopping the only tablet does prevent scans Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/local_tablet_writer.h M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/rowset.cc M src/kudu/tablet/tablet-harness.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/tablet_service.cc 17 files changed, 733 insertions(+), 135 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/39 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 39 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 39: Fixed IWYU + the build failure was KUDU- -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 39 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:40:04 + Gerrit-HasComments: No
[kudu-CR] tablet: mark delta tracker read-only on error
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8605 ) Change subject: tablet: mark delta tracker read-only on error .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced Gerrit-Change-Number: 8605 Gerrit-PatchSet: 7 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:38:29 + Gerrit-HasComments: No
[kudu-CR] tablet: add early returns to maintenance functions
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8606 ) Change subject: tablet: add early returns to maintenance functions .. Patch Set 8: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/8606/5/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/8606/5/src/kudu/integration-tests/stop_tablet-itest.cc@93 PS5, Line 93: server. > I think that's an optimization, but it wouldn't get the same result as canc hmm, ok -- To view, visit http://gerrit.cloudera.org:8080/8606 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Gerrit-Change-Number: 8606 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:36:48 + Gerrit-HasComments: Yes
[kudu-CR] tablet: make various update paths atomic
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8441 ) Change subject: tablet: make various update paths atomic .. Patch Set 13: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 13 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:35:06 + Gerrit-HasComments: No
[kudu-CR] tablet: remove ignore result in ApplyRowOperation
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8620 ) Change subject: tablet: remove ignore_result in ApplyRowOperation .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8620 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I54ea74f5ee10733b2d7b13b1971f9177dddb50cb Gerrit-Change-Number: 8620 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:34:33 + Gerrit-HasComments: No
[kudu-CR] handle disk failures during tablet copies
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7654 ) Change subject: handle disk failures during tablet copies .. Patch Set 8: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7654 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Gerrit-Change-Number: 7654 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Tue, 21 Nov 2017 22:34:14 + Gerrit-HasComments: No
[kudu-CR] tablet: mark delta tracker read-only on error
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8605 ) Change subject: tablet: mark delta tracker read-only on error .. Patch Set 7: (2 comments) http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc File src/kudu/tablet/delta_tracker.cc: http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc@355 PS6, Line 355: CheckWritableUnlocked( > nit: perhaps rename to CheckWritableUnlocked() Done http://gerrit.cloudera.org:8080/#/c/8605/6/src/kudu/tablet/delta_tracker.cc@740 PS6, Line 740: read_only_ = true; > CHECK(s.IsDiskFailure()) << LogPrefix() << s.ToString(); Done -- To view, visit http://gerrit.cloudera.org:8080/8605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced Gerrit-Change-Number: 8605 Gerrit-PatchSet: 7 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:31:10 + Gerrit-HasComments: Yes
[kudu-CR] tablet: add early returns to maintenance functions
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8606 ) Change subject: tablet: add early returns to maintenance functions .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/8606/6/src/kudu/tablet/tablet_replica_mm_ops.cc File src/kudu/tablet/tablet_replica_mm_ops.cc: http://gerrit.cloudera.org:8080/#/c/8606/6/src/kudu/tablet/tablet_replica_mm_ops.cc@200 PS6, Line 200: ()) > LogPrefix() includes the trailing space, so it's usually preferable to writ Done -- To view, visit http://gerrit.cloudera.org:8080/8606 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Gerrit-Change-Number: 8606 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:31:02 + Gerrit-HasComments: Yes
[kudu-CR] tablet: make various update paths atomic
Hello Tidy Bot, Alexey Serbin, Mike Percy, Kudu Jenkins, Adar Dembo, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8441 to look at the new patch set (#13). Change subject: tablet: make various update paths atomic .. tablet: make various update paths atomic A few codepaths in the tablet subsystem aren't atomic, i.e. if they fail mid-method (e.g. due to a file error), they leave some in-memory structures updated and others untouched. This has been safe because we CHECKed to ensure their success. In preparation for _not_ crashing in these methods, this patch refactors some of these functions to be atomic, and notes others that still have the possibility of failing in such a state (these calls still trigger a CHECK failure). There are no functional changes in this patch. Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 --- M src/kudu/tablet/delta_compaction.cc M src/kudu/tablet/delta_compaction.h M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/delta_tracker.h M src/kudu/tablet/diskrowset.cc M src/kudu/tablet/diskrowset.h M src/kudu/tablet/metadata-test.cc M src/kudu/tablet/rowset_metadata.cc M src/kudu/tablet/rowset_metadata.h 9 files changed, 224 insertions(+), 169 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/41/8441/13 -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 13 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: make various update paths atomic
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8441 ) Change subject: tablet: make various update paths atomic .. Patch Set 13: (1 comment) http://gerrit.cloudera.org:8080/#/c/8441/12/src/kudu/tablet/diskrowset.cc File src/kudu/tablet/diskrowset.cc: http://gerrit.cloudera.org:8080/#/c/8441/12/src/kudu/tablet/diskrowset.cc@579 PS12, Line 579: LOG_WITH_PREFIX(WARNING) << "Error during major delta compaction! Rolling back rowset metadata"; > WITH_PREFIX here so you get the tablet ID Done -- To view, visit http://gerrit.cloudera.org:8080/8441 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0241cd4206ce77ef1fb334458b091bc2092f4141 Gerrit-Change-Number: 8441 Gerrit-PatchSet: 13 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:30:48 + Gerrit-HasComments: Yes
[kudu-CR] tablet: add early returns to maintenance functions
Hello Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8606 to look at the new patch set (#8). Change subject: tablet: add early returns to maintenance functions .. tablet: add early returns to maintenance functions When a Tablet is stopped, further maintenance ops are not scheduled, further IO is prevented, etc. This patch optimizes this further to stop various functions that are called by maintenance threads to prevent their execution, returning an error instead. Previously, certain ops (e.g. flush DMS) would guarantee durability by checking for the success these functions. These checks are now replaced with on-error checks that the tablet has been stopped, since these failures are inconsequential if the tablet is stopped. Currently this is an optimization for tablets that are shutting down to return early from these calls, but in the future, this could be useful in stopping all IO done by a tablet that is failing, e.g. due to disk failure. Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 --- M src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tablet/tablet_replica_mm_ops.cc 5 files changed, 34 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/06/8606/8 -- To view, visit http://gerrit.cloudera.org:8080/8606 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I84ad557851863f6fd9acff28ddcd1244e62cf516 Gerrit-Change-Number: 8606 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] handle disk failures during tablet copies
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7654 to look at the new patch set (#8). Change subject: handle disk failures during tablet copies .. handle disk failures during tablet copies There are two components in a tablet copy: the copy client (that receives data) and the copy session source (that sends data). Coarse-grain handling of disk failures during tablet copies is done for the tablet copy client as: - Before starting a copy client, if no disks are available to place the tablet, simply return (instead of failing a CHECK). - Before downloading each WAL segments or block, check that the tablet is in a healthy group. And for the tablet copy session as: - Before sending a block or log segment, check if the tablet has an error. Upon returning an error, the tablet copy client will shutdown the replica, leaving it in a failed state. A test is added to ensure that both copy clients and that source sessions with failed disks will return errors to the copying client. Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 --- M src/kudu/tserver/tablet_copy-test-base.h M src/kudu/tserver/tablet_copy_client-test.cc M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h M src/kudu/tserver/tablet_copy_service-test.cc M src/kudu/tserver/tablet_copy_source_session.cc M src/kudu/tserver/tablet_copy_source_session.h 7 files changed, 124 insertions(+), 7 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/8 -- To view, visit http://gerrit.cloudera.org:8080/7654 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Gerrit-Change-Number: 7654 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot
[kudu-CR] handle disk failures during tablet copies
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7654 ) Change subject: handle disk failures during tablet copies .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc File src/kudu/tserver/tablet_copy_source_session.cc: http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc@133 PS7, Line 133: )); > we can remove this now Done -- To view, visit http://gerrit.cloudera.org:8080/7654 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7 Gerrit-Change-Number: 7654 Gerrit-PatchSet: 8 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Comment-Date: Tue, 21 Nov 2017 22:30:43 + Gerrit-HasComments: Yes
[kudu-CR] tablet: mark delta tracker read-only on error
Hello Mike Percy, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8605 to look at the new patch set (#7). Change subject: tablet: mark delta tracker read-only on error .. tablet: mark delta tracker read-only on error When DeltaTracker::Flush() fails, it leaves a DeltaMemStore in the redo store list. This is fine for reads because the readpath expects any DeltaStore to be in that slot, but this is a problem for updates, which expect only DeltaFileReaders in the list while the compact_flush_lock_ is held. Before, we would get around this by CHECKing so Flush() could never return in such a state. This patch introduces a flag to the DeltaTracker that indicates whether it has experienced such an error. In order to ensure type-correctness, this flag must be checked immediately upon entering the critical section of the compact_flush_lock_. Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced --- M src/kudu/tablet/delta_tracker.cc M src/kudu/tablet/delta_tracker.h 2 files changed, 35 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/05/8605/7 -- To view, visit http://gerrit.cloudera.org:8080/8605 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ib950048e9cd0929a10714ab1cc2bd829835afced Gerrit-Change-Number: 8605 Gerrit-PatchSet: 7 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon
[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8617 ) Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics .. Patch Set 4: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8617 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Gerrit-Change-Number: 8617 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:16:43 + Gerrit-HasComments: No
[kudu-CR] Add a test for data consistency after stopping tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@249 PS3, Line 249: Shutdown Did you mean to Shutdown() or Stop() here? Maybe one after the other? -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:13:56 + Gerrit-HasComments: Yes
[kudu-CR] Add a test for data consistency after stopping tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: (1 comment) http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@249 PS3, Line 249: Shutdown > Did you mean to Shutdown() or Stop() here? Maybe one after the other? Oh, wait, Shutdown() calls Stop() first. Good idea with this. -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:14:34 + Gerrit-HasComments: Yes
[kudu-CR] Add a test for data consistency after stopping tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8618 ) Change subject: Add a test for data consistency after stopping tablets .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/cluster_verifier.cc File src/kudu/integration-tests/cluster_verifier.cc: http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/cluster_verifier.cc@92 PS3, Line 92: we should support totally agree http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/8618/3/src/kudu/integration-tests/stop_tablet-itest.cc@231 PS3, Line 231: why does the client get a RemoteError in this test? what does the RemoteError say? -- To view, visit http://gerrit.cloudera.org:8080/8618 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I7b3b9e90fd1fbfb555a22074fd4a777a2b2e56e2 Gerrit-Change-Number: 8618 Gerrit-PatchSet: 3 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:12:18 + Gerrit-HasComments: Yes
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 38: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 38 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:08:02 + Gerrit-HasComments: No
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 38: (1 comment) http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc@193 PS37, Line 193: ASSERT_TRUE(s.IsTimedOut()) > nit: add << s.ToString() so this is diagnosable when / if it fails in the f Done -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 38 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:05:48 + Gerrit-HasComments: Yes
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#38). Change subject: tablet: introduce closed mvcc and stopped tablets .. tablet: introduce closed mvcc and stopped tablets Currently, the only way to stop an Applying transaction is to wait for it to finish and Commit it. This constraint was put in place to guarantee on-disk correctness, but is sometimes too strict. E.g. if the tablet is shutting down, the Apply doesn't need to finish. This patch adds a new state to the MvccManager in which it is closed for transactions. Once in this closed state: 1. New Applies will return and not move to the Commit phase, and any methods waiting for the tablet's Applies to Commit (e.g. new snapshot scans, FlushMRS) will respond with an error immediately. This allows an escape from the existing invariant that Applies _must_ Commit, provided the MvccManager is in this closed state. 2. Applies that are already underway may still Commit, but will return early on a best-effort basis. These non-Committed operations are inconsequential w.r.t. consistency; having some in-flight transactions Commit and others not is consistent with the server shutting down in between the Commits of two transactions. 3. New transactions drivers will abort immediately before even reaching the Prepare phase, ensuring no more writes to the tablet are made durable. The Tablet class uses this closed MVCC state in a new "stopped" state of its own. A Tablet that has been stopped will avoid further activity: its MvccManager is closed to prevent further writes, and its maintenance ops are cancelled to prevent further scheduling. This patch includes these new behaviors when shutting down a tablet, with the assumption that a tablet will only be shut down when it's being deleted and we don't care too much about its in-flight transactions Committing or its further maintenance ops. Code paths that previously crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask, MvccManager::AbortTransaction, etc.) or that waited for Applies to finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet has been stopped and will log a warning instead. Testing is done by adding the following: - a test in mvcc-test to shut down MVCC and delete an Applying transaction, ensuring that there are no errors when it leaves scope. - a test in mvcc-test to wait on an Applying transaction, shut down MVCC, and ensure that any waiters will return with an error. - a new test stop_tablet-itest is added to ensure stopped leaders block writes (because they cannot start new transactions) and stopped followers don't (because while they cannot service the op, there still exists a majority that can); that stopped tablets don't prevent fault-tolerant scans; and that stopping the only tablet does prevent scans Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/local_tablet_writer.h M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/rowset.cc M src/kudu/tablet/tablet-harness.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/tablet_service.cc 17 files changed, 735 insertions(+), 135 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/38 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 38 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 37: (1 comment) looks good http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/7439/37/src/kudu/integration-tests/stop_tablet-itest.cc@193 PS37, Line 193: ASSERT_TRUE(s.IsTimedOut()) nit: add << s.ToString() so this is diagnosable when / if it fails in the future. -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 37 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 22:02:10 + Gerrit-HasComments: Yes
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/7439 ) Change subject: tablet: introduce closed mvcc and stopped tablets .. Patch Set 37: (1 comment) http://gerrit.cloudera.org:8080/#/c/7439/36/src/kudu/integration-tests/stop_tablet-itest.cc File src/kudu/integration-tests/stop_tablet-itest.cc: http://gerrit.cloudera.org:8080/#/c/7439/36/src/kudu/integration-tests/stop_tablet-itest.cc@138 PS36, Line 138: *leader_num = i; > It seems like this works but I'm worried that if someone changes the read t Done -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 37 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Tue, 21 Nov 2017 21:57:57 + Gerrit-HasComments: Yes
[kudu-CR] tablet: introduce closed mvcc and stopped tablets
Hello Tidy Bot, Mike Percy, David Ribeiro Alves, Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#37). Change subject: tablet: introduce closed mvcc and stopped tablets .. tablet: introduce closed mvcc and stopped tablets Currently, the only way to stop an Applying transaction is to wait for it to finish and Commit it. This constraint was put in place to guarantee on-disk correctness, but is sometimes too strict. E.g. if the tablet is shutting down, the Apply doesn't need to finish. This patch adds a new state to the MvccManager in which it is closed for transactions. Once in this closed state: 1. New Applies will return and not move to the Commit phase, and any methods waiting for the tablet's Applies to Commit (e.g. new snapshot scans, FlushMRS) will respond with an error immediately. This allows an escape from the existing invariant that Applies _must_ Commit, provided the MvccManager is in this closed state. 2. Applies that are already underway may still Commit, but will return early on a best-effort basis. These non-Committed operations are inconsequential w.r.t. consistency; having some in-flight transactions Commit and others not is consistent with the server shutting down in between the Commits of two transactions. 3. New transactions drivers will abort immediately before even reaching the Prepare phase, ensuring no more writes to the tablet are made durable. The Tablet class uses this closed MVCC state in a new "stopped" state of its own. A Tablet that has been stopped will avoid further activity: its MvccManager is closed to prevent further writes, and its maintenance ops are cancelled to prevent further scheduling. This patch includes these new behaviors when shutting down a tablet, with the assumption that a tablet will only be shut down when it's being deleted and we don't care too much about its in-flight transactions Committing or its further maintenance ops. Code paths that previously crashed if Applies did not succeed (e.g. TransactionDriver::ApplyTask, MvccManager::AbortTransaction, etc.) or that waited for Applies to finish (e.g. Tablet:: FlushUnlocked) will now _not_ crash if the Tablet has been stopped and will log a warning instead. Testing is done by adding the following: - a test in mvcc-test to shut down MVCC and delete an Applying transaction, ensuring that there are no errors when it leaves scope. - a test in mvcc-test to wait on an Applying transaction, shut down MVCC, and ensure that any waiters will return with an error. - a new test stop_tablet-itest is added to ensure stopped leaders block writes (because they cannot start new transactions) and stopped followers don't (because while they cannot service the op, there still exists a majority that can); that stopped tablets don't prevent fault-tolerant scans; and that stopping the only tablet does prevent scans Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/stop_tablet-itest.cc M src/kudu/tablet/local_tablet_writer.h M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/rowset.cc M src/kudu/tablet/tablet-harness.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_bootstrap.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica_mm_ops.cc M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tserver/tablet_service.cc 17 files changed, 735 insertions(+), 135 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/37 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-Change-Number: 7439 Gerrit-PatchSet: 37 Gerrit-Owner: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [spark] add 'local-cluster' mode for unit test
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8585 ) Change subject: [spark] add 'local-cluster' mode for unit test .. Patch Set 1: I pointed SPARK_HOME at the unpacked release tarball downloaded from spark.apache.org. Is the source tarball really necessary? -- To view, visit http://gerrit.cloudera.org:8080/8585 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788 Gerrit-Change-Number: 8585 Gerrit-PatchSet: 1 Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 21 Nov 2017 21:46:28 + Gerrit-HasComments: No
[kudu-CR] KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics
Hello Tidy Bot, Mike Percy, Andrew Wong, Kudu Jenkins, Andrew Wong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8617 to look at the new patch set (#4). Change subject: KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics .. KUDU-2165 workaround: avoid TSAN warnings on CacheMetrics InternalMiniCluster tests were spitting out TSAN warnings where one thread (associated with one mini-tserver) accesses the CacheMetrics while another thread is busy restarting a tserver. The restarting tserver ends up replacing the CacheMetrics instance, which caused TSAN to warn due to the unprotected concurrent access. This patch is a relatively simple workaround that just prevents the Cache from switching CacheMetrics instances once it is first constructed. The underlying metrics are ref-counted, so we shouldn't have any leak or illegal access. The only downside is that, after a restart, new tservers won't be able to "attach" to the cache instance. But, we already had somewhat randomly associated the cache metrics with one of the tablet servers, so it seems unlikely that any assertions are depending on the functionality. Along the way, I removed the Cache::NewId() method which was unused. Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f --- M src/kudu/util/cache-test.cc M src/kudu/util/cache.cc M src/kudu/util/cache.h M src/kudu/util/nvm_cache.cc 4 files changed, 21 insertions(+), 31 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8617/4 -- To view, visit http://gerrit.cloudera.org:8080/8617 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ifc5c6e9306df78c364c8b89651ddcc56b90a924f Gerrit-Change-Number: 8617 Gerrit-PatchSet: 4 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [java] Add ReplicaSelection in KuduScanToken
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8559 ) Change subject: [java] Add ReplicaSelection in KuduScanToken .. Patch Set 2: Code-Review+2 Looks good. Are you going to add support to the C++ client as well? -- To view, visit http://gerrit.cloudera.org:8080/8559 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I860fcc73e486642ab5177cfd0dc0bdc98fdc6914 Gerrit-Change-Number: 8559 Gerrit-PatchSet: 2 Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 21 Nov 2017 21:31:27 + Gerrit-HasComments: No
[kudu-CR] [spark] add 'local-cluster' mode for unit test
Hao Hao has posted comments on this change. ( http://gerrit.cloudera.org:8080/8585 ) Change subject: [spark] add 'local-cluster' mode for unit test .. Patch Set 1: > (5 comments) > > I wasn't able to get the tests to run on my system, I got a bunch > of error messages like this: > > 12:22:13.383 [ERROR - ExecutorRunner for app-20171121122213-/0] > (Logging.scala:91) Error running executor > java.lang.IllegalStateException: Cannot find any build directories. > at > org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248) > at > org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:241) > at > org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:195) > at > org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:118) > at > org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39) > at > org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:47) > at > org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63) > at > org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51) > at > org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145) > at > org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73) > > > > Does that ring any bells? Yes, it should be caused by either the "SPARK_HOME" is not properly set, or you did not happen to build the source code pointed by the "SPARK_HOME"? Sorry that I should also document the latter. -- To view, visit http://gerrit.cloudera.org:8080/8585 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788 Gerrit-Change-Number: 8585 Gerrit-PatchSet: 1 Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Hao Hao Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 21 Nov 2017 21:22:31 + Gerrit-HasComments: No
[kudu-CR] [spark] add 'local-cluster' mode for unit test
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8585 ) Change subject: [spark] add 'local-cluster' mode for unit test .. Patch Set 1: (5 comments) I wasn't able to get the tests to run on my system, I got a bunch of error messages like this: 12:22:13.383 [ERROR - ExecutorRunner for app-20171121122213-/0] (Logging.scala:91) Error running executor java.lang.IllegalStateException: Cannot find any build directories. at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248) at org.apache.spark.launcher.AbstractCommandBuilder.getScalaVersion(AbstractCommandBuilder.java:241) at org.apache.spark.launcher.AbstractCommandBuilder.buildClassPath(AbstractCommandBuilder.java:195) at org.apache.spark.launcher.AbstractCommandBuilder.buildJavaCommand(AbstractCommandBuilder.java:118) at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:39) at org.apache.spark.launcher.WorkerCommandBuilder.buildCommand(WorkerCommandBuilder.scala:47) at org.apache.spark.deploy.worker.CommandUtils$.buildCommandSeq(CommandUtils.scala:63) at org.apache.spark.deploy.worker.CommandUtils$.buildProcessBuilder(CommandUtils.scala:51) at org.apache.spark.deploy.worker.ExecutorRunner.org$apache$spark$deploy$worker$ExecutorRunner$$fetchAndRunExecutor(ExecutorRunner.scala:145) at org.apache.spark.deploy.worker.ExecutorRunner$$anon$1.run(ExecutorRunner.scala:73) Does that ring any bells? http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala File java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala: http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@69 PS1, Line 69: testMode.toLowerCase Can be simplified to System.getenv("TEST_MODE") match { http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@70 PS1, Line 70: // According to documentation on spark code, this mode: creates a Spark standalone Might be useful to add a reference to SPARK-595. http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@78 PS1, Line 78: conf.setMaster("local[*]") I think it'd be better to throw an exception here so that it fails when TEST_MODE is set without SPARK_HOME. Is there a reason I'm missing to do a silent fallback? http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@80 PS1, Line 80: val executorLog: String = System.getenv("EXE_LOG_PATH") These configuration knobs are going to be hard to use, and I don't think they will ultimately be very useful. Can we skip adding them for now? We can always add them back later if they are necessary. http://gerrit.cloudera.org:8080/#/c/8585/1/java/kudu-spark/src/test/scala/org/apache/kudu/spark/kudu/TestContext.scala@90 PS1, Line 90: "local" | _ can be simplified to _ -- To view, visit http://gerrit.cloudera.org:8080/8585 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I07b2daad6f6883eb7425fa7132208818bbe65788 Gerrit-Change-Number: 8585 Gerrit-PatchSet: 1 Gerrit-Owner: Hao Hao Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 21 Nov 2017 21:13:42 + Gerrit-HasComments: Yes