[kudu-CR] disk failure: release failed txs from tracker
Todd Lipcon has posted comments on this change. Change subject: disk failure: release failed txs from tracker .. Patch Set 8: (6 comments) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/mvcc.h File src/kudu/tablet/mvcc.h: Line 404: void Cancel(); here's an idea: how about rather than specifically cancelling each transaction, there is an MvccManager::Shutdown() call, and then we allow AbortTransaction() of an APPLYING transaction only in the case that the MvccManager has entered shutdown state? then we don't need new per-transaction states, and it more accurately captures the idea that "you can only cancel operations once the entire tablet ahs started shutting itself down" (and we can still ensure that, on destruction, the map has been cleared out) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: Line 481: CHECK(s.IsDiskFailure()); per conversation on Slack, I think generalizing this to something like the following would separate the concerns a bit better: - on a disk error, mark the tablet as shutdown - when attempting to apply operations on a shutdown tablet, return Status::Aborted - if we get Status::Aborted here, we can CHECK that the tablet is in Shutdown() state and then go to the cancellation path -- basically this would be the same code as in HandleFailure today, except widened the scope to allow cancelling a replicating/replicated transaction in the case that the tablet is in failed state? (eg by taking a new parameter to HandleFailure indicating that it's due to the tablet shutdown?) http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/write_transaction.cc File src/kudu/tablet/transactions/write_transaction.cc: Line 134: if (PREDICT_FALSE(tablet->IsInFailedDir())) { making ApplyRowOperations return a Status would allow centralizing this logic into Tablet a bit better and also generalizing to all "Shutdown" cases instead of just specifically the disk-failure one Line 139: tablet->ApplyRowOperations(state()); here if Tablet indicated that it has shut down then you can early-return without creating a commit message. that would be a pretty good safety net that we don't accidentally end up writing a commit message http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: Line 1076: tablet->SetInFailedDir(); again maybe generalize to just 'MarkFailed' since the handling isn't particular to the reason for the failure. (eg if we detected some internal inconsistency like a bad checksum or something we may also want to markfailed even though it's not in a failed dir) Line 1078: LOG(FATAL) << Substitute("Data directory $0 failed. Disk failure handling not implemented", now implemented? -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: release failed txs from tracker
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#8). Change subject: disk failure: release failed txs from tracker .. disk failure: release failed txs from tracker Currently, if a TransactionState object is deleted while it is Applying, Kudu will crash as a way to prevent the failed transaction from persisting incorrect state onto disk. However, this is not always necessary: transactions that fail due to disk failure may want to keep the server alive by marking the tablet as unusable. In this case, there needs to be a way to release transactions from the TransactionTracker without crashing Kudu, regardless of state. This patch adds the ability to Cancel transactions. Unlike Aborts, which ensure the transaction is not in the APPLYING state when destructed, a Canceled transaction has no such constraint. This is only "safe" because the TransactionTracker's tablet is expected to shut down and no longer be used. Additionally, a new flag is added to Tablet to indicate that it resides on a failed disk. If set, the tablet's transactions will end early. Currently, this codepath will not have a visible effect on Kudu as disk failures are still fatal, but it is tested for. Testing is done in a couple ways: - A test is added in mvcc-test to Cancel an Applying ScopedTransaction and ensure that there are no errors when it leaves scope. - A test is added to tablet_replica-test to register a WriteTransaction, set the tablet's "in-failed-dir" flag, and begin Applying. The transaction exits early and releases itself from MVCC. This is part of a series of patches to address disk failure. To see how this patch fits in, see section 2.3 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/transactions/alter_schema_transaction.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tserver/ts_tablet_manager.cc 15 files changed, 133 insertions(+), 29 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/8 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: release failed txs from tracker
Andrew Wong has posted comments on this change. Change subject: disk failure: release failed txs from tracker .. Patch Set 6: (3 comments) http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction.h File src/kudu/tablet/transactions/transaction.h: Line 20: #include > warning: #includes are not sorted properly [llvm-include-order] Done http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction_driver.cc File src/kudu/tablet/transactions/transaction_driver.cc: Line 45: using std::shared_ptr; > warning: using decl 'shared_ptr' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction_tracker-test.cc File src/kudu/tablet/transactions/transaction_tracker-test.cc: Line 24: #include "kudu/tablet/transactions/transaction_driver.h" > warning: #includes are not sorted properly [llvm-include-order] Done -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: release failed txs from tracker
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#7). Change subject: disk failure: release failed txs from tracker .. disk failure: release failed txs from tracker Currently, if a TransactionState object is deleted while it is Applying, Kudu will crash as a way to prevent the failed transaction from persisting incorrect state onto disk. However, this is not always necessary: transactions that fail due to disk failure may want to keep the server alive by marking the tablet as unusable. In this case, there needs to be a way to release transactions from the TransactionTracker without crashing Kudu, regardless of state. This patch adds the ability to Cancel transactions. Unlike Aborts, which ensure the transaction is not in the APPLYING state when destructed, a Canceled transaction has no such constraint. This is only "safe" because the TransactionTracker's tablet is expected to shut down and no longer be used. Additionally, a new flag is added to Tablet to indicate that it resides on a failed disk. If set, the tablet's transactions will end early. Currently, this codepath will not have a visible effect on Kudu as disk failures are still fatal, but it is tested for. Testing is done in a couple ways: - A test is added in mvcc-test to Cancel an Applying ScopedTransaction and ensure that there are no errors when it leaves scope. - A test is added to tablet_replica-test to register a WriteTransaction, set the tablet's "in-failed-dir" flag, and being Applying. The transaction exits early and releases itself from MVCC. This is part of a series of patches to address disk failure. To see how this patch fits in, see section 2.3 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/transactions/alter_schema_transaction.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tserver/ts_tablet_manager.cc 15 files changed, 133 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/7 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: release failed txs from tracker
Andrew Wong has posted comments on this change. Change subject: disk failure: release failed txs from tracker .. Patch Set 6: (1 comment) http://gerrit.cloudera.org:8080/#/c/7439/4//COMMIT_MSG Commit Message: PS4, Line 15: without crashing Kudu, regardless of state. : : T > currently I think we directly enter the 'APPLYING' state when we submit a t Realized I never clicked respond. Right, if the transaction is waiting in the queue, there's not much we can do. When the transaction starts applying, however, we can check if the tablet has been failed and return early. Yeah, a flag to check whether the tablet is on a failed disk (not shut down quite yet) should do the trick (I have essentially that in a separate patch). I think it's fair game to end the transaction mid-apply (e.g. in ApplyRowOperations), lest we fail some checks if the apply is the thing that's failing. I agree that the distinction between "Abort" and "Cancel" is a bit (maybe too) subtle. The necessity comes from the fact that, for write transactions, the TransactionDrivers have ScopedTransaction objects that call MvccManager::AbortTransaction when they're deleted (e.g. when the driver is released). The idea with "Cancel" is that we should be able to safely eject the transaction from the TransactionTracker (deleting the ScopedTransaction, and effectively "completing" the Apply, without ever officially committing the tx, and not have to worry about the constraints of the MvccManager that crash Kudu if aborting during an Apply. Maybe a cleaner API would be to have some TransactionTracker::Abort(TransactionDriver) that would release the driver regardless of state, but that'd be more of a cosmetic change. The underlying behavior would be the same. -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: release failed txs from tracker
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7439 to look at the new patch set (#6). Change subject: disk failure: release failed txs from tracker .. disk failure: release failed txs from tracker Currently, if a TransactionState object is deleted while it is Applying, Kudu will crash as a way to prevent the failed transaction from persisting incorrect state onto disk. However, this is not always necessary: transactions that fail due to disk failure may want to keep the server alive by marking the tablet as unusable. In this case, there needs to be a way to release transactions from the TransactionTracker without crashing Kudu, regardless of state. This patch adds the ability to Cancel transactions. Unlike Aborts, which ensure the transaction is not in the APPLYING state when destructed, a Canceled transaction has no such constraint. This is only "safe" because the TransactionTracker's tablet is expected to shut down and no longer be used. Additionally, a new flag is added to Tablet to indicate that it resides on a failed disk. If set, the tablet's transactions will end early. Currently, this codepath will not have a visible effect on Kudu as disk failures are still fatal, but testing is still done to test when this flag is manually set without crashes. Testing is done in a couple ways: - A test is added in mvcc-test to Cancel an Applying ScopedTransaction and ensure that there are no errors when it leaves scope. - A test is added to tablet_replica-test to register a WriteTransaction, set the tablet's "in-failed-dir" flag, and being Applying. The transaction exits early and releases itself from MVCC. This is part of a series of patches to address disk failure. To see how this patch fits in, see section 2.3 of: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I983620f27e7226806a2cca253db7619731914d42 --- M src/kudu/tablet/mvcc-test.cc M src/kudu/tablet/mvcc.cc M src/kudu/tablet/mvcc.h M src/kudu/tablet/tablet.cc M src/kudu/tablet/tablet.h M src/kudu/tablet/tablet_replica-test.cc M src/kudu/tablet/transactions/alter_schema_transaction.cc M src/kudu/tablet/transactions/alter_schema_transaction.h M src/kudu/tablet/transactions/transaction.h M src/kudu/tablet/transactions/transaction_driver.cc M src/kudu/tablet/transactions/transaction_driver.h M src/kudu/tablet/transactions/transaction_tracker-test.cc M src/kudu/tablet/transactions/write_transaction.cc M src/kudu/tablet/transactions/write_transaction.h M src/kudu/tserver/ts_tablet_manager.cc 15 files changed, 133 insertions(+), 27 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/6 -- To view, visit http://gerrit.cloudera.org:8080/7439 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon