[kudu-CR] disk failure: release failed txs from tracker

2017-08-08 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: disk failure: release failed txs from tracker
..


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

Line 404:   void Cancel();
here's an idea: how about rather than specifically cancelling each transaction, 
there is an MvccManager::Shutdown() call, and then we allow AbortTransaction() 
of an APPLYING transaction only in the case that the MvccManager has entered 
shutdown state?

then we don't need new per-transaction states, and it more accurately captures 
the idea that "you can only cancel operations once the entire tablet ahs 
started shutting itself down"

(and we can still ensure that, on destruction, the map has been cleared out)


http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

Line 481:   CHECK(s.IsDiskFailure());
per conversation on Slack, I think generalizing this to something like the 
following would separate the concerns a bit better:

- on a disk error, mark the tablet as shutdown
- when attempting to apply operations on a shutdown tablet, return 
Status::Aborted
- if we get Status::Aborted here, we can CHECK that the tablet is in Shutdown() 
state and then go to the cancellation path
-- basically this would be the same code as in HandleFailure today, except 
widened the scope to allow cancelling a replicating/replicated transaction in 
the case that the tablet is in failed state? (eg by taking a new parameter to 
HandleFailure indicating that it's due to the tablet shutdown?)


http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tablet/transactions/write_transaction.cc
File src/kudu/tablet/transactions/write_transaction.cc:

Line 134:   if (PREDICT_FALSE(tablet->IsInFailedDir())) {
making ApplyRowOperations return a Status would allow centralizing this logic 
into Tablet a bit better and also generalizing to all "Shutdown" cases instead 
of just specifically the disk-failure one


Line 139:   tablet->ApplyRowOperations(state());
here if Tablet indicated that it has shut down then you can early-return 
without creating a commit message. that would be a pretty good safety net that 
we don't accidentally end up writing a commit message


http://gerrit.cloudera.org:8080/#/c/7439/8/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 1076: tablet->SetInFailedDir();
again maybe generalize to just 'MarkFailed' since the handling isn't particular 
to the reason for the failure. (eg if we detected some internal inconsistency 
like a bad checksum or something we may also want to markfailed even though 
it's not in a failed dir)


Line 1078:   LOG(FATAL) << Substitute("Data directory $0 failed. Disk failure 
handling not implemented",
now implemented?


-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: release failed txs from tracker

2017-08-08 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#8).

Change subject: disk failure: release failed txs from tracker
..

disk failure: release failed txs from tracker

Currently, if a TransactionState object is deleted while it is Applying,
Kudu will crash as a way to prevent the failed transaction from
persisting incorrect state onto disk. However, this is not always
necessary: transactions that fail due to disk failure may want to keep
the server alive by marking the tablet as unusable. In this case, there
needs to be a way to release transactions from the TransactionTracker
without crashing Kudu, regardless of state.

This patch adds the ability to Cancel transactions. Unlike Aborts, which
ensure the transaction is not in the APPLYING state when destructed, a
Canceled transaction has no such constraint. This is only "safe" because
the TransactionTracker's tablet is expected to shut down and no longer
be used.

Additionally, a new flag is added to Tablet to indicate that it resides
on a failed disk. If set, the tablet's transactions will end early.
Currently, this codepath will not have a visible effect on Kudu as disk
failures are still fatal, but it is tested for.

Testing is done in a couple ways:
- A test is added in mvcc-test to Cancel an Applying ScopedTransaction
  and ensure that there are no errors when it leaves scope.
- A test is added to tablet_replica-test to register a WriteTransaction,
  set the tablet's "in-failed-dir" flag, and begin Applying. The
  transaction exits early and releases itself from MVCC.

This is part of a series of patches to address disk failure. To see how
this patch fits in, see section 2.3 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/ts_tablet_manager.cc
15 files changed, 133 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: release failed txs from tracker

2017-08-08 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: release failed txs from tracker
..


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction.h
File src/kudu/tablet/transactions/transaction.h:

Line 20: #include 
> warning: #includes are not sorted properly [llvm-include-order]
Done


http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

Line 45: using std::shared_ptr;
> warning: using decl 'shared_ptr' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/7439/6/src/kudu/tablet/transactions/transaction_tracker-test.cc
File src/kudu/tablet/transactions/transaction_tracker-test.cc:

Line 24: #include "kudu/tablet/transactions/transaction_driver.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: release failed txs from tracker

2017-08-08 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#7).

Change subject: disk failure: release failed txs from tracker
..

disk failure: release failed txs from tracker

Currently, if a TransactionState object is deleted while it is Applying,
Kudu will crash as a way to prevent the failed transaction from
persisting incorrect state onto disk. However, this is not always
necessary: transactions that fail due to disk failure may want to keep
the server alive by marking the tablet as unusable. In this case, there
needs to be a way to release transactions from the TransactionTracker
without crashing Kudu, regardless of state.

This patch adds the ability to Cancel transactions. Unlike Aborts, which
ensure the transaction is not in the APPLYING state when destructed, a
Canceled transaction has no such constraint. This is only "safe" because
the TransactionTracker's tablet is expected to shut down and no longer
be used.

Additionally, a new flag is added to Tablet to indicate that it resides
on a failed disk. If set, the tablet's transactions will end early.
Currently, this codepath will not have a visible effect on Kudu as disk
failures are still fatal, but it is tested for.

Testing is done in a couple ways:
- A test is added in mvcc-test to Cancel an Applying ScopedTransaction
  and ensure that there are no errors when it leaves scope.
- A test is added to tablet_replica-test to register a WriteTransaction,
  set the tablet's "in-failed-dir" flag, and being Applying. The
  transaction exits early and releases itself from MVCC.

This is part of a series of patches to address disk failure. To see how
this patch fits in, see section 2.3 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/ts_tablet_manager.cc
15 files changed, 133 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: release failed txs from tracker

2017-08-08 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: release failed txs from tracker
..


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7439/4//COMMIT_MSG
Commit Message:

PS4, Line 15: without crashing Kudu, regardless of state.
: 
: T
> currently I think we directly enter the 'APPLYING' state when we submit a t
Realized I never clicked respond.

Right, if the transaction is waiting in the queue, there's not much we can do. 
When the transaction starts applying, however, we can check if the tablet has 
been failed and return early.

Yeah, a flag to check whether the tablet is on a failed disk (not shut down 
quite yet) should do the trick (I have essentially that in a separate patch). I 
think it's fair game to end the transaction mid-apply (e.g. in 
ApplyRowOperations), lest we fail some checks if the apply is the thing that's 
failing.

I agree that the distinction between "Abort" and "Cancel" is a bit (maybe too) 
subtle. The necessity comes from the fact that, for write transactions, the 
TransactionDrivers have ScopedTransaction objects that call 
MvccManager::AbortTransaction when they're deleted (e.g. when the driver is 
released).

The idea with "Cancel" is that we should be able to safely eject the 
transaction from the TransactionTracker (deleting the ScopedTransaction, and 
effectively "completing" the Apply, without ever officially committing the tx, 
and not have to worry about the constraints of the MvccManager that crash Kudu 
if aborting during an Apply.

Maybe a cleaner API would be to have some 
TransactionTracker::Abort(TransactionDriver) that would release the driver 
regardless of state, but that'd be more of a cosmetic change. The underlying 
behavior would be the same.


-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: release failed txs from tracker

2017-08-07 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7439

to look at the new patch set (#6).

Change subject: disk failure: release failed txs from tracker
..

disk failure: release failed txs from tracker

Currently, if a TransactionState object is deleted while it is Applying,
Kudu will crash as a way to prevent the failed transaction from
persisting incorrect state onto disk. However, this is not always
necessary: transactions that fail due to disk failure may want to keep
the server alive by marking the tablet as unusable. In this case, there
needs to be a way to release transactions from the TransactionTracker
without crashing Kudu, regardless of state.

This patch adds the ability to Cancel transactions. Unlike Aborts, which
ensure the transaction is not in the APPLYING state when destructed, a
Canceled transaction has no such constraint. This is only "safe" because
the TransactionTracker's tablet is expected to shut down and no longer
be used.

Additionally, a new flag is added to Tablet to indicate that it resides
on a failed disk. If set, the tablet's transactions will end early.
Currently, this codepath will not have a visible effect on Kudu as disk
failures are still fatal, but testing is still done to test when this
flag is manually set without crashes.

Testing is done in a couple ways:
- A test is added in mvcc-test to Cancel an Applying ScopedTransaction
  and ensure that there are no errors when it leaves scope.
- A test is added to tablet_replica-test to register a WriteTransaction,
  set the tablet's "in-failed-dir" flag, and being Applying. The
  transaction exits early and releases itself from MVCC.

This is part of a series of patches to address disk failure. To see how
this patch fits in, see section 2.3 of:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I983620f27e7226806a2cca253db7619731914d42
---
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet.h
M src/kudu/tablet/tablet_replica-test.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/alter_schema_transaction.h
M src/kudu/tablet/transactions/transaction.h
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/ts_tablet_manager.cc
15 files changed, 133 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/39/7439/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7439
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I983620f27e7226806a2cca253db7619731914d42
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon