[kudu-CR] disk failure: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: disk failure: reassign failed tablets .. Patch Set 8: (5 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through > would it make more sense to have this be like: TABLET_NOT_FOUND? How do we Hrm, maybe, but I'm keeping this as is for now. Reasoning here was that before when a tablet was in the FAILED state, we would treat it as TABLET_NOT_RUNNING. I'm looking in client/scanner-internal.cc and it seems like we blacklist the location for TNR (if there's somewhere else I should be looking, please let me know). I'm not sure it makes sense to retry on TNR. I suppose it could retry if the tablet were NOT_STARTED or BOOTSTRAPPING, but tablets in QUIESCING and SHUTDOWN are also considered NOT_RUNNING. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS7, Line 284: sponse_.error().code() == TabletServerErrorPB::TABLET_FAILED) > maybe in this case we should directly call: NotifyObserversOfFailedFollower Done. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS7, Line 638: // Initiate Tablet Copy on the peer if the tablet is not found. : if (response.has_error()) { : CHECK_EQ(tserver::TabletServerErrorPB::TABLET_NOT_FOUND, response.error().code()); : peer->needs_tablet_copy = true; : VLOG_WITH_PREFIX_UNLOCKED(1) << "Marked peer as needing tablet copy: " : << peer->ToString(); : *more_pending = true; : return; : } : : // Sanity checks. : // Some of these can be eventually removed, but they are handy for now. : DCHECK(response.status().IsInitialized()) : << "Error: Uninitialized: " << response.InitializationErrorString() : << ". Response: "<< SecureShortDebugString(response); : // TODO: Include uuid in error messages as well. : DCHECK(response.has_responder_uuid() && !response.responder_uuid().empty()) : > see my comment on the call site Done http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS7, Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, : "Whether the master should tombstone (delete) tablet replicas that " : "are reporting a failed state. Only for testing!"); : TAG_FLAG(master_tombstone_failed_tablet_replicas, hidden); > is this a test only thing? As of now, yes. Will update to make that clear. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: PS7, Line 161: the tablet will be evicted and > ?? Should be evicted and replaced. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#8). Change subject: disk failure: reassign failed tablets .. disk failure: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted. This is an issue because failed tablets don't get evicted and reassigned (e.g. if a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else). To remediate this, this patch changes the tserver response generated by FAILED tablets to a new TABLET_FAILED state, on which a leader will immediately evict the peer. Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut down and still needs to be reassigned. Calling normal Shutdown() cannot leave the replica in the FAILED state, and the SHUTDOWN state cannot itself indicate the need for eviction. Prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures on this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files in the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. The test raft_consensus-itest is updated to ensure that failed tablets are evicted and replaced. The test tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 14 files changed, 131 insertions(+), 65 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/8 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: reassign failed tablets
David Ribeiro Alves has posted comments on this change. Change subject: disk failure: reassign failed tablets .. Patch Set 7: (5 comments) http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: PS7, Line 232: case tserver::TabletServerErrorPB::TABLET_FAILED: // fall-through would it make more sense to have this be like: TABLET_NOT_FOUND? How do we respond to TABLET_NOT_RUNNING on the client? do we retry? my fear is that we might eventually do that (or might even do that already in some cases like single replica tablets) but this is a non-retriable error. Alternatively actually sending a tablet failed reply might make sense too. http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: PS7, Line 284: response_.error().code() != TabletServerErrorPB::TABLET_FAILED maybe in this case we should directly call: NotifyObserversOfFailedFollower() or a new analog public method (like queue_->NotifyPeerHasFailed())? http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: PS7, Line 638: // We only let special types of errors through to this point from the peer. : if (response.has_error()) { : // Initiate Tablet Copy on the peer if the tablet is not found. : if (tserver::TabletServerErrorPB::TABLET_NOT_FOUND == response.error().code()) { : peer->needs_tablet_copy = true; : VLOG_WITH_PREFIX_UNLOCKED(1) << "Marked peer as needing tablet copy: " : << peer->ToString(); : *more_pending = true; : } else { : // Ignore the response from the peer if it is failed. : CHECK(tserver::TabletServerErrorPB::TABLET_FAILED == response.error().code()) : << SecureShortDebugString(response); : VLOG_WITH_PREFIX_UNLOCKED(1) << "Ignoring response from tablet to promote eviction: " : << peer->ToString(); : *more_pending = false; : } : return; : } see my comment on the call site http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS7, Line 170: DEFINE_bool(master_tombstone_failed_tablet_replicas, true, : "Whether the master should tombstone (delete) tablet replicas that " : "are reporting a failed state."); : TAG_FLAG(master_tombstone_failed_tablet_replicas, hidden); is this a test only thing? http://gerrit.cloudera.org:8080/#/c/7440/7/src/kudu/tablet/metadata.proto File src/kudu/tablet/metadata.proto: PS7, Line 161: the tablet will be replicated. ?? -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: reassign failed tablets
Andrew Wong has posted comments on this change. Change subject: disk failure: reassign failed tablets .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/7440/2/src/kudu/consensus/consensus_queue.cc File src/kudu/consensus/consensus_queue.cc: Line 642: peer->needs_tablet_copy = true; > Why Tablet Copy? Why not evict and let the master reassign the replica? Hrm, good point. This would probably introduce a cases where a disk fails and a single server has to take on a bunch of tablet copies. Done. -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#7). Change subject: disk failure: reassign failed tablets .. disk failure: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted. This is an issue because failed tablets don't get evicted and reassigned (e.g. if a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else). To remediate this, this patch changes the tserver response generated by FAILED tablets to a new TABLET_FAILED, which is ignored by leaders to promote eviction. Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut down and still needs to be reassigned. Calling normal Shutdown() cannot leave the replica in the FAILED state, and the SHUTDOWN state cannot itself indicate the need for eviction. Prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures on this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files in the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. The test raft_consensus-itest is updated to ensure that failed tablets are evicted and replaced. The test tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 119 insertions(+), 54 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/7 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 7 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#6). Change subject: disk failure: reassign failed tablets .. disk failure: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted. This is an issue because failed tablets don't get evicted and reassigned (e.g. if a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else). To remediate this, this patch changes the tserver response generated by FAILED tablets to a new TABLET_FAILED, which is ignored by leaders to promote eviction. Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut down and still needs to be reassigned. Calling normal Shutdown() cannot leave the replica in the FAILED state, and the SHUTDOWN state cannot itself indicate the need for eviction. Prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures on this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files in the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. The test raft_consensus-itest is updated to ensure that failed tablets are evicted and replaced. The test tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 12 files changed, 105 insertions(+), 56 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/6 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 6 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#5). Change subject: disk failure: reassign failed tablets .. disk failure: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted. This is an issue because failed tablets don't get evicted and reassigned (e.g. if a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else). To remediate this, this patch changes the tserver response generated by FAILED tablets to a new TABLET_FAILED, which is ignored by leaders to promote eviction. Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut down and still needs to be reassigned. Calling normal Shutdown() cannot leave the replica in the FAILED state, and the SHUTDOWN state cannot itself indicate the need for eviction. Additionally, prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures on this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files on the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tablet/metadata.proto M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 12 files changed, 111 insertions(+), 55 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/5 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: reassign failed tablets
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7440 to look at the new patch set (#4). Change subject: disk failure: reassign failed tablets .. disk failure: reassign failed tablets Tablets are marked tablet::FAILED in a number of places (e.g. failing to bootstrap) and will be marked tablet::FAILED_AND_SHUTDOWN upon disk failure. Currently, a tablet in either state is left alone and will respond to heartbeats with TABLET_NOT_RUNNING messages, which indicate it is responsive, but the tablet itself is not used for anything. This patch changes this behavior to ensure that tablets in either state will respond with the new TABLET_FAILED response that does not indicate the tablet is responsive, promoting eviction. Additionally, prior to this patch, tablets were set to FAILED when they failed to delete metadata. This is no longer the case. Since error statuses during deletion are only returned during IO to the metadata directory, and because the metadata directory is a single point of failure, failures in this codepath are made fatal for now. Once this is no longer the case, these failures should be made benign, as proper error handling should make files on the failed metadata directory unreachable. This ensures the tablets that were meant to be deleted are not reassigned. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/master/catalog_manager.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 8 files changed, 76 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/4 -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon