[kudu-CR] [catalog manager] proper handling of catalog shutdown
Hello Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6170 to look at the new patch set (#5). Change subject: [catalog_manager] proper handling of catalog shutdown .. [catalog_manager] proper handling of catalog shutdown This changelist introduces the categorization of the system catalog's read/write operation failures happened while executing the leader post-election task. There are two categories of errors: fatal and non-fatal. If a read/write operation fails in between terms of the system catalog leadership, the error is considered non-fatal. In case of a non-fatal error the leader post-election task bails out: the catalog is no longer the leader at the original term and the task should be executed by the new leader upon the ElectedAsLeaderCb. If a read/write operation fails happened at the same term of the system catalog leadership, the error is considered fatal: this causes the master process to crash. This is done do avoid any possible inconsistency when reading and persisting important system information such as tables/tablets metadata, IPKI certificate authority information and TSKs (Token Signing Keys). All read/write operation failures happened during the system catalog's shutdown are ignored. Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master-test.cc M src/kudu/master/sys_catalog-test.cc M src/kudu/master/sys_catalog.cc 5 files changed, 187 insertions(+), 120 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/5 -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [catalog manager] proper handling of catalog shutdown
David Ribeiro Alves has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/6170/4/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: PS4, Line 585: bool* error_is_not_leader I agree with dan. In fact this was part of the motivation for my comments yesterday about not crashing, you wouldn't need to have specific handling for the "not leader" case, and instead have generic handling for all cases where writes to consensus fail. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Dan Burkert has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 4: (4 comments) In general I'm not a fan of the additional bool* out params, when the callers could just check resp.error().code() == TabletServerErrorPB::NOT_THE_LEADER The bool also doesn't seem to be used in non-test code. http://gerrit.cloudera.org:8080/#/c/6170/4/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 458: { New scope seems redundant. Line 459: bool error_is_not_leader; What is this boolean doing? It's passed as an out parameter to TryGenerateNewTskUnlocked, and then not used thereafter. Line 763: if (s.IsNotFound()) { This method might be simpler if it's structured as a single if/else chain: if (s.ok()) { return InitCertAuth... } else if (s.IsNotFound()) { ... } else { return s; } Line 764: LOG(INFO) << "Did not find CA certificate and key for Kudu IPKI, " This is going to be pretty chatty, perhaps VLOG? -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Hello Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6170 to look at the new patch set (#4). Change subject: [catalog_manager] proper handling of catalog shutdown .. [catalog_manager] proper handling of catalog shutdown Re-factored the code to proper handle the system catalog's shutdown. Besides, updated the logic to properly handle the NOT_THE_LEADER error for write operations against the system catalog table in cases when such an error is a recoverable failure. Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/sys_catalog-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 245 insertions(+), 128 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/4 -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/6170/2/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: PS2, Line 458: { : bool error_is_not_leader; : const Status& s = : catalog_manager_->TryGenerateNewTskUnlocked(_is_not_leader); : if (!s.ok()) { : LOG(WARNING) << "Failed to add new TSK: " << s.ToString(); : if (!error_is_not_leader) { : // Do not expect any other errors besides NOT_THE_LEADER error. : CHECK_OK(s); : } : } : } > say that the error is something else (like the machine is partitioned) does AFAIK, your understanding is correct -- it might happen that checks in the ScopedLeaderSharedLock ctor went fine, but then situation might change down the road. For example, the master could lose its leadership role while holding that shared lock -- there is no guarantee it will stay a master up to the very end of the operation which holds that lock. I think you are correct -- we shouldn't crash the master server if we can let it continue working knowing its operation in such a state does not bring inconsistency into the clustered master. This is exactly the case: if master some error happened and the master cannot write the newly generated TSK into the system table, we should not let it run only if it stays the leader master AND no any active TSK is left. I need to figure out how to check that master is still a leader (the latter part with TSK it clear to me). Probably, I can use the underlying system_catalog_'s consensus implementation to figure that out. This is a background task which runs on its own -- I'm not sure it's possible to implement the behavior you described. However, I might be missing something. PS2, Line 751: if (!(s.ok() || s.IsNotFound())) { : return s; : } > could you leave a comment that we only handle the OK and NotFound cases bel Done -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Hello Adar Dembo, Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6170 to look at the new patch set (#3). Change subject: [catalog_manager] proper handling of catalog shutdown .. [catalog_manager] proper handling of catalog shutdown Re-factored the code to proper handle the system catalog's shutdown. Besides, updated the logic to properly handle the NOT_THE_LEADER error for write operations against the system catalog table in cases when such an error is a recoverable failure. Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/sys_catalog-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 243 insertions(+), 128 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/3 -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 3 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/6170/2/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: Line 602: Status StoreCertAuthorityInfo(const security::PrivateKey& key, > warning: function 'kudu::master::CatalogManager::StoreCertAuthorityInfo' ha Done -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 2: > Will defer to Mike and/or David; I bet they have stronger opinions > on this than I do. OK, sure. Thank you for the fast review! -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Adar Dembo has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 2: Code-Review+1 Will defer to Mike and/or David; I bet they have stronger opinions on this than I do. -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/6170 to look at the new patch set (#2). Change subject: [catalog_manager] proper handling of catalog shutdown .. [catalog_manager] proper handling of catalog shutdown Re-factored the code to proper handle the system catalog's shutdown. Besides, updated the logic to properly handle the NOT_THE_LEADER error for write operations against the system catalog table in cases when such an error is a recoverable failure. Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/sys_catalog-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 243 insertions(+), 127 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/2 -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Alexey Serbin has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 1: (5 comments) > (4 comments) > > As a general point, it sucks that Status isn't full-featured enough > to include this additional bit of information. IIRC Mike proposed a > sweeping change to Status handling a couple years ago which would > have made it versatile enough to include this case. It would be > interesting to hear his comments on this. Yep, it might be handy to have that. Probably, having this use-case might help to make it happen :) http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.cc File src/kudu/master/catalog_manager.cc: Line 204: using std::bind; > warning: using decl 'bind' is unused [misc-unused-using-decls] Done http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: Line 20: #include > Is this for existing code? Or something new? Now it's just for the existing code: StartTabletCopy() method's signature has std::function<...> as a parameter. PS1, Line 585: master's > Nit: to make it clearer that this is a variant of InitCertAuthority(), can Done Line 596: // Store CA certificate information into the system table. > Doc what is_not_leader_error means. Done http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/sys_catalog.h File src/kudu/master/sys_catalog.h: Line 153:bool* is_not_leader); > Maybe rename to "error_is_not_leader" or "is_not_leader_error"? Done -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Adar Dembo has posted comments on this change. Change subject: [catalog_manager] proper handling of catalog shutdown .. Patch Set 1: (4 comments) As a general point, it sucks that Status isn't full-featured enough to include this additional bit of information. IIRC Mike proposed a sweeping change to Status handling a couple years ago which would have made it versatile enough to include this case. It would be interesting to hear his comments on this. http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.h File src/kudu/master/catalog_manager.h: Line 20: #include Is this for existing code? Or something new? PS1, Line 585: master's Nit: to make it clearer that this is a variant of InitCertAuthority(), can you change this to "IPKI"? Line 596: // Store CA certificate information into the system table. Doc what is_not_leader_error means. http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/sys_catalog.h File src/kudu/master/sys_catalog.h: Line 153:bool* is_not_leader); Maybe rename to "error_is_not_leader" or "is_not_leader_error"? -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey SerbinGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] [catalog manager] proper handling of catalog shutdown
Alexey Serbin has uploaded a new change for review. http://gerrit.cloudera.org:8080/6170 Change subject: [catalog_manager] proper handling of catalog shutdown .. [catalog_manager] proper handling of catalog shutdown Re-factored the code to proper handle the system catalog shutdown. Besides, updated the logic to properly handle the NOT_THE_LEADER error for write operations against the system catalog table for cases when such an error is a recoverable failure. Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/sys_catalog-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 228 insertions(+), 124 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/1 -- To view, visit http://gerrit.cloudera.org:8080/6170 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8 Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Alexey Serbin