[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-03-02 Thread Alexey Serbin (Code Review)
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#5).

Change subject: [catalog_manager] proper handling of catalog shutdown
..

[catalog_manager] proper handling of catalog shutdown

This changelist introduces the categorization of the system catalog's
read/write operation failures happened while executing the leader
post-election task.  There are two categories of errors: fatal and
non-fatal.

If a read/write operation fails in between terms of the system catalog
leadership, the error is considered non-fatal.  In case of a non-fatal
error the leader post-election task bails out: the catalog is no longer
the leader at the original term and the task should be executed by the
new leader upon the ElectedAsLeaderCb.

If a read/write operation fails happened at the same term of the system
catalog leadership, the error is considered fatal: this causes the
master process to crash.  This is done do avoid any possible
inconsistency when reading and persisting important system information
such as tables/tablets metadata, IPKI certificate authority information
and TSKs (Token Signing Keys).

All read/write operation failures happened during the system catalog's
shutdown are ignored.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
5 files changed, 187 insertions(+), 120 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-28 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/4/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS4, Line 585: bool* error_is_not_leader
I agree with dan. In fact this was part of the motivation for my comments 
yesterday about not crashing, you wouldn't need to have specific handling for 
the "not leader" case, and instead have generic handling for all cases where 
writes to consensus fail.


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-28 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 4:

(4 comments)

In general I'm not a fan of the additional bool* out params, when the callers 
could just check

resp.error().code() == TabletServerErrorPB::NOT_THE_LEADER

The bool also doesn't seem to be used in non-test code.

http://gerrit.cloudera.org:8080/#/c/6170/4/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 458: {
New scope seems redundant.


Line 459:   bool error_is_not_leader;
What is this boolean doing?  It's passed as an out parameter to 
TryGenerateNewTskUnlocked, and then not used thereafter.


Line 763:   if (s.IsNotFound()) {
This method might be simpler if it's structured as a single if/else chain:

if (s.ok()) {
  return InitCertAuth...
} else if (s.IsNotFound()) {
  ...
} else {
  return s;
}


Line 764: LOG(INFO) << "Did not find CA certificate and key for Kudu IPKI, "
This is going to be pretty chatty, perhaps VLOG?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#4).

Change subject: [catalog_manager] proper handling of catalog shutdown
..

[catalog_manager] proper handling of catalog shutdown

Re-factored the code to proper handle the system catalog's shutdown.
Besides, updated the logic to properly handle the NOT_THE_LEADER error
for write operations against the system catalog table in cases when
such an error is a recoverable failure.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 245 insertions(+), 128 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6170/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS2, Line 458:  {
 :   bool error_is_not_leader;
 :   const Status& s =
 :   
catalog_manager_->TryGenerateNewTskUnlocked(_is_not_leader);
 :   if (!s.ok()) {
 : LOG(WARNING) << "Failed to add new TSK: " << 
s.ToString();
 : if (!error_is_not_leader) {
 :   // Do not expect any other errors besides 
NOT_THE_LEADER error.
 :   CHECK_OK(s);
 : }
 :   }
 : }
> say that the error is something else (like the machine is partitioned) does
AFAIK, your understanding is correct -- it might happen that checks in the 
ScopedLeaderSharedLock ctor went fine, but then situation might change down the 
road.  For example, the master could lose its leadership role while holding 
that shared lock -- there is no guarantee it will stay a master up to the very 
end of the operation which holds that lock.

I think you are correct -- we shouldn't crash the master server if we can let 
it continue working knowing its operation  in such a state does not bring 
inconsistency into the clustered master.  This is exactly the case: if master 
some error happened and the master cannot write the newly generated TSK into 
the system table, we should not let it run only if it stays the leader master 
AND no any active TSK is left.  I need to figure out how to check that master 
is still a leader (the latter part with TSK it clear to me).  Probably, I can 
use the underlying system_catalog_'s consensus implementation to figure that 
out.

This is a background task which runs on its own -- I'm not sure it's possible 
to implement the behavior you described.  However, I might be missing something.


PS2, Line 751:   if (!(s.ok() || s.IsNotFound())) {
 : return s;
 :   }
> could you leave a comment that we only handle the OK and NotFound cases bel
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Hello Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#3).

Change subject: [catalog_manager] proper handling of catalog shutdown
..

[catalog_manager] proper handling of catalog shutdown

Re-factored the code to proper handle the system catalog's shutdown.
Besides, updated the logic to properly handle the NOT_THE_LEADER error
for write operations against the system catalog table in cases when
such an error is a recoverable failure.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 243 insertions(+), 128 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6170/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 602:   Status StoreCertAuthorityInfo(const security::PrivateKey& key,
> warning: function 'kudu::master::CatalogManager::StoreCertAuthorityInfo' ha
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 2:

> Will defer to Mike and/or David; I bet they have stronger opinions
 > on this than I do.

OK, sure.  Thank you for the fast review!

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 2: Code-Review+1

Will defer to Mike and/or David; I bet they have stronger opinions on this than 
I do.

-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/6170

to look at the new patch set (#2).

Change subject: [catalog_manager] proper handling of catalog shutdown
..

[catalog_manager] proper handling of catalog shutdown

Re-factored the code to proper handle the system catalog's shutdown.
Besides, updated the logic to properly handle the NOT_THE_LEADER error
for write operations against the system catalog table in cases when
such an error is a recoverable failure.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 243 insertions(+), 127 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 1:

(5 comments)

> (4 comments)
 > 
 > As a general point, it sucks that Status isn't full-featured enough
 > to include this additional bit of information. IIRC Mike proposed a
 > sweeping change to Status handling a couple years ago which would
 > have made it versatile enough to include this case. It would be
 > interesting to hear his comments on this.

Yep, it might be handy to have that.  Probably, having this use-case might help 
to make it happen :)

http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 204: using std::bind;
> warning: using decl 'bind' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 20: #include 
> Is this for existing code? Or something new?
Now it's just for the existing code: StartTabletCopy() method's signature has 
std::function<...> as a parameter.


PS1, Line 585: master's
> Nit: to make it clearer that this is a variant of InitCertAuthority(), can 
Done


Line 596:   // Store CA certificate information into the system table.
> Doc what is_not_leader_error means.
Done


http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/sys_catalog.h
File src/kudu/master/sys_catalog.h:

Line 153:bool* is_not_leader);
> Maybe rename to "error_is_not_leader" or "is_not_leader_error"?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Alexey Serbin 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: [catalog_manager] proper handling of catalog shutdown
..


Patch Set 1:

(4 comments)

As a general point, it sucks that Status isn't full-featured enough to include 
this additional bit of information. IIRC Mike proposed a sweeping change to 
Status handling a couple years ago which would have made it versatile enough to 
include this case. It would be interesting to hear his comments on this.

http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 20: #include 
Is this for existing code? Or something new?


PS1, Line 585: master's
Nit: to make it clearer that this is a variant of InitCertAuthority(), can you 
change this to "IPKI"?


Line 596:   // Store CA certificate information into the system table.
Doc what is_not_leader_error means.


http://gerrit.cloudera.org:8080/#/c/6170/1/src/kudu/master/sys_catalog.h
File src/kudu/master/sys_catalog.h:

Line 153:bool* is_not_leader);
Maybe rename to "error_is_not_leader" or "is_not_leader_error"?


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] [catalog manager] proper handling of catalog shutdown

2017-02-27 Thread Alexey Serbin (Code Review)
Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6170

Change subject: [catalog_manager] proper handling of catalog shutdown
..

[catalog_manager] proper handling of catalog shutdown

Re-factored the code to proper handle the system catalog shutdown.
Besides, updated the logic to properly handle the NOT_THE_LEADER error
for write operations against the system catalog table for cases
when such an error is a recoverable failure.

Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/sys_catalog-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 228 insertions(+), 124 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/6170/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin