[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-20 Thread Jean-Daniel Cryans (Code Review)
Jean-Daniel Cryans has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3632/4/src/kudu/tools/ksck.cc
File src/kudu/tools/ksck.cc:

Line 484: 
nit: extra empty line


Line 560: errors.push_back(Substitute("$0 does not have a majority of 
replicas in RUNNING state",
Not sure about that one, I've seen clusters that had more tombstone'd tablets 
than live ones but it's not really a problem.


http://gerrit.cloudera.org:8080/#/c/3632/4/src/kudu/tools/ksck.h
File src/kudu/tools/ksck.h:

Line 235: 
nit: extra empty line?


-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: Yes


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-20 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 4:

Build Started http://104.196.14.100/job/kudu-gerrit/2574/

-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: No


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-19 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 2:

here's some example output on a cluster with a messed up table:
https://gist.github.com/697f2970c4fbaf5f5888b6864d628968

I think there's some more improvements to be made, like distinguishing between 
an under-replicated-but-available tablet vs an under-replicated-below-majority 
tablet.

-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Jean-Daniel Cryans 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: No


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-19 Thread Will Berkeley (Code Review)
Will Berkeley has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 1:

> BTW, I tried this on a cluster with a bad table:
 > WARNING: Unable to connect to Tablet Server 5fb7d6c7083943059521e03d6ece2863
 > (10.20.132.112:7050) because Network error: Client connection
 > negotiation failed: client connection to 10.20.132.112:7050:
 > connect: Connection refused (error 111)
 > WARNING: Unable to connect to Tablet Server acd8306f95334ec1bfce8cb30d7ca36d
 > (10.20.126.115:7050) because Network error: Client connection
 > negotiation failed: client connection to 10.20.126.115:7050:
 > connect: Connection refused (error 111)
 > WARNING: Unable to connect to Tablet Server dff78a5acdbb4a47ba2c7a62d1bcc5ee
 > (10.20.132.107:7050) because Network error: Client connection
 > negotiation failed: client connection to 10.20.132.107:7050:
 > connect: Connection refused (error 111)
 > WARNING: Connected to 69 Tablet Servers, 3 weren't reachable
 > WARNING: Tablet 3bf432551c5d4c529616f8e7ce829424 of table
 > 'usertable' does not have a majority of replicas in RUNNING state
 > WARNING: Tablet 2f652871b74b4d0f9bf99e730486a451 of table
 > 'usertable' does not have a majority of replicas in RUNNING state
 > WARNING: Tablet b009973af71842cf99e10d25254b5557 of table
 > 'usertable' does not have a majority of replicas in RUNNING state
 > WARNING: Tablet 71ca44eebda44903868014175e02862a of table
 > 'usertable' does not have a majority of replicas in RUNNING state
 > WARNING: Table usertable has 4 bad tablets
 > INFO: Table IntegrationTestBigLinkedListHeads is HEALTHY
 > INFO: Table IntegrationTestBigLinkedList is HEALTHY
 > WARNING: 1 out of 3 tables are not in a healthy state
 > ==
 > Errors:
 > ==
 > Tablet server aliveness check error: Network error: Not all Tablet
 > Servers are reachable
 > Table consistency check error: Corruption: 1 tables are bad
 > 
 > FAILED
 > 
 > From this output you can see how it would be useful to give
 > slightly more info on why the bad tablets are bad. Let me know if
 > you'll have time to keep working on this - otherwise I might try to
 > take it from where you left off.

Go for it, Unfortunately I don't think I'll have much time over the next couple 
of weeks.

-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Reviewer: Will Berkeley 
Gerrit-HasComments: No


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-18 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 1:

BTW, I tried this on a cluster with a bad table:
WARNING: Unable to connect to Tablet Server 5fb7d6c7083943059521e03d6ece2863 
(10.20.132.112:7050) because Network error: Client connection negotiation 
failed: client connection to 10.20.132.112:7050: connect: Connection refused 
(error 111)
WARNING: Unable to connect to Tablet Server acd8306f95334ec1bfce8cb30d7ca36d 
(10.20.126.115:7050) because Network error: Client connection negotiation 
failed: client connection to 10.20.126.115:7050: connect: Connection refused 
(error 111)
WARNING: Unable to connect to Tablet Server dff78a5acdbb4a47ba2c7a62d1bcc5ee 
(10.20.132.107:7050) because Network error: Client connection negotiation 
failed: client connection to 10.20.132.107:7050: connect: Connection refused 
(error 111)
WARNING: Connected to 69 Tablet Servers, 3 weren't reachable
WARNING: Tablet 3bf432551c5d4c529616f8e7ce829424 of table 'usertable' does not 
have a majority of replicas in RUNNING state
WARNING: Tablet 2f652871b74b4d0f9bf99e730486a451 of table 'usertable' does not 
have a majority of replicas in RUNNING state
WARNING: Tablet b009973af71842cf99e10d25254b5557 of table 'usertable' does not 
have a majority of replicas in RUNNING state
WARNING: Tablet 71ca44eebda44903868014175e02862a of table 'usertable' does not 
have a majority of replicas in RUNNING state
WARNING: Table usertable has 4 bad tablets
INFO: Table IntegrationTestBigLinkedListHeads is HEALTHY
INFO: Table IntegrationTestBigLinkedList is HEALTHY
WARNING: 1 out of 3 tables are not in a healthy state
==
Errors:
==
Tablet server aliveness check error: Network error: Not all Tablet Servers are 
reachable
Table consistency check error: Corruption: 1 tables are bad

FAILED

>From this output you can see how it would be useful to give slightly more info 
>on why the bad tablets are bad. Let me know if you'll have time to keep 
>working on this - otherwise I might try to take it from where you left off.

-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-13 Thread Will Berkeley (Code Review)
Will Berkeley has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3632

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..

KUDU-1516 ksck should check for more raft-related status issues (partial)

This patch improves ksck. The main way it does so is by adding "tablet
server POV" information. ksck now gathers information about tablet
replicas from the tablet servers and cross-references this information
with the master metadata. This adds the following checks:

* each tablet has a majority of replicas on live tablet servers
* if a tablet has a majority of replicas on a live tablet
server, then a majority of its tablets are in RUNNING state
* the assignments of tablets to tablet servers in the master agrees with
the assignment of tablet replicas reported by the tablet servers

There's a flag to revert to the old behavior that only uses master
metadata.

This patch does not include other desiderata from KUDU-1516, like
a consensus canary or a write op canary.

I'm planning to add canaries and make more improvements to ksck in
follow-up patches.

Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/master/master.proto
M src/kudu/tools/ksck-test.cc
M src/kudu/tools/ksck.cc
M src/kudu/tools/ksck.h
M src/kudu/tools/ksck_remote-test.cc
M src/kudu/tools/ksck_remote.cc
M src/kudu/tools/ksck_remote.h
M src/kudu/tools/kudu-ksck.cc
9 files changed, 191 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/32/3632/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 


[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)

2016-07-13 Thread Kudu Jenkins (Code Review)
Kudu Jenkins has posted comments on this change.

Change subject: KUDU-1516 ksck should check for more raft-related status issues 
(partial)
..


Patch Set 1:

Build Started http://104.196.14.100/job/kudu-gerrit/2386/

-- 
To view, visit http://gerrit.cloudera.org:8080/3632
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No