[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Jean-Daniel Cryans has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 4: (3 comments) http://gerrit.cloudera.org:8080/#/c/3632/4/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: Line 484: nit: extra empty line Line 560: errors.push_back(Substitute("$0 does not have a majority of replicas in RUNNING state", Not sure about that one, I've seen clusters that had more tombstone'd tablets than live ones but it's not really a problem. http://gerrit.cloudera.org:8080/#/c/3632/4/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: Line 235: nit: extra empty line? -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: Yes
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 4: Build Started http://104.196.14.100/job/kudu-gerrit/2574/ -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: No
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Todd Lipcon has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 2: here's some example output on a cluster with a messed up table: https://gist.github.com/697f2970c4fbaf5f5888b6864d628968 I think there's some more improvements to be made, like distinguishing between an under-replicated-but-available tablet vs an under-replicated-below-majority tablet. -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Jean-Daniel Cryans Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: No
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Will Berkeley has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 1: > BTW, I tried this on a cluster with a bad table: > WARNING: Unable to connect to Tablet Server 5fb7d6c7083943059521e03d6ece2863 > (10.20.132.112:7050) because Network error: Client connection > negotiation failed: client connection to 10.20.132.112:7050: > connect: Connection refused (error 111) > WARNING: Unable to connect to Tablet Server acd8306f95334ec1bfce8cb30d7ca36d > (10.20.126.115:7050) because Network error: Client connection > negotiation failed: client connection to 10.20.126.115:7050: > connect: Connection refused (error 111) > WARNING: Unable to connect to Tablet Server dff78a5acdbb4a47ba2c7a62d1bcc5ee > (10.20.132.107:7050) because Network error: Client connection > negotiation failed: client connection to 10.20.132.107:7050: > connect: Connection refused (error 111) > WARNING: Connected to 69 Tablet Servers, 3 weren't reachable > WARNING: Tablet 3bf432551c5d4c529616f8e7ce829424 of table > 'usertable' does not have a majority of replicas in RUNNING state > WARNING: Tablet 2f652871b74b4d0f9bf99e730486a451 of table > 'usertable' does not have a majority of replicas in RUNNING state > WARNING: Tablet b009973af71842cf99e10d25254b5557 of table > 'usertable' does not have a majority of replicas in RUNNING state > WARNING: Tablet 71ca44eebda44903868014175e02862a of table > 'usertable' does not have a majority of replicas in RUNNING state > WARNING: Table usertable has 4 bad tablets > INFO: Table IntegrationTestBigLinkedListHeads is HEALTHY > INFO: Table IntegrationTestBigLinkedList is HEALTHY > WARNING: 1 out of 3 tables are not in a healthy state > == > Errors: > == > Tablet server aliveness check error: Network error: Not all Tablet > Servers are reachable > Table consistency check error: Corruption: 1 tables are bad > > FAILED > > From this output you can see how it would be useful to give > slightly more info on why the bad tablets are bad. Let me know if > you'll have time to keep working on this - otherwise I might try to > take it from where you left off. Go for it, Unfortunately I don't think I'll have much time over the next couple of weeks. -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-HasComments: No
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Todd Lipcon has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 1: BTW, I tried this on a cluster with a bad table: WARNING: Unable to connect to Tablet Server 5fb7d6c7083943059521e03d6ece2863 (10.20.132.112:7050) because Network error: Client connection negotiation failed: client connection to 10.20.132.112:7050: connect: Connection refused (error 111) WARNING: Unable to connect to Tablet Server acd8306f95334ec1bfce8cb30d7ca36d (10.20.126.115:7050) because Network error: Client connection negotiation failed: client connection to 10.20.126.115:7050: connect: Connection refused (error 111) WARNING: Unable to connect to Tablet Server dff78a5acdbb4a47ba2c7a62d1bcc5ee (10.20.132.107:7050) because Network error: Client connection negotiation failed: client connection to 10.20.132.107:7050: connect: Connection refused (error 111) WARNING: Connected to 69 Tablet Servers, 3 weren't reachable WARNING: Tablet 3bf432551c5d4c529616f8e7ce829424 of table 'usertable' does not have a majority of replicas in RUNNING state WARNING: Tablet 2f652871b74b4d0f9bf99e730486a451 of table 'usertable' does not have a majority of replicas in RUNNING state WARNING: Tablet b009973af71842cf99e10d25254b5557 of table 'usertable' does not have a majority of replicas in RUNNING state WARNING: Tablet 71ca44eebda44903868014175e02862a of table 'usertable' does not have a majority of replicas in RUNNING state WARNING: Table usertable has 4 bad tablets INFO: Table IntegrationTestBigLinkedListHeads is HEALTHY INFO: Table IntegrationTestBigLinkedList is HEALTHY WARNING: 1 out of 3 tables are not in a healthy state == Errors: == Tablet server aliveness check error: Network error: Not all Tablet Servers are reachable Table consistency check error: Corruption: 1 tables are bad FAILED >From this output you can see how it would be useful to give slightly more info >on why the bad tablets are bad. Let me know if you'll have time to keep >working on this - otherwise I might try to take it from where you left off. -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: No
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Will Berkeley has uploaded a new change for review. http://gerrit.cloudera.org:8080/3632 Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. KUDU-1516 ksck should check for more raft-related status issues (partial) This patch improves ksck. The main way it does so is by adding "tablet server POV" information. ksck now gathers information about tablet replicas from the tablet servers and cross-references this information with the master metadata. This adds the following checks: * each tablet has a majority of replicas on live tablet servers * if a tablet has a majority of replicas on a live tablet server, then a majority of its tablets are in RUNNING state * the assignments of tablets to tablet servers in the master agrees with the assignment of tablet replicas reported by the tablet servers There's a flag to revert to the old behavior that only uses master metadata. This patch does not include other desiderata from KUDU-1516, like a consensus canary or a write op canary. I'm planning to add canaries and make more improvements to ksck in follow-up patches. Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/master/master.proto M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/kudu-ksck.cc 9 files changed, 191 insertions(+), 26 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/32/3632/1 -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will Berkeley
[kudu-CR] KUDU-1516 ksck should check for more raft-related status issues (partial)
Kudu Jenkins has posted comments on this change. Change subject: KUDU-1516 ksck should check for more raft-related status issues (partial) .. Patch Set 1: Build Started http://104.196.14.100/job/kudu-gerrit/2386/ -- To view, visit http://gerrit.cloudera.org:8080/3632 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iec6590ba52548a9ee11d63269b134320b10809da Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Will BerkeleyGerrit-Reviewer: Kudu Jenkins Gerrit-HasComments: No