[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=1 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Reviewed-on: http://gerrit.cloudera.org:8080/10061 Tested-by: Kudu Jenkins Reviewed-by: Andrew Wong Reviewed-by: Attila Bukor --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 459 insertions(+), 4 deletions(-) Approvals: Kudu Jenkins: Verified Andrew Wong: Looks good to me, approved Attila Bukor: Looks good to me, but someone else must approve -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 11 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 10: Code-Review+1 (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > It's unsigned in the latest patch set. damn, it seems I still can't use gerrit properly... -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 10 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Sun, 27 May 2018 08:37:46 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Andrew, you +2d, but this flag is still signed, shouldn't it be fixed first It's unsigned in the latest patch set. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Sun, 27 May 2018 08:26:03 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 10: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Aggh yeah, sorry I wasn't clear! Should've highlighted the flag. Andrew, you +2d, but this flag is still signed, shouldn't it be fixed first? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 10 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Sun, 27 May 2018 07:14:02 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 10: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Oh, you were talking about n this whole time? I'll make the truncate server Aggh yeah, sorry I wasn't clear! Should've highlighted the flag. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 10 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Sun, 27 May 2018 02:55:03 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#10). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=1 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 459 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/10 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 10 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > I don't think it's redundant. E.g. server.size() - (-1) > 0 is valid, but n Oh, you were talking about n this whole time? I'll make the truncate server csv flag an unsigned int. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Sat, 26 May 2018 07:09:31 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#9). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=1 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 457 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/9 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 9 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 8: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Ugh...I must still be tired from yesterday. The block returns if servers.si I don't think it's redundant. E.g. server.size() - (-1) > 0 is valid, but n = -1 shouldn't be valid, no? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 8 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 22:36:41 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Hmm... maybe i'm missing something. Doesn't that block return in the `n >= Ugh...I must still be tired from yesterday. The block returns if servers.size() <= n, so below, server.size() > n i.e. server.size() - n > 0 and a check is redundant. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 22:29:58 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 8: (2 comments) http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc@216 PS7, Line 216: private: : client::KuduSchema schema_; : Random random_; : }; : : TEST_F(RemoteKsckTest, TestClusterOk) { : ASSERT_OK(ksck_->CheckMasterHealth()); > I thought not but actually the test will usually fail if this is replaced w Yep, sgtm. Was just curious since it would also hammer in the point that these other tests _aren't_ necessarily doing a full ksck run. http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > The block L144-146 guarantees that n > server.size() here. Hmm... maybe i'm missing something. Doesn't that block return in the `n >= server.size()` case, and mean that over here, we're guaranteed the opposite? `n < server.size()` -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 8 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 22:22:45 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (7 comments) http://gerrit.cloudera.org:8080/#/c/10061/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/7//COMMIT_MSG@37 PS7, Line 37: localhost:7052 and 1 other server(s) > Shouldn't this be not truncated if `--truncate_server_csv_length=2`? Whoops, meant = 1. http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc@202 PS7, Line 202:}); : : sh > nit: extra line here Well, it's actually intentional, but if you don't like it that much... http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc@349 PS7, Line 349:if (FLAGS_consensus) { : return ts->FetchConsensusState(&health); : } : return Status::OK(); : }); > nit: revert here too Done http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc@95 PS5, Line 95: > I agree with Attila, it'd be weird for programmers and code reviewers in th Mmm I see what you guys mean, it works because we are using an internal minicluster. Sorry I was thick and didn't figure that out before. http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc@216 PS7, Line 216: ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckMasterConsensus()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); > nit: any reason to not use ksck_->Run() here? No checksumming? I thought not but actually the test will usually fail if this is replaced with ksck_->Run(). The reason is that calling CheckTablesConsistency will frequently find at least one tablet in the state where it has a leader but the leader hasn't asserted its leadership yet, so the other nodes think there's no leader. It's easy to solve by just inserting some rows at the start of the test, and we ought to actually do that in each of these tests as its the simplest way to wait until the tables are fully ready. Then we can use ksck_->RunAndPrintResults() everywhere. Doing that involves changing every test and doing a workaround for some checksum tests as they insert N rows and expect to checksum N rows, so if we insert rows in the test fixture we need to expect those in the checksum tests. Is it ok if I do a quick followup for that stuff after this patch? http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n > Do you think it's worth validating this is positive? The block L144-146 guarantees that n > server.size() here. http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@374 PS7, Line 374: const string& name = flag.first.first; : const string& value = flag.first.second; : flags_table.AddRow({name, : value, : FindOrDie(flag_tags_map, name), : ServerCsv(num_servers, flag.second)}); : } > nit: spacing Done -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 21:36:11 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#8). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=1 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 457 insertions(+), 4 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/8 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 8 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (1 comment) Sorry for the extra churn http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@374 PS7, Line 374: const string& name = flag.first.first; : const string& value = flag.first.second; : flags_table.AddRow({name, : value, : FindOrDie(flag_tags_map, name), : ServerCsv(num_servers, flag.second)}); : } nit: spacing -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 20:48:15 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: (6 comments) http://gerrit.cloudera.org:8080/#/c/10061/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/7//COMMIT_MSG@37 PS7, Line 37: localhost:7052 and 1 other server(s) Shouldn't this be not truncated if `--truncate_server_csv_length=2`? http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc@202 PS7, Line 202:}); : : sh nit: extra line here http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck.cc@349 PS7, Line 349:if (FLAGS_consensus) { : return ts->FetchConsensusState(&health); : } : return Status::OK(); : }); nit: revert here too http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc@95 PS5, Line 95: > I meant adding this flag only in the test context, but I'm not sure if it's I agree with Attila, it'd be weird for programmers and code reviewers in the future to have to spend cycles on an update to this test for an update of the flag. http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_remote-test.cc@216 PS7, Line 216: ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckMasterConsensus()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); nit: any reason to not use ksck_->Run() here? No checksumming? http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/7/src/kudu/tools/ksck_results.cc@149 PS7, Line 149: n Do you think it's worth validating this is positive? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 20:16:41 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 7: Code-Review+1 (2 comments) http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc@369 PS5, Line 369: // Failing to gather flags is only a warning. > Because the first and most important part is filling out the summary. The f makes sense, thanks http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc@95 PS5, Line 95: > If the flag moves to stable we pick a new flag. I don't think it's worth ad I meant adding this flag only in the test context, but I'm not sure if it's possible to do and changing the flag used for this test if safe_time_max_lag_ms ever gets stable sounds good enough. The comment seems clear enough that whoever breaks this test by graduating this flag should know what to do. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 19:57:55 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#7). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=2 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 458 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/7 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 7 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 6: (8 comments) http://gerrit.cloudera.org:8080/#/c/10061/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/5//COMMIT_MSG@30 PS5, Line 30: rolled by --truncate_server_csv_length. : Additionally, if all ch > nit: mind including this in the above snippet too? The two are far away from each other in the ksck output so I made a second snippet. http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc@200 PS5, Line 200: tatus s = master->FetchInfo().AndThen([&]() { : return master->FetchConsensusState(); : }); : > nit: revert formatting? Done http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc@369 PS5, Line 369: // Failing to gather flags is only a warning. > why isn't this part simply above the lock instead of moving the lock to a n Because the first and most important part is filling out the summary. The flags is ordered after as it's a secondary concern. If the lock part is below the flags part, it looks like the flags part has something to do with the locking, which it doesn't. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@94 PS4, Line 94: InternalMiniClusterOptions opts; : > Oh huh. I thought IMCs reflected all in-process updates of flags. Those don My bad. You are right. We will see the flag change. It may or may not affect the behavior of the server, and that's irrelevant here, which is why I made the mistake of thinking we should set it before start. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@229 PS4, Line 229: ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); : ASSERT_OK(ksck_->PrintResults()); : > Ah, I would've thought CheckMasterUnusualFlags() and CheckTabletServerUnusu Done http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc@95 PS5, Line 95: > can we define a test flag instead? moving this flag to stable would break t If the flag moves to stable we pick a new flag. I don't think it's worth adding a new flag to the server just for these tests. http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc@138 PS5, Line 138: DCHECK_LE(servers.size(), server_count); > flags are usually defined the same way on each server, especially when a cl Good idea! http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc@267 PS5, Line 267: > Since these are warnings, let's just print `s.message()`. That way we won't Done -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 6 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 19:00:22 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#6). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+--- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Additionally, if all checked servers have an unusual --flag=value we call that out specially. For example, the above table reprinted with --truncate_server_csv_length=2 would look like Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052 and 1 other server(s) min_compression_ratio | 0.80004 | experimental | all 3 server(s) checked safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 assuming that there are 3 servers checked in total. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. The new warnings section looks like this in context: == Warnings: == Some masters have unsafe, experimental, or hidden flags set unable to get flag information for tablet server 812db6461bae4f62a651e132f783ab53 (127.0.0.1:7250): could not get status from server: Client connection negotiation failed: client connection to 127.0.0.1:7250: connect: Connection refused (error 61) Some tablet servers have unsafe, experimental, or hidden flags set tserver flag check error: 1 of 3 tservers' flags were not available == Errors: == Network error: error fetching info from tablet servers: failed to gather info for all tablet servers: 1 of 3 had errors FAILED Runtime error: ksck discovered errors Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 457 insertions(+), 9 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/6 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 6 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 5: (5 comments) http://gerrit.cloudera.org:8080/#/c/10061/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/5//COMMIT_MSG@30 PS5, Line 30: in a new warnings section near the : end of the ksck output. nit: mind including this in the above snippet too? http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc@200 PS5, Line 200: tatus s = master->FetchInfo() : .AndThen([&]() { : return master->FetchConsensusState(); : }); nit: revert formatting? http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@94 PS4, Line 94: // Add a harmless experimental flag so we can detect it with ksck. : FLAGS_safe_time_max_lag_ms = 6; > We should set it here in SetUp before we start the minicluster. I will make Oh huh. I thought IMCs reflected all in-process updates of flags. Those don't show up in the flags endpoint? http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@229 PS4, Line 229: TEST_F(RemoteKsckTest, TestCheckUnusualFlags) { : ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); : ASSERT_OK(ksck_->PrintResults()); > Well there was an extra CheckMasterHealth in there, and CheckMasterCons Ah, I would've thought CheckMasterUnusualFlags() and CheckTabletServerUnusualFlags() were standalone. If that's not the case, mind documenting the prerequisites to running them? http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc@267 PS5, Line 267: s.ToString() Since these are warnings, let's just print `s.message()`. That way we won't confuse users by presenting an error type. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 5 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 16:08:03 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Attila Bukor has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 5: (3 comments) http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck.cc@369 PS5, Line 369: // Fetch the flags information. why isn't this part simply above the lock instead of moving the lock to a new block? http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_remote-test.cc@95 PS5, Line 95: FLAGS_safe_time_max_lag_ms = 6; can we define a test flag instead? moving this flag to stable would break these tests http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc File src/kudu/tools/ksck_results.cc: http://gerrit.cloudera.org:8080/#/c/10061/5/src/kudu/tools/ksck_results.cc@138 PS5, Line 138: return JoinStrings(first_n, ", "); flags are usually defined the same way on each server, especially when a cluster management tool like Cloudera Manager is used, so we should check if all servers are affected and print something like "ALL" instead of listing them. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 5 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 15:57:20 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Alexey Serbin, Attila Bukor, Kudu Jenkins, Andrew Wong, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#5). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | localhost:7052,localhost:7051,localhost:7053 safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. Finally, since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 444 insertions(+), 12 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/5 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 5 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 4: (9 comments) http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG@22 PS1, Line 22: are two rows : for the > Mmm I had the same thought which inspired the flag to turn off the flags ch Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@356 PS4, Line 356: state_, KsckFetchState::UNINITIALIZED > nit: reverse the order of these here and L252 Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@356 PS4, Line 356: CHECK_NE(state_, KsckFetchState::UNINITIALIZED); > Does state_ reflect all of FetchUnusualFlags, FetchConsensusState, and Fetc state_ should reflect the tripartite state of fetching the core info from the server, so we shouldn't touch it in the flags stuff since it's extra. Just for sanity checks I can add another state for flags fetching. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@485 PS4, Line 485: Check for "unusual" flags on masters. : // "Unusual" flags are ones tagged hidden, experimental, or unsafe. > Note only non-default unusual flags. I don't think that's mentioned anywher Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@298 PS4, Line 298: "Some masters have unsafe, experimental, or hidden flags set")); > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@302 PS4, Line 302: return Status::Incomplete( > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@436 PS4, Line 436: "Some tablet servers have unsafe, experimental, or hidden flags set")); > nit: spacing Done http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@94 PS4, Line 94: // Add a harmless experimental flag so we can detect it with ksck. : FLAGS_safe_time_max_lag_ms = 6; > Can we put this down in TestCheckUnusualFlags? It's kind of weird seeing L2 We should set it here in SetUp before we start the minicluster. I will make a note that we set an unusual flag in the test itself. I could also subclass RemoteKsckTest but that feels like overkill. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@229 PS4, Line 229: ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckMasterConsensus()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); > Must all of these be run to pass this test? Well there was an extra CheckMasterHealth in there, and CheckMasterConsensus isn't needed, but otherwise yes. CheckMasterHealth gathers the info. CheckMasterUnusualFlags does the master flag check The next 3 do the info gathering for tservers. CheckTabletServerUnusualFlags does the tserver flag check. -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 4 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 07:19:46 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 4: (8 comments) http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h File src/kudu/tools/ksck.h: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@356 PS4, Line 356: CHECK_NE(state_, KsckFetchState::UNINITIALIZED); Does state_ reflect all of FetchUnusualFlags, FetchConsensusState, and FetchInfo? It's probably worth documenting its semantics, e.g. if FetchInfo() succeeds but FetchUnusualFlags() hasn't been called, is it still valid to call flags()? Is that an important point to make? Since we don't set the state in FetchUnusualFlags(), it seems a little weird to have this check here. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@356 PS4, Line 356: state_, KsckFetchState::UNINITIALIZED nit: reverse the order of these here and L252 http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.h@485 PS4, Line 485: Check for "unusual" flags on masters. : // "Unusual" flags are ones tagged hidden, experimental, or unsafe. Note only non-default unusual flags. I don't think that's mentioned anywhere except the commit message. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc File src/kudu/tools/ksck.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@298 PS4, Line 298: "Some masters have unsafe, experimental, or hidden flags set")); nit: spacing http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@302 PS4, Line 302: return Status::Incomplete( nit: spacing http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck.cc@436 PS4, Line 436: "Some tablet servers have unsafe, experimental, or hidden flags set")); nit: spacing http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc File src/kudu/tools/ksck_remote-test.cc: http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@94 PS4, Line 94: // Add a harmless experimental flag so we can detect it with ksck. : FLAGS_safe_time_max_lag_ms = 6; Can we put this down in TestCheckUnusualFlags? It's kind of weird seeing L244 without seeing this. http://gerrit.cloudera.org:8080/#/c/10061/4/src/kudu/tools/ksck_remote-test.cc@229 PS4, Line 229: ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterHealth()); : ASSERT_OK(ksck_->CheckMasterUnusualFlags()); : ASSERT_OK(ksck_->CheckMasterConsensus()); : ASSERT_OK(ksck_->CheckClusterRunning()); : ASSERT_OK(ksck_->FetchTableAndTabletInfo()); : ASSERT_OK(ksck_->FetchInfoFromTabletServers()); : ASSERT_OK(ksck_->CheckTabletServerUnusualFlags()); Must all of these be run to pass this test? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 4 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Alexey Serbin Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: Attila Bukor Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 25 May 2018 02:49:24 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#4). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | localhost:7052,localhost:7051,localhost:7053 safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. Having unusual flags or failing to gather flags isn't considered an error, since it doesn't indicate the cluster is unhealthy (in the latter case because the daemon may not support the GetFlags RPC). Instead, flag checks surface their results in a new warnings section near the end of the ksck output. Finally, since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/ksck_results.cc M src/kudu/tools/ksck_results.h 8 files changed, 396 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/4 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 4 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#3). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: WARNING: Some masters have hidden, experimental, or unsafe flags set: Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | localhost:7052,localhost:7051,localhost:7053 safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. A new flag --check_unusual_flags controls whether flag checks are done. It default to true. Flag checks may be worth turning off if a cluster has a large amount of tablet servers with experimental flags, or if a newer kudu tool is run against a cluster without GetFlags RPC support, since in these cases the flag check output will be noisy. Additionally, since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/tool_action_cluster.cc 8 files changed, 346 insertions(+), 2 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/3 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 3 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Hello Kudu Jenkins, Todd Lipcon, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/10061 to look at the new patch set (#2). Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: WARNING: Some masters have hidden, experimental, or unsafe flags set: Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | localhost:7052,localhost:7051,localhost:7053 safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. A new flag --check_unusual_flags controls whether flag checks are done. It default to true. Flag checks may be worth turning off if a cluster has a large amount of tablet servers with experimental flags, or if a newer kudu tool is run against a cluster without GetFlags RPC support, since in these cases the flag check output will be noisy. Additionally, since the output might not scale to a large number of servers, the CSV of servers is abbreviated, by default, to 3 entries, with the number of additional servers indicated. The number of entries before truncation kicks in is controlled by --truncate_server_csv_length. Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/tool_action_cluster.cc 8 files changed, 346 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/2 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 2 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG@22 PS1, Line 22: listing all : daemons > is this output going to scale to large clusters? eg if you have 200 tablet Mmm I had the same thought which inspired the flag to turn off the flags check. It's a bummer not to see which tablet servers have a flag set. Maybe, if we did something like list the first few tablet servers, and after some threshold abbreviate, e.g. my-tserver-0, my-tserver-1, my-tserver-2, and 86 other tablet servers since users running a lot of tablet servers are most likely using config management software, showing a few addresses might help them figure out why the experimental flag is set. For example, maybe they set an experimental flag on a rack and forgot to remove it from the config group for that rack, and seeing a few hostnames that have rack info encoded in them might help. Thoughts? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Reviewer: Will Berkeley Gerrit-Comment-Date: Fri, 13 Apr 2018 20:00:38 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/10061 ) Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10061/1//COMMIT_MSG@22 PS1, Line 22: listing all : daemons is this output going to scale to large clusters? eg if you have 200 tablet servers, it's likely that all of them have the same experimental flags set. Maybe that's fine and we'll just have a really long line in the output, but maybe we should just abbreviate to a count like "89/89 tservers"? -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Fri, 13 Apr 2018 19:52:22 + Gerrit-HasComments: Yes
[kudu-CR] [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags
Will Berkeley has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10061 Change subject: [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags .. [tools] ksck improvements [5/n]: Checks for experimental, unsafe, hidden flags This adds checks to ksck that look for experimental, unsafe, and hidden flags set to non-default values on Kudu masters and tablet servers. If any are found, ksck generates a table summarizing the different flags and their values. For example: WARNING: Some masters have hidden, experimental, or unsafe flags set: Flag |Value| Tags | Master +-+--+-- codegen_dump_functions | true| runtime,experimental | localhost:7052,localhost:7053 min_compression_ratio | 0.80004 | experimental | localhost:7052,localhost:7051,localhost:7053 safe_time_max_lag_ms | 4 | experimental | localhost:7052 safe_time_max_lag_ms | 5 | experimental | localhost:7053 The table has one row for each unique (flag, value) pair, listing all daemons with --flag=value. So, in the above output, there are two rows for the flag --safe_time_max_lag_ms because it's set to two different values on two masters. This makes it easy to scan for concerning flags and their values. A new flag --check_unusual_flags controls whether flag checks are done. It default to true. Flag checks may be worth turning off if a cluster has a large amount of tablet servers with experimental flags, or if a newer kudu tool is run against a cluster without GetFlags RPC support, since in these cases the flag check output will be noisy. Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 --- M src/kudu/integration-tests/cluster_verifier.cc M src/kudu/tools/ksck-test.cc M src/kudu/tools/ksck.cc M src/kudu/tools/ksck.h M src/kudu/tools/ksck_remote-test.cc M src/kudu/tools/ksck_remote.cc M src/kudu/tools/ksck_remote.h M src/kudu/tools/tool_action_cluster.cc 8 files changed, 323 insertions(+), 1 deletion(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/10061/1 -- To view, visit http://gerrit.cloudera.org:8080/10061 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Idd6c179e5256b2f2bae2f7486c5e0365ef184706 Gerrit-Change-Number: 10061 Gerrit-PatchSet: 1 Gerrit-Owner: Will Berkeley