[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7031 to look at the new patch set (#12). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and during scans, ensuring the servers return to a normal state. All of these tests are parameterized to run with both the LBM and FBM. Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 407 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/12 -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 12 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7031 to look at the new patch set (#11). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and during scans, ensuring the servers return to a normal state. All of these tests are parameterized to run with both the LBM and FBM. Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 407 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/11 -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 11 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Adar Dembo has posted comments on this change. Change subject: disk failure: tests for disk failure recovery .. Patch Set 10: (8 comments) http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: PS7, Line 68: > Done Not done? Also apply this change to your other testing change. Line 233: // servers with EIOs would have been shut down entirely. > Applying sequentially means there are only MRS flushes. OK, could you add a comment explaining that? http://gerrit.cloudera.org:8080/#/c/7031/10/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: PS10, Line 58: const vector kDefaultFlags = { : // Flush frequently to trigger writes. : "--flush_threshold_mb=1", : "--flush_threshold_secs=1", : : // Ensure a tablet will only store data on a single disk. : "--fs_target_data_dirs_per_tablet=1", : }; Nit: move these to GetDefaultFlags() (i.e. closer to where they're actually used). Line 73: void SetupDefaultTables() { It's creating just one table though. So SetupDefaultTable() maybe? Line 93: if (GetParam() == FBM) { How about place the string representation of the gflag directly in the enum, then you won't need this conversion. Line 107: CHECK_EQ(3, tablet_servers_.size()); Nit: compare with FLAGS_num_tablet_servers, so that if we want to update that value, there are two places in the code to update instead of one. Line 332: SetServerSurvivalFlags(ext_tservers); NO_FATALS. Line 350: write_workload.set_write_pattern(TestWorkload::WritePattern::INSERT_SEQUENTIAL_ROWS); Add comment justifying SEQUENTIAL. -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Andrew Wong Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon Gerrit-HasComments: Yes
[kudu-CR] disk failure: tests for disk failure recovery
Andrew Wong has posted comments on this change. Change subject: disk failure: tests for disk failure recovery .. Patch Set 7: (34 comments) http://gerrit.cloudera.org:8080/#/c/7031/3//COMMIT_MSG Commit Message: PS3, Line 16: exercises > typo Done http://gerrit.cloudera.org:8080/#/c/7031/3/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: Line 26: #include "kudu/gutil/map-util.h" > warning: #includes are not sorted properly [llvm-include-order] Done Line 38: > warning: using decl 'map' is unused [misc-unused-using-decls] Done PS3, Line 61: > nit: indent These have been removed to use TestWorkload instead PS3, Line 62: > nit: using... same Line 81: NO_FATALS(CreateTable(table_id)); > warning: redundant 'test_info_' declaration [readability-redundant-declarat seems harmless? PS3, Line 82: WaitForTSAndReplicas(table_id); : } : } : : // Sets up a cluster with three servers with two disks each. : vectorSetupDefaultCluster() { : FLAGS_num_tablet_servers = 3; : CreateCluster("survivable_cluster", kDefaultFlags, {}, /* num_data_dirs */ 2); : CreateClient(_); : vector tservers; : AppendValuesFromMap(tablet_servers_, ); : CHECK_EQ(3, tservers.size()); : vector ext_tservers; : for (auto* details : tservers) { : ext_tservers.push_back(cluster_->tablet_server_by_uuid(details->uuid())); : } : return ext_tservers; > nit move this comment to before the test declaration, or at least give an i Done PS3, Line 150: if (counts_after - counts_before > 0) { : dirs_written->push_back(e.first); : } else { : dirs_not_written->push_back(e.first); : } : } : } > this is a bit funky (depending on the number of files) any way we can be mo Done. I've changed it to count the files before and compare it with the counts after running whatever function. PS3, Line 214: d on tw > tablet servers Done http://gerrit.cloudera.org:8080/#/c/7031/6/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: Line 28: #include "kudu/integration-tests/test_workload.h" > warning: #includes are not sorted properly [llvm-include-order] Done Line 66: string GlobForBlockFileInDataDir(const string& data_dir) { > warning: invalid case style for global constant 'ts_flags' [readability-ide Done Line 101: void SetServerSurvivalFlags(vector & ext_tservers) { > warning: non-const reference parameter 'ext_tservers', make it const or use Done Line 126: const std::function & f, > warning: the parameter 'f' is copied for each invocation but only used as a Done Line 165: void GetDataDirsWrittenToByWorkload(const vector ext_tservers, > warning: the const qualified parameter 'ext_tservers' is copied for each in Done Line 166: TestWorkload* workload, > warning: non-const reference parameter 'workload', make it const or use a p Done http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc File src/kudu/integration-tests/disk_failure-itest.cc: Line 30: #include "kudu/util/path_util.h" > warning: #includes are not sorted properly [llvm-include-order] Done PS7, Line 51: uint32_t > Simple int would be fine here, I think. Done Line 67: if (FLAGS_block_manager == "file") { > Would it be useful to parameterize these tests and run them on both block m Done PS7, Line 68: ** > Isn't this middle entry also '??'? I thought block paths were data/ab/cd/ab Done Line 91: vector tservers; > Maybe omit this and iterate on the tablet_servers_ map directly? Done Line 101: void SetServerSurvivalFlags(vector & ext_tservers) { > warning: non-const reference parameter 'ext_tservers', make it const or use Done Line 150: if (counts_after - counts_before > 0) { > Would if counts_after > counts_before be simpler? Done Line 165: void GetDataDirsWrittenToByWorkload(const vector ext_tservers, > warning: the const qualified parameter 'ext_tservers' is copied for each in Done Line 176: workload->StopAndJoin(); > IIUC, we're writing just enough to figure out who is writing where, and the Yes, exactly Line 184: ext_tserver->GetInt64Metric(_ENTITY_server, nullptr, _data_dirs_failed, > Need ASSERT_OK here too? Done Line 200: ASSERT_OK(row->SetInt32(0, i +
[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7031 to look at the new patch set (#10). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and during scans, ensuring the servers return to a normal state. All of these tests are parameterized to run with both the LBM and FBM. Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 382 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/10 -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 10 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7031 to look at the new patch set (#9). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and during scans, ensuring the servers return to a normal state. All of these tests are parameterized to run with both the LBM and FBM. Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 380 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/9 -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 9 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Andrew Wong has abandoned this change. Change subject: disk failure: tests for disk failure recovery .. Abandoned With the right error handling, this patch shouldn't be needed. Fault-tolerant scans shouldn't see errors at this level. -- To view, visit http://gerrit.cloudera.org:8080/7243 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: abandon Gerrit-Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7031 to look at the new patch set (#8). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and scans, ensuring the servers return to a normal state. Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 361 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/8 -- To view, visit http://gerrit.cloudera.org:8080/7031 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97 Gerrit-PatchSet: 8 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Adar Dembo Gerrit-Reviewer: David Ribeiro Alves Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon
[kudu-CR] disk failure: tests for disk failure recovery
Hello Kudu Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7243 to look at the new patch set (#5). Change subject: disk failure: tests for disk failure recovery .. disk failure: tests for disk failure recovery This patch adds an EMC test that spawns three servers and triggers EIOs on two of them to fail two different tablets. With improper disk-failure-handling, this scenario alone would have been enough to leave the server with only a single copy of data, as the two servers with EIOs would have been shut down entirely. With proper disk-failure handling, this scenario would be salvageable, and data would be replicated on the remaining disks. This exercises the FlushMRS codepath. Tests are also added to test behavior during FlushDMS calls and scans, ensuring the servers return to a normal state. Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e --- M src/kudu/integration-tests/CMakeLists.txt A src/kudu/integration-tests/disk_failure-itest.cc 2 files changed, 361 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/7243/5 -- To view, visit http://gerrit.cloudera.org:8080/7243 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew WongGerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon