[kudu-CR] disk failure: tests for disk failure recovery

2017-08-10 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#12).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 407 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-08-10 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#11).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 407 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/11
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-19 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: tests for disk failure recovery
..


Patch Set 10:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS7, Line 68: 
> Done
Not done? Also apply this change to your other testing change.


Line 233: // servers with EIOs would have been shut down entirely.
> Applying sequentially means there are only MRS flushes.
OK, could you add a comment explaining that?


http://gerrit.cloudera.org:8080/#/c/7031/10/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS10, Line 58: const vector kDefaultFlags = {
 :   // Flush frequently to trigger writes.
 :   "--flush_threshold_mb=1",
 :   "--flush_threshold_secs=1",
 : 
 :   // Ensure a tablet will only store data on a single disk.
 :   "--fs_target_data_dirs_per_tablet=1",
 : };
Nit: move these to GetDefaultFlags() (i.e. closer to where they're actually 
used).


Line 73:   void SetupDefaultTables() {
It's creating just one table though. So SetupDefaultTable() maybe?


Line 93: if (GetParam() == FBM) {
How about place the string representation of the gflag directly in the enum, 
then you won't need this conversion.


Line 107: CHECK_EQ(3, tablet_servers_.size());
Nit: compare with FLAGS_num_tablet_servers, so that if we want to update that 
value, there are two places in the code to update instead of one.


Line 332:   SetServerSurvivalFlags(ext_tservers);
NO_FATALS.


Line 350:   
write_workload.set_write_pattern(TestWorkload::WritePattern::INSERT_SEQUENTIAL_ROWS);
Add comment justifying SEQUENTIAL.


-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-18 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: tests for disk failure recovery
..


Patch Set 7:

(34 comments)

http://gerrit.cloudera.org:8080/#/c/7031/3//COMMIT_MSG
Commit Message:

PS3, Line 16: exercises 
> typo
Done


http://gerrit.cloudera.org:8080/#/c/7031/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 26: #include "kudu/gutil/map-util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 38: 
> warning: using decl 'map' is unused [misc-unused-using-decls]
Done


PS3, Line 61: 
> nit: indent
These have been removed to use TestWorkload instead


PS3, Line 62: 
> nit: using...
same


Line 81:   NO_FATALS(CreateTable(table_id));
> warning: redundant 'test_info_' declaration [readability-redundant-declarat
seems harmless?


PS3, Line 82:   WaitForTSAndReplicas(table_id);
: }
:   }
: 
:   // Sets up a cluster with three servers with two disks each.
:   vector SetupDefaultCluster() {
: FLAGS_num_tablet_servers = 3;
: CreateCluster("survivable_cluster", kDefaultFlags, {}, /* 
num_data_dirs */ 2);
: CreateClient(_);
: vector tservers;
: AppendValuesFromMap(tablet_servers_, );
: CHECK_EQ(3, tservers.size());
: vector ext_tservers;
: for (auto* details : tservers) {
:   
ext_tservers.push_back(cluster_->tablet_server_by_uuid(details->uuid()));
: }
: return ext_tservers;
> nit move this comment to before the test declaration, or at least give an i
Done


PS3, Line 150:   if (counts_after - counts_before > 0) {
 : dirs_written->push_back(e.first);
 :   } else {
 : dirs_not_written->push_back(e.first);
 :   }
 : }
 :   }
> this is a bit funky (depending on the number of files) any way we can be mo
Done. I've changed it to count the files before and compare it with the counts 
after running whatever function.


PS3, Line 214: d on tw
> tablet servers
Done


http://gerrit.cloudera.org:8080/#/c/7031/6/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 28: #include "kudu/integration-tests/test_workload.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


Line 66: string GlobForBlockFileInDataDir(const string& data_dir) {
> warning: invalid case style for global constant 'ts_flags' [readability-ide
Done


Line 101:   void SetServerSurvivalFlags(vector& 
ext_tservers) {
> warning: non-const reference parameter 'ext_tservers', make it const or use
Done


Line 126:   const 
std::function& f,
> warning: the parameter 'f' is copied for each invocation but only used as a
Done


Line 165:   void GetDataDirsWrittenToByWorkload(const 
vector ext_tservers,
> warning: the const qualified parameter 'ext_tservers' is copied for each in
Done


Line 166:   TestWorkload* workload,
> warning: non-const reference parameter 'workload', make it const or use a p
Done


http://gerrit.cloudera.org:8080/#/c/7031/7/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 30: #include "kudu/util/path_util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


PS7, Line 51: uint32_t
> Simple int would be fine here, I think.
Done


Line 67:   if (FLAGS_block_manager == "file") {
> Would it be useful to parameterize these tests and run them on both block m
Done


PS7, Line 68: **
> Isn't this middle entry also '??'? I thought block paths were data/ab/cd/ab
Done


Line 91: vector tservers;
> Maybe omit this and iterate on the tablet_servers_ map directly?
Done


Line 101:   void SetServerSurvivalFlags(vector& 
ext_tservers) {
> warning: non-const reference parameter 'ext_tservers', make it const or use
Done


Line 150:   if (counts_after - counts_before > 0) {
> Would if counts_after > counts_before be simpler?
Done


Line 165:   void GetDataDirsWrittenToByWorkload(const 
vector ext_tservers,
> warning: the const qualified parameter 'ext_tservers' is copied for each in
Done


Line 176: workload->StopAndJoin();
> IIUC, we're writing just enough to figure out who is writing where, and the
Yes, exactly


Line 184:   ext_tserver->GetInt64Metric(_ENTITY_server, nullptr, 
_data_dirs_failed,
> Need ASSERT_OK here too?
Done


Line 200: ASSERT_OK(row->SetInt32(0, i + 

[kudu-CR] disk failure: tests for disk failure recovery

2017-07-18 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#10).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 382 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-18 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#9).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and during
scans, ensuring the servers return to a normal state. All of these tests
are parameterized to run with both the LBM and FBM.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 380 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-08 Thread Andrew Wong (Code Review)
Andrew Wong has abandoned this change.

Change subject: disk failure: tests for disk failure recovery
..


Abandoned

With the right error handling, this patch shouldn't be needed. Fault-tolerant 
scans shouldn't see errors at this level.

-- 
To view, visit http://gerrit.cloudera.org:8080/7243
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-08 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7031

to look at the new patch set (#8).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans,
ensuring the servers return to a normal state.

Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 361 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/7031/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7031
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8ff63ec71ab718866484b9f3ec7264bc72ecfe97
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: tests for disk failure recovery

2017-07-07 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7243

to look at the new patch set (#5).

Change subject: disk failure: tests for disk failure recovery
..

disk failure: tests for disk failure recovery

This patch adds an EMC test that spawns three servers and triggers EIOs
on two of them to fail two different tablets. With improper
disk-failure-handling, this scenario alone would have been enough to
leave the server with only a single copy of data, as the two servers
with EIOs would have been shut down entirely.

With proper disk-failure handling, this scenario would be salvageable,
and data would be replicated on the remaining disks. This exercises the
FlushMRS codepath.

Tests are also added to test behavior during FlushDMS calls and scans,
ensuring the servers return to a normal state.

Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
2 files changed, 361 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/43/7243/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7243
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I364c0ae2ac48920bcbd5b662b931ca448464c90e
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon