[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-24 Thread Mike Percy (Code Review)
Mike Percy has submitted this change and it was merged.

Change subject: disk failure: don't open tablets on failed disks
..


disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster configured to use
multiple directories for data blocks, failing a single directory on one
of the tablet servers, and ensuring that the tablets with blocks on the
failed directory get re-replicated at startup time. The test uses a
cluster verifier to verify the healthy end-state of the cluster.
Necessary changes have been made to do this on a cluster comprising of
multiple data directories.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Reviewed-on: http://gerrit.cloudera.org:8080/7766
Reviewed-by: Mike Percy 
Tested-by: Kudu Jenkins
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/integration-tests/external_mini_cluster_fs_inspector.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 185 insertions(+), 13 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-24 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 6:

python test hung, was killed and left no server logs or junit xml report. 
retriggered again.

-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 6:

I retriggered this one too

-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 6: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 5:

(6 comments)

I'm using a cluster verifier, but it doesn't quite work if you mix around the 
block manager type. I've deparameterized the test for this reason.

http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 31: METRIC_DECLARE_gauge_uint64(data_dirs_failed);
> not used in an external minicluster test
As per our hipchat convo, this is needed to get metric.


Line 33: DECLARE_string(block_manager);
> same
Done


PS4, Line 42: 
> does 30 work?
Done


Line 55: StartCluster(GetClusterFlags(), {}, 3, 5);
> I dont think you need this when you can just use cluster_->tablet_server(i)
Done


PS4, Line 96:   ExternalTabletServer* ts = cluster_->tablet_server(0);
:   ASSERT_OK(cluster_->WaitForTabletsRunning(ts, kNumTablets, 
kAgreementTimeout));
: 
> how about:
I'm going to keep this here since it's just a way to ensure we don't restart 
before things are persisted. I'll use the cluster verifier below.


PS4, Line 116: }
 : 
> additionally, how about adding:
See above comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7766

to look at the new patch set (#6).

Change subject: disk failure: don't open tablets on failed disks
..

disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster configured to use
multiple directories for data blocks, failing a single directory on one
of the tablet servers, and ensuring that the tablets with blocks on the
failed directory get re-replicated at startup time. The test uses a
cluster verifier to verify the healthy end-state of the cluster.
Necessary changes have been made to do this on a cluster comprising of
multiple data directories.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/integration-tests/external_mini_cluster_fs_inspector.cc
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 185 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/7766/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 4:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 31: METRIC_DECLARE_gauge_uint64(data_dirs_failed);
not used in an external minicluster test


Line 33: DECLARE_string(block_manager);
same


PS4, Line 42: 120
does 30 work?


Line 55:   void SetupDefaultCluster(vector* 
ext_tservers) {
I dont think you need this when you can just use cluster_->tablet_server(i) to 
get what you are loading into this vector


PS4, Line 96:   // Ensure the tablets get to a running state.
:   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
: kNumTablets, 
kAgreementTimeout));
how about:

  auto& ts0 = ts_map_[cluster_->tablet_server(0)->uuid()];
  string tablet_id;
  ASSERT_EVENTUALLY([&] {
vector tablet_ids;
ASSERT_OK(ListRunningTabletIds(ts0, kTimeout, _ids));
ASSERT_EQ(1, tablet_ids.size());
tablet_id = tablet_ids[0];
  }
  ASSERT_OK(WaitForServersToAgree(kTimeout, ts_map_, tablet_id, 
workload.batches_completed()));


PS4, Line 116:   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
 : kNumTablets, 
kAgreementTimeout));
additionally, how about adding:

  ASSERT_OK(WaitForServersToAgree(kTimeout, ts_map_, tablet_id, 
workload.batches_completed()));


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS4, Line 597: TODO(awong): currently the FsReport only reports on containers 
for the
 : // log block manager. Implement some sort of reporting for 
failed
 : // directories as well.
> mind creating a jira to track this and assigning to yourself?
Done


http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS4, Line 104:   Substitute("--env_inject_eio_globs=$0", 
JoinPathSegments(failed_dir, "**")),
 :   "--env_inject_eio=1.0",
 :   "--suicide_on_eio=false",
> what if you just append to the current flags and not replace them all?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7766

to look at the new patch set (#5).

Change subject: disk failure: don't open tablets on failed disks
..

disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster configured to use
multiple directories for data blocks, failing a single directory on one
of the tablet servers, and ensuring that the tablets with blocks on the
failed directory get re-replicated at startup time.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
7 files changed, 201 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/7766/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS4, Line 597: TODO(awong): currently the FsReport only reports on containers 
for the
 : // log block manager. Implement some sort of reporting for 
failed
 : // directories as well.
mind creating a jira to track this and assigning to yourself?


http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS4, Line 104:   Substitute("--env_inject_eio_globs=$0", 
JoinPathSegments(failed_dir, "**")),
 :   "--env_inject_eio=1.0",
 :   "--suicide_on_eio=false",
what if you just append to the current flags and not replace them all?


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 4:

(1 comment)

Just a lint failure, still good for review.

http://gerrit.cloudera.org:8080/#/c/7766/4/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

Line 23: #include "kudu/integration-tests/cluster_itest_util.h"
> warning: #includes are not sorted properly [llvm-include-order]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 4:

Rebased onto "open FS layout in presence of disk failure" instead of the 
"persistent disk state" patch.

-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: No


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7766

to look at the new patch set (#4).

Change subject: disk failure: don't open tablets on failed disks
..

disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster configured to use
multiple directories for data blocks, failing a single directory on one
of the tablet servers, and ensuring that the tablets with blocks on the
failed directory get re-replicated at startup time.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
7 files changed, 203 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/7766/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-23 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 3:

(25 comments)

This patch is dependent on the disk state patch; currently working out the 
status of that first and then will push these changes.

http://gerrit.cloudera.org:8080/#/c/7766/3//COMMIT_MSG
Commit Message:

PS3, Line 21: Testing is done by loading data into a cluster with multi-disk
: servers, failing a single directory of one of the servers, and 
ensuring
: that the tablets spread across the failed disk get replicated 
upon the
: next startup.
> how about: Testing is done by loading data into a cluster configured to use
Done


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 986:   if (PREDICT_FALSE(!failed_dirs.empty())) {
> Why do we need this condition? Won't the for loop no-op if failed_dirs is e
Done


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 581: // Don't report failed directories.
> shoudln't we _report_ failed directories? ok if you want to do it in a foll
Right, the intent here was to not report a failed directory as healthy. I agree 
it should somehow show itself in the FsReport. For another patch though, I 
think.


PS3, Line 581: // Don't report failed directories.
> There's some confusion about this, I think. The FsReport is intended for th
Thanks for explaining. I'll leave this for now and leave any additional 
reporting out of scope.


PS3, Line 693: VLOG(3) << Substitute("Block $0 is in a failed directory; not 
deleting",
 : block_id.ToString());
> how about using LOG_EVERY_N at INFO level, that way we'd get something in t
Done


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1694:   // Ignore attempts to delete blocks in failed directories.
> Shouldn't this precede the RemoveLogBlockUnlocked? This is deleting the blo
Done


PS3, Line 1700:   VLOG(3) << Substitute("Block $0 is in a failed directory; 
not deleting",
  : block_id.ToString());
> same
It's unique to this patch in that we'll only ever try to delete a block in a 
failed disk. This snippet here has the property that we won't even try to write 
to the failed directory, and in that sense, disk failures on DeleteBlock() are 
non-fatal.


Line 1702:   return Status::OK();
> Agreed with Mike: if return an error, don't callers handle it appropriately
Done


Line 1702:   return Status::OK();
> I'm not sure why we are returning OK here. Also, new API semantics should b
Yeah, you make a good point. Failing to delete blocks is non-fatal anyway, so 
we should be able to just return a disk error here.


Line 1702:   return Status::OK();
> also: if these are the semantics we want (not sure), shouldn't we have a un
You're right, I've changed the semantics to return an error. Added a test too.


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS3, Line 43: TabletServerIntegrationTestBase
> Would you mind inheriting from ExternalMiniClusterITestBase instead in this
Ah, it's all the rage, huh. Done


PS3, Line 56: if (GetParam() == "file") {
:   flags.emplace_back("--block_manager=file");
: } else {
:   flags.emplace_back("--block_manager=log");
: }
> maybe: flags.emplace_back(Substitute("--block_manager=$0", GetParam()));
Done


PS3, Line 65:   void SetupDefaultCluster(vector* 
ext_tservers) {
: FLAGS_num_tablet_servers = 3;
: vector flags = GetClusterFlags();
: NO_FATALS(CreateCluster("survivable_cluster", flags, {}, /* 
num_data_dirs */ 2));
: NO_FATALS(CreateClient(_));
: CHECK_EQ(FLAGS_num_tablet_servers, tablet_servers_.size());
: ext_tservers->clear();
: for (const auto& e : tablet_servers_) {
:   
ext_tservers->push_back(cluster_->tablet_server_by_uuid(e.second->uuid()));
: }
:   }
> can you add more disks? 2 is kind of special-casy because of tablet overlap
Yep, and that's not really necessary in this case because we're spreading 
across all directories.


PS3, Line 96: server
> a tablet server
Done


PS3, Line 97: server
> tablet server
Done


PS3, Line 98: .
> while it is shut down.
Done


Line 109:   write_workload.Setup();
> This creates a table. Why are you creating the table yourself above?
Good point, removed.


Line 110:   write_workload.Start();
> You should call workload.stopAndJoin() at some point during the test to shu
Done


PS3, 

[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-22 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 986:   if (PREDICT_FALSE(!failed_dirs.empty())) {
Why do we need this condition? Won't the for loop no-op if failed_dirs is empty?


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 581: // Don't report failed directories.
> shoudln't we _report_ failed directories? ok if you want to do it in a foll
There's some confusion about this, I think. The FsReport is intended for the 
block manager to report fine grained details about the filesystem; if a data 
directory has "failed", the block manager can't open its on-disk state on that 
directory and thus can't include it in the report.

This is more of an issue for the log block manager, where we actually report 
something meaningful.


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1694:   // Ignore attempts to delete blocks in failed directories.
Shouldn't this precede the RemoveLogBlockUnlocked? This is deleting the block 
from in-memory maps, but not from disk. That makes LogBlockManager::DeleteBlock 
differ from FileBlockManager::DeleteBlock in semantics.


Line 1702:   return Status::OK();
> also: if these are the semantics we want (not sure), shouldn't we have a un
Agreed with Mike: if return an error, don't callers handle it appropriately 
(i.e. WARN or whatever since we don't care that much if block deletion fails)?


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1702:   return Status::OK();
> I'm not sure why we are returning OK here. Also, new API semantics should b
also: if these are the semantics we want (not sure), shouldn't we have a unit 
test for this?


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-22 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7766/3//COMMIT_MSG
Commit Message:

PS3, Line 21: Testing is done by loading data into a cluster with multi-disk
: servers, failing a single directory of one of the servers, and 
ensuring
: that the tablets spread across the failed disk get replicated 
upon the
: next startup.
how about: Testing is done by loading data into a cluster configured to use 
multiple directories for data blocks, failing a single directory on one of the 
tablet servers, and ensuring that the tablets with blocks on the failed 
directory get re-replicated at startup time.


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 1702:   return Status::OK();
I'm not sure why we are returning OK here. Also, new API semantics should be 
documented at the interface level.


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS3, Line 43: TabletServerIntegrationTestBase
Would you mind inheriting from ExternalMiniClusterITestBase instead in this 
class? The newer tests are inheriting from that instead.


PS3, Line 96: server
a tablet server


PS3, Line 97: server
tablet server


PS3, Line 98: .
while it is shut down.


Line 109:   write_workload.Setup();
This creates a table. Why are you creating the table yourself above?


Line 110:   write_workload.Start();
You should call workload.stopAndJoin() at some point during the test to shut 
the writer thread down again. Did you want it running this whole time?


PS3, Line 114: WaitForTSAndReplicas
what is the purpose of calling this function?


PS3, Line 124:   NO_FATALS(SetServerSurvivalFlags(ext_tservers));
> why is this not set on boot?
agree


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 765: LOG(ERROR) << "Exiting bootstrapping early; tablet is in a failed 
directory";
how about: LOG(ERROR) << LogPrefix(tablet_id) << "aborting tablet bootstrap: 
tablet has data in a failed directory";


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-22 Thread David Ribeiro Alves (Code Review)
David Ribeiro Alves has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/file_block_manager.cc
File src/kudu/fs/file_block_manager.cc:

PS3, Line 581: // Don't report failed directories.
shoudln't we _report_ failed directories? ok if you want to do it in a follow 
up but this is likely one of the most important things we should report


PS3, Line 693: VLOG(3) << Substitute("Block $0 is in a failed directory; not 
deleting",
 : block_id.ToString());
how about using LOG_EVERY_N at INFO level, that way we'd get something in the 
logs but it woudn't spam


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS3, Line 1700:   VLOG(3) << Substitute("Block $0 is in a failed directory; 
not deleting",
  : block_id.ToString());
same
also do we have checks for the other ops? why is delete particular in this 
regard?


http://gerrit.cloudera.org:8080/#/c/7766/3/src/kudu/integration-tests/disk_failure-itest.cc
File src/kudu/integration-tests/disk_failure-itest.cc:

PS3, Line 56: if (GetParam() == "file") {
:   flags.emplace_back("--block_manager=file");
: } else {
:   flags.emplace_back("--block_manager=log");
: }
maybe: flags.emplace_back(Substitute("--block_manager=$0", GetParam()));


PS3, Line 65:   void SetupDefaultCluster(vector* 
ext_tservers) {
: FLAGS_num_tablet_servers = 3;
: vector flags = GetClusterFlags();
: NO_FATALS(CreateCluster("survivable_cluster", flags, {}, /* 
num_data_dirs */ 2));
: NO_FATALS(CreateClient(_));
: CHECK_EQ(FLAGS_num_tablet_servers, tablet_servers_.size());
: ext_tservers->clear();
: for (const auto& e : tablet_servers_) {
:   
ext_tservers->push_back(cluster_->tablet_server_by_uuid(e.second->uuid()));
: }
:   }
can you add more disks? 2 is kind of special-casy because of tablet overlaps 
right?


PS3, Line 124:   NO_FATALS(SetServerSurvivalFlags(ext_tservers));
why is this not set on boot?


PS3, Line 129: Substitute("--block_manager=$0", GetParam()
why are you re-settting the block manager? can it change?


PS3, Line 138:   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
 : kNumTablets, 
kAgreementTimeout));
so remind me again what happened to the failed tablets? were they replicated? 
where?


PS3, Line 137:   // Ensure again that the tablets get to a running, consistent 
state.
 :   
ASSERT_OK(cluster_->WaitForTabletsRunning(cluster_->tablet_server(0),
 : kNumTablets, 
kAgreementTimeout));
 :   ASSERT_OK(WaitForServersToAgree(kAgreementTimeout, 
tablet_servers_, tablet_id, 1));
not sure I get this. you inject a failure, wait for it to happen and then make 
sure all tablets are started? and why are they agreeing on index 1, isn't that 
kind of a given even with the failures?


-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 
Gerrit-HasComments: Yes


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-21 Thread Andrew Wong (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7766

to look at the new patch set (#3).

Change subject: disk failure: don't open tablets on failed disks
..

disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster with multi-disk
servers, failing a single directory of one of the servers, and ensuring
that the tablets spread across the failed disk get replicated upon the
next startup.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
7 files changed, 222 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/7766/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: David Ribeiro Alves 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon 


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change.

Change subject: disk failure: don't open tablets on failed disks
..


Patch Set 2:

This is currently dependent on https://gerrit.cloudera.org/#/c/7440/, which 
will change given some of the work Mike's been doing.

I don't think those changes will affect the contents of this patch, and I would 
still appreciate comments here.

-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: No


[kudu-CR] disk failure: don't open tablets on failed disks

2017-08-21 Thread Andrew Wong (Code Review)
Andrew Wong has uploaded a new patch set (#2).

Change subject: disk failure: don't open tablets on failed disks
..

disk failure: don't open tablets on failed disks

Currently Kudu servers open the FS layout with failed disks. However,
the moment tablets attempt to bootstrap (i.e. open blocks, etc.), they
will attempt to read from the failed disk and fail. This can be avoided
by checking whether a tablet's disk group contains a failed disk before
attempting to read data from the tablet. If so, the tablet should be
marked as having an error so it can be reassigned.

The default behavior of the 'fs_target_data_dirs_per_tablet' flag is
updated to take into account disk state when assigning new directory
groups. This allows the tablet to be reassigned to a server without
being spread across a failed directory.

Testing is done by loading data into a cluster with multi-disk
servers, failing a single directory of one of the servers, and ensuring
that the tablets spread across the failed disk get replicated upon the
next startup.

Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
---
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/file_block_manager.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/disk_failure-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
7 files changed, 247 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/66/7766/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7766
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id3fae98355657f6aa4b134c542f92fc07f5c0aa1
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins