[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
- Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
- Before downloading each WAL segments or block, check that the tablet
  is in a healthy group.
And for the tablet copy session as:
- Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Reviewed-on: http://gerrit.cloudera.org:8080/7654
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy 
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
7 files changed, 124 insertions(+), 7 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 9: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 22 Nov 2017 05:26:11 +
Gerrit-HasComments: No


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:34:14 +
Gerrit-HasComments: No


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc
File src/kudu/tserver/tablet_copy_source_session.cc:

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc@133
PS7, Line 133: ));
> we can remove this now
Done



--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 22:30:43 +
Gerrit-HasComments: Yes


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#8).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
- Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
- Before downloading each WAL segments or block, check that the tablet
  is in a healthy group.
And for the tablet copy session as:
- Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
7 files changed, 124 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/8
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc
File src/kudu/tserver/tablet_copy_source_session.cc:

http://gerrit.cloudera.org:8080/#/c/7654/7/src/kudu/tserver/tablet_copy_source_session.cc@133
PS7, Line 133: nullptr
we can remove this now



--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 20:31:29 +
Gerrit-HasComments: Yes


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7654/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7654/6//COMMIT_MSG@16
PS6, Line 16: WAL
> WAL segment
Done


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/tablet_copy_source_session.cc
File src/kudu/tserver/tablet_copy_source_session.cc:

http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/tablet_copy_source_session.cc@133
PS6, Line 133: nullptr
> nit: since this is an optional out-param of the function, defaulting it to
Done with default


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc@694
PS6, Line 694:   Status s = tc_client.FetchAll(replica);
 :   if (!s.ok()) {
 : LOG(WARNING) << LogPrefix(tablet_id) << "Tablet Copy: Unable 
to fetch data from remote peer "
 :  << kSrcPeerInfo << ": " 
<< s.ToString();
 : r
> There is no need for this; the TabletCopyClient destructor will run Abort()
Done


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc@992
PS6, Line 992: (elapsed_ms > FLAGS_tablet_start_warn_threshold_ms) {
> that should have already happened above on line 972, right?
Ah, this should be "while starting", although I think this change could be 
pushed to the "handle failures at runtime" patch, since only then can errors 
get set in the replica externally.



--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 18:17:49 +
Gerrit-HasComments: Yes


[kudu-CR] handle disk failures during tablet copies

2017-11-21 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Mike Percy, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#7).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
- Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
- Before downloading each WAL segments or block, check that the tablet
  is in a healthy group.
And for the tablet copy session as:
- Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
7 files changed, 124 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/7
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Mike Percy (Code Review)
Mike Percy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7654 )

Change subject: handle disk failures during tablet copies
..


Patch Set 6:

(4 comments)

Some of these changes make sense but see my comments about Abort()

http://gerrit.cloudera.org:8080/#/c/7654/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7654/6//COMMIT_MSG@16
PS6, Line 16: WALs
WAL segment


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/tablet_copy_source_session.cc
File src/kudu/tserver/tablet_copy_source_session.cc:

http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/tablet_copy_source_session.cc@133
PS6, Line 133: nullptr
nit: since this is an optional out-param of the function, defaulting it to 
nullptr in the header file might be the user-friendlier option. Otherwise, 
would be helpful to add a comment to document what this is, like:

  RETURN_NOT_OK(CheckHealthyDirGroup(/*error_code=*/ nullptr));


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc@694
PS6, Line 694:   // In case of failure, shutdown the replica.
 :   auto failure_cleanup = MakeScopedCleanup([&] {
 : replica->SetError(s);
 : replica->Shutdown();
 :   });
There is no need for this; the TabletCopyClient destructor will run Abort() and 
tombstone the replica if it didn't succeed.


http://gerrit.cloudera.org:8080/#/c/7654/6/src/kudu/tserver/ts_tablet_manager.cc@992
PS6, Line 992: / If the replica was marked failed while bootstrapping, abort.
that should have already happened above on line 972, right?



--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy 
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 21 Nov 2017 06:03:57 +
Gerrit-HasComments: Yes


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#6).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
* Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
* Before downloading each WALs or block, check that the tablet is in a
  healthy group.
And for the tablet copy session as:
* Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 140 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/6
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#3).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
* Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
* Before downloading each WALs or block, check that the tablet is in a
  healthy group.
And for the tablet copy session as:
* Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
M src/kudu/tserver/ts_tablet_manager.cc
8 files changed, 134 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/3
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Andrew Wong (Code Review)
Andrew Wong has abandoned this change. ( http://gerrit.cloudera.org:8080/8607 )

Change subject: handle disk failures during tablet copies
..


Abandoned

This is a duplicate
--
To view, visit http://gerrit.cloudera.org:8080/8607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: Iacbfe446d01dd523fb2f2f81880e5af2551e979f
Gerrit-Change-Number: 8607
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Andrew Wong (Code Review)
Hello Tidy Bot, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/7654

to look at the new patch set (#2).

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
* Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
* Before downloading each WALs or block, check that the tablet is in a
  healthy group.
And for the tablet copy session as:
* Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/tablet/tablet.h
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
8 files changed, 119 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/2
--
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-Change-Number: 7654
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot


[kudu-CR] handle disk failures during tablet copies

2017-11-20 Thread Andrew Wong (Code Review)
Andrew Wong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8607


Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components in a tablet copy: the copy client (that
receives data) and the copy session source (that sends data).

Coarse-grain handling of disk failures during tablet copies is done for
the tablet copy client as:
* Before starting a copy client, if no disks are available to place the
  tablet, simply return (instead of failing a CHECK).
* Before downloading each WALs or block, check that the tablet is in a
  healthy group.
And for the tablet copy session as:
* Before sending a block or log segment, check if the tablet has an
  error.

Upon returning an error, the tablet copy client will shutdown the
replica, leaving it in a failed state.

A test is added to ensure that both copy clients and that source
sessions with failed disks will return errors to the copying client.

Change-Id: Iacbfe446d01dd523fb2f2f81880e5af2551e979f
---
M src/kudu/tablet/tablet.h
M src/kudu/tserver/tablet_copy-test-base.h
M src/kudu/tserver/tablet_copy_client-test.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
M src/kudu/tserver/tablet_copy_service-test.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/tablet_copy_source_session.h
8 files changed, 119 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/07/8607/1
--
To view, visit http://gerrit.cloudera.org:8080/8607
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iacbfe446d01dd523fb2f2f81880e5af2551e979f
Gerrit-Change-Number: 8607
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong 


[kudu-CR] handle disk failures during tablet copies

2017-08-16 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change.

Change subject: handle disk failures during tablet copies
..


Patch Set 1:

(3 comments)

I imagine Mike will do a more thorough review, but overall looks good to me.

http://gerrit.cloudera.org:8080/#/c/7654/1//COMMIT_MSG
Commit Message:

Line 10: receiving data) and the copy session sources (that sending data).
"receive" and "send". Or "are receiving" and "are sending".


http://gerrit.cloudera.org:8080/#/c/7654/1/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

PS1, Line 525:   if (group->uuid_indices().size() != 
valid_uuid_indices.size()) {
 : return Status::IOError("Directory group contains a 
failed directory");
 :   }
 :   group_uuid_indices = _uuid_indices;
Unrelated to this patch?


http://gerrit.cloudera.org:8080/#/c/7654/1/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

Line 305: 
Leftover from a change since removed? Or is this stylistic?


-- 
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes


[kudu-CR] handle disk failures during tablet copies

2017-08-10 Thread Andrew Wong (Code Review)
Andrew Wong has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7654

Change subject: handle disk failures during tablet copies
..

handle disk failures during tablet copies

There are two components to tablet copies: the copy clients (that
receiving data) and the copy session sources (that sending data).

Coarse-grain handling of disk failures during tablet copies is done as
follows. For tablet copy source sessions:
- if a disk fails in the session (i.e. during a call to
  ReadFileChunkToBuf, etc.), the error should handle itself at the block
  layer and return the error to the client
- if a disk fails during the session in some other thread, the next call
  to GetBlockPiece or GetLogSegmentPiece should return the error that
  failed the replica

For tablet copy clients:
- when getting next blocks, the client repeatedly gets blocks for the
  copy. If this fails, the client will fail.
- everything will handle itself at the block layer.

Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
---
M src/kudu/fs/data_dirs.cc
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_source_session.cc
M src/kudu/tserver/ts_disk_failure-test.cc
4 files changed, 94 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/54/7654/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic18d93c218ea13f3086f420a4847cb5e29a47bc7
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong