[kudu-CR] [docs] Document how to recovery from a majority failed tablet

2017-10-26 Thread Will Berkeley (Code Review)
Hello Kudu Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8402

to look at the new patch set (#2).

Change subject: [docs] Document how to recovery from a majority failed tablet
..

[docs] Document how to recovery from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas. Manual
intervention is required, and basically boils down to

1. copy the data from a healthy replica to where the revived replicas
will be
2. set the consensus configuration of the tablet so it matches the new
locations of replicas

Step 2 requires downtime even for healthy replicas, since new servers
can't be added to consensus configs without either rewriting the on-disk
cmeta or having a majority available. It might be worth allowing a tool
to bypass this restriction so that healthy tablet servers don't need to
be shut down in order to recover tablet on unhealthy ones.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 104 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/2
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 2
Gerrit-Owner: Will Berkeley 
Gerrit-Reviewer: Kudu Jenkins


[kudu-CR] [docs] Document how to recovery from a majority failed tablet

2017-10-26 Thread Will Berkeley (Code Review)
Will Berkeley has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8402


Change subject: [docs] Document how to recovery from a majority failed tablet
..

[docs] Document how to recovery from a majority failed tablet

This adds some docs on how to recover when a tablet can no longer find
a majority due to the permanent failure of replicas. Manual
intervention is required, and basically boils down to

1. copy the data from a healthy replica to where the revived replicas
will be
2. set the consensus configuration of the tablet so it matches the new
locations of replicas

Step 2 requires downtime even for healthy replicas, since new servers
can't be added to consensus configs without either rewriting the on-disk
cmeta or having a majority available. It might be worth allowing a tool
to bypass this restriction so that healthy tablet servers don't need to
be shut down in order to recover tablet on unhealthy ones.

I tested this procedure by failing tablets in various ways:
- deleting important bits like cmeta or tablet metadata
- deleting entire data dirs
- tombstoning 2/3 replicas (and disabling tombstoned voting)
and I was always able to recover using these instructions.

Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
---
M docs/administration.adoc
1 file changed, 104 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/02/8402/1
--
To view, visit http://gerrit.cloudera.org:8080/8402
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic6326f65d029a1cd75e487b16ce5be4baea2f215
Gerrit-Change-Number: 8402
Gerrit-PatchSet: 1
Gerrit-Owner: Will Berkeley