[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-25 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606498#comment-15606498
 ] 

Todd Lipcon commented on KUDU-1618:
---

If I recall correctly, ksck does call ListTablets on each of the tablet 
servers, in which case it could notice tablets that are on tservers that 
"shouldn't be"

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-25 Thread Dinesh Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606212#comment-15606212
 ] 

Dinesh Bhat commented on KUDU-1618:
---

Thanks [~tlipcon], agreed to your points above that this is not a bug. I 
confirmed that ksck as of now doesn't know about this spurious replica 
resulting from the tool's action. I wonder if it's even possible to show this 
info via ksck because I guess these reports are not sent to master ? 

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577292#comment-15577292
 ] 

Todd Lipcon commented on KUDU-1618:
---

bq. Todd Lipcon thanks for a quick reply, by 'shouldn't have a replica' in 
above comment, you meant current tablet server where we are trying to bring up 
the replica, is not part of raft config for that tablet anymore right ? It has 
other tservers as replicas at this point. That makes sense. I believe tserver 
keeps trying until there may be another change_config in future which brings in 
this tserver as replica for that tablet.

Right, when you copied the replica it copied the new configuration, and it's 
not a part of that configuration. So, it knows that it shouldn't try to get the 
other nodes to vote for it. It would be reasonable to say that we should detect 
this scenario and mark the tablet as 'failed', but it's actually somewhat 
useful occasionally -- eg I've used this before to copy a tablet from a running 
cluster onto my laptop so I could then use tools like 'dump_tablet' against it 
locally. Given that I don't think you can get into this state without using 
explicit tablet copy repair tools, I don't think it should really be considered 
a bug.

bq. One follow up Qn is: What state should the replica be in after step 6 ? I 
see it in RUNNING state, which was slightly confusing, because this replica 
isn't an active replica at this point.

The tablet's state is referring more to the data layer. It's up and running, it 
has replayed its log, it has valid data, etc. So it's RUNNING even though it's 
not actually an active part of any raft configuration. If you run ksck on this 
cluster does ksck report the "extra" replica anywhere? That might be a useful 
thing to do so we can detect if this ever happens in real life.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Dinesh Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576389#comment-15576389
 ] 

Dinesh Bhat commented on KUDU-1618:
---

[~tlipcon] thanks for a quick reply, by 'shouldn't have a replica' in above 
comment, you meant current tablet server where we are trying to bring up the 
replica, is not part of raft config for that tablet anymore right ? It has 
other tservers as replicas at this point. That makes sense. I believe tserver 
keeps trying until there may be another change_config in future which brings in 
this tserver as replica for that tablet.
One follow up Qn is: What state should the replica be in after step 6 ? I see 
it in RUNNING state, which was slightly confusing, because this replica isn't 
an active replica at this point.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KUDU-1618) Add local_replica tool to delete a replica

2016-10-14 Thread Todd Lipcon (JIRA)

[ 
https://issues.apache.org/jira/browse/KUDU-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576300#comment-15576300
 ] 

Todd Lipcon commented on KUDU-1618:
---

This seems like expected behavior to me. You created a replica on a node that 
was removed from the raft config, so when it starts up, it's confused because 
the metadata says it shouldn't have a replica.

> Add local_replica tool to delete a replica
> --
>
> Key: KUDU-1618
> URL: https://issues.apache.org/jira/browse/KUDU-1618
> Project: Kudu
>  Issue Type: Improvement
>  Components: ops-tooling
>Affects Versions: 1.0.0
>Reporter: Todd Lipcon
>Assignee: Dinesh Bhat
>
> Occasionally we've hit cases where a tablet is corrupt in such a way that the 
> tserver fails to start or crashes soon after starting. Typically we'd prefer 
> the tablet just get marked FAILED but in the worst case it causes the whole 
> tserver to fail.
> For these cases we should add a 'local_replica' subtool to fully remove a 
> local tablet. Related, it might be useful to have a 'local_replica archive' 
> which would create a tarball from the data in this tablet for later 
> examination by developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)