Re: Missing replication metadata

2017-07-24 Thread Josh Elser

Sounds good.

Just opened ACCUMULO-4684 for docs.

On 7/24/17 2:13 PM, Adam J. Shook wrote:
Thanks, Josh.  As this is our stage cluster, we aren't too worried about 
the missing data; I just want to clean up the metadata so the queue 
looks better.  I'll take the back-fill approach and see how that goes.


--Adam

On Mon, Jul 24, 2017 at 1:55 PM, Josh Elser > wrote:




On 7/24/17 1:44 PM, Adam J. Shook wrote:

We had some corrupt WAL blocks on our stage environment the
other day and opted to delete them.  We not have some missing
metadata and about 3k files pending for replication.  I've dug
into it a bit and noticed that many of the WALs in the `order`
queue of the replication table A) no longer exist in HDFS and B)
have no entries in the `repl` section of the replication table.

Based on the code, if there are no entries in the `repl`
section, then the work will never be queued for completion via
ZooKeeper and therefore never finished -- does this make sense?


Yeah, that sounds about right. I'm lamenting that I never wrote up
docs for the user-manual to cover the table-schema. I should ... do
that...

I think the order entry is created when the repl entry is. Would
have to dig back into code though.

   What'd be the suggestion here

to proceed?  I'm thinking a one-off tool to backfill the `repl`
section should do the trick, but I am wondering if this is
something that should be changed in Accumulo?


A tool to back-fill makes sense to me. I'm not sure what we could do
in Accumulo automatically. Any time there is data-loss (data gone
missing or old data coming back), Accumulo really can't do anything
on its own. As you described in your scenario, you made the
conscious decision to nuke the files with missing blocks. However,
providing tools to handle "common" failure scenarios outside of our
purview sounds like a good idea.

Improving our docs around how to "re-sync" two tables being
replicated would also be great. We have the hammer via
snapshot+export, just need to be clear with the instructions.

Cheers,
--Adam




Re: Missing replication metadata

2017-07-24 Thread Adam J. Shook
Thanks, Josh.  As this is our stage cluster, we aren't too worried about
the missing data; I just want to clean up the metadata so the queue looks
better.  I'll take the back-fill approach and see how that goes.

--Adam

On Mon, Jul 24, 2017 at 1:55 PM, Josh Elser  wrote:

>
>
> On 7/24/17 1:44 PM, Adam J. Shook wrote:
>
>> We had some corrupt WAL blocks on our stage environment the other day and
>> opted to delete them.  We not have some missing metadata and about 3k files
>> pending for replication.  I've dug into it a bit and noticed that many of
>> the WALs in the `order` queue of the replication table A) no longer exist
>> in HDFS and B) have no entries in the `repl` section of the replication
>> table.
>>
>> Based on the code, if there are no entries in the `repl` section, then
>> the work will never be queued for completion via ZooKeeper and therefore
>> never finished -- does this make sense?
>>
>
> Yeah, that sounds about right. I'm lamenting that I never wrote up docs
> for the user-manual to cover the table-schema. I should ... do that...
>
> I think the order entry is created when the repl entry is. Would have to
> dig back into code though.
>
>   What'd be the suggestion here
>
>> to proceed?  I'm thinking a one-off tool to backfill the `repl` section
>> should do the trick, but I am wondering if this is something that should be
>> changed in Accumulo?
>>
>
> A tool to back-fill makes sense to me. I'm not sure what we could do in
> Accumulo automatically. Any time there is data-loss (data gone missing or
> old data coming back), Accumulo really can't do anything on its own. As you
> described in your scenario, you made the conscious decision to nuke the
> files with missing blocks. However, providing tools to handle "common"
> failure scenarios outside of our purview sounds like a good idea.
>
> Improving our docs around how to "re-sync" two tables being replicated
> would also be great. We have the hammer via snapshot+export, just need to
> be clear with the instructions.
>
> Cheers,
>> --Adam
>>
>


Re: Missing replication metadata

2017-07-24 Thread Josh Elser



On 7/24/17 1:44 PM, Adam J. Shook wrote:
We had some corrupt WAL blocks on our stage environment the other day 
and opted to delete them.  We not have some missing metadata and about 
3k files pending for replication.  I've dug into it a bit and noticed 
that many of the WALs in the `order` queue of the replication table A) 
no longer exist in HDFS and B) have no entries in the `repl` section of 
the replication table.


Based on the code, if there are no entries in the `repl` section, then 
the work will never be queued for completion via ZooKeeper and therefore 
never finished -- does this make sense?


Yeah, that sounds about right. I'm lamenting that I never wrote up docs 
for the user-manual to cover the table-schema. I should ... do that...


I think the order entry is created when the repl entry is. Would have to 
dig back into code though.


  What'd be the suggestion here
to proceed?  I'm thinking a one-off tool to backfill the `repl` section 
should do the trick, but I am wondering if this is something that should 
be changed in Accumulo?


A tool to back-fill makes sense to me. I'm not sure what we could do in 
Accumulo automatically. Any time there is data-loss (data gone missing 
or old data coming back), Accumulo really can't do anything on its own. 
As you described in your scenario, you made the conscious decision to 
nuke the files with missing blocks. However, providing tools to handle 
"common" failure scenarios outside of our purview sounds like a good idea.


Improving our docs around how to "re-sync" two tables being replicated 
would also be great. We have the hammer via snapshot+export, just need 
to be clear with the instructions.



Cheers,
--Adam


Missing replication metadata

2017-07-24 Thread Adam J. Shook
We had some corrupt WAL blocks on our stage environment the other day and
opted to delete them.  We not have some missing metadata and about 3k files
pending for replication.  I've dug into it a bit and noticed that many of
the WALs in the `order` queue of the replication table A) no longer exist
in HDFS and B) have no entries in the `repl` section of the replication
table.

Based on the code, if there are no entries in the `repl` section, then the
work will never be queued for completion via ZooKeeper and therefore never
finished -- does this make sense?  What'd be the suggestion here to
proceed?  I'm thinking a one-off tool to backfill the `repl` section should
do the trick, but I am wondering if this is something that should be
changed in Accumulo?

Cheers,
--Adam