Re: Corrupt WAL

2018-08-22 Thread Adam J. Shook
The code referenced in the PR works to detect and move a WAL, replacing it
with an empty one, but isn't fully wrapped up/merged.  Some priorities were
shifted and this got pushed back, though I do plan on addressing the
comments in the code review Soon™.

I'd suggest upgrading to 1.9.2 once you resolve the issue.  We've been
running it for a while and have not had any WAL-related errors.

--Adam

On Tue, Aug 21, 2018 at 6:58 PM Ed Coleman  wrote:

> The has been work done in https://github.com/apache/accumulo/pull/574.
> I'm not certain of the state of the code, but the description may provide
> you with things that you could look at manually.
>
>
> -Original Message-
> From: tech.s...@gmail.com [mailto:tech.s...@gmail.com]
> Sent: Tuesday, August 21, 2018 5:45 PM
> To: user@accumulo.apache.org
> Subject: Re: Corrupt WAL
>
> Was there any success with this workaround strategy?  I am also
> experiencing this issue.
>
> On 2018/06/13 16:30:22, "Adam J. Shook"  wrote:
> > Sorry, I had the error backwards.  There is an OPEN for the WAL and
> > then immediately a COMPACTION_FINISH entry.  This would cause the error.
> >
> > On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook 
> > wrote:
> >
> > > Looking at the log I see that the last two entries are
> > > COMPACTION_START of one RFile immediately followed by a
> > > COMPACTION_START of a separate RFile which (I believe) would lead to
> > > the error.  Would this necessarily be an issue if the compactions are
> for separate RFiles?
> > >
> > > This is a dev cluster and I don't necessarily care about it, but is
> > > there a (good) means to do WAL log surgery?  I imagine I can just
> > > chop off bytes until the log is parseable and missing the info about
> the compactions.
> > >
> > > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner 
> wrote:
> > >
> > >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook
> > >> 
> > >> wrote:
> > >> > Yes, that is the error.  I'll inspect the logs and report back.
> > >>
> > >> Ok.  The LogReader command has a mechanism to filter which tablet
> > >> is displayed.  If the walog has  alot of data in it, may need to
> > >> use this.
> > >>
> > >> Also, be aware that only 5 mutations are shown for a "many mutations"
> > >> objects in the walog.   The -m options changes this.  May want to see
> > >> more when deciding if the info in the log is important.
> > >>
> > >>
> > >> >
> > >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner 
> > >> wrote:
> > >> >>
> > >> >> Is the message you are seeing "COMPACTION_FINISH (without
> > >> >> preceding COMPACTION_START)" ?  That messages indicates that the
> > >> >> WALs are incomplete, probably as a result of the NN problems.
> > >> >> Could do the following :
> > >> >>
> > >> >> 1) Run the following command to see whats in the log.  Need to
> > >> >> see what is there for the root tablet.
> > >> >>
> > >> >>accumulo org.apache.accumulo.tserver.logger.LogReader
> > >> >>
> > >> >> 2) Replace the log file with an empty file after seeing if there
> > >> >> is anything important in it.
> > >> >>
> > >> >> I think the list of WALs for the root tablet is stored in ZK at
> > >> >> /accumulo//walogs
> > >> >>
> > >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook
> > >> >> 
> > >> >> wrote:
> > >> >> > Hey all,
> > >> >> >
> > >> >> > The root tablet on one of our dev systems isn't loading due to
> > >> >> > an illegal state exception -- COMPACTION_FINISH preceding
> > >> >> > COMPACTION_START.
> > >> What'd
> > >> >> > be
> > >> >> > the best way to mitigate this issue?  This was likely caused
> > >> >> > due to
> > >> both
> > >> >> > of
> > >> >> > our NameNodes failing.
> > >> >> >
> > >> >> > Thank you,
> > >> >> > --Adam
> > >> >
> > >> >
> > >>
> > >
> > >
> >
>
>


RE: Corrupt WAL

2018-08-21 Thread Ed Coleman
The has been work done in https://github.com/apache/accumulo/pull/574. I'm not 
certain of the state of the code, but the description may provide you with 
things that you could look at manually.


-Original Message-
From: tech.s...@gmail.com [mailto:tech.s...@gmail.com] 
Sent: Tuesday, August 21, 2018 5:45 PM
To: user@accumulo.apache.org
Subject: Re: Corrupt WAL

Was there any success with this workaround strategy?  I am also experiencing 
this issue.

On 2018/06/13 16:30:22, "Adam J. Shook"  wrote: 
> Sorry, I had the error backwards.  There is an OPEN for the WAL and 
> then immediately a COMPACTION_FINISH entry.  This would cause the error.
> 
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook 
> wrote:
> 
> > Looking at the log I see that the last two entries are 
> > COMPACTION_START of one RFile immediately followed by a 
> > COMPACTION_START of a separate RFile which (I believe) would lead to 
> > the error.  Would this necessarily be an issue if the compactions are for 
> > separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is 
> > there a (good) means to do WAL log surgery?  I imagine I can just 
> > chop off bytes until the log is parseable and missing the info about the 
> > compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner  wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook 
> >> 
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet 
> >> is displayed.  If the walog has  alot of data in it, may need to 
> >> use this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner 
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without 
> >> >> preceding COMPACTION_START)" ?  That messages indicates that the 
> >> >> WALs are incomplete, probably as a result of the NN problems.  
> >> >> Could do the following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to 
> >> >> see what is there for the root tablet.
> >> >>
> >> >>accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there 
> >> >> is anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at 
> >> >> /accumulo//walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook 
> >> >> 
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to 
> >> >> > an illegal state exception -- COMPACTION_FINISH preceding 
> >> >> > COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused 
> >> >> > due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>



Re: Corrupt WAL

2018-08-21 Thread tech . shan
Was there any success with this workaround strategy?  I am also experiencing 
this issue.

On 2018/06/13 16:30:22, "Adam J. Shook"  wrote: 
> Sorry, I had the error backwards.  There is an OPEN for the WAL and then
> immediately a COMPACTION_FINISH entry.  This would cause the error.
> 
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook 
> wrote:
> 
> > Looking at the log I see that the last two entries are COMPACTION_START of
> > one RFile immediately followed by a COMPACTION_START of a separate RFile
> > which (I believe) would lead to the error.  Would this necessarily be an
> > issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is there
> > a (good) means to do WAL log surgery?  I imagine I can just chop off bytes
> > until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner  wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook 
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet is
> >> displayed.  If the walog has  alot of data in it, may need to use
> >> this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner 
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
> >> >> COMPACTION_START)" ?  That messages indicates that the WALs are
> >> >> incomplete, probably as a result of the NN problems.  Could do the
> >> >> following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to see
> >> >> what is there for the root tablet.
> >> >>
> >> >>accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there is
> >> >> anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at
> >> >> /accumulo//walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook 
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to an
> >> >> > illegal
> >> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>


Re: Corrupt WAL

2018-06-13 Thread Adam J. Shook
Sorry, I had the error backwards.  There is an OPEN for the WAL and then
immediately a COMPACTION_FINISH entry.  This would cause the error.

On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook 
wrote:

> Looking at the log I see that the last two entries are COMPACTION_START of
> one RFile immediately followed by a COMPACTION_START of a separate RFile
> which (I believe) would lead to the error.  Would this necessarily be an
> issue if the compactions are for separate RFiles?
>
> This is a dev cluster and I don't necessarily care about it, but is there
> a (good) means to do WAL log surgery?  I imagine I can just chop off bytes
> until the log is parseable and missing the info about the compactions.
>
> On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner  wrote:
>
>> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook 
>> wrote:
>> > Yes, that is the error.  I'll inspect the logs and report back.
>>
>> Ok.  The LogReader command has a mechanism to filter which tablet is
>> displayed.  If the walog has  alot of data in it, may need to use
>> this.
>>
>> Also, be aware that only 5 mutations are shown for a "many mutations"
>> objects in the walog.   The -m options changes this.  May want to see
>> more when deciding if the info in the log is important.
>>
>>
>> >
>> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner 
>> wrote:
>> >>
>> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> >> COMPACTION_START)" ?  That messages indicates that the WALs are
>> >> incomplete, probably as a result of the NN problems.  Could do the
>> >> following :
>> >>
>> >> 1) Run the following command to see whats in the log.  Need to see
>> >> what is there for the root tablet.
>> >>
>> >>accumulo org.apache.accumulo.tserver.logger.LogReader
>> >>
>> >> 2) Replace the log file with an empty file after seeing if there is
>> >> anything important in it.
>> >>
>> >> I think the list of WALs for the root tablet is stored in ZK at
>> >> /accumulo//walogs
>> >>
>> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook 
>> >> wrote:
>> >> > Hey all,
>> >> >
>> >> > The root tablet on one of our dev systems isn't loading due to an
>> >> > illegal
>> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
>> What'd
>> >> > be
>> >> > the best way to mitigate this issue?  This was likely caused due to
>> both
>> >> > of
>> >> > our NameNodes failing.
>> >> >
>> >> > Thank you,
>> >> > --Adam
>> >
>> >
>>
>
>


Re: Corrupt WAL

2018-06-12 Thread Keith Turner
On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook  wrote:
> Yes, that is the error.  I'll inspect the logs and report back.

Ok.  The LogReader command has a mechanism to filter which tablet is
displayed.  If the walog has  alot of data in it, may need to use
this.

Also, be aware that only 5 mutations are shown for a "many mutations"
objects in the walog.   The -m options changes this.  May want to see
more when deciding if the info in the log is important.


>
> On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner  wrote:
>>
>> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> COMPACTION_START)" ?  That messages indicates that the WALs are
>> incomplete, probably as a result of the NN problems.  Could do the
>> following :
>>
>> 1) Run the following command to see whats in the log.  Need to see
>> what is there for the root tablet.
>>
>>accumulo org.apache.accumulo.tserver.logger.LogReader
>>
>> 2) Replace the log file with an empty file after seeing if there is
>> anything important in it.
>>
>> I think the list of WALs for the root tablet is stored in ZK at
>> /accumulo//walogs
>>
>> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook 
>> wrote:
>> > Hey all,
>> >
>> > The root tablet on one of our dev systems isn't loading due to an
>> > illegal
>> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
>> > be
>> > the best way to mitigate this issue?  This was likely caused due to both
>> > of
>> > our NameNodes failing.
>> >
>> > Thank you,
>> > --Adam
>
>


Re: Corrupt WAL

2018-06-12 Thread Keith Turner
Is the message you are seeing "COMPACTION_FINISH (without preceding
COMPACTION_START)" ?  That messages indicates that the WALs are
incomplete, probably as a result of the NN problems.  Could do the
following :

1) Run the following command to see whats in the log.  Need to see
what is there for the root tablet.

   accumulo org.apache.accumulo.tserver.logger.LogReader

2) Replace the log file with an empty file after seeing if there is
anything important in it.

I think the list of WALs for the root tablet is stored in ZK at
/accumulo//walogs

On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook  wrote:
> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam


Re: Corrupt WAL

2018-06-11 Thread Adam J. Shook
The WAL is from 1.9.1.

On Mon, Jun 11, 2018 at 6:33 PM, Christopher  wrote:

> That's what I was thinking it was related to. Do you know if the
> particular WAL file was created from a previous version, from before you
> upgraded?
>
> On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook 
> wrote:
>
>> Sorry would have been good to include that :)  It's the newest 1.9.1.  I
>> think it relates to https://github.com/apache/accumulo/pull/458, just
>> not sure what the best thing to do here is.
>>
>> On Mon, Jun 11, 2018 at 5:46 PM, Christopher  wrote:
>>
>>> What version are you using?
>>>
>>> On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook 
>>> wrote:
>>>
 Hey all,

 The root tablet on one of our dev systems isn't loading due to an
 illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.
 What'd be the best way to mitigate this issue?  This was likely caused due
 to both of our NameNodes failing.

 Thank you,
 --Adam

>>>
>>


Re: Corrupt WAL

2018-06-11 Thread Christopher
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook  wrote:

> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam
>


Corrupt WAL

2018-06-11 Thread Adam J. Shook
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal
state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
the best way to mitigate this issue?  This was likely caused due to both of
our NameNodes failing.

Thank you,
--Adam