Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-21 Thread Jeff Kubina
Mike,

Yes, thanks for the help. We had to delete the recovered files generated
from the WAL a few times but that worked. Then we restarted the two tablets
with the TProtocolException exceptions to fix those errors. We saved off
the log files for you.

Jeff


-- 
Jeff Kubina
410-988-4436


On Fri, Oct 21, 2016 at 10:57 AM, Michael Wall  wrote:

> Andrew/Jeff,
>
> How's it going? Did you resolve your issue?
>
> Mike
>
> On Tue, Oct 18, 2016 at 10:42 AM, Andrew Hulbert 
> wrote:
>
>> I think it is attempting to do migrations at the moment FYI
>>
>> On 10/18/2016 10:40 AM, Andrew Hulbert wrote:
>>
>> Yes, it looks similar.
>>
>> Esp these parts:
>>
>> 2015-11-19 22:43:05,998 [impl.TabletServerBatchReaderIterator] DEBUG: 
>> org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 
>> but got 19
>> java.io.IOException: org.apache.thrift.protocol.TProtocolException: Expected 
>> protocol id ff82 but got 19
>>  at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:702)
>>  at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:349)
>>  at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at 
>> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: org.apache.thrift.protocol.TProtocolException: Expected protocol 
>> id ff82 but got 19
>>  at 
>> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:472)
>>  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>>  at 
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:317)
>>  at 
>> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:297)
>>  at 
>> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:634)
>>  ... 6 more
>>
>>
>>
>>
>> On 10/18/2016 10:34 AM, Josh Elser wrote:
>>
>> Or, if it's more convenient, this is the issue I was thinking of:
>> https://issues.apache.org/jira/browse/ACCUMULO-4065
>>
>> Andrew Hulbert wrote:
>>
>> I'll try to dig up the full error from the tserver
>>
>>
>> On 10/18/2016 10:30 AM, Josh Elser wrote:
>>
>> Do you have the full exception for the "Expected protocol id.." error?
>>
>> That looks like it might be incorrect usage of Thrift on our part..
>>
>> Andrew Hulbert wrote:
>>
>> Mike,
>>
>> So backing up and then later deleting the recovery directories a few
>> times did the trick. It seemed that removing the initial bad one caused
>> the others to go through for the most part...
>>
>> I believe all the WAL files were there. I'll look for the WAL deleted in
>> the GC logs and see if there's any evidence of that. It is version 1.6.4
>> by the way. Unfortunately can't send the logs to you here but I did save
>> them off and I'll talk to Jeff about what we can do.
>>
>> We are currently getting a new error that I'm going to look into...
>>
>> Expected protocol id 82 but got 0
>>
>> Expected protocol id 82 but got 6e
>>
>> etc.
>>
>> Looking into that now! Thanks for the help so far, as usual!
>>
>> Andrew
>>
>> On 10/18/2016 09:46 AM, Michael Wall wrote:
>>
>> Andrew,
>>
>> That is what I was going to suggest you try. Where is that "Unable to
>> find recovery files for extent" log? Anyway we can see some actual
>> logs?
>>
>> Are all the WALs there? Do you find any of the WAL deleted by GC in
>> the gc logs? Do you find any duplicates WALs in the HDFS trash?
>>
>> On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert >  > wrote:
>>
>> Mike,
>>
>> For one of the WALs I backed up the recovery directory and that
>> initiated a new recovery attempt as indicated in the tserver debug
>> log...
>>
>> Then the exception was thrown:
>>
>> Unable to find recovery files for extent xx logentry x
>> hdfs://path/to/wal/
>>
>> Any ideas? I figure we can zero out the WAL and it will go on with
>> life but it would be nice to try and get the data!
>>
>> Thanks!
>>
>>
>> On 10/18/2016 08:55 AM, Jeff Kubina wrote:
>>
>>
>> On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall >  > wrote:
>>
>> Take a look at the master logs for where the WAL was sorted
>> to the /accumulo/recovery/... directory. Then look to see if
>> those WALs are still around and contain content.
>>
>>
>> Checked one of them, yes it is around with content.
>>
>> Where is 

Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-21 Thread Michael Wall
Andrew/Jeff,

How's it going? Did you resolve your issue?

Mike

On Tue, Oct 18, 2016 at 10:42 AM, Andrew Hulbert  wrote:

> I think it is attempting to do migrations at the moment FYI
>
> On 10/18/2016 10:40 AM, Andrew Hulbert wrote:
>
> Yes, it looks similar.
>
> Esp these parts:
>
> 2015-11-19 22:43:05,998 [impl.TabletServerBatchReaderIterator] DEBUG: 
> org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 
> but got 19
> java.io.IOException: org.apache.thrift.protocol.TProtocolException: Expected 
> protocol id ff82 but got 19
>   at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:702)
>   at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:349)
>   at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at 
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.thrift.protocol.TProtocolException: Expected protocol 
> id ff82 but got 19
>   at 
> org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:472)
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
>   at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:317)
>   at 
> org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:297)
>   at 
> org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:634)
>   ... 6 more
>
>
>
>
> On 10/18/2016 10:34 AM, Josh Elser wrote:
>
> Or, if it's more convenient, this is the issue I was thinking of:
> https://issues.apache.org/jira/browse/ACCUMULO-4065
>
> Andrew Hulbert wrote:
>
> I'll try to dig up the full error from the tserver
>
>
> On 10/18/2016 10:30 AM, Josh Elser wrote:
>
> Do you have the full exception for the "Expected protocol id.." error?
>
> That looks like it might be incorrect usage of Thrift on our part..
>
> Andrew Hulbert wrote:
>
> Mike,
>
> So backing up and then later deleting the recovery directories a few
> times did the trick. It seemed that removing the initial bad one caused
> the others to go through for the most part...
>
> I believe all the WAL files were there. I'll look for the WAL deleted in
> the GC logs and see if there's any evidence of that. It is version 1.6.4
> by the way. Unfortunately can't send the logs to you here but I did save
> them off and I'll talk to Jeff about what we can do.
>
> We are currently getting a new error that I'm going to look into...
>
> Expected protocol id 82 but got 0
>
> Expected protocol id 82 but got 6e
>
> etc.
>
> Looking into that now! Thanks for the help so far, as usual!
>
> Andrew
>
> On 10/18/2016 09:46 AM, Michael Wall wrote:
>
> Andrew,
>
> That is what I was going to suggest you try. Where is that "Unable to
> find recovery files for extent" log? Anyway we can see some actual
> logs?
>
> Are all the WALs there? Do you find any of the WAL deleted by GC in
> the gc logs? Do you find any duplicates WALs in the HDFS trash?
>
> On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert   > wrote:
>
> Mike,
>
> For one of the WALs I backed up the recovery directory and that
> initiated a new recovery attempt as indicated in the tserver debug
> log...
>
> Then the exception was thrown:
>
> Unable to find recovery files for extent xx logentry x
> hdfs://path/to/wal/
>
> Any ideas? I figure we can zero out the WAL and it will go on with
> life but it would be nice to try and get the data!
>
> Thanks!
>
>
> On 10/18/2016 08:55 AM, Jeff Kubina wrote:
>
>
> On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall   > wrote:
>
> Take a look at the master logs for where the WAL was sorted
> to the /accumulo/recovery/... directory. Then look to see if
> those WALs are still around and contain content.
>
>
> Checked one of them, yes it is around with content.
>
> Where is this this EOF exception, on a tserver?
>
>
> Yes, the tserver.
>
> Is the master log complaining about anything?
>
>
> Repeating a message similar to the tserver but also that the
> tablet assignment failed for the tserver.
>
> tservers are not balancing because of all this.
>
>
>
>
>
>
>
>
>


Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Andrew Hulbert

I think it is attempting to do migrations at the moment FYI


On 10/18/2016 10:40 AM, Andrew Hulbert wrote:


Yes, it looks similar.

Esp these parts:

2015-11-19 22:43:05,998 [impl.TabletServerBatchReaderIterator] DEBUG: 
org.apache.thrift.protocol.TProtocolException: Expected protocol id ff82 
but got 19
java.io.IOException: org.apache.thrift.protocol.TProtocolException: Expected 
protocol id ff82 but got 19
at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:702)
at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator$QueryTask.run(TabletServerBatchReaderIterator.java:349)
at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at 
org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.protocol.TProtocolException: Expected protocol id 
ff82 but got 19
at 
org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:472)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.recv_startMultiScan(TabletClientService.java:317)
at 
org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Client.startMultiScan(TabletClientService.java:297)
at 
org.apache.accumulo.core.client.impl.TabletServerBatchReaderIterator.doLookup(TabletServerBatchReaderIterator.java:634)
... 6 more



On 10/18/2016 10:34 AM, Josh Elser wrote:
Or, if it's more convenient, this is the issue I was thinking of: 
https://issues.apache.org/jira/browse/ACCUMULO-4065


Andrew Hulbert wrote:

I'll try to dig up the full error from the tserver


On 10/18/2016 10:30 AM, Josh Elser wrote:

Do you have the full exception for the "Expected protocol id.." error?

That looks like it might be incorrect usage of Thrift on our part..

Andrew Hulbert wrote:

Mike,

So backing up and then later deleting the recovery directories a few
times did the trick. It seemed that removing the initial bad one 
caused

the others to go through for the most part...

I believe all the WAL files were there. I'll look for the WAL 
deleted in
the GC logs and see if there's any evidence of that. It is version 
1.6.4
by the way. Unfortunately can't send the logs to you here but I 
did save

them off and I'll talk to Jeff about what we can do.

We are currently getting a new error that I'm going to look into...

Expected protocol id 82 but got 0

Expected protocol id 82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:

Andrew,

That is what I was going to suggest you try. Where is that 
"Unable to

find recovery files for extent" log? Anyway we can see some actual
logs?

Are all the WALs there? Do you find any of the WAL deleted by GC in
the gc logs? Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert > wrote:

Mike,

For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver debug
log...

Then the exception was thrown:

Unable to find recovery files for extent xx logentry x
hdfs://path/to/wal/

Any ideas? I figure we can zero out the WAL and it will go on with
life but it would be nice to try and get the data!

Thanks!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory. Then look to see if
those WALs are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.















Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Andrew Hulbert

I'll try to dig up the full error from the tserver


On 10/18/2016 10:30 AM, Josh Elser wrote:

Do you have the full exception for the "Expected protocol id.." error?

That looks like it might be incorrect usage of Thrift on our part..

Andrew Hulbert wrote:

Mike,

So backing up and then later deleting the recovery directories a few
times did the trick. It seemed that removing the initial bad one caused
the others to go through for the most part...

I believe all the WAL files were there. I'll look for the WAL deleted in
the GC logs and see if there's any evidence of that. It is version 1.6.4
by the way. Unfortunately can't send the logs to you here but I did save
them off and I'll talk to Jeff about what we can do.

We are currently getting a new error that I'm going to look into...

Expected protocol id 82 but got 0

Expected protocol id 82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:

Andrew,

That is what I was going to suggest you try.  Where is that "Unable to
find recovery files for extent" log?  Anyway we can see some actual 
logs?


Are all the WALs there?  Do you find any of the WAL deleted by GC in
the gc logs?  Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert > wrote:

Mike,

For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver debug
log...

Then the exception was thrown:

Unable to find recovery files for extent xx logentry x
hdfs://path/to/wal/

Any ideas? I figure we can zero out the WAL and it will go on with
life but it would be nice to try and get the data!

Thanks!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory.  Then look to see if
those WALs are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.











Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Andrew Hulbert

Note that the error is more like this:

Expected protocol id ff82 but got 35 (0!;38\\;82,:9997, 
)




On 10/18/2016 10:28 AM, Andrew Hulbert wrote:


Mike,

So backing up and then later deleting the recovery directories a few 
times did the trick. It seemed that removing the initial bad one 
caused the others to go through for the most part...


I believe all the WAL files were there. I'll look for the WAL deleted 
in the GC logs and see if there's any evidence of that. It is version 
1.6.4 by the way. Unfortunately can't send the logs to you here but I 
did save them off and I'll talk to Jeff about what we can do.


We are currently getting a new error that I'm going to look into...

Expected protocol id 82 but got 0

Expected protocol id 82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:

Andrew,

That is what I was going to suggest you try.  Where is that "Unable 
to find recovery files for extent" log?  Anyway we can see some 
actual logs?


Are all the WALs there?  Do you find any of the WAL deleted by GC in 
the gc logs?  Do you find any duplicates WALs in the HDFS trash?


On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert > wrote:


Mike,

For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver
debug log...

Then the exception was thrown:

Unable to find recovery files for extent xx logentry x
hdfs://path/to/wal/

Any ideas? I figure we can zero out the WAL and it will go on
with life but it would be nice to try and get the data!

Thanks!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory.  Then look to see
if those WALs are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.











Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Josh Elser

Do you have the full exception for the "Expected protocol id.." error?

That looks like it might be incorrect usage of Thrift on our part..

Andrew Hulbert wrote:

Mike,

So backing up and then later deleting the recovery directories a few
times did the trick. It seemed that removing the initial bad one caused
the others to go through for the most part...

I believe all the WAL files were there. I'll look for the WAL deleted in
the GC logs and see if there's any evidence of that. It is version 1.6.4
by the way. Unfortunately can't send the logs to you here but I did save
them off and I'll talk to Jeff about what we can do.

We are currently getting a new error that I'm going to look into...

Expected protocol id 82 but got 0

Expected protocol id 82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:

Andrew,

That is what I was going to suggest you try.  Where is that "Unable to
find recovery files for extent" log?  Anyway we can see some actual logs?

Are all the WALs there?  Do you find any of the WAL deleted by GC in
the gc logs?  Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert > wrote:

Mike,

For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver debug
log...

Then the exception was thrown:

Unable to find recovery files for extent xx logentry x
hdfs://path/to/wal/

Any ideas? I figure we can zero out the WAL and it will go on with
life but it would be nice to try and get the data!

Thanks!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory.  Then look to see if
those WALs are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.









Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Andrew Hulbert

Mike,

So backing up and then later deleting the recovery directories a few 
times did the trick. It seemed that removing the initial bad one caused 
the others to go through for the most part...


I believe all the WAL files were there. I'll look for the WAL deleted in 
the GC logs and see if there's any evidence of that. It is version 1.6.4 
by the way. Unfortunately can't send the logs to you here but I did save 
them off and I'll talk to Jeff about what we can do.


We are currently getting a new error that I'm going to look into...

Expected protocol id 82 but got 0

Expected protocol id 82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:

Andrew,

That is what I was going to suggest you try.  Where is that "Unable to 
find recovery files for extent" log?  Anyway we can see some actual logs?


Are all the WALs there?  Do you find any of the WAL deleted by GC in 
the gc logs?  Do you find any duplicates WALs in the HDFS trash?


On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert > wrote:


Mike,

For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver debug
log...

Then the exception was thrown:

Unable to find recovery files for extent xx logentry x
hdfs://path/to/wal/

Any ideas? I figure we can zero out the WAL and it will go on with
life but it would be nice to try and get the data!

Thanks!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory.  Then look to see if
those WALs are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.









Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Michael Wall
Andrew,

That is what I was going to suggest you try.  Where is that "Unable to find
recovery files for extent" log?  Anyway we can see some actual logs?

Are all the WALs there?  Do you find any of the WAL deleted by GC in the gc
logs?  Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert  wrote:

> Mike,
> For one of the WALs I backed up the recovery directory and that initiated
> a new recovery attempt as indicated in the tserver debug log...
>
> Then the exception was thrown:
>
> Unable to find recovery files for extent xx logentry x
> hdfs://path/to/wal/
>
> Any ideas? I figure we can zero out the WAL and it will go on with life
> but it would be nice to try and get the data!
>
> Thanks!
>
>
> On 10/18/2016 08:55 AM, Jeff Kubina wrote:
>
>
> On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall  wrote:
>
>> Take a look at the master logs for where the WAL was sorted to the 
>> /accumulo/recovery/...
>> directory.  Then look to see if those WALs are still around and contain
>> content.
>>
>
> Checked one of them, yes it is around with content.
>
> Where is this this EOF exception, on a tserver?
>>
>
> Yes, the tserver.
>
>
>> Is the master log complaining about anything?
>>
>
> Repeating a message similar to the tserver but also that the tablet
> assignment failed for the tserver.
>
> tservers are not balancing because of all this.
>
>
>
>


Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Andrew Hulbert

Mike,

For one of the WALs I backed up the recovery directory and that 
initiated a new recovery attempt as indicated in the tserver debug log...


Then the exception was thrown:

Unable to find recovery files for extent xx logentry x 
hdfs://path/to/wal/


Any ideas? I figure we can zero out the WAL and it will go on with life 
but it would be nice to try and get the data!


Thanks!

On 10/18/2016 08:55 AM, Jeff Kubina wrote:


On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall > wrote:


Take a look at the master logs for where the WAL was sorted to the
/accumulo/recovery/... directory.  Then look to see if those WALs
are still around and contain content.


Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?


Yes, the tserver.

Is the master log complaining about anything?


Repeating a message similar to the tserver but also that the tablet 
assignment failed for the tserver.


tservers are not balancing because of all this.






Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Jeff Kubina
On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall  wrote:

> Take a look at the master logs for where the WAL was sorted to the 
> /accumulo/recovery/...
> directory.  Then look to see if those WALs are still around and contain
> content.
>

Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?
>

Yes, the tserver.


> Is the master log complaining about anything?
>

Repeating a message similar to the tserver but also that the tablet
assignment failed for the tserver.

tservers are not balancing because of all this.


Re: java.IO.EOFException: ..../accumulo/recovery/.../part-r-00000/index not a SequenceFile.

2016-10-18 Thread Michael Wall
Jeff,

Take a look at the master logs for where the WAL was sorted to the
/accumulo/recovery/...
directory.  Then look to see if those WALs are still around and contain
content.

Where is this this EOF exception, on a tserver?

Is the master log complaining about anything?

Mike

On Mon, Oct 17, 2016 at 6:15 PM, Jeff Kubina  wrote:

> We had a lot of datanodes lock up nearly simaltanuously in our Accumulo
> instance. Many more of the tservers also went offline. After about two
> hours we were able to get all the datanodes and tservers back online with
> no HDFS blocks lost. However we have two tservers throwing about 70
> exceptions caused by:
>
> java.IO.EOFException: /accumulo/recovery/.../part-r-0/index not a
> SequenceFile.
>
> For all the exceptions all the "/accumulo/recovery/.../part-r-0/index"
> files are empty but their associated 
> /accumulo/recovery/.../part-r-0/data
> file is not.
>
> Any suggestions on how we can best recover from these exceptions?
>
>