Or, if it's more convenient, this is the issue I was thinking of: https://issues.apache.org/jira/browse/ACCUMULO-4065

Andrew Hulbert wrote:
I'll try to dig up the full error from the tserver

On 10/18/2016 10:30 AM, Josh Elser wrote:
Do you have the full exception for the "Expected protocol id.." error?

That looks like it might be incorrect usage of Thrift on our part..

Andrew Hulbert wrote:

So backing up and then later deleting the recovery directories a few
times did the trick. It seemed that removing the initial bad one caused
the others to go through for the most part...

I believe all the WAL files were there. I'll look for the WAL deleted in
the GC logs and see if there's any evidence of that. It is version 1.6.4
by the way. Unfortunately can't send the logs to you here but I did save
them off and I'll talk to Jeff about what we can do.

We are currently getting a new error that I'm going to look into...

Expected protocol id ffffffff82 but got 0

Expected protocol id ffffffff82 but got 6e


Looking into that now! Thanks for the help so far, as usual!


On 10/18/2016 09:46 AM, Michael Wall wrote:

That is what I was going to suggest you try. Where is that "Unable to
find recovery files for extent" log? Anyway we can see some actual

Are all the WALs there? Do you find any of the WAL deleted by GC in
the gc logs? Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert
<mailto:ahulb...@ccri.com>> wrote:


For one of the WALs I backed up the recovery directory and that
initiated a new recovery attempt as indicated in the tserver debug

Then the exception was thrown:

Unable to find recovery files for extent xxxxxx logentry xxxxx

Any ideas? I figure we can zero out the WAL and it will go on with
life but it would be nice to try and get the data!


On 10/18/2016 08:55 AM, Jeff Kubina wrote:

On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall
<mailto:mjw...@gmail.com>> wrote:

Take a look at the master logs for where the WAL was sorted
to the /accumulo/recovery/... directory. Then look to see if
those WALs are still around and contain content.

Checked one of them, yes it is around with content.

Where is this this EOF exception, on a tserver?

Yes, the tserver.

Is the master log complaining about anything?

Repeating a message similar to the tserver but also that the
tablet assignment failed for the tserver.

tservers are not balancing because of all this.

