Note that the error is more like this:

Expected protocol id ffffff82 but got 35 (0!;38\\;82,<servername>:9997, <somelonghex>)



On 10/18/2016 10:28 AM, Andrew Hulbert wrote:

Mike,

So backing up and then later deleting the recovery directories a few times did the trick. It seemed that removing the initial bad one caused the others to go through for the most part...

I believe all the WAL files were there. I'll look for the WAL deleted in the GC logs and see if there's any evidence of that. It is version 1.6.4 by the way. Unfortunately can't send the logs to you here but I did save them off and I'll talk to Jeff about what we can do.

We are currently getting a new error that I'm going to look into...

Expected protocol id ffffffff82 but got 0

Expected protocol id ffffffff82 but got 6e

etc.

Looking into that now! Thanks for the help so far, as usual!

Andrew

On 10/18/2016 09:46 AM, Michael Wall wrote:
Andrew,

That is what I was going to suggest you try. Where is that "Unable to find recovery files for extent" log? Anyway we can see some actual logs?

Are all the WALs there? Do you find any of the WAL deleted by GC in the gc logs? Do you find any duplicates WALs in the HDFS trash?

On Tue, Oct 18, 2016 at 9:32 AM, Andrew Hulbert <ahulb...@ccri.com <mailto:ahulb...@ccri.com>> wrote:

    Mike,

    For one of the WALs I backed up the recovery directory and that
    initiated a new recovery attempt as indicated in the tserver
    debug log...

    Then the exception was thrown:

    Unable to find recovery files for extent xxxxxx logentry xxxxx
    hdfs://path/to/wal/yyyy

    Any ideas? I figure we can zero out the WAL and it will go on
    with life but it would be nice to try and get the data!

    Thanks!


    On 10/18/2016 08:55 AM, Jeff Kubina wrote:

    On Tue, Oct 18, 2016 at 6:32 AM, Michael Wall <mjw...@gmail.com
    <mailto:mjw...@gmail.com>> wrote:

        Take a look at the master logs for where the WAL was sorted
        to the /accumulo/recovery/... directory.  Then look to see
        if those WALs are still around and contain content.


    Checked one of them, yes it is around with content.

        Where is this this EOF exception, on a tserver?


    Yes, the tserver.

        Is the master log complaining about anything?


    Repeating a message similar to the tserver but also that the
    tablet assignment failed for the tserver.

    tservers are not balancing because of all this.






Reply via email to