Cool. I know it's one of those hard things to do, but if we can be somewhat sure it was just server failure, that's good. I'd hate to write a workaround and find it was just because I wrote some bad code :P

Maybe something in the GC logic which presently does the closing of the WALs would be easiest? Should be pretty easy to build a util from the core logic too.

LMK if/how I can help.

Adam J. Shook wrote:
No exception on write -- this is coming from the master when it goes to
assign work to the accumulo.replication table.  Some of the WALs are
fairly old.

Not too sure why it didn't get the attribute; my guess is server failure
before it was able to append the created time.  Some of the WAL files
are empty, others have data in them.  I think a tool will suffice for
now when the issue crops up, but it'll need to get fixed in the
Master/GC so that, after some condition, it will assign it a createdTime
so replication will occur -- or whenever the first metadata entry is
added, give it a createdTime.

--Adam

On Fri, Feb 17, 2017 at 1:40 PM, Josh Elser <josh.el...@gmail.com
<mailto:josh.el...@gmail.com>> wrote:

    Hey Adam,

    Thanks for sharing this one.

    Adam J. Shook wrote:

        Hello folks,

        One of our clusters has been throwing a handful of replication
        errors
        from the status maker -- see below.  The WAL files in question
        to not
        belong to an active tserver -- some investigation in the code
        shows that
        the createdTime could not be written and these WALs will sit
        here until
        a created time is added.


    Does that mean that you saw an exception when the mutation written
    to accumulo.metadata that had the createTime failed? Or is the cause
    of why that WAL didn't get this 'attribute' still unknown?

    I think the kind of fix to make it dependent on the cause here. e.g.
    if this is just a bug, a standalone tool to fix this case would be
    good. However, if there's an inherent issue where this case might
    happen and we can't guarantee the record was written (server
    failure), it might be best to add some process to the master/gc to
    eventually add one (e.g. if we see the wal has been hanging out in
    the state, add a createdTime after ~12hrs)


        I wanted to bring some attention to this -- I think my immediate
        course
        of action here is to manually add a createdTime so the files will be
        replicated, then address this within the Accumulo source code
        itself.
        Thoughts?

        Status record ([begin: 0 end: 0 infiniteEnd: true closed:true]) for
        hdfs://foo:8020/accumulo/wal/blah/blah in table k was written to
        metadata table which lacked createtime

        Thank you,
        --Adam


Reply via email to