Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

Josh Elser Fri, 08 Apr 2016 07:52:33 -0700


Billie Rinaldi wrote:

*From:* Jayesh Patel
*Sent:* Thursday, April 07, 2016 4:36 PM
*To:* '[email protected] <mailto:[email protected]>'
<[email protected] <mailto:[email protected]>>
*Subject:* RE: why compaction failure on one table brings other tables
offline, how to recover____

__ __

I have a 3 node Accumulo 1.7 cluster with a few small tables (few MB in
size at most).____

__ __

I had one of those table fail minc because I had configured a
SummingCombiner with FIXEDLEN but had smaller values:____

MinC failed (trying to convert to long, but byte array isn't long
enough, wanted 8 found 1) to create
hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bcs.rf_tmp
retrying ...____

__ __

I have learned since to set the ‘lossy’ parameter to true to avoid this.
*Why is the default value for it false* if it can cause catastrophic
failure that you’ll read about ahead.____

I'm pretty sure I told you this on StackOverflow, but if you're notwriting 8-byte long values, don't used FIXEDLEN. Use VARLEN instead.

However, this brought other the tablets for other tables offline without
any apparent errors or warnings. *Can someone please explain why?*____


Can you provide logs? We are not wizards :)

In order to recover from this, I did a ‘droptable’ from the shell on the
affected tables, but they all got stuck in the ‘DELETING’ state.  I was
able to finally delete them using zkcli ‘rmr’ command. *Is there a
better way?____*

Again, not sure why they would have gotten stuck in the deleting phasewithout more logs/context (nor how far along in the deletion processthey got). It's possible that there were still entries in theaccumulo.metadata table.

I’m assuming there is a more proper way because when I created the
tables again (with the same name), they went back to having a single
offline tablet right away. *Is this because there are “traces” of the
old table left behind that affect the new table even though the new
table has a different table id?*  I ended up wiping out hdfs and
recreating the accumulo instance. ____

Accumulo uses monotonically increasing IDs to identify tables. Thehuman-readable names are only there for your benefit. Creating a tablewith the same name would not cause a problem. It sounds like you got themetadata table in a bad state or have tabletservers in a bad state (ifyou haven't restarted them).

It seems that a small bug, writing 1 byte value instead of 8 bytes,
caused us to dump the whole accumulo instance.  Luckily the data wasn’t
that important, but this whole episode makes us wonder why doing things
the right way (assuming there is a right way) wasn’t obvious or if
Accumulo is just very fragile.____

Causing Accumulo to be unable to flush data from memory to disk in aminor compaction is a very bad idea. One that we cannot automaticallyrecover from because of the combiner configuration you set.

If you can provide logs and stack traces from the Accumulo services, wecan try to help you further. This is not normal. If you don't believeme, take a look at the distributed tests we run each release where wewrite hundreds of gigabytes of data across many servers while randomlykilling Accumulo processes.


Please ask away any questions/clarification you might have. We’ll
appreciate any input you might have so we make educated decisions about
using Accumulo going forward.____

__ __

Thank you,____

Jayesh____

Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

Reply via email to