Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

Christopher Mon, 11 Apr 2016 15:22:30 -0700

You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160


On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel <[email protected]> wrote:

> There really aren't a lot of log messages that can explain why tablets for
> other tables went offline except the following:
>
> 2016-04-11 13:32:18,258
> [tserver.TabletServerResourceManager$AssignmentWatcher] WARN :
> tserver:instance-accumulo-3 Assignment for 2<< has been running for at
> least 973455566ms
> java.lang.Exception: Assignment of 2<<
>     at sun.misc.Unsafe.park(Native Method)
>     at java.util.concurrent.locks.LockSupport.park(Unknown Source)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown
> Source)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown
> Source)
>     at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown
> Source)
>     at java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown
> Source)
>     at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
>     at
> org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230)
>     at
> org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252)
>     at
> org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150)
>     at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at
> org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
>     at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>     at
> org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
>     at java.lang.Thread.run(Unknown Source)
>
> Table 2<< here doesn't have the issue with minc failing and so shouldn’t
> be offline.  These messages happened on a restart of a tserver if that
> offers any clues.  All the nodes were rebooted at that time due to a power
> failure.  I'm assuming that it's tablet went offline soon after this
> message first appeared in the logs.
>
> Other tidbit of note is that the Accumulo operates for hours/days without
> taking the tablets offline even though minc is failing and it's the crash
> of a tserver due to OutOfMemory situation in one case that seems to have
> taken the tablet offline.  Is it safe to assume that other tservers are not
> able to pick up the tablets that are failing minc from a crashed tserver?
>
> -----Original Message-----
> From: Josh Elser [mailto:[email protected]]
> Sent: Friday, April 08, 2016 10:52 AM
> To: [email protected]
> Subject: Re: Fwd: why compaction failure on one table brings other tables
> offline, how to recover
>
>
>
> Billie Rinaldi wrote:
> > *From:* Jayesh Patel
> > *Sent:* Thursday, April 07, 2016 4:36 PM
> > *To:* '[email protected] <mailto:[email protected]>'
> > <[email protected] <mailto:[email protected]>>
> > *Subject:* RE: why compaction failure on one table brings other tables
> > offline, how to recover____
> >
> > __ __
> >
> > I have a 3 node Accumulo 1.7 cluster with a few small tables (few MB
> > in size at most).____
> >
> > __ __
> >
> > I had one of those table fail minc because I had configured a
> > SummingCombiner with FIXEDLEN but had smaller values:____
> >
> > MinC failed (trying to convert to long, but byte array isn't long
> > enough, wanted 8 found 1) to create
> > hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc
> > s.rf_tmp
> > retrying ...____
> >
> > __ __
> >
> > I have learned since to set the ‘lossy’ parameter to true to avoid this.
> > *Why is the default value for it false* if it can cause catastrophic
> > failure that you’ll read about ahead.____
>
> I'm pretty sure I told you this on StackOverflow, but if you're not
> writing 8-byte long values, don't used FIXEDLEN. Use VARLEN instead.
>
> > However, this brought other the tablets for other tables offline
> > without any apparent errors or warnings. *Can someone please explain
> > why?*____
>
> Can you provide logs? We are not wizards :)
>
> > In order to recover from this, I did a ‘droptable’ from the shell on
> > the affected tables, but they all got stuck in the ‘DELETING’ state.
> > I was able to finally delete them using zkcli ‘rmr’ command. *Is there
> > a better way?____*
>
> Again, not sure why they would have gotten stuck in the deleting phase
> without more logs/context (nor how far along in the deletion process they
> got). It's possible that there were still entries in the accumulo.metadata
> table.
>
> > I’m assuming there is a more proper way because when I created the
> > tables again (with the same name), they went back to having a single
> > offline tablet right away. *Is this because there are “traces” of the
> > old table left behind that affect the new table even though the new
> > table has a different table id?*  I ended up wiping out hdfs and
> > recreating the accumulo instance. ____
>
> Accumulo uses monotonically increasing IDs to identify tables. The
> human-readable names are only there for your benefit. Creating a table with
> the same name would not cause a problem. It sounds like you got the
> metadata table in a bad state or have tabletservers in a bad state (if you
> haven't restarted them).
>
> > It seems that a small bug, writing 1 byte value instead of 8 bytes,
> > caused us to dump the whole accumulo instance.  Luckily the data
> > wasn’t that important, but this whole episode makes us wonder why
> > doing things the right way (assuming there is a right way) wasn’t
> > obvious or if Accumulo is just very fragile.____
> >
>
> Causing Accumulo to be unable to flush data from memory to disk in a minor
> compaction is a very bad idea. One that we cannot automatically recover
> from because of the combiner configuration you set.
>
> If you can provide logs and stack traces from the Accumulo services, we
> can try to help you further. This is not normal. If you don't believe me,
> take a look at the distributed tests we run each release where we write
> hundreds of gigabytes of data across many servers while randomly killing
> Accumulo processes.
>
> >
> > Please ask away any questions/clarification you might have. We’ll
> > appreciate any input you might have so we make educated decisions
> > about using Accumulo going forward.____
> >
> > __ __
> >
> > Thank you,____
> >
> > Jayesh____
> >
> >
>

Re: Fwd: why compaction failure on one table brings other tables offline, how to recover

Reply via email to