You might be seeing https://issues.apache.org/jira/browse/ACCUMULO-4160
On Mon, Apr 11, 2016 at 5:52 PM Jayesh Patel <[email protected]> wrote: > There really aren't a lot of log messages that can explain why tablets for > other tables went offline except the following: > > 2016-04-11 13:32:18,258 > [tserver.TabletServerResourceManager$AssignmentWatcher] WARN : > tserver:instance-accumulo-3 Assignment for 2<< has been running for at > least 973455566ms > java.lang.Exception: Assignment of 2<< > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.park(Unknown Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(Unknown > Source) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown > Source) > at java.util.concurrent.locks.ReentrantLock$FairSync.lock(Unknown > Source) > at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source) > at > org.apache.accumulo.tserver.TabletServer.acquireRecoveryMemory(TabletServer.java:2230) > at > org.apache.accumulo.tserver.TabletServer.access$2600(TabletServer.java:252) > at > org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2150) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at > org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) > at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > at > org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) > at java.lang.Thread.run(Unknown Source) > > Table 2<< here doesn't have the issue with minc failing and so shouldn’t > be offline. These messages happened on a restart of a tserver if that > offers any clues. All the nodes were rebooted at that time due to a power > failure. I'm assuming that it's tablet went offline soon after this > message first appeared in the logs. > > Other tidbit of note is that the Accumulo operates for hours/days without > taking the tablets offline even though minc is failing and it's the crash > of a tserver due to OutOfMemory situation in one case that seems to have > taken the tablet offline. Is it safe to assume that other tservers are not > able to pick up the tablets that are failing minc from a crashed tserver? > > -----Original Message----- > From: Josh Elser [mailto:[email protected]] > Sent: Friday, April 08, 2016 10:52 AM > To: [email protected] > Subject: Re: Fwd: why compaction failure on one table brings other tables > offline, how to recover > > > > Billie Rinaldi wrote: > > *From:* Jayesh Patel > > *Sent:* Thursday, April 07, 2016 4:36 PM > > *To:* '[email protected] <mailto:[email protected]>' > > <[email protected] <mailto:[email protected]>> > > *Subject:* RE: why compaction failure on one table brings other tables > > offline, how to recover____ > > > > __ __ > > > > I have a 3 node Accumulo 1.7 cluster with a few small tables (few MB > > in size at most).____ > > > > __ __ > > > > I had one of those table fail minc because I had configured a > > SummingCombiner with FIXEDLEN but had smaller values:____ > > > > MinC failed (trying to convert to long, but byte array isn't long > > enough, wanted 8 found 1) to create > > hdfs://instance-accumulo:8020/accumulo/tables/1/default_tablet/F0002bc > > s.rf_tmp > > retrying ...____ > > > > __ __ > > > > I have learned since to set the ‘lossy’ parameter to true to avoid this. > > *Why is the default value for it false* if it can cause catastrophic > > failure that you’ll read about ahead.____ > > I'm pretty sure I told you this on StackOverflow, but if you're not > writing 8-byte long values, don't used FIXEDLEN. Use VARLEN instead. > > > However, this brought other the tablets for other tables offline > > without any apparent errors or warnings. *Can someone please explain > > why?*____ > > Can you provide logs? We are not wizards :) > > > In order to recover from this, I did a ‘droptable’ from the shell on > > the affected tables, but they all got stuck in the ‘DELETING’ state. > > I was able to finally delete them using zkcli ‘rmr’ command. *Is there > > a better way?____* > > Again, not sure why they would have gotten stuck in the deleting phase > without more logs/context (nor how far along in the deletion process they > got). It's possible that there were still entries in the accumulo.metadata > table. > > > I’m assuming there is a more proper way because when I created the > > tables again (with the same name), they went back to having a single > > offline tablet right away. *Is this because there are “traces” of the > > old table left behind that affect the new table even though the new > > table has a different table id?* I ended up wiping out hdfs and > > recreating the accumulo instance. ____ > > Accumulo uses monotonically increasing IDs to identify tables. The > human-readable names are only there for your benefit. Creating a table with > the same name would not cause a problem. It sounds like you got the > metadata table in a bad state or have tabletservers in a bad state (if you > haven't restarted them). > > > It seems that a small bug, writing 1 byte value instead of 8 bytes, > > caused us to dump the whole accumulo instance. Luckily the data > > wasn’t that important, but this whole episode makes us wonder why > > doing things the right way (assuming there is a right way) wasn’t > > obvious or if Accumulo is just very fragile.____ > > > > Causing Accumulo to be unable to flush data from memory to disk in a minor > compaction is a very bad idea. One that we cannot automatically recover > from because of the combiner configuration you set. > > If you can provide logs and stack traces from the Accumulo services, we > can try to help you further. This is not normal. If you don't believe me, > take a look at the distributed tests we run each release where we write > hundreds of gigabytes of data across many servers while randomly killing > Accumulo processes. > > > > > Please ask away any questions/clarification you might have. We’ll > > appreciate any input you might have so we make educated decisions > > about using Accumulo going forward.____ > > > > __ __ > > > > Thank you,____ > > > > Jayesh____ > > > > >
