When you add edges are you by chance creating one mutation and adding a lot of edges to it? This could create a large mutation, which would have to fit in JVM memory on the tserver (which looks like its 1g).
Accumulo logs messages about whats going on w/ the java GC every few seconds. Try grepping the tserver logs for GC, what does this look like? On Tue, Feb 11, 2014 at 2:23 PM, Kesten Broughton <[email protected]>wrote: > Hi david, > > Responses inline > > What is the average load on the servers while the ingest runs? > > We are seeing ingest rates (ingest column on accumulo dashboard) of > 200-400k. Load is low, perhaps up to 1 on a 4 core vm. Less on > bare-metal. Often we see only one tablet server (of two) ingesting. > However, both show they are online. Sometimes it is just highly skewed. > We are now running pre-split ingests. > > How large are the mutations? > > How do we determine this? > > What are your heap sizes? > > Typically we are running with configs based on the example 2Gb > accumulo-site.xml. Our block count is under 2000. See config bundle for > more details. > > How much memory do the servers have? > > metal hdfs cluster - 256 Gb, 2nd metal cluster - 128 Gb, virtual boxes - > 16Gb > > Can you move beyond a three node cluster? > > We are moving to this now. ETA 2 days for virtualized stack of 9 nodes, > 8 core, 64 Gb, fully separating 3 zookeepers, master nodes and 3 datanodes. > ETA for metal version of this 1-2 weeks. > > Are you querying while writing to the same table? > > No > | > > From: David Medinets <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, February 11, 2014 at 10:55 AM > To: accumulo-user <[email protected]> > Subject: Re: ingest problems > > My cluster ingests data every night. We use a map-reduce program to > generate rFiles. Then import those files into Accumulo. No hiccups. No > instability. I've also used map-reduce to directly write mutations. Haven't > seen any issues there either. > > What is the average load on the servers while the ingest runs? > How large are the mutations? > What are your heap sizes? > How much memory do the servers have? > Can you move beyond a three node cluster? > Are you querying while writing to the same table? > > > > On Tue, Feb 11, 2014 at 11:28 AM, Josh Elser <[email protected]> wrote: > >> On 2/11/14, 11:10 AM, Kesten Broughton wrote: >> >>> Hi there, >>> >>> We have been experimenting with accumulo for about two months now. Our >>> biggest painpoint has been on ingest. >>> Often we will have ingest process fail 2 or 3 times 3/4 of the way >>> through an ingest and then on a final try it works, without any changes. >>> >> >> Funny, most times I hear that people consider Accumulo to handles ingest >> fairly well, but let's see what we can do to help. >> >> We need a bit more information than what you provided here though: what's >> your "ingest process"? Are you using some other workflow library? Are you >> running MapReduce? Do you just have a Java class with a main method that >> uses a BatchWriter? >> >> The fact that it "works sometimes" implies that the problem might be >> resource related. >> >> >> Once the ingest works, the cluster is usually stable for querying for >>> weeks or months only requiring the occasional start-all.sh if there is a >>> problem. >>> >>> Sometimes our ingest can be 24 hours long, and we need a stronger ingest >>> story to be able to commit to accumulo. >>> >> >> You should be able to run ingest 24/7 with Accumulo without it falling >> over (I do regularly to stress-test it). The limitation should only be the >> disk-space you have available. >> >> >> Our cluster architecture has been: >>> 3 hdfs datanodes overlaid with name node, secondary nn and accumulo >>> master each collocated with a datanode, and a zookeeper server on each. >>> We realize this is not optimal and are transitioning to separate >>> hardware for zookeepers and name/secondary/accumulomaster nodes. >>> However, the big concern is that sometimes a failed ingest will bork the >>> whole cluster and we have to re-init accumulo with an accumulo init >>> destroying all our data. >>> We have experienced this on at least three different clusters of this >>> description. >>> >> >> Can you be more specific than "bork the whole cluster"? Unless you're >> hitting a really nasty bug, there shouldn't be any way that a client >> writing data into Accumulo will destroy an instance. >> >> >> The most recent attempt was on a 65GB dataset. The cluster had been up >>> for over 24 hours. The ingest test takes 40 mins and about 5 mins >>> in, one of the datanodes failed. >>> There were no error logs on the failed node, and the two other nodes had >>> logs filled with zookeeper connection errors. We were unable to recover >>> the cluster and had to re-init. >>> >> >> Check both the log4j logs and the stdout/stderr redirection files for the >> datanode process. Typically, if you get an OOME, log4j gets torn down >> before that exception can be printed to the normal log files. "Silent" >> failures seem indicative of lack of physical resources (over-subscribed the >> node) on the box or insufficient resources provided to the processes (-Xmx >> was too small for the process). >> >> >> I know a vague description of problems is difficult to respond to, and >>> the next time we have an ingest failure, i will bring specifics forward. >>> But I'm writing to know if >>> 1. Ingest failures are a known fail point for accumulo, or if we are >>> perhaps unlucky/mis-configured. >>> >> >> No -- something else is going on here. >> >> >> 2. Are there any guidelines for capturing ingest failures / determining >>> root causes when errors don't show up in the logs >>> >> >> For any help request, be sure to gather Accumulo, Hadoop and ZooKeeper >> versions, OS and Java versions. Capturing log files and stdout/stderr files >> are important; beware that if you restart the Accumulo process on that >> node, it will overwrite the stdout/stderr files, so make sure to copy them >> out of the way. >> >> >> 3. Are there any means of checkpointing a data ingest, so that if a >>> failure were to occur at hour 23.5 we could roll back to hour 23 and >>> continue. Client code could checkpoint and restart at the last one, but >>> if the underlying accumulo cluster can't be recovered, that's of no use. >>> >> >> You can do anything you want in your client ingest code :) >> >> Assuming that you're using a BatchWriter, if you manually call flush() >> and it returns without Exception, you can assume that all data up to that >> point written with that BatchWriter instance is "ingested". This can easily >> extrapolated: if you're ingesting CSV files, ensure that a flush() happens >> every 1000lines and denote that somewhere that your ingest process can >> advance itself to the appropriate place in the CSV file and proceed from >> where it left off. >> >> thanks, >>> >>> kesten >>> >> >
