RE: EXTERNAL: Re: Failing Tablet Servers

Cardon, Tejay E Thu, 20 Sep 2012 15:51:21 -0700

Sorry, yes it's the AccumuloOutputFormat.  I do about 1,000,000 mutation.puts 
before I do a context.write.  Any idea how many is safe?

Thanks,
Tejay

From: Jim Klucar [mailto:[email protected]]
Sent: Thursday, September 20, 2012 4:44 PM
To: [email protected]
Subject: Re: EXTERNAL: Re: Failing Tablet Servers

Do you mean AccumuloOutputFormat? Is the map failing or the reduce failing? How 
many Mutation.put are you doing before a context.write? Too many puts will 
crash the mutation object. You need to periodically call context.write and 
create a new mutation object. At some point I wrote a ContextFlushingMutation 
that handled this problem for you, but I'd have to dig around for it or rewrite 
it.

Sent from my iPhone

On Sep 20, 2012, at 5:29 PM, "Cardon, Tejay E" 
<[email protected]<mailto:[email protected]>> wrote:
John,
Thanks for the quick response.  I'm not seeing any errors in the logger logs.  
I am using native maps, and I left the memory map size at 1GB.  I assume that's 
plenty large if I'm using native maps, right?

Thanks,
Tejay

From: John Vines [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, September 20, 2012 3:20 PM
To: [email protected]<mailto:[email protected]>
Subject: EXTERNAL: Re: Failing Tablet Servers

Okay, so we know that you're killing servers. We know when you drop the amount 
of data down, you have no issues. There are two immediate issues that come to 
mind-
1. You modified tservers opts to give them 10G of memory. Did you up the memory 
map size in accumulo-site.xml to make those larger, or did you leave those 
alone? Or did you up them to match the 10G? If you upped them and arne't using 
the native maps, that would be problematic as you need space for other purposes 
as well.

2. You seem to be making giant rows. Depending on your Key/Value size, it's 
possible for you to write a row that you cannot send (especially if using a 
WholeRowIterator) that can cause a cascading error when doing log recovery. Are 
you seeing any sort of errors in your loggers logs?

John
On Thu, Sep 20, 2012 at 5:05 PM, Cardon, Tejay E 
<[email protected]<mailto:[email protected]>> wrote:
I'm seeing some strange behavior on a moderate (30 node) cluster.  I've got 27 
tablet servers on large dell servers with 30GB of memory each.  I've set the 
TServer_OPTS to give them each 10G of memory.  I'm running an ingest process 
that uses AccumuloInputFormat in a MapReduce job to write 1,000 rows with each 
row containing ~1,000,000 columns in 160,000 families.  The MapReduce initially 
runs quite quickly and I can see the ingest rate peak on the  monitor page.  
However, after about 30 seconds of high ingest, the ingest falls to 0.  It then 
stalls out and my map task are eventually killed.  In the end, the map/reduce 
fails and I usually end up with between 3 and 7 of my Tservers dead.

Inspecting the tserver.err logs shows nothing, even on the nodes that fail.  
The tserver.out log shows a java OutOfMemoryError, and nothing else.  I've 
included a zip with the logs from one of the failed tservers and a second one 
with the logs from the master.  Other than the out of memory, I'm not seeing 
anything that stands out to me.

If I reduce the data size to only 100,000 columns, rather than 1,000,000, the 
process takes about 4 minutes and completes without incident.

Am I just ingesting too quickly?

Thanks,
Tejay Cardon

RE: EXTERNAL: Re: Failing Tablet Servers

Reply via email to