Re: stupid/dangerous batch load question

Seidl, Ed Wed, 28 May 2014 11:35:41 -0700

That's the rub.  I have 120 reducers running, so I wind up w/ 120 RFiles to 
import.  I haven't tried playing w/ a custom partitioner to send adjacent 
ranges to reducers so the rfiles won't have overlapping keys.  Perhaps that 
would help?

Thanks,
Ed

From: Mike Drob <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, May 28, 2014 11:22 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: stupid/dangerous batch load question

Are you partitioning the resultant files by the existing table splits, or just 
sending everything to one file?

If you are importing multiple files, then there is potential that some of the 
files succeed and others fail. Depending on how your data is laid out, this may 
cause application level corruption, but the underlying key/value store should 
be ok.

On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed 
<[email protected]<mailto:[email protected]>> wrote:
I have a large amount of data that I am batch loading into accumulo.  I'm using 
mapreduce to read in chunks of data and write out rfiles to be loaded with 
importdirectory.  I've noticed that the import will hang for longer and longer 
times as more data is added.  For instance, one table, which currently has 
~2500 tablets, now takes around 2 hours to process the importdirectory.

In poking around in the source for TableOperationsImpl (1.5.0), I see that 
there is an option to not wait on certain operations (like compact).  Would it 
be dangerous to (optionally) return immediately from importdirectory, and 
instead check the fail directory to detect errors in the import?  I know this 
will eventually cause a backup in the staging directories, but is there any 
potential to corrupt the tables?

Thanks,
Ed

Re: stupid/dangerous batch load question

Reply via email to