Re: stupid/dangerous batch load question

Josh Elser Wed, 28 May 2014 11:25:41 -0700

On 5/28/14, 2:22 PM, Mike Drob wrote:

Are you partitioning the resultant files by the existing table splits,
or just sending everything to one file?

Emphasis on this. Sending a large file to every tablet for a table canbe very expensive. Trying to align the files you're generating with thesplits of a table will help alleviate that cost.

If you are importing multiple files, then there is potential that some
of the files succeed and others fail. Depending on how your data is laid
out, this may cause application level corruption, but the underlying
key/value store should be ok.


On Wed, May 28, 2014 at 12:49 PM, Seidl, Ed <[email protected]
<mailto:[email protected]>> wrote:

    I have a large amount of data that I am batch loading into accumulo.
      I'm using mapreduce to read in chunks of data and write out rfiles
    to be loaded with importdirectory.  I've noticed that the import
    will hang for longer and longer times as more data is added.  For
    instance, one table, which currently has ~2500 tablets, now takes
    around 2 hours to process the importdirectory.

    In poking around in the source for TableOperationsImpl (1.5.0), I
    see that there is an option to not wait on certain operations (like
    compact).  Would it be dangerous to (optionally) return immediately
    from importdirectory, and instead check the fail directory to detect
    errors in the import?  I know this will eventually cause a backup in
    the staging directories, but is there any potential to corrupt the
    tables?

    Thanks,
    Ed

Re: stupid/dangerous batch load question

Reply via email to