Re: Bulk Ingest

Russ Weeks Thu, 16 Jun 2016 20:41:59 -0700

Whoops forgot the link GroupedKeyPartitioner from the excellent
accumulo-recipes project:
https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/support/GroupedKeyPartitioner.scala


On Thu, Jun 16, 2016 at 8:40 PM Russ Weeks <[email protected]> wrote:

> > 1) Avoid lots of small files. Target as large of files as you can, relative
> to your ingest latency requirements and your max file size (set on your
> instance or table)
>
> If you're using Spark to produce the RFiles, one trick for this is to call
> coalesce() on your RDD to reduce the number of RFiles that are written to
> HDFS.
>
> > 2) Avoid having to import one file to multiple tablets.
>
> This is huge. Again, if you're using Spark you must not use the
> HashPartitioner to create RDDs or you'll wind up in a situation where every
> tablet owns a piece of every RFile. Ideally you would use something like
> the GroupedKeyPartitioner[1] to align the RDD partitions with the tablet
> splits but even the built-in RangePartitioner will be much better than the
> HashPartitioner.
>
> -Russ
>
> On Thu, Jun 16, 2016 at 7:24 PM Josh Elser <[email protected]> wrote:
>
>> There are two big things that are required to really scale up bulk
>> loading. Sadly (I guess) they are both things you would need to be
>> implement on your own:
>>
>> 1) Avoid lots of small files. Target as large of files as you can,
>> relative to your ingest latency requirements and your max file size (set
>> on your instance or table)
>>
>> 2) Avoid having to import one file to multiple tablets. Remember that
>> the majority of the metadata update for Accumulo is updating the tablet
>> row with the new file. When you have one file which spans many tablets,
>> you are now create N metadata updates instead of just one. When you
>> create the files, take into account the split points of your table, and
>> use that try to target one file per tablet.
>>
>> Roshan Punnoose wrote:
>> > We are trying to perform bulk ingest at scale and wanted to get some
>> > quick thoughts on how to increase performance and stability. One of the
>> > problems we have is that we sometimes import thousands of small files,
>> > and I don't believe there is a good way around this in the architecture
>> > as of yet. Already I have run into an rpc timeout issue because the
>> > import process is taking longer than 5m. And another issue where we have
>> > so many files after a bulk import that we have had to bump the
>> > tserver.scan.files.open.max to 1K.
>> >
>> > Here are some other configs that we have been toying with:
>> > - master.fate.threadpool.size: 20
>> > - master.bulk.threadpool.size: 20
>> > - master.bulk.timeout: 20m
>> > - tserver.bulk.process.threads: 20
>> > - tserver.bulk.assign.threads: 20
>> > - tserver.bulk.timeout: 20m
>> > - tserver.compaction.major.concurrent.max: 20
>> > - tserver.scan.files.open.max: 1200
>> > - tserver.server.threads.minimum: 64
>> > - table.file.max: 64
>> > - table.compaction.major.ratio: 20
>> >
>> > (HDFS)
>> > - dfs.namenode.handler.count: 100
>> > - dfs.datanode.handler.count: 50
>> >
>> > Just want to get any quick ideas for performing bulk ingest at scale.
>> > Thanks guys
>> >
>> > p.s. This is on Accumulo 1.6.5
>>
>

Re: Bulk Ingest

Reply via email to