Re: Bulk Ingest

Roshan Punnoose Fri, 17 Jun 2016 04:05:07 -0700

Thanks guys! Awesome stuff.

On Thu, Jun 16, 2016, 11:41 PM Russ Weeks <[email protected]> wrote:


> Whoops forgot the link GroupedKeyPartitioner from the excellent
> accumulo-recipes project:
> https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/support/GroupedKeyPartitioner.scala
>
> On Thu, Jun 16, 2016 at 8:40 PM Russ Weeks <[email protected]>
> wrote:
>
>> > 1) Avoid lots of small files. Target as large of files as you can, relative
>> to your ingest latency requirements and your max file size (set on your
>> instance or table)
>>
>> If you're using Spark to produce the RFiles, one trick for this is to
>> call coalesce() on your RDD to reduce the number of RFiles that are written
>> to HDFS.
>>
>> > 2) Avoid having to import one file to multiple tablets.
>>
>> This is huge. Again, if you're using Spark you must not use the
>> HashPartitioner to create RDDs or you'll wind up in a situation where every
>> tablet owns a piece of every RFile. Ideally you would use something like
>> the GroupedKeyPartitioner[1] to align the RDD partitions with the tablet
>> splits but even the built-in RangePartitioner will be much better than the
>> HashPartitioner.
>>
>> -Russ
>>
>> On Thu, Jun 16, 2016 at 7:24 PM Josh Elser <[email protected]> wrote:
>>
>>> There are two big things that are required to really scale up bulk
>>> loading. Sadly (I guess) they are both things you would need to be
>>> implement on your own:
>>>
>>> 1) Avoid lots of small files. Target as large of files as you can,
>>> relative to your ingest latency requirements and your max file size (set
>>> on your instance or table)
>>>
>>> 2) Avoid having to import one file to multiple tablets. Remember that
>>> the majority of the metadata update for Accumulo is updating the tablet
>>> row with the new file. When you have one file which spans many tablets,
>>> you are now create N metadata updates instead of just one. When you
>>> create the files, take into account the split points of your table, and
>>> use that try to target one file per tablet.
>>>
>>> Roshan Punnoose wrote:
>>> > We are trying to perform bulk ingest at scale and wanted to get some
>>> > quick thoughts on how to increase performance and stability. One of the
>>> > problems we have is that we sometimes import thousands of small files,
>>> > and I don't believe there is a good way around this in the architecture
>>> > as of yet. Already I have run into an rpc timeout issue because the
>>> > import process is taking longer than 5m. And another issue where we
>>> have
>>> > so many files after a bulk import that we have had to bump the
>>> > tserver.scan.files.open.max to 1K.
>>> >
>>> > Here are some other configs that we have been toying with:
>>> > - master.fate.threadpool.size: 20
>>> > - master.bulk.threadpool.size: 20
>>> > - master.bulk.timeout: 20m
>>> > - tserver.bulk.process.threads: 20
>>> > - tserver.bulk.assign.threads: 20
>>> > - tserver.bulk.timeout: 20m
>>> > - tserver.compaction.major.concurrent.max: 20
>>> > - tserver.scan.files.open.max: 1200
>>> > - tserver.server.threads.minimum: 64
>>> > - table.file.max: 64
>>> > - table.compaction.major.ratio: 20
>>> >
>>> > (HDFS)
>>> > - dfs.namenode.handler.count: 100
>>> > - dfs.datanode.handler.count: 50
>>> >
>>> > Just want to get any quick ideas for performing bulk ingest at scale.
>>> > Thanks guys
>>> >
>>> > p.s. This is on Accumulo 1.6.5
>>>
>>

Re: Bulk Ingest

Reply via email to