Whoops forgot the link GroupedKeyPartitioner from the excellent accumulo-recipes project: https://github.com/calrissian/accumulo-recipes/blob/master/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/support/GroupedKeyPartitioner.scala
On Thu, Jun 16, 2016 at 8:40 PM Russ Weeks <[email protected]> wrote: > > 1) Avoid lots of small files. Target as large of files as you can, relative > to your ingest latency requirements and your max file size (set on your > instance or table) > > If you're using Spark to produce the RFiles, one trick for this is to call > coalesce() on your RDD to reduce the number of RFiles that are written to > HDFS. > > > 2) Avoid having to import one file to multiple tablets. > > This is huge. Again, if you're using Spark you must not use the > HashPartitioner to create RDDs or you'll wind up in a situation where every > tablet owns a piece of every RFile. Ideally you would use something like > the GroupedKeyPartitioner[1] to align the RDD partitions with the tablet > splits but even the built-in RangePartitioner will be much better than the > HashPartitioner. > > -Russ > > On Thu, Jun 16, 2016 at 7:24 PM Josh Elser <[email protected]> wrote: > >> There are two big things that are required to really scale up bulk >> loading. Sadly (I guess) they are both things you would need to be >> implement on your own: >> >> 1) Avoid lots of small files. Target as large of files as you can, >> relative to your ingest latency requirements and your max file size (set >> on your instance or table) >> >> 2) Avoid having to import one file to multiple tablets. Remember that >> the majority of the metadata update for Accumulo is updating the tablet >> row with the new file. When you have one file which spans many tablets, >> you are now create N metadata updates instead of just one. When you >> create the files, take into account the split points of your table, and >> use that try to target one file per tablet. >> >> Roshan Punnoose wrote: >> > We are trying to perform bulk ingest at scale and wanted to get some >> > quick thoughts on how to increase performance and stability. One of the >> > problems we have is that we sometimes import thousands of small files, >> > and I don't believe there is a good way around this in the architecture >> > as of yet. Already I have run into an rpc timeout issue because the >> > import process is taking longer than 5m. And another issue where we have >> > so many files after a bulk import that we have had to bump the >> > tserver.scan.files.open.max to 1K. >> > >> > Here are some other configs that we have been toying with: >> > - master.fate.threadpool.size: 20 >> > - master.bulk.threadpool.size: 20 >> > - master.bulk.timeout: 20m >> > - tserver.bulk.process.threads: 20 >> > - tserver.bulk.assign.threads: 20 >> > - tserver.bulk.timeout: 20m >> > - tserver.compaction.major.concurrent.max: 20 >> > - tserver.scan.files.open.max: 1200 >> > - tserver.server.threads.minimum: 64 >> > - table.file.max: 64 >> > - table.compaction.major.ratio: 20 >> > >> > (HDFS) >> > - dfs.namenode.handler.count: 100 >> > - dfs.datanode.handler.count: 50 >> > >> > Just want to get any quick ideas for performing bulk ingest at scale. >> > Thanks guys >> > >> > p.s. This is on Accumulo 1.6.5 >> >
