1) Is your Accumulo Garbage Collector process running? It will delete un-referenced files. 2) I've heard it said that 200 tablets per tserver is the sweet spot, but it depends a lot on your read and write patterns. 3) https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <[email protected]> wrote: > Hi all, > > A few questions on behavior if you have any time... > > 1. When looking in accumulo's HDFS directories I'm seeing a situation > where "tablets" aka "directories" for a table have more than the default 1G > split threshold worth of rfiles in them. In one large instance, we have > 400G worth of rfiles in the default_tablet directory (a mix of A, C, and > F-type rfiles). We took one of these tables and compacted it and now there > are appropriately ~1G worth of files in HDFS. On an unrelated table we have > tablets with 100+G of bulk imported rfiles in the tablet's HDFS directory. > > These seems to be common across multiple clouds. All the ingest is done > via batch writing. Is anyone aware of why this would happen or if it is > even important? Perhaps these are leftover rfiles from some process. Their > timestamps cover large date ranges. > > 2. There's been some discussion on the number of files per tserver for > efficiency. Are there any limits on the size of rfiles for efficiency? For > instance, I assume that compacting all the files into a single rfile per 1G > split is more efficient bc it avoids merging (but maybe decreases > concurrency). However, would it be better to have 500 tablets per node on a > table with 1G splits versus having 50 tablets with 10G splits. Assuming > HDFS and Accumulo don't mind 10G files! > > 3. Is there any way to force idle tablets to actually major compact other > than the shell? Seems like it never happens. > > Thanks! > > Andrew >
