Hi Mike,
Thanks for responding.
1) Yeah, the GC is running...90% were still in-use and not deleted.
Though the GC issue seems interesting since the compacted table had the
same number of entries after compaction. I found a few examples (perhaps
all are this way) of files that have a ~del marker in the metadata
table. I'll go search the gc logs to see whats happening or figure out
if they are still thought to be "in use". Do you know what marks them in
use?
2) Cool, 200 seems a little low though unless you increased the split
size. Has anyone tried large split sizes?
3) Yeah I saw that...unfortunately the cluster is rarely idle and it
seems that extremely few tablets from lots of different tables have ever
actually been completely compacted. It's a pretty multi-tenant environment
-Andrew
On 06/07/2016 05:18 PM, Mike Drob wrote:
1) Is your Accumulo Garbage Collector process running? It will delete
un-referenced files.
2) I've heard it said that 200 tablets per tserver is the sweet spot,
but it depends a lot on your read and write patterns.
3)
https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <[email protected]
<mailto:[email protected]>> wrote:
Hi all,
A few questions on behavior if you have any time...
1. When looking in accumulo's HDFS directories I'm seeing a
situation where "tablets" aka "directories" for a table have more
than the default 1G split threshold worth of rfiles in them. In
one large instance, we have 400G worth of rfiles in the
default_tablet directory (a mix of A, C, and F-type rfiles). We
took one of these tables and compacted it and now there are
appropriately ~1G worth of files in HDFS. On an unrelated table we
have tablets with 100+G of bulk imported rfiles in the tablet's
HDFS directory.
These seems to be common across multiple clouds. All the ingest is
done via batch writing. Is anyone aware of why this would happen
or if it is even important? Perhaps these are leftover rfiles from
some process. Their timestamps cover large date ranges.
2. There's been some discussion on the number of files per tserver
for efficiency. Are there any limits on the size of rfiles for
efficiency? For instance, I assume that compacting all the files
into a single rfile per 1G split is more efficient bc it avoids
merging (but maybe decreases concurrency). However, would it be
better to have 500 tablets per node on a table with 1G splits
versus having 50 tablets with 10G splits. Assuming HDFS and
Accumulo don't mind 10G files!
3. Is there any way to force idle tablets to actually major
compact other than the shell? Seems like it never happens.
Thanks!
Andrew