Keith Turner wrote:
On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <[email protected]
<mailto:[email protected]>> wrote:
Yeah it looks like in both cases there tablets that have ~del
markers but are also referenced as entries for tablets. I assume
there's no problem with both? Most are many many months old.
Yeah, nothing inherently wrong with it. It's easier to create the ~del
entry when we know one tablet is done with it. The GC still checks the
tablet row-space to make sure no tablets still have a reference (to
Keith's point about how multiple tablets can refer to the same file).
Many actually seem to have multiple file: assignments (multiple rows
in metadata table) ...which shouldn't happen, right?
Its ok for multiple tablets(rows in metadata table) to reference the
same file. When a tablet splits, both children may reference some of
the parents files. When a file is bulk imported, it may go to multiple
tablets.
I also assume that the files in the directory don't particularly
matter since they are assigned to other tablets in the metdata table.
Cool & thanks again. Fun to learn the internals.
-Andrew
On 06/07/2016 05:34 PM, Josh Elser wrote:
re #1, you can try grep'ing over the Accumulo metadata table to
see if there are references to the file. It's possible that some
files might be kept around for table snapshots (but these should
eventually be compacted per Mike's point in #3, I believe).
Mike Drob wrote:
1) Is your Accumulo Garbage Collector process running? It
will delete
un-referenced files.
2) I've heard it said that 200 tablets per tserver is the
sweet spot,
but it depends a lot on your read and write patterns.
3)
https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle
On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi all,
A few questions on behavior if you have any time...
1. When looking in accumulo's HDFS directories I'm seeing a
situation where "tablets" aka "directories" for a table
have more
than the default 1G split threshold worth of rfiles in
them. In one
large instance, we have 400G worth of rfiles in the
default_tablet
directory (a mix of A, C, and F-type rfiles). We took
one of these
tables and compacted it and now there are appropriately
~1G worth of
files in HDFS. On an unrelated table we have tablets
with 100+G of
bulk imported rfiles in the tablet's HDFS directory.
These seems to be common across multiple clouds. All
the ingest is
done via batch writing. Is anyone aware of why this
would happen or
if it is even important? Perhaps these are leftover
rfiles from some
process. Their timestamps cover large date ranges.
2. There's been some discussion on the number of files
per tserver
for efficiency. Are there any limits on the size of
rfiles for
efficiency? For instance, I assume that compacting all
the files
into a single rfile per 1G split is more efficient bc
it avoids
merging (but maybe decreases concurrency). However,
would it be
better to have 500 tablets per node on a table with 1G
splits versus
having 50 tablets with 10G splits. Assuming HDFS and
Accumulo don't
mind 10G files!
3. Is there any way to force idle tablets to actually
major compact
other than the shell? Seems like it never happens.
Thanks!
Andrew