On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <[email protected]> wrote:
> Yeah it looks like in both cases there tablets that have ~del markers but > are also referenced as entries for tablets. I assume there's no problem > with both? Most are many many months old. > > Many actually seem to have multiple file: assignments (multiple rows in > metadata table) ...which shouldn't happen, right? > Its ok for multiple tablets(rows in metadata table) to reference the same file. When a tablet splits, both children may reference some of the parents files. When a file is bulk imported, it may go to multiple tablets. > > I also assume that the files in the directory don't particularly matter > since they are assigned to other tablets in the metdata table. > > Cool & thanks again. Fun to learn the internals. > > -Andrew > > > > On 06/07/2016 05:34 PM, Josh Elser wrote: > >> re #1, you can try grep'ing over the Accumulo metadata table to see if >> there are references to the file. It's possible that some files might be >> kept around for table snapshots (but these should eventually be compacted >> per Mike's point in #3, I believe). >> >> Mike Drob wrote: >> >>> 1) Is your Accumulo Garbage Collector process running? It will delete >>> un-referenced files. >>> 2) I've heard it said that 200 tablets per tserver is the sweet spot, >>> but it depends a lot on your read and write patterns. >>> 3) >>> >>> https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle >>> >>> On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Hi all, >>> >>> A few questions on behavior if you have any time... >>> >>> 1. When looking in accumulo's HDFS directories I'm seeing a >>> situation where "tablets" aka "directories" for a table have more >>> than the default 1G split threshold worth of rfiles in them. In one >>> large instance, we have 400G worth of rfiles in the default_tablet >>> directory (a mix of A, C, and F-type rfiles). We took one of these >>> tables and compacted it and now there are appropriately ~1G worth of >>> files in HDFS. On an unrelated table we have tablets with 100+G of >>> bulk imported rfiles in the tablet's HDFS directory. >>> >>> These seems to be common across multiple clouds. All the ingest is >>> done via batch writing. Is anyone aware of why this would happen or >>> if it is even important? Perhaps these are leftover rfiles from some >>> process. Their timestamps cover large date ranges. >>> >>> 2. There's been some discussion on the number of files per tserver >>> for efficiency. Are there any limits on the size of rfiles for >>> efficiency? For instance, I assume that compacting all the files >>> into a single rfile per 1G split is more efficient bc it avoids >>> merging (but maybe decreases concurrency). However, would it be >>> better to have 500 tablets per node on a table with 1G splits versus >>> having 50 tablets with 10G splits. Assuming HDFS and Accumulo don't >>> mind 10G files! >>> >>> 3. Is there any way to force idle tablets to actually major compact >>> other than the shell? Seems like it never happens. >>> >>> Thanks! >>> >>> Andrew >>> >>> >>> >
