Re: Unbalanced tablets or extra rfiles

Josh Elser Tue, 07 Jun 2016 15:16:24 -0700


Keith Turner wrote:



On Tue, Jun 7, 2016 at 5:48 PM, Andrew Hulbert <[email protected]
<mailto:[email protected]>> wrote:

    Yeah it looks like in both cases there tablets that have ~del
    markers but are also referenced as entries for tablets. I assume
    there's no problem with both? Most are many many months old.

Yeah, nothing inherently wrong with it. It's easier to create the ~delentry when we know one tablet is done with it. The GC still checks thetablet row-space to make sure no tablets still have a reference (toKeith's point about how multiple tablets can refer to the same file).

    Many actually seem to have multiple file: assignments (multiple rows
    in metadata table) ...which shouldn't happen, right?


Its ok for multiple tablets(rows in metadata table) to reference the
same file.  When a tablet splits, both children may reference some of
the parents files.  When a file is bulk imported, it may go to multiple
tablets.


    I also assume that the files in the directory don't particularly
    matter since they are assigned to other tablets in the metdata table.

    Cool & thanks again. Fun to learn the internals.

    -Andrew



    On 06/07/2016 05:34 PM, Josh Elser wrote:

        re #1, you can try grep'ing over the Accumulo metadata table to
        see if there are references to the file. It's possible that some
        files might be kept around for table snapshots (but these should
        eventually be compacted per Mike's point in #3, I believe).

        Mike Drob wrote:

            1) Is your Accumulo Garbage Collector process running? It
            will delete
            un-referenced files.
            2) I've heard it said that 200 tablets per tserver is the
            sweet spot,
            but it depends a lot on your read and write patterns.
            3)
            
https://accumulo.apache.org/1.7/accumulo_user_manual#_table_compaction_major_everything_idle


            On Tue, Jun 7, 2016 at 4:03 PM, Andrew Hulbert
            <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>> wrote:

                 Hi all,

                 A few questions on behavior if you have any time...

                 1. When looking in accumulo's HDFS directories I'm seeing a
                 situation where "tablets" aka "directories" for a table
            have more
                 than the default 1G split threshold worth of rfiles in
            them. In one
                 large instance, we have 400G worth of rfiles in the
            default_tablet
                 directory (a mix of A, C, and F-type rfiles). We took
            one of these
                 tables and compacted it and now there are appropriately
            ~1G worth of
                 files in HDFS. On an unrelated table we have tablets
            with 100+G of
                 bulk imported rfiles in the tablet's HDFS directory.

                 These seems to be common across multiple clouds. All
            the ingest is
                 done via batch writing. Is anyone aware of why this
            would happen or
                 if it is even important? Perhaps these are leftover
            rfiles from some
                 process. Their timestamps cover large date ranges.

                 2. There's been some discussion on the number of files
            per tserver
                 for efficiency. Are there any limits on the size of
            rfiles for
                 efficiency? For instance, I assume that compacting all
            the files
                 into a single rfile per 1G split is more efficient bc
            it avoids
                 merging (but maybe decreases concurrency). However,
            would it be
                 better to have 500 tablets per node on a table with 1G
            splits versus
                 having 50 tablets with 10G splits. Assuming HDFS and
            Accumulo don't
                 mind 10G files!

                 3. Is there any way to force idle tablets to actually
            major compact
                 other than the shell? Seems like it never happens.

                 Thanks!

                 Andrew

Re: Unbalanced tablets or extra rfiles

Reply via email to