is a file listed in metadata under *all* of the tablets that might have entries in the file?

(this example is probably bad, but i hope get get the gist):

if a table has tablets for rows A, B, C, D, and E, and a new rfile has
entries for B and E, would tablets B, C, D, and E all have pointers to the new file, in !METADATA? or is there some cleverness, and only B and E point to the file?




On 11/11/14 11:49 AM, Josh Elser wrote:
You could also affirm your thoughts about RFile usage by generating a histogram over the metadata table for your table.

The table ID is the common prefix in the metadata table. Each unique row is a tablet, and will contain zero to many keys with a column family of 'file'. The column qualifier is a URI to the file (can you make a Path object from it). The value is a CSV where the first element is the approximate size of the file (bytes) and the second element is the number of key-value pairs in the file.

This might help you make a more quantitative analysis over the distribution of files as you tweak things.

To reaffirm what Mike said, trying to get close to 1:1 files to tablets is definitely ideal but can be difficult to manage when considering all of potential knobs you can turn (for both ingest and query characteristics).

Mike Drob wrote:
I'm not sure how to quantify this and give you a way to verify, but in
my experience you want to be producing rflies that load into a single
tablet. Typically, this means number of reducers equal to the number of
tablets in the table that you will be importing and perhaps a custom
partitioner. I think your intuition is spot on, here.


Of course, if that means that you have a bunch of tiny files, then maybe
it's time to rethink your split strategy.

On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected]
<mailto:[email protected]>> wrote:

    is there a good way to compare the overall system effect of
    bulk loading different sets of rfiles that have the same data,
    but very different "densities"?

    i've been working on a way to re-feed a lot of data in to a table,
    and have started to believe that our default scheme for creating
    rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
is actually pretty bad. subjectively, it feels like rfiles that "span"
    300 or 400 tablets is bad in at least two ways for the tservers -
    until the files are compacted, all of the "potential" tservers have
    to check the file, right?  and then, during compaction, do portions
    of that rfile get volleyed around the cloud until all tservers
have grabbed their portion? (so, there's network overhead, repeatedly
    reading files and skipping most of the data, ...)

if my new idea works, i will have a lot more control over the density
    of rfiles, and most of them will span just one or two tablets.

so, is there a way to measure/simulate overall system benefit or cost
    of different approaches to building bulk-load data (destined for an
    established table, across N tservers, ...)?

i guess that a related question would be "are 1000 smaller and denser bulk files better than 100 larger bulk files produced under a typical
    getSplits() scheme?"

    thanks,
    jeff



Reply via email to