Re: comparing different rfile "densities"

Jeff Turner Tue, 11 Nov 2014 12:06:46 -0800

is a file listed in metadata under *all* of the tablets that might haveentries in the file?


(this example is probably bad, but i hope get get the gist):


if a table has tablets for rows A, B, C, D, and E, and a new rfile has

entries for B and E, would tablets B, C, D, and E all have pointers tothe new file,in !METADATA? or is there some cleverness, and only B and E point tothe file?





On 11/11/14 11:49 AM, Josh Elser wrote:

You could also affirm your thoughts about RFile usage by generating ahistogram over the metadata table for your table.
The table ID is the common prefix in the metadata table. Each uniquerow is a tablet, and will contain zero to many keys with a columnfamily of 'file'. The column qualifier is a URI to the file (can youmake a Path object from it). The value is a CSV where the firstelement is the approximate size of the file (bytes) and the secondelement is the number of key-value pairs in the file.
This might help you make a more quantitative analysis over thedistribution of files as you tweak things.
To reaffirm what Mike said, trying to get close to 1:1 files totablets is definitely ideal but can be difficult to manage whenconsidering all of potential knobs you can turn (for both ingest andquery characteristics).
Mike Drob wrote:
I'm not sure how to quantify this and give you a way to verify, but in
my experience you want to be producing rflies that load into a single
tablet. Typically, this means number of reducers equal to the number of
tablets in the table that you will be importing and perhaps a custom
partitioner. I think your intuition is spot on, here.


Of course, if that means that you have a bunch of tiny files, then maybe
it's time to rethink your split strategy.

On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected]
<mailto:[email protected]>> wrote:

    is there a good way to compare the overall system effect of
    bulk loading different sets of rfiles that have the same data,
    but very different "densities"?

    i've been working on a way to re-feed a lot of data in to a table,
    and have started to believe that our default scheme for creating
    rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
is actually pretty bad. subjectively, it feels like rfiles that"span"
    300 or 400 tablets is bad in at least two ways for the tservers -
    until the files are compacted, all of the "potential" tservers have
    to check the file, right?  and then, during compaction, do portions
    of that rfile get volleyed around the cloud until all tservers
have grabbed their portion? (so, there's network overhead,repeatedly
    reading files and skipping most of the data, ...)
if my new idea works, i will have a lot more control over thedensity
    of rfiles, and most of them will span just one or two tablets.
so, is there a way to measure/simulate overall system benefit orcost
    of different approaches to building bulk-load data (destined for an
    established table, across N tservers, ...)?
i guess that a related question would be "are 1000 smaller anddenserbulk files better than 100 larger bulk files produced under atypical
    getSplits() scheme?"

    thanks,
    jeff

Re: comparing different rfile "densities"

Reply via email to