is a file listed in metadata under *all* of the tablets that might have
entries in the file?
(this example is probably bad, but i hope get get the gist):
if a table has tablets for rows A, B, C, D, and E, and a new rfile has
entries for B and E, would tablets B, C, D, and E all have pointers to
the new file,
in !METADATA? or is there some cleverness, and only B and E point to
the file?
On 11/11/14 11:49 AM, Josh Elser wrote:
You could also affirm your thoughts about RFile usage by generating a
histogram over the metadata table for your table.
The table ID is the common prefix in the metadata table. Each unique
row is a tablet, and will contain zero to many keys with a column
family of 'file'. The column qualifier is a URI to the file (can you
make a Path object from it). The value is a CSV where the first
element is the approximate size of the file (bytes) and the second
element is the number of key-value pairs in the file.
This might help you make a more quantitative analysis over the
distribution of files as you tweak things.
To reaffirm what Mike said, trying to get close to 1:1 files to
tablets is definitely ideal but can be difficult to manage when
considering all of potential knobs you can turn (for both ingest and
query characteristics).
Mike Drob wrote:
I'm not sure how to quantify this and give you a way to verify, but in
my experience you want to be producing rflies that load into a single
tablet. Typically, this means number of reducers equal to the number of
tablets in the table that you will be importing and perhaps a custom
partitioner. I think your intuition is spot on, here.
Of course, if that means that you have a bunch of tiny files, then maybe
it's time to rethink your split strategy.
On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected]
<mailto:[email protected]>> wrote:
is there a good way to compare the overall system effect of
bulk loading different sets of rfiles that have the same data,
but very different "densities"?
i've been working on a way to re-feed a lot of data in to a table,
and have started to believe that our default scheme for creating
rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
is actually pretty bad. subjectively, it feels like rfiles that
"span"
300 or 400 tablets is bad in at least two ways for the tservers -
until the files are compacted, all of the "potential" tservers have
to check the file, right? and then, during compaction, do portions
of that rfile get volleyed around the cloud until all tservers
have grabbed their portion? (so, there's network overhead,
repeatedly
reading files and skipping most of the data, ...)
if my new idea works, i will have a lot more control over the
density
of rfiles, and most of them will span just one or two tablets.
so, is there a way to measure/simulate overall system benefit or
cost
of different approaches to building bulk-load data (destined for an
established table, across N tservers, ...)?
i guess that a related question would be "are 1000 smaller and
denser
bulk files better than 100 larger bulk files produced under a
typical
getSplits() scheme?"
thanks,
jeff