comparing different rfile "densities"

Jeff Turner Tue, 11 Nov 2014 03:59:28 -0800

is there a good way to compare the overall system effect of
bulk loading different sets of rfiles that have the same data,
but very different "densities"?


i've been working on a way to re-feed a lot of data in to a table,
and have started to believe that our default scheme for creating
rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
is actually pretty bad.  subjectively, it feels like rfiles that "span"
300 or 400 tablets is bad in at least two ways for the tservers -
until the files are compacted, all of the "potential" tservers have
to check the file, right?  and then, during compaction, do portions
of that rfile get volleyed around the cloud until all tservers
have grabbed their portion?  (so, there's network overhead, repeatedly
reading files and skipping most of the data, ...)

if my new idea works, i will have a lot more control over the density
of rfiles, and most of them will span just one or two tablets.

so, is there a way to measure/simulate overall system benefit or cost
of different approaches to building bulk-load data (destined for an
established table, across N tservers, ...)?

i guess that a related question would be "are 1000 smaller and denser
bulk files better than 100 larger bulk files produced under a typical
getSplits() scheme?"

thanks,
jeff

comparing different rfile "densities"

Reply via email to