is there a good way to compare the overall system effect of bulk loading different sets of rfiles that have the same data, but very different "densities"?
i've been working on a way to re-feed a lot of data in to a table, and have started to believe that our default scheme for creating rfiles - mapred in to ~100-200 splits, sampled from 50k tablets - is actually pretty bad. subjectively, it feels like rfiles that "span" 300 or 400 tablets is bad in at least two ways for the tservers - until the files are compacted, all of the "potential" tservers have to check the file, right? and then, during compaction, do portions of that rfile get volleyed around the cloud until all tservers have grabbed their portion? (so, there's network overhead, repeatedly reading files and skipping most of the data, ...) if my new idea works, i will have a lot more control over the density of rfiles, and most of them will span just one or two tablets. so, is there a way to measure/simulate overall system benefit or cost of different approaches to building bulk-load data (destined for an established table, across N tservers, ...)? i guess that a related question would be "are 1000 smaller and denser bulk files better than 100 larger bulk files produced under a typical getSplits() scheme?" thanks, jeff
