I'm not sure how to quantify this and give you a way to verify, but in my experience you want to be producing rflies that load into a single tablet. Typically, this means number of reducers equal to the number of tablets in the table that you will be importing and perhaps a custom partitioner. I think your intuition is spot on, here.
Of course, if that means that you have a bunch of tiny files, then maybe it's time to rethink your split strategy. On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected]> wrote: > is there a good way to compare the overall system effect of > bulk loading different sets of rfiles that have the same data, > but very different "densities"? > > i've been working on a way to re-feed a lot of data in to a table, > and have started to believe that our default scheme for creating > rfiles - mapred in to ~100-200 splits, sampled from 50k tablets - > is actually pretty bad. subjectively, it feels like rfiles that "span" > 300 or 400 tablets is bad in at least two ways for the tservers - > until the files are compacted, all of the "potential" tservers have > to check the file, right? and then, during compaction, do portions > of that rfile get volleyed around the cloud until all tservers > have grabbed their portion? (so, there's network overhead, repeatedly > reading files and skipping most of the data, ...) > > if my new idea works, i will have a lot more control over the density > of rfiles, and most of them will span just one or two tablets. > > so, is there a way to measure/simulate overall system benefit or cost > of different approaches to building bulk-load data (destined for an > established table, across N tservers, ...)? > > i guess that a related question would be "are 1000 smaller and denser > bulk files better than 100 larger bulk files produced under a typical > getSplits() scheme?" > > thanks, > jeff >
