Re: comparing different rfile "densities"

Mike Drob Tue, 11 Nov 2014 05:23:03 -0800

I'm not sure how to quantify this and give you a way to verify, but in my
experience you want to be producing rflies that load into a single tablet.
Typically, this means number of reducers equal to the number of tablets in
the table that you will be importing and perhaps a custom partitioner. I
think your intuition is spot on, here.



Of course, if that means that you have a bunch of tiny files, then maybe
it's time to rethink your split strategy.

On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected]> wrote:

> is there a good way to compare the overall system effect of
> bulk loading different sets of rfiles that have the same data,
> but very different "densities"?
>
> i've been working on a way to re-feed a lot of data in to a table,
> and have started to believe that our default scheme for creating
> rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
> is actually pretty bad.  subjectively, it feels like rfiles that "span"
> 300 or 400 tablets is bad in at least two ways for the tservers -
> until the files are compacted, all of the "potential" tservers have
> to check the file, right?  and then, during compaction, do portions
> of that rfile get volleyed around the cloud until all tservers
> have grabbed their portion?  (so, there's network overhead, repeatedly
> reading files and skipping most of the data, ...)
>
> if my new idea works, i will have a lot more control over the density
> of rfiles, and most of them will span just one or two tablets.
>
> so, is there a way to measure/simulate overall system benefit or cost
> of different approaches to building bulk-load data (destined for an
> established table, across N tservers, ...)?
>
> i guess that a related question would be "are 1000 smaller and denser
> bulk files better than 100 larger bulk files produced under a typical
> getSplits() scheme?"
>
> thanks,
> jeff
>

Re: comparing different rfile "densities"

Reply via email to