Re: comparing different rfile "densities"

Adam Fuchs Tue, 11 Nov 2014 11:58:28 -0800

Jeff,

"Density" is an interesting measure here, because RFiles are going to
be sorted such that, even when the file is split between tablets, a
read of the file is going to be (mostly) a sequential scan. I think
instead you might want to look at a few other metrics: network
overhead, name node operation rates, and number of files per tablet.

The network overhead is going to be more an issue of locality than
density, so you'd have to do more than just have separate files per
tablet to optimize that. You'll need some way of specifying or hinting
at where the files should be generated. As an aside, we generally want
to shoot for "probabilistic locality" so that the aggregate traffic
over top level switches is much smaller than the total data processed
(smaller enough so that it isn't a bottleneck). This is generally a
little easier than guaranteeing that files are always read from a
local drive. You might be able to measure this by monitoring your
network usage, assuming those sensors are available to you. Accumulo
also prints out information on entries/s and bytes/s for compactions
in the debug logs.

Impact on the namenode is more or less proportional to the number of
blocks+files that you generate. As long as you're not generating a
large number of files that are smaller than your block size (64MB?
128MB?) you're probably going to be close to optimal here. I'm not
sure at what point the number of files+blocks becomes a bottleneck,
but I've seen it happen when generating a very large number of tiny
files. This is something that may cause you problems if you generate
50K files per ingest cycle rather than 500 or 5K. Measure this by
looking at the size of files that are being ingested.

Number of files per tablet has a big effect on performance -- much
more so than number of tablets per file. Query latency and aggregate
scan performance are directly proportional to the number of files per
tablet. Generating one file per tablet or one file per group of
tablets doesn't really change this metric. You can measure this by
scanning the metadata table as Josh suggested.

I'm very interested in this subject, so please let us know what you find.

Cheers,
Adam

On Tue, Nov 11, 2014 at 6:56 AM, Jeff Turner <[email protected]> wrote:
> is there a good way to compare the overall system effect of
> bulk loading different sets of rfiles that have the same data,
> but very different "densities"?
>
> i've been working on a way to re-feed a lot of data in to a table,
> and have started to believe that our default scheme for creating
> rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
> is actually pretty bad.  subjectively, it feels like rfiles that "span"
> 300 or 400 tablets is bad in at least two ways for the tservers -
> until the files are compacted, all of the "potential" tservers have
> to check the file, right?  and then, during compaction, do portions
> of that rfile get volleyed around the cloud until all tservers
> have grabbed their portion?  (so, there's network overhead, repeatedly
> reading files and skipping most of the data, ...)
>
> if my new idea works, i will have a lot more control over the density
> of rfiles, and most of them will span just one or two tablets.
>
> so, is there a way to measure/simulate overall system benefit or cost
> of different approaches to building bulk-load data (destined for an
> established table, across N tservers, ...)?
>
> i guess that a related question would be "are 1000 smaller and denser
> bulk files better than 100 larger bulk files produced under a typical
> getSplits() scheme?"
>
> thanks,
> jeff

Re: comparing different rfile "densities"

Reply via email to