i had a conversation about this yesterday that confused me.
if this is all documented somewhere, or is described in ppt, let me know.

how much does the goal of "approximately one rfile per tserver" help
with building "good" (i.e., dense) rfiles for bulk-load?

i had not heard of the the notion of one file per tserver (as an efficiency trick). i assumed that, since tserver/task-tracker are usually 1:1, the goal was really to
establish splits that would use a reasonable number of reducers. which is a
compromise from rfile-per-tablet, when tables have very large numbers of tablets.

so, two questions:

- do tservers generally get a contiguous range of tablets?
- if an rfile spans five consecutive tablets, what is the likelihood that they are all on one tserver? - if tablets are less systematically allocated, then the rfile would cover multiple tservers

- even if we can generate rfiles that "fit in to one tserver", isn't there still an issue that the new rfile may cover 100's of the tablets owned by a tserver? so any scan of any of those tablets will have to peek in the new file (until compaction).

i think i'm getting close to, or already passed by, the point of diminishing returns
as far as optimizing.  now i'm just curious.



On 11/11/14 2:57 PM, Adam Fuchs wrote:
Jeff,

"Density" is an interesting measure here, because RFiles are going to
be sorted such that, even when the file is split between tablets, a
read of the file is going to be (mostly) a sequential scan. I think
instead you might want to look at a few other metrics: network
overhead, name node operation rates, and number of files per tablet.

The network overhead is going to be more an issue of locality than
density, so you'd have to do more than just have separate files per
tablet to optimize that. You'll need some way of specifying or hinting
at where the files should be generated. As an aside, we generally want
to shoot for "probabilistic locality" so that the aggregate traffic
over top level switches is much smaller than the total data processed
(smaller enough so that it isn't a bottleneck). This is generally a
little easier than guaranteeing that files are always read from a
local drive. You might be able to measure this by monitoring your
network usage, assuming those sensors are available to you. Accumulo
also prints out information on entries/s and bytes/s for compactions
in the debug logs.

Impact on the namenode is more or less proportional to the number of
blocks+files that you generate. As long as you're not generating a
large number of files that are smaller than your block size (64MB?
128MB?) you're probably going to be close to optimal here. I'm not
sure at what point the number of files+blocks becomes a bottleneck,
but I've seen it happen when generating a very large number of tiny
files. This is something that may cause you problems if you generate
50K files per ingest cycle rather than 500 or 5K. Measure this by
looking at the size of files that are being ingested.

Number of files per tablet has a big effect on performance -- much
more so than number of tablets per file. Query latency and aggregate
scan performance are directly proportional to the number of files per
tablet. Generating one file per tablet or one file per group of
tablets doesn't really change this metric. You can measure this by
scanning the metadata table as Josh suggested.

I'm very interested in this subject, so please let us know what you find.

Cheers,
Adam

On Tue, Nov 11, 2014 at 6:56 AM, Jeff Turner <[email protected]> wrote:
is there a good way to compare the overall system effect of
bulk loading different sets of rfiles that have the same data,
but very different "densities"?

i've been working on a way to re-feed a lot of data in to a table,
and have started to believe that our default scheme for creating
rfiles - mapred in to ~100-200 splits, sampled from 50k tablets -
is actually pretty bad.  subjectively, it feels like rfiles that "span"
300 or 400 tablets is bad in at least two ways for the tservers -
until the files are compacted, all of the "potential" tservers have
to check the file, right?  and then, during compaction, do portions
of that rfile get volleyed around the cloud until all tservers
have grabbed their portion?  (so, there's network overhead, repeatedly
reading files and skipping most of the data, ...)

if my new idea works, i will have a lot more control over the density
of rfiles, and most of them will span just one or two tablets.

so, is there a way to measure/simulate overall system benefit or cost
of different approaches to building bulk-load data (destined for an
established table, across N tservers, ...)?

i guess that a related question would be "are 1000 smaller and denser
bulk files better than 100 larger bulk files produced under a typical
getSplits() scheme?"

thanks,
jeff

Reply via email to