Jeff, "Density" is an interesting measure here, because RFiles are going to be sorted such that, even when the file is split between tablets, a read of the file is going to be (mostly) a sequential scan. I think instead you might want to look at a few other metrics: network overhead, name node operation rates, and number of files per tablet.
The network overhead is going to be more an issue of locality than density, so you'd have to do more than just have separate files per tablet to optimize that. You'll need some way of specifying or hinting at where the files should be generated. As an aside, we generally want to shoot for "probabilistic locality" so that the aggregate traffic over top level switches is much smaller than the total data processed (smaller enough so that it isn't a bottleneck). This is generally a little easier than guaranteeing that files are always read from a local drive. You might be able to measure this by monitoring your network usage, assuming those sensors are available to you. Accumulo also prints out information on entries/s and bytes/s for compactions in the debug logs. Impact on the namenode is more or less proportional to the number of blocks+files that you generate. As long as you're not generating a large number of files that are smaller than your block size (64MB? 128MB?) you're probably going to be close to optimal here. I'm not sure at what point the number of files+blocks becomes a bottleneck, but I've seen it happen when generating a very large number of tiny files. This is something that may cause you problems if you generate 50K files per ingest cycle rather than 500 or 5K. Measure this by looking at the size of files that are being ingested. Number of files per tablet has a big effect on performance -- much more so than number of tablets per file. Query latency and aggregate scan performance are directly proportional to the number of files per tablet. Generating one file per tablet or one file per group of tablets doesn't really change this metric. You can measure this by scanning the metadata table as Josh suggested. I'm very interested in this subject, so please let us know what you find. Cheers, Adam On Tue, Nov 11, 2014 at 6:56 AM, Jeff Turner <[email protected]> wrote: > is there a good way to compare the overall system effect of > bulk loading different sets of rfiles that have the same data, > but very different "densities"? > > i've been working on a way to re-feed a lot of data in to a table, > and have started to believe that our default scheme for creating > rfiles - mapred in to ~100-200 splits, sampled from 50k tablets - > is actually pretty bad. subjectively, it feels like rfiles that "span" > 300 or 400 tablets is bad in at least two ways for the tservers - > until the files are compacted, all of the "potential" tservers have > to check the file, right? and then, during compaction, do portions > of that rfile get volleyed around the cloud until all tservers > have grabbed their portion? (so, there's network overhead, repeatedly > reading files and skipping most of the data, ...) > > if my new idea works, i will have a lot more control over the density > of rfiles, and most of them will span just one or two tablets. > > so, is there a way to measure/simulate overall system benefit or cost > of different approaches to building bulk-load data (destined for an > established table, across N tservers, ...)? > > i guess that a related question would be "are 1000 smaller and denser > bulk files better than 100 larger bulk files produced under a typical > getSplits() scheme?" > > thanks, > jeff
