No trickery there -- all tablets for which there are keys in the file will reference the file directly after bulk load.
Adam On Tue, Nov 11, 2014 at 2:57 PM, Jeff Turner <[email protected]> wrote: > is a file listed in metadata under *all* of the tablets that might have > entries in the file? > > (this example is probably bad, but i hope get get the gist): > > if a table has tablets for rows A, B, C, D, and E, and a new rfile has > entries for B and E, would tablets B, C, D, and E all have pointers to the > new file, > in !METADATA? or is there some cleverness, and only B and E point to the > file? > > > > > > On 11/11/14 11:49 AM, Josh Elser wrote: >> >> You could also affirm your thoughts about RFile usage by generating a >> histogram over the metadata table for your table. >> >> The table ID is the common prefix in the metadata table. Each unique row >> is a tablet, and will contain zero to many keys with a column family of >> 'file'. The column qualifier is a URI to the file (can you make a Path >> object from it). The value is a CSV where the first element is the >> approximate size of the file (bytes) and the second element is the number of >> key-value pairs in the file. >> >> This might help you make a more quantitative analysis over the >> distribution of files as you tweak things. >> >> To reaffirm what Mike said, trying to get close to 1:1 files to tablets is >> definitely ideal but can be difficult to manage when considering all of >> potential knobs you can turn (for both ingest and query characteristics). >> >> Mike Drob wrote: >>> >>> I'm not sure how to quantify this and give you a way to verify, but in >>> my experience you want to be producing rflies that load into a single >>> tablet. Typically, this means number of reducers equal to the number of >>> tablets in the table that you will be importing and perhaps a custom >>> partitioner. I think your intuition is spot on, here. >>> >>> >>> Of course, if that means that you have a bunch of tiny files, then maybe >>> it's time to rethink your split strategy. >>> >>> On Tue, Nov 11, 2014 at 5:56 AM, Jeff Turner <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> is there a good way to compare the overall system effect of >>> bulk loading different sets of rfiles that have the same data, >>> but very different "densities"? >>> >>> i've been working on a way to re-feed a lot of data in to a table, >>> and have started to believe that our default scheme for creating >>> rfiles - mapred in to ~100-200 splits, sampled from 50k tablets - >>> is actually pretty bad. subjectively, it feels like rfiles that >>> "span" >>> 300 or 400 tablets is bad in at least two ways for the tservers - >>> until the files are compacted, all of the "potential" tservers have >>> to check the file, right? and then, during compaction, do portions >>> of that rfile get volleyed around the cloud until all tservers >>> have grabbed their portion? (so, there's network overhead, >>> repeatedly >>> reading files and skipping most of the data, ...) >>> >>> if my new idea works, i will have a lot more control over the density >>> of rfiles, and most of them will span just one or two tablets. >>> >>> so, is there a way to measure/simulate overall system benefit or cost >>> of different approaches to building bulk-load data (destined for an >>> established table, across N tservers, ...)? >>> >>> i guess that a related question would be "are 1000 smaller and denser >>> bulk files better than 100 larger bulk files produced under a typical >>> getSplits() scheme?" >>> >>> thanks, >>> jeff >>> >>> >
