Sounds like you're on the right track.
> On Aug 1, 2015, at 8:11 PM, Dave Latham <[email protected]> wrote: > > Thanks Andrew and Vladimir. As Vladimir notes, it looks like it is checked > at scanner creation: > StoreScanner constructor -> getScannersNoCompaction -> selectScannersFrom > -> StoreFileScanner.shouldUseScanner -> StoreFile.passesTimerangeFilter > > The StoreScanner would probably need to store the timerange for that family > separately from the scan, in the same way it keeps the set of columns. So > it may not be too intrusive. > > Vladimir, note that B is 100x larger than A, rather than the other way > round. Cutting out the old store files could well also reduce disk IO for > that family by 100x. > > On Sat, Aug 1, 2015 at 7:17 PM, Vladimir Rodionov <[email protected]> > wrote: > >> I think TimeRange is handled higher, when region scanner is created. With >> data size in B 100x smaller than in A, I do not understand where is a >> source of IO bottleneck? >>> On Aug 1, 2015 9:16 AM, "Andrew Purtell" <[email protected]> wrote: >>> >>> Hi Dave, >>> >>>> Would HBase be willing to accept updating Scan to have different >>> TimeRange's for each column families? >>> >>> We could try it. I'm not sure how familiar you are with the relevant >> code. >>> I'm guessing some? Look at ScanQueryMatcher. This and related concerns >>> govern how we search through store files. Timerange handling is done at >> the >>> top level (the SQM). Then for each column we have a leaf tracker >>> (implementing ColumnTracker) that tracks column specific info like number >>> of versions for a cell found in each. We'd need to push timerange >> handling >>> down into the column trackers. This would be a tricky refactor on >> delicate >>> code. I suspect we could be comfortable making this change in master and >> on >>> branch-1 for upcoming unscheduled minor release line 1.3. Would that >> work? >>> Or would this change need to go further back? >>> >>> Maybe someone else has another suggestion. >>> >>> >>>> On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote: >>>> >>>> I have a table with 2 column families, call them A and B, with new data >>>> regularly being added. They are very different sizes: B is 100x the >> size >>> of >>>> A. Among other uses for this data, I have a MapReduce job that needs >> to >>>> read all of A, but only recent data from B (e.g. last day). Here are >>> some >>>> methods I've considered: >>>> >>>> 1. Use a Filter to get throw out older data from B (this is what I >>>> currently do). However, all the data from B still needs to be read >>> from >>>> disk, causing a disk IO bottleneck. >>>> 2. Configure the table input format to read from B only, using a >>>> TimeRange for recent data, and have each map task open a separate >>>> scanner >>>> for A (without a TimeRange) then merge the data in the map task. >>>> However, >>>> this adds complexity to the job and gives up the >> atomicity/consistency >>>> guarantees as new writes hit both column families. >>>> 3. Add a new column family C to the table with an additional copy of >>> the >>>> data in B, but set a TTL on it. All writes duplicate the data >> written >>>> to B >>>> and C. Change the scan to include C instead of B. However, this >> adds >>>> all >>>> the overhead of another column family, more writes, and having to >> set >>>> the >>>> TTL to the maximum of any time window I want to scan efficiently. >>>> 4. Implement an enhancement to HBase's Scan to allow giving each >>> column >>>> family its own TimeRange. The job would then be able to skip most >> old >>>> large store files (hopefully all of them with tiered compaction at >>> some >>>> point). >>>> >>>> Does anyone have other suggestions? Would HBase be willing to accept >>>> updating Scan to have different TimeRange's for each column families? >>>> >>>> >>>> Dave >>> >>> >>> >>> -- >>> Best regards, >>> >>> - Andy >>> >>> Problems worthy of attack prove their worth by hitting back. - Piet Hein >>> (via Tom White) >>
