Re: scan column families with different time ranges

Andrew Purtell Sat, 01 Aug 2015 21:06:27 -0700

Sounds like you're on the right track.


> On Aug 1, 2015, at 8:11 PM, Dave Latham <[email protected]> wrote:
> 
> Thanks Andrew and Vladimir.  As Vladimir notes, it looks like it is checked
> at scanner creation:
> StoreScanner constructor -> getScannersNoCompaction -> selectScannersFrom
> -> StoreFileScanner.shouldUseScanner -> StoreFile.passesTimerangeFilter
> 
> The StoreScanner would probably need to store the timerange for that family
> separately from the scan, in the same way it keeps the set of columns.  So
> it may not be too intrusive.
> 
> Vladimir, note that B is 100x larger than A, rather than the other way
> round.  Cutting out the old store files could well also reduce disk IO for
> that family by 100x.
> 
> On Sat, Aug 1, 2015 at 7:17 PM, Vladimir Rodionov <[email protected]>
> wrote:
> 
>> I think TimeRange is handled higher, when region scanner is created. With
>> data size in B 100x smaller than in A, I do not understand where is a
>> source of IO bottleneck?
>>> On Aug 1, 2015 9:16 AM, "Andrew Purtell" <[email protected]> wrote:
>>> 
>>> Hi Dave,
>>> 
>>>> Would HBase be willing to accept updating Scan to have different
>>> TimeRange's for each column families?
>>> 
>>> We could try it. I'm not sure how familiar you are with the relevant
>> code.
>>> I'm guessing some? Look at ScanQueryMatcher. This and related concerns
>>> govern how we search through store files. Timerange handling is done at
>> the
>>> top level (the SQM). Then for each column we have a leaf tracker
>>> (implementing ColumnTracker) that tracks column specific info like number
>>> of versions for a cell found in each. We'd need to push timerange
>> handling
>>> down into the column trackers. This would be a tricky refactor on
>> delicate
>>> code. I suspect we could be comfortable making this change in master and
>> on
>>> branch-1 for upcoming unscheduled minor release line 1.3. Would that
>> work?
>>> Or would this change need to go further back?
>>> 
>>> Maybe someone else has another suggestion.
>>> 
>>> 
>>>> On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote:
>>>> 
>>>> I have a table with 2 column families, call them A and B, with new data
>>>> regularly being added. They are very different sizes: B is 100x the
>> size
>>> of
>>>> A.  Among other uses for this data, I have a MapReduce job that needs
>> to
>>>> read all of A, but only recent data from B (e.g. last day).  Here are
>>> some
>>>> methods I've considered:
>>>> 
>>>>   1. Use a Filter to get throw out older data from B (this is what I
>>>>   currently do).  However, all the data from B still needs to be read
>>> from
>>>>   disk, causing a disk IO bottleneck.
>>>>   2. Configure the table input format to read from B only, using a
>>>>   TimeRange for recent data, and have each map task open a separate
>>>> scanner
>>>>   for A (without a TimeRange) then merge the data in the map task.
>>>> However,
>>>>   this adds complexity to the job and gives up the
>> atomicity/consistency
>>>>   guarantees as new writes hit both column families.
>>>>   3. Add a new column family C to the table with an additional copy of
>>> the
>>>>   data in B, but set a TTL on it.  All writes duplicate the data
>> written
>>>> to B
>>>>   and C.  Change the scan to include C instead of B.  However, this
>> adds
>>>> all
>>>>   the overhead of another column family, more writes, and having to
>> set
>>>> the
>>>>   TTL to the maximum of any time window I want to scan efficiently.
>>>>   4. Implement an enhancement to HBase's Scan to allow giving each
>>> column
>>>>   family its own TimeRange.  The job would then be able to skip most
>> old
>>>>   large store files (hopefully all of them with tiered compaction at
>>> some
>>>>   point).
>>>> 
>>>> Does anyone have other suggestions?  Would HBase be willing to accept
>>>> updating Scan to have different TimeRange's for each column families?
>>>> 
>>>> 
>>>> Dave
>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> 
>>>   - Andy
>>> 
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>>

Re: scan column families with different time ranges

Reply via email to