Re: scan column families with different time ranges

Ted Yu Sat, 01 Aug 2015 07:54:48 -0700

Have you considered using essential column family feature (through Filter) ?
In your case A would be the essential column family.
Within TimeRange for recent data, the filter would return both column
families.
Outside the TimeRange, only family A is returned.


Cheers

On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote:

> I have a table with 2 column families, call them A and B, with new data
> regularly being added. They are very different sizes: B is 100x the size of
> A.  Among other uses for this data, I have a MapReduce job that needs to
> read all of A, but only recent data from B (e.g. last day).  Here are some
> methods I've considered:
>
>    1. Use a Filter to get throw out older data from B (this is what I
>    currently do).  However, all the data from B still needs to be read from
>    disk, causing a disk IO bottleneck.
>    2. Configure the table input format to read from B only, using a
>    TimeRange for recent data, and have each map task open a separate
> scanner
>    for A (without a TimeRange) then merge the data in the map task.
> However,
>    this adds complexity to the job and gives up the atomicity/consistency
>    guarantees as new writes hit both column families.
>    3. Add a new column family C to the table with an additional copy of the
>    data in B, but set a TTL on it.  All writes duplicate the data written
> to B
>    and C.  Change the scan to include C instead of B.  However, this adds
> all
>    the overhead of another column family, more writes, and having to set
> the
>    TTL to the maximum of any time window I want to scan efficiently.
>    4. Implement an enhancement to HBase's Scan to allow giving each column
>    family its own TimeRange.  The job would then be able to skip most old
>    large store files (hopefully all of them with tiered compaction at some
>    point).
>
> Does anyone have other suggestions?  Would HBase be willing to accept
> updating Scan to have different TimeRange's for each column families?
>
>
> Dave
>

Re: scan column families with different time ranges

Reply via email to