Have you considered using essential column family feature (through Filter) ? In your case A would be the essential column family. Within TimeRange for recent data, the filter would return both column families. Outside the TimeRange, only family A is returned.
Cheers On Sat, Aug 1, 2015 at 7:17 AM, Dave Latham <[email protected]> wrote: > I have a table with 2 column families, call them A and B, with new data > regularly being added. They are very different sizes: B is 100x the size of > A. Among other uses for this data, I have a MapReduce job that needs to > read all of A, but only recent data from B (e.g. last day). Here are some > methods I've considered: > > 1. Use a Filter to get throw out older data from B (this is what I > currently do). However, all the data from B still needs to be read from > disk, causing a disk IO bottleneck. > 2. Configure the table input format to read from B only, using a > TimeRange for recent data, and have each map task open a separate > scanner > for A (without a TimeRange) then merge the data in the map task. > However, > this adds complexity to the job and gives up the atomicity/consistency > guarantees as new writes hit both column families. > 3. Add a new column family C to the table with an additional copy of the > data in B, but set a TTL on it. All writes duplicate the data written > to B > and C. Change the scan to include C instead of B. However, this adds > all > the overhead of another column family, more writes, and having to set > the > TTL to the maximum of any time window I want to scan efficiently. > 4. Implement an enhancement to HBase's Scan to allow giving each column > family its own TimeRange. The job would then be able to skip most old > large store files (hopefully all of them with tiered compaction at some > point). > > Does anyone have other suggestions? Would HBase be willing to accept > updating Scan to have different TimeRange's for each column families? > > > Dave >
