Hi Ayya, As of now there's no intelligent way to prime the caches as you've described. I've had some brief discussions with the esteemed Mr. Purtell on this topic in the past, but without some application-aware intelligence, the best we could think of were simple heuristics. Of course, if your dataset is small enough to fit in memory, we have flags like CACHE_ON_WRITE that can be set on column families or tables. These result in all data landing in the cache. I believe there was also a patch posted to load a region's block on open, but this is also limited by the amount of memory. This approach could be extended to only load a sampled portion of the region, up to some percentage of available cache for instance. As I said, this is merely simple heuristic.
Do you have any ideas regarding how your application logic might make its way into the regions? As an idea, a post-open region coprocessor might provide the context at the right moment in the HBase region lifecycle. As you say, "thoughts and suggestions welcome." Thanks, Nick On Thu, Apr 23, 2015 at 12:18 PM, ayyajnam nahdravhbuhs <[email protected]> wrote: > Hi, > > I have been toying with the idea of a predictive cache for Batch Hbase > jobs. > > Traditionally speaking, hadoop is a batch processing framework. We use > hbase as a data store for a number of batch jobs that run on Hadoop. > Depending on the job that is run, and the way the data is layed out, Hbase > might perform great for some of the jobs but might result in performance > bottlenecks for others. This might specifically be seen for cases where the > same table is used as an input for different jobs with different access > patterns. > Hbase currently supports various cache implementations (Bucket, LRU, > Combined) but none of these mechanisms are job aware. A job aware cache > should be able to determine the best data to cache based on previous data > requests from previous runs of the job. The learning process can happen in > the background and will require access information from mulitple runs of > the job. The process should result in a per job output that can be used by > a new Predictive caching algorithm. When a job is then run with this > predictive cache, it can query the learning results when it has to decide > which block to evict or load. > > Just wanted to check if anyone knows of any related work in this area. > > Thoughts and suggestions welcome. > > Thanks, > Ayya >
