Hi, I have been toying with the idea of a predictive cache for Batch Hbase jobs.
Traditionally speaking, hadoop is a batch processing framework. We use hbase as a data store for a number of batch jobs that run on Hadoop. Depending on the job that is run, and the way the data is layed out, Hbase might perform great for some of the jobs but might result in performance bottlenecks for others. This might specifically be seen for cases where the same table is used as an input for different jobs with different access patterns. Hbase currently supports various cache implementations (Bucket, LRU, Combined) but none of these mechanisms are job aware. A job aware cache should be able to determine the best data to cache based on previous data requests from previous runs of the job. The learning process can happen in the background and will require access information from mulitple runs of the job. The process should result in a per job output that can be used by a new Predictive caching algorithm. When a job is then run with this predictive cache, it can query the learning results when it has to decide which block to evict or load. Just wanted to check if anyone knows of any related work in this area. Thoughts and suggestions welcome. Thanks, Ayya
