http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
Here you can find the discussion, trade-offs and working code/API (even for M/R) about this and the approach you are trying out. Regards, Shahab On Mon, Sep 23, 2013 at 5:41 PM, anil gupta <[email protected]> wrote: > Hi All, > > I have a secondary index(inverted index) table with a rowkey on the basis > of Timestamp of an event. Assume the rowkey as <TimeStamp in Epoch>. > I also store some extra(apart from main_table rowkey) columns in that table > for doing filtering. > > The requirement is to do range-based scan on the basis of time of > event. Hence, the index with this rowkey. > I cannot use Hashing or MD5 digest solution because then i cannot do range > based scans. And, i already have a index like OpenTSDB in another table > for the same dataset.(I have many secondary Index for same data set.) > > Problem: When we increase the write workload during stress test. Time > secondary index becomes a bottleneck due to the famous Region HotSpotting > problem. > Solution: I am thinking of adding a prefix of { (<TimeStamp in Epoch>%10) = > bucket} in the rowkey. Then my row key will become: > <Bucket><TimeStamp in Epoch> > By using above rowkey i can at least alleviate *WRITE* problem.(i don't > think problem can be fixed permanently because of the use case requirement. > I would love to be proven wrong.) > However, with the above row key, now when i want to *READ* data, for every > single range scans i have to read data from 10 different regions. This > extra load for read is scaring me a bit. > > I am wondering if anyone has better suggestion/approach to solve this > problem given the constraints i have. Looking for feedback from community. > > -- > Thanks & Regards, > Anil Gupta >
