On Jun 02, Sam Seigal wrote: ><eventid> - <yyyy-mm-dd> > >My eventId can be one of 12 distinct values (let us say from A-L) , and I >have a 4 node cluster running HBase right now. > >After doing some research in our OLTP database, I found that the majority >(about 45% of the data) from the last 6 months written in the OLTP database >has the event id equal to value "A".
(Disclaimer: hbase n00b trying to pretend an expert, I might be grossly wrong in certain respects) Hbase regions are not organized like a trie. So, a dense clustering for a given first byte of the row key should not be a problem when it comes to how the regions are constructed. With the default splitting scheme, regions should roughly be getting split based on number of keys in a range (assuming comparable row key sizes). The potential problem that you could run into and might be a bit harder to dodge occurs when all the region sizes are comparable but the access pattern is heavily skewed towards certain regions. At this point, you would have to split the regions manually. I am not sure of hbase can spread hot regions across different physical nodes on the fly though.
