Hi Cristofer, Thanks for elaborate response!!!
I have no much information about production data as I work with partial data. But based on discussion with my project partners, I have some answers for you. Number of model numbers and serial numbers will be finite. Not so many... As far as I know,there is no predefined rule for model number or serial number creation. I have two access pattern. I count the number of rows for a specific model number. I use rowkey filter for this. Also I filter the rows based on model, serial number and some other columns. I scan the table with column value filter for this case. I will evaluate salting as you have explained. Regards, Anand.C On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < [email protected]> wrote: > Hi Anand, > > As usual, the answer is that 'it depends' :) > > I think that the main question here is: why are you afraid that this setup > would lead to region server hotspotting? Is because you don't know how your > production data will seems? > > Based on what you told about your rowkey, you will query mostly by > providing model no. + serial no., but: > 1 - How is your rowkey distribution? There are tons of different > modelNumbers AND serialNumbers? Few modelNumbers and a lot of > serialNumbers? Few of both? > 2 - Putting modelNumber in front of your rowkey means that your data will > be sorted by rowkey. So, what is the rule that determinates a modelNumber > creation? Is it a sequential number that will be increased by time? If so, > are newer members accessed a lot more than older members? If not, what will > drive this number? Is it an encoding rule? > 3 - Do you expect more write/read load over a few of these modelNumbers > and/or serialNumbers? Will it be similar to a Pareto Distribution? > Distributed over what? > > Also, two other things got my attention here... > 1 - Why are you filtering with regex? If your queries are over model no. + > serial no., why don't you just scan starting by your > modelNumber+SerialNumber, and stoping on your next > modelNumber+SerialNumber? Or is there another access pattern that doesn't > apply to your composited rowkey? > 2 - Why do you have to add a timestamp to ensure uniqueness? > > Now, answering your question without more info about your data, you can > apply hash in two ways: > 1 - Generating a hash (MD5 is the most common as far as I read about) and > using only this hash as your rowkey. Based on what you have told, this way > doesn't fit your needs, because you would not be able to do apply your > filter anymore. > 2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice > that the hash portion must be your rowkey prefix to ensure a kind of > balanced distribution over something (where something is your region > servers). I'm working with a case that is a bit similar to yours, and what > I'm doing right now is calculating the hashValue of my rowkey and using a > Java Formatter to create a hex string to prepend to my rowkey. Something > like a String.format("%03x", hashValue) > > In both cases, you still have to split your regions in advance, and it > will be better to work your splitting before starting to feed your table > with production data. > > Also, you have to study the consequences that changing your rowkey will > bring. It's not for free. > > There's a lot of words here and a lot of questions, so by now I feel I > started to shoot in the dark. Try to understand your production data and if > you have more to share, for sure it will help! > > Regards, > Cristofer > > -----Mensagem original----- > De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] > Enviada em: segunda-feira, 16 de julho de 2012 02:30 > Para: [email protected] > Assunto: Rowkey hashing to avoid hotspotting > > Hi, > > I am using Hbase to store data about mechanical components. Each component > has model no. and serial no. and some other attributes. > > I would be querying my data mostly by model no. and serial no. So I > created a composite key with these two attributes and added timestamp to > make it unique. > > To filter the data, I use rowkey filter with regex string comparator and > it works well with sample seed data. Now I am afraid whether this set up > will lead to region server hotspotting when we load production data in > HBase. I read hashing may solve this problem. Can some one help me in > implementing hashing the row key? Also I would want the row filter to work > as I have to display the number of components in a web page and I use row > key filter for implementing that functionality? Any guidance would be of > great help. > > -- > Regards, > Anand > -- Regards, Anand
