Reading hot spotting? Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.
Sent from a remote device. Please excuse any typos... Mike Segel On Jul 17, 2012, at 10:53 AM, Alex Baranau <[email protected]> wrote: > The most common reason for RS hotspotting during writing data in HBase is > writing rows with monotonically increasing/decreasing row keys. E.g. if you > put timestamp in the first part of your key, then you are likely to have > monotonically increasing row keys. You can find more info about this issue > and how to solve it here: [1] and also you may want to look at already > implemented salting solution [2]. > > As for RS hotspotting during reading - it is hard to predict without > knowing what it the most common data access patterns. E.g. putting model # > in first part of a key may seem like a good distribution, but if your web > site used mostly by Mercedes owners, the majority of the read load may be > directed to just few regions. Again, salting can help a lot here. > > +1 to what Cristofer said on other things, esp: use partial key scans were > possible instead of filters and pre-split your table. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] http://bit.ly/HnKjbc > [2] https://github.com/sematext/HBaseWD > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > [email protected]> wrote: > >> Hi Cristofer, >> >> Thanks for elaborate response!!! >> >> I have no much information about production data as I work with partial >> data. But based on discussion with my project partners, I have some answers >> for you. >> >> Number of model numbers and serial numbers will be finite. Not so many... >> As far as I know,there is no predefined rule for model number or serial >> number creation. >> >> I have two access pattern. I count the number of rows for a specific model >> number. I use rowkey filter for this. Also I filter the rows based on >> model, serial number and some other columns. I scan the table with column >> value filter for this case. >> >> I will evaluate salting as you have explained. >> >> Regards, >> Anand.C >> >> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < >> [email protected]> wrote: >> >>> Hi Anand, >>> >>> As usual, the answer is that 'it depends' :) >>> >>> I think that the main question here is: why are you afraid that this >> setup >>> would lead to region server hotspotting? Is because you don't know how >> your >>> production data will seems? >>> >>> Based on what you told about your rowkey, you will query mostly by >>> providing model no. + serial no., but: >>> 1 - How is your rowkey distribution? There are tons of different >>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of >>> serialNumbers? Few of both? >>> 2 - Putting modelNumber in front of your rowkey means that your data will >>> be sorted by rowkey. So, what is the rule that determinates a modelNumber >>> creation? Is it a sequential number that will be increased by time? If >> so, >>> are newer members accessed a lot more than older members? If not, what >> will >>> drive this number? Is it an encoding rule? >>> 3 - Do you expect more write/read load over a few of these modelNumbers >>> and/or serialNumbers? Will it be similar to a Pareto Distribution? >>> Distributed over what? >>> >>> Also, two other things got my attention here... >>> 1 - Why are you filtering with regex? If your queries are over model no. >> + >>> serial no., why don't you just scan starting by your >>> modelNumber+SerialNumber, and stoping on your next >>> modelNumber+SerialNumber? Or is there another access pattern that doesn't >>> apply to your composited rowkey? >>> 2 - Why do you have to add a timestamp to ensure uniqueness? >>> >>> Now, answering your question without more info about your data, you can >>> apply hash in two ways: >>> 1 - Generating a hash (MD5 is the most common as far as I read about) and >>> using only this hash as your rowkey. Based on what you have told, this >> way >>> doesn't fit your needs, because you would not be able to do apply your >>> filter anymore. >>> 2 - Salting, by prefixing your current rowkey with a pinch of hash. >> Notice >>> that the hash portion must be your rowkey prefix to ensure a kind of >>> balanced distribution over something (where something is your region >>> servers). I'm working with a case that is a bit similar to yours, and >> what >>> I'm doing right now is calculating the hashValue of my rowkey and using a >>> Java Formatter to create a hex string to prepend to my rowkey. Something >>> like a String.format("%03x", hashValue) >>> >>> In both cases, you still have to split your regions in advance, and it >>> will be better to work your splitting before starting to feed your table >>> with production data. >>> >>> Also, you have to study the consequences that changing your rowkey will >>> bring. It's not for free. >>> >>> There's a lot of words here and a lot of questions, so by now I feel I >>> started to shoot in the dark. Try to understand your production data and >> if >>> you have more to share, for sure it will help! >>> >>> Regards, >>> Cristofer >>> >>> -----Mensagem original----- >>> De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] >>> Enviada em: segunda-feira, 16 de julho de 2012 02:30 >>> Para: [email protected] >>> Assunto: Rowkey hashing to avoid hotspotting >>> >>> Hi, >>> >>> I am using Hbase to store data about mechanical components. Each >> component >>> has model no. and serial no. and some other attributes. >>> >>> I would be querying my data mostly by model no. and serial no. So I >>> created a composite key with these two attributes and added timestamp to >>> make it unique. >>> >>> To filter the data, I use rowkey filter with regex string comparator and >>> it works well with sample seed data. Now I am afraid whether this set up >>> will lead to region server hotspotting when we load production data in >>> HBase. I read hashing may solve this problem. Can some one help me in >>> implementing hashing the row key? Also I would want the row filter to >> work >>> as I have to display the number of components in a web page and I use row >>> key filter for implementing that functionality? Any guidance would be of >>> great help. >>> >>> -- >>> Regards, >>> Anand >>> >> >> >> >> -- >> Regards, >> Anand >> > > > > -- > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr
