Hi Cristofer, Data i store is test cell reports about a component. I have many test cell reports for each model number + serial number combination. So to make rowkey unique, I added timstamp.
On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < [email protected]> wrote: > So, Anand, there are some things that can help, but again, most of them > are related with the famous access patterns. > > Sometimes is not easy to get more information about them in advance, but > if you are replacing another system you can study its data distribution, > grouping for counts, mean, changes over time, etc. It is possible to > analyze with partial data too, but it is risky because you will be > subjected to the way this partial data was gathered; sample data may not be > representative. > > Salting your rowkey with a hash calculated over your model# will probably > result in an uniform distribution over a range (if using modulus), and > pre-spliting your table will balance your load over your Region Servers. > Also, you will be able to recalculate your hash for your model# before > scanning for it, allowing for a scan over specific rowkey while restricting > this scan by startRow and stopRow. Remember that if your rowkeys shares the > same prefix they will probably be located in the same region and your scan > will be favored by this. > > I'm still curious about your need of adding a timestamp after your > model#,serial#... I have some background in manufacturing systems and > usually a serial number is unique. But, of course, it's just curiosity. :-) > > Regards, > Cristofer > > -----Mensagem original----- > De: Alex Baranau [mailto:[email protected]] > Enviada em: terça-feira, 17 de julho de 2012 12:53 > Para: [email protected] > Assunto: Re: Rowkey hashing to avoid hotspotting > > The most common reason for RS hotspotting during writing data in HBase is > writing rows with monotonically increasing/decreasing row keys. E.g. if you > put timestamp in the first part of your key, then you are likely to have > monotonically increasing row keys. You can find more info about this issue > and how to solve it here: [1] and also you may want to look at already > implemented salting solution [2]. > > As for RS hotspotting during reading - it is hard to predict without > knowing what it the most common data access patterns. E.g. putting model # > in first part of a key may seem like a good distribution, but if your web > site used mostly by Mercedes owners, the majority of the read load may be > directed to just few regions. Again, salting can help a lot here. > > +1 to what Cristofer said on other things, esp: use partial key scans > +were > possible instead of filters and pre-split your table. > > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > > [1] http://bit.ly/HnKjbc > [2] https://github.com/sematext/HBaseWD > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > [email protected]> wrote: > > > Hi Cristofer, > > > > Thanks for elaborate response!!! > > > > I have no much information about production data as I work with > > partial data. But based on discussion with my project partners, I have > > some answers for you. > > > > Number of model numbers and serial numbers will be finite. Not so many... > > As far as I know,there is no predefined rule for model number or > > serial number creation. > > > > I have two access pattern. I count the number of rows for a specific > > model number. I use rowkey filter for this. Also I filter the rows > > based on model, serial number and some other columns. I scan the table > > with column value filter for this case. > > > > I will evaluate salting as you have explained. > > > > Regards, > > Anand.C > > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < > > [email protected]> wrote: > > > > > Hi Anand, > > > > > > As usual, the answer is that 'it depends' :) > > > > > > I think that the main question here is: why are you afraid that this > > setup > > > would lead to region server hotspotting? Is because you don't know > > > how > > your > > > production data will seems? > > > > > > Based on what you told about your rowkey, you will query mostly by > > > providing model no. + serial no., but: > > > 1 - How is your rowkey distribution? There are tons of different > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of > > > serialNumbers? Few of both? > > > 2 - Putting modelNumber in front of your rowkey means that your data > > > will be sorted by rowkey. So, what is the rule that determinates a > > > modelNumber creation? Is it a sequential number that will be > > > increased by time? If > > so, > > > are newer members accessed a lot more than older members? If not, > > > what > > will > > > drive this number? Is it an encoding rule? > > > 3 - Do you expect more write/read load over a few of these > > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto > Distribution? > > > Distributed over what? > > > > > > Also, two other things got my attention here... > > > 1 - Why are you filtering with regex? If your queries are over model > no. > > + > > > serial no., why don't you just scan starting by your > > > modelNumber+SerialNumber, and stoping on your next SerialNumber? Or > > > modelNumber+is there another access pattern that doesn't > > > apply to your composited rowkey? > > > 2 - Why do you have to add a timestamp to ensure uniqueness? > > > > > > Now, answering your question without more info about your data, you > > > can apply hash in two ways: > > > 1 - Generating a hash (MD5 is the most common as far as I read > > > about) and using only this hash as your rowkey. Based on what you > > > have told, this > > way > > > doesn't fit your needs, because you would not be able to do apply > > > your filter anymore. > > > 2 - Salting, by prefixing your current rowkey with a pinch of hash. > > Notice > > > that the hash portion must be your rowkey prefix to ensure a kind of > > > balanced distribution over something (where something is your region > > > servers). I'm working with a case that is a bit similar to yours, > > > and > > what > > > I'm doing right now is calculating the hashValue of my rowkey and > > > using a Java Formatter to create a hex string to prepend to my > > > rowkey. Something like a String.format("%03x", hashValue) > > > > > > In both cases, you still have to split your regions in advance, and > > > it will be better to work your splitting before starting to feed > > > your table with production data. > > > > > > Also, you have to study the consequences that changing your rowkey > > > will bring. It's not for free. > > > > > > There's a lot of words here and a lot of questions, so by now I feel > > > I started to shoot in the dark. Try to understand your production > > > data and > > if > > > you have more to share, for sure it will help! > > > > > > Regards, > > > Cristofer > > > > > > -----Mensagem original----- > > > De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30 > > > Para: [email protected] > > > Assunto: Rowkey hashing to avoid hotspotting > > > > > > Hi, > > > > > > I am using Hbase to store data about mechanical components. Each > > component > > > has model no. and serial no. and some other attributes. > > > > > > I would be querying my data mostly by model no. and serial no. So I > > > created a composite key with these two attributes and added > > > timestamp to make it unique. > > > > > > To filter the data, I use rowkey filter with regex string comparator > > > and it works well with sample seed data. Now I am afraid whether > > > this set up will lead to region server hotspotting when we load > > > production data in HBase. I read hashing may solve this problem. Can > > > some one help me in implementing hashing the row key? Also I would > > > want the row filter to > > work > > > as I have to display the number of components in a web page and I > > > use row key filter for implementing that functionality? Any guidance > > > would be of great help. > > > > > > -- > > > Regards, > > > Anand > > > > > > > > > > > -- > > Regards, > > Anand > > > > > > -- > Alex Baranau > ------ > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - > Solr > -- Regards, Anand
