Thank a lot, Guys!!! I will evaluate and implement a solution based on your suggestions..
On Thu, Jul 19, 2012 at 10:22 PM, syed kather <[email protected]> wrote: > Anand , > i had a case which i had combine 4 fields and made one row key . > serial number can be first part of rowkey and model number can be second > part . So that B-Search on Row key will be more faster because we can > reduce lot jump while doing B- Search > Note : if serial number is changing frequently then use serial number at > first part > > For solving hot spotting problem i am at present started implementing > > > http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > In my case i had 20 million of rows in my hbase table. i had the same > problem while reading in map reduce. > > Thanks and Regards, > S SYED ABDUL KATHER > > > > On Thu, Jul 19, 2012 at 8:52 PM, Alex Baranau <[email protected] > >wrote: > > > > I read somewhere that HBase is not > > > good at handling more than 100 column families > > > > Heh. Usually it is not good to have more than two or three, actually. > > See [1], and may be also [2]. > > > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > > [1] http://hbase.apache.org/book/number.of.cfs.html > > [2] > > http://blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know > > > > On Thu, Jul 19, 2012 at 11:08 AM, AnandaVelMurugan Chandra Mohan < > > [email protected]> wrote: > > > > > Hi Cristofer, > > > > > > No problem... I am happy to share and learn.. :) > > > > > > Regarding timestamp based column family, I haven't thought about it. > But > > my > > > only concern is no of column families. I read somewhere that HBase is > not > > > good at handling more than 100 column families. > > > > > > > > > On Wed, Jul 18, 2012 at 11:15 PM, Cristofer Weber < > > > [email protected]> wrote: > > > > > > > Hi Anand! > > > > > > > > I see... sorry for being so curious, but since I started studying > > HBase I > > > > am curious about how people are modeling their tables, and in what > > kinds > > > of > > > > systems HBase is in use. > > > > > > > > Have you evaluated recording your reports in a distinct CF using > > > > timestamps as column qualifiers? It's my curiosity asking again! > > > > > > > > Thanks for sharing! > > > > > > > > Regards, > > > > Cristofer > > > > > > > > -----Mensagem original----- > > > > De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] > > > > Enviada em: quarta-feira, 18 de julho de 2012 13:04 > > > > Para: [email protected] > > > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > > > > > Hi Cristofer, > > > > > > > > Data i store is test cell reports about a component. I have many test > > > cell > > > > reports for each model number + serial number combination. So to make > > > > rowkey unique, I added timstamp. > > > > > > > > > > > > On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < > > > > [email protected]> wrote: > > > > > > > > > So, Anand, there are some things that can help, but again, most of > > > > > them are related with the famous access patterns. > > > > > > > > > > Sometimes is not easy to get more information about them in > advance, > > > > > but if you are replacing another system you can study its data > > > > > distribution, grouping for counts, mean, changes over time, etc. It > > is > > > > > possible to analyze with partial data too, but it is risky because > > you > > > > > will be subjected to the way this partial data was gathered; sample > > > > > data may not be representative. > > > > > > > > > > Salting your rowkey with a hash calculated over your model# will > > > > > probably result in an uniform distribution over a range (if using > > > > > modulus), and pre-spliting your table will balance your load over > > your > > > > Region Servers. > > > > > Also, you will be able to recalculate your hash for your model# > > before > > > > > scanning for it, allowing for a scan over specific rowkey while > > > > > restricting this scan by startRow and stopRow. Remember that if > your > > > > > rowkeys shares the same prefix they will probably be located in the > > > > > same region and your scan will be favored by this. > > > > > > > > > > I'm still curious about your need of adding a timestamp after your > > > > > model#,serial#... I have some background in manufacturing systems > and > > > > > usually a serial number is unique. But, of course, it's just > > > > > curiosity. :-) > > > > > > > > > > Regards, > > > > > Cristofer > > > > > > > > > > -----Mensagem original----- > > > > > De: Alex Baranau [mailto:[email protected]] Enviada em: > > > > > terça-feira, 17 de julho de 2012 12:53 > > > > > Para: [email protected] > > > > > Assunto: Re: Rowkey hashing to avoid hotspotting > > > > > > > > > > The most common reason for RS hotspotting during writing data in > > HBase > > > > > is writing rows with monotonically increasing/decreasing row keys. > > > > > E.g. if you put timestamp in the first part of your key, then you > are > > > > > likely to have monotonically increasing row keys. You can find more > > > > > info about this issue and how to solve it here: [1] and also you > may > > > > > want to look at already implemented salting solution [2]. > > > > > > > > > > As for RS hotspotting during reading - it is hard to predict > without > > > > > knowing what it the most common data access patterns. E.g. putting > > > > > model # in first part of a key may seem like a good distribution, > but > > > > > if your web site used mostly by Mercedes owners, the majority of > the > > > > > read load may be directed to just few regions. Again, salting can > > help > > > a > > > > lot here. > > > > > > > > > > +1 to what Cristofer said on other things, esp: use partial key > scans > > > > > +were > > > > > possible instead of filters and pre-split your table. > > > > > > > > > > Alex Baranau > > > > > ------ > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > > > ElasticSearch - Solr > > > > > > > > > > [1] http://bit.ly/HnKjbc > > > > > [2] https://github.com/sematext/HBaseWD > > > > > > > > > > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < > > > > > [email protected]> wrote: > > > > > > > > > > > Hi Cristofer, > > > > > > > > > > > > Thanks for elaborate response!!! > > > > > > > > > > > > I have no much information about production data as I work with > > > > > > partial data. But based on discussion with my project partners, I > > > > > > have some answers for you. > > > > > > > > > > > > Number of model numbers and serial numbers will be finite. Not so > > > > many... > > > > > > As far as I know,there is no predefined rule for model number or > > > > > > serial number creation. > > > > > > > > > > > > I have two access pattern. I count the number of rows for a > > specific > > > > > > model number. I use rowkey filter for this. Also I filter the > rows > > > > > > based on model, serial number and some other columns. I scan the > > > > > > table with column value filter for this case. > > > > > > > > > > > > I will evaluate salting as you have explained. > > > > > > > > > > > > Regards, > > > > > > Anand.C > > > > > > > > > > > > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < > > > > > > [email protected]> wrote: > > > > > > > > > > > > > Hi Anand, > > > > > > > > > > > > > > As usual, the answer is that 'it depends' :) > > > > > > > > > > > > > > I think that the main question here is: why are you afraid that > > > > > > > this > > > > > > setup > > > > > > > would lead to region server hotspotting? Is because you don't > > know > > > > > > > how > > > > > > your > > > > > > > production data will seems? > > > > > > > > > > > > > > Based on what you told about your rowkey, you will query mostly > > by > > > > > > > providing model no. + serial no., but: > > > > > > > 1 - How is your rowkey distribution? There are tons of > different > > > > > > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of > > > > > > > serialNumbers? Few of both? > > > > > > > 2 - Putting modelNumber in front of your rowkey means that your > > > > > > > data will be sorted by rowkey. So, what is the rule that > > > > > > > determinates a modelNumber creation? Is it a sequential number > > > > > > > that will be increased by time? If > > > > > > so, > > > > > > > are newer members accessed a lot more than older members? If > not, > > > > > > > what > > > > > > will > > > > > > > drive this number? Is it an encoding rule? > > > > > > > 3 - Do you expect more write/read load over a few of these > > > > > > > modelNumbers and/or serialNumbers? Will it be similar to a > Pareto > > > > > Distribution? > > > > > > > Distributed over what? > > > > > > > > > > > > > > Also, two other things got my attention here... > > > > > > > 1 - Why are you filtering with regex? If your queries are over > > > > > > > model > > > > > no. > > > > > > + > > > > > > > serial no., why don't you just scan starting by your > > > > > > > modelNumber+SerialNumber, and stoping on your next > SerialNumber? > > > > > > > modelNumber+Or is there another access pattern that doesn't > > > > > > > apply to your composited rowkey? > > > > > > > 2 - Why do you have to add a timestamp to ensure uniqueness? > > > > > > > > > > > > > > Now, answering your question without more info about your data, > > > > > > > you can apply hash in two ways: > > > > > > > 1 - Generating a hash (MD5 is the most common as far as I read > > > > > > > about) and using only this hash as your rowkey. Based on what > you > > > > > > > have told, this > > > > > > way > > > > > > > doesn't fit your needs, because you would not be able to do > apply > > > > > > > your filter anymore. > > > > > > > 2 - Salting, by prefixing your current rowkey with a pinch of > > hash. > > > > > > Notice > > > > > > > that the hash portion must be your rowkey prefix to ensure a > kind > > > > > > > of balanced distribution over something (where something is > your > > > > > > > region servers). I'm working with a case that is a bit similar > to > > > > > > > yours, and > > > > > > what > > > > > > > I'm doing right now is calculating the hashValue of my rowkey > and > > > > > > > using a Java Formatter to create a hex string to prepend to my > > > > > > > rowkey. Something like a String.format("%03x", hashValue) > > > > > > > > > > > > > > In both cases, you still have to split your regions in advance, > > > > > > > and it will be better to work your splitting before starting to > > > > > > > feed your table with production data. > > > > > > > > > > > > > > Also, you have to study the consequences that changing your > > rowkey > > > > > > > will bring. It's not for free. > > > > > > > > > > > > > > There's a lot of words here and a lot of questions, so by now I > > > > > > > feel I started to shoot in the dark. Try to understand your > > > > > > > production data and > > > > > > if > > > > > > > you have more to share, for sure it will help! > > > > > > > > > > > > > > Regards, > > > > > > > Cristofer > > > > > > > > > > > > > > -----Mensagem original----- > > > > > > > De: AnandaVelMurugan Chandra Mohan [mailto: > [email protected] > > ] > > > > > > > Enviada em: segunda-feira, 16 de julho de 2012 02:30 > > > > > > > Para: [email protected] > > > > > > > Assunto: Rowkey hashing to avoid hotspotting > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I am using Hbase to store data about mechanical components. > Each > > > > > > component > > > > > > > has model no. and serial no. and some other attributes. > > > > > > > > > > > > > > I would be querying my data mostly by model no. and serial no. > So > > > > > > > I created a composite key with these two attributes and added > > > > > > > timestamp to make it unique. > > > > > > > > > > > > > > To filter the data, I use rowkey filter with regex string > > > > > > > comparator and it works well with sample seed data. Now I am > > > > > > > afraid whether this set up will lead to region server > hotspotting > > > > > > > when we load production data in HBase. I read hashing may solve > > > > > > > this problem. Can some one help me in implementing hashing the > > row > > > > > > > key? Also I would want the row filter to > > > > > > work > > > > > > > as I have to display the number of components in a web page > and I > > > > > > > use row key filter for implementing that functionality? Any > > > > > > > guidance would be of great help. > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > Anand > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards, > > > > > > Anand > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Alex Baranau > > > > > ------ > > > > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - > > > > > ElasticSearch - Solr > > > > > > > > > > > > > > > > > > > > > -- > > > > Regards, > > > > Anand > > > > > > > > > > > > > > > > -- > > > Regards, > > > Anand > > > > > > > > > > > -- > > Alex Baranau > > ------ > > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - > > Solr > > > -- Regards, Anand
