RES: Rowkey hashing to avoid hotspotting

Cristofer Weber Wed, 18 Jul 2012 10:45:44 -0700

Hi Anand!

I see... sorry for being so curious, but since I started studying HBase I am 
curious about how people are modeling their tables, and in what kinds of 
systems HBase is in use.


Have you evaluated recording your reports in a distinct CF using timestamps as 
column qualifiers? It's my curiosity asking again!

Thanks for sharing!

Regards,
Cristofer

-----Mensagem original-----
De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] 
Enviada em: quarta-feira, 18 de julho de 2012 13:04
Para: [email protected]
Assunto: Re: Rowkey hashing to avoid hotspotting

Hi Cristofer,

Data i store is test cell reports about a component. I have many test cell 
reports for each model number + serial number combination. So to make rowkey 
unique, I added timstamp.


On Wed, Jul 18, 2012 at 3:14 AM, Cristofer Weber < [email protected]> 
wrote:

> So, Anand, there are some things that can help, but again, most of 
> them are related with the famous access patterns.
>
> Sometimes is not easy to get more information about them in advance, 
> but if you are replacing another system you can study its data 
> distribution, grouping for counts, mean, changes over time, etc. It is 
> possible to analyze with partial data too, but it is risky because you 
> will be subjected to the way this partial data was gathered; sample 
> data may not be representative.
>
> Salting your rowkey with a hash calculated over your model# will 
> probably result in an uniform distribution over a range (if using 
> modulus), and pre-spliting your table will balance your load over your Region 
> Servers.
> Also, you will be able to recalculate your hash for your model# before 
> scanning for it, allowing for a scan over specific rowkey while 
> restricting this scan by startRow and stopRow. Remember that if your 
> rowkeys shares the same prefix they will probably be located in the 
> same region and your scan will be favored by this.
>
> I'm still curious about your need of adding a timestamp after your 
> model#,serial#... I have some background in manufacturing systems and 
> usually a serial number is unique. But, of course, it's just 
> curiosity.  :-)
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: Alex Baranau [mailto:[email protected]] Enviada em: 
> terça-feira, 17 de julho de 2012 12:53
> Para: [email protected]
> Assunto: Re: Rowkey hashing to avoid hotspotting
>
> The most common reason for RS hotspotting during writing data in HBase 
> is writing rows with monotonically increasing/decreasing row keys. 
> E.g. if you put timestamp in the first part of your key, then you are 
> likely to have monotonically increasing row keys. You can find more 
> info about this issue and how to solve it here: [1] and also you may 
> want to look at already implemented salting solution [2].
>
> As for RS hotspotting during reading - it is hard to predict without 
> knowing what it the most common data access patterns. E.g. putting 
> model # in first part of a key may seem like a good distribution, but 
> if your web site used mostly by Mercedes owners, the majority of the 
> read load may be directed to just few regions. Again, salting can help a lot 
> here.
>
> +1 to what Cristofer said on other things, esp: use partial key scans 
> +were
> possible instead of filters and pre-split your table.
>
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - 
> ElasticSearch - Solr
>
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
>
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < 
> [email protected]> wrote:
>
> > Hi Cristofer,
> >
> > Thanks for elaborate response!!!
> >
> > I have no much information about production data as I work with 
> > partial data. But based on discussion with my project partners, I 
> > have some answers for you.
> >
> > Number of model numbers and serial numbers will be finite. Not so many...
> > As far as I know,there is no predefined rule for model number or 
> > serial number creation.
> >
> > I have two access pattern. I count the number of rows for a specific 
> > model number. I use rowkey filter for this. Also I filter the rows 
> > based on model, serial number and some other columns. I scan the 
> > table with column value filter for this case.
> >
> > I will evaluate salting as you have explained.
> >
> > Regards,
> > Anand.C
> >
> > On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < 
> > [email protected]> wrote:
> >
> > > Hi Anand,
> > >
> > > As usual, the answer is that 'it depends'  :)
> > >
> > > I think that the main question here is: why are you afraid that 
> > > this
> > setup
> > > would lead to region server hotspotting? Is because you don't know 
> > > how
> > your
> > > production data will seems?
> > >
> > > Based on what you told about your rowkey, you will query mostly by 
> > > providing model no. + serial no., but:
> > > 1 - How is your rowkey distribution? There are tons of different 
> > > modelNumbers AND serialNumbers? Few modelNumbers and a lot of 
> > > serialNumbers? Few of both?
> > > 2 - Putting modelNumber in front of your rowkey means that your 
> > > data will be sorted by rowkey. So, what is the rule that 
> > > determinates a modelNumber creation? Is it a sequential number 
> > > that will be increased by time? If
> > so,
> > > are newer members accessed a lot more than older members? If not, 
> > > what
> > will
> > > drive this number? Is it an encoding rule?
> > > 3 - Do you expect more write/read load over a few of these 
> > > modelNumbers and/or serialNumbers? Will it be similar to a Pareto
> Distribution?
> > > Distributed over what?
> > >
> > > Also, two other things got my attention here...
> > > 1 - Why are you filtering with regex? If your queries are over 
> > > model
> no.
> > +
> > > serial no., why don't you just scan starting by your
> > > modelNumber+SerialNumber, and stoping on your next SerialNumber? 
> > > modelNumber+Or is there another access pattern that doesn't
> > > apply to your composited rowkey?
> > > 2 - Why do you have to add a timestamp to ensure uniqueness?
> > >
> > > Now, answering your question without more info about your data, 
> > > you can apply hash in two ways:
> > > 1 - Generating a hash (MD5 is the most common as far as I read
> > > about) and using only this hash as your rowkey. Based on what you 
> > > have told, this
> > way
> > > doesn't fit your needs, because you would not be able to do apply 
> > > your filter anymore.
> > > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> > Notice
> > > that the hash portion must be your rowkey prefix to ensure a kind 
> > > of balanced distribution over something (where something is your 
> > > region servers). I'm working with a case that is a bit similar to 
> > > yours, and
> > what
> > > I'm doing right now is calculating the hashValue of my rowkey and 
> > > using a Java Formatter to create a hex string to prepend to my 
> > > rowkey. Something like a String.format("%03x", hashValue)
> > >
> > > In both cases, you still have to split your regions in advance, 
> > > and it will be better to work your splitting before starting to 
> > > feed your table with production data.
> > >
> > > Also, you have to study the consequences that changing your rowkey 
> > > will bring. It's not for free.
> > >
> > > There's a lot of words here and a lot of questions, so by now I 
> > > feel I started to shoot in the dark. Try to understand your 
> > > production data and
> > if
> > > you have more to share, for sure it will help!
> > >
> > > Regards,
> > > Cristofer
> > >
> > > -----Mensagem original-----
> > > De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]] 
> > > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > > Para: [email protected]
> > > Assunto: Rowkey hashing to avoid hotspotting
> > >
> > > Hi,
> > >
> > > I am using Hbase to store data about mechanical components. Each
> > component
> > > has model no. and serial no. and some other attributes.
> > >
> > > I would be querying my data mostly by model no. and serial no. So 
> > > I created a composite key with these two attributes and added 
> > > timestamp to make it unique.
> > >
> > > To filter the data, I use rowkey filter with regex string 
> > > comparator and it works well with sample seed data. Now I am 
> > > afraid whether this set up will lead to region server hotspotting 
> > > when we load production data in HBase. I read hashing may solve 
> > > this problem. Can some one help me in implementing hashing the row 
> > > key? Also I would want the row filter to
> > work
> > > as I have to display the number of components in a web page and I 
> > > use row key filter for implementing that functionality? Any 
> > > guidance would be of great help.
> > >
> > > --
> > > Regards,
> > > Anand
> > >
> >
> >
> >
> > --
> > Regards,
> > Anand
> >
>
>
>
> --
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - 
> ElasticSearch - Solr
>



--
Regards,
Anand

RES: Rowkey hashing to avoid hotspotting

Reply via email to