Re: Rowkey hashing to avoid hotspotting

AnandaVelMurugan Chandra Mohan Tue, 17 Jul 2012 07:45:11 -0700

Hi Cristofer,

Thanks for elaborate response!!!


I have no much information about production data as I work with partial
data. But based on discussion with my project partners, I have some answers
for you.

Number of model numbers and serial numbers will be finite. Not so many...
As far as I know,there is no predefined rule for model number or serial
number creation.

I have two access pattern. I count the number of rows for a specific model
number. I use rowkey filter for this. Also I filter the rows based on
model, serial number and some other columns. I scan the table with column
value filter for this case.

I will evaluate salting as you have explained.

Regards,
Anand.C

On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
[email protected]> wrote:

> Hi Anand,
>
> As usual, the answer is that 'it depends'  :)
>
> I think that the main question here is: why are you afraid that this setup
> would lead to region server hotspotting? Is because you don't know how your
> production data will seems?
>
> Based on what you told about your rowkey, you will query mostly by
> providing model no. + serial no., but:
> 1 - How is your rowkey distribution? There are tons of different
> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> serialNumbers? Few of both?
> 2 - Putting modelNumber in front of your rowkey means that your data will
> be sorted by rowkey. So, what is the rule that determinates a modelNumber
> creation? Is it a sequential number that will be increased by time? If so,
> are newer members accessed a lot more than older members? If not, what will
> drive this number? Is it an encoding rule?
> 3 - Do you expect more write/read load over a few of these modelNumbers
> and/or serialNumbers? Will it be similar to a Pareto Distribution?
> Distributed over what?
>
> Also, two other things got my attention here...
> 1 - Why are you filtering with regex? If your queries are over model no. +
> serial no., why don't you just scan starting by your
> modelNumber+SerialNumber, and stoping on your next
> modelNumber+SerialNumber? Or is there another access pattern that doesn't
> apply to your composited rowkey?
> 2 - Why do you have to add a timestamp to ensure uniqueness?
>
> Now, answering your question without more info about your data, you can
> apply hash in two ways:
> 1 - Generating a hash (MD5 is the most common as far as I read about) and
> using only this hash as your rowkey. Based on what you have told, this way
> doesn't fit your needs, because you would not be able to do apply your
> filter anymore.
> 2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice
> that the hash portion must be your rowkey prefix to ensure a kind of
> balanced distribution over something (where something is your region
> servers). I'm working with a case that is a bit similar to yours, and what
> I'm doing right now is calculating the hashValue of my rowkey and using a
> Java Formatter to create a hex string to prepend to my rowkey. Something
> like a String.format("%03x", hashValue)
>
> In both cases, you still have to split your regions in advance, and it
> will be better to work your splitting before starting to feed your table
> with production data.
>
> Also, you have to study the consequences that changing your rowkey will
> bring. It's not for free.
>
> There's a lot of words here and a lot of questions, so by now I feel I
> started to shoot in the dark. Try to understand your production data and if
> you have more to share, for sure it will help!
>
> Regards,
> Cristofer
>
> -----Mensagem original-----
> De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]]
> Enviada em: segunda-feira, 16 de julho de 2012 02:30
> Para: [email protected]
> Assunto: Rowkey hashing to avoid hotspotting
>
> Hi,
>
> I am using Hbase to store data about mechanical components. Each component
> has model no. and serial no. and some other attributes.
>
> I would be querying my data mostly by model no. and serial no. So I
> created a composite key with these two attributes and added timestamp to
> make it unique.
>
> To filter the data, I use rowkey filter with regex string comparator and
> it works well with sample seed data. Now I am afraid whether this set up
> will lead to region server hotspotting when we load production data in
> HBase. I read hashing may solve this problem. Can some one help me in
> implementing hashing the row key? Also I would want the row filter to work
> as I have to display the number of components in a web page and I use row
> key filter for implementing that functionality? Any guidance would be of
> great help.
>
> --
> Regards,
> Anand
>



-- 
Regards,
Anand

Re: Rowkey hashing to avoid hotspotting

Reply via email to