Re: Rowkey hashing to avoid hotspotting

Alex Baranau Tue, 17 Jul 2012 11:50:22 -0700

You might be right, when reading load concentrated on single/several RS
they will not act as dead as when it is hotspotting during writing. I think
I referred more to "uneven read load distribution" when called it
hotspotting while reading.


Caches will help for sure, but that might be not enough. Having
single/several RS sweating in a cluster more than others is already not a
very desired situation. Also it may be that it's not the specific set of
records within Regions on RS (read as "data blocks") which are under load,
but the whole regions that for some reason has more hot data (like in
example above: with keys prefixed with model, the whole several regions
containing data of same model may have data that is frequently accessed).
In this case HBase (depending on hardware) may not be able to fit all that
data in cache on this hot single (or several) RS. As opposed to situation
when this hot data distributed over many more RSs (which will act like
distributed cache) e.g. with salting.

In general, yes, you will not see as big issues with uneven *read* load
distribution over the cluster as you might see in case of uneven *write*
load distribution.

Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

On Tue, Jul 17, 2012 at 12:44 PM, Michel Segel <[email protected]>wrote:

> Reading hot spotting?
> Hmmm there's a cache and I don't see any real use cases where you would
> have it occur naturally.
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jul 17, 2012, at 10:53 AM, Alex Baranau <[email protected]>
> wrote:
>
> > The most common reason for RS hotspotting during writing data in HBase is
> > writing rows with monotonically increasing/decreasing row keys. E.g. if
> you
> > put timestamp in the first part of your key, then you are likely to have
> > monotonically increasing row keys. You can find more info about this
> issue
> > and how to solve it here: [1] and also you may want to look at already
> > implemented salting solution [2].
> >
> > As for RS hotspotting during reading - it is hard to predict without
> > knowing what it the most common data access patterns. E.g. putting model
> #
> > in first part of a key may seem like a good distribution, but if your web
> > site used mostly by Mercedes owners, the majority of the read load may be
> > directed to just few regions. Again, salting can help a lot here.
> >
> > +1 to what Cristofer said on other things, esp: use partial key scans
> were
> > possible instead of filters and pre-split your table.
> >
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
> >
> > [1] http://bit.ly/HnKjbc
> > [2] https://github.com/sematext/HBaseWD
> >
> > On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> > [email protected]> wrote:
> >
> >> Hi Cristofer,
> >>
> >> Thanks for elaborate response!!!
> >>
> >> I have no much information about production data as I work with partial
> >> data. But based on discussion with my project partners, I have some
> answers
> >> for you.
> >>
> >> Number of model numbers and serial numbers will be finite. Not so
> many...
> >> As far as I know,there is no predefined rule for model number or serial
> >> number creation.
> >>
> >> I have two access pattern. I count the number of rows for a specific
> model
> >> number. I use rowkey filter for this. Also I filter the rows based on
> >> model, serial number and some other columns. I scan the table with
> column
> >> value filter for this case.
> >>
> >> I will evaluate salting as you have explained.
> >>
> >> Regards,
> >> Anand.C
> >>
> >> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
> >> [email protected]> wrote:
> >>
> >>> Hi Anand,
> >>>
> >>> As usual, the answer is that 'it depends'  :)
> >>>
> >>> I think that the main question here is: why are you afraid that this
> >> setup
> >>> would lead to region server hotspotting? Is because you don't know how
> >> your
> >>> production data will seems?
> >>>
> >>> Based on what you told about your rowkey, you will query mostly by
> >>> providing model no. + serial no., but:
> >>> 1 - How is your rowkey distribution? There are tons of different
> >>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
> >>> serialNumbers? Few of both?
> >>> 2 - Putting modelNumber in front of your rowkey means that your data
> will
> >>> be sorted by rowkey. So, what is the rule that determinates a
> modelNumber
> >>> creation? Is it a sequential number that will be increased by time? If
> >> so,
> >>> are newer members accessed a lot more than older members? If not, what
> >> will
> >>> drive this number? Is it an encoding rule?
> >>> 3 - Do you expect more write/read load over a few of these modelNumbers
> >>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
> >>> Distributed over what?
> >>>
> >>> Also, two other things got my attention here...
> >>> 1 - Why are you filtering with regex? If your queries are over model
> no.
> >> +
> >>> serial no., why don't you just scan starting by your
> >>> modelNumber+SerialNumber, and stoping on your next
> >>> modelNumber+SerialNumber? Or is there another access pattern that
> doesn't
> >>> apply to your composited rowkey?
> >>> 2 - Why do you have to add a timestamp to ensure uniqueness?
> >>>
> >>> Now, answering your question without more info about your data, you can
> >>> apply hash in two ways:
> >>> 1 - Generating a hash (MD5 is the most common as far as I read about)
> and
> >>> using only this hash as your rowkey. Based on what you have told, this
> >> way
> >>> doesn't fit your needs, because you would not be able to do apply your
> >>> filter anymore.
> >>> 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> >> Notice
> >>> that the hash portion must be your rowkey prefix to ensure a kind of
> >>> balanced distribution over something (where something is your region
> >>> servers). I'm working with a case that is a bit similar to yours, and
> >> what
> >>> I'm doing right now is calculating the hashValue of my rowkey and
> using a
> >>> Java Formatter to create a hex string to prepend to my rowkey.
> Something
> >>> like a String.format("%03x", hashValue)
> >>>
> >>> In both cases, you still have to split your regions in advance, and it
> >>> will be better to work your splitting before starting to feed your
> table
> >>> with production data.
> >>>
> >>> Also, you have to study the consequences that changing your rowkey will
> >>> bring. It's not for free.
> >>>
> >>> There's a lot of words here and a lot of questions, so by now I feel I
> >>> started to shoot in the dark. Try to understand your production data
> and
> >> if
> >>> you have more to share, for sure it will help!
> >>>
> >>> Regards,
> >>> Cristofer
> >>>
> >>> -----Mensagem original-----
> >>> De: AnandaVelMurugan Chandra Mohan [mailto:[email protected]]
> >>> Enviada em: segunda-feira, 16 de julho de 2012 02:30
> >>> Para: [email protected]
> >>> Assunto: Rowkey hashing to avoid hotspotting
> >>>
> >>> Hi,
> >>>
> >>> I am using Hbase to store data about mechanical components. Each
> >> component
> >>> has model no. and serial no. and some other attributes.
> >>>
> >>> I would be querying my data mostly by model no. and serial no. So I
> >>> created a composite key with these two attributes and added timestamp
> to
> >>> make it unique.
> >>>
> >>> To filter the data, I use rowkey filter with regex string comparator
> and
> >>> it works well with sample seed data. Now I am afraid whether this set
> up
> >>> will lead to region server hotspotting when we load production data in
> >>> HBase. I read hashing may solve this problem. Can some one help me in
> >>> implementing hashing the row key? Also I would want the row filter to
> >> work
> >>> as I have to display the number of components in a web page and I use
> row
> >>> key filter for implementing that functionality? Any guidance would be
> of
> >>> great help.
> >>>
> >>> --
> >>> Regards,
> >>> Anand
> >>>
> >>
> >>
> >>
> >> --
> >> Regards,
> >> Anand
> >>
> >
> >
> >
> > --
> > Alex Baranau
> > ------
> > Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> -
> > Solr
>



-- 
Alex Baranau
------
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
Solr

Re: Rowkey hashing to avoid hotspotting

Reply via email to