Re: Key formats and very low cardinality leading fields

Jean-Marc Spaggiari Tue, 04 Sep 2012 10:23:13 -0700

Hi Eric,

Yes you can split and existing region. You can do that easily with the
web interface. After the split, at some point, one of the 2 regions
will be moved to another server to balanced the load. You can also
move it manually.


JM

2012/9/4, Eric Czech <[email protected]>:
> Thanks again, both of you.
>
> I'll look at pre splitting the regions so that there isn't so much initial
> contention.  The issue I'll have though is that I won't know all the prefix
> values at first and will have to be able to add them later.
>
> Is it possible to split regions on an existing table?  Or is that
> inadvisable in favor of doing the splits when the table is created?
>
> On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia
> <[email protected]>wrote:
>
>> You can also look at pre-splitting the regions for timeseries type data.
>>
>> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari <
>> [email protected]
>> > wrote:
>>
>> > Initially your table will contain only one region.
>> >
>> > When you will reach its maximum size, it will split into 2 regions
>> > will are going to be distributed over the cluster.
>> >
>> > The 2 regions are going to be ordered by keys.So all entries starting
>> > with 1 will be on the first region. And the middle key (let's say
>> > 25......) will start the 2nd region.
>> >
>> > So region 1 will contain 1 to 24999. and the 2nd region will contain
>> > keys from 25
>> >
>> > And so on.
>> >
>> > Since keys are ordered, all keys starting with a 1 are going to be
>> > closeby on the same region, expect if the region is big enought to be
>> > splitted and the servers by more region servers.
>> >
>> > So when you will load all your entries starting with 1, or 3, they
>> > will go on one uniq region. Only entries starting with 2 are going to
>> > be sometime on region 1, sometime on region 2.
>> >
>> > Of course, the more data you will load, the more regions you will
>> > have, the less hotspoting you will have. But at the beginning, it
>> > might be difficult for some of your servers.
>> >
>> >
>> > 2012/9/3, Eric Czech <[email protected]>:
>> >  > With regards to:
>> > >
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >
>> > > Assuming there are multiple regions in existence for each prefix, why
>> > > would they not be distributed across all the machines?
>> > >
>> > > In other words, if there are many regions with keys that generally
>> > > start with 1, why would they ALL be on server 1 like you said?  It's
>> > > my understanding that the regions aren't placed around the cluster
>> > > according to the range of information they contain so I'm not quite
>> > > following that explanation.
>> > >
>> > > Putting the higher cardinality values in front of the key isn't
>> > > entirely out of the question, but I'd like to use the low cardinality
>> > > key out front for the sake of selecting rows for MapReduce jobs.
>> > > Otherwise, I always have to scan the full table for each job.
>> > >
>> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari
>> > > <[email protected]> wrote:
>> > >> Yes, you're right, but again, it will depend on the number of
>> > >> regionservers and the distribution of your data.
>> > >>
>> > >> If you have 3 region servers and your data is evenly distributed,
>> > >> that
>> > >> mean all the data starting with a 1 will be on server 1, and so on.
>> > >>
>> > >> So if you write a million of lines starting with a 1, they will all
>> > >> land on the same server.
>> > >>
>> > >> Of course, you can pre-split your table. Like 1a to 1z and assign
>> > >> each
>> > >> region to one of you 3 servers. That way you will avoir hotspotting
>> > >> even if you write million of lines starting with a 1.
>> > >>
>> > >> If you have une hundred regions, you will face the same issue at the
>> > >> beginning, but the more data your will add, the more your table will
>> > >> be split across all the servers and the less hotspottig you will
>> > >> have.
>> > >>
>> > >> Can't you just revert your fields and put the 1 to 30 at the end of
>> the
>> > >> key?
>> > >>
>> > >> 2012/9/3, Eric Czech <[email protected]>:
>> > >>> Thanks for the response Jean-Marc!
>> > >>>
>> > >>> I understand what you're saying but in a more extreme case, let's
>> > >>> say
>> > >>> I'm choosing the leading number on the range 1 - 3 instead of 1 -
>> > >>> 30.
>> > >>> In that case, it seems like all of the data for any one prefix
>> > >>> would
>> > >>> already be split well across the cluster and as long as the second
>> > >>> value isn't written sequentially, there wouldn't be an issue.
>> > >>>
>> > >>> Is my reasoning there flawed at all?
>> > >>>
>> > >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari
>> > >>> <[email protected]> wrote:
>> > >>>> Hi Eric,
>> > >>>>
>> > >>>> In HBase, data is stored sequentially based on the key
>> > >>>> alphabetical
>> > >>>> order.
>> > >>>>
>> > >>>> It will depend of the number of reqions and regionservers you have
>> but
>> > >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most
>> > >>>> probably
>> go
>> > >>>> to the same region even if the cardinality of the 2nd part of the
>> key
>> > >>>> is high.
>> > >>>>
>> > >>>> If the first number is always changing between 1 and 30 for each
>> > >>>> write, then you will reach multiple region/servers if you have,
>> else,
>> > >>>> you might have some hot-stopping.
>> > >>>>
>> > >>>> JM
>> > >>>>
>> > >>>> 2012/9/3, Eric Czech <[email protected]>:
>> > >>>>> Hi everyone,
>> > >>>>>
>> > >>>>> I was curious whether or not I should expect any write hot spots
>> if I
>> > >>>>> structured my composite keys in a way such that the first field
>> > >>>>> is
>> a
>> > >>>>> low cardinality (maybe 30 distinct values) value and the next
>> > >>>>> field
>> > >>>>> contains a very high cardinality value that would not be written
>> > >>>>> sequentially.
>> > >>>>>
>> > >>>>> More concisely, I want to do this:
>> > >>>>>
>> > >>>>> Given one number between 1 and 30, write many millions of rows
>> > >>>>> with
>> > >>>>> keys like <number chosen> : <some generally distinct,
>> non-sequential
>> > >>>>> value>
>> > >>>>>
>> > >>>>> Would there be any problem with the millions of writes happening
>> with
>> > >>>>> the same first field key prefix even if the second field is
>> > >>>>> largely
>> > >>>>> unique?
>> > >>>>>
>> > >>>>> Thank you!
>> > >>>>>
>> > >>>
>> > >
>> >
>>
>

Re: Key formats and very low cardinality leading fields

Reply via email to