Hi Eric, Yes you can split and existing region. You can do that easily with the web interface. After the split, at some point, one of the 2 regions will be moved to another server to balanced the load. You can also move it manually.
JM 2012/9/4, Eric Czech <[email protected]>: > Thanks again, both of you. > > I'll look at pre splitting the regions so that there isn't so much initial > contention. The issue I'll have though is that I won't know all the prefix > values at first and will have to be able to add them later. > > Is it possible to split regions on an existing table? Or is that > inadvisable in favor of doing the splits when the table is created? > > On Mon, Sep 3, 2012 at 5:19 PM, Mohit Anchlia > <[email protected]>wrote: > >> You can also look at pre-splitting the regions for timeseries type data. >> >> On Mon, Sep 3, 2012 at 1:11 PM, Jean-Marc Spaggiari < >> [email protected] >> > wrote: >> >> > Initially your table will contain only one region. >> > >> > When you will reach its maximum size, it will split into 2 regions >> > will are going to be distributed over the cluster. >> > >> > The 2 regions are going to be ordered by keys.So all entries starting >> > with 1 will be on the first region. And the middle key (let's say >> > 25......) will start the 2nd region. >> > >> > So region 1 will contain 1 to 24999. and the 2nd region will contain >> > keys from 25 >> > >> > And so on. >> > >> > Since keys are ordered, all keys starting with a 1 are going to be >> > closeby on the same region, expect if the region is big enought to be >> > splitted and the servers by more region servers. >> > >> > So when you will load all your entries starting with 1, or 3, they >> > will go on one uniq region. Only entries starting with 2 are going to >> > be sometime on region 1, sometime on region 2. >> > >> > Of course, the more data you will load, the more regions you will >> > have, the less hotspoting you will have. But at the beginning, it >> > might be difficult for some of your servers. >> > >> > >> > 2012/9/3, Eric Czech <[email protected]>: >> > > With regards to: >> > > >> > >> If you have 3 region servers and your data is evenly distributed, >> > >> that >> > >> mean all the data starting with a 1 will be on server 1, and so on. >> > > >> > > Assuming there are multiple regions in existence for each prefix, why >> > > would they not be distributed across all the machines? >> > > >> > > In other words, if there are many regions with keys that generally >> > > start with 1, why would they ALL be on server 1 like you said? It's >> > > my understanding that the regions aren't placed around the cluster >> > > according to the range of information they contain so I'm not quite >> > > following that explanation. >> > > >> > > Putting the higher cardinality values in front of the key isn't >> > > entirely out of the question, but I'd like to use the low cardinality >> > > key out front for the sake of selecting rows for MapReduce jobs. >> > > Otherwise, I always have to scan the full table for each job. >> > > >> > > On Mon, Sep 3, 2012 at 3:20 PM, Jean-Marc Spaggiari >> > > <[email protected]> wrote: >> > >> Yes, you're right, but again, it will depend on the number of >> > >> regionservers and the distribution of your data. >> > >> >> > >> If you have 3 region servers and your data is evenly distributed, >> > >> that >> > >> mean all the data starting with a 1 will be on server 1, and so on. >> > >> >> > >> So if you write a million of lines starting with a 1, they will all >> > >> land on the same server. >> > >> >> > >> Of course, you can pre-split your table. Like 1a to 1z and assign >> > >> each >> > >> region to one of you 3 servers. That way you will avoir hotspotting >> > >> even if you write million of lines starting with a 1. >> > >> >> > >> If you have une hundred regions, you will face the same issue at the >> > >> beginning, but the more data your will add, the more your table will >> > >> be split across all the servers and the less hotspottig you will >> > >> have. >> > >> >> > >> Can't you just revert your fields and put the 1 to 30 at the end of >> the >> > >> key? >> > >> >> > >> 2012/9/3, Eric Czech <[email protected]>: >> > >>> Thanks for the response Jean-Marc! >> > >>> >> > >>> I understand what you're saying but in a more extreme case, let's >> > >>> say >> > >>> I'm choosing the leading number on the range 1 - 3 instead of 1 - >> > >>> 30. >> > >>> In that case, it seems like all of the data for any one prefix >> > >>> would >> > >>> already be split well across the cluster and as long as the second >> > >>> value isn't written sequentially, there wouldn't be an issue. >> > >>> >> > >>> Is my reasoning there flawed at all? >> > >>> >> > >>> On Mon, Sep 3, 2012 at 2:31 PM, Jean-Marc Spaggiari >> > >>> <[email protected]> wrote: >> > >>>> Hi Eric, >> > >>>> >> > >>>> In HBase, data is stored sequentially based on the key >> > >>>> alphabetical >> > >>>> order. >> > >>>> >> > >>>> It will depend of the number of reqions and regionservers you have >> but >> > >>>> if you write data from 23AAAAAA to 23ZZZZZZ they will most >> > >>>> probably >> go >> > >>>> to the same region even if the cardinality of the 2nd part of the >> key >> > >>>> is high. >> > >>>> >> > >>>> If the first number is always changing between 1 and 30 for each >> > >>>> write, then you will reach multiple region/servers if you have, >> else, >> > >>>> you might have some hot-stopping. >> > >>>> >> > >>>> JM >> > >>>> >> > >>>> 2012/9/3, Eric Czech <[email protected]>: >> > >>>>> Hi everyone, >> > >>>>> >> > >>>>> I was curious whether or not I should expect any write hot spots >> if I >> > >>>>> structured my composite keys in a way such that the first field >> > >>>>> is >> a >> > >>>>> low cardinality (maybe 30 distinct values) value and the next >> > >>>>> field >> > >>>>> contains a very high cardinality value that would not be written >> > >>>>> sequentially. >> > >>>>> >> > >>>>> More concisely, I want to do this: >> > >>>>> >> > >>>>> Given one number between 1 and 30, write many millions of rows >> > >>>>> with >> > >>>>> keys like <number chosen> : <some generally distinct, >> non-sequential >> > >>>>> value> >> > >>>>> >> > >>>>> Would there be any problem with the millions of writes happening >> with >> > >>>>> the same first field key prefix even if the second field is >> > >>>>> largely >> > >>>>> unique? >> > >>>>> >> > >>>>> Thank you! >> > >>>>> >> > >>> >> > > >> > >> >
