On Mon, Oct 24, 2011 at 9:22 PM, Karthik Ranganathan <[email protected]> wrote: > > > << ...mod the hash with the number of machines I have... >> > This means that the data will change with the number of machines - so all > your data will map to different regions if you add a new machine to your > cluster. > > > << What I do not understand is the advantages/disasvantages of having > regions that are too big vs regions that are too thin. >> > The disadvantage is that some regions (and consequently nodes) will have a > lot of data which will adversely affect things like storage (if dfs is > local to that node), block cache hit ratio, etc.
Can you please explain a bit more on how a bigger region size will affect the block cache hit ratio ? > > In general - per our experience using Hbase, its much more desirable to > disperse data up-front. If you are building indexes using MR, then you > probably don¹t need range scan ability on your keys. > > Thanks > Karthik > > > > On 10/24/11 4:48 PM, "Sam Seigal" <[email protected]> wrote: > >>According to my understanding, the way that HBase works is that on a >>brand new system, all keys will start going to a single region i.e. a >>single region server. Once that region >>reaches a max region size, it will split and then move to another >>region server, and so on and so forth. >> >>Initially hooking up HBase to a prod system, I am concerned about this >>behaviour, since a clean HBase cluster is going to experience a surge >>of traffic all going into one region server initially. >>This is the motivation behind pre-defining the regions, so the initial >>surge of traffic is distributed evenly. >> >>My strategy is to take the incoming data, calculate the hash and then >>mod the hash with the number of machines I have. I will split the >>regions according to the prefix # . >>This should , I think provide for better data distribution when the >>cluster first starts up with one region / region server. >> >>These regions should then grow fairly uniformly. Once they reach a >>size like ~ 5G, I can do a rolling split. >> >>Also, I want to make sure my regions do not grow too much in size that >>when I end up adding more machines, it does not take a very long time >>to perform a rolling split. >> >>What I do not understand is the advantages/disasvantages of having >>regions that are too big vs regions that are too thin. What does this >>impact ? Compaction time ? Split time ? What is the >>concern when it comes to how the architecture works. I think if I >>understand this better, I can manage my regions more efficiently. >> >> >> >>On Mon, Oct 24, 2011 at 3:23 PM, Nicolas Spiegelberg >><[email protected]> wrote: >>> Isn't a better strategy to create the HBase keys as >>> >>> Key = hash(MySQL_key) + MySQL_key >>> >>> That way you'll know your key distribution and can add new machines >>> seamlessly. I'm assuming that your rows don't overlap between any 2 >>> machines. If so, you could append the MACHINE_ID to the key (not >>> prepend). I don't think you want the machine # as the first dimension >>>on >>> your rows, because you want the data from new machines to be evenly >>>spread >>> out across the existing regions. >>> >>> >>> On 10/24/11 9:07 AM, "Stack" <[email protected]> wrote: >>> >>>>On Mon, Oct 24, 2011 at 1:27 AM, Sam Seigal <[email protected]> wrote: >>>>> According to the HBase book , pre splitting tables and doing manual >>>>> splits is a better long term strategy than letting HBase handle it. >>>>> >>>> >>>>Its good for getting a table off the ground, yes. >>>> >>>> >>>>> Since I do not know what the keys from the prod system are going to >>>>> look like , I am adding a machine number prefix to the the row keys >>>>> and pre splitting the tables based on the prefix (prefix 0 goes to >>>>> machine A, prefix 1 goes to machine b etc). >>>>> >>>> >>>>You don't need to do inorder scan of the data? Whats the rest of your >>>>row key look like? >>>> >>>> >>>>> Once I decide to add more machines, I can always do a rolling split >>>>> and add more prefixes. >>>>> >>>> >>>>Yes. >>>> >>>>> Is this a good strategy for pre splitting the tables ? >>>>> >>>> >>>>So, you'll start out with one region per server? >>>> >>>>What do you think the rate of splitting will be like? Are you using >>>>default region size or have you bumped this up? >>>> >>>>St.Ack >>> >>> > >
