Hi Ram, I SPLIT_POLICY is define the same way MAX_FILESIZE is.... So I think it's a table attribut and can be altered... That's a good news! I will probably try it.
Also, the admin.split(rowkey) is the way I will use until I'm able to properly use/set the SPLIT_POLICY. I will simply (try to) count the rows in a region, and split in the middle... Thanks for the hint regarding the SPLIT_POLICY. JM 2013/1/22, ramkrishna vasudevan <[email protected]>: >>>Also, last thing. If I want to change the default behaviour and split >>>based on the row number instead of the midkey, can I hook somewhere? > > HTableDescriptor myHtd = new HTableDescriptor(); > myHtd.setValue(HTableDescriptor.SPLIT_POLICY, > KeyPrefixRegionSplitPolicy.class.getName()); > So the region split policy can be changed only during table creation i > suppose. (May be wrong, not sure anyother way out there). > > When i meant split based on row key my point was like use > admin.split(rowkey). I will check more on your calculations and figures > and get back to you. > > Regards > Ram > > > On Tue, Jan 22, 2013 at 7:17 PM, Jean-Marc Spaggiari < > [email protected]> wrote: > >> Hi Anoop, Hi Ram, >> >> Thanks for your replies. >> >> I looked at the code and found in the HFileBlockIndex the midkey >> function which is doing the computation used in the >> Store.getSplitPoint() method. >> >> Now, if all the keys are almost equals in size, and the table has only >> one big 10GB region, if we lower the maxfilesize parameter to >> something like 300MB, we should see only almost equals regions, right? >> It's not the result I got. So I'm trying to figure where I'm wrong. >> >> Also, last thing. If I want to change the default behaviour and split >> based on the row number instead of the midkey, can I hook somewhere? >> > > >> Or will I have to disable the default split (by setting the >> maxfilesize to something like 20GB) and run a job to split the regions >> manually? >> >> Thanks, >> >> JM >> >> 2013/1/22, ramkrishna vasudevan <[email protected]>: >> > Hi Jean >> > >> > Before replying as to what i know, region splits can be configured too. >> > >> > Ok, now on how the split happens >> > -> You can explicity ask the region to get splitted on a specific row >> key. >> > If you know that splitting on that rowkey will yield you almost equal >> > region sizes. >> > -> Now when HBase tries to split, it just takes the midkey from the >> HFiles. >> > Here the midkey is the one that is the first key in the mid block of >> > the >> > HFile. >> > Also the individual rows cannot be split. So if one row is nearly the >> size >> > of the region and other rows are smaller in size, it tries to find the >> mid >> > block inside the HFile and the size of one the block is going to be >> > very >> > huge and that may be splitted as one region. I know this has to do >> > with >> > the internals of the splitting code. >> > >> > >> > Regards >> > Ram >> > >> > On Tue, Jan 22, 2013 at 5:12 PM, Jean-Marc Spaggiari < >> > [email protected]> wrote: >> > >> >> Hi, >> >> >> >> I'm wondering, what is HBase split policy. >> >> >> >> I mean, let's imagine this situation. >> >> >> >> I have a region full of rows starting from AA to AZ. Thousands of >> >> hundreds. I also have few rows from B to DZ. Let's say only one >> >> hundred. >> >> >> >> Region is just above the maxfilesize, so it's fine. >> >> >> >> No, I add "A" and store a very big row into it. Almost half the size >> >> of my maxfilesize value. That mean it's now time to split this row. >> >> >> >> How will HBase decide where to split it? Is it going to use the >> >> lexical order? Which mean it will split somewhere between B and C? If >> >> it's done that way, I will have one VERY small region, and one VERY >> >> big which will still be over the maxfilesize and will need to be split >> >> again, and most probably many times, right? >> >> >> >> Or will HBase take the middle of the region, look at the closest key, >> >> and cut there? >> >> >> >> Yesterday, for one table, I merged all my regions into a single one. >> >> This gave me something like a 10GB region. Since I want to have at >> >> least 100 regions for this table, I have setup the maxfilesize to >> >> 100MB. I have restarted HBase, and let it worked over night. >> >> >> >> This morning, I have some very big regions, still over the 100MB, and >> >> some very small. And the big regions are at least hundred times bigger >> >> than the small one. >> >> >> >> I just stopped the cluster again to re-merge the regions into a single >> >> one and see if I have not done something wrong in the process, but in >> >> the meantime, I'm looking for more information about the way HBase is >> >> deciding where to cut, and if there is a way to customize that. >> >> >> >> Thanks, >> >> >> >> JM >> >> >> >> PS: Numbers are out of my head. I don't really recall how big the last >> >> region was yesterday. I will take more notes when the current >> >> MassMerge will be done. >> >> >> > >> >
