Re: HBase split policy

Jean-Marc Spaggiari Tue, 22 Jan 2013 06:10:59 -0800

Hi Ram,

I SPLIT_POLICY is define the same way MAX_FILESIZE is.... So I think
it's a table attribut and can be altered... That's a good news! I will
probably try it.


Also, the admin.split(rowkey) is the way I will use until I'm able to
properly use/set the SPLIT_POLICY. I will simply (try to) count the
rows in a region, and split in the middle...

Thanks for the hint regarding the SPLIT_POLICY.

JM

2013/1/22, ramkrishna vasudevan <[email protected]>:
>>>Also, last thing. If I want to change the default behaviour and split
>>>based on the row number instead of the midkey, can I hook somewhere?
>
> HTableDescriptor myHtd = new HTableDescriptor();
>     myHtd.setValue(HTableDescriptor.SPLIT_POLICY,
>         KeyPrefixRegionSplitPolicy.class.getName());
> So the region split policy can be changed only during table creation i
> suppose.  (May be wrong, not sure anyother way out there).
>
> When i meant split based on row key my point was like use
> admin.split(rowkey).  I will check more on your calculations and figures
> and get back to you.
>
> Regards
> Ram
>
>
> On Tue, Jan 22, 2013 at 7:17 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
>
>> Hi Anoop, Hi Ram,
>>
>> Thanks for your replies.
>>
>> I looked at the code and found in the HFileBlockIndex the midkey
>> function which is doing the computation used in the
>> Store.getSplitPoint() method.
>>
>> Now, if all the keys are almost equals in size, and the table has only
>> one big 10GB region, if we lower the maxfilesize parameter to
>> something like 300MB, we should see only almost equals regions, right?
>> It's not the result I got. So I'm trying to figure where I'm wrong.
>>
>> Also, last thing. If I want to change the default behaviour and split
>> based on the row number instead of the midkey, can I hook somewhere?
>>
>
>
>> Or will I have to disable the default split (by setting the
>> maxfilesize to something like 20GB) and run a job to split the regions
>> manually?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/1/22, ramkrishna vasudevan <[email protected]>:
>> > Hi Jean
>> >
>> > Before replying as to what i know, region splits can be configured too.
>> >
>> > Ok, now on how the split happens
>> > -> You can explicity ask the region to get splitted on a specific row
>> key.
>> >  If you know that splitting on that rowkey will yield you almost equal
>> > region sizes.
>> > -> Now when HBase tries to split, it just takes the midkey from the
>> HFiles.
>> >  Here the midkey is the one that is the first key in the mid block of
>> > the
>> > HFile.
>> > Also the individual rows cannot be split. So if one row is nearly the
>> size
>> > of the region and other rows are smaller in size, it tries to find the
>> mid
>> > block inside the HFile and the size of one the block is going to be
>> > very
>> > huge and that may be splitted as one region.  I know this has to do
>> > with
>> > the internals of the splitting code.
>> >
>> >
>> > Regards
>> > Ram
>> >
>> > On Tue, Jan 22, 2013 at 5:12 PM, Jean-Marc Spaggiari <
>> > [email protected]> wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm wondering, what is HBase split policy.
>> >>
>> >> I mean, let's imagine this situation.
>> >>
>> >> I have a region full of rows starting from AA to AZ. Thousands of
>> >> hundreds. I also have few rows from B to DZ. Let's say only one
>> >> hundred.
>> >>
>> >> Region is just above the maxfilesize, so it's fine.
>> >>
>> >> No, I add "A" and store a very big row into it. Almost half the size
>> >> of my maxfilesize value. That mean it's now time to split this row.
>> >>
>> >> How will HBase decide where to split it? Is it going to use the
>> >> lexical order? Which mean it will split somewhere between B and C? If
>> >> it's done that way, I will have one VERY small region, and one VERY
>> >> big which will still be over the maxfilesize and will need to be split
>> >> again, and most probably many times, right?
>> >>
>> >> Or will HBase take the middle of the region, look at the closest key,
>> >> and cut there?
>> >>
>> >> Yesterday, for one table, I merged all my regions into a single one.
>> >> This gave me something like a 10GB region. Since I want to have at
>> >> least 100 regions for this table, I have setup the maxfilesize to
>> >> 100MB. I have restarted HBase, and let it worked over night.
>> >>
>> >> This morning, I have some very big regions, still over the 100MB, and
>> >> some very small. And the big regions are at least hundred times bigger
>> >> than the small one.
>> >>
>> >> I just stopped the cluster again to re-merge the regions into a single
>> >> one and see if I have not done something wrong in the process, but in
>> >> the meantime, I'm looking for more information about the way HBase is
>> >> deciding where to cut, and if there is a way to customize that.
>> >>
>> >> Thanks,
>> >>
>> >> JM
>> >>
>> >> PS: Numbers are out of my head. I don't really recall how big the last
>> >> region was yesterday. I will take more notes when the current
>> >> MassMerge will be done.
>> >>
>> >
>>
>

Re: HBase split policy

Reply via email to