Re: Pre-split table using shell

Michael Segel Tue, 12 Jun 2012 06:49:26 -0700

Ok...

Please tell me that this isn't a production system.


Is this on EC2? 

On Jun 12, 2012, at 6:55 AM, Simon Kelly wrote:

> Thanks Michael
> 
> I'm 100% sure its not the UUID distribution that's causing the problem. I'm
> going to try us the API to create the table and see if that changes things.
> 
> The reason I want to pre-split the table is that HBase doesn't handle the
> initial load to a single regionserver and I can't start the system off
> slowly and allow a few splits to happen before fully loading it. Its 100%
> or nothing. I'm also stuck with only 8Gb of RAM per server and only 5
> servers so I need to try and get as much as I can from the get go.
> 
> Simon
> 
> On 12 June 2012 13:37, Michael Segel <[email protected]> wrote:
> 
>> Ok,
>> Now that I'm awake, and am drinking my first cup of joe...
>> 
>> If you just generate UUIDs you are not going to have an even distribution.
>> Nor are they going to be truly random due to how the machines are
>> generating their random numbers.
>> But this is not important in solving your problem....
>> 
>> There is a set of UUIDs which are hashed and then truncated back down to a
>> 128 bit string.
>> You can generate the UUID, take a hash (SHA-1 or MD5) and then truncate it
>> to 128 bits.
>> This would generate a more random distribution across your splits.
>> 
>> I'm also a bit curious about why you're pre-splitting in the first place.
>> I mean I understand why people do it, but its a short term fix and I
>> wonder how much pain you feel.
>> 
>> Of course YMMV based on your use case.
>> 
>> Hash your key and you'll be ok.
>> 
>> 
>> 
>> On Jun 12, 2012, at 4:41 AM, Simon Kelly wrote:
>> 
>>> Yes, I'm aware that UUID's are designed to be unique and not evenly
>>> distributed but I wouldn't expect a big gap in their distribution either.
>>> 
>>> The other thing that is really confusing me is that the regions splits
>>> aren't lexicographical sorted. Perhaps there is a problem with the way
>> I'm
>>> specifying the splits in the split file. I haven't been able to find any
>>> docs on what format the splits keys should be in so I've used what's
>>> produced by Bytes.toStringBinary. Is that correct?
>>> 
>>> Simon
>>> 
>>> On 12 June 2012 10:23, Michael Segel <[email protected]> wrote:
>>> 
>>>> UUIDs are unique but not necessarily random and even in random
>> samplings,
>>>> you may not see an even distribution except over time.
>>>> 
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Jun 12, 2012, at 3:18 AM, "Simon Kelly" <[email protected]>
>> wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> I'm getting some unexpected results with a pre-split table where some
>> of
>>>>> the regions are not getting any data.
>>>>> 
>>>>> The table keys are UUID (generated using Java's UUID.randomUUID() )
>> which
>>>>> I'm storing as a byte[16]:
>>>>> 
>>>>>  key[0-7] = uuid most significant bits
>>>>>  key[8-15] = uuid least significant bits
>>>>> 
>>>>> The table is created via the shell as follows:
>>>>> 
>>>>>  create 'table', {NAME => 'cf'}, {SPLITS_FILE => 'splits.txt'}
>>>>> 
>>>>> The splits.txt is generated using the code here:
>>>>> http://pastebin.com/DAExXMDz which generates 32 regions split between
>>>> x00
>>>>> and xFF. I have also tried with 16 byte regions keys (x00x00... to
>>>>> xFFxFF...).
>>>>> 
>>>>> As far as I understand this should distribute the rows evenly across
>> the
>>>>> regions but I'm getting a bunch of regions with no rows. I'm also
>>>> confused
>>>>> as the the ordering of the regions since it seems the start and end
>> keys
>>>>> aren't really matching up correctly. You can see the regions and the
>>>>> requests they are getting here: http://pastebin.com/B4771g5X
>>>>> 
>>>>> Thanks in advance for the help.
>>>>> Simon
>>>> 
>> 
>>

Re: Pre-split table using shell

Reply via email to