Hi Anoop,
Thanks Anoop.
I am creating the splits using the hex split example in the HBase
documentation. I am specifically passing the splits during table creation. The
leading zeros were lost in pasting from some of the key ranges as the
spreadsheet took them to be numbers while assumed the other values to be text.
All key ranges are having consistent size with leading zeros. I am parting them
again with the careful consideration of not losing the leading zeros this time.
StartKey EndKey
0000000000199999
0000000000199999 0000000000333332
0000000000333332 00000000004ccccb
00000000004ccccb 0000000000666664
0000000000666664 00000000007ffffd
00000000007ffffd 0000000000999996
0000000000999996 0000000000b3332f
0000000000b3332f 0000000000ccccc8
0000000000ccccc8 0000000000e66661
0000000000e66661
public static byte[][] getHexSplits(String startKey, String endKey, int
numRegions) {
byte[][] splits = new byte[numRegions-1][];
BigInteger lowestKey = new BigInteger(startKey, 16);
BigInteger highestKey = new BigInteger(endKey, 16);
BigInteger range = highestKey.subtract(lowestKey);
BigInteger regionIncrement =
range.divide(BigInteger.valueOf(numRegions));
lowestKey = lowestKey.add(regionIncrement);
for(int i=0; i < numRegions-1;i++) {
BigInteger key =
lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
byte[] b = String.format("%016x", key).getBytes();
splits[i] = b;
}
return splits;
}
After few more insights I did realize as was indicated by RamKrishna, that the
formatting of the keys was causing this behavior. The basis/format on which the
split is done should be consistent with the key generation format as well. In
this particular case while the split was happening based on the hex values of
the key, and additional formatting is being done by padding it with 0 to make
it a 16 byte start/end key. Likewise, the same formatting is to be applied
while generating the key during records insertion. If the formatting is not
consistent, the hash values are different, hence I was not getting what I was
expecting. With the changes made, I was able to get distributions across the
regions.
I thank you all for all the help, much appreciate it.
Thanks and Regards
Pankaj Misra
________________________________________
From: Anoop Sam John [[email protected]]
Sent: Tuesday, September 25, 2012 4:05 PM
To: [email protected]
Subject: RE: HBase BatchMutations - HOT Region Problem
Hi
There is a util class Bytes available in HBase and there is toBytes(int) using
which u can convert an int to byte[]
In the split keys why leading zeros for some region keys? How you have made the
splits? U have passed explicitely the splits or splitkey creation done by HBase
code? How you have changed the byte[] keys into hex format to paste below?
-Anoop-
________________________________________
From: Pankaj Misra [[email protected]]
Sent: Tuesday, September 25, 2012 12:41 PM
To: [email protected]
Subject: RE: HBase BatchMutations - HOT Region Problem
Please find attached the table split and the snapshot below.
Start Key End Key
199999
199999 333332
333332 00000000004ccccb
00000000004ccccb 666664
666664 00000000007ffffd
00000000007ffffd 999996
999996 0000000000b3332f
0000000000b3332f 0000000000ccccc8
0000000000ccccc8 0000000000e66661
0000000000e66661
As can be seen from the snapshot, the last region being filled up alone with
all the data, containing the keys which do not belong the that range as well.
One doubt that I do have however is the way the keys are being generated the
client side. The keys are generated incrementally per thread and add to the
offset. This is then converted to its string representation and written as
ByteBuffer. So converting an integer key to its String form and then writing it
as a ByteBuffer could be a problem?
Thanks and Regards
Pankaj Misra
________________________________________
From: Anoop Sam John [[email protected]]
Sent: Tuesday, September 25, 2012 12:18 PM
To: [email protected]
Subject: RE: HBase BatchMutations - HOT Region Problem
Your table is presplit. Can you give the splitkeys that you have used?
-Anoop-
________________________________________
From: Pankaj Misra [[email protected]]
Sent: Tuesday, September 25, 2012 11:45 AM
To: [email protected]
Subject: HBase BatchMutations - HOT Region Problem
Dear All,
I am using HBASE 0.94.1 with Hadoop 0.23.1. I have written a multi-threaded
thrift client to load the data into HBASE using BatchMutations. The size of
each batch is 1000 rows and the table in HBASE is split into 10 regions. The
rows are increasing incrementally(0...999999) with offsets applied for each of
the threads(0..99999, 100000...199999, 200000...299999, ...), so in theory
every thread is expected to write in different region. The individual regions
are wide, i.e. every region is expected to store about 100000 rows, so this
makes it a total of 1000000 rows across all the regions.
I am using thrift server/client and only 1 region server as per the default
HBase setup.
So if I spawn 10 threads with offsets applied accordingly I was expecting the
regions to be getting parallely filled up which does not seem to be the case.
All the inserts pile into the the same region which make the writes inefficient
due to frequent compacting cycles blocking all the threads. If the threads
would have been writing to different regions, this problem could have been much
smaller.
I am not sure if I am missing out on anything, any ideas would be very helpful.
Thanks and Regards
Pankaj Misra
________________________________
Impetus Ranked in the Top 50 India's Best Companies to Work For 2012.
Impetus webcast 'Designing a Test Automation Framework for Multi-vendor
Interoperable Systems' available at http://lf1.me/0E/.
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise protected by law. The message is intended solely for
the named addressee. If received in error, please destroy and notify the
sender. Any use of this email is prohibited when received in error. Impetus
does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors,
virus, interception or interference.
________________________________
Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.
Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
Interoperable Systems’ available at http://lf1.me/0E/.
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise protected by law. The message is intended solely for
the named addressee. If received in error, please destroy and notify the
sender. Any use of this email is prohibited when received in error. Impetus
does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors,
virus, interception or interference.
________________________________
Impetus Ranked in the Top 50 India’s Best Companies to Work For 2012.
Impetus webcast ‘Designing a Test Automation Framework for Multi-vendor
Interoperable Systems’ available at http://lf1.me/0E/.
NOTE: This message may contain information that is confidential, proprietary,
privileged or otherwise protected by law. The message is intended solely for
the named addressee. If received in error, please destroy and notify the
sender. Any use of this email is prohibited when received in error. Impetus
does not represent, warrant and/or guarantee, that the integrity of this
communication has been maintained nor that the communication is free of errors,
virus, interception or interference.