I've been thinking about this topic lately so I'll fork from another
discussion to ask if anyone has a good approach to determining keys
for pre-splitting from a known dataset. We have a key scenario similar
to what Ted describes below.

We periodically run MR jobs to transform and bulk load data from HDFS
into HBase using Pig. The approach I've used to find the best keys for
the splits is very manual and clunky so I'm wondering if others have a
better approach, perhaps one that could even lead to automation. :)

Here's what I've done:

1. use Pig to read in our datasets, join/filter/transform/etc before
writing the output back to HDFS with N reducers ordered by key, where
N is the number of splits we'll create.
2. Manually plucking out the first key of each reducer output file to
make a list of split keys.
3. Creating the HBase table with keys from step 2.
4. Re-running step 1, this time removing the 'ORDER BY key' and
writing to HBase.

The pre-created splits are guaranteed to be evenly distributed, but
the process of determining the keys to split on isn't ideal. Is there
a better technique to do steps 1-2 in a way where the split keys can
just be output to a file?

Suggestions?

---------- Forwarded message ----------
From: Ted Dunning <[email protected]>
Date: Tue, Mar 29, 2011 at 11:38 AM
Subject: Re: Performance test results
To: [email protected]
Cc: Jean-Daniel Cryans <[email protected]>, Eran Kutner
<[email protected]>, Stack <[email protected]>


Watch out when pre-splitting.  Your key distribution may not be as uniform
as you might think.  This particularly happens when keys are represented in
some printable form.  Base 64, for instance only populates a small fraction
of the base 256 key space.

On Tue, Mar 29, 2011 at 10:54 AM, Jean-Daniel Cryans <[email protected]>wrote:

> - Inserting into a new table without pre-splitting it is bound to be a
> red herring of bad performance. Please pre-split it with methods such
> as
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#createTable(org.apache.hadoop.hbase.HTableDescriptor
> ,
> byte[][])
>

Reply via email to