Re: hbase insertion optimisation:

Ted Yu Sun, 20 Mar 2011 08:58:47 -0700

For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be
useful.
You should evaluate the distribution of <somedata> as row key. Assuming
distribution is uneven, you can apply hashing function to row key.
Using MurmurHash is as simple as:
MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)


For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash
functions in your system entails storing meta data for each table about the
choice of hash function.

Cheers

On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <[email protected]>wrote:

> I took org.apache.hadoop.hbase.util.MurmurHash class  and want to use it
> for
> my hashing.
>     Till now I had  key , value pairs (key format <date>_<somedata>) ,
>      Using MurmurHash I get hashing for my key.
> My questions is :
>   1) What is the way to use hashing. Meaning how code should  be written
>  so that instead of writing key, value it will use hashing too?
>   2)Can I different hash function be used  for different Hbase tables?
>  What is the way to do it?
>
> Thanks in advance
> Oleg.
>
>
>
> On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <[email protected]> wrote:
>
> > I guess you chose date prefix for query consideration.
> > You should introduce hashing so that the row keys are not clustered
> > together.
> >
> > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <[email protected]
> > >wrote:
> >
> > >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > append).
> > > currently we have ~ 10 million records per day.We use map/reduce to
> > prepare
> > > data , and write it to hbase using chunks of data (5000 puts  every
> > chunk)
> > >   All process takes 1h 20 minutes. Making some tests verified that
> > writing
> > > to hbase takes ~ 1 hour.
> > >
> > > I have couple of questions:
> > >  1) Reducers is writing  data which has a key like : <date>_<some_text>
> ,
> > > the strange is that   all records were written to a one node.
> > >
> > >    Is it correct behaviour? What is the way to get better distributions
> > > accross the cluster? Simply during insertion process  I saw that most
> > load
> > > get that specific node where all data were inserted and all other nodes
> > > almost has no any resources utilisations (cpu , I/O ...).
> > >
> > > Oleg.
> > >
> >
>

Re: hbase insertion optimisation:

Reply via email to