On Sun, Mar 20, 2011 at 5:58 PM, Ted Yu <[email protected]> wrote:

> For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be
> useful.
> You should evaluate the distribution of <somedata> as row key. Assuming
> distribution is uneven, you can apply hashing function to row key.
> Using MurmurHash is as simple as:
> MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)
>


Thank you for quick answer

in case key : <date>_<somedata>
where date is 20110310,
20110310_ length is 9

public class MyHash extends MurmurHash {
 private static MyHash _instance = new MyHash();

  public static Hash getInstance() {
    return _instance;
  }

  @Override
public int hash(byte[] data, int offset, int length, int seed) {
return super.hash(data, 9, length-9, seed);
}
}

using:
int rowKeyHash = MurmurHash.getInstance().hash(rowkey, 0, rowkey.length,
seed)
or
int rowKeyHash = MyHash.getInstance().hash(rowkey);

What should I do with int rowKeyHash ? How should code be written to be
using  rowKeyHash .

Currently my code looks like this:

......

put.setWriteToWAL(false);
     puts.add(put);
     counter++;

     if(counter>batchSize){
     try{
     table.getWriteBuffer().addAll(puts);
     table.flushCommits();
     puts.clear();
     }finally{
     counter = 0;
     }
     }


......

Thanks
Oleg.






> For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash
> functions in your system entails storing meta data for each table about the
> choice of hash function.
>
> Cheers
>
> On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <[email protected]
> >wrote:
>
> > I took org.apache.hadoop.hbase.util.MurmurHash class  and want to use it
> > for
> > my hashing.
> >     Till now I had  key , value pairs (key format <date>_<somedata>) ,
> >      Using MurmurHash I get hashing for my key.
> > My questions is :
> >   1) What is the way to use hashing. Meaning how code should  be written
> >  so that instead of writing key, value it will use hashing too?
> >   2)Can I different hash function be used  for different Hbase tables?
> >  What is the way to do it?
> >
> > Thanks in advance
> > Oleg.
> >
> >
> >
> > On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <[email protected]> wrote:
> >
> > > I guess you chose date prefix for query consideration.
> > > You should introduce hashing so that the row keys are not clustered
> > > together.
> > >
> > > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <[email protected]
> > > >wrote:
> > >
> > > >   We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > > append).
> > > > currently we have ~ 10 million records per day.We use map/reduce to
> > > prepare
> > > > data , and write it to hbase using chunks of data (5000 puts  every
> > > chunk)
> > > >   All process takes 1h 20 minutes. Making some tests verified that
> > > writing
> > > > to hbase takes ~ 1 hour.
> > > >
> > > > I have couple of questions:
> > > >  1) Reducers is writing  data which has a key like :
> <date>_<some_text>
> > ,
> > > > the strange is that   all records were written to a one node.
> > > >
> > > >    Is it correct behaviour? What is the way to get better
> distributions
> > > > accross the cluster? Simply during insertion process  I saw that most
> > > load
> > > > get that specific node where all data were inserted and all other
> nodes
> > > > almost has no any resources utilisations (cpu , I/O ...).
> > > >
> > > > Oleg.
> > > >
> > >
> >
>

Reply via email to