On Sun, Mar 20, 2011 at 5:58 PM, Ted Yu <[email protected]> wrote:
> For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be
> useful.
> You should evaluate the distribution of <somedata> as row key. Assuming
> distribution is uneven, you can apply hashing function to row key.
> Using MurmurHash is as simple as:
> MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)
>
Thank you for quick answer
in case key : <date>_<somedata>
where date is 20110310,
20110310_ length is 9
public class MyHash extends MurmurHash {
private static MyHash _instance = new MyHash();
public static Hash getInstance() {
return _instance;
}
@Override
public int hash(byte[] data, int offset, int length, int seed) {
return super.hash(data, 9, length-9, seed);
}
}
using:
int rowKeyHash = MurmurHash.getInstance().hash(rowkey, 0, rowkey.length,
seed)
or
int rowKeyHash = MyHash.getInstance().hash(rowkey);
What should I do with int rowKeyHash ? How should code be written to be
using rowKeyHash .
Currently my code looks like this:
......
put.setWriteToWAL(false);
puts.add(put);
counter++;
if(counter>batchSize){
try{
table.getWriteBuffer().addAll(puts);
table.flushCommits();
puts.clear();
}finally{
counter = 0;
}
}
......
Thanks
Oleg.
> For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash
> functions in your system entails storing meta data for each table about the
> choice of hash function.
>
> Cheers
>
> On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <[email protected]
> >wrote:
>
> > I took org.apache.hadoop.hbase.util.MurmurHash class and want to use it
> > for
> > my hashing.
> > Till now I had key , value pairs (key format <date>_<somedata>) ,
> > Using MurmurHash I get hashing for my key.
> > My questions is :
> > 1) What is the way to use hashing. Meaning how code should be written
> > so that instead of writing key, value it will use hashing too?
> > 2)Can I different hash function be used for different Hbase tables?
> > What is the way to do it?
> >
> > Thanks in advance
> > Oleg.
> >
> >
> >
> > On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <[email protected]> wrote:
> >
> > > I guess you chose date prefix for query consideration.
> > > You should introduce hashing so that the row keys are not clustered
> > > together.
> > >
> > > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <[email protected]
> > > >wrote:
> > >
> > > > We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop
> > > append).
> > > > currently we have ~ 10 million records per day.We use map/reduce to
> > > prepare
> > > > data , and write it to hbase using chunks of data (5000 puts every
> > > chunk)
> > > > All process takes 1h 20 minutes. Making some tests verified that
> > > writing
> > > > to hbase takes ~ 1 hour.
> > > >
> > > > I have couple of questions:
> > > > 1) Reducers is writing data which has a key like :
> <date>_<some_text>
> > ,
> > > > the strange is that all records were written to a one node.
> > > >
> > > > Is it correct behaviour? What is the way to get better
> distributions
> > > > accross the cluster? Simply during insertion process I saw that most
> > > load
> > > > get that specific node where all data were inserted and all other
> nodes
> > > > almost has no any resources utilisations (cpu , I/O ...).
> > > >
> > > > Oleg.
> > > >
> > >
> >
>