For 1), if you apply hashing to <date>_<somedata>, date prefix wouldn't be useful. You should evaluate the distribution of <somedata> as row key. Assuming distribution is uneven, you can apply hashing function to row key. Using MurmurHash is as simple as: MurmurHash.getInstance().hash(rowkey, 0, rowkey.length, seed)
For 2), you can evaluate MurmurHash and JenkinsHash. Using different hash functions in your system entails storing meta data for each table about the choice of hash function. Cheers On Sun, Mar 20, 2011 at 8:21 AM, Oleg Ruchovets <[email protected]>wrote: > I took org.apache.hadoop.hbase.util.MurmurHash class and want to use it > for > my hashing. > Till now I had key , value pairs (key format <date>_<somedata>) , > Using MurmurHash I get hashing for my key. > My questions is : > 1) What is the way to use hashing. Meaning how code should be written > so that instead of writing key, value it will use hashing too? > 2)Can I different hash function be used for different Hbase tables? > What is the way to do it? > > Thanks in advance > Oleg. > > > > On Sun, Mar 20, 2011 at 12:25 AM, Ted Yu <[email protected]> wrote: > > > I guess you chose date prefix for query consideration. > > You should introduce hashing so that the row keys are not clustered > > together. > > > > On Sat, Mar 19, 2011 at 3:00 PM, Oleg Ruchovets <[email protected] > > >wrote: > > > > > We want to insert to hbase on daily basis (hbase 0.90.1 , hadoop > > append). > > > currently we have ~ 10 million records per day.We use map/reduce to > > prepare > > > data , and write it to hbase using chunks of data (5000 puts every > > chunk) > > > All process takes 1h 20 minutes. Making some tests verified that > > writing > > > to hbase takes ~ 1 hour. > > > > > > I have couple of questions: > > > 1) Reducers is writing data which has a key like : <date>_<some_text> > , > > > the strange is that all records were written to a one node. > > > > > > Is it correct behaviour? What is the way to get better distributions > > > accross the cluster? Simply during insertion process I saw that most > > load > > > get that specific node where all data were inserted and all other nodes > > > almost has no any resources utilisations (cpu , I/O ...). > > > > > > Oleg. > > > > > >
