In 0.94, we have src/main/java/org/apache/hadoop/hbase/util/MurmurHash.java
For hadoop 1, there is src/core/org/apache/hadoop/util/hash/MurmurHash.java Cheers On Mon, Jul 8, 2013 at 8:29 AM, Michael Segel <michael_se...@hotmail.com>wrote: > Is murmur part of the standard java libraries? > > If not, you end up having to do a bit more maintenance of your cluster and > that's going to be part of your tradeoff. > > On Jul 8, 2013, at 10:14 AM, Mike Axiak <m...@axiak.net> wrote: > > > Hello Jason, > > > > Have you considered the following rowkey? > > > > murmur_128(userId) + timestamp + userId ? > > > > This handles both of your cases as (1) murmur 128 is much faster than > > md5 so will have very low overhead and (2) the userid at the end of > > the key will ensure that no murmur collisions will cause issues. This > > key also handle incrementing userIds well because close userIds will > > likely be in separate regions. > > > > Cheers, > > Mike > > > > On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <jason.hu...@icare.com> > wrote: > >> Hello, > >> > >> I am trying to get some advice on pros/cons of using > separator/delimiter as > >> part of HBase row key. > >> > >> Currently one of our user activity tables has a rowkey design of > >> "UserID^TimeStamp" with a separator of "^". (UserID is a string that > won't > >> include '^'). > >> > >> This is designed for the two common use cases in our system: > >> (1) If we come from a context where the UserID is known, we can do a > scan > >> easily for all the user activities with a startRowKey and stopRowKey. > >> (2) If we come from a external networked table where the row key of this > >> user activity table is stored and can be retrieved as activityRowKey, > then > >> we can use the following code to parse out the UserID and do the same > scan > >> as in (1): > >> > >> String activityRowKeyStr = Bytes.toString(activityRowKey); > >> String userId = > >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1) > >> > >> Then I can set startRowKey and stopRowKey for the scan based on userId. > >> Here we get benefit of having the User ID as part of the row key with > the > >> separator (comparing to another solution that stores the userID as one > of > >> the columns in the user activity table). > >> > >> The reason I pick a separator after UserID is that sometimes we may not > get > >> a fixed length string of the UserID value. At one point I actually > thought > >> of using MD5 to hash the UserID and make it a fixed length, however, the > >> possibility of collision and possible overhead of applying the hash > >> function makes me pick the separator "^". > >> > >> My question: > >> (1) I kind of make the argument that using a separator is kind of better > >> than using a MD5 hash value. Does that seem reasonable? Could you > comments > >> on other pros and cons that I might miss (as the bases for my argument)? > >> > >> (2) On using a separator/delimiter, besides the requirements that this > >> separator/delimiter shouldn't appear elsewhere in the rowkey, are there > any > >> other requirements? Are there any special separator/delimiters that are > >> better/worse than the average ones? > >> > >> thanks! > >> > >> Jason > > > >