In 0.94, we have src/main/java/org/apache/hadoop/hbase/util/MurmurHash.java

For hadoop 1, there is src/core/org/apache/hadoop/util/hash/MurmurHash.java

Cheers

On Mon, Jul 8, 2013 at 8:29 AM, Michael Segel <michael_se...@hotmail.com>wrote:

> Is murmur part of the standard java libraries?
>
> If not, you end up having to do a bit more maintenance of your cluster and
> that's going to be part of your tradeoff.
>
> On Jul 8, 2013, at 10:14 AM, Mike Axiak <m...@axiak.net> wrote:
>
> > Hello Jason,
> >
> > Have you considered the following rowkey?
> >
> >  murmur_128(userId) + timestamp + userId ?
> >
> > This handles both of your cases as (1) murmur 128 is much faster than
> > md5 so will have very low overhead and (2) the userid at the end of
> > the key will ensure that no murmur collisions will cause issues. This
> > key also handle incrementing userIds well because close userIds will
> > likely be in separate regions.
> >
> > Cheers,
> > Mike
> >
> > On Mon, Jul 8, 2013 at 10:19 AM, Jason Huang <jason.hu...@icare.com>
> wrote:
> >> Hello,
> >>
> >> I am trying to get some advice on pros/cons of using
> separator/delimiter as
> >> part of HBase row key.
> >>
> >> Currently one of our user activity tables has a rowkey design of
> >> "UserID^TimeStamp" with a separator of "^". (UserID is a string that
> won't
> >> include '^').
> >>
> >> This is designed for the two common use cases in our system:
> >> (1) If we come from a context where the UserID is known, we can do a
> scan
> >> easily for all the user activities with a startRowKey and stopRowKey.
> >> (2) If we come from a external networked table where the row key of this
> >> user activity table is stored and can be retrieved as activityRowKey,
> then
> >> we can use the following code to parse out the UserID and do the same
> scan
> >> as in (1):
> >>
> >>    String activityRowKeyStr = Bytes.toString(activityRowKey);
> >>    String userId =
> >> activityRowKeyStr.subString(activityRowKeyStr.indexOf("^")+1)
> >>
> >> Then I can set startRowKey and stopRowKey for the scan based on userId.
> >> Here we get benefit of having the User ID as part of the row key with
> the
> >> separator (comparing to another solution that stores the userID as one
> of
> >> the columns in the user activity table).
> >>
> >> The reason I pick a separator after UserID is that sometimes we may not
> get
> >> a fixed length string of the UserID value. At one point I actually
> thought
> >> of using MD5 to hash the UserID and make it a fixed length, however, the
> >> possibility of collision and possible overhead of applying the hash
> >> function makes me pick the separator "^".
> >>
> >> My question:
> >> (1) I kind of make the argument that using a separator is kind of better
> >> than using a MD5 hash value. Does that seem reasonable? Could you
> comments
> >> on other pros and cons that I might miss (as the bases for my argument)?
> >>
> >> (2) On using a separator/delimiter, besides the requirements that this
> >> separator/delimiter shouldn't appear elsewhere in the rowkey, are there
> any
> >> other requirements? Are there any special separator/delimiters that are
> >> better/worse than the average ones?
> >>
> >> thanks!
> >>
> >> Jason
> >
>
>

Reply via email to