Just hashing is almost surely fine. I'd XOR 64 bit chunks of the UUID
to make a 64-bit value. The probability of collision at this size is
vanishingly small, and collisions do little damage anyway.

note that in the Hadoop jobs the longs are hashed down to ints anyway!

On Fri, Jun 22, 2012 at 3:43 PM, Jonathan Hodges <[email protected]> wrote:
> I have some input data I don’t control where the user IDs are UUID format.
> The UUIDs are larger than the long type I need for Mahout.  Is there a best
> practice converting this type of data?
>
>
> Since our set is less than 10 million unique users I was thinking about
> chaining together a few MR jobs to convert the user UUIDs to unique
> sequential longs.  Before going through the trouble I thought I would ask
> the community for ideas as I am still very new to Mahout.
>
>
> Thanks in advance.
>
>
> -Jonathan

Reply via email to