The GroupLens & other datasets come with separate itemID->movie name/genre
and userID->zipcode. How would you carry around separate "side" datamodels
like these?

It then becomes interesting to create separate things like 'userID->median
rating', itemID->count of ratings. Clusters of rating events with the same
timestamp also sounds really intriguing.

Lance

On Mon, Aug 29, 2011 at 8:10 AM, Sebastian Schelter <[email protected]> wrote:

> My sample code that I wrote for a magazin article that will shortly be
> published might help you with that issue.
>
> The essence is that you need to preprocess your data into two files. One
> holds all preferences using longs only, the other one has the original
> strings. Be aware that you need to generate the longs in the preference file
> by hashing the strings correctly, you can either use
>
> new MemoryIDMigrator().toLongID(..**.)
>
> for that if you use Java to preprocess your data or that Python snippet
> here if you prefer a scripting language:
>
> #!/usr/bin/python
> # -*- coding: utf-8 -*-
>
> import hashlib, numpy
>
> def mahout_hash(value):
>  md5_hash = hashlib.md5(value).digest()
>  hash = numpy.int64(0)
>  for c in md5_hash[:8]:
>    hash = hash << 8 | ord(c)
>  return str(hash)
>
> After that you only need to instantiate a FileDataModel and a
> FileIDMigrator to work with the data as shown here:
>
> https://github.com/sscdotopen/**musicwithtaste/blob/master/**
> src/main/java/io/ssc/**musicwithtaste/examples/**
> RunSimilarArtistsExample.java<https://github.com/sscdotopen/musicwithtaste/blob/master/src/main/java/io/ssc/musicwithtaste/examples/RunSimilarArtistsExample.java>
>
> --sebastian
>
>
> On 29.08.2011 16:58, Sean Owen wrote:
>
>> Really, the best thing is to use numeric IDs. Hash the string or otherwise
>> turn them into numbers first.
>>
>> if you really need to work with Strings, see the IDMigrator class which
>> provides a little automatic help in doing so.
>>
>> On Mon, Aug 29, 2011 at 3:04 PM, Amit Mahale<[email protected]>
>>  wrote:
>>
>>  Hello,
>>>
>>> I was playing with Mahout and found that the FileDataModel accepts data
>>> in
>>> the format
>>>
>>>     userId,itemId,pref(long,long,**Double).
>>>
>>>  The data that i want to experiment with is of the format
>>>
>>>     String,long,double
>>>
>>>  What is the best/easiest method to work with this dataset on Mahout,
>>> Inputs
>>> please..
>>>
>>> Thanks
>>>
>>>
>>
>


-- 
Lance Norskog
[email protected]

Reply via email to