The GroupLens & other datasets come with separate itemID->movie name/genre and userID->zipcode. How would you carry around separate "side" datamodels like these?
It then becomes interesting to create separate things like 'userID->median rating', itemID->count of ratings. Clusters of rating events with the same timestamp also sounds really intriguing. Lance On Mon, Aug 29, 2011 at 8:10 AM, Sebastian Schelter <[email protected]> wrote: > My sample code that I wrote for a magazin article that will shortly be > published might help you with that issue. > > The essence is that you need to preprocess your data into two files. One > holds all preferences using longs only, the other one has the original > strings. Be aware that you need to generate the longs in the preference file > by hashing the strings correctly, you can either use > > new MemoryIDMigrator().toLongID(..**.) > > for that if you use Java to preprocess your data or that Python snippet > here if you prefer a scripting language: > > #!/usr/bin/python > # -*- coding: utf-8 -*- > > import hashlib, numpy > > def mahout_hash(value): > md5_hash = hashlib.md5(value).digest() > hash = numpy.int64(0) > for c in md5_hash[:8]: > hash = hash << 8 | ord(c) > return str(hash) > > After that you only need to instantiate a FileDataModel and a > FileIDMigrator to work with the data as shown here: > > https://github.com/sscdotopen/**musicwithtaste/blob/master/** > src/main/java/io/ssc/**musicwithtaste/examples/** > RunSimilarArtistsExample.java<https://github.com/sscdotopen/musicwithtaste/blob/master/src/main/java/io/ssc/musicwithtaste/examples/RunSimilarArtistsExample.java> > > --sebastian > > > On 29.08.2011 16:58, Sean Owen wrote: > >> Really, the best thing is to use numeric IDs. Hash the string or otherwise >> turn them into numbers first. >> >> if you really need to work with Strings, see the IDMigrator class which >> provides a little automatic help in doing so. >> >> On Mon, Aug 29, 2011 at 3:04 PM, Amit Mahale<[email protected]> >> wrote: >> >> Hello, >>> >>> I was playing with Mahout and found that the FileDataModel accepts data >>> in >>> the format >>> >>> userId,itemId,pref(long,long,**Double). >>> >>> The data that i want to experiment with is of the format >>> >>> String,long,double >>> >>> What is the best/easiest method to work with this dataset on Mahout, >>> Inputs >>> please.. >>> >>> Thanks >>> >>> >> > -- Lance Norskog [email protected]
