My sample code that I wrote for a magazin article that will shortly be
published might help you with that issue.
The essence is that you need to preprocess your data into two files. One
holds all preferences using longs only, the other one has the original
strings. Be aware that you need to generate the longs in the preference
file by hashing the strings correctly, you can either use
new MemoryIDMigrator().toLongID(...)
for that if you use Java to preprocess your data or that Python snippet
here if you prefer a scripting language:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import hashlib, numpy
def mahout_hash(value):
md5_hash = hashlib.md5(value).digest()
hash = numpy.int64(0)
for c in md5_hash[:8]:
hash = hash << 8 | ord(c)
return str(hash)
After that you only need to instantiate a FileDataModel and a
FileIDMigrator to work with the data as shown here:
https://github.com/sscdotopen/musicwithtaste/blob/master/src/main/java/io/ssc/musicwithtaste/examples/RunSimilarArtistsExample.java
--sebastian
On 29.08.2011 16:58, Sean Owen wrote:
Really, the best thing is to use numeric IDs. Hash the string or otherwise
turn them into numbers first.
if you really need to work with Strings, see the IDMigrator class which
provides a little automatic help in doing so.
On Mon, Aug 29, 2011 at 3:04 PM, Amit Mahale<[email protected]> wrote:
Hello,
I was playing with Mahout and found that the FileDataModel accepts data in
the format
userId,itemId,pref(long,long,Double).
The data that i want to experiment with is of the format
String,long,double
What is the best/easiest method to work with this dataset on Mahout,
Inputs
please..
Thanks