Could do... it's trivial enough to convert the input or modify the mapper that it's almost not worth committing and maintaining. It's all just the stock algorithm otherwise.
I will dump this into the wiki with some other thoughts. On Thu, May 27, 2010 at 3:57 PM, Grant Ingersoll <[email protected]> wrote: > Great stuff, Sean! Code Sounds like it could go into examples with some of > the other Wikipedia stuff? > > Also, how about c-n-p to > https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks? > > -Grant > > On May 26, 2010, at 1:32 PM, Sean Owen wrote: > >> The only customization needed was in the first mapper/reducer to parse >> the particular format of the input: >> >> http://users.on.net/~henry/home/wikipedia.htm >> >> I can post the code somewhere... it's in the book too. Oh why not do >> it here, it's pasted later. >> >> The rest is just the stock code from HEAD in SVN. The command line is >> something like: >> >> hadoop jar mahout-core-0.4-SNAPSHOT.job >> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob >> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output >> --booleanData true >> >> Your mileage may vary depending on the machine you run it on of course. >> >> >> >> public final class WikipediaItemIDIndexMapper extends MapReduceBase >> implements >> Mapper<LongWritable,Text,IntWritable, VLongWritable> { >> >> private static final Pattern NUMBERS = Pattern.compile("(\\d+)"); >> >> �...@override >> public void map(LongWritable key, >> Text value, >> OutputCollector<IntWritable,VLongWritable> output, >> Reporter reporter) throws IOException { >> String line = value.toString(); >> Matcher m = NUMBERS.matcher(line); >> m.find(); >> IntWritable index = new IntWritable(); >> VLongWritable itemID = new VLongWritable(); >> while (m.find()) { >> long item = Long.parseLong(m.group()); >> itemID.set(item); >> index.set(idToIndex(item)); >> output.collect(index, itemID); >> } >> } >> >> static int idToIndex(long itemID) { >> return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32)); >> } >> >> } >> >> >> public final class WikipediaToItemPrefsMapper extends MapReduceBase >> implements >> Mapper<LongWritable,Text,VLongWritable,VLongWritable> { >> >> private static final Pattern NUMBERS = Pattern.compile("(\\d+)"); >> >> �...@override >> public void map(LongWritable key, >> Text value, >> OutputCollector<VLongWritable,VLongWritable> output, >> Reporter reporter) throws IOException { >> String line = value.toString(); >> Matcher m = NUMBERS.matcher(line); >> m.find(); >> VLongWritable userID = new VLongWritable(Long.parseLong(m.group())); >> VLongWritable itemID = new VLongWritable(); >> while (m.find()) { >> itemID.set(Long.parseLong(m.group())); >> output.collect(userID, itemID); >> } >> } >> >> } >> >> >> On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <[email protected]> wrote: >>> Hey Sean, >>> >>> Very cool! Is there any custom code you used to import the link data / >>> instructions on how to reproduce this? >>> >>> -jake >>> >>> On May 26, 2010 8:09 AM, "Sean Owen" <[email protected]> wrote: >>> >>> Hi all, though the list might be interested in some recent numbers I >>> collected on distributed recommenders, in reality, on Hadoop. I just >>> finished running a set of recommendations based on the Wikipedia link >>> graph, for book purposes (yeah, it's unconventional). I ran on my >>> laptop, but it ought to be crudely representative of how it runs in a >>> real cluster. >>> >>> The input is 1058MB as a text file, and contains, 130M article-article >>> associations, from 5.7M articles to 3.8M distinct articles ("users" >>> and "items", respectively). I estimate cost based on Amazon's North >>> American small Linux-based instance pricing of $0.085/hour. I ran on a >>> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this >>> is valid. >>> >>> In this run, I run recommendations for all 5.7M "users". You can >>> certainly run for any subset of all users of course. >>> >>> Phase 1 (Item ID to item index mapping) >>> 29 minutes CPU time >>> $0.05 >>> 60MB output >>> >>> Phase 2 (Create user vectors) >>> 88 minutes CPU time >>> $0.13 >>> Output: 1159MB >>> >>> Phase 3 (Count co-occurrence) >>> 77 hours minutes CPU time >>> $6.54 >>> Output: 23.6GB >>> >>> Phase 4 (Partial multiply prep) >>> 636 minutes >>> $0.90 >>> Output: 24.6GB >>> >>> Phase 5 (Aggregate and recommend) >>> about 600 hours >>> about $51.00 >>> about 10GB >>> (I estimated these rather than let it run at home for days!) >>> >>> >>> Note that phases 1 and 3 may be run less frequently, and need not be >>> run every time. >>> But the cost is dominated by the last step, which is most of the work. >>> I've ignored storage costs since >>> >>> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 >>> user recommendations. That's not bad if, say, you want to update recs >>> for you site's 100,000 daily active users for a dollar. >>> >>> There are several levers one could pull internally to sacrifice >>> accuracy for speed, but it's currently set to pretty normal values. So >>> this is just one possibility. >>> >>> Now that's not terrible, but it is about 8x more computing than would >>> be needed by a non-distributed implementation *if* you could fit the >>> whole data set into a very large instance's memory, which is still >>> possible at this scale but needs a pretty big instance. That's a very >>> apples-to-oranges comparison of course; different algorithms, entirely >>> different environments. This is about the amount of overhead I'd >>> expect from distributing -- interesting to note how non-trivial it is. >>> >>> >>> Still to-do is to actually run this on EMR at some point or a real >>> cluster to see how well this estimate holds up. >>> And still to-do is to make this faster. >>> > > >
