Great stuff, Sean! Code Sounds like it could go into examples with some of the other Wikipedia stuff?
Also, how about c-n-p to https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks? -Grant On May 26, 2010, at 1:32 PM, Sean Owen wrote: > The only customization needed was in the first mapper/reducer to parse > the particular format of the input: > > http://users.on.net/~henry/home/wikipedia.htm > > I can post the code somewhere... it's in the book too. Oh why not do > it here, it's pasted later. > > The rest is just the stock code from HEAD in SVN. The command line is > something like: > > hadoop jar mahout-core-0.4-SNAPSHOT.job > org.apache.mahout.cf.taste.hadoop.item.RecommenderJob > -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output > --booleanData true > > Your mileage may vary depending on the machine you run it on of course. > > > > public final class WikipediaItemIDIndexMapper extends MapReduceBase implements > Mapper<LongWritable,Text,IntWritable, VLongWritable> { > > private static final Pattern NUMBERS = Pattern.compile("(\\d+)"); > > @Override > public void map(LongWritable key, > Text value, > OutputCollector<IntWritable,VLongWritable> output, > Reporter reporter) throws IOException { > String line = value.toString(); > Matcher m = NUMBERS.matcher(line); > m.find(); > IntWritable index = new IntWritable(); > VLongWritable itemID = new VLongWritable(); > while (m.find()) { > long item = Long.parseLong(m.group()); > itemID.set(item); > index.set(idToIndex(item)); > output.collect(index, itemID); > } > } > > static int idToIndex(long itemID) { > return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32)); > } > > } > > > public final class WikipediaToItemPrefsMapper extends MapReduceBase implements > Mapper<LongWritable,Text,VLongWritable,VLongWritable> { > > private static final Pattern NUMBERS = Pattern.compile("(\\d+)"); > > @Override > public void map(LongWritable key, > Text value, > OutputCollector<VLongWritable,VLongWritable> output, > Reporter reporter) throws IOException { > String line = value.toString(); > Matcher m = NUMBERS.matcher(line); > m.find(); > VLongWritable userID = new VLongWritable(Long.parseLong(m.group())); > VLongWritable itemID = new VLongWritable(); > while (m.find()) { > itemID.set(Long.parseLong(m.group())); > output.collect(userID, itemID); > } > } > > } > > > On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <[email protected]> wrote: >> Hey Sean, >> >> Very cool! Is there any custom code you used to import the link data / >> instructions on how to reproduce this? >> >> -jake >> >> On May 26, 2010 8:09 AM, "Sean Owen" <[email protected]> wrote: >> >> Hi all, though the list might be interested in some recent numbers I >> collected on distributed recommenders, in reality, on Hadoop. I just >> finished running a set of recommendations based on the Wikipedia link >> graph, for book purposes (yeah, it's unconventional). I ran on my >> laptop, but it ought to be crudely representative of how it runs in a >> real cluster. >> >> The input is 1058MB as a text file, and contains, 130M article-article >> associations, from 5.7M articles to 3.8M distinct articles ("users" >> and "items", respectively). I estimate cost based on Amazon's North >> American small Linux-based instance pricing of $0.085/hour. I ran on a >> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this >> is valid. >> >> In this run, I run recommendations for all 5.7M "users". You can >> certainly run for any subset of all users of course. >> >> Phase 1 (Item ID to item index mapping) >> 29 minutes CPU time >> $0.05 >> 60MB output >> >> Phase 2 (Create user vectors) >> 88 minutes CPU time >> $0.13 >> Output: 1159MB >> >> Phase 3 (Count co-occurrence) >> 77 hours minutes CPU time >> $6.54 >> Output: 23.6GB >> >> Phase 4 (Partial multiply prep) >> 636 minutes >> $0.90 >> Output: 24.6GB >> >> Phase 5 (Aggregate and recommend) >> about 600 hours >> about $51.00 >> about 10GB >> (I estimated these rather than let it run at home for days!) >> >> >> Note that phases 1 and 3 may be run less frequently, and need not be >> run every time. >> But the cost is dominated by the last step, which is most of the work. >> I've ignored storage costs since >> >> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 >> user recommendations. That's not bad if, say, you want to update recs >> for you site's 100,000 daily active users for a dollar. >> >> There are several levers one could pull internally to sacrifice >> accuracy for speed, but it's currently set to pretty normal values. So >> this is just one possibility. >> >> Now that's not terrible, but it is about 8x more computing than would >> be needed by a non-distributed implementation *if* you could fit the >> whole data set into a very large instance's memory, which is still >> possible at this scale but needs a pretty big instance. That's a very >> apples-to-oranges comparison of course; different algorithms, entirely >> different environments. This is about the amount of overhead I'd >> expect from distributing -- interesting to note how non-trivial it is. >> >> >> Still to-do is to actually run this on EMR at some point or a real >> cluster to see how well this estimate holds up. >> And still to-do is to make this faster. >>
