Great stuff, Sean!  Code Sounds like it could go into examples with some of the 
other Wikipedia stuff?

Also, how about c-n-p to 
https://cwiki.apache.org/confluence/display/MAHOUT/MahoutBenchmarks?

-Grant 

On May 26, 2010, at 1:32 PM, Sean Owen wrote:

> The only customization needed was in the first mapper/reducer to parse
> the particular format of the input:
> 
> http://users.on.net/~henry/home/wikipedia.htm
> 
> I can post the code somewhere... it's in the book too. Oh why not do
> it here, it's pasted later.
> 
> The rest is just the stock code from HEAD in SVN. The command line is
> something like:
> 
> hadoop jar mahout-core-0.4-SNAPSHOT.job
> org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
> -Dmapred.input.dir=input/input.txt -Dmapred.output.dir=output
> --booleanData true
> 
> Your mileage may vary depending on the machine you run it on of course.
> 
> 
> 
> public final class WikipediaItemIDIndexMapper extends MapReduceBase implements
>    Mapper<LongWritable,Text,IntWritable, VLongWritable> {
> 
>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
> 
>  @Override
>  public void map(LongWritable key,
>                  Text value,
>                  OutputCollector<IntWritable,VLongWritable> output,
>                  Reporter reporter) throws IOException {
>    String line = value.toString();
>    Matcher m = NUMBERS.matcher(line);
>    m.find();
>    IntWritable index = new IntWritable();
>    VLongWritable itemID = new VLongWritable();
>    while (m.find()) {
>      long item = Long.parseLong(m.group());
>      itemID.set(item);
>      index.set(idToIndex(item));
>      output.collect(index, itemID);
>    }
>  }
> 
>  static int idToIndex(long itemID) {
>    return 0x7FFFFFFF & ((int) itemID ^ (int) (itemID >>> 32));
>  }
> 
> }
> 
> 
> public final class WikipediaToItemPrefsMapper extends MapReduceBase implements
>    Mapper<LongWritable,Text,VLongWritable,VLongWritable> {
> 
>  private static final Pattern NUMBERS = Pattern.compile("(\\d+)");
> 
>  @Override
>  public void map(LongWritable key,
>                  Text value,
>                  OutputCollector<VLongWritable,VLongWritable> output,
>                  Reporter reporter) throws IOException {
>    String line = value.toString();
>    Matcher m = NUMBERS.matcher(line);
>    m.find();
>    VLongWritable userID = new VLongWritable(Long.parseLong(m.group()));
>    VLongWritable itemID = new VLongWritable();
>    while (m.find()) {
>      itemID.set(Long.parseLong(m.group()));
>      output.collect(userID, itemID);
>    }
>  }
> 
> }
> 
> 
> On Wed, May 26, 2010 at 4:45 PM, Jake Mannix <[email protected]> wrote:
>> Hey Sean,
>> 
>>  Very cool!  Is there any custom code you used to import the link data /
>> instructions on how to reproduce this?
>> 
>>  -jake
>> 
>> On May 26, 2010 8:09 AM, "Sean Owen" <[email protected]> wrote:
>> 
>> Hi all, though the list might be interested in some recent numbers I
>> collected on distributed recommenders, in reality, on Hadoop. I just
>> finished running a set of recommendations based on the Wikipedia link
>> graph, for book purposes (yeah, it's unconventional). I ran on my
>> laptop, but it ought to be crudely representative of how it runs in a
>> real cluster.
>> 
>> The input is 1058MB as a text file, and contains, 130M article-article
>> associations, from 5.7M articles to 3.8M distinct articles ("users"
>> and "items", respectively). I estimate cost based on Amazon's North
>> American small Linux-based instance pricing of $0.085/hour. I ran on a
>> dual-core laptop with plenty of RAM, allowing 1GB per worker, so this
>> is valid.
>> 
>> In this run, I run recommendations for all 5.7M "users". You can
>> certainly run for any subset of all users of course.
>> 
>> Phase 1 (Item ID to item index mapping)
>> 29 minutes CPU time
>> $0.05
>> 60MB output
>> 
>> Phase 2 (Create user vectors)
>> 88 minutes CPU time
>> $0.13
>> Output: 1159MB
>> 
>> Phase 3 (Count co-occurrence)
>> 77 hours minutes CPU time
>> $6.54
>> Output: 23.6GB
>> 
>> Phase 4 (Partial multiply prep)
>> 636 minutes
>> $0.90
>> Output: 24.6GB
>> 
>> Phase 5 (Aggregate and recommend)
>> about 600 hours
>> about $51.00
>> about 10GB
>> (I estimated these rather than let it run at home for days!)
>> 
>> 
>> Note that phases 1 and 3 may be run less frequently, and need not be
>> run every time.
>> But the cost is dominated by the last step, which is most of the work.
>> I've ignored storage costs since
>> 
>> This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000
>> user recommendations. That's not bad if, say, you want to update recs
>> for you site's 100,000 daily active users for a dollar.
>> 
>> There are several levers one could pull internally to sacrifice
>> accuracy for speed, but it's currently set to pretty normal values. So
>> this is just one possibility.
>> 
>> Now that's not terrible, but it is about 8x more computing than would
>> be needed by a non-distributed implementation *if* you could fit the
>> whole data set into a very large instance's memory, which is still
>> possible at this scale but needs a pretty big instance. That's a very
>> apples-to-oranges comparison of course; different algorithms, entirely
>> different environments. This is about the amount of overhead I'd
>> expect from distributing -- interesting to note how non-trivial it is.
>> 
>> 
>> Still to-do is to actually run this on EMR at some point or a real
>> cluster to see how well this estimate holds up.
>> And still to-do is to make this faster.
>> 


Reply via email to