Hey Sean, Very cool! Is there any custom code you used to import the link data / instructions on how to reproduce this?
-jake On May 26, 2010 8:09 AM, "Sean Owen" <[email protected]> wrote: Hi all, though the list might be interested in some recent numbers I collected on distributed recommenders, in reality, on Hadoop. I just finished running a set of recommendations based on the Wikipedia link graph, for book purposes (yeah, it's unconventional). I ran on my laptop, but it ought to be crudely representative of how it runs in a real cluster. The input is 1058MB as a text file, and contains, 130M article-article associations, from 5.7M articles to 3.8M distinct articles ("users" and "items", respectively). I estimate cost based on Amazon's North American small Linux-based instance pricing of $0.085/hour. I ran on a dual-core laptop with plenty of RAM, allowing 1GB per worker, so this is valid. In this run, I run recommendations for all 5.7M "users". You can certainly run for any subset of all users of course. Phase 1 (Item ID to item index mapping) 29 minutes CPU time $0.05 60MB output Phase 2 (Create user vectors) 88 minutes CPU time $0.13 Output: 1159MB Phase 3 (Count co-occurrence) 77 hours minutes CPU time $6.54 Output: 23.6GB Phase 4 (Partial multiply prep) 636 minutes $0.90 Output: 24.6GB Phase 5 (Aggregate and recommend) about 600 hours about $51.00 about 10GB (I estimated these rather than let it run at home for days!) Note that phases 1 and 3 may be run less frequently, and need not be run every time. But the cost is dominated by the last step, which is most of the work. I've ignored storage costs since This implies a cost of $0.01 (or about 8 instance-minutes) per 1,000 user recommendations. That's not bad if, say, you want to update recs for you site's 100,000 daily active users for a dollar. There are several levers one could pull internally to sacrifice accuracy for speed, but it's currently set to pretty normal values. So this is just one possibility. Now that's not terrible, but it is about 8x more computing than would be needed by a non-distributed implementation *if* you could fit the whole data set into a very large instance's memory, which is still possible at this scale but needs a pretty big instance. That's a very apples-to-oranges comparison of course; different algorithms, entirely different environments. This is about the amount of overhead I'd expect from distributing -- interesting to note how non-trivial it is. Still to-do is to actually run this on EMR at some point or a real cluster to see how well this estimate holds up. And still to-do is to make this faster.
