Thank you very much. The pointer to Myrrix is a very useful piece of information. Myrrix, however, relies on an iterative sparse matrix factorization to do PCA. I want to produce Amazon-like recommendations. I.e., "70% of users who bough this, also bought that." So, I specifically want the direct kNN algorithm. Any clue what Mahout + Hadoop can deliver on that one? Thanks, Jacob
On Sun, Dec 2, 2012 at 5:25 PM, Sean Owen <[email protected]> wrote: > My guess is: less than $10. Little enough that I wouldn't worry about > it. But I have not tried it directly. > > You just have 10K items, so it ought to be relatively quick to find > similar items for them. You will want to look at ItemSimilarityJob. > Setting some parameters like --maxSimilaritiesPerRow and --threshold > will be important to speed. On EMR, I suggest using 2-4 m1.xlarge > instances and using spot instances. For the master, use on-demand and > use m1.large. The usual Hadoop tunings like mapred.reduce.tasks matter > a lot too. When set up well it should be quite economical. > > Since you mentioned implicit feedback and EMR, you may benefit from a > look at Myrrix (http://myrrix.com). It can compute recommendations or > item-item similarities, on Hadoop / EMR if desired, and is built for > this implicit feedback model. The scale is no problem. It's > pre-packaged and tuned to run by itself, so, might save you time and > money versus trying to configure, run and tune it from scratch > (http://myrrix.com/purchase-computation-layer/). For what it may be > worth I do have one recent benchmark on EMR > (http://myrrix.com/example-wikipedia-links/) computing a model over > 13M Wikipedia articles for about $7. > > On Sun, Dec 2, 2012 at 9:12 PM, Koobas <[email protected]> wrote: > > I was wondering if somebody could give me a rough estimate of the cost of > > running Mahout on Amazon's Elastic MapReduce for a specific problem. > > I am working with a common case of implicit feedback. > > I have a simple, boolean input, i.e., user-item pairs (userID, itemID). > > I would like to find 50 nearest neighbors for each item. > > I have 10M users, 10K items, and 500M records. > > If anybody has any ballpark idea of the kind of cost it would take to > solve > > the problem using EMR, I would appreciate it very much. > > Jacob >
