Yeah that review was, IMHO, had issues. It's important to note the context: the person was selling their own services. It was trying to run some sample code, non-distributed code, in a sort of distributed fashion. The result was predictably not so good. That was a long time ago.
2M users and 10M items isn't big even for a non-distributed recommender. This doesn't even sound hard for a non-distributed Mahout recommender. Sure, let's hear more and we can give some ideas. On Wed, Dec 29, 2010 at 4:08 AM, Sebastian Schelter <[email protected]> wrote: > Hi all, > > once again, I'm moving a twitter conversation to this mailing list. > > Let me introduce Andy, who is currently evaluating recommendation > components for his NYC located startup and looking into Mahout for that > reason: > > "We are coding primarily in Scala and looking to build or license a > recommendation component. The base requirement is that it be capable of > hybrid recommendations on a body of ~2MM users and ~10MM items with rich > metadata. The paper I referenced seems to indicate Mahout is not a > great fit- can you point me to recent improvements that make the > assertions in the paper obsolete? Any guidance is very much appreciated!" > > The paper which he's quoting is an old review of Mahout's recommender > support available at > http://www.iletken-project.com/documents/mahout_review_by_iletken.pdf . > I think we should give great advice to Andy and simulatenously give the > community an update about the criticized facts in that review that are > not true anymore. > > I'll make a first try to address the state of that review: > > - Mahout currently offers parallel algorithms for Collaborative > Filtering, see > https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering > which can also be used to precompute a model which can than be used for > online recommendations. > > - Mahout has some support for matrix factorization based recommenders ( > https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/svd/SVDRecommender.html > ), a superior algrithm to this ( > https://issues.apache.org/jira/browse/MAHOUT-525 ) as well as a parallel > implementation ( https://issues.apache.org/jira/browse/MAHOUT-542 ) are > currently in the making > > -The memory consumption of Taste has significantly improved, I never > tried to load the Netflix dataset, but I'm pretty sure it fits into some > hundred megabytes of memory. > > Furthermore I think we need to know more details about Andy's usecase to > give him proper answers about Mahout fitting his project: > > - Do you have explicit ratings from the users or are you working with > implicit data? > > - What do you exactly mean by hybrid recommendations? Do you mean a > combination of content based and collaborative filtering techniques? > > - How fast do you need the recommendations? Would it be ok to have them > precomputed on a daily basis e.g. or do you need them in realtime? > > - How often do new users and new items enter your dataset? How sparse is > your rating data? > > --sebastian >
