Hi all, once again, I'm moving a twitter conversation to this mailing list.
Let me introduce Andy, who is currently evaluating recommendation components for his NYC located startup and looking into Mahout for that reason: "We are coding primarily in Scala and looking to build or license a recommendation component. The base requirement is that it be capable of hybrid recommendations on a body of ~2MM users and ~10MM items with rich metadata. The paper I referenced seems to indicate Mahout is not a great fit- can you point me to recent improvements that make the assertions in the paper obsolete? Any guidance is very much appreciated!" The paper which he's quoting is an old review of Mahout's recommender support available at http://www.iletken-project.com/documents/mahout_review_by_iletken.pdf . I think we should give great advice to Andy and simulatenously give the community an update about the criticized facts in that review that are not true anymore. I'll make a first try to address the state of that review: - Mahout currently offers parallel algorithms for Collaborative Filtering, see https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering which can also be used to precompute a model which can than be used for online recommendations. - Mahout has some support for matrix factorization based recommenders ( https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/svd/SVDRecommender.html ), a superior algrithm to this ( https://issues.apache.org/jira/browse/MAHOUT-525 ) as well as a parallel implementation ( https://issues.apache.org/jira/browse/MAHOUT-542 ) are currently in the making -The memory consumption of Taste has significantly improved, I never tried to load the Netflix dataset, but I'm pretty sure it fits into some hundred megabytes of memory. Furthermore I think we need to know more details about Andy's usecase to give him proper answers about Mahout fitting his project: - Do you have explicit ratings from the users or are you working with implicit data? - What do you exactly mean by hybrid recommendations? Do you mean a combination of content based and collaborative filtering techniques? - How fast do you need the recommendations? Would it be ok to have them precomputed on a daily basis e.g. or do you need them in realtime? - How often do new users and new items enter your dataset? How sparse is your rating data? --sebastian
