Hi all,

once again, I'm moving a twitter conversation to this mailing list.

Let me introduce Andy, who is currently evaluating recommendation
components for his NYC located startup and looking into Mahout for that
reason:

"We are coding primarily in Scala and looking to build or license a
recommendation component. The base requirement is that it be capable of
hybrid recommendations on a body of ~2MM users and ~10MM items with rich
metadata.  The paper I referenced seems to indicate Mahout is not a
great fit- can you point me to recent improvements that make the
assertions in the paper obsolete? Any guidance is very much appreciated!"

The paper which he's quoting is an old review of Mahout's recommender
support available at
http://www.iletken-project.com/documents/mahout_review_by_iletken.pdf .
I think we should give great advice to Andy and simulatenously give the
community an update about the criticized facts in that review that are
not true anymore.

I'll make a first try to address the state of that review:

 - Mahout currently offers parallel algorithms for Collaborative
Filtering, see
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
which can also be used to precompute a model which can than be used for
online recommendations.

 - Mahout has some support for matrix factorization based recommenders (
https://hudson.apache.org/hudson/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/impl/recommender/svd/SVDRecommender.html
), a superior algrithm to this (
https://issues.apache.org/jira/browse/MAHOUT-525 ) as well as a parallel
implementation ( https://issues.apache.org/jira/browse/MAHOUT-542 ) are
currently in the making

 -The memory consumption of Taste has significantly improved, I never
tried to load the Netflix dataset, but I'm pretty sure it fits into some
hundred megabytes of memory.

Furthermore I think we need to know more details about Andy's usecase to
give him proper answers about Mahout fitting his project:

- Do you have explicit ratings from the users or are you working with
implicit data?

- What do you exactly mean by hybrid recommendations? Do you mean a
combination of content based and collaborative filtering techniques?

- How fast do you need the recommendations? Would it be ok to have them
precomputed on a daily basis e.g. or do you need them in realtime?

- How often do new users and new items enter your dataset? How sparse is
your rating data?

--sebastian

Reply via email to