Hi Dan, On 30.11.2011, at 21:23, Dan Beaulieu wrote:
> Hi all, this is a tangent and can mostly be ignored by the people > interested in this problem. > > I'm new to Machine Learning and especially Mahout. Following this > discussion has made me a bit confused. > Isn't Mahout used for large datasets where it makes sense to distribute the > work? Why then isn't anyone pointing > out that the problem may be the use of one single Mahout node? Is it > because it's boolean based? Is it because the data set > isn't really that large? Isabel already gave a good explanation. Nevertheless as it turns out at the moment the problem of this performance issues seams to be the item similarity. There is a distributed approach of calculating this data: https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html Sebastian Schelter wrote a tutorial how to use this job: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ Nevertheless not everybody is maintaining a hadoop cluster. For example I did not use a cluster yet. As a rule of thumb (by Sean Owen) you can calculate everything until 100.000.000 Ratings on your normal machine. > > Even if for whatever reason a single node will do for this case, is it > really expected that the recommendation process would finish in less than > half a second? Yes, it is. Recommendation is a real time problem but how to do it in realtime is still a question where a lot of research is put in. A lot of people from mahout are working in an academic context so it is unclear yet how to handle the different problems. Mahout has a lot of possibilities to tweak. For a small dataset I did a benchmark published here: http://thread.gmane.org/gmane.comp.apache.mahout.user/10433 Actually for every recommender there is a trade off between: - accuracy - space - time It is a tough task to find the sweet spot. > This makes me think if that is the expectation then the data set is > actually small and Mahout might be overkill... > > What obvious piece of the Mahout puzzle am I missing? Hope that helps Manuel > > Thanks. > > Dan > > On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[email protected]> wrote: > >> Have you used CachingItemSimilarity? That will hold common similarities in >> memory. It's a lot easier than pre-computing and might help. >> >> I think something like your change is a good one (Sebastian what do you >> think) in that it gives you the ultimate lever to control how many >> candidates are evaluated. That ought to make it go as fast as you like, but >> it trades off quality. Still I'd be really surprised if there's no viable >> middle ground -- this works fine at smaller scale, where 100s of candidates >> are evaluated, perhaps, and you can use your lever to get to 100s of >> candidates at your scale too. Is that still both slow and inaccurate? >> >> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[email protected]> wrote: >> >>> I just tested the app with Mahout 0.6. >>> There seems to be a small performance improvement, but still >>> recommendations for the 'heavy users' take between 1-5 seconds. >>> >>> >> -- Manuel Blechschmidt Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
