Re: Mahout performance issues

Manuel Blechschmidt Thu, 01 Dec 2011 07:18:41 -0800

Hi Dan,

On 30.11.2011, at 21:23, Dan Beaulieu wrote:


> Hi all, this is a tangent and can mostly be ignored by the people
> interested in this problem.
> 
> I'm new to Machine Learning and especially Mahout. Following this
> discussion has made me a bit confused.
> Isn't Mahout used for large datasets where it makes sense to distribute the
> work? Why then isn't anyone pointing
> out that the problem may be the use of one single Mahout node? Is it
> because it's boolean based? Is it because the data set
> isn't really that large?

Isabel already gave a good explanation. Nevertheless as it turns out at the 
moment the problem of this performance issues seams to be the item similarity.

There is a distributed approach of calculating this data:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html

Sebastian Schelter wrote a tutorial how to use this job:
http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/

Nevertheless not everybody is maintaining a hadoop cluster. For example I did 
not use a cluster yet. As a rule of thumb (by Sean Owen) you can calculate 
everything until 100.000.000 Ratings on your normal machine.

> 
> Even if for whatever reason a single node will do for this case, is it
> really expected that the recommendation process would finish in less than
> half a second?

Yes, it is. Recommendation is a real time problem but how to do it in realtime 
is still a question where a lot of research is put in. A lot of people from 
mahout are working in an academic context so it is unclear yet how to handle 
the different problems.
Mahout has a lot of possibilities to tweak. For a small dataset I did a 
benchmark published here:
http://thread.gmane.org/gmane.comp.apache.mahout.user/10433

Actually for every recommender there is a trade off between:
- accuracy
- space
- time

It is a tough task to find the sweet spot.

> This makes me think if that is the expectation then the data set is
> actually small and Mahout might be overkill...
> 
> What obvious piece of the Mahout puzzle am I missing?

Hope that helps
    Manuel

> 
> Thanks.
> 
> Dan
> 
> On Wed, Nov 30, 2011 at 11:56 AM, Sean Owen <[email protected]> wrote:
> 
>> Have you used CachingItemSimilarity? That will hold common similarities in
>> memory. It's a lot easier than pre-computing and might help.
>> 
>> I think something like your change is a good one (Sebastian what do you
>> think) in that it gives you the ultimate lever to control how many
>> candidates are evaluated. That ought to make it go as fast as you like, but
>> it trades off quality. Still I'd be really surprised if there's no viable
>> middle ground -- this works fine at smaller scale, where 100s of candidates
>> are evaluated, perhaps, and you can use your lever to get to 100s of
>> candidates at your scale too. Is that still both slow and inaccurate?
>> 
>> On Wed, Nov 30, 2011 at 3:18 PM, Daniel Zohar <[email protected]> wrote:
>> 
>>> I just tested the app with Mahout 0.6.
>>> There seems to be a small performance improvement, but still
>>> recommendations for the 'heavy users' take between 1-5 seconds.
>>> 
>>> 
>> 

-- 
Manuel Blechschmidt
Dortustr. 57
14467 Potsdam
Mobil: 0173/6322621
Twitter: http://twitter.com/Manuel_B

Re: Mahout performance issues

Reply via email to