Sean: Thanks for your quick and detailed response. I think it's great (and
am very appreciative!) that Mahout implemented the basic Collaborative
Filtering algorithms because that means that a lot of the "heavy lifting"
has already been done. The built-in Mahout evaluators also make it easier to
compare different prediction calculation and similarity weighting
approaches. I will take your suggestions under advisement. Thanks again for
your help .. Carlos

On Wed, Jul 6, 2011 at 5:37 PM, Sean Owen <[email protected]> wrote:

> On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario <[email protected]
> >wrote:
> >
> > Although this is certainly a sound approach, other approaches have been
> > suggested in the literature as cited in
> >
> >
> https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation
> .
> > Can you please provide some insight as to why you selected the above
> > prediction calculation approach for Mahout?
> >
>
> This is a simple weighted average, and is the simplest and most canonical
> thing to do -- it's what the literature suggests. I can imagine several
> other things you can do here, and you're welcome to modify the code to do
> them. The framework can't implement all thousand possible things you'd do
> at
> every point so it usually provides the basic, simple pieces and invites you
> to extend or modify as you like.
>
> I have written these implementations to reflect the "canonical" and basic
> way of doing things, not my inventions or ideas. I'm implementing standard
> ideas, not my own.
>
> But I do have plenty of other ideas for you if you like. For example:
>
> In this formulation, the estimated preference value used to rank
> recommendations is the mean of all the independent predictions. That's
> quite
> sensible: I think the implicit assumption is that these predictions have
> some normal distribution whose mean is the "real" preference for that item.
> So the sample mean is as good an estimate of any of that real preference.
>
> One problem is that this takes no account of your certainty about how close
> the sample mean and real mean are. For instance, the mean of 100
> predictions
> is probably more reliable than 1, right? You know that the population mean
> is far more likely to be close to the sample mean.
>
> You could use this idea directly by ranking by sample mean minus sample
> standard deviation, instead of just sample mean. That's not an estimate of
> the actual preference, but a sort of lower bound on what the preference is
> probably larger than.
>
>
> I also noticed that Mahout has implemented the following
> > PearsonCorrelationSimilarity weighting when the WEIGHTED parameter is
> used
> > in the similarity constructor:
> >
> > Would you please provide some insight as to why you decided to use this
> > weighting approach?
> >
>
> This is somewhat made-up. There is not some strong mathematical
> justification for it. I can explain the intuition behind why this is
> sensible but I think the answer is just that it is a crude adjustment to a
> similarity metric you probably won't use anyway, but that is so well-known
> needs to be supported.
>
>
>
> >
> > It appears that Mahout calculates similarities between users to determine
> > the neighborhood and then again during the prediction calculation. When
> > running an evaluator (e.g., DifferenceRecommenderEvaluator), I can see
> that
> > the user similarities are computed repeatedly for each user. Is there a
> > reason why it was implemented this way? (“time vs space” tradeoff?)
> >
>
> UserSimilarity implementations always compute user-user similarity. You can
> wrap in CachingUserSimilarity if you want it cached. These are separate
> concerns.
>
>
> Can you provide some insight as to why you decided to use this approach?
> > Were
> > there any other approaches you considered but rejected, and if so, why
> did
> > you reject them?
> >
> >
> Same as #1, this is just a simple weighted average.
>

Reply via email to