Sean: Thanks for your quick and detailed response. I think it's great (and am very appreciative!) that Mahout implemented the basic Collaborative Filtering algorithms because that means that a lot of the "heavy lifting" has already been done. The built-in Mahout evaluators also make it easier to compare different prediction calculation and similarity weighting approaches. I will take your suggestions under advisement. Thanks again for your help .. Carlos
On Wed, Jul 6, 2011 at 5:37 PM, Sean Owen <[email protected]> wrote: > On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario <[email protected] > >wrote: > > > > Although this is certainly a sound approach, other approaches have been > > suggested in the literature as cited in > > > > > https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation > . > > Can you please provide some insight as to why you selected the above > > prediction calculation approach for Mahout? > > > > This is a simple weighted average, and is the simplest and most canonical > thing to do -- it's what the literature suggests. I can imagine several > other things you can do here, and you're welcome to modify the code to do > them. The framework can't implement all thousand possible things you'd do > at > every point so it usually provides the basic, simple pieces and invites you > to extend or modify as you like. > > I have written these implementations to reflect the "canonical" and basic > way of doing things, not my inventions or ideas. I'm implementing standard > ideas, not my own. > > But I do have plenty of other ideas for you if you like. For example: > > In this formulation, the estimated preference value used to rank > recommendations is the mean of all the independent predictions. That's > quite > sensible: I think the implicit assumption is that these predictions have > some normal distribution whose mean is the "real" preference for that item. > So the sample mean is as good an estimate of any of that real preference. > > One problem is that this takes no account of your certainty about how close > the sample mean and real mean are. For instance, the mean of 100 > predictions > is probably more reliable than 1, right? You know that the population mean > is far more likely to be close to the sample mean. > > You could use this idea directly by ranking by sample mean minus sample > standard deviation, instead of just sample mean. That's not an estimate of > the actual preference, but a sort of lower bound on what the preference is > probably larger than. > > > I also noticed that Mahout has implemented the following > > PearsonCorrelationSimilarity weighting when the WEIGHTED parameter is > used > > in the similarity constructor: > > > > Would you please provide some insight as to why you decided to use this > > weighting approach? > > > > This is somewhat made-up. There is not some strong mathematical > justification for it. I can explain the intuition behind why this is > sensible but I think the answer is just that it is a crude adjustment to a > similarity metric you probably won't use anyway, but that is so well-known > needs to be supported. > > > > > > > It appears that Mahout calculates similarities between users to determine > > the neighborhood and then again during the prediction calculation. When > > running an evaluator (e.g., DifferenceRecommenderEvaluator), I can see > that > > the user similarities are computed repeatedly for each user. Is there a > > reason why it was implemented this way? (“time vs space” tradeoff?) > > > > UserSimilarity implementations always compute user-user similarity. You can > wrap in CachingUserSimilarity if you want it cached. These are separate > concerns. > > > Can you provide some insight as to why you decided to use this approach? > > Were > > there any other approaches you considered but rejected, and if so, why > did > > you reject them? > > > > > Same as #1, this is just a simple weighted average. >
