On Wed, Jul 6, 2011 at 10:02 PM, Carlos Seminario <[email protected]>wrote: > > Although this is certainly a sound approach, other approaches have been > suggested in the literature as cited in > > https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation. > Can you please provide some insight as to why you selected the above > prediction calculation approach for Mahout? >
This is a simple weighted average, and is the simplest and most canonical thing to do -- it's what the literature suggests. I can imagine several other things you can do here, and you're welcome to modify the code to do them. The framework can't implement all thousand possible things you'd do at every point so it usually provides the basic, simple pieces and invites you to extend or modify as you like. I have written these implementations to reflect the "canonical" and basic way of doing things, not my inventions or ideas. I'm implementing standard ideas, not my own. But I do have plenty of other ideas for you if you like. For example: In this formulation, the estimated preference value used to rank recommendations is the mean of all the independent predictions. That's quite sensible: I think the implicit assumption is that these predictions have some normal distribution whose mean is the "real" preference for that item. So the sample mean is as good an estimate of any of that real preference. One problem is that this takes no account of your certainty about how close the sample mean and real mean are. For instance, the mean of 100 predictions is probably more reliable than 1, right? You know that the population mean is far more likely to be close to the sample mean. You could use this idea directly by ranking by sample mean minus sample standard deviation, instead of just sample mean. That's not an estimate of the actual preference, but a sort of lower bound on what the preference is probably larger than. I also noticed that Mahout has implemented the following > PearsonCorrelationSimilarity weighting when the WEIGHTED parameter is used > in the similarity constructor: > > Would you please provide some insight as to why you decided to use this > weighting approach? > This is somewhat made-up. There is not some strong mathematical justification for it. I can explain the intuition behind why this is sensible but I think the answer is just that it is a crude adjustment to a similarity metric you probably won't use anyway, but that is so well-known needs to be supported. > > It appears that Mahout calculates similarities between users to determine > the neighborhood and then again during the prediction calculation. When > running an evaluator (e.g., DifferenceRecommenderEvaluator), I can see that > the user similarities are computed repeatedly for each user. Is there a > reason why it was implemented this way? (“time vs space” tradeoff?) > UserSimilarity implementations always compute user-user similarity. You can wrap in CachingUserSimilarity if you want it cached. These are separate concerns. Can you provide some insight as to why you decided to use this approach? > Were > there any other approaches you considered but rejected, and if so, why did > you reject them? > > Same as #1, this is just a simple weighted average.
