This is a good discussion of the issue. https://issues.apache.org/jira/browse/MAHOUT-898
Negative weights are problematic. I think taking the absolute value gives slightly less explainable results, but that's up to taste. For example a rating of 3, weighted by -4, results in a prediction of -3. It's not clear -3 represents "the opposite of 3", and it doesn't in a 1-5 rating scale for example. Really negative weights are votes to be infinitely far from a value, and that is weird. Don't do it. On Mon, Nov 26, 2012 at 9:51 PM, Evgeny Karataev <[email protected]> wrote: > Thank you Sean and Paulo. > > Paulo, I guess in my original email I meant what you said in your last > email (about rating normalization). So that part is not done. > > I've looked at the code https://github.com/apache/** > mahout/blob/trunk/core/src/**main/java/org/apache/mahout/** > cf/taste/impl/recommender/**GenericItemBasedRecommender.**java#L230<https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/GenericItemBasedRecommender.java#L230> > > and the formula looks almost exactly as formula 4.12 in "A Comprehensive > Survey of Neighborhood-based Recommendation Methods" ( > http://www.springerlink.com/content/n3jq77686228781n/), however, the > difference is that you divide weighted preference by totalSimilarity > > ... > > > // Weights can be negative! > preference += theSimilarity * preferencesFromUser.getValue(i); > totalSimilarity += theSimilarity; > ... > float estimate = (float) (preference / totalSimilarity); > ... > > Where in contrast, in other papers the denominator is sum of absolute > values of similarities.* > * > > If I am not mistaken and as the comment in the code states, weights > (similarities) could be negative. And actually they might sum up to 0. > Then you would divide preference by 0. What would be the estimate in > that case? > > > > > On Mon, Nov 26, 2012 at 4:32 PM, Paulo Villegas <[email protected]> wrote: > >> > What do you mean here? You never need to actually subtract the mean >> > from the data. The similarity metric's math is just adjusted to work >> > as if it were. So no there is no idea of adding back a mean. I don't >> > think there's something not implemented. >> >> No, not about the similarity metric, as I said, the computation of the >> similarity metric *is* centred (or can be, the code has that option). >> >> But once you have similarities computed, then you go on and use them to >> predict the rating for unknown items. It's this rating prediction the >> place in which mean centering (or, to be more general, rating >> normalization) is not done and could be done. >> >> The papers mentioned in the original post explain it, I just searched >> around and found another one that also mentions it: >> >> "An Empirical Analysis of Design Choices in Neighborhood-Based >> Collaborative Filtering Algorithms" >> >> (googling it will give you a PDF right away). The rating prediction is >> Equation 1, and there you can see what I mean by mean centering in the >> prediction. >> >> Basically, you use the similarities you have already computed as weights >> for the averaging sum that creates the prediction, but those weights do >> not multiply the bare ratings for the other items, but their deviation >> from each users' average rating (equation 1 is for user-based). >> >> The rationale is that each user's scale is different, and tends to >> cluster ratings around a different mean. By subtracting that mean, we >> get into the equation only the user's perceived difference between that >> item and her average opinion, and factor out the user's mean opinion >> (which would introduce some bias). Then we add back to the result the >> average rating of the target user, which restores the normal range for >> the prediction, but this time using the target user's own bias. This >> helps to achieve predictions more in line with the target user's own scale. >> >> The same paper explains it later on (more eloquently than me :-) in >> section 7.1, in the more general context of rating normalization >> (proposing also z-score as a more elaborate choice, and evaluating >> results). >> >> Paulo >> >> >> On 26/11/12 21:51, Sean Owen wrote: >> >>> >>> On Mon, Nov 26, 2012 at 8:20 PM, Paulo Villegas <[email protected]> wrote: >>> >>>> The thing is, in an Item- or User- based neighborhood recommender, >>>> there's more than one thing that can be centered :-) >>>> >>>> What those papers talk about (from memory, it's been a while since I >>>> last read them, and I don't have them at hand now) is about centering of >>>> the preference around the user's (or item's) average before entering it >>>> in the neighborhood formula. And then moving them back to its usual >>>> range by adding back the average preference (this time for the target >>>> item or user). >>>> >>>> This is something that the code in Mahout does not currently do. You can >>>> check for yourself, the formula is pretty straightforward: >>>> >>> >> >> ______________________________**__ >> >> Este mensaje se dirige exclusivamente a su destinatario. Puede consultar >> nuestra política de envío y recepción de correo electrónico en el enlace >> situado más abajo. >> This message is intended exclusively for its addressee. We only send and >> receive email on the basis of the terms set out at: >> http://www.tid.es/ES/PAGINAS/**disclaimer.aspx<http://www.tid.es/ES/PAGINAS/disclaimer.aspx> >> > > > > -- > Best Regards, > Evgeny Karataev
