Hi, I am exploring a random-walk based algorithm for recommender systems which works by propagating the item preferences for users on the user-user graph. To do this, I have to compute user-user similarity and form a neighborhood. I have tried the following three simple techniques to compute the score between two users and find the neighborhood.
1. Score = (Common Items between users A and B) / (items preferred by A + items Preferred by B) 2. Scoring based on Mahout's Cosine Similarity 3. Scoring based on Mahout's LogLikelihood similarity. My understanding is that similarity based on LogLikelihood is more robust, however, I get better results using the naive approach (technique 1 from the above list). The problems I am addressing are collaborator recommendation, conference recommendation and reference recommendation and the data has implicit feedback. So, my questions is, are there any cases where cosine similarity and loglikelihood metrics fail (to capture similarity), for example, for the problems stated above, users only collaborate with few other users (based on area of interest), publish in only few conferences (again based on area of interest) and refer to publications in a specific domain. So, the preference counts are fairly small compared to other domains (music/video etc). Secondly, for CosineSimilarity, should I treat the preferences as boolean or use the counts? (I think loglikelihood metric does not take into account the preference counts.. correct me if I am wrong.) Any insight into this is much appreciated. Thanks, Rohit p.s. Ted, Pat: I am following the discussion on the thread "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to understand how it works and made me wonder why things are different in my case.
