Ted, Thanks for your response. Following is the information about the approach and the datasets:
I am using the ItemSimilarityJob and passing it "itemID, userID, prefCount" tuples as input to compute user-user similarity using LLR. I read this approach from a response for one of the stackoverflow questions on calculating user similarity using mahout. . Following are the stats for the datasets: Coauthor dataset: users = 29189 items = 140091 averageItemsClicked = 15.808660796875536 Conference Dataset: users = 29189 items = 2393 averageItemsClicked = 7.265099866388023 Reference Dataset: users = 29189 items = 201570 averageItemsClicked = 61.08564870327863 By Scale, did you mean rating scale? If so, I am using preference counts, not rating. Thanks, Rohit On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <[email protected]> wrote: > How are you using LLR to compute user similarity? It is normally used to > compute item similarity? > > Also, what is your scale? how many users? how many items? how many > actions per user? > > > > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <[email protected]> > wrote: > > > Hi, > > > > I am exploring a random-walk based algorithm for recommender systems > which > > works by propagating the item preferences for users on the user-user > graph. > > To do this, I have to compute user-user similarity and form a > neighborhood. > > I have tried the following three simple techniques to compute the score > > between two users and find the neighborhood. > > > > 1. Score = (Common Items between users A and B) / (items preferred by A + > > items Preferred by B) > > 2. Scoring based on Mahout's Cosine Similarity > > 3. Scoring based on Mahout's LogLikelihood similarity. > > > > My understanding is that similarity based on LogLikelihood is more > robust, > > however, I get better results using the naive approach (technique 1 from > > the above list). The problems I am addressing are collaborator > > recommendation, conference recommendation and reference recommendation > and > > the data has implicit feedback. > > > > So, my questions is, are there any cases where cosine similarity and > > loglikelihood metrics fail (to capture similarity), for example, for the > > problems stated above, users only collaborate with few other users (based > > on area of interest), publish in only few conferences (again based on > area > > of interest) and refer to publications in a specific domain. So, the > > preference counts are fairly small compared to other domains (music/video > > etc). > > > > Secondly, for CosineSimilarity, should I treat the preferences as boolean > > or use the counts? (I think loglikelihood metric does not take into > account > > the preference counts.. correct me if I am wrong.) > > > > Any insight into this is much appreciated. > > > > Thanks, > > Rohit > > > > p.s. Ted, Pat: I am following the discussion on the thread > > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot to > > understand how it works and made me wonder why things are different in my > > case. > > >
