This is an incredibly tiny dataset.  If you delete singletons, it is likely
to get significantly smaller.

I think that something like LDA might work much better for you. It was
designed to work on small data like this.


On Tue, Sep 30, 2014 at 11:13 AM, Parimi Rohit <[email protected]>
wrote:

> Ted, Thanks for your response. Following is the information about the
> approach and the datasets:
>
> I am using the ItemSimilarityJob and passing it  "itemID, userID,
> prefCount" tuples as input to compute user-user similarity using LLR. I
> read this approach from a response for one of the stackoverflow questions
> on calculating user similarity using mahout. .
>
>
> Following are the stats for the datasets:
>
> Coauthor dataset:
>
> users = 29189
> items =  140091
> averageItemsClicked = 15.808660796875536
>
> Conference Dataset:
>
> users = 29189
> items =  2393
> averageItemsClicked = 7.265099866388023
>
> Reference Dataset:
>
> users = 29189
> items =  201570
> averageItemsClicked = 61.08564870327863
>
> By Scale, did you mean rating scale? If so, I am using preference counts,
> not rating.
>
> Thanks,
> Rohit
>
>
> On Tue, Sep 30, 2014 at 12:08 AM, Ted Dunning <[email protected]>
> wrote:
>
> > How are you using LLR to compute user similarity?  It is normally used to
> > compute item similarity?
> >
> > Also, what is your scale?  how many users? how many items?  how many
> > actions per user?
> >
> >
> >
> > On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I am exploring a random-walk based algorithm for recommender systems
> > which
> > > works by propagating the item preferences for users on the user-user
> > graph.
> > > To do this, I have to compute user-user similarity and form a
> > neighborhood.
> > > I have tried the following three simple techniques to compute the score
> > > between two users and find the neighborhood.
> > >
> > > 1. Score = (Common Items between users A and B) / (items preferred by
> A +
> > > items Preferred by B)
> > > 2. Scoring based on Mahout's Cosine Similarity
> > > 3. Scoring based on Mahout's LogLikelihood similarity.
> > >
> > > My understanding is that similarity based on LogLikelihood is more
> > robust,
> > > however, I get better results using the naive approach (technique 1
> from
> > > the above list). The problems I am addressing are collaborator
> > > recommendation, conference recommendation and reference recommendation
> > and
> > > the data has implicit feedback.
> > >
> > > So, my questions is, are there any cases where cosine similarity and
> > > loglikelihood metrics fail (to capture similarity), for example, for
> the
> > > problems stated above, users only collaborate with few other users
> (based
> > > on area of interest), publish in only few conferences (again based on
> > area
> > > of interest) and refer to publications in a specific domain. So, the
> > > preference counts are fairly small compared to other domains
> (music/video
> > > etc).
> > >
> > > Secondly, for CosineSimilarity, should I treat the preferences as
> boolean
> > > or use the counts? (I think loglikelihood metric does not take into
> > account
> > > the preference counts.. correct me if I am wrong.)
> > >
> > > Any insight into this is much appreciated.
> > >
> > > Thanks,
> > > Rohit
> > >
> > > p.s. Ted, Pat: I am following the discussion on the thread
> > > "LogLikelihoodSimilarity Calculation" and your answers helped me a lot
> to
> > > understand how it works and made me wonder why things are different in
> my
> > > case.
> > >
> >
>

Reply via email to