Thank you so much Sean.  It was great to get confirmation from you
regarding the choice of algorithm.

As suggested, I used the following params:

            similarityJob.run(new String[]{"--tempDir",
tmpDir.getAbsolutePath(), "--similarityClassname",

CooccurrenceCountSimilarity.class.getName(),"--booleanData",
String.valueOf(Boolean.TRUE)});

and got output!!!!   Horray.

Question:  Is CooccurenceCountSimilarity best in this case?


Anyway, now I am going to try on our production cluster with Billions of
lines.  Last time I tried, I ran into OutOfMemoryExceptions.  Any
suggestions regarding memory settings?

Thanks once again for your help.


On Fri, Jun 22, 2012 at 11:08 PM, Sean Owen <[email protected]> wrote:

> Using 1 is just fine for the reasons you give. You would be surprised how
> OK it is to use this even for dislikes. In fact just omit the third field
> in your CSV.
>
> However you need to set the boolean data flag and choose a similarity
> metric that is defined over such data. Pearson / cosine is not for example
> since every value is 1. This is why there is no output.
> On Jun 23, 2012 1:33 AM, "Something Something" <[email protected]>
> wrote:
>
> > I tested my setup of ItemSimilarityJob using the MovieLens dataset & got
> > the expected results.  It looks like my setup is good.
> >
> > Here's what I have:
> >
> > I have data coming in the following format: UserId, GroupId, Frequency
> (how
> > many times the user chose the group), Max timestamp (the last time the
> user
> > chose the group).
> >
> > Based on this dataset we need to figure out which groups look alike. I
> > decided to use "item based collaborative filtering" but I have 3
> concerns:
> >
> > 1)  We don't have any knowledge of "Dislikes"; we only know which groups
> > users "Like".
> > 2)  We don't really have ratings. In other words, users don't rate the
> > group. Either they choose OR they don't.
> > 3)  Frequency doesn't really imply interest level.
> >
> >
> > I decided to try 'ItemSimilarityJob' by using a CSV file in the following
> > format:
> >
> > UserId, GroupId, "1"
> >
> > In other words, the rating value is always 1.  There are NO rows with
> value
> > "0".  This is producing NO OUTPUT, but the job finishes successfully.
> >
> > Is this the right way to solve the problem?  Is there some other
> Algorithm
> > that I should be using?  Thanks for the help.
> >
>

Reply via email to