I don't see why documents with only one word in common should have a similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you specify a threshold for the similarity.
UncenteredCosineSimilarity works on matching entries only, which is problematic for documents, as empty entries have a meaning (0 term occurrences) as opposed to collaborative filtering data. Maybe we should remove UncenteredCosine andd create another similarity implementation that computes the cosine correctly over all entries. --sebastian On 02.10.2012 10:08, yamo93 wrote: > Hello Seb, > > In my comprehension, the algorithm is the same (except the normalization > part) as UncenteredCosine (with the drawback that vectors with only one > word in common have a distance of 1.0)... but the result are quite > different (is this just an effect of the consider() method which remove > irrelevant values ?) ... > > I looked at the code but there is quite nothing in > org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, > the code seems to be in SimilarityReducer which is not so simple to > understand ... > > Thanks for helping, > > On 10/01/2012 10:25 PM, Sebastian Schelter wrote: >> The cosine similarity as computed by RowSimilarityJob is the cosine >> similarity between the whole vectors. >> >> see >> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity >> >> for details >> >> At first both vectors are scaled to unit length in normalize() and after >> this their dot product in similarity() (which can be computed from >> elements that exist in both vectors) gives the cosine between those. >> >> On 01.10.2012 21:52, bangbig wrote: >>> I think it's better to understand how the RowSimilarityJob gets the >>> result. >>> For two items, >>> itemA, 0, 0, a1, a2, a3, 0 >>> itemB, 0, b1, b2, b3, 0 , 0 >>> when computing, it just uses the blue parts of the vectors. >>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) >>> * sqrt(b2*b2 + b3*b3)) >>> 1) if itemA and itemB have just one common word, the result is 1; >>> 2) if the values of the vectors are almost the same, the value would >>> also be nearly 1; >>> and for the two cases above, I think you can consider to use >>> association rules to consider the problem. >>> >>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote: >>>> It seems that RowSimilarityJob does not have the same weakness, but i >>>> also use CosineSimilarity. Why ? >>>> >>>> On 10/01/2012 12:37 PM, Sean Owen wrote: >>>>> Yes, this is one of the weaknesses of this particular flavor of this >>>>> particular similarity metric. The more sparse, the worse the problem >>>>> is in general. There are some band-aid solutions like applying some >>>>> kind of weight against similarities based on small intersection size. >>>>> Or you can pretend that missing values are 0 (PreferenceInferrer), >>>>> which can introduce its own problems, or perhaps some mean value. >>>>> >>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: >>>>>> Thanks for replying. >>>>>> >>>>>> So, documents with only one word in common have more chance to be >>>>>> similar >>>>>> than documents with more words in common, right ? >>>>>> >>>>>> >>>>>> >>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote: >>>>>>> Similar items, right? You should look at the vectors that have 1.0 >>>>>>> similarity and see if they are in fact collinear. This is still >>>>>>> by far >>>>>>> the most likely explanation. Remember that the vector similarity is >>>>>>> computed over elements that exist in both vectors only. They just >>>>>>> have >>>>>>> to have 2 identical values for this to happen. >>>>>>> >>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote: >>>>>>>> For each item, i have 10 recommended items with a value of 1.0. >>>>>>>> It sounds like a bug somewhere. >>>>>>>> >>>>>>>> >>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and >>>>>>>>> occurs when two vector are just a scalar multiple of each other (0 >>>>>>>>> angle between them). It's possible there are several of these, >>>>>>>>> and so >>>>>>>>> their 1.0 similarities dominate the result. >>>>>>>>> >>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote: >>>>>>>>>> I saw something strange : all recommended items, returned by >>>>>>>>>> mostSimilarItems(), have a value of 1.0. >>>>>>>>>> Is it normal ? >
