Yes, you are right!
At 2012-10-02 04:25:09,"Sebastian Schelter" <[email protected]> wrote: >The cosine similarity as computed by RowSimilarityJob is the cosine >similarity between the whole vectors. > >see >org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity >for details > >At first both vectors are scaled to unit length in normalize() and after >this their dot product in similarity() (which can be computed from >elements that exist in both vectors) gives the cosine between those. > >On 01.10.2012 21:52, bangbig wrote: >> I think it's better to understand how the RowSimilarityJob gets the result. >> For two items, >> itemA, 0, 0, a1, a2, a3, 0 >> itemB, 0, b1, b2, b3, 0 , 0 >> when computing, it just uses the blue parts of the vectors. >> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) * >> sqrt(b2*b2 + b3*b3)) >> 1) if itemA and itemB have just one common word, the result is 1; >> 2) if the values of the vectors are almost the same, the value would also be >> nearly 1; >> and for the two cases above, I think you can consider to use association >> rules to consider the problem. >> >> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote: >>> It seems that RowSimilarityJob does not have the same weakness, but i >>> also use CosineSimilarity. Why ? >>> >>> On 10/01/2012 12:37 PM, Sean Owen wrote: >>>> Yes, this is one of the weaknesses of this particular flavor of this >>>> particular similarity metric. The more sparse, the worse the problem >>>> is in general. There are some band-aid solutions like applying some >>>> kind of weight against similarities based on small intersection size. >>>> Or you can pretend that missing values are 0 (PreferenceInferrer), >>>> which can introduce its own problems, or perhaps some mean value. >>>> >>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: >>>>> Thanks for replying. >>>>> >>>>> So, documents with only one word in common have more chance to be similar >>>>> than documents with more words in common, right ? >>>>> >>>>> >>>>> >>>>> On 10/01/2012 11:28 AM, Sean Owen wrote: >>>>>> Similar items, right? You should look at the vectors that have 1.0 >>>>>> similarity and see if they are in fact collinear. This is still by far >>>>>> the most likely explanation. Remember that the vector similarity is >>>>>> computed over elements that exist in both vectors only. They just have >>>>>> to have 2 identical values for this to happen. >>>>>> >>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote: >>>>>>> For each item, i have 10 recommended items with a value of 1.0. >>>>>>> It sounds like a bug somewhere. >>>>>>> >>>>>>> >>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and >>>>>>>> occurs when two vector are just a scalar multiple of each other (0 >>>>>>>> angle between them). It's possible there are several of these, and so >>>>>>>> their 1.0 similarities dominate the result. >>>>>>>> >>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote: >>>>>>>>> I saw something strange : all recommended items, returned by >>>>>>>>> mostSimilarItems(), have a value of 1.0. >>>>>>>>> Is it normal ? >>>>> >>> >> >
