Yes, you get it. I thought RowSimilarityJob was from taste when I write the previous email.
At 2012-10-02 19:26:48,yamo93 <[email protected]> wrote: >Ok, i think i understood. > >Let's take an example with two vectors (1,1,1) and (0,1,0). >With UncenteredCosineSimilarity (as implemented in taste), the distance is 1 >With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3) > >Ok ? > >On 10/02/2012 11:50 AM, Sebastian Schelter wrote: >> I don't see why documents with only one word in common should have a >> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you >> specify a threshold for the similarity. >> >> UncenteredCosineSimilarity works on matching entries only, which is >> problematic for documents, as empty entries have a meaning (0 term >> occurrences) as opposed to collaborative filtering data. >> >> Maybe we should remove UncenteredCosine andd create another similarity >> implementation that computes the cosine correctly over all entries. >> >> --sebastian >> >> >> On 02.10.2012 10:08, yamo93 wrote: >>> Hello Seb, >>> >>> In my comprehension, the algorithm is the same (except the normalization >>> part) as UncenteredCosine (with the drawback that vectors with only one >>> word in common have a distance of 1.0)... but the result are quite >>> different (is this just an effect of the consider() method which remove >>> irrelevant values ?) ... >>> >>> I looked at the code but there is quite nothing in >>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, >>> the code seems to be in SimilarityReducer which is not so simple to >>> understand ... >>> >>> Thanks for helping, >>> >>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote: >>>> The cosine similarity as computed by RowSimilarityJob is the cosine >>>> similarity between the whole vectors. >>>> >>>> see >>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity >>>> >>>> for details >>>> >>>> At first both vectors are scaled to unit length in normalize() and after >>>> this their dot product in similarity() (which can be computed from >>>> elements that exist in both vectors) gives the cosine between those. >>>> >>>> On 01.10.2012 21:52, bangbig wrote: >>>>> I think it's better to understand how the RowSimilarityJob gets the >>>>> result. >>>>> For two items, >>>>> itemA, 0, 0, a1, a2, a3, 0 >>>>> itemB, 0, b1, b2, b3, 0 , 0 >>>>> when computing, it just uses the blue parts of the vectors. >>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) >>>>> * sqrt(b2*b2 + b3*b3)) >>>>> 1) if itemA and itemB have just one common word, the result is 1; >>>>> 2) if the values of the vectors are almost the same, the value would >>>>> also be nearly 1; >>>>> and for the two cases above, I think you can consider to use >>>>> association rules to consider the problem. >>>>> >>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote: >>>>>> It seems that RowSimilarityJob does not have the same weakness, but i >>>>>> also use CosineSimilarity. Why ? >>>>>> >>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote: >>>>>>> Yes, this is one of the weaknesses of this particular flavor of this >>>>>>> particular similarity metric. The more sparse, the worse the problem >>>>>>> is in general. There are some band-aid solutions like applying some >>>>>>> kind of weight against similarities based on small intersection size. >>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer), >>>>>>> which can introduce its own problems, or perhaps some mean value. >>>>>>> >>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: >>>>>>>> Thanks for replying. >>>>>>>> >>>>>>>> So, documents with only one word in common have more chance to be >>>>>>>> similar >>>>>>>> than documents with more words in common, right ? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote: >>>>>>>>> Similar items, right? You should look at the vectors that have 1.0 >>>>>>>>> similarity and see if they are in fact collinear. This is still >>>>>>>>> by far >>>>>>>>> the most likely explanation. Remember that the vector similarity is >>>>>>>>> computed over elements that exist in both vectors only. They just >>>>>>>>> have >>>>>>>>> to have 2 identical values for this to happen. >>>>>>>>> >>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote: >>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0. >>>>>>>>>> It sounds like a bug somewhere. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and >>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0 >>>>>>>>>>> angle between them). It's possible there are several of these, >>>>>>>>>>> and so >>>>>>>>>>> their 1.0 similarities dominate the result. >>>>>>>>>>> >>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote: >>>>>>>>>>>> I saw something strange : all recommended items, returned by >>>>>>>>>>>> mostSimilarItems(), have a value of 1.0. >>>>>>>>>>>> Is it normal ? >
