By the way, note that distance is not the same thing as similarity.
At 2012-10-02 19:43:58,bangbig <[email protected]> wrote: >Yes, you get it. >I thought RowSimilarityJob was from taste when I write the previous email. > >At 2012-10-02 19:26:48,yamo93 <[email protected]> wrote: >>Ok, i think i understood. >> >>Let's take an example with two vectors (1,1,1) and (0,1,0). >>With UncenteredCosineSimilarity (as implemented in taste), the distance is 1 >>With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3) >> >>Ok ? >> >>On 10/02/2012 11:50 AM, Sebastian Schelter wrote: >>> I don't see why documents with only one word in common should have a >>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you >>> specify a threshold for the similarity. >>> >>> UncenteredCosineSimilarity works on matching entries only, which is >>> problematic for documents, as empty entries have a meaning (0 term >>> occurrences) as opposed to collaborative filtering data. >>> >>> Maybe we should remove UncenteredCosine andd create another similarity >>> implementation that computes the cosine correctly over all entries. >>> >>> --sebastian >>> >>> >>> On 02.10.2012 10:08, yamo93 wrote: >>>> Hello Seb, >>>> >>>> In my comprehension, the algorithm is the same (except the normalization >>>> part) as UncenteredCosine (with the drawback that vectors with only one >>>> word in common have a distance of 1.0)... but the result are quite >>>> different (is this just an effect of the consider() method which remove >>>> irrelevant values ?) ... >>>> >>>> I looked at the code but there is quite nothing in >>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, >>>> the code seems to be in SimilarityReducer which is not so simple to >>>> understand ... >>>> >>>> Thanks for helping, >>>> >>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote: >>>>> The cosine similarity as computed by RowSimilarityJob is the cosine >>>>> similarity between the whole vectors. >>>>> >>>>> see >>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity >>>>> >>>>> for details >>>>> >>>>> At first both vectors are scaled to unit length in normalize() and after >>>>> this their dot product in similarity() (which can be computed from >>>>> elements that exist in both vectors) gives the cosine between those. >>>>> >>>>> On 01.10.2012 21:52, bangbig wrote: >>>>>> I think it's better to understand how the RowSimilarityJob gets the >>>>>> result. >>>>>> For two items, >>>>>> itemA, 0, 0, a1, a2, a3, 0 >>>>>> itemB, 0, b1, b2, b3, 0 , 0 >>>>>> when computing, it just uses the blue parts of the vectors. >>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) >>>>>> * sqrt(b2*b2 + b3*b3)) >>>>>> 1) if itemA and itemB have just one common word, the result is 1; >>>>>> 2) if the values of the vectors are almost the same, the value would >>>>>> also be nearly 1; >>>>>> and for the two cases above, I think you can consider to use >>>>>> association rules to consider the problem. >>>>>> >>>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote: >>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i >>>>>>> also use CosineSimilarity. Why ? >>>>>>> >>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote: >>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this >>>>>>>> particular similarity metric. The more sparse, the worse the problem >>>>>>>> is in general. There are some band-aid solutions like applying some >>>>>>>> kind of weight against similarities based on small intersection size. >>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer), >>>>>>>> which can introduce its own problems, or perhaps some mean value. >>>>>>>> >>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: >>>>>>>>> Thanks for replying. >>>>>>>>> >>>>>>>>> So, documents with only one word in common have more chance to be >>>>>>>>> similar >>>>>>>>> than documents with more words in common, right ? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote: >>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0 >>>>>>>>>> similarity and see if they are in fact collinear. This is still >>>>>>>>>> by far >>>>>>>>>> the most likely explanation. Remember that the vector similarity is >>>>>>>>>> computed over elements that exist in both vectors only. They just >>>>>>>>>> have >>>>>>>>>> to have 2 identical values for this to happen. >>>>>>>>>> >>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote: >>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0. >>>>>>>>>>> It sounds like a bug somewhere. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and >>>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0 >>>>>>>>>>>> angle between them). It's possible there are several of these, >>>>>>>>>>>> and so >>>>>>>>>>>> their 1.0 similarities dominate the result. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote: >>>>>>>>>>>>> I saw something strange : all recommended items, returned by >>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0. >>>>>>>>>>>>> Is it normal ? >>
