Re:Re:Re: Need to reduce execution time of RowSimilarityJob

bangbig Tue, 02 Oct 2012 04:56:18 -0700

By the way,  note that distance is not the same thing as similarity.


At 2012-10-02 19:43:58,bangbig <[email protected]> wrote:
>Yes, you get it.
>I thought RowSimilarityJob was from taste when I write the previous email.
>
>At 2012-10-02 19:26:48,yamo93 <[email protected]> wrote:
>>Ok, i think i understood.
>>
>>Let's take an example with two vectors (1,1,1) and (0,1,0).
>>With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
>>With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)
>>
>>Ok ?
>>
>>On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>> I don't see why documents with only one word in common should have a
>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>>> specify a threshold for the similarity.
>>>
>>> UncenteredCosineSimilarity works on matching entries only, which is
>>> problematic for documents, as empty entries have a meaning (0 term
>>> occurrences) as opposed to collaborative filtering data.
>>>
>>> Maybe we should remove UncenteredCosine andd create another similarity
>>> implementation that computes the cosine correctly over all entries.
>>>
>>> --sebastian
>>>
>>>
>>> On 02.10.2012 10:08, yamo93 wrote:
>>>> Hello Seb,
>>>>
>>>> In my comprehension, the algorithm is the same (except the normalization
>>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>>> word in common have a distance of 1.0)... but the result are quite
>>>> different (is this just an effect of the consider() method which remove
>>>> irrelevant values ?) ...
>>>>
>>>> I looked at the code but there is quite nothing in
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>> understand ...
>>>>
>>>> Thanks for helping,
>>>>
>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>> similarity between the whole vectors.
>>>>>
>>>>> see
>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>>
>>>>> for details
>>>>>
>>>>> At first both vectors are scaled to unit length in normalize() and after
>>>>> this their dot product in similarity() (which can be computed from
>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>
>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>> result.
>>>>>> For two items,
>>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>>> also be nearly 1;
>>>>>> and for the two cases above, I think you can consider to use
>>>>>> association rules to consider the problem.
>>>>>>
>>>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:
>>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>
>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>>> kind of weight against similarities based on small intersection size.
>>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>
>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
>>>>>>>>> Thanks for replying.
>>>>>>>>>
>>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>>> similar
>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>> by far
>>>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>>> have
>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>>> and so
>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>> Is it normal ?
>>

Re:Re:Re: Need to reduce execution time of RowSimilarityJob

Reply via email to