Re:Re: Need to reduce execution time of RowSimilarityJob

bangbig Tue, 02 Oct 2012 04:44:37 -0700

Yes, you get it.
I thought RowSimilarityJob was from taste when I write the previous email.


At 2012-10-02 19:26:48,yamo93 <[email protected]> wrote:
>Ok, i think i understood.
>
>Let's take an example with two vectors (1,1,1) and (0,1,0).
>With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
>With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)
>
>Ok ?
>
>On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>> I don't see why documents with only one word in common should have a
>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
>> specify a threshold for the similarity.
>>
>> UncenteredCosineSimilarity works on matching entries only, which is
>> problematic for documents, as empty entries have a meaning (0 term
>> occurrences) as opposed to collaborative filtering data.
>>
>> Maybe we should remove UncenteredCosine andd create another similarity
>> implementation that computes the cosine correctly over all entries.
>>
>> --sebastian
>>
>>
>> On 02.10.2012 10:08, yamo93 wrote:
>>> Hello Seb,
>>>
>>> In my comprehension, the algorithm is the same (except the normalization
>>> part) as UncenteredCosine (with the drawback that vectors with only one
>>> word in common have a distance of 1.0)... but the result are quite
>>> different (is this just an effect of the consider() method which remove
>>> irrelevant values ?) ...
>>>
>>> I looked at the code but there is quite nothing in
>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>> the code seems to be in SimilarityReducer which is not so simple to
>>> understand ...
>>>
>>> Thanks for helping,
>>>
>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>> similarity between the whole vectors.
>>>>
>>>> see
>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>
>>>> for details
>>>>
>>>> At first both vectors are scaled to unit length in normalize() and after
>>>> this their dot product in similarity() (which can be computed from
>>>> elements that exist in both vectors) gives the cosine between those.
>>>>
>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>> result.
>>>>> For two items,
>>>>> itemA, 0, 0,   a1, a2, a3, 0
>>>>> itemB, 0, b1, b2, b3, 0  , 0
>>>>> when computing, it just uses the blue parts of the vectors.
>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>> * sqrt(b2*b2 + b3*b3))
>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>> 2) if the values of the vectors are almost the same, the value would
>>>>> also be nearly 1;
>>>>> and for the two cases above, I think you can consider to use
>>>>> association rules to consider the problem.
>>>>>
>>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:
>>>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>>>> also use CosineSimilarity. Why ?
>>>>>>
>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>>>> is in general. There are some band-aid solutions like applying some
>>>>>>> kind of weight against similarities based on small intersection size.
>>>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
>>>>>>>> Thanks for replying.
>>>>>>>>
>>>>>>>> So, documents with only one word in common have more chance to be
>>>>>>>> similar
>>>>>>>> than documents with more words in common, right ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>> by far
>>>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>>>> have
>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>>>> and so
>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>> Is it normal ?
>

Re:Re: Need to reduce execution time of RowSimilarityJob

Reply via email to