Re: Need to reduce execution time of RowSimilarityJob

Sebastian Schelter Tue, 02 Oct 2012 02:50:48 -0700

I don't see why documents with only one word in common should have a
similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
specify a threshold for the similarity.


UncenteredCosineSimilarity works on matching entries only, which is
problematic for documents, as empty entries have a meaning (0 term
occurrences) as opposed to collaborative filtering data.

Maybe we should remove UncenteredCosine andd create another similarity
implementation that computes the cosine correctly over all entries.

--sebastian


On 02.10.2012 10:08, yamo93 wrote:
> Hello Seb,
> 
> In my comprehension, the algorithm is the same (except the normalization
> part) as UncenteredCosine (with the drawback that vectors with only one
> word in common have a distance of 1.0)... but the result are quite
> different (is this just an effect of the consider() method which remove
> irrelevant values ?) ...
> 
> I looked at the code but there is quite nothing in
> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
> the code seems to be in SimilarityReducer which is not so simple to
> understand ...
> 
> Thanks for helping,
> 
> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>> The cosine similarity as computed by RowSimilarityJob is the cosine
>> similarity between the whole vectors.
>>
>> see
>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>
>> for details
>>
>> At first both vectors are scaled to unit length in normalize() and after
>> this their dot product in similarity() (which can be computed from
>> elements that exist in both vectors) gives the cosine between those.
>>
>> On 01.10.2012 21:52, bangbig wrote:
>>> I think it's better to understand how the RowSimilarityJob gets the
>>> result.
>>> For two items,
>>> itemA, 0, 0,   a1, a2, a3, 0
>>> itemB, 0, b1, b2, b3, 0  , 0
>>> when computing, it just uses the blue parts of the vectors.
>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) 
>>> * sqrt(b2*b2 + b3*b3))
>>> 1) if itemA and itemB have just one common word, the result is 1;
>>> 2) if the values of the vectors are almost the same, the value would
>>> also be nearly 1;
>>> and for the two cases above, I think you can consider to use
>>> association rules to consider the problem.
>>>
>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:
>>>> It seems that RowSimilarityJob does not have the same weakness, but i
>>>> also use CosineSimilarity. Why ?
>>>>
>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>> Yes, this is one of the weaknesses of this particular flavor of this
>>>>> particular similarity metric. The more sparse, the worse the problem
>>>>> is in general. There are some band-aid solutions like applying some
>>>>> kind of weight against similarities based on small intersection size.
>>>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
>>>>>> Thanks for replying.
>>>>>>
>>>>>> So, documents with only one word in common have more chance to be
>>>>>> similar
>>>>>> than documents with more words in common, right ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>> by far
>>>>>>> the most likely explanation. Remember that the vector similarity is
>>>>>>> computed over elements that exist in both vectors only. They just
>>>>>>> have
>>>>>>> to have 2 identical values for this to happen.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:
>>>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>>>> angle between them). It's possible there are several of these,
>>>>>>>>> and so
>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>
>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>> Is it normal ?
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to