Re: Need to reduce execution time of RowSimilarityJob

Sebastian Schelter Mon, 01 Oct 2012 13:03:40 -0700

This is not true.


On 01.10.2012 21:52, bangbig wrote:
> I think it's better to understand how the RowSimilarityJob gets the result.
> For two items, 
> itemA, 0, 0,   a1, a2, a3, 0
> itemB, 0, b1, b2, b3, 0  , 0
> when computing, it just uses the blue parts of the vectors.
> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)  * 
> sqrt(b2*b2 + b3*b3))
> 1) if itemA and itemB have just one common word, the result is 1;
> 2) if the values of the vectors are almost the same, the value would also be 
> nearly 1;
> and for the two cases above, I think you can consider to use association 
> rules to consider the problem.
> 
> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:
>> It seems that RowSimilarityJob does not have the same weakness, but i 
>> also use CosineSimilarity. Why ?
>>
>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>> Yes, this is one of the weaknesses of this particular flavor of this
>>> particular similarity metric. The more sparse, the worse the problem
>>> is in general. There are some band-aid solutions like applying some
>>> kind of weight against similarities based on small intersection size.
>>> Or you can pretend that missing values are 0 (PreferenceInferrer),
>>> which can introduce its own problems, or perhaps some mean value.
>>>
>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
>>>> Thanks for replying.
>>>>
>>>> So, documents with only one word in common have more chance to be similar
>>>> than documents with more words in common, right ?
>>>>
>>>>
>>>>
>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>> Similar items, right? You should look at the vectors that have 1.0
>>>>> similarity and see if they are in fact collinear. This is still by far
>>>>> the most likely explanation. Remember that the vector similarity is
>>>>> computed over elements that exist in both vectors only. They just have
>>>>> to have 2 identical values for this to happen.
>>>>>
>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:
>>>>>> For each item, i have 10 recommended items with a value of 1.0.
>>>>>> It sounds like a bug somewhere.
>>>>>>
>>>>>>
>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>>>>> occurs when two vector are just a scalar multiple of each other (0
>>>>>>> angle between them). It's possible there are several of these, and so
>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>
>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:
>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>> Is it normal ?
>>>>
>>
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to