Re: Need to reduce execution time of RowSimilarityJob

Sean Owen Mon, 01 Oct 2012 03:38:01 -0700

Yes, this is one of the weaknesses of this particular flavor of this
particular similarity metric. The more sparse, the worse the problem
is in general. There are some band-aid solutions like applying some
kind of weight against similarities based on small intersection size.
Or you can pretend that missing values are 0 (PreferenceInferrer),
which can introduce its own problems, or perhaps some mean value.


On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
> Thanks for replying.
>
> So, documents with only one word in common have more chance to be similar
> than documents with more words in common, right ?
>
>
>
> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>
>> Similar items, right? You should look at the vectors that have 1.0
>> similarity and see if they are in fact collinear. This is still by far
>> the most likely explanation. Remember that the vector similarity is
>> computed over elements that exist in both vectors only. They just have
>> to have 2 identical values for this to happen.
>>
>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:
>>>
>>> For each item, i have 10 recommended items with a value of 1.0.
>>> It sounds like a bug somewhere.
>>>
>>>
>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>
>>>> It's possible this is correct. 1.0 is the maximum similarity and
>>>> occurs when two vector are just a scalar multiple of each other (0
>>>> angle between them). It's possible there are several of these, and so
>>>> their 1.0 similarities dominate the result.
>>>>
>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:
>>>>>
>>>>> I saw something strange : all recommended items, returned by
>>>>> mostSimilarItems(), have a value of 1.0.
>>>>> Is it normal ?
>
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to