Re: Need to reduce execution time of RowSimilarityJob

yamo93 Mon, 01 Oct 2012 04:45:34 -0700

Ok. You're right : i have some spareness on my data.

Do you advise another similarity algorithm for text-based ?


On 10/01/2012 12:37 PM, Sean Owen wrote:

Yes, this is one of the weaknesses of this particular flavor of this
particular similarity metric. The more sparse, the worse the problem
is in general. There are some band-aid solutions like applying some
kind of weight against similarities based on small intersection size.
Or you can pretend that missing values are 0 (PreferenceInferrer),
which can introduce its own problems, or perhaps some mean value.

On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:

Thanks for replying.

So, documents with only one word in common have more chance to be similar
than documents with more words in common, right ?



On 10/01/2012 11:28 AM, Sean Owen wrote:

Similar items, right? You should look at the vectors that have 1.0
similarity and see if they are in fact collinear. This is still by far
the most likely explanation. Remember that the vector similarity is
computed over elements that exist in both vectors only. They just have
to have 2 identical values for this to happen.

On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:

For each item, i have 10 recommended items with a value of 1.0.
It sounds like a bug somewhere.


On 10/01/2012 11:06 AM, Sean Owen wrote:

It's possible this is correct. 1.0 is the maximum similarity and
occurs when two vector are just a scalar multiple of each other (0
angle between them). It's possible there are several of these, and so
their 1.0 similarities dominate the result.

On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:

I saw something strange : all recommended items, returned by
mostSimilarItems(), have a value of 1.0.
Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to