Yes, this is one of the weaknesses of this particular flavor of this particular similarity metric. The more sparse, the worse the problem is in general. There are some band-aid solutions like applying some kind of weight against similarities based on small intersection size. Or you can pretend that missing values are 0 (PreferenceInferrer), which can introduce its own problems, or perhaps some mean value.
On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: > Thanks for replying. > > So, documents with only one word in common have more chance to be similar > than documents with more words in common, right ? > > > > On 10/01/2012 11:28 AM, Sean Owen wrote: >> >> Similar items, right? You should look at the vectors that have 1.0 >> similarity and see if they are in fact collinear. This is still by far >> the most likely explanation. Remember that the vector similarity is >> computed over elements that exist in both vectors only. They just have >> to have 2 identical values for this to happen. >> >> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote: >>> >>> For each item, i have 10 recommended items with a value of 1.0. >>> It sounds like a bug somewhere. >>> >>> >>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>> >>>> It's possible this is correct. 1.0 is the maximum similarity and >>>> occurs when two vector are just a scalar multiple of each other (0 >>>> angle between them). It's possible there are several of these, and so >>>> their 1.0 similarities dominate the result. >>>> >>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote: >>>>> >>>>> I saw something strange : all recommended items, returned by >>>>> mostSimilarItems(), have a value of 1.0. >>>>> Is it normal ? > >
