Re: Need to reduce execution time of RowSimilarityJob

yamo93 Tue, 02 Oct 2012 04:27:25 -0700

Ok, i think i understood.

Let's take an example with two vectors (1,1,1) and (0,1,0).
With UncenteredCosineSimilarity (as implemented in taste), the distance is 1
With Cosine (as implemented in RowSimilarityJob), the distance is 1/sqrt(3)


Ok ?

On 10/02/2012 11:50 AM, Sebastian Schelter wrote:

I don't see why documents with only one word in common should have a
similarity of 1.0 in RowSimilarityJob. consider() is only invoked if you
specify a threshold for the similarity.

UncenteredCosineSimilarity works on matching entries only, which is
problematic for documents, as empty entries have a meaning (0 term
occurrences) as opposed to collaborative filtering data.

Maybe we should remove UncenteredCosine andd create another similarity
implementation that computes the cosine correctly over all entries.

--sebastian


On 02.10.2012 10:08, yamo93 wrote:

Hello Seb,

In my comprehension, the algorithm is the same (except the normalization
part) as UncenteredCosine (with the drawback that vectors with only one
word in common have a distance of 1.0)... but the result are quite
different (is this just an effect of the consider() method which remove
irrelevant values ?) ...

I looked at the code but there is quite nothing in
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
the code seems to be in SimilarityReducer which is not so simple to
understand ...

Thanks for helping,

On 10/01/2012 10:25 PM, Sebastian Schelter wrote:

The cosine similarity as computed by RowSimilarityJob is the cosine
similarity between the whole vectors.

see
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity

for details

At first both vectors are scaled to unit length in normalize() and after
this their dot product in similarity() (which can be computed from
elements that exist in both vectors) gives the cosine between those.

On 01.10.2012 21:52, bangbig wrote:

I think it's better to understand how the RowSimilarityJob gets the
result.
For two items,
itemA, 0, 0,   a1, a2, a3, 0
itemB, 0, b1, b2, b3, 0  , 0
when computing, it just uses the blue parts of the vectors.
the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
* sqrt(b2*b2 + b3*b3))
1) if itemA and itemB have just one common word, the result is 1;
2) if the values of the vectors are almost the same, the value would
also be nearly 1;
and for the two cases above, I think you can consider to use
association rules to consider the problem.

At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:

It seems that RowSimilarityJob does not have the same weakness, but i
also use CosineSimilarity. Why ?

On 10/01/2012 12:37 PM, Sean Owen wrote:

Yes, this is one of the weaknesses of this particular flavor of this
particular similarity metric. The more sparse, the worse the problem
is in general. There are some band-aid solutions like applying some
kind of weight against similarities based on small intersection size.
Or you can pretend that missing values are 0 (PreferenceInferrer),
which can introduce its own problems, or perhaps some mean value.

On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:

Thanks for replying.

So, documents with only one word in common have more chance to be
similar
than documents with more words in common, right ?



On 10/01/2012 11:28 AM, Sean Owen wrote:

Similar items, right? You should look at the vectors that have 1.0
similarity and see if they are in fact collinear. This is still
by far
the most likely explanation. Remember that the vector similarity is
computed over elements that exist in both vectors only. They just
have
to have 2 identical values for this to happen.

On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> wrote:

For each item, i have 10 recommended items with a value of 1.0.
It sounds like a bug somewhere.


On 10/01/2012 11:06 AM, Sean Owen wrote:

It's possible this is correct. 1.0 is the maximum similarity and
occurs when two vector are just a scalar multiple of each other (0
angle between them). It's possible there are several of these,
and so
their 1.0 similarities dominate the result.

On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> wrote:

I saw something strange : all recommended items, returned by
mostSimilarItems(), have a value of 1.0.
Is it normal ?

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to