Re: tf-idf + svd + cosine similarity

Sebastian Schelter Tue, 14 Jun 2011 12:10:59 -0700

Hi Stefan,

I checked the implementation of RowSimilarityJob and we might still havea bug in the 0.5 release... (f**k). I don't know if your problem iscaused by that, but the similarity scores might not be correct...

We had this issue in 0.4 already, when someone realized thatcooccurrences were mapped out inconsistently, so for 0.5 we made surethat we always map the smaller row as first value. But apparently I didnot adjust the value setting for the Cooccurrence object...


In 0.5 the code is:

 if (rowA <= rowB) {
   rowPair.set(rowA, rowB, weightA, weightB);
 } else {
   rowPair.set(rowB, rowA, weightB, weightA);
 }
 coocurrence.set(column.get(), valueA, valueB);

But I should be (already fixed in current trunk some days ago):

 if (rowA <= rowB) {
   rowPair.set(rowA, rowB, weightA, weightB);
   coocurrence.set(column.get(), valueA, valueB);
 } else {
   rowPair.set(rowB, rowA, weightB, weightA);
   coocurrence.set(column.get(), valueB, valueA);
 }

Maybe you could rerun your test with the current trunk?

--sebastian

On 14.06.2011 20:54, Sean Owen wrote:

It is a similarity, not a distance. Higher values mean more
similarity, not less.

I agree that similarity ought to decrease with more dimensions. That
is what you observe -- except that you see quite high average
similarity with no dimension reduction!

An average cosine similarity of 0.87 sounds "high" to me for anything
but a few dimensions. What's the dimensionality of the input without
dimension reduction?

Something is amiss in this pipeline. It is an interesting question!

On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]>  wrote:

Actually I'm using  RowSimilarityJob() with
--input input
--output output
--numberOfColumns documentCount
--maxSimilaritiesPerRow documentCount
--similarityClassname SIMILARITY_UNCENTERED_COSINE

Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
calculates...
the source says: "distributed implementation of cosine similarity that
does not center its data"

So... this seems to be the similarity and not the distance?

Cheers,
Stefan



2011/6/14 Stefan Wienert<[email protected]>:

but... why do I get the different results with cosine similarity with
no dimension reduction (with 100,000 dimensions) ?

2011/6/14 Fernando Fernández<[email protected]>:

Actually that's what your results are showing, aren't they? With rank 1000
the similarity avg is the lowest...


2011/6/14 Jake Mannix<[email protected]>

actually, wait - are your graphs showing *similarity*, or *distance*?  In
higher
dimensions, *distance* (and cosine angle) should grow, but on the other
hand,
*similarity* (1-cos(angle)) should go toward 0.

On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]>
wrote:

Hey Guys,

I have some strange results in my LSA-Pipeline.

First, I explain the steps my data is making:
1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as
weighter
2) Transposing TDM
3a) Using Mahout SVD (Lanczos) with the transposed TDM
3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
3c) Using no dimension reduction (for testing purpose)
4) Transpose result (ONLY none / svd)
5) Calculating Cosine Similarty (from Mahout)

Now... Some strange thinks happen:
First of all: The demo data shows the similarity from document 1 to
all other documents.

the results using only cosine similarty (without dimension reduction):
http://the-lord.de/img/none.png

the result using svd, rank 10
http://the-lord.de/img/svd-10.png
some points falling down to the bottom.

the results using ssvd rank 10
http://the-lord.de/img/ssvd-10.png

the result using svd, rank 100
http://the-lord.de/img/svd-100.png
more points falling down to the bottom.

the results using ssvd rank 100
http://the-lord.de/img/ssvd-100.png

the results using svd rank 200
http://the-lord.de/img/svd-200.png
even more points falling down to the bottom.

the results using svd rank 1000
http://the-lord.de/img/svd-1000.png
most points are at the bottom

please beware of the scale:
- the avg from none: 0,8712
- the avg from svd rank 10: 0,2648
- the avg from svd rank 100: 0,0628
- the avg from svd rank 200: 0,0238
- the avg from svd rank 1000: 0,0116

so my question is:
Can you explain this behavior? Why are the documents getting more
equal with more ranks in svd. I thought it was the opposite.

Cheers
Stefan




--
Stefan Wienert

http://www.wienert.cc
[email protected]

Telefon: +495251-2026838
Mobil: +49176-40170270




--
Stefan Wienert

http://www.wienert.cc
[email protected]

Telefon: +495251-2026838
Mobil: +49176-40170270

Re: tf-idf + svd + cosine similarity

Reply via email to