Re: tf-idf + svd + cosine similarity

Dmitriy Lyubimov Tue, 14 Jun 2011 14:36:28 -0700

Interesting.

(I have one confusion of mine RE: lanczos -- is it computing U
eigenvectors or V? The doc says "eigenvectors" but doesn't say left or
right. if it's V (right eigenvectors) this sequence should be fine).


With ssvd i don't do transpose, i just do coputation of U which will
produce document singular vectors directly.

Also, i am not sure that Lanczos actually normalizes the eigenvectors,
but SSVD does (or multiplies normalized version by square root of a
singlular value, whichever requested). So depending on which space
your rotate results in, cosine similarities may be different. I assume
you used normalized (true) eigenvectors from ssvd.

Also would be interesting to know what oversampling parameter you (p) you used.

Thanks.
-d


On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[email protected]> wrote:
> So... lets check the dimensions:
>
> First step: Lucene Output:
> 227 rows (=docs) and 107909 cols (=tems)
>
> transposed to:
> 107909 rows and 227 cols
>
> reduced with svd (rank 100) to:
> 99 rows and 227 cols
>
> transposed to: (actually there was a bug (with no effect on the SVD
> result but on NONE result))
> 227 rows and 99 cols
>
> So... now the cosine results are very similar to SVD 200.
>
> Results are added.
>
> @Sebastian: I will check if the bug affects my results.
>
> 2011/6/14 Fernando Fernández <[email protected]>:
>> Hi Stefan,
>>
>> Are  you sure you need to transpose the input marix? I thought that what you
>> get from lucene index was already document(rows)-term(columns) matrix, but
>> you say that you obtain term-document matrix and transpose it. Is this
>> correct? What are you using to obtain this matrix from Lucene? Is it
>> possible that you are calculating similarities with the wrong matrix in some
>> of the two cases? (With/without dimension reduction).
>>
>> Best,
>> Fernando.
>>
>> 2011/6/14 Sebastian Schelter <[email protected]>
>>
>>> Hi Stefan,
>>>
>>> I checked the implementation of RowSimilarityJob and we might still have a
>>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by
>>> that, but the similarity scores might not be correct...
>>>
>>> We had this issue in 0.4 already, when someone realized that cooccurrences
>>> were mapped out inconsistently, so for 0.5 we made sure that we always map
>>> the smaller row as first value. But apparently I did not adjust the value
>>> setting for the Cooccurrence object...
>>>
>>> In 0.5 the code is:
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>  }
>>>  coocurrence.set(column.get(), valueA, valueB);
>>>
>>> But I should be (already fixed in current trunk some days ago):
>>>
>>>  if (rowA <= rowB) {
>>>   rowPair.set(rowA, rowB, weightA, weightB);
>>>   coocurrence.set(column.get(), valueA, valueB);
>>>  } else {
>>>   rowPair.set(rowB, rowA, weightB, weightA);
>>>   coocurrence.set(column.get(), valueB, valueA);
>>>  }
>>>
>>> Maybe you could rerun your test with the current trunk?
>>>
>>> --sebastian
>>>
>>>
>>> On 14.06.2011 20:54, Sean Owen wrote:
>>>
>>>> It is a similarity, not a distance. Higher values mean more
>>>> similarity, not less.
>>>>
>>>> I agree that similarity ought to decrease with more dimensions. That
>>>> is what you observe -- except that you see quite high average
>>>> similarity with no dimension reduction!
>>>>
>>>> An average cosine similarity of 0.87 sounds "high" to me for anything
>>>> but a few dimensions. What's the dimensionality of the input without
>>>> dimension reduction?
>>>>
>>>> Something is amiss in this pipeline. It is an interesting question!
>>>>
>>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]>
>>>>  wrote:
>>>>
>>>>> Actually I'm using  RowSimilarityJob() with
>>>>> --input input
>>>>> --output output
>>>>> --numberOfColumns documentCount
>>>>> --maxSimilaritiesPerRow documentCount
>>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE
>>>>>
>>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE
>>>>> calculates...
>>>>> the source says: "distributed implementation of cosine similarity that
>>>>> does not center its data"
>>>>>
>>>>> So... this seems to be the similarity and not the distance?
>>>>>
>>>>> Cheers,
>>>>> Stefan
>>>>>
>>>>>
>>>>>
>>>>> 2011/6/14 Stefan Wienert<[email protected]>:
>>>>>
>>>>>> but... why do I get the different results with cosine similarity with
>>>>>> no dimension reduction (with 100,000 dimensions) ?
>>>>>>
>>>>>> 2011/6/14 Fernando Fernández<[email protected]>:
>>>>>>
>>>>>>> Actually that's what your results are showing, aren't they? With rank
>>>>>>> 1000
>>>>>>> the similarity avg is the lowest...
>>>>>>>
>>>>>>>
>>>>>>> 2011/6/14 Jake Mannix<[email protected]>
>>>>>>>
>>>>>>>  actually, wait - are your graphs showing *similarity*, or *distance*?
>>>>>>>>  In
>>>>>>>> higher
>>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the
>>>>>>>> other
>>>>>>>> hand,
>>>>>>>> *similarity* (1-cos(angle)) should go toward 0.
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Hey Guys,
>>>>>>>>>
>>>>>>>>> I have some strange results in my LSA-Pipeline.
>>>>>>>>>
>>>>>>>>> First, I explain the steps my data is making:
>>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF
>>>>>>>>> as
>>>>>>>>> weighter
>>>>>>>>> 2) Transposing TDM
>>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM
>>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM
>>>>>>>>> 3c) Using no dimension reduction (for testing purpose)
>>>>>>>>> 4) Transpose result (ONLY none / svd)
>>>>>>>>> 5) Calculating Cosine Similarty (from Mahout)
>>>>>>>>>
>>>>>>>>> Now... Some strange thinks happen:
>>>>>>>>> First of all: The demo data shows the similarity from document 1 to
>>>>>>>>> all other documents.
>>>>>>>>>
>>>>>>>>> the results using only cosine similarty (without dimension
>>>>>>>>> reduction):
>>>>>>>>> http://the-lord.de/img/none.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 10
>>>>>>>>> http://the-lord.de/img/svd-10.png
>>>>>>>>> some points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 10
>>>>>>>>> http://the-lord.de/img/ssvd-10.png
>>>>>>>>>
>>>>>>>>> the result using svd, rank 100
>>>>>>>>> http://the-lord.de/img/svd-100.png
>>>>>>>>> more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using ssvd rank 100
>>>>>>>>> http://the-lord.de/img/ssvd-100.png
>>>>>>>>>
>>>>>>>>> the results using svd rank 200
>>>>>>>>> http://the-lord.de/img/svd-200.png
>>>>>>>>> even more points falling down to the bottom.
>>>>>>>>>
>>>>>>>>> the results using svd rank 1000
>>>>>>>>> http://the-lord.de/img/svd-1000.png
>>>>>>>>> most points are at the bottom
>>>>>>>>>
>>>>>>>>> please beware of the scale:
>>>>>>>>> - the avg from none: 0,8712
>>>>>>>>> - the avg from svd rank 10: 0,2648
>>>>>>>>> - the avg from svd rank 100: 0,0628
>>>>>>>>> - the avg from svd rank 200: 0,0238
>>>>>>>>> - the avg from svd rank 1000: 0,0116
>>>>>>>>>
>>>>>>>>> so my question is:
>>>>>>>>> Can you explain this behavior? Why are the documents getting more
>>>>>>>>> equal with more ranks in svd. I thought it was the opposite.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Stefan Wienert
>>>>>>
>>>>>> http://www.wienert.cc
>>>>>> [email protected]
>>>>>>
>>>>>> Telefon: +495251-2026838
>>>>>> Mobil: +49176-40170270
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Stefan Wienert
>>>>>
>>>>> http://www.wienert.cc
>>>>> [email protected]
>>>>>
>>>>> Telefon: +495251-2026838
>>>>> Mobil: +49176-40170270
>>>>>
>>>>>
>>>
>>
>
>
>
> --
> Stefan Wienert
>
> http://www.wienert.cc
> [email protected]
>
> Telefon: +495251-2026838
> Mobil: +49176-40170270
>

Re: tf-idf + svd + cosine similarity

Reply via email to