Interesting. (I have one confusion of mine RE: lanczos -- is it computing U eigenvectors or V? The doc says "eigenvectors" but doesn't say left or right. if it's V (right eigenvectors) this sequence should be fine).
With ssvd i don't do transpose, i just do coputation of U which will produce document singular vectors directly. Also, i am not sure that Lanczos actually normalizes the eigenvectors, but SSVD does (or multiplies normalized version by square root of a singlular value, whichever requested). So depending on which space your rotate results in, cosine similarities may be different. I assume you used normalized (true) eigenvectors from ssvd. Also would be interesting to know what oversampling parameter you (p) you used. Thanks. -d On Tue, Jun 14, 2011 at 2:04 PM, Stefan Wienert <[email protected]> wrote: > So... lets check the dimensions: > > First step: Lucene Output: > 227 rows (=docs) and 107909 cols (=tems) > > transposed to: > 107909 rows and 227 cols > > reduced with svd (rank 100) to: > 99 rows and 227 cols > > transposed to: (actually there was a bug (with no effect on the SVD > result but on NONE result)) > 227 rows and 99 cols > > So... now the cosine results are very similar to SVD 200. > > Results are added. > > @Sebastian: I will check if the bug affects my results. > > 2011/6/14 Fernando Fernández <[email protected]>: >> Hi Stefan, >> >> Are you sure you need to transpose the input marix? I thought that what you >> get from lucene index was already document(rows)-term(columns) matrix, but >> you say that you obtain term-document matrix and transpose it. Is this >> correct? What are you using to obtain this matrix from Lucene? Is it >> possible that you are calculating similarities with the wrong matrix in some >> of the two cases? (With/without dimension reduction). >> >> Best, >> Fernando. >> >> 2011/6/14 Sebastian Schelter <[email protected]> >> >>> Hi Stefan, >>> >>> I checked the implementation of RowSimilarityJob and we might still have a >>> bug in the 0.5 release... (f**k). I don't know if your problem is caused by >>> that, but the similarity scores might not be correct... >>> >>> We had this issue in 0.4 already, when someone realized that cooccurrences >>> were mapped out inconsistently, so for 0.5 we made sure that we always map >>> the smaller row as first value. But apparently I did not adjust the value >>> setting for the Cooccurrence object... >>> >>> In 0.5 the code is: >>> >>> if (rowA <= rowB) { >>> rowPair.set(rowA, rowB, weightA, weightB); >>> } else { >>> rowPair.set(rowB, rowA, weightB, weightA); >>> } >>> coocurrence.set(column.get(), valueA, valueB); >>> >>> But I should be (already fixed in current trunk some days ago): >>> >>> if (rowA <= rowB) { >>> rowPair.set(rowA, rowB, weightA, weightB); >>> coocurrence.set(column.get(), valueA, valueB); >>> } else { >>> rowPair.set(rowB, rowA, weightB, weightA); >>> coocurrence.set(column.get(), valueB, valueA); >>> } >>> >>> Maybe you could rerun your test with the current trunk? >>> >>> --sebastian >>> >>> >>> On 14.06.2011 20:54, Sean Owen wrote: >>> >>>> It is a similarity, not a distance. Higher values mean more >>>> similarity, not less. >>>> >>>> I agree that similarity ought to decrease with more dimensions. That >>>> is what you observe -- except that you see quite high average >>>> similarity with no dimension reduction! >>>> >>>> An average cosine similarity of 0.87 sounds "high" to me for anything >>>> but a few dimensions. What's the dimensionality of the input without >>>> dimension reduction? >>>> >>>> Something is amiss in this pipeline. It is an interesting question! >>>> >>>> On Tue, Jun 14, 2011 at 7:39 PM, Stefan Wienert<[email protected]> >>>> wrote: >>>> >>>>> Actually I'm using RowSimilarityJob() with >>>>> --input input >>>>> --output output >>>>> --numberOfColumns documentCount >>>>> --maxSimilaritiesPerRow documentCount >>>>> --similarityClassname SIMILARITY_UNCENTERED_COSINE >>>>> >>>>> Actually I am not really sure what this SIMILARITY_UNCENTERED_COSINE >>>>> calculates... >>>>> the source says: "distributed implementation of cosine similarity that >>>>> does not center its data" >>>>> >>>>> So... this seems to be the similarity and not the distance? >>>>> >>>>> Cheers, >>>>> Stefan >>>>> >>>>> >>>>> >>>>> 2011/6/14 Stefan Wienert<[email protected]>: >>>>> >>>>>> but... why do I get the different results with cosine similarity with >>>>>> no dimension reduction (with 100,000 dimensions) ? >>>>>> >>>>>> 2011/6/14 Fernando Fernández<[email protected]>: >>>>>> >>>>>>> Actually that's what your results are showing, aren't they? With rank >>>>>>> 1000 >>>>>>> the similarity avg is the lowest... >>>>>>> >>>>>>> >>>>>>> 2011/6/14 Jake Mannix<[email protected]> >>>>>>> >>>>>>> actually, wait - are your graphs showing *similarity*, or *distance*? >>>>>>>> In >>>>>>>> higher >>>>>>>> dimensions, *distance* (and cosine angle) should grow, but on the >>>>>>>> other >>>>>>>> hand, >>>>>>>> *similarity* (1-cos(angle)) should go toward 0. >>>>>>>> >>>>>>>> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert<[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hey Guys, >>>>>>>>> >>>>>>>>> I have some strange results in my LSA-Pipeline. >>>>>>>>> >>>>>>>>> First, I explain the steps my data is making: >>>>>>>>> 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF >>>>>>>>> as >>>>>>>>> weighter >>>>>>>>> 2) Transposing TDM >>>>>>>>> 3a) Using Mahout SVD (Lanczos) with the transposed TDM >>>>>>>>> 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >>>>>>>>> 3c) Using no dimension reduction (for testing purpose) >>>>>>>>> 4) Transpose result (ONLY none / svd) >>>>>>>>> 5) Calculating Cosine Similarty (from Mahout) >>>>>>>>> >>>>>>>>> Now... Some strange thinks happen: >>>>>>>>> First of all: The demo data shows the similarity from document 1 to >>>>>>>>> all other documents. >>>>>>>>> >>>>>>>>> the results using only cosine similarty (without dimension >>>>>>>>> reduction): >>>>>>>>> http://the-lord.de/img/none.png >>>>>>>>> >>>>>>>>> the result using svd, rank 10 >>>>>>>>> http://the-lord.de/img/svd-10.png >>>>>>>>> some points falling down to the bottom. >>>>>>>>> >>>>>>>>> the results using ssvd rank 10 >>>>>>>>> http://the-lord.de/img/ssvd-10.png >>>>>>>>> >>>>>>>>> the result using svd, rank 100 >>>>>>>>> http://the-lord.de/img/svd-100.png >>>>>>>>> more points falling down to the bottom. >>>>>>>>> >>>>>>>>> the results using ssvd rank 100 >>>>>>>>> http://the-lord.de/img/ssvd-100.png >>>>>>>>> >>>>>>>>> the results using svd rank 200 >>>>>>>>> http://the-lord.de/img/svd-200.png >>>>>>>>> even more points falling down to the bottom. >>>>>>>>> >>>>>>>>> the results using svd rank 1000 >>>>>>>>> http://the-lord.de/img/svd-1000.png >>>>>>>>> most points are at the bottom >>>>>>>>> >>>>>>>>> please beware of the scale: >>>>>>>>> - the avg from none: 0,8712 >>>>>>>>> - the avg from svd rank 10: 0,2648 >>>>>>>>> - the avg from svd rank 100: 0,0628 >>>>>>>>> - the avg from svd rank 200: 0,0238 >>>>>>>>> - the avg from svd rank 1000: 0,0116 >>>>>>>>> >>>>>>>>> so my question is: >>>>>>>>> Can you explain this behavior? Why are the documents getting more >>>>>>>>> equal with more ranks in svd. I thought it was the opposite. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Stefan Wienert >>>>>> >>>>>> http://www.wienert.cc >>>>>> [email protected] >>>>>> >>>>>> Telefon: +495251-2026838 >>>>>> Mobil: +49176-40170270 >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Stefan Wienert >>>>> >>>>> http://www.wienert.cc >>>>> [email protected] >>>>> >>>>> Telefon: +495251-2026838 >>>>> Mobil: +49176-40170270 >>>>> >>>>> >>> >> > > > > -- > Stefan Wienert > > http://www.wienert.cc > [email protected] > > Telefon: +495251-2026838 > Mobil: +49176-40170270 >
