but... why do I get the different results with cosine similarity with no dimension reduction (with 100,000 dimensions) ?
2011/6/14 Fernando Fernández <[email protected]>: > Actually that's what your results are showing, aren't they? With rank 1000 > the similarity avg is the lowest... > > > 2011/6/14 Jake Mannix <[email protected]> > >> actually, wait - are your graphs showing *similarity*, or *distance*? In >> higher >> dimensions, *distance* (and cosine angle) should grow, but on the other >> hand, >> *similarity* (1-cos(angle)) should go toward 0. >> >> On Tue, Jun 14, 2011 at 10:15 AM, Stefan Wienert <[email protected]> >> wrote: >> >> > Hey Guys, >> > >> > I have some strange results in my LSA-Pipeline. >> > >> > First, I explain the steps my data is making: >> > 1) Extract Term-Dokument-Matrix from a Lucene datastore using TFIDF as >> > weighter >> > 2) Transposing TDM >> > 3a) Using Mahout SVD (Lanczos) with the transposed TDM >> > 3b) Using Mahout SSVD (stochastic SVD) with the transposed TDM >> > 3c) Using no dimension reduction (for testing purpose) >> > 4) Transpose result (ONLY none / svd) >> > 5) Calculating Cosine Similarty (from Mahout) >> > >> > Now... Some strange thinks happen: >> > First of all: The demo data shows the similarity from document 1 to >> > all other documents. >> > >> > the results using only cosine similarty (without dimension reduction): >> > http://the-lord.de/img/none.png >> > >> > the result using svd, rank 10 >> > http://the-lord.de/img/svd-10.png >> > some points falling down to the bottom. >> > >> > the results using ssvd rank 10 >> > http://the-lord.de/img/ssvd-10.png >> > >> > the result using svd, rank 100 >> > http://the-lord.de/img/svd-100.png >> > more points falling down to the bottom. >> > >> > the results using ssvd rank 100 >> > http://the-lord.de/img/ssvd-100.png >> > >> > the results using svd rank 200 >> > http://the-lord.de/img/svd-200.png >> > even more points falling down to the bottom. >> > >> > the results using svd rank 1000 >> > http://the-lord.de/img/svd-1000.png >> > most points are at the bottom >> > >> > please beware of the scale: >> > - the avg from none: 0,8712 >> > - the avg from svd rank 10: 0,2648 >> > - the avg from svd rank 100: 0,0628 >> > - the avg from svd rank 200: 0,0238 >> > - the avg from svd rank 1000: 0,0116 >> > >> > so my question is: >> > Can you explain this behavior? Why are the documents getting more >> > equal with more ranks in svd. I thought it was the opposite. >> > >> > Cheers >> > Stefan >> > >> > -- Stefan Wienert http://www.wienert.cc [email protected] Telefon: +495251-2026838 Mobil: +49176-40170270
