Thanks for the clarification Jake. The end goal is to run the SVD against my n-gram vector, which have 380K dimensions.
I'll update the wiki once I have this working. Tim On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.man...@gmail.com> wrote: > > On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote: > >> Looking for a little clarification with using SVD to reduce dimensions of >> my >> vectors for clustering ... >> >> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors >> with 20,444 dimensions. I successfully run Mahout SVD on the vectors >> using: >> >> bin/mahout svd -i >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \ >> -o /asf-mail-archives/mahout-0.4/svd \ >> --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true >> >> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only >> 87, but I'm assuming that has something to do with Lanczos??? >> > > Hi Timothy, > > The LanczosSolver looks for 100 eigenvectors, but then does some cleanup > after > the fact: convergence issues and numeric overflow can cause some > eigenvectors > to show up twice - the last step in Mahout SVD is to remove these spurious > eigenvectors (and also any which just don't appear to be "eigen" enough > (ie, > they don't satisfy the eigenvector criterion with high enough fidelity). > > If you really need more eigenvectors, you can try re-running with > rank=150, > and then take the top 100 out of however many you get out. > > So then I proceeded to transpose the SVD output using: >> >> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444 >> --numRows 87 >> >> Next, I tried to run transpose on my original vectors using: >> >> transpose -i >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors >> --numCols 20444 --numRows 6076937 >> >> > So the problems with this is that the tfidf-vectors is a > SequenceFile<Text,VectorWritable> - which is fine for input into > DistributedLanczosSolver (which just needs <Writable,VectorWritable> > pairs), > but not so fine for being really considered a "matrix" - you need to run > the > RowIdJob on these tfidf-vectors first. This will normalize your > SequenceFIle<Text,VectorWritable> into a > SequenceFile<IntWritable,VectorWritable> > and a SequenceFIle<IntWritable,Text> (where original one is the join of > these new ones, on the new int key). > > Hope that helps. > > -jake >