Looking for a little clarification with using SVD to reduce dimensions of my vectors for clustering ...
Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors with 20,444 dimensions. I successfully run Mahout SVD on the vectors using: bin/mahout svd -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \ -o /asf-mail-archives/mahout-0.4/svd \ --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true This produced 87 eigenvectors of size 20,444. I'm not clear as to why only 87, but I'm assuming that has something to do with Lanczos??? So then I proceeded to transpose the SVD output using: bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444 --numRows 87 Next, I tried to run transpose on my original vectors using: transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors --numCols 20444 --numRows 6076937 This failed with error: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312) at org.apache.hadoop.mapred.Child.main(Child.java:170) So I think I'm missing something ... I'm basing my process on the steps outlined in thread: http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html, i.e. bin/*mahout* *svd* (original -> *svdOut*) bin/*mahout* cleansvd ... bin/*mahout* *transpose* *svdOut* -> *svdT* bin/*mahout* *transpose* original -> originalT bin/*mahout* matrixmult originalT *svdT* -> newMatrix bin/*mahout* kmeans newMatrix Based on Ted's last comment in that thread, it seems like I may not need to transpose the original matrix? Just want to be sure this process is correct.