Looking for a little clarification with using SVD to reduce dimensions of my
vectors for clustering ...

Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:

bin/mahout svd -i
/asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
    -o /asf-mail-archives/mahout-0.4/svd \
    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true

This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
87, but I'm assuming that has something to do with Lanczos???

So then I proceeded to transpose the SVD output using:

bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
--numRows 87

Next, I tried to run transpose on my original vectors using:

transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
--numCols 20444 --numRows 6076937

This failed with error:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
to org.apache.hadoop.io.IntWritable
        at 
org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

So I think I'm missing something ... I'm basing my process on the steps
outlined in thread:
http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html,
i.e.

bin/*mahout* *svd* (original -> *svdOut*)
bin/*mahout* cleansvd ...
bin/*mahout* *transpose* *svdOut* -> *svdT*
bin/*mahout* *transpose* original -> originalT
bin/*mahout* matrixmult originalT *svdT* -> newMatrix
bin/*mahout* kmeans newMatrix

Based on Ted's last comment in that thread, it seems like I may not need to
transpose the original matrix? Just want to be sure this process is correct.

Reply via email to