What down-projection techniques are available in Mahout, and what others would be useful? For example, I'm intrigued by the manifold-finders like ISOMAP.
Lance On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > For clustering purposes, you probably don't even need SVD here. You can > project randomly down to 100-200 dimensions and do the clustering. You have > to use a higher number of dimensions than you would with SVD, but avoiding > the SVD is a big win. Depending on the density of your data, this may or > may not make clustering faster. The key question is whether the total data > size is larger or smaller. > > Also, since your data is essentially count data, you have large amounts of > noise which probably make everything after about 20-30 singular vectors into > random noise anyway. As such, I recommend replacing later singular vectors > with random numbers anyway. These will be quasi-orthogonal and thus pretty > much as good as real singular vectors for reducing dimensionality, not quite > so good as a minimal basis. > > On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote: > >> Looking for a little clarification with using SVD to reduce dimensions of >> my >> vectors for clustering ... >> >> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors >> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using: >> >> bin/mahout svd -i >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \ >> -o /asf-mail-archives/mahout-0.4/svd \ >> --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true >> >> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only >> 87, but I'm assuming that has something to do with Lanczos??? >> >> So then I proceeded to transpose the SVD output using: >> >> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444 >> --numRows 87 >> >> Next, I tried to run transpose on my original vectors using: >> >> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors >> --numCols 20444 --numRows 6076937 >> >> This failed with error: >> >> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast >> to org.apache.hadoop.io.IntWritable >> at >> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100) >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> >> So I think I'm missing something ... I'm basing my process on the steps >> outlined in thread: >> >> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html >> , >> i.e. >> >> bin/*mahout* *svd* (original -> *svdOut*) >> bin/*mahout* cleansvd ... >> bin/*mahout* *transpose* *svdOut* -> *svdT* >> bin/*mahout* *transpose* original -> originalT >> bin/*mahout* matrixmult originalT *svdT* -> newMatrix >> bin/*mahout* kmeans newMatrix >> >> Based on Ted's last comment in that thread, it seems like I may not need to >> transpose the original matrix? Just want to be sure this process is >> correct. >> > -- Lance Norskog goks...@gmail.com