What down-projection techniques are available in Mahout, and what
others would be useful? For example, I'm intrigued by the
manifold-finders like ISOMAP.

Lance

On Sun, Mar 13, 2011 at 8:18 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> For clustering purposes, you probably don't even need SVD here.  You can
> project randomly down to 100-200 dimensions and do the clustering.  You have
> to use a higher number of dimensions than you would with SVD, but avoiding
> the SVD is a big win.  Depending on the density of your data, this may or
> may not make clustering faster.  The key question is whether the total data
> size is larger or smaller.
>
> Also, since your data is essentially count data, you have large amounts of
> noise which probably make everything after about 20-30 singular vectors into
> random noise anyway.  As such, I recommend replacing later singular vectors
> with random numbers anyway.  These will be quasi-orthogonal and thus pretty
> much as good as real singular vectors for reducing dimensionality, not quite
> so good as a minimal basis.
>
> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote:
>
>> Looking for a little clarification with using SVD to reduce dimensions of
>> my
>> vectors for clustering ...
>>
>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors using:
>>
>> bin/mahout svd -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>    -o /asf-mail-archives/mahout-0.4/svd \
>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>
>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>> 87, but I'm assuming that has something to do with Lanczos???
>>
>> So then I proceeded to transpose the SVD output using:
>>
>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>> --numRows 87
>>
>> Next, I tried to run transpose on my original vectors using:
>>
>> transpose -i /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>> --numCols 20444 --numRows 6076937
>>
>> This failed with error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast
>> to org.apache.hadoop.io.IntWritable
>>        at
>> org.apache.mahout.math.hadoop.TransposeJob$TransposeMapper.map(TransposeJob.java:100)
>>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)
>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:312)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> So I think I'm missing something ... I'm basing my process on the steps
>> outlined in thread:
>>
>> http://lucene.472066.n3.nabble.com/Using-SVD-with-Canopy-KMeans-td1407217.html
>> ,
>> i.e.
>>
>> bin/*mahout* *svd* (original -> *svdOut*)
>> bin/*mahout* cleansvd ...
>> bin/*mahout* *transpose* *svdOut* -> *svdT*
>> bin/*mahout* *transpose* original -> originalT
>> bin/*mahout* matrixmult originalT *svdT* -> newMatrix
>> bin/*mahout* kmeans newMatrix
>>
>> Based on Ted's last comment in that thread, it seems like I may not need to
>> transpose the original matrix? Just want to be sure this process is
>> correct.
>>
>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to