Thanks for the clarification Jake.

The end goal is to run the SVD against my n-gram vector, which have 380K
dimensions.

I'll update the wiki once I have this working.

Tim

On Mon, Mar 14, 2011 at 1:09 PM, Jake Mannix <jake.man...@gmail.com> wrote:

>
> On Sun, Mar 13, 2011 at 6:47 PM, Timothy Potter <thelabd...@gmail.com>wrote:
>
>> Looking for a little clarification with using SVD to reduce dimensions of
>> my
>> vectors for clustering ...
>>
>> Using the ASF mail archives for Mahout-588, I have 6,076,937 tfidf vectors
>> with 20,444 dimensions. I successfully run Mahout SVD on the vectors
>> using:
>>
>> bin/mahout svd -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors \
>>    -o /asf-mail-archives/mahout-0.4/svd \
>>    --rank 100 --numCols 20444 --numRows 6076937 --cleansvd true
>>
>> This produced 87 eigenvectors of size 20,444. I'm not clear as to why only
>> 87, but I'm assuming that has something to do with Lanczos???
>>
>
> Hi Timothy,
>
>   The LanczosSolver looks for 100 eigenvectors, but then does some cleanup
> after
> the fact: convergence issues and numeric overflow can cause some
> eigenvectors
> to show up twice - the last step in Mahout SVD is to remove these spurious
> eigenvectors (and also any which just don't appear to be "eigen" enough
> (ie,
> they don't satisfy the eigenvector criterion with high enough fidelity).
>
>   If you really need more eigenvectors, you can try re-running with
> rank=150,
> and then take the top 100 out of however many you get out.
>
> So then I proceeded to transpose the SVD output using:
>>
>> bin/mahout transpose -i /mnt/dev/svd/cleanEigenvectors --numCols 20444
>> --numRows 87
>>
>> Next, I tried to run transpose on my original vectors using:
>>
>> transpose -i
>> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors
>> --numCols 20444 --numRows 6076937
>>
>>
> So the problems with this is that the tfidf-vectors is a
> SequenceFile<Text,VectorWritable> - which is fine for input into
> DistributedLanczosSolver (which just needs <Writable,VectorWritable>
> pairs),
> but not so fine for being really considered a "matrix" - you need to run
> the
> RowIdJob on these tfidf-vectors first.  This will normalize your
> SequenceFIle<Text,VectorWritable> into a
> SequenceFile<IntWritable,VectorWritable>
> and a SequenceFIle<IntWritable,Text> (where original one is the join of
> these new ones, on the new int key).
>
> Hope that helps.
>
>   -jake
>

Reply via email to