Ssvd process for dimentionality reduction is easier. Assuming your data points are row vectors lf the input (which is the case with outpht of Mahout's seq2sparse) you need the U*Sigma output of the pca flow.
I.e. You need something like mahout ssvd -i input -o output -k 80 -pca true -us true -U false -V false... This information is also in the latest ssvd manual on wiki. Take latest trunk. Some of pca flow components got broken recently and i fixed them just last week. On Oct 19, 2012 9:06 AM, "Matt Molek" <[email protected]> wrote: > Sorry for the basic question. I've been reading about this for a few hours, > but I'm still confused. I want to use ssvd to reduce the dimensionality of > some tfidf-vectors so I can perform clustering on the result. > > Among many other things, I've read: > https://cwiki.apache.org/MAHOUT/dimensional-reduction.html > > Which states the process for svd is: > > bin/mahout svd (original -> svdOut) > bin/mahout cleansvd ... > bin/mahout transpose svdOut -> svdT > bin/mahout transpose original -> originalT > bin/mahout matrixmult originalT svdT -> newMatrix > bin/mahout kmeans newMatrix > > I know you don't need to do cleansvd with ssvd output. My main question is > which of the three outputs of ssvd should I be transposing and multiplying > with the original tfidf-matrix? I'm having trouble understanding the math > that's going on. > > ssvd outputs U, V, and sigma, and despite reading a bunch, I'm still > confused on which of these outputs I should be using, and how. Could anyone > spell it out for me? > > Thanks for any help, > Matt >
