There's probably confusion about options. (1) --pca=true enables pca flow in general. There's more to it than just taking a mean and re-centering. (2) --us=true enables computation of U*Sigma flow which what approximates dimensionality reduced output with original variances. This is what one usually wants from PCA, although in some cases it may be useful to just use U. (3) optionally, one may supply externally computed colmean by using --pcaOffset. Motivation behind this option is that usually PCA is never a standalone job in a pipeline. Usually there's a MR job that preps the PCA input, in which case it is very easy to take row averages in the reducers of the previous step (and do final averaging in front end). That saves one MR pass over the input, because in SSVD average will require one additional MR pass over A.
Bottom line, typically one wants something along the lines ssvd --pca=true -u=false -v=false -us=true ... On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <[email protected]> wrote: > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <[email protected]> wrote: > > > > So how does the column mean get calculated if the --pcaOffset option is > not > By taking average of all row vectors. See code for details. > > > specified? I would think you are just doing SVD at that point. > This statement is incorrect. I know becuse i designed this code. > > > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <[email protected]> > > > wrote: > > > > > > > Hello, > > > > > > > > I am trying to use the Mahout/Java API to do PCA but I am confused > about > > > > the write order to do things. To start, I have a list of > DenseVectors > > > that > > > > I am reading into the code and turning it into a distributed matrix > in > > > the > > > > following form. > > > > > > > > DistributedRowMatrix m = new DistributedRowMatrix(input_vec, > > > matrix_path, > > > > num_rows,num_cols); > > > > > > > > When I run this code, I would have thought it would output the result > > > into > > > > the path called "matrix_path" so that I can then use something like > > > > MatrixColumnMeansJob.run > > > > to get mean. When I run this bit of code I get no output, is there > > > > something else I should do or is there a better way to calculate the > mean > > > > for my file. > > > > > > > > > > > > From what I understand about the SSVD CI code, you need to calculate > the > > > > column mean and then output it into a directory > > > > > > . > > > > > > > > > No, you don't have to (although you have an _option_ to calculate and > > > substitute one yourself if for some reason it is already known.) > Default > > > use assumes it would calculate it for you. > > > > > > > > > > > > > Is there a good way to do > > > > this if I am starting from a file which is a sequence file of > > > DenseVectors? > > > > > > > > > > Yes. just don't specify --pcaOffset option. > > > > > > > > > > > > > > -- > > > > > > > > *Chirag Lakhani* > > > > > > > > Data Scientist > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > Durham, NC 27713 > > > > e: [email protected] > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > -- > > > > *Chirag Lakhani* > > > > Data Scientist > > > > Zaloni, Inc. | www.zaloni.com > > > > 633 Davis Dr., Suite 200 > > > > Durham, NC 27713 > > e: [email protected] > > p: 919.602.4965 x7020 > >
