okay thanks. It looks like I have that part running so I will go back to the SSVDCli to finish the rest. Thanks for your help.
Chirag On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov <[email protected]> wrote: > On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <[email protected]> > wrote: > > > Okay thanks for that. After working on that issue I am still having > > trouble running the SSVD solver. I know I have asked this before but I > > still can not initiate the SSVD solver when the input called inputFolder > is > > the location of the sequence files of DenseVectors. Is there something I > > am missing with this code? > > > > > > String inputFolder = "/data_csv_for_pca/"; > > String pcaOutput = "/vectors/"; > > String column_type = "DenseVector"; > > Path input_vec = new Path(inputFolder); > > > > SSVDSolver solver = new SSVDSolver(conf, new Path[] {input_vec}, new > > Path(pcaOutput),18,5,3,10); > > > > > SSVDSolver does not encapsulate the entire PCA workflow on its own. > > You can use SSVDCli as an example to build the entire thing to embed. > SSVDSolver class does not compute pca offset on its own, SSVDCli uses > another job from Distributed Matrix to compute that (again, see SSVDCli > code). > > Problems with not finding input -- about 1 million reasons in your case. > Try to use absolute hdfs:// -prefixed paths for all files. > > > > > > > > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > There's probably confusion about options. > > > > > > (1) --pca=true enables pca flow in general. There's more to it than > just > > > taking a mean and re-centering. > > > (2) --us=true enables computation of U*Sigma flow which what > approximates > > > dimensionality reduced output with original variances. This is what one > > > usually wants from PCA, although in some cases it may be useful to just > > use > > > U. > > > (3) optionally, one may supply externally computed colmean by using > > > --pcaOffset. Motivation behind this option is that usually PCA is > never a > > > standalone job in a pipeline. Usually there's a MR job that preps the > PCA > > > input, in which case it is very easy to take row averages in the > reducers > > > of the previous step (and do final averaging in front end). That saves > > one > > > MR pass over the input, because in SSVD average will require one > > additional > > > MR pass over A. > > > > > > Bottom line, typically one wants something along the lines > > > > > > ssvd --pca=true -u=false -v=false -us=true ... > > > > > > > > > > > > > > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <[email protected]> > > > wrote: > > > > > > > > > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <[email protected]> > wrote: > > > > > > > > > > So how does the column mean get calculated if the --pcaOffset > option > > is > > > > not > > > > By taking average of all row vectors. See code for details. > > > > > > > > > specified? I would think you are just doing SVD at that point. > > > > This statement is incorrect. I know becuse i designed this code. > > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov < > [email protected]> > > > > wrote: > > > > > > > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > I am trying to use the Mahout/Java API to do PCA but I am > > confused > > > > about > > > > > > > the write order to do things. To start, I have a list of > > > > DenseVectors > > > > > > that > > > > > > > I am reading into the code and turning it into a distributed > > matrix > > > > in > > > > > > the > > > > > > > following form. > > > > > > > > > > > > > > DistributedRowMatrix m = new DistributedRowMatrix(input_vec, > > > > > > matrix_path, > > > > > > > num_rows,num_cols); > > > > > > > > > > > > > > When I run this code, I would have thought it would output the > > > result > > > > > > into > > > > > > > the path called "matrix_path" so that I can then use something > > like > > > > > > > MatrixColumnMeansJob.run > > > > > > > to get mean. When I run this bit of code I get no output, is > > there > > > > > > > something else I should do or is there a better way to > calculate > > > the > > > > mean > > > > > > > for my file. > > > > > > > > > > > > > > > > > > > > > From what I understand about the SSVD CI code, you need to > > > calculate > > > > the > > > > > > > column mean and then output it into a directory > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > No, you don't have to (although you have an _option_ to calculate > > and > > > > > > substitute one yourself if for some reason it is already known.) > > > > Default > > > > > > use assumes it would calculate it for you. > > > > > > > > > > > > > > > > > > > > > > > > > Is there a good way to do > > > > > > > this if I am starting from a file which is a sequence file of > > > > > > DenseVectors? > > > > > > > > > > > > > > > > > > > Yes. just don't specify --pcaOffset option. > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > > > > > Data Scientist > > > > > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > > > > > Durham, NC 27713 > > > > > > > e: [email protected] > > > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > Data Scientist > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > Durham, NC 27713 > > > > > e: [email protected] > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > -- > > > > *Chirag Lakhani* > > > > Data Scientist > > > > Zaloni, Inc. | www.zaloni.com > > > > 633 Davis Dr., Suite 200 > > > > Durham, NC 27713 > > e: [email protected] > > p: 919.602.4965 x7020 > > > -- *Chirag Lakhani* Data Scientist Zaloni, Inc. | www.zaloni.com 633 Davis Dr., Suite 200 Durham, NC 27713 e: [email protected] p: 919.602.4965 x7020
