Thanks for pointing those relevant codes out explicitly. I will try that out but am getting an error java.lang.StackOverflowError but according to a previous comment I need to use the trunk version.
Chirag On Wed, Jul 3, 2013 at 4:39 PM, Dmitriy Lyubimov <[email protected]> wrote: > yeah. specifically this code computes the mean (it is called "xi" to > conform to notations used in math solution for MAHOUT-817) > > // MAHOUT-817 > if (pca && xiPath == null) { > xiPath = new Path(tempPath, "xi"); > if (overwrite) { > fs.delete(xiPath, true); > } > ====> MatrixColumnMeansJob.run(conf, inputPaths[0], xiPath); > } > > ... and then passing it all to the SVD solver .. : > > SVDSolver solver = > new SSVDSolver(conf, > inputPaths, > new Path(tempPath, "ssvd"), > r, > k, > p, > reduceTasks); > > solver.setMinSplitSize(minSplitSize); > solver.setComputeU(computeU); > solver.setComputeV(computeV); > solver.setcUHalfSigma(cUHalfSigma); > solver.setcVHalfSigma(cVHalfSigma); > solver.setcUSigma(cUSigma); > solver.setOuterBlockHeight(h); > solver.setAbtBlockHeight(abh); > solver.setQ(q); > solver.setBroadcast(broadcast); > solver.setOverwrite(overwrite); > > > if (xiPath != null) { > ====> solver.setPcaMeanPath(new Path(xiPath, "part-*")); > } > > > > essential pieces marked with double arrows. > > > On Wed, Jul 3, 2013 at 1:34 PM, Chirag Lakhani <[email protected]> > wrote: > > > okay thanks. It looks like I have that part running so I will go back to > > the SSVDCli to finish the rest. Thanks for your help. > > > > Chirag > > > > > > On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov <[email protected]> > > wrote: > > > > > On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <[email protected]> > > > wrote: > > > > > > > Okay thanks for that. After working on that issue I am still having > > > > trouble running the SSVD solver. I know I have asked this before > but I > > > > still can not initiate the SSVD solver when the input called > > inputFolder > > > is > > > > the location of the sequence files of DenseVectors. Is there > > something I > > > > am missing with this code? > > > > > > > > > > > > String inputFolder = "/data_csv_for_pca/"; > > > > String pcaOutput = "/vectors/"; > > > > String column_type = "DenseVector"; > > > > Path input_vec = new Path(inputFolder); > > > > > > > > SSVDSolver solver = new SSVDSolver(conf, new Path[] {input_vec}, > new > > > > Path(pcaOutput),18,5,3,10); > > > > > > > > > > > > > SSVDSolver does not encapsulate the entire PCA workflow on its own. > > > > > > You can use SSVDCli as an example to build the entire thing to embed. > > > SSVDSolver class does not compute pca offset on its own, SSVDCli uses > > > another job from Distributed Matrix to compute that (again, see SSVDCli > > > code). > > > > > > Problems with not finding input -- about 1 million reasons in your > case. > > > Try to use absolute hdfs:// -prefixed paths for all files. > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <[email protected] > > > > > > wrote: > > > > > > > > > There's probably confusion about options. > > > > > > > > > > (1) --pca=true enables pca flow in general. There's more to it than > > > just > > > > > taking a mean and re-centering. > > > > > (2) --us=true enables computation of U*Sigma flow which what > > > approximates > > > > > dimensionality reduced output with original variances. This is what > > one > > > > > usually wants from PCA, although in some cases it may be useful to > > just > > > > use > > > > > U. > > > > > (3) optionally, one may supply externally computed colmean by using > > > > > --pcaOffset. Motivation behind this option is that usually PCA is > > > never a > > > > > standalone job in a pipeline. Usually there's a MR job that preps > the > > > PCA > > > > > input, in which case it is very easy to take row averages in the > > > reducers > > > > > of the previous step (and do final averaging in front end). That > > saves > > > > one > > > > > MR pass over the input, because in SSVD average will require one > > > > additional > > > > > MR pass over A. > > > > > > > > > > Bottom line, typically one wants something along the lines > > > > > > > > > > ssvd --pca=true -u=false -v=false -us=true ... > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov < > [email protected]> > > > > > wrote: > > > > > > > > > > > > > > > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <[email protected]> > > > wrote: > > > > > > > > > > > > > > So how does the column mean get calculated if the --pcaOffset > > > option > > > > is > > > > > > not > > > > > > By taking average of all row vectors. See code for details. > > > > > > > > > > > > > specified? I would think you are just doing SVD at that point. > > > > > > This statement is incorrect. I know becuse i designed this code. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov < > > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > I am trying to use the Mahout/Java API to do PCA but I am > > > > confused > > > > > > about > > > > > > > > > the write order to do things. To start, I have a list of > > > > > > DenseVectors > > > > > > > > that > > > > > > > > > I am reading into the code and turning it into a > distributed > > > > matrix > > > > > > in > > > > > > > > the > > > > > > > > > following form. > > > > > > > > > > > > > > > > > > DistributedRowMatrix m = new > DistributedRowMatrix(input_vec, > > > > > > > > matrix_path, > > > > > > > > > num_rows,num_cols); > > > > > > > > > > > > > > > > > > When I run this code, I would have thought it would output > > the > > > > > result > > > > > > > > into > > > > > > > > > the path called "matrix_path" so that I can then use > > something > > > > like > > > > > > > > > MatrixColumnMeansJob.run > > > > > > > > > to get mean. When I run this bit of code I get no output, > is > > > > there > > > > > > > > > something else I should do or is there a better way to > > > calculate > > > > > the > > > > > > mean > > > > > > > > > for my file. > > > > > > > > > > > > > > > > > > > > > > > > > > > From what I understand about the SSVD CI code, you need to > > > > > calculate > > > > > > the > > > > > > > > > column mean and then output it into a directory > > > > > > > > > > > > > > > > . > > > > > > > > > > > > > > > > > > > > > > > > No, you don't have to (although you have an _option_ to > > calculate > > > > and > > > > > > > > substitute one yourself if for some reason it is already > > known.) > > > > > > Default > > > > > > > > use assumes it would calculate it for you. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is there a good way to do > > > > > > > > > this if I am starting from a file which is a sequence file > of > > > > > > > > DenseVectors? > > > > > > > > > > > > > > > > > > > > > > > > > Yes. just don't specify --pcaOffset option. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > > > > > > > > > Data Scientist > > > > > > > > > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > > > > > > > > > Durham, NC 27713 > > > > > > > > > e: [email protected] > > > > > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > *Chirag Lakhani* > > > > > > > > > > > > > > Data Scientist > > > > > > > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > > > > > > > Durham, NC 27713 > > > > > > > e: [email protected] > > > > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > *Chirag Lakhani* > > > > > > > > Data Scientist > > > > > > > > Zaloni, Inc. | www.zaloni.com > > > > > > > > 633 Davis Dr., Suite 200 > > > > > > > > Durham, NC 27713 > > > > e: [email protected] > > > > p: 919.602.4965 x7020 > > > > > > > > > > > > > > > -- > > > > *Chirag Lakhani* > > > > Data Scientist > > > > Zaloni, Inc. | www.zaloni.com > > > > 633 Davis Dr., Suite 200 > > > > Durham, NC 27713 > > e: [email protected] > > p: 919.602.4965 x7020 > > > -- *Chirag Lakhani* Data Scientist Zaloni, Inc. | www.zaloni.com 633 Davis Dr., Suite 200 Durham, NC 27713 e: [email protected] p: 919.602.4965 x7020
