Re: PCA using Java Code

Chirag Lakhani Wed, 03 Jul 2013 13:36:18 -0700

okay thanks.  It looks like I have that part running so I will go back to
the SSVDCli to finish the rest.  Thanks for your help.


Chirag


On Wed, Jul 3, 2013 at 4:19 PM, Dmitriy Lyubimov <[email protected]> wrote:

> On Wed, Jul 3, 2013 at 12:25 PM, Chirag Lakhani <[email protected]>
> wrote:
>
> > Okay thanks for that.  After working on that issue I am still having
> > trouble running the SSVD solver.  I know I have asked this before but I
> > still can not initiate the SSVD solver when the input called inputFolder
> is
> > the location of the sequence files of DenseVectors.  Is there something I
> > am missing with this code?
> >
> >
> > String inputFolder = "/data_csv_for_pca/";
> >                 String pcaOutput =  "/vectors/";
> >                 String column_type = "DenseVector";
> >                 Path input_vec = new Path(inputFolder);
> >
> >  SSVDSolver solver  = new SSVDSolver(conf, new Path[] {input_vec}, new
> > Path(pcaOutput),18,5,3,10);
> >
>
>
> SSVDSolver does not encapsulate the entire PCA workflow on its own.
>
>  You can use SSVDCli as an example to build the entire thing to embed.
> SSVDSolver class does not compute pca offset on its own, SSVDCli uses
> another job from Distributed Matrix to compute that (again, see SSVDCli
> code).
>
> Problems with not finding input -- about 1 million reasons in your case.
> Try to use absolute hdfs:// -prefixed paths for all files.
>
>
> >
> >
> > On Wed, Jul 3, 2013 at 12:24 PM, Dmitriy Lyubimov <[email protected]>
> > wrote:
> >
> > > There's probably confusion about options.
> > >
> > > (1) --pca=true enables pca flow in general. There's more to it than
> just
> > > taking a mean and re-centering.
> > > (2) --us=true enables computation of U*Sigma flow which what
> approximates
> > > dimensionality reduced output with original variances. This is what one
> > > usually wants from PCA, although in some cases it may be useful to just
> > use
> > > U.
> > > (3) optionally, one may supply externally computed colmean by using
> > > --pcaOffset. Motivation behind this option is that usually PCA is
> never a
> > > standalone job in a pipeline. Usually there's a MR job that preps the
> PCA
> > > input, in which case it is very easy to take row averages in the
> reducers
> > > of the previous step (and do final averaging in front end). That saves
> > one
> > > MR pass over the input, because in SSVD average will require one
> > additional
> > > MR pass over A.
> > >
> > > Bottom line, typically one wants something along the lines
> > >
> > > ssvd --pca=true -u=false -v=false -us=true ...
> > >
> > >
> > >
> > >
> > > On Wed, Jul 3, 2013 at 8:58 AM, Dmitriy Lyubimov <[email protected]>
> > > wrote:
> > >
> > > >
> > > > On Jul 3, 2013 6:56 AM, "Chirag Lakhani" <[email protected]>
> wrote:
> > > > >
> > > > > So how does the column mean get calculated if the --pcaOffset
> option
> > is
> > > > not
> > > > By taking average of all row vectors. See code for details.
> > > >
> > > > > specified?  I would think you are just doing SVD at that point.
> > > > This statement is incorrect. I know becuse i designed this code.
> > > >
> > > > >
> > > > >
> > > > > On Tue, Jul 2, 2013 at 5:52 PM, Dmitriy Lyubimov <
> [email protected]>
> > > > wrote:
> > > > >
> > > > > > On Tue, Jul 2, 2013 at 1:52 PM, Chirag Lakhani <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I am trying to use the Mahout/Java API to do PCA but I am
> > confused
> > > > about
> > > > > > > the write order to do things.  To start, I have a list of
> > > > DenseVectors
> > > > > > that
> > > > > > > I am reading into the code and turning it into a distributed
> > matrix
> > > > in
> > > > > > the
> > > > > > > following form.
> > > > > > >
> > > > > > >  DistributedRowMatrix m = new DistributedRowMatrix(input_vec,
> > > > > > matrix_path,
> > > > > > > num_rows,num_cols);
> > > > > > >
> > > > > > > When I run this code, I would have thought it would output the
> > > result
> > > > > > into
> > > > > > > the path called "matrix_path" so that I can then use something
> > like
> > > > > > > MatrixColumnMeansJob.run
> > > > > > > to get mean. When I run this bit of code I get no output, is
> > there
> > > > > > > something else I should do or is there a better way to
> calculate
> > > the
> > > > mean
> > > > > > > for my file.
> > > > > > >
> > > > > > >
> > > > > > > From what I understand about the SSVD CI code, you need to
> > > calculate
> > > > the
> > > > > > > column mean and then output it into a directory
> > > > > >
> > > > > > .
> > > > > >
> > > > > >
> > > > > > No, you don't have to (although you have an _option_ to calculate
> > and
> > > > > > substitute one yourself if for some reason it is already known.)
> > > > Default
> > > > > > use assumes it would calculate it for you.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Is there a good way to do
> > > > > > > this if I am starting from a file which is a sequence file of
> > > > > > DenseVectors?
> > > > > > >
> > > > > >
> > > > > > Yes. just don't specify --pcaOffset option.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > *Chirag Lakhani*
> > > > > > >
> > > > > > > Data Scientist
> > > > > > >
> > > > > > > Zaloni, Inc. | www.zaloni.com
> > > > > > >
> > > > > > > 633 Davis Dr., Suite 200
> > > > > > >
> > > > > > > Durham, NC 27713
> > > > > > > e: [email protected]
> > > > > > > p: 919.602.4965 x7020
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > *Chirag Lakhani*
> > > > >
> > > > > Data Scientist
> > > > >
> > > > > Zaloni, Inc. | www.zaloni.com
> > > > >
> > > > > 633 Davis Dr., Suite 200
> > > > >
> > > > > Durham, NC 27713
> > > > > e: [email protected]
> > > > > p: 919.602.4965 x7020
> > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > *Chirag Lakhani*
> >
> > Data Scientist
> >
> > Zaloni, Inc. | www.zaloni.com
> >
> > 633 Davis Dr., Suite 200
> >
> > Durham, NC 27713
> > e: [email protected]
> > p: 919.602.4965 x7020
> >
>



-- 

*Chirag Lakhani*

Data Scientist

Zaloni, Inc. | www.zaloni.com

633 Davis Dr., Suite 200

Durham, NC 27713
e: [email protected]
p: 919.602.4965 x7020

Re: PCA using Java Code

Reply via email to