Re: Regarding PCA implementation

Jake Mannix Wed, 27 Apr 2011 20:22:01 -0700

On Wed, Apr 27, 2011 at 6:41 PM, Ted Dunning <[email protected]> wrote:
>
>  > 3. Now that I have the centered data, computing the covariance matrix
> > shouldn't be too hard if I have represented my matrix as a distributed
> row
> > matrix. I can then use "times" to produce the covariance matrix.
> >
>
> Actually, this is liable to be a disaster because the covariance matrix
> will
> be dense after you subtract the mean.
>


This is exactly what I was thinking.


> a) can you do the SVD of the original matrix rather than the eigen-value
> computation of the covariance?  I think that this is likely to be
> numerically better.
>
> b) is there some perturbation trick such that you can do to avoid the mean
> shift problem?  I know that you can deal with (A - \lambda I), but you have
> (A -  e m') where e is the vector with all ones.
>

I would love to know the answer to this question.

Thinking on it a little bit further, this is not so bad: Let's say we had a
finished
patch to the idea discussed in MAHOUT-672 - virtual distributed matrices,
where
in this case, we have (A - e m'), where e and m are represented in a nice
compact fashion (just vectors, after all).  Then Lanczos operates by
repeated
multiplication of this matrix and some dense vector.  A . v is fine, and
then
(e m') . v = (v.dot(m) ) e is also easy to compute, so repeated iteration is
not
so bad at all.

I'm guessing that I've just reinvented sparse PCA, unless this is all crazy?

  -jake

Re: Regarding PCA implementation

Reply via email to