On Wed, Apr 27, 2011 at 6:41 PM, Ted Dunning <[email protected]> wrote: > > > 3. Now that I have the centered data, computing the covariance matrix > > shouldn't be too hard if I have represented my matrix as a distributed > row > > matrix. I can then use "times" to produce the covariance matrix. > > > > Actually, this is liable to be a disaster because the covariance matrix > will > be dense after you subtract the mean. >
This is exactly what I was thinking. > a) can you do the SVD of the original matrix rather than the eigen-value > computation of the covariance? I think that this is likely to be > numerically better. > > b) is there some perturbation trick such that you can do to avoid the mean > shift problem? I know that you can deal with (A - \lambda I), but you have > (A - e m') where e is the vector with all ones. > I would love to know the answer to this question. Thinking on it a little bit further, this is not so bad: Let's say we had a finished patch to the idea discussed in MAHOUT-672 - virtual distributed matrices, where in this case, we have (A - e m'), where e and m are represented in a nice compact fashion (just vectors, after all). Then Lanczos operates by repeated multiplication of this matrix and some dense vector. A . v is fine, and then (e m') . v = (v.dot(m) ) e is also easy to compute, so repeated iteration is not so bad at all. I'm guessing that I've just reinvented sparse PCA, unless this is all crazy? -jake
