Re: Regarding PCA implementation

Ted Dunning Wed, 27 Apr 2011 18:42:14 -0700

On Wed, Apr 27, 2011 at 5:28 PM, Vckay <[email protected]> wrote:

> Assuming the data is available as a text file with rows representing
> measurements,
>


A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional
approach to this.


> 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver.
> This would compute the empirical mean. I have done this. I unfortunately
> have not found an elegant way of using a Single Map Reduce + Combiner Job
> of
> finally generating one single output vector.


Use a combiner and a single reducer.


> Currently, I am generating key
> value pairs where key refers to the index of the vector and value refers to
> the mean for the dimension and am saving them.


I think you should use a constant key and a value that is the original
vector.  Then define combiner and reducer to sum the values and output a key
and vector.  All of the combiner outputs will go to the same reducer which
will output a single vector.

2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am
> planning on using the VectorCache as a way of passing this vector onto the
> job that takes the input Matrix (now distributedRowMatrix) to center the
> data.
>

Sounds about right.

But when did the input get converted?  I think it should be converted before
starting.



> 3. Now that I have the centered data, computing the covariance matrix
> shouldn't be too hard if I have represented my matrix as a distributed row
> matrix. I can then use "times" to produce the covariance matrix.
>

Actually, this is liable to be a disaster because the covariance matrix will
be dense after you subtract the mean.

If your data is dense to begin with, this is no big deal.  If your data is
sparse, then this is a huge deal.

A more subtle point is that computational cost of PCA on a dense matrix is
something like k n^3 where k is the number of eigen-values you want and n is
the size of the matrix.  If you can compute PCA for an input of size N on a
single machine in time T, then you are going to have a horrendous scaling
problem.  The issue is that by moving to network transfer instead of memory
transfer, you are losing several orders of magnitude in speed.  That means
that even to match the speed of the sequential algorithm will take quite a
few machines.  If you increase the size of the problem by, say, 10 to make
up for this, you now have a problem that will take a thousand times longer
than before and you are still going to be in the hole.

This sort of thing is why almost all really, really large scale data mining
exercises work with sparse matrices.  That drops most algorithms to
something like O(n) so that adding more horsepower does some real damage to
the size of n and so that increasing n is relatively survivable.  In fact,
O(n) algorithms are almost a defining property of scalable data-mining.

Back to your algorithm, I have two questions:

a) can you do the SVD of the original matrix rather than the eigen-value
computation of the covariance?  I think that this is likely to be
numerically better.

b) is there some perturbation trick such that you can do to avoid the mean
shift problem?  I know that you can deal with (A - \lambda I), but you have
(A -  e m') where e is the vector with all ones.

4. Last lap use DistributedLancsos solver to produce the eigen vectors.
>
> What do you guys think?
>

The last step sounds good!

Re: Regarding PCA implementation

Reply via email to