On Wed, Apr 27, 2011 at 5:28 PM, Vckay <[email protected]> wrote: > Assuming the data is available as a text file with rows representing > measurements, >
A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional approach to this. > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver. > This would compute the empirical mean. I have done this. I unfortunately > have not found an elegant way of using a Single Map Reduce + Combiner Job > of > finally generating one single output vector. Use a combiner and a single reducer. > Currently, I am generating key > value pairs where key refers to the index of the vector and value refers to > the mean for the dimension and am saving them. I think you should use a constant key and a value that is the original vector. Then define combiner and reducer to sum the values and output a key and vector. All of the combiner outputs will go to the same reducer which will output a single vector. 2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am > planning on using the VectorCache as a way of passing this vector onto the > job that takes the input Matrix (now distributedRowMatrix) to center the > data. > Sounds about right. But when did the input get converted? I think it should be converted before starting. > 3. Now that I have the centered data, computing the covariance matrix > shouldn't be too hard if I have represented my matrix as a distributed row > matrix. I can then use "times" to produce the covariance matrix. > Actually, this is liable to be a disaster because the covariance matrix will be dense after you subtract the mean. If your data is dense to begin with, this is no big deal. If your data is sparse, then this is a huge deal. A more subtle point is that computational cost of PCA on a dense matrix is something like k n^3 where k is the number of eigen-values you want and n is the size of the matrix. If you can compute PCA for an input of size N on a single machine in time T, then you are going to have a horrendous scaling problem. The issue is that by moving to network transfer instead of memory transfer, you are losing several orders of magnitude in speed. That means that even to match the speed of the sequential algorithm will take quite a few machines. If you increase the size of the problem by, say, 10 to make up for this, you now have a problem that will take a thousand times longer than before and you are still going to be in the hole. This sort of thing is why almost all really, really large scale data mining exercises work with sparse matrices. That drops most algorithms to something like O(n) so that adding more horsepower does some real damage to the size of n and so that increasing n is relatively survivable. In fact, O(n) algorithms are almost a defining property of scalable data-mining. Back to your algorithm, I have two questions: a) can you do the SVD of the original matrix rather than the eigen-value computation of the covariance? I think that this is likely to be numerically better. b) is there some perturbation trick such that you can do to avoid the mean shift problem? I know that you can deal with (A - \lambda I), but you have (A - e m') where e is the vector with all ones. 4. Last lap use DistributedLancsos solver to produce the eigen vectors. > > What do you guys think? > The last step sounds good!
