On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning <[email protected]> wrote:
> On Wed, Apr 27, 2011 at 5:28 PM, Vckay <[email protected]> wrote: > > > Assuming the data is available as a text file with rows representing > > measurements, > > > > > A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional > approach to this. > > > All right, I will do that. > > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver. > > This would compute the empirical mean. I have done this. I unfortunately > > have not found an elegant way of using a Single Map Reduce + Combiner Job > > of > > finally generating one single output vector. > > > Use a combiner and a single reducer. > > > > Currently, I am generating key > > value pairs where key refers to the index of the vector and value refers > to > > the mean for the dimension and am saving them. > > > I think you should use a constant key and a value that is the original > vector. Then define combiner and reducer to sum the values and output a > key > and vector. All of the combiner outputs will go to the same reducer which > will output a single vector. > > I am not completely sure I understand what you are saying here. (I am probably missing something really obvious). Let me go over this once more: So mapper transmits every vector with a constant key. The combiner takes in all the vectors, adds them up and sends reducer the aggregated output as a <constant key, vector > . However, the problem that I can see is that the reducer wont know the number of values that were aggregated by combiner which is needed for it to compute the mean. I am talking about the issue mentioned here: ( http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/ ) 2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am > > planning on using the VectorCache as a way of passing this vector onto > the > > job that takes the input Matrix (now distributedRowMatrix) to center the > > data. > > > > Sounds about right. > > But when did the input get converted? I think it should be converted > before > starting. > > > > Yep. Sorry. I messed up the work flow. > > 3. Now that I have the centered data, computing the covariance matrix > > shouldn't be too hard if I have represented my matrix as a distributed > row > > matrix. I can then use "times" to produce the covariance matrix. > > > > Actually, this is liable to be a disaster because the covariance matrix > will > be dense after you subtract the mean. > > If your data is dense to begin with, this is no big deal. If your data is > sparse, then this is a huge deal. > > A more subtle point is that computational cost of PCA on a dense matrix is > something like k n^3 where k is the number of eigen-values you want and n > is > the size of the matrix. If you can compute PCA for an input of size N on a > single machine in time T, then you are going to have a horrendous scaling > problem. The issue is that by moving to network transfer instead of memory > transfer, you are losing several orders of magnitude in speed. That means > that even to match the speed of the sequential algorithm will take quite a > few machines. If you increase the size of the problem by, say, 10 to make > up for this, you now have a problem that will take a thousand times longer > than before and you are still going to be in the hole. > > This sort of thing is why almost all really, really large scale data mining > exercises work with sparse matrices. That drops most algorithms to > something like O(n) so that adding more horsepower does some real damage to > the size of n and so that increasing n is relatively survivable. In fact, > O(n) algorithms are almost a defining property of scalable data-mining. > > > > Back to your algorithm, I have two questions: > > a) can you do the SVD of the original matrix rather than the eigen-value > computation of the covariance? I think that this is likely to be > numerically better. > > b) is there some perturbation trick such that you can do to avoid the mean > shift problem? I know that you can deal with (A - \lambda I), but you have > (A - e m') where e is the vector with all ones. > > This is an interesting issue: I briefly thought about it when I saw a thread where you had mentioned this issue previously. However, the good thing is that I don't probably have to worry about this in a first stage implementation because I plan to test the algorithm on image data which is pretty dense. I believe however there are ways of doing distributed PCA by allowing each node to work on a chunk of the matrix separately. For example, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4016296&tag=1 . I will probably look at that once I get done with the first step of wrapping my head around the mahout library and how a simple PCA implementation can be done using that. > 4. Last lap use DistributedLancsos solver to produce the eigen vectors. > > > > What do you guys think? > > > > The last step sounds good! >
