Hello all, I am trying to implement PCA using some of the libraries from Mahout. I am following the TODO list posted here : https://issues.apache.org/jira/browse/MAHOUT-512 . I understand conceptually the idea behind the PCA but I am rather new to both Hadoop and Mahout. Here is what I think the work flow should look like. It would be awesome if some one could pitch in on better ways of doing things.
Assuming the data is available as a text file with rows representing measurements, 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver. This would compute the empirical mean. I have done this. I unfortunately have not found an elegant way of using a Single Map Reduce + Combiner Job of finally generating one single output vector. Currently, I am generating key value pairs where key refers to the index of the vector and value refers to the mean for the dimension and am saving them. I am planning on creating a separate job to read this file and convert into a single vector. I feel there should be a more efficient way of doing this. Please do chime on this. 2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am planning on using the VectorCache as a way of passing this vector onto the job that takes the input Matrix (now distributedRowMatrix) to center the data. 3. Now that I have the centered data, computing the covariance matrix shouldn't be too hard if I have represented my matrix as a distributed row matrix. I can then use "times" to produce the covariance matrix. 4. Last lap use DistributedLancsos solver to produce the eigen vectors. What do you guys think? Thanks VC
