Hello all,
  I am trying to implement PCA using some of the libraries from Mahout. I am
following the TODO list posted here :
https://issues.apache.org/jira/browse/MAHOUT-512 . I understand conceptually
the idea behind the PCA but I am rather new to both Hadoop and Mahout. Here
is what I think the work flow should look like. It would be awesome if some
one could pitch in on better ways of doing things.

Assuming the data is available as a text file with rows representing
measurements,

1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver.
This would compute the empirical mean. I have done this. I unfortunately
have not found an elegant way of using a Single Map Reduce + Combiner Job of
finally generating one single output vector. Currently, I am generating key
value pairs where key refers to the index of the vector and value refers to
the mean for the dimension and am saving them. I am planning on creating a
separate job to read this file and convert into a single vector. I feel
there should be a more efficient way of doing this. Please do chime on this.

2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am
planning on using the VectorCache as a way of passing this vector onto the
job that takes the input Matrix (now distributedRowMatrix) to center the
data.

3. Now that I have the centered data, computing the covariance matrix
shouldn't be too hard if I have represented my matrix as a distributed row
matrix. I can then use "times" to produce the covariance matrix.

4. Last lap use DistributedLancsos solver to produce the eigen vectors.

What do you guys think?

Thanks
VC

Reply via email to