On Wed, Apr 27, 2011 at 8:41 PM, Ted Dunning <[email protected]> wrote:

> On Wed, Apr 27, 2011 at 5:28 PM, Vckay <[email protected]> wrote:
>
> > Assuming the data is available as a text file with rows representing
> > measurements,
> >
>
>


> A org.apache.mahout.math.hadoop.DistributedRowMatrix is the traditional
> approach to this.
>
>
>


All right, I will do that.



> > 1. Have a dataCenteringDriver that calls a empiricalMeanGenerator driver.
> > This would compute the empirical mean. I have done this. I unfortunately
> > have not found an elegant way of using a Single Map Reduce + Combiner Job
> > of
> > finally generating one single output vector.
>
>
> Use a combiner and a single reducer.
>
>
> > Currently, I am generating key
> > value pairs where key refers to the index of the vector and value refers
> to
> > the mean for the dimension and am saving them.
>
>
> I think you should use a constant key and a value that is the original
> vector.  Then define combiner and reducer to sum the values and output a
> key
> and vector.  All of the combiner outputs will go to the same reducer which
> will output a single vector.
>
>

I am not completely sure I understand what you are saying here. (I am
probably missing something really obvious). Let me go over this once more:
So mapper transmits every vector with a constant key. The combiner takes in
all the vectors, adds them up and sends reducer the aggregated output as  a
<constant key, vector > . However, the problem that I can see is that the
reducer wont know the number of values that were aggregated by combiner
which is needed for it to compute the mean. I am talking about the issue
mentioned here: (
http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/
)


2. Assuming, I get a vector out of the empiricalMeanGenerator phase, I am
> > planning on using the VectorCache as a way of passing this vector onto
> the
> > job that takes the input Matrix (now distributedRowMatrix) to center the
> > data.
> >
>
> Sounds about right.
>
> But when did the input get converted?  I think it should be converted
> before
> starting.
>
>
>
> Yep. Sorry. I messed up the work flow.


> > 3. Now that I have the centered data, computing the covariance matrix
> > shouldn't be too hard if I have represented my matrix as a distributed
> row
> > matrix. I can then use "times" to produce the covariance matrix.
> >
>
> Actually, this is liable to be a disaster because the covariance matrix
> will
> be dense after you subtract the mean.
>
> If your data is dense to begin with, this is no big deal.  If your data is
> sparse, then this is a huge deal.
>
> A more subtle point is that computational cost of PCA on a dense matrix is
> something like k n^3 where k is the number of eigen-values you want and n
> is
> the size of the matrix.  If you can compute PCA for an input of size N on a
> single machine in time T, then you are going to have a horrendous scaling
> problem.  The issue is that by moving to network transfer instead of memory
> transfer, you are losing several orders of magnitude in speed.  That means
> that even to match the speed of the sequential algorithm will take quite a
> few machines.  If you increase the size of the problem by, say, 10 to make
> up for this, you now have a problem that will take a thousand times longer
> than before and you are still going to be in the hole.
>
> This sort of thing is why almost all really, really large scale data mining
> exercises work with sparse matrices.  That drops most algorithms to
> something like O(n) so that adding more horsepower does some real damage to
> the size of n and so that increasing n is relatively survivable.  In fact,
> O(n) algorithms are almost a defining property of scalable data-mining.
>
>

>
> Back to your algorithm, I have two questions:
>
> a) can you do the SVD of the original matrix rather than the eigen-value
> computation of the covariance?  I think that this is likely to be
> numerically better.
>
> b) is there some perturbation trick such that you can do to avoid the mean
> shift problem?  I know that you can deal with (A - \lambda I), but you have
> (A -  e m') where e is the vector with all ones.
>
>

This is an interesting issue: I briefly thought about it when I saw a thread
where you had mentioned this issue previously. However, the good thing is
that I don't probably have to worry about this in a first stage
implementation because I plan to test the algorithm on image data which is
pretty dense. I believe however there are ways of doing distributed PCA by
allowing each node to work on a chunk of the matrix separately. For example,
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4016296&tag=1 . I will
probably look at that once I get done with the first step  of wrapping my
head around the mahout library and how a simple PCA implementation can be
done using that.




> 4. Last lap use DistributedLancsos solver to produce the eigen vectors.
> >
> > What do you guys think?
> >
>
> The last step sounds good!
>

Reply via email to