Doing PCA on 5000 x 5000 matrix is still an in-memory thing.  That's
only  25M doubles, 200MB of memory.  Lots of techniques can run
on that set.  You could do Lanczos with whatever rank you want
on it (but don't worry about distributed lanczos).

2011/6/23 <[email protected]>

> I will, if it works I may have to make an m/r job for it. All the data we
> have will be tall and dense (lets say 5000 columns, with millions of
> rows). Now doing PCA on that will create a covariance matrix that is
> square and dense. Thanks again guys.
>
> -Trevor
>
> > Try the QR trick.  It is amazingly effective.
> >
> > 2011/6/23 <[email protected]>
> >
> >> Alright, thanks guys.
> >>
> >> > The cases where Lanczos or the stochastic projection helps are cases
> >> where
> >> > you have *many* columns but where the data are sparse.  If you have a
> >> very
> >> > tall dense matrix, the QR method is to be muchly preferred.
> >> >
> >> > 2011/6/23 <[email protected]>
> >> >
> >> >> Ok, then what would you think to be the minimum number of columns in
> >> the
> >> >> dataset for Lanczos to give a reasonable result?
> >> >>
> >> >> Thanks,
> >> >> -Trevor
> >> >>
> >> >> > A gazillion rows of 2-columned data is really much better suited to
> >> >> doing
> >> >> > the following:
> >> >> >
> >> >> > if each row is of the form [a, b], then compute the matrix
> >> >> >
> >> >> > [[a*a, a*b], [a*b, b*b]]
> >> >> >
> >> >> > (the outer product of the vector with itself)
> >> >> >
> >> >> > Then take the matrix sum of all of these, from each row of your
> >> input
> >> >> > matrix.
> >> >> >
> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by hand.
> >> It
> >> >> will
> >> >> > give you your eigenvalues, and also the right-singular vectors of
> >> your
> >> >> > original matrix.
> >> >> >
> >> >> >   -jake
> >> >> >
> >> >> > 2011/6/23 <[email protected]>
> >> >> >
> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is
> >> being
> >> >> >> said,
> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos can't
> >> do
> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the mix -
> >> of
> >> >> the
> >> >> >> 2
> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue taking
> >> up
> >> >> a
> >> >> >> slot)?
> >> >> >>
> >> >> >> If the eigenvalue of 0 is invalid, then should it not be filtered
> >> out
> >> >> so
> >> >> >> that it returns "rank" number of eigenvalues that could be valid?
> >> >> >>
> >> >> >> -Trevor
> >> >> >>
> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank 10.
> >> >> Try
> >> >> >> on
> >> >> >> > some slightly less synthetic data of more than rank 10.  You
> >> can't
> >> >> >> > ask Lanczos for more reduced rank than that of the matrix
> >> itself.
> >> >> >> >
> >> >> >> >   -jake
> >> >> >> >
> >> >> >> > 2011/6/23 <[email protected]>
> >> >> >> >
> >> >> >> >> Alright I can reorder that is easy, just had to verify that the
> >> >> >> ordering
> >> >> >> >> was correct. So when I increased the rank of the results I get
> >> >> >> Lanczos
> >> >> >> >> bailing out. Which incidentally causes a NullPointerException:
> >> >> >> >>
> >> >> >> >> INFO: 9 passes through the corpus so far...
> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta =
> >> NaN.
> >> >> >> >> Bailing out early!
> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the
> >> >> >> tri-diagonal
> >> >> >> >> auxiliary matrix.
> >> >> >> >> Exception in thread "main" java.lang.NullPointerException
> >> >> >> >>        at
> >> >> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
> >> >> >> >>        at
> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
> >> >> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
> >> >> >> >>        at pca.PCA.main(PCA.java:20)
> >> >> >> >>
> >> >> >> >> So I should probably note that my data only has 2 columns, the
> >> >> real
> >> >> >> data
> >> >> >> >> will have quite a bit more.
> >> >> >> >>
> >> >> >> >> The failing happens with 10 and more for rank, with the last,
> >> and
> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>.
> >> >> >> >>
> >> >> >> >> -Trevor
> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the output
> >> will
> >> >> list
> >> >> >> >> the
> >> >> >> >> > results
> >> >> >> >> > in *increasing* order, even though it is finding the largest
> >> >> >> >> > eigenvalues/vectors
> >> >> >> >> > first.
> >> >> >> >> >
> >> >> >> >> > Remember that convergence is gradual, so if you only ask for
> >> 3
> >> >> >> >> > eigevectors/values, you won't be very accurate.  If you ask
> >> for
> >> >> 10
> >> >> >> or
> >> >> >> >> > more,
> >> >> >> >> > the
> >> >> >> >> > largest few will now be quite good.  If you ask for 50, now
> >> the
> >> >> top
> >> >> >> >> 10-20
> >> >> >> >> > will
> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still be
> >> >> quite
> >> >> >> >> good.
> >> >> >> >> >
> >> >> >> >> > Try out a non-distributed form of what is in the
> >> >> >> EigenverificationJob
> >> >> >> >> to
> >> >> >> >> > re-order the output and collect how accurate your results are
> >> >> (it
> >> >> >> >> computes
> >> >> >> >> > errors for you as well).
> >> >> >> >> >
> >> >> >> >> >   -jake
> >> >> >> >> >
> >> >> >> >> > 2011/6/23 <[email protected]>
> >> >> >> >> >
> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the distributed
> >> >> >> version
> >> >> >> >> of
> >> >> >> >> >> the
> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem with
> >> the
> >> >> >> >> >> non-distributed version.
> >> >> >> >> >>
> >> >> >> >> >> I send a dataset of gaussian distributed numbers (testing
> >> PCA
> >> >> >> stuff)
> >> >> >> >> and
> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the
> >> output
> >> >> >> given
> >> >> >> >> in
> >> >> >> >> >> the logs from LanczosSolver.
> >> >> >> >> >>
> >> >> >> >> >> Output:
> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804
> >> >> >> >> >> INFO: LanczosSolver finished.
> >> >> >> >> >>
> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one with an
> >> >> >> >> eigenvalue
> >> >> >> >> >> of
> >> >> >> >> >> 347?. Whats more interesting is that when I increase the
> >> rank,
> >> >> I
> >> >> >> get
> >> >> >> >> a
> >> >> >> >> >> new
> >> >> >> >> >> eigenvector with a value between 0 and 347:
> >> >> >> >> >>
> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566
> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704
> >> >> >> >> >>
> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also is
> >> the
> >> >> 0.0
> >> >> >> >> >> eigenvalue even valid?
> >> >> >> >> >>
> >> >> >> >> >> Thanks,
> >> >> >> >> >> Trevor
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Reply via email to