Yes but a M/R job to create the covariance matrix would be required. With
millions of rows that is, unless I am missing something.

> Doing PCA on 5000 x 5000 matrix is still an in-memory thing.  That's
> only  25M doubles, 200MB of memory.  Lots of techniques can run
> on that set.  You could do Lanczos with whatever rank you want
> on it (but don't worry about distributed lanczos).
>
> 2011/6/23 <[email protected]>
>
>> I will, if it works I may have to make an m/r job for it. All the data
>> we
>> have will be tall and dense (lets say 5000 columns, with millions of
>> rows). Now doing PCA on that will create a covariance matrix that is
>> square and dense. Thanks again guys.
>>
>> -Trevor
>>
>> > Try the QR trick.  It is amazingly effective.
>> >
>> > 2011/6/23 <[email protected]>
>> >
>> >> Alright, thanks guys.
>> >>
>> >> > The cases where Lanczos or the stochastic projection helps are
>> cases
>> >> where
>> >> > you have *many* columns but where the data are sparse.  If you have
>> a
>> >> very
>> >> > tall dense matrix, the QR method is to be muchly preferred.
>> >> >
>> >> > 2011/6/23 <[email protected]>
>> >> >
>> >> >> Ok, then what would you think to be the minimum number of columns
>> in
>> >> the
>> >> >> dataset for Lanczos to give a reasonable result?
>> >> >>
>> >> >> Thanks,
>> >> >> -Trevor
>> >> >>
>> >> >> > A gazillion rows of 2-columned data is really much better suited
>> to
>> >> >> doing
>> >> >> > the following:
>> >> >> >
>> >> >> > if each row is of the form [a, b], then compute the matrix
>> >> >> >
>> >> >> > [[a*a, a*b], [a*b, b*b]]
>> >> >> >
>> >> >> > (the outer product of the vector with itself)
>> >> >> >
>> >> >> > Then take the matrix sum of all of these, from each row of your
>> >> input
>> >> >> > matrix.
>> >> >> >
>> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by hand.
>> >> It
>> >> >> will
>> >> >> > give you your eigenvalues, and also the right-singular vectors
>> of
>> >> your
>> >> >> > original matrix.
>> >> >> >
>> >> >> >   -jake
>> >> >> >
>> >> >> > 2011/6/23 <[email protected]>
>> >> >> >
>> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is
>> >> being
>> >> >> >> said,
>> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos
>> can't
>> >> do
>> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the mix
>> -
>> >> of
>> >> >> the
>> >> >> >> 2
>> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue
>> taking
>> >> up
>> >> >> a
>> >> >> >> slot)?
>> >> >> >>
>> >> >> >> If the eigenvalue of 0 is invalid, then should it not be
>> filtered
>> >> out
>> >> >> so
>> >> >> >> that it returns "rank" number of eigenvalues that could be
>> valid?
>> >> >> >>
>> >> >> >> -Trevor
>> >> >> >>
>> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank
>> 10.
>> >> >> Try
>> >> >> >> on
>> >> >> >> > some slightly less synthetic data of more than rank 10.  You
>> >> can't
>> >> >> >> > ask Lanczos for more reduced rank than that of the matrix
>> >> itself.
>> >> >> >> >
>> >> >> >> >   -jake
>> >> >> >> >
>> >> >> >> > 2011/6/23 <[email protected]>
>> >> >> >> >
>> >> >> >> >> Alright I can reorder that is easy, just had to verify that
>> the
>> >> >> >> ordering
>> >> >> >> >> was correct. So when I increased the rank of the results I
>> get
>> >> >> >> Lanczos
>> >> >> >> >> bailing out. Which incidentally causes a
>> NullPointerException:
>> >> >> >> >>
>> >> >> >> >> INFO: 9 passes through the corpus so far...
>> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta
>> =
>> >> NaN.
>> >> >> >> >> Bailing out early!
>> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the
>> >> >> >> tri-diagonal
>> >> >> >> >> auxiliary matrix.
>> >> >> >> >> Exception in thread "main" java.lang.NullPointerException
>> >> >> >> >>        at
>> >> >> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
>> >> >> >> >>        at
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
>> >> >> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
>> >> >> >> >>        at pca.PCA.main(PCA.java:20)
>> >> >> >> >>
>> >> >> >> >> So I should probably note that my data only has 2 columns,
>> the
>> >> >> real
>> >> >> >> data
>> >> >> >> >> will have quite a bit more.
>> >> >> >> >>
>> >> >> >> >> The failing happens with 10 and more for rank, with the
>> last,
>> >> and
>> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>.
>> >> >> >> >>
>> >> >> >> >> -Trevor
>> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the output
>> >> will
>> >> >> list
>> >> >> >> >> the
>> >> >> >> >> > results
>> >> >> >> >> > in *increasing* order, even though it is finding the
>> largest
>> >> >> >> >> > eigenvalues/vectors
>> >> >> >> >> > first.
>> >> >> >> >> >
>> >> >> >> >> > Remember that convergence is gradual, so if you only ask
>> for
>> >> 3
>> >> >> >> >> > eigevectors/values, you won't be very accurate.  If you
>> ask
>> >> for
>> >> >> 10
>> >> >> >> or
>> >> >> >> >> > more,
>> >> >> >> >> > the
>> >> >> >> >> > largest few will now be quite good.  If you ask for 50,
>> now
>> >> the
>> >> >> top
>> >> >> >> >> 10-20
>> >> >> >> >> > will
>> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still
>> be
>> >> >> quite
>> >> >> >> >> good.
>> >> >> >> >> >
>> >> >> >> >> > Try out a non-distributed form of what is in the
>> >> >> >> EigenverificationJob
>> >> >> >> >> to
>> >> >> >> >> > re-order the output and collect how accurate your results
>> are
>> >> >> (it
>> >> >> >> >> computes
>> >> >> >> >> > errors for you as well).
>> >> >> >> >> >
>> >> >> >> >> >   -jake
>> >> >> >> >> >
>> >> >> >> >> > 2011/6/23 <[email protected]>
>> >> >> >> >> >
>> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the
>> distributed
>> >> >> >> version
>> >> >> >> >> of
>> >> >> >> >> >> the
>> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem
>> with
>> >> the
>> >> >> >> >> >> non-distributed version.
>> >> >> >> >> >>
>> >> >> >> >> >> I send a dataset of gaussian distributed numbers (testing
>> >> PCA
>> >> >> >> stuff)
>> >> >> >> >> and
>> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the
>> >> output
>> >> >> >> given
>> >> >> >> >> in
>> >> >> >> >> >> the logs from LanczosSolver.
>> >> >> >> >> >>
>> >> >> >> >> >> Output:
>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
>> 347.8703086831804
>> >> >> >> >> >> INFO: LanczosSolver finished.
>> >> >> >> >> >>
>> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one with
>> an
>> >> >> >> >> eigenvalue
>> >> >> >> >> >> of
>> >> >> >> >> >> 347?. Whats more interesting is that when I increase the
>> >> rank,
>> >> >> I
>> >> >> >> get
>> >> >> >> >> a
>> >> >> >> >> >> new
>> >> >> >> >> >> eigenvector with a value between 0 and 347:
>> >> >> >> >> >>
>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue
>> 44.794928654801566
>> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue
>> 347.8286920203704
>> >> >> >> >> >>
>> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also is
>> >> the
>> >> >> 0.0
>> >> >> >> >> >> eigenvalue even valid?
>> >> >> >> >> >>
>> >> >> >> >> >> Thanks,
>> >> >> >> >> >> Trevor
>> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>


Reply via email to