The cases where Lanczos or the stochastic projection helps are cases where
you have *many* columns but where the data are sparse.  If you have a very
tall dense matrix, the QR method is to be muchly preferred.

2011/6/23 <[email protected]>

> Ok, then what would you think to be the minimum number of columns in the
> dataset for Lanczos to give a reasonable result?
>
> Thanks,
> -Trevor
>
> > A gazillion rows of 2-columned data is really much better suited to doing
> > the following:
> >
> > if each row is of the form [a, b], then compute the matrix
> >
> > [[a*a, a*b], [a*b, b*b]]
> >
> > (the outer product of the vector with itself)
> >
> > Then take the matrix sum of all of these, from each row of your input
> > matrix.
> >
> > You'll now have a 2x2 matrix, which you can diagonalize by hand.  It will
> > give you your eigenvalues, and also the right-singular vectors of your
> > original matrix.
> >
> >   -jake
> >
> > 2011/6/23 <[email protected]>
> >
> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is being
> >> said,
> >> is if I have lets say 50M rows of 2 columned data, Lanczos can't do
> >> anything with it (assuming it puts the 0 eigenvalue in the mix - of the
> >> 2
> >> eigenvectors only 1 is returned because of the 0 eigenvalue taking up a
> >> slot)?
> >>
> >> If the eigenvalue of 0 is invalid, then should it not be filtered out so
> >> that it returns "rank" number of eigenvalues that could be valid?
> >>
> >> -Trevor
> >>
> >> > Ah, if your matrix only has 2 columns, you can't go to rank 10.  Try
> >> on
> >> > some slightly less synthetic data of more than rank 10.  You can't
> >> > ask Lanczos for more reduced rank than that of the matrix itself.
> >> >
> >> >   -jake
> >> >
> >> > 2011/6/23 <[email protected]>
> >> >
> >> >> Alright I can reorder that is easy, just had to verify that the
> >> ordering
> >> >> was correct. So when I increased the rank of the results I get
> >> Lanczos
> >> >> bailing out. Which incidentally causes a NullPointerException:
> >> >>
> >> >> INFO: 9 passes through the corpus so far...
> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = NaN.
> >> >> Bailing out early!
> >> >> INFO: Lanczos iteration complete - now to diagonalize the
> >> tri-diagonal
> >> >> auxiliary matrix.
> >> >> Exception in thread "main" java.lang.NullPointerException
> >> >>        at
> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133)
> >> >>        at
> >> >>
> >> >>
> >>
> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160)
> >> >>        at pca.PCASolver.solve(PCASolver.java:53)
> >> >>        at pca.PCA.main(PCA.java:20)
> >> >>
> >> >> So I should probably note that my data only has 2 columns, the real
> >> data
> >> >> will have quite a bit more.
> >> >>
> >> >> The failing happens with 10 and more for rank, with the last, and
> >> >> therefore most significant eigenvector being <NaN,NaN>.
> >> >>
> >> >> -Trevor
> >> >> > The 0 eigenvalue output is not valid, and yes, the output will list
> >> >> the
> >> >> > results
> >> >> > in *increasing* order, even though it is finding the largest
> >> >> > eigenvalues/vectors
> >> >> > first.
> >> >> >
> >> >> > Remember that convergence is gradual, so if you only ask for 3
> >> >> > eigevectors/values, you won't be very accurate.  If you ask for 10
> >> or
> >> >> > more,
> >> >> > the
> >> >> > largest few will now be quite good.  If you ask for 50, now the top
> >> >> 10-20
> >> >> > will
> >> >> > be *extremely* accurate, and maybe the top 30 will still be quite
> >> >> good.
> >> >> >
> >> >> > Try out a non-distributed form of what is in the
> >> EigenverificationJob
> >> >> to
> >> >> > re-order the output and collect how accurate your results are (it
> >> >> computes
> >> >> > errors for you as well).
> >> >> >
> >> >> >   -jake
> >> >> >
> >> >> > 2011/6/23 <[email protected]>
> >> >> >
> >> >> >> So, I know that MAHOUT-369 fixed a bug with the distributed
> >> version
> >> >> of
> >> >> >> the
> >> >> >> LanczosSolver but I am experiencing a similar problem with the
> >> >> >> non-distributed version.
> >> >> >>
> >> >> >> I send a dataset of gaussian distributed numbers (testing PCA
> >> stuff)
> >> >> and
> >> >> >> my eigenvalues are seemingly reversed. Below I have the output
> >> given
> >> >> in
> >> >> >> the logs from LanczosSolver.
> >> >> >>
> >> >> >> Output:
> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804
> >> >> >> INFO: LanczosSolver finished.
> >> >> >>
> >> >> >> So it returns a vector with eigenvalue 0 before one with an
> >> >> eigenvalue
> >> >> >> of
> >> >> >> 347?. Whats more interesting is that when I increase the rank, I
> >> get
> >> >> a
> >> >> >> new
> >> >> >> eigenvector with a value between 0 and 347:
> >> >> >>
> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0
> >> >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566
> >> >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704
> >> >> >>
> >> >> >> Shouldn't the eigenvalues be in descending order? Also is the 0.0
> >> >> >> eigenvalue even valid?
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Trevor
> >> >> >>
> >> >> >>
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Reply via email to