This method isn't usually as numerically stable as, for instance, using a QR decomposition. If your original data matrix is n x 2, then Q is n x 2 and R is 2 x 2. R is trivial to decompose into U S V' and since Q is a unit matrix, the singular values and right singular vectors of R are your original goal. If you want left singular values, use QU.
On Thu, Jun 23, 2011 at 11:03 AM, Jake Mannix <[email protected]> wrote: > A gazillion rows of 2-columned data is really much better suited to doing > the following: > > if each row is of the form [a, b], then compute the matrix > > [[a*a, a*b], [a*b, b*b]] > > (the outer product of the vector with itself) > > Then take the matrix sum of all of these, from each row of your input > matrix. > > You'll now have a 2x2 matrix, which you can diagonalize by hand. It will > give you your eigenvalues, and also the right-singular vectors of your > original matrix. > > -jake > > 2011/6/23 <[email protected]> > > > Yes, exactly why I asked it for only 2 eigenvalues. So what is being > said, > > is if I have lets say 50M rows of 2 columned data, Lanczos can't do > > anything with it (assuming it puts the 0 eigenvalue in the mix - of the 2 > > eigenvectors only 1 is returned because of the 0 eigenvalue taking up a > > slot)? > > > > If the eigenvalue of 0 is invalid, then should it not be filtered out so > > that it returns "rank" number of eigenvalues that could be valid? > > > > -Trevor > > > > > Ah, if your matrix only has 2 columns, you can't go to rank 10. Try on > > > some slightly less synthetic data of more than rank 10. You can't > > > ask Lanczos for more reduced rank than that of the matrix itself. > > > > > > -jake > > > > > > 2011/6/23 <[email protected]> > > > > > >> Alright I can reorder that is easy, just had to verify that the > ordering > > >> was correct. So when I increased the rank of the results I get Lanczos > > >> bailing out. Which incidentally causes a NullPointerException: > > >> > > >> INFO: 9 passes through the corpus so far... > > >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = NaN. > > >> Bailing out early! > > >> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal > > >> auxiliary matrix. > > >> Exception in thread "main" java.lang.NullPointerException > > >> at > > >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) > > >> at > > >> > > >> > > > org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) > > >> at pca.PCASolver.solve(PCASolver.java:53) > > >> at pca.PCA.main(PCA.java:20) > > >> > > >> So I should probably note that my data only has 2 columns, the real > data > > >> will have quite a bit more. > > >> > > >> The failing happens with 10 and more for rank, with the last, and > > >> therefore most significant eigenvector being <NaN,NaN>. > > >> > > >> -Trevor > > >> > The 0 eigenvalue output is not valid, and yes, the output will list > > >> the > > >> > results > > >> > in *increasing* order, even though it is finding the largest > > >> > eigenvalues/vectors > > >> > first. > > >> > > > >> > Remember that convergence is gradual, so if you only ask for 3 > > >> > eigevectors/values, you won't be very accurate. If you ask for 10 > or > > >> > more, > > >> > the > > >> > largest few will now be quite good. If you ask for 50, now the top > > >> 10-20 > > >> > will > > >> > be *extremely* accurate, and maybe the top 30 will still be quite > > >> good. > > >> > > > >> > Try out a non-distributed form of what is in the > EigenverificationJob > > >> to > > >> > re-order the output and collect how accurate your results are (it > > >> computes > > >> > errors for you as well). > > >> > > > >> > -jake > > >> > > > >> > 2011/6/23 <[email protected]> > > >> > > > >> >> So, I know that MAHOUT-369 fixed a bug with the distributed version > > >> of > > >> >> the > > >> >> LanczosSolver but I am experiencing a similar problem with the > > >> >> non-distributed version. > > >> >> > > >> >> I send a dataset of gaussian distributed numbers (testing PCA > stuff) > > >> and > > >> >> my eigenvalues are seemingly reversed. Below I have the output > given > > >> in > > >> >> the logs from LanczosSolver. > > >> >> > > >> >> Output: > > >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > > >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804 > > >> >> INFO: LanczosSolver finished. > > >> >> > > >> >> So it returns a vector with eigenvalue 0 before one with an > > >> eigenvalue > > >> >> of > > >> >> 347?. Whats more interesting is that when I increase the rank, I > get > > >> a > > >> >> new > > >> >> eigenvector with a value between 0 and 347: > > >> >> > > >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > > >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566 > > >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704 > > >> >> > > >> >> Shouldn't the eigenvalues be in descending order? Also is the 0.0 > > >> >> eigenvalue even valid? > > >> >> > > >> >> Thanks, > > >> >> Trevor > > >> >> > > >> >> > > >> > > > >> > > >> > > >> > > > > > > > > > >
