Yes, exactly why I asked it for only 2 eigenvalues. So what is being said, is if I have lets say 50M rows of 2 columned data, Lanczos can't do anything with it (assuming it puts the 0 eigenvalue in the mix - of the 2 eigenvectors only 1 is returned because of the 0 eigenvalue taking up a slot)?
If the eigenvalue of 0 is invalid, then should it not be filtered out so that it returns "rank" number of eigenvalues that could be valid? -Trevor > Ah, if your matrix only has 2 columns, you can't go to rank 10. Try on > some slightly less synthetic data of more than rank 10. You can't > ask Lanczos for more reduced rank than that of the matrix itself. > > -jake > > 2011/6/23 <[email protected]> > >> Alright I can reorder that is easy, just had to verify that the ordering >> was correct. So when I increased the rank of the results I get Lanczos >> bailing out. Which incidentally causes a NullPointerException: >> >> INFO: 9 passes through the corpus so far... >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = NaN. >> Bailing out early! >> INFO: Lanczos iteration complete - now to diagonalize the tri-diagonal >> auxiliary matrix. >> Exception in thread "main" java.lang.NullPointerException >> at >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) >> at >> >> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) >> at pca.PCASolver.solve(PCASolver.java:53) >> at pca.PCA.main(PCA.java:20) >> >> So I should probably note that my data only has 2 columns, the real data >> will have quite a bit more. >> >> The failing happens with 10 and more for rank, with the last, and >> therefore most significant eigenvector being <NaN,NaN>. >> >> -Trevor >> > The 0 eigenvalue output is not valid, and yes, the output will list >> the >> > results >> > in *increasing* order, even though it is finding the largest >> > eigenvalues/vectors >> > first. >> > >> > Remember that convergence is gradual, so if you only ask for 3 >> > eigevectors/values, you won't be very accurate. If you ask for 10 or >> > more, >> > the >> > largest few will now be quite good. If you ask for 50, now the top >> 10-20 >> > will >> > be *extremely* accurate, and maybe the top 30 will still be quite >> good. >> > >> > Try out a non-distributed form of what is in the EigenverificationJob >> to >> > re-order the output and collect how accurate your results are (it >> computes >> > errors for you as well). >> > >> > -jake >> > >> > 2011/6/23 <[email protected]> >> > >> >> So, I know that MAHOUT-369 fixed a bug with the distributed version >> of >> >> the >> >> LanczosSolver but I am experiencing a similar problem with the >> >> non-distributed version. >> >> >> >> I send a dataset of gaussian distributed numbers (testing PCA stuff) >> and >> >> my eigenvalues are seemingly reversed. Below I have the output given >> in >> >> the logs from LanczosSolver. >> >> >> >> Output: >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804 >> >> INFO: LanczosSolver finished. >> >> >> >> So it returns a vector with eigenvalue 0 before one with an >> eigenvalue >> >> of >> >> 347?. Whats more interesting is that when I increase the rank, I get >> a >> >> new >> >> eigenvector with a value between 0 and 347: >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566 >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704 >> >> >> >> Shouldn't the eigenvalues be in descending order? Also is the 0.0 >> >> eigenvalue even valid? >> >> >> >> Thanks, >> >> Trevor >> >> >> >> >> > >> >> >> >
