Try the QR trick. It is amazingly effective. 2011/6/23 <[email protected]>
> Alright, thanks guys. > > > The cases where Lanczos or the stochastic projection helps are cases > where > > you have *many* columns but where the data are sparse. If you have a > very > > tall dense matrix, the QR method is to be muchly preferred. > > > > 2011/6/23 <[email protected]> > > > >> Ok, then what would you think to be the minimum number of columns in the > >> dataset for Lanczos to give a reasonable result? > >> > >> Thanks, > >> -Trevor > >> > >> > A gazillion rows of 2-columned data is really much better suited to > >> doing > >> > the following: > >> > > >> > if each row is of the form [a, b], then compute the matrix > >> > > >> > [[a*a, a*b], [a*b, b*b]] > >> > > >> > (the outer product of the vector with itself) > >> > > >> > Then take the matrix sum of all of these, from each row of your input > >> > matrix. > >> > > >> > You'll now have a 2x2 matrix, which you can diagonalize by hand. It > >> will > >> > give you your eigenvalues, and also the right-singular vectors of your > >> > original matrix. > >> > > >> > -jake > >> > > >> > 2011/6/23 <[email protected]> > >> > > >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is being > >> >> said, > >> >> is if I have lets say 50M rows of 2 columned data, Lanczos can't do > >> >> anything with it (assuming it puts the 0 eigenvalue in the mix - of > >> the > >> >> 2 > >> >> eigenvectors only 1 is returned because of the 0 eigenvalue taking up > >> a > >> >> slot)? > >> >> > >> >> If the eigenvalue of 0 is invalid, then should it not be filtered out > >> so > >> >> that it returns "rank" number of eigenvalues that could be valid? > >> >> > >> >> -Trevor > >> >> > >> >> > Ah, if your matrix only has 2 columns, you can't go to rank 10. > >> Try > >> >> on > >> >> > some slightly less synthetic data of more than rank 10. You can't > >> >> > ask Lanczos for more reduced rank than that of the matrix itself. > >> >> > > >> >> > -jake > >> >> > > >> >> > 2011/6/23 <[email protected]> > >> >> > > >> >> >> Alright I can reorder that is easy, just had to verify that the > >> >> ordering > >> >> >> was correct. So when I increased the rank of the results I get > >> >> Lanczos > >> >> >> bailing out. Which incidentally causes a NullPointerException: > >> >> >> > >> >> >> INFO: 9 passes through the corpus so far... > >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta = NaN. > >> >> >> Bailing out early! > >> >> >> INFO: Lanczos iteration complete - now to diagonalize the > >> >> tri-diagonal > >> >> >> auxiliary matrix. > >> >> >> Exception in thread "main" java.lang.NullPointerException > >> >> >> at > >> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) > >> >> >> at > >> >> >> > >> >> >> > >> >> > >> > org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) > >> >> >> at pca.PCASolver.solve(PCASolver.java:53) > >> >> >> at pca.PCA.main(PCA.java:20) > >> >> >> > >> >> >> So I should probably note that my data only has 2 columns, the > >> real > >> >> data > >> >> >> will have quite a bit more. > >> >> >> > >> >> >> The failing happens with 10 and more for rank, with the last, and > >> >> >> therefore most significant eigenvector being <NaN,NaN>. > >> >> >> > >> >> >> -Trevor > >> >> >> > The 0 eigenvalue output is not valid, and yes, the output will > >> list > >> >> >> the > >> >> >> > results > >> >> >> > in *increasing* order, even though it is finding the largest > >> >> >> > eigenvalues/vectors > >> >> >> > first. > >> >> >> > > >> >> >> > Remember that convergence is gradual, so if you only ask for 3 > >> >> >> > eigevectors/values, you won't be very accurate. If you ask for > >> 10 > >> >> or > >> >> >> > more, > >> >> >> > the > >> >> >> > largest few will now be quite good. If you ask for 50, now the > >> top > >> >> >> 10-20 > >> >> >> > will > >> >> >> > be *extremely* accurate, and maybe the top 30 will still be > >> quite > >> >> >> good. > >> >> >> > > >> >> >> > Try out a non-distributed form of what is in the > >> >> EigenverificationJob > >> >> >> to > >> >> >> > re-order the output and collect how accurate your results are > >> (it > >> >> >> computes > >> >> >> > errors for you as well). > >> >> >> > > >> >> >> > -jake > >> >> >> > > >> >> >> > 2011/6/23 <[email protected]> > >> >> >> > > >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the distributed > >> >> version > >> >> >> of > >> >> >> >> the > >> >> >> >> LanczosSolver but I am experiencing a similar problem with the > >> >> >> >> non-distributed version. > >> >> >> >> > >> >> >> >> I send a dataset of gaussian distributed numbers (testing PCA > >> >> stuff) > >> >> >> and > >> >> >> >> my eigenvalues are seemingly reversed. Below I have the output > >> >> given > >> >> >> in > >> >> >> >> the logs from LanczosSolver. > >> >> >> >> > >> >> >> >> Output: > >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 347.8703086831804 > >> >> >> >> INFO: LanczosSolver finished. > >> >> >> >> > >> >> >> >> So it returns a vector with eigenvalue 0 before one with an > >> >> >> eigenvalue > >> >> >> >> of > >> >> >> >> 347?. Whats more interesting is that when I increase the rank, > >> I > >> >> get > >> >> >> a > >> >> >> >> new > >> >> >> >> eigenvector with a value between 0 and 347: > >> >> >> >> > >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > >> >> >> >> INFO: Eigenvector 1 found with eigenvalue 44.794928654801566 > >> >> >> >> INFO: Eigenvector 2 found with eigenvalue 347.8286920203704 > >> >> >> >> > >> >> >> >> Shouldn't the eigenvalues be in descending order? Also is the > >> 0.0 > >> >> >> >> eigenvalue even valid? > >> >> >> >> > >> >> >> >> Thanks, > >> >> >> >> Trevor > >> >> >> >> > >> >> >> >> > >> >> >> > > >> >> >> > >> >> >> > >> >> >> > >> >> > > >> >> > >> >> > >> >> > >> > > >> > >> > >> > > > > >
