That times method won't get me past what you described in: http://article.gmane.org/gmane.comp.apache.mahout.user/7384/match=pca
with 2 dense matrices being multiplied will it? And it is conceivable that we will have billions of rows to do this computation on for my dataset. On another note, is that not a confusing function name as it does not do just multiplication but transpose on the matrix as well? -Trevor > Well I'm going to pretend for a second that you're talking about > *billions* > of rows, because tens of millions of rows still fit not just on one box, > but > on one box *in memory*... but yes, the mapreduce job you want already > exists in Mahout: DistributedRowMatrix#times(DistributedRowMatrix) > does exactly this. > > Although it doesn't do the mean-subtraction. > > -jake > > 2011/6/23 <[email protected]> > >> Yes but a M/R job to create the covariance matrix would be required. >> With >> millions of rows that is, unless I am missing something. >> >> > Doing PCA on 5000 x 5000 matrix is still an in-memory thing. That's >> > only 25M doubles, 200MB of memory. Lots of techniques can run >> > on that set. You could do Lanczos with whatever rank you want >> > on it (but don't worry about distributed lanczos). >> > >> > 2011/6/23 <[email protected]> >> > >> >> I will, if it works I may have to make an m/r job for it. All the >> data >> >> we >> >> have will be tall and dense (lets say 5000 columns, with millions of >> >> rows). Now doing PCA on that will create a covariance matrix that is >> >> square and dense. Thanks again guys. >> >> >> >> -Trevor >> >> >> >> > Try the QR trick. It is amazingly effective. >> >> > >> >> > 2011/6/23 <[email protected]> >> >> > >> >> >> Alright, thanks guys. >> >> >> >> >> >> > The cases where Lanczos or the stochastic projection helps are >> >> cases >> >> >> where >> >> >> > you have *many* columns but where the data are sparse. If you >> have >> >> a >> >> >> very >> >> >> > tall dense matrix, the QR method is to be muchly preferred. >> >> >> > >> >> >> > 2011/6/23 <[email protected]> >> >> >> > >> >> >> >> Ok, then what would you think to be the minimum number of >> columns >> >> in >> >> >> the >> >> >> >> dataset for Lanczos to give a reasonable result? >> >> >> >> >> >> >> >> Thanks, >> >> >> >> -Trevor >> >> >> >> >> >> >> >> > A gazillion rows of 2-columned data is really much better >> suited >> >> to >> >> >> >> doing >> >> >> >> > the following: >> >> >> >> > >> >> >> >> > if each row is of the form [a, b], then compute the matrix >> >> >> >> > >> >> >> >> > [[a*a, a*b], [a*b, b*b]] >> >> >> >> > >> >> >> >> > (the outer product of the vector with itself) >> >> >> >> > >> >> >> >> > Then take the matrix sum of all of these, from each row of >> your >> >> >> input >> >> >> >> > matrix. >> >> >> >> > >> >> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by >> hand. >> >> >> It >> >> >> >> will >> >> >> >> > give you your eigenvalues, and also the right-singular >> vectors >> >> of >> >> >> your >> >> >> >> > original matrix. >> >> >> >> > >> >> >> >> > -jake >> >> >> >> > >> >> >> >> > 2011/6/23 <[email protected]> >> >> >> >> > >> >> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what >> is >> >> >> being >> >> >> >> >> said, >> >> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos >> >> can't >> >> >> do >> >> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the >> mix >> >> - >> >> >> of >> >> >> >> the >> >> >> >> >> 2 >> >> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue >> >> taking >> >> >> up >> >> >> >> a >> >> >> >> >> slot)? >> >> >> >> >> >> >> >> >> >> If the eigenvalue of 0 is invalid, then should it not be >> >> filtered >> >> >> out >> >> >> >> so >> >> >> >> >> that it returns "rank" number of eigenvalues that could be >> >> valid? >> >> >> >> >> >> >> >> >> >> -Trevor >> >> >> >> >> >> >> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to >> rank >> >> 10. >> >> >> >> Try >> >> >> >> >> on >> >> >> >> >> > some slightly less synthetic data of more than rank 10. >> You >> >> >> can't >> >> >> >> >> > ask Lanczos for more reduced rank than that of the matrix >> >> >> itself. >> >> >> >> >> > >> >> >> >> >> > -jake >> >> >> >> >> > >> >> >> >> >> > 2011/6/23 <[email protected]> >> >> >> >> >> > >> >> >> >> >> >> Alright I can reorder that is easy, just had to verify >> that >> >> the >> >> >> >> >> ordering >> >> >> >> >> >> was correct. So when I increased the rank of the results >> I >> >> get >> >> >> >> >> Lanczos >> >> >> >> >> >> bailing out. Which incidentally causes a >> >> NullPointerException: >> >> >> >> >> >> >> >> >> >> >> >> INFO: 9 passes through the corpus so far... >> >> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, >> beta >> >> = >> >> >> NaN. >> >> >> >> >> >> Bailing out early! >> >> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the >> >> >> >> >> tri-diagonal >> >> >> >> >> >> auxiliary matrix. >> >> >> >> >> >> Exception in thread "main" java.lang.NullPointerException >> >> >> >> >> >> at >> >> >> >> >> >> >> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) >> >> >> >> >> >> at >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) >> >> >> >> >> >> at pca.PCASolver.solve(PCASolver.java:53) >> >> >> >> >> >> at pca.PCA.main(PCA.java:20) >> >> >> >> >> >> >> >> >> >> >> >> So I should probably note that my data only has 2 >> columns, >> >> the >> >> >> >> real >> >> >> >> >> data >> >> >> >> >> >> will have quite a bit more. >> >> >> >> >> >> >> >> >> >> >> >> The failing happens with 10 and more for rank, with the >> >> last, >> >> >> and >> >> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>. >> >> >> >> >> >> >> >> >> >> >> >> -Trevor >> >> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the >> output >> >> >> will >> >> >> >> list >> >> >> >> >> >> the >> >> >> >> >> >> > results >> >> >> >> >> >> > in *increasing* order, even though it is finding the >> >> largest >> >> >> >> >> >> > eigenvalues/vectors >> >> >> >> >> >> > first. >> >> >> >> >> >> > >> >> >> >> >> >> > Remember that convergence is gradual, so if you only >> ask >> >> for >> >> >> 3 >> >> >> >> >> >> > eigevectors/values, you won't be very accurate. If you >> >> ask >> >> >> for >> >> >> >> 10 >> >> >> >> >> or >> >> >> >> >> >> > more, >> >> >> >> >> >> > the >> >> >> >> >> >> > largest few will now be quite good. If you ask for 50, >> >> now >> >> >> the >> >> >> >> top >> >> >> >> >> >> 10-20 >> >> >> >> >> >> > will >> >> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will >> still >> >> be >> >> >> >> quite >> >> >> >> >> >> good. >> >> >> >> >> >> > >> >> >> >> >> >> > Try out a non-distributed form of what is in the >> >> >> >> >> EigenverificationJob >> >> >> >> >> >> to >> >> >> >> >> >> > re-order the output and collect how accurate your >> results >> >> are >> >> >> >> (it >> >> >> >> >> >> computes >> >> >> >> >> >> > errors for you as well). >> >> >> >> >> >> > >> >> >> >> >> >> > -jake >> >> >> >> >> >> > >> >> >> >> >> >> > 2011/6/23 <[email protected]> >> >> >> >> >> >> > >> >> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the >> >> distributed >> >> >> >> >> version >> >> >> >> >> >> of >> >> >> >> >> >> >> the >> >> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem >> >> with >> >> >> the >> >> >> >> >> >> >> non-distributed version. >> >> >> >> >> >> >> >> >> >> >> >> >> >> I send a dataset of gaussian distributed numbers >> (testing >> >> >> PCA >> >> >> >> >> stuff) >> >> >> >> >> >> and >> >> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have >> the >> >> >> output >> >> >> >> >> given >> >> >> >> >> >> in >> >> >> >> >> >> >> the logs from LanczosSolver. >> >> >> >> >> >> >> >> >> >> >> >> >> >> Output: >> >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue >> >> 347.8703086831804 >> >> >> >> >> >> >> INFO: LanczosSolver finished. >> >> >> >> >> >> >> >> >> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one >> with >> >> an >> >> >> >> >> >> eigenvalue >> >> >> >> >> >> >> of >> >> >> >> >> >> >> 347?. Whats more interesting is that when I increase >> the >> >> >> rank, >> >> >> >> I >> >> >> >> >> get >> >> >> >> >> >> a >> >> >> >> >> >> >> new >> >> >> >> >> >> >> eigenvector with a value between 0 and 347: >> >> >> >> >> >> >> >> >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >> >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue >> >> 44.794928654801566 >> >> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue >> >> 347.8286920203704 >> >> >> >> >> >> >> >> >> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also >> is >> >> >> the >> >> >> >> 0.0 >> >> >> >> >> >> >> eigenvalue even valid? >> >> >> >> >> >> >> >> >> >> >> >> >> >> Thanks, >> >> >> >> >> >> >> Trevor >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >
