For billions of rows, you can do block-wise QR and get the SVD pretty easily.
Also, the distributed matrix times will get you there with slightly less numerical stability. On Thu, Jun 23, 2011 at 12:53 PM, Jake Mannix <[email protected]> wrote: > Well I'm going to pretend for a second that you're talking about *billions* > of rows, because tens of millions of rows still fit not just on one box, > but > on one box *in memory*... but yes, the mapreduce job you want already > exists in Mahout: DistributedRowMatrix#times(DistributedRowMatrix) > does exactly this. > > Although it doesn't do the mean-subtraction. > > -jake > > 2011/6/23 <[email protected]> > > > Yes but a M/R job to create the covariance matrix would be required. With > > millions of rows that is, unless I am missing something. > > > > > Doing PCA on 5000 x 5000 matrix is still an in-memory thing. That's > > > only 25M doubles, 200MB of memory. Lots of techniques can run > > > on that set. You could do Lanczos with whatever rank you want > > > on it (but don't worry about distributed lanczos). > > > > > > 2011/6/23 <[email protected]> > > > > > >> I will, if it works I may have to make an m/r job for it. All the data > > >> we > > >> have will be tall and dense (lets say 5000 columns, with millions of > > >> rows). Now doing PCA on that will create a covariance matrix that is > > >> square and dense. Thanks again guys. > > >> > > >> -Trevor > > >> > > >> > Try the QR trick. It is amazingly effective. > > >> > > > >> > 2011/6/23 <[email protected]> > > >> > > > >> >> Alright, thanks guys. > > >> >> > > >> >> > The cases where Lanczos or the stochastic projection helps are > > >> cases > > >> >> where > > >> >> > you have *many* columns but where the data are sparse. If you > have > > >> a > > >> >> very > > >> >> > tall dense matrix, the QR method is to be muchly preferred. > > >> >> > > > >> >> > 2011/6/23 <[email protected]> > > >> >> > > > >> >> >> Ok, then what would you think to be the minimum number of > columns > > >> in > > >> >> the > > >> >> >> dataset for Lanczos to give a reasonable result? > > >> >> >> > > >> >> >> Thanks, > > >> >> >> -Trevor > > >> >> >> > > >> >> >> > A gazillion rows of 2-columned data is really much better > suited > > >> to > > >> >> >> doing > > >> >> >> > the following: > > >> >> >> > > > >> >> >> > if each row is of the form [a, b], then compute the matrix > > >> >> >> > > > >> >> >> > [[a*a, a*b], [a*b, b*b]] > > >> >> >> > > > >> >> >> > (the outer product of the vector with itself) > > >> >> >> > > > >> >> >> > Then take the matrix sum of all of these, from each row of > your > > >> >> input > > >> >> >> > matrix. > > >> >> >> > > > >> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by > hand. > > >> >> It > > >> >> >> will > > >> >> >> > give you your eigenvalues, and also the right-singular vectors > > >> of > > >> >> your > > >> >> >> > original matrix. > > >> >> >> > > > >> >> >> > -jake > > >> >> >> > > > >> >> >> > 2011/6/23 <[email protected]> > > >> >> >> > > > >> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what > is > > >> >> being > > >> >> >> >> said, > > >> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos > > >> can't > > >> >> do > > >> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the > mix > > >> - > > >> >> of > > >> >> >> the > > >> >> >> >> 2 > > >> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue > > >> taking > > >> >> up > > >> >> >> a > > >> >> >> >> slot)? > > >> >> >> >> > > >> >> >> >> If the eigenvalue of 0 is invalid, then should it not be > > >> filtered > > >> >> out > > >> >> >> so > > >> >> >> >> that it returns "rank" number of eigenvalues that could be > > >> valid? > > >> >> >> >> > > >> >> >> >> -Trevor > > >> >> >> >> > > >> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank > > >> 10. > > >> >> >> Try > > >> >> >> >> on > > >> >> >> >> > some slightly less synthetic data of more than rank 10. > You > > >> >> can't > > >> >> >> >> > ask Lanczos for more reduced rank than that of the matrix > > >> >> itself. > > >> >> >> >> > > > >> >> >> >> > -jake > > >> >> >> >> > > > >> >> >> >> > 2011/6/23 <[email protected]> > > >> >> >> >> > > > >> >> >> >> >> Alright I can reorder that is easy, just had to verify > that > > >> the > > >> >> >> >> ordering > > >> >> >> >> >> was correct. So when I increased the rank of the results I > > >> get > > >> >> >> >> Lanczos > > >> >> >> >> >> bailing out. Which incidentally causes a > > >> NullPointerException: > > >> >> >> >> >> > > >> >> >> >> >> INFO: 9 passes through the corpus so far... > > >> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, > beta > > >> = > > >> >> NaN. > > >> >> >> >> >> Bailing out early! > > >> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the > > >> >> >> >> tri-diagonal > > >> >> >> >> >> auxiliary matrix. > > >> >> >> >> >> Exception in thread "main" java.lang.NullPointerException > > >> >> >> >> >> at > > >> >> >> >> >> > > org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) > > >> >> >> >> >> at > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> > > >> >> > > >> > > > org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) > > >> >> >> >> >> at pca.PCASolver.solve(PCASolver.java:53) > > >> >> >> >> >> at pca.PCA.main(PCA.java:20) > > >> >> >> >> >> > > >> >> >> >> >> So I should probably note that my data only has 2 columns, > > >> the > > >> >> >> real > > >> >> >> >> data > > >> >> >> >> >> will have quite a bit more. > > >> >> >> >> >> > > >> >> >> >> >> The failing happens with 10 and more for rank, with the > > >> last, > > >> >> and > > >> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>. > > >> >> >> >> >> > > >> >> >> >> >> -Trevor > > >> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the > output > > >> >> will > > >> >> >> list > > >> >> >> >> >> the > > >> >> >> >> >> > results > > >> >> >> >> >> > in *increasing* order, even though it is finding the > > >> largest > > >> >> >> >> >> > eigenvalues/vectors > > >> >> >> >> >> > first. > > >> >> >> >> >> > > > >> >> >> >> >> > Remember that convergence is gradual, so if you only ask > > >> for > > >> >> 3 > > >> >> >> >> >> > eigevectors/values, you won't be very accurate. If you > > >> ask > > >> >> for > > >> >> >> 10 > > >> >> >> >> or > > >> >> >> >> >> > more, > > >> >> >> >> >> > the > > >> >> >> >> >> > largest few will now be quite good. If you ask for 50, > > >> now > > >> >> the > > >> >> >> top > > >> >> >> >> >> 10-20 > > >> >> >> >> >> > will > > >> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still > > >> be > > >> >> >> quite > > >> >> >> >> >> good. > > >> >> >> >> >> > > > >> >> >> >> >> > Try out a non-distributed form of what is in the > > >> >> >> >> EigenverificationJob > > >> >> >> >> >> to > > >> >> >> >> >> > re-order the output and collect how accurate your > results > > >> are > > >> >> >> (it > > >> >> >> >> >> computes > > >> >> >> >> >> > errors for you as well). > > >> >> >> >> >> > > > >> >> >> >> >> > -jake > > >> >> >> >> >> > > > >> >> >> >> >> > 2011/6/23 <[email protected]> > > >> >> >> >> >> > > > >> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the > > >> distributed > > >> >> >> >> version > > >> >> >> >> >> of > > >> >> >> >> >> >> the > > >> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem > > >> with > > >> >> the > > >> >> >> >> >> >> non-distributed version. > > >> >> >> >> >> >> > > >> >> >> >> >> >> I send a dataset of gaussian distributed numbers > (testing > > >> >> PCA > > >> >> >> >> stuff) > > >> >> >> >> >> and > > >> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the > > >> >> output > > >> >> >> >> given > > >> >> >> >> >> in > > >> >> >> >> >> >> the logs from LanczosSolver. > > >> >> >> >> >> >> > > >> >> >> >> >> >> Output: > > >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > > >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue > > >> 347.8703086831804 > > >> >> >> >> >> >> INFO: LanczosSolver finished. > > >> >> >> >> >> >> > > >> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one > with > > >> an > > >> >> >> >> >> eigenvalue > > >> >> >> >> >> >> of > > >> >> >> >> >> >> 347?. Whats more interesting is that when I increase > the > > >> >> rank, > > >> >> >> I > > >> >> >> >> get > > >> >> >> >> >> a > > >> >> >> >> >> >> new > > >> >> >> >> >> >> eigenvector with a value between 0 and 347: > > >> >> >> >> >> >> > > >> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 > > >> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue > > >> 44.794928654801566 > > >> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue > > >> 347.8286920203704 > > >> >> >> >> >> >> > > >> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also > is > > >> >> the > > >> >> >> 0.0 > > >> >> >> >> >> >> eigenvalue even valid? > > >> >> >> >> >> >> > > >> >> >> >> >> >> Thanks, > > >> >> >> >> >> >> Trevor > > >> >> >> >> >> >> > > >> >> >> >> >> >> > > >> >> >> >> >> > > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > >> >> >> > > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> > > > >> >> > > >> >> > > >> >> > > >> > > > >> > > >> > > >> > > > > > > > > > >
