Alright thank you again, that makes it a little easier. Now to see about implementing PCA as described in "Map-Reduce for Machine Learning on Multicore" - Once I can get it to yield the correct covariant matrix of course. If this gets done I may be able to (employer may agree to it) give it back to Mahout, assuming someone else doesn't get PCA done earlier.
-Trevor > Btw.. the JIRA involved was > https://issues.apache.org/jira/browse/MAHOUT-376 > > On Thu, Jun 23, 2011 at 11:44 AM, Ted Dunning <[email protected]> > wrote: > >> If you don't need all 5000 singular values, then you can directly use >> the >> stochastic decomposition algorithms in Mahout. >> >> If you do want all 5000 singular values, then you can probably use all >> but >> the first and last few step of the stochastic decomposition algorithm to >> get >> what you need. If you look back at the pertinent JIRA's you will see >> some >> PDF documents that describe my approach and Dmitriy's approach. His >> approach is what we have now since he did all the work. Either should >> work >> for you. >> >> >> 2011/6/23 <[email protected]> >> >>> I will, if it works I may have to make an m/r job for it. All the data >>> we >>> have will be tall and dense (lets say 5000 columns, with millions of >>> rows). Now doing PCA on that will create a covariance matrix that is >>> square and dense. Thanks again guys. >>> >>> -Trevor >>> >>> > Try the QR trick. It is amazingly effective. >>> > >>> > 2011/6/23 <[email protected]> >>> > >>> >> Alright, thanks guys. >>> >> >>> >> > The cases where Lanczos or the stochastic projection helps are >>> cases >>> >> where >>> >> > you have *many* columns but where the data are sparse. If you >>> have a >>> >> very >>> >> > tall dense matrix, the QR method is to be muchly preferred. >>> >> > >>> >> > 2011/6/23 <[email protected]> >>> >> > >>> >> >> Ok, then what would you think to be the minimum number of columns >>> in >>> >> the >>> >> >> dataset for Lanczos to give a reasonable result? >>> >> >> >>> >> >> Thanks, >>> >> >> -Trevor >>> >> >> >>> >> >> > A gazillion rows of 2-columned data is really much better >>> suited >>> to >>> >> >> doing >>> >> >> > the following: >>> >> >> > >>> >> >> > if each row is of the form [a, b], then compute the matrix >>> >> >> > >>> >> >> > [[a*a, a*b], [a*b, b*b]] >>> >> >> > >>> >> >> > (the outer product of the vector with itself) >>> >> >> > >>> >> >> > Then take the matrix sum of all of these, from each row of your >>> >> input >>> >> >> > matrix. >>> >> >> > >>> >> >> > You'll now have a 2x2 matrix, which you can diagonalize by >>> hand. >>> >> It >>> >> >> will >>> >> >> > give you your eigenvalues, and also the right-singular vectors >>> of >>> >> your >>> >> >> > original matrix. >>> >> >> > >>> >> >> > -jake >>> >> >> > >>> >> >> > 2011/6/23 <[email protected]> >>> >> >> > >>> >> >> >> Yes, exactly why I asked it for only 2 eigenvalues. So what is >>> >> being >>> >> >> >> said, >>> >> >> >> is if I have lets say 50M rows of 2 columned data, Lanczos >>> can't >>> >> do >>> >> >> >> anything with it (assuming it puts the 0 eigenvalue in the mix >>> - >>> >> of >>> >> >> the >>> >> >> >> 2 >>> >> >> >> eigenvectors only 1 is returned because of the 0 eigenvalue >>> taking >>> >> up >>> >> >> a >>> >> >> >> slot)? >>> >> >> >> >>> >> >> >> If the eigenvalue of 0 is invalid, then should it not be >>> filtered >>> >> out >>> >> >> so >>> >> >> >> that it returns "rank" number of eigenvalues that could be >>> valid? >>> >> >> >> >>> >> >> >> -Trevor >>> >> >> >> >>> >> >> >> > Ah, if your matrix only has 2 columns, you can't go to rank >>> 10. >>> >> >> Try >>> >> >> >> on >>> >> >> >> > some slightly less synthetic data of more than rank 10. You >>> >> can't >>> >> >> >> > ask Lanczos for more reduced rank than that of the matrix >>> >> itself. >>> >> >> >> > >>> >> >> >> > -jake >>> >> >> >> > >>> >> >> >> > 2011/6/23 <[email protected]> >>> >> >> >> > >>> >> >> >> >> Alright I can reorder that is easy, just had to verify that >>> the >>> >> >> >> ordering >>> >> >> >> >> was correct. So when I increased the rank of the results I >>> get >>> >> >> >> Lanczos >>> >> >> >> >> bailing out. Which incidentally causes a >>> NullPointerException: >>> >> >> >> >> >>> >> >> >> >> INFO: 9 passes through the corpus so far... >>> >> >> >> >> WARNING: Lanczos parameters out of range: alpha = NaN, beta >>> = >>> >> NaN. >>> >> >> >> >> Bailing out early! >>> >> >> >> >> INFO: Lanczos iteration complete - now to diagonalize the >>> >> >> >> tri-diagonal >>> >> >> >> >> auxiliary matrix. >>> >> >> >> >> Exception in thread "main" java.lang.NullPointerException >>> >> >> >> >> at >>> >> >> >> >> >>> org.apache.mahout.math.DenseVector.assign(DenseVector.java:133) >>> >> >> >> >> at >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >>> >> >> >>> >> >>> org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:160) >>> >> >> >> >> at pca.PCASolver.solve(PCASolver.java:53) >>> >> >> >> >> at pca.PCA.main(PCA.java:20) >>> >> >> >> >> >>> >> >> >> >> So I should probably note that my data only has 2 columns, >>> the >>> >> >> real >>> >> >> >> data >>> >> >> >> >> will have quite a bit more. >>> >> >> >> >> >>> >> >> >> >> The failing happens with 10 and more for rank, with the >>> last, >>> >> and >>> >> >> >> >> therefore most significant eigenvector being <NaN,NaN>. >>> >> >> >> >> >>> >> >> >> >> -Trevor >>> >> >> >> >> > The 0 eigenvalue output is not valid, and yes, the output >>> >> will >>> >> >> list >>> >> >> >> >> the >>> >> >> >> >> > results >>> >> >> >> >> > in *increasing* order, even though it is finding the >>> largest >>> >> >> >> >> > eigenvalues/vectors >>> >> >> >> >> > first. >>> >> >> >> >> > >>> >> >> >> >> > Remember that convergence is gradual, so if you only ask >>> for >>> >> 3 >>> >> >> >> >> > eigevectors/values, you won't be very accurate. If you >>> ask >>> >> for >>> >> >> 10 >>> >> >> >> or >>> >> >> >> >> > more, >>> >> >> >> >> > the >>> >> >> >> >> > largest few will now be quite good. If you ask for 50, >>> now >>> >> the >>> >> >> top >>> >> >> >> >> 10-20 >>> >> >> >> >> > will >>> >> >> >> >> > be *extremely* accurate, and maybe the top 30 will still >>> be >>> >> >> quite >>> >> >> >> >> good. >>> >> >> >> >> > >>> >> >> >> >> > Try out a non-distributed form of what is in the >>> >> >> >> EigenverificationJob >>> >> >> >> >> to >>> >> >> >> >> > re-order the output and collect how accurate your results >>> are >>> >> >> (it >>> >> >> >> >> computes >>> >> >> >> >> > errors for you as well). >>> >> >> >> >> > >>> >> >> >> >> > -jake >>> >> >> >> >> > >>> >> >> >> >> > 2011/6/23 <[email protected]> >>> >> >> >> >> > >>> >> >> >> >> >> So, I know that MAHOUT-369 fixed a bug with the >>> distributed >>> >> >> >> version >>> >> >> >> >> of >>> >> >> >> >> >> the >>> >> >> >> >> >> LanczosSolver but I am experiencing a similar problem >>> with >>> >> the >>> >> >> >> >> >> non-distributed version. >>> >> >> >> >> >> >>> >> >> >> >> >> I send a dataset of gaussian distributed numbers >>> (testing >>> >> PCA >>> >> >> >> stuff) >>> >> >> >> >> and >>> >> >> >> >> >> my eigenvalues are seemingly reversed. Below I have the >>> >> output >>> >> >> >> given >>> >> >> >> >> in >>> >> >> >> >> >> the logs from LanczosSolver. >>> >> >> >> >> >> >>> >> >> >> >> >> Output: >>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue >>> 347.8703086831804 >>> >> >> >> >> >> INFO: LanczosSolver finished. >>> >> >> >> >> >> >>> >> >> >> >> >> So it returns a vector with eigenvalue 0 before one with >>> an >>> >> >> >> >> eigenvalue >>> >> >> >> >> >> of >>> >> >> >> >> >> 347?. Whats more interesting is that when I increase the >>> >> rank, >>> >> >> I >>> >> >> >> get >>> >> >> >> >> a >>> >> >> >> >> >> new >>> >> >> >> >> >> eigenvector with a value between 0 and 347: >>> >> >> >> >> >> >>> >> >> >> >> >> INFO: Eigenvector 0 found with eigenvalue 0.0 >>> >> >> >> >> >> INFO: Eigenvector 1 found with eigenvalue >>> 44.794928654801566 >>> >> >> >> >> >> INFO: Eigenvector 2 found with eigenvalue >>> 347.8286920203704 >>> >> >> >> >> >> >>> >> >> >> >> >> Shouldn't the eigenvalues be in descending order? Also >>> is >>> >> the >>> >> >> 0.0 >>> >> >> >> >> >> eigenvalue even valid? >>> >> >> >> >> >> >>> >> >> >> >> >> Thanks, >>> >> >> >> >> >> Trevor >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> > >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> > >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> > >>> >> >> >>> >> >> >>> >> >> >>> >> > >>> >> >>> >> >>> >> >>> > >>> >>> >>> >> >
