One of the algorithms we would like to try out is PCA where it seems to be good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say A, to avoid destroying sparsity ( Refer : http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 ). From what I understand about the Lanczos algorithm, it shouldn't be too hard to modify the solver code so that I can pass A and m*m^T without combining them into a single matrix and then do repeated multiplications. Unfortunately, I have not yet had time to look at SSVD; So it would be extremely helpful if someone who has looked at the problem more closer can comment on how to do these (potential?) modifications to the SSVD code to avoid having to deal with dense matrices.
Thanks in advance Eshwaran On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote: > I would push for SSVD as well if you want a real SVD. > > Also, I don't think that you lose information about which vectors are which > (or as Jake put it "what they mean"). The stochastic decomposition gives a > very accurate estimate of the top-k singular vectors. It does this by using > the random projection to project the top singular vectors into a sub-space > and then correcting the results obtained back into the original space. This > is not the same as simply doing the decomposition on the random projection > and then using that decomposition. > > On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar < > [email protected]> wrote: > >> Hi Jake, >> Thank you for your reply. Good to know that we can use Lanczos. I will >> have to look into SSVD algorithm closer to figure out whether the >> information loss is worth the gain in speed (and computational efficiency). >> I guess We will have to run more tests to see which works best to decide on >> which path to go by. >> >> >> Esh >> >> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote: >> >>> With 50k columns, you're well within the "sweet spot" for traditional SVD >>> via Lanczos, so give it a try. >>> >>> SSVD will probably run faster, but you lose some information on what the >>> singular vectors "mean". If you don't need this information, SSVD may be >>> better for you. >>> >>> What would be awesome for *us* is if you tried both and told us what you >>> found, in terms of performance and relevance. :) >>> >>> -jake >>> >>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" < >> [email protected]> >>> wrote: >>> >>> Hello all, >>> We are trying to build a clustering system which will have an SVD >>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver >>> and SSVD. Could someone give me some tips on which would be a better >> choice >>> of a solver given that the size of the data will be roughly 100 million >> rows >>> with each row having roughly 50 K dimensions (100 million X 50000 ). We >> will >>> be working with text data so the resultant matrix should be relatively >>> sparse to begin with. >>> >>> Thanks >>> Eshwaran >> >>
