I would encourage you to take a stab at a patch on this. You aren't the only person to have expressed interest in scaling PCA, but you aren't a member of a large horde, either.
On Wed, Jun 8, 2011 at 7:39 AM, Eshwaran Vijaya Kumar <[email protected]> wrote: > Thanks Ted. That is good news. > On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote: > >> I think that incorporating mean subtraction into the SSVD code should >> be relatively straightforward. The trick is that you have to project >> the orginal matrix and the mean separately and then combine the >> results. There are later steps that require similar mods, but since >> it is all based on a simple product that can always be factored, it >> should be straightforward. >> >> On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar >> <[email protected]> wrote: >>> One of the algorithms we would like to try out is PCA where it seems to be >>> good to avoid subtracting the mean matrix (m* m^T) from the data matrix, >>> say A, to avoid destroying sparsity ( Refer : >>> http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 >>> ). From what I understand about the Lanczos algorithm, it shouldn't be too >>> hard to modify the solver code so that I can pass A and m*m^T without >>> combining them into a single matrix and then do repeated multiplications. >>> Unfortunately, I have not yet had time to look at SSVD; So it would be >>> extremely helpful if someone who has looked at the problem more closer can >>> comment on how to do these (potential?) modifications to the SSVD code to >>> avoid having to deal with dense matrices. >>> >>> >>> Thanks in advance >>> Eshwaran >>> >>> >>> >>> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote: >>> >>>> I would push for SSVD as well if you want a real SVD. >>>> >>>> Also, I don't think that you lose information about which vectors are which >>>> (or as Jake put it "what they mean"). The stochastic decomposition gives a >>>> very accurate estimate of the top-k singular vectors. It does this by >>>> using >>>> the random projection to project the top singular vectors into a sub-space >>>> and then correcting the results obtained back into the original space. >>>> This >>>> is not the same as simply doing the decomposition on the random projection >>>> and then using that decomposition. >>>> >>>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar < >>>> [email protected]> wrote: >>>> >>>>> Hi Jake, >>>>> Thank you for your reply. Good to know that we can use Lanczos. I will >>>>> have to look into SSVD algorithm closer to figure out whether the >>>>> information loss is worth the gain in speed (and computational >>>>> efficiency). >>>>> I guess We will have to run more tests to see which works best to decide >>>>> on >>>>> which path to go by. >>>>> >>>>> >>>>> Esh >>>>> >>>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote: >>>>> >>>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD >>>>>> via Lanczos, so give it a try. >>>>>> >>>>>> SSVD will probably run faster, but you lose some information on what the >>>>>> singular vectors "mean". If you don't need this information, SSVD may be >>>>>> better for you. >>>>>> >>>>>> What would be awesome for *us* is if you tried both and told us what you >>>>>> found, in terms of performance and relevance. :) >>>>>> >>>>>> -jake >>>>>> >>>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" < >>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>> Hello all, >>>>>> We are trying to build a clustering system which will have an SVD >>>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver >>>>>> and SSVD. Could someone give me some tips on which would be a better >>>>> choice >>>>>> of a solver given that the size of the data will be roughly 100 million >>>>> rows >>>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We >>>>> will >>>>>> be working with text data so the resultant matrix should be relatively >>>>>> sparse to begin with. >>>>>> >>>>>> Thanks >>>>>> Eshwaran >>>>> >>>>> >>> >>> > >
