Thanks Ted. That is good news. On Jun 7, 2011, at 11:12 PM, Ted Dunning wrote:
> I think that incorporating mean subtraction into the SSVD code should > be relatively straightforward. The trick is that you have to project > the orginal matrix and the mean separately and then combine the > results. There are later steps that require similar mods, but since > it is all based on a simple product that can always be factored, it > should be straightforward. > > On Tue, Jun 7, 2011 at 4:06 PM, Eshwaran Vijaya Kumar > <[email protected]> wrote: >> One of the algorithms we would like to try out is PCA where it seems to be >> good to avoid subtracting the mean matrix (m* m^T) from the data matrix, say >> A, to avoid destroying sparsity ( Refer : >> http://search.lucidimagination.com/search/document/cd4c36c2f27080d/regarding_pca_implementation#2eae2e2861213ae0 >> ). From what I understand about the Lanczos algorithm, it shouldn't be too >> hard to modify the solver code so that I can pass A and m*m^T without >> combining them into a single matrix and then do repeated multiplications. >> Unfortunately, I have not yet had time to look at SSVD; So it would be >> extremely helpful if someone who has looked at the problem more closer can >> comment on how to do these (potential?) modifications to the SSVD code to >> avoid having to deal with dense matrices. >> >> >> Thanks in advance >> Eshwaran >> >> >> >> On Jun 6, 2011, at 3:32 AM, Ted Dunning wrote: >> >>> I would push for SSVD as well if you want a real SVD. >>> >>> Also, I don't think that you lose information about which vectors are which >>> (or as Jake put it "what they mean"). The stochastic decomposition gives a >>> very accurate estimate of the top-k singular vectors. It does this by using >>> the random projection to project the top singular vectors into a sub-space >>> and then correcting the results obtained back into the original space. This >>> is not the same as simply doing the decomposition on the random projection >>> and then using that decomposition. >>> >>> On Fri, Jun 3, 2011 at 8:16 PM, Eshwaran Vijaya Kumar < >>> [email protected]> wrote: >>> >>>> Hi Jake, >>>> Thank you for your reply. Good to know that we can use Lanczos. I will >>>> have to look into SSVD algorithm closer to figure out whether the >>>> information loss is worth the gain in speed (and computational efficiency). >>>> I guess We will have to run more tests to see which works best to decide on >>>> which path to go by. >>>> >>>> >>>> Esh >>>> >>>> On Jun 3, 2011, at 6:23 PM, Jake Mannix wrote: >>>> >>>>> With 50k columns, you're well within the "sweet spot" for traditional SVD >>>>> via Lanczos, so give it a try. >>>>> >>>>> SSVD will probably run faster, but you lose some information on what the >>>>> singular vectors "mean". If you don't need this information, SSVD may be >>>>> better for you. >>>>> >>>>> What would be awesome for *us* is if you tried both and told us what you >>>>> found, in terms of performance and relevance. :) >>>>> >>>>> -jake >>>>> >>>>> On Jun 3, 2011 4:49 PM, "Eshwaran Vijaya Kumar" < >>>> [email protected]> >>>>> wrote: >>>>> >>>>> Hello all, >>>>> We are trying to build a clustering system which will have an SVD >>>>> component. I believe Mahout has two SVD solvers: DistributedLanczosSolver >>>>> and SSVD. Could someone give me some tips on which would be a better >>>> choice >>>>> of a solver given that the size of the data will be roughly 100 million >>>> rows >>>>> with each row having roughly 50 K dimensions (100 million X 50000 ). We >>>> will >>>>> be working with text data so the resultant matrix should be relatively >>>>> sparse to begin with. >>>>> >>>>> Thanks >>>>> Eshwaran >>>> >>>> >> >>
