> Od: Ted Dunning <[email protected]> > Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity) > Datum: 19.12.2011 16:58:57 > ---------------------------------------- > The users of SVD on Mahout are more interested in their own application > level metrics than the metrics associated with the decomposition itself. > Moreover, at the scale people have been working with, getting any > decomposition at all is refreshing.
Heh. Is that the official Mahout philosophy for other algorithms as well? "Who cares about correctness, we are happy we can run it on some data at all, so shut up?" I hope you're not serious Ted. Aren't you afraid people will draw wrong conclusions about an SVD application, using your (possibly wildly inaccurate) SVD implementation? Retract publications? By all means, use whatever decomposition suits you. But SVD already has a well-established meaning in linear algebra and using that acronym comes with certain expectations. People unfamiliar with the pitfalls of your implementation may assume they're really getting SVD (or at least a version that's "reasonably close" -- in the numerical computing sense). A big fat accuracy warning is in order here. Nobody expects more or less random vectors, even if these happen to perform better than the real truncated SVD in your app [citation needed]. > The examples that you gave in your thread involved walking *way* down the > spectrum to the smaller singular values. That is absolutely not the > interest with most Mahout users because that would involve fairly massive > over-fitting. Too many opinions, too little data. Instead, I decided to run the English wikipedia experiments with factors=10 and oversampling=5, as per your concerns. (cross-posting to the gensim mailing list, as this might be of interest to gensim users as well) Data: English Wikipedia as term-document matrix (0.42B non-zeroes, 3.5M documents, 10K features). Requesting top 10 factors (1% of the total singular value mass), not 500 factors like before (15% total mass). Accuracy is evaluated by comparing reconstruction error to Lapack's DSYEV in-core routine on A*A^T. Error = |A*A^T-U*S^2*U^T| / |A*A^T-U_lapack*Sigma_lapack*U_lapack^T|. batch algo error --------------+------ baseline* 1.986 0 power iters 1.877 1 power iter 1.094 2 power iters 1.009 4 power iters 1.0005 6 power iters 1.00009 The results are completely in line with Martinsson et al.'s [1] as well as my previous experiments: no power iteration steps with massive truncation = rubbish output. Accuracy improves exponentially with increasing no. of iteration steps (but see my initial warning re. numerical issues with higher number of steps if implemented naively). So, your worry that the SVD inaccuracy is somehow due to asking too many factors and irrelevant for thinner SVDs is without substance. Your users certainly deserve to know that without power iterations, the SVD output is on par with baseline, which is a "decomposition" where all factors are simply zero. >From all the dev replies here -- no users actually replied -- I get the vibe >that the accuracy discussion annoys you. Now, I dropped by to give a friendly >hint about possible serious accuracy concerns, based on experience with >mid-scale (billions of non-zeroes) SVD computations in one specific domain >(term-document matrices in NLP). And possibly learning about your issues on >tera-feature scale datasets in return, which I'm very interested in. >Apparently neither of us is getting anything out of this, so I'll stop here. Best, Radim [1] http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf > > 2011/12/19 Radim Rehurek <[email protected]> > > > No problem. > > > > If you decide to run some SSVD accuracy experiments in the future (don't > > your users ask for that?), please include me in cc -- just in case I miss > > the post on this mailing list. > > > > >
