Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Radim Rehurek Wed, 21 Dec 2011 01:57:26 -0800

> Od: Ted Dunning <[email protected]>
> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> Datum: 19.12.2011 16:58:57
> ----------------------------------------
> The users of SVD on Mahout are more interested in their own application
> level metrics than the metrics associated with the decomposition itself.
>  Moreover, at the scale people have been working with, getting any
> decomposition at all is refreshing.



Heh. Is that the official Mahout philosophy for other algorithms as well? "Who 
cares about correctness, we are happy we can run it on some data at all, so 
shut up?" I hope you're not serious Ted.

Aren't you afraid people will draw wrong conclusions about an SVD application, 
using your (possibly wildly inaccurate) SVD implementation? Retract 
publications?

By all means, use whatever decomposition suits you. But SVD already has a 
well-established meaning in linear algebra and using that acronym comes with 
certain expectations. People unfamiliar with the pitfalls of your 
implementation may assume they're really getting SVD (or at least a version 
that's "reasonably close" -- in the numerical computing sense). A big fat 
accuracy warning is in order here. Nobody expects more or less random vectors, 
even if these happen to perform better than the real truncated SVD in your app 
[citation needed].

> The examples that you gave in your thread involved walking *way* down the
> spectrum to the smaller singular values.  That is absolutely not the
> interest with most Mahout users because that would involve fairly massive
> over-fitting.


Too many opinions, too little data. Instead, I decided to run the English 
wikipedia experiments with factors=10 and oversampling=5, as per your concerns.

(cross-posting to the gensim mailing list, as this might be of interest to 
gensim users as well)

Data: English Wikipedia as term-document matrix (0.42B non-zeroes, 3.5M 
documents, 10K features).
Requesting top 10 factors (1% of the total singular value mass), not 500 
factors like before (15% total mass). Accuracy is evaluated by comparing 
reconstruction error to Lapack's DSYEV in-core routine on A*A^T. Error = 
|A*A^T-U*S^2*U^T| / |A*A^T-U_lapack*Sigma_lapack*U_lapack^T|.

batch algo    error
--------------+------
baseline*      1.986
0 power iters  1.877
1 power iter   1.094
2 power iters  1.009
4 power iters  1.0005
6 power iters  1.00009

The results are completely in line with Martinsson et al.'s [1] as well as my 
previous experiments: no power iteration steps with massive truncation = 
rubbish output. Accuracy improves exponentially with increasing no. of 
iteration steps (but see my initial warning re. numerical issues with higher 
number of steps if implemented naively).

So, your worry that the SVD inaccuracy is somehow due to asking too many 
factors and irrelevant for thinner SVDs is without substance. Your users 
certainly deserve to know that without power iterations, the SVD output is on 
par with baseline, which is a "decomposition" where all factors are simply zero.

>From all the dev replies here -- no users actually replied -- I get the vibe 
>that the accuracy discussion annoys you. Now, I dropped by to give a friendly 
>hint about possible serious accuracy concerns, based on experience with 
>mid-scale (billions of non-zeroes) SVD computations in one specific domain 
>(term-document matrices in NLP). And possibly learning about your issues on 
>tera-feature scale datasets in return, which I'm very interested in. 
>Apparently neither of us is getting anything out of this, so I'll stop here.

Best,
Radim

[1] http://www.sci.ccny.cuny.edu/~szlam/npisvdnipsshort.pdf



>
> 2011/12/19 Radim Rehurek <[email protected]>
>
> > No problem.
> >
> > If you decide to run some SSVD accuracy experiments in the future (don't
> > your users ask for that?), please include me in cc -- just in case I miss
> > the post on this mailing list.
> >
>
>
>

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Reply via email to