By the way, are there similar experimental results for performance/accuracy of SVD in Mahout?
I believe there are several SVD implementations available in Mahout (Lanczos/randomized, distributed/local), I would be interested in seeing their comparison in terms of speed and accuracy, on some reasonably large public dataset. Cheers, Radim > ------------ Původní zpráva ------------ > Od: Radim Rehurek <[email protected]> > Předmět: Re: Mahout Lanczos SVD complexity > Datum: 18.12.2011 09:41:23 > ---------------------------------------- > Hi Ted, > > appreciate your comments! > > > You are correct that the randomized projection methods are of limited > > accuracy when the singular values are not decreasing rapidly, but I think > > that you were asking for too many singular values (500) from a relatively > > small matrix to make much sense. My argument for saying this is that the > > singular values at this point have become quite small and are likely to be > > modeling the randomized variation in your corpus rather than the systematic > > patterns of occurrence. As such, the "accuracy" of the decomposition is of > > little interest. > > Agreed that the small factors are modelling random variations and could likely > be replaced by random direction vectors (of comparable magnitude). > > The 500 is a common "gold-standard" dimensionality used in Latent Semantic > Indexing (one of the applications of SVD), and users explicitly ask for SVD > accuracy -- so there it is, hard numbers :) > Also note that a few million documents, with a few 10k-100k vocabulary, is by > far the most common use-case for gensim users. That's why I picked the English > wikipedia to test on. If use-cases of Mahout SVD target millions of features > on > billions of documents, YMMV. > > > > It may well be that using a high number of singular values increases > > apparent accuracy in your application over using a smaller number, but it > > is likely that augmenting a small decomposition with random vectors instead > > of true singular vectors would provide virtually identical performance. As > > such, computing these extra singular vectors more or less accuracy is > > hardly of the essence. > > > > In order to determine whether this is the case, the interesting comparison > > to make is whether you get higher application accuracy on held out data > > from, say, 100 singular vectors + 400 random vectors or your 500 singular > > vectors. It is common with LSI-like applications to test 100 singular > > vectors against 500 singular vectors, but does not really measure how much > > information is being extracted in the singular decomposition. > > But those tests didn't measure application accuracy! (though I very much urge > users to do that). They simply measure SVD reconstruction error (not LSI > retrieval error or anything). > > Again, I complete agree that real app accuracy beats some theoretical nuisance > metric (such as reconstruction error by frobenius norm). > > > Best, > Radim > > > > Regarding the accuracy of the decomposition, your choice of k is well below > > the bound suggested by the theoretical considerations in the Halko, et al, > > paper. For small matrices with the size measured in thousands or tens of > > thousands and very tight desired accuracies, this bound is not all that > > helpful and it isn't surprising that you may have some problems. > > > > In any, case I think that your experience is divergent enough from the > > applications being considered in Mahout, that conclusions should be drawn > > with some caution. > > > > On Sat, Dec 17, 2011 at 10:00 PM, Radim Rehurek > <[email protected]>wrote: > > > > > Ted Dunning wrote: > > > > Also, it isn't entirely clear yet whether power iterations are more > > > > efficient than simply increasing the fudge factor p. Power iterations > > > > are > > > > very effective, and increasing p increases costs in the cube, but > > > > running > > > > MR passes is expensive enough that some increase in p might be > > > > sufficient > > > > and still faster than a power iteration. > > > > > > Don't even think about it -- on large matrices (significant truncation), > > > the randomized SVD without power iterations is complete rubbish, > > > regardless > > > of any sane value of `p`. > > > > > > (assuming your fudge factor p = oversampling parameter `p` from Halko et > > > al.). > > > > > > You can see some accuracy experiments (wikipedia, images) here: > > > http://groups.google.com/group/gensim/msg/1f3106b85837ce9c and here > > > http://groups.google.com/group/gensim/msg/240a348b70f18b30 . > > > > > > Also see the end of that (longish) thread for some notes on accuracy in > > > face of many power iterations (~ beware of numeric overflows). > > > > > > Best, > > > Radim > > > > > > > > > > > >
