Hi Ted, appreciate your comments!
> You are correct that the randomized projection methods are of limited > accuracy when the singular values are not decreasing rapidly, but I think > that you were asking for too many singular values (500) from a relatively > small matrix to make much sense. My argument for saying this is that the > singular values at this point have become quite small and are likely to be > modeling the randomized variation in your corpus rather than the systematic > patterns of occurrence. As such, the "accuracy" of the decomposition is of > little interest. Agreed that the small factors are modelling random variations and could likely be replaced by random direction vectors (of comparable magnitude). The 500 is a common "gold-standard" dimensionality used in Latent Semantic Indexing (one of the applications of SVD), and users explicitly ask for SVD accuracy -- so there it is, hard numbers :) Also note that a few million documents, with a few 10k-100k vocabulary, is by far the most common use-case for gensim users. That's why I picked the English wikipedia to test on. If use-cases of Mahout SVD target millions of features on billions of documents, YMMV. > It may well be that using a high number of singular values increases > apparent accuracy in your application over using a smaller number, but it > is likely that augmenting a small decomposition with random vectors instead > of true singular vectors would provide virtually identical performance. As > such, computing these extra singular vectors more or less accuracy is > hardly of the essence. > > In order to determine whether this is the case, the interesting comparison > to make is whether you get higher application accuracy on held out data > from, say, 100 singular vectors + 400 random vectors or your 500 singular > vectors. It is common with LSI-like applications to test 100 singular > vectors against 500 singular vectors, but does not really measure how much > information is being extracted in the singular decomposition. But those tests didn't measure application accuracy! (though I very much urge users to do that). They simply measure SVD reconstruction error (not LSI retrieval error or anything). Again, I complete agree that real app accuracy beats some theoretical nuisance metric (such as reconstruction error by frobenius norm). Best, Radim > Regarding the accuracy of the decomposition, your choice of k is well below > the bound suggested by the theoretical considerations in the Halko, et al, > paper. For small matrices with the size measured in thousands or tens of > thousands and very tight desired accuracies, this bound is not all that > helpful and it isn't surprising that you may have some problems. > > In any, case I think that your experience is divergent enough from the > applications being considered in Mahout, that conclusions should be drawn > with some caution. > > On Sat, Dec 17, 2011 at 10:00 PM, Radim Rehurek <[email protected]>wrote: > > > Ted Dunning wrote: > > > Also, it isn't entirely clear yet whether power iterations are more > > > efficient than simply increasing the fudge factor p. Power iterations are > > > very effective, and increasing p increases costs in the cube, but running > > > MR passes is expensive enough that some increase in p might be sufficient > > > and still faster than a power iteration. > > > > Don't even think about it -- on large matrices (significant truncation), > > the randomized SVD without power iterations is complete rubbish, regardless > > of any sane value of `p`. > > > > (assuming your fudge factor p = oversampling parameter `p` from Halko et > > al.). > > > > You can see some accuracy experiments (wikipedia, images) here: > > http://groups.google.com/group/gensim/msg/1f3106b85837ce9c and here > > http://groups.google.com/group/gensim/msg/240a348b70f18b30 . > > > > Also see the end of that (longish) thread for some notes on accuracy in > > face of many power iterations (~ beware of numeric overflows). > > > > Best, > > Radim > > > > >
