Hi Ted,

appreciate your comments!

> You are correct that the randomized projection methods are of limited
> accuracy when the singular values are not decreasing rapidly, but I think
> that you were asking for too many singular values (500) from a relatively
> small matrix to make much sense.  My argument for saying this is that the
> singular values at this point have become quite small and are likely to be
> modeling the randomized variation in your corpus rather than the systematic
> patterns of occurrence.  As such, the "accuracy" of the decomposition is of
> little interest.

Agreed that the small factors are modelling random variations and could likely 
be replaced by random direction vectors (of comparable magnitude). 

The 500 is a common "gold-standard" dimensionality used in Latent Semantic 
Indexing (one of the applications of SVD), and users explicitly ask for SVD 
accuracy -- so there it is, hard numbers :)
Also note that a few million documents, with a few 10k-100k vocabulary, is by 
far the most common use-case for gensim users. That's why I picked the English 
wikipedia to test on. If use-cases of Mahout SVD target millions of features on 
billions of documents, YMMV.


> It may well be that using a high number of singular values increases
> apparent accuracy in your application over using a smaller number, but it
> is likely that augmenting a small decomposition with random vectors instead
> of true singular vectors would provide virtually identical performance.  As
> such, computing these extra singular vectors more or less accuracy is
> hardly of the essence.
> 
> In order to determine whether this is the case, the interesting comparison
> to make is whether you get higher application accuracy on held out data
> from, say, 100 singular vectors + 400 random vectors or your 500 singular
> vectors.  It is common with LSI-like applications to test 100 singular
> vectors against 500 singular vectors, but does not really measure how much
> information is being extracted in the singular decomposition.

But those tests didn't measure application accuracy! (though I very much urge 
users to do that). They simply measure SVD reconstruction error (not LSI 
retrieval error or anything).

Again, I complete agree that real app accuracy beats some theoretical nuisance 
metric (such as reconstruction error by frobenius norm).


Best,
Radim


> Regarding the accuracy of the decomposition, your choice of k is well below
> the bound suggested by the theoretical considerations in the Halko, et al,
> paper.  For small matrices with the size measured in thousands or tens of
> thousands and very tight desired accuracies, this bound is not all that
> helpful and it isn't surprising that you may have some problems.
> 
> In any, case I think that your experience is divergent enough from the
> applications being considered in Mahout, that conclusions should be drawn
> with some caution.
> 
> On Sat, Dec 17, 2011 at 10:00 PM, Radim Rehurek <[email protected]>wrote:
> 
> > Ted Dunning wrote:
> > > Also, it isn't entirely clear yet whether power iterations are more
> > > efficient than simply increasing the fudge factor p. Power iterations are
> > > very effective, and increasing p increases costs in the cube, but running
> > > MR passes is expensive enough that some increase in p might be sufficient
> > > and still faster than a power iteration.
> >
> > Don't even think about it -- on large matrices (significant truncation),
> > the randomized SVD without power iterations is complete rubbish, regardless
> > of any sane value of `p`.
> >
> > (assuming your fudge factor p = oversampling parameter `p` from Halko et
> > al.).
> >
> > You can see some accuracy experiments (wikipedia, images) here:
> > http://groups.google.com/group/gensim/msg/1f3106b85837ce9c and here
> > http://groups.google.com/group/gensim/msg/240a348b70f18b30 .
> >
> > Also see the end of that (longish) thread for some notes on accuracy in
> > face of many power iterations (~ beware of numeric overflows).
> >
> > Best,
> > Radim
> >
> 
> 
> 

Reply via email to