Hello Dmitriy,

> ------------ Původní zpráva ------------
> Od: Dmitriy Lyubimov <[email protected]>
> Předmět: Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)
> Datum: 18.12.2011 22:36:11
> ----------------------------------------
> >> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
> >> Indexing (one of the applications of SVD), and users explicitly ask for SVD
> >> accuracy -- so there it is, hard numbers :)
> >> Also note that a few million documents, with a few 10k-100k vocabulary, is
> by
> >> far the most common use-case for gensim users. That's why I picked the
> English
> >> wikipedia to test on. If use-cases of Mahout SVD target millions of 
> >> features
> on
> >> billions of documents, YMMV.
>
> If I remember it correctly, Dumais and Deerwester were speaking of
> ~200 s.v. in their experiments.

actually, there have been many experiments with setting the optimal 
dimensionality since. See e.g. "Bradford, 2008: An empirical study of required 
dimensionality for large-scale latent semantic indexing applications" or "Zha 
et al. 1998: Large-scale SVD and subspace-based methods for information 
retrieval" with an MDL approach for more info.


> As far as lsi is concerned, why would one be interested in that many?
> The measures you get are going to greatly depend on the corpus you are
> picking. So your solution for "topics" is biased to begin with. ( The
> mental model for it that i kind of like to think of is that every
> person would have a slightly different meaning of what "politeness"
> means, depending on his upbringing and experience, i.e. on his
> personal "training corpus") .


Not sure what you mean by "my solution". Latent semantic analysis was developed 
in the 80s (not by me). For whatever reason, it still seems to be popular, and 
people ask for tools that implement LSA efficiently. I kinda thought perhaps 
some people used the SVD in Mahout for similar goals, that's why I brought it 
up.

Also, the discussion is about truncated SVD accuracy. This can be measured down 
to machine precision -- no need to resort to opinions :) I'm genuinely curious 
about the Mahout implementation, whatever your domain of application is (incl. 
rocket boosters).

Best,
Radim


> So in many cases data is rather biased to begin with. that's why lsi
> is not the same as trying to compute geometry of a rocket booster.
>
>
>

Reply via email to