or, in terms of pragmatical problems: if i work for computer industry
and want for LSI to figure out that "java coffee" and "java code" is
completely orthogonal concepts despte of common terms present, i just
throw in a mixture of texts mentioning both uses and as long it tells
me those are different things with high degree of confidence, i don't
care about abolute value of that confidence. Which it does.

On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
>>> Indexing (one of the applications of SVD), and users explicitly ask for SVD
>>> accuracy -- so there it is, hard numbers :)
>>> Also note that a few million documents, with a few 10k-100k vocabulary, is 
>>> by
>>> far the most common use-case for gensim users. That's why I picked the 
>>> English
>>> wikipedia to test on. If use-cases of Mahout SVD target millions of 
>>> features on
>>> billions of documents, YMMV.
>
> If I remember it correctly, Dumais and Deerwester were speaking of
> ~200 s.v. in their experiments.
>
> As far as lsi is concerned, why would one be interested in that many?
> The measures you get are going to greatly depend on the corpus you are
> picking. So your solution for "topics" is biased to begin with. ( The
> mental model for it that i kind of like to think of is that every
> person would have a slightly different meaning of what "politeness"
> means, depending on his upbringing and experience, i.e. on his
> personal "training corpus") .
>
> So in many cases data is rather biased to begin with. that's why lsi
> is not the same as trying to compute geometry of a rocket booster.

Reply via email to