>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic >> Indexing (one of the applications of SVD), and users explicitly ask for SVD >> accuracy -- so there it is, hard numbers :) >> Also note that a few million documents, with a few 10k-100k vocabulary, is by >> far the most common use-case for gensim users. That's why I picked the >> English >> wikipedia to test on. If use-cases of Mahout SVD target millions of features >> on >> billions of documents, YMMV.
If I remember it correctly, Dumais and Deerwester were speaking of ~200 s.v. in their experiments. As far as lsi is concerned, why would one be interested in that many? The measures you get are going to greatly depend on the corpus you are picking. So your solution for "topics" is biased to begin with. ( The mental model for it that i kind of like to think of is that every person would have a slightly different meaning of what "politeness" means, depending on his upbringing and experience, i.e. on his personal "training corpus") . So in many cases data is rather biased to begin with. that's why lsi is not the same as trying to compute geometry of a rocket booster.
