or, in terms of pragmatical problems: if i work for computer industry and want for LSI to figure out that "java coffee" and "java code" is completely orthogonal concepts despte of common terms present, i just throw in a mixture of texts mentioning both uses and as long it tells me those are different things with high degree of confidence, i don't care about abolute value of that confidence. Which it does.
On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic >>> Indexing (one of the applications of SVD), and users explicitly ask for SVD >>> accuracy -- so there it is, hard numbers :) >>> Also note that a few million documents, with a few 10k-100k vocabulary, is >>> by >>> far the most common use-case for gensim users. That's why I picked the >>> English >>> wikipedia to test on. If use-cases of Mahout SVD target millions of >>> features on >>> billions of documents, YMMV. > > If I remember it correctly, Dumais and Deerwester were speaking of > ~200 s.v. in their experiments. > > As far as lsi is concerned, why would one be interested in that many? > The measures you get are going to greatly depend on the corpus you are > picking. So your solution for "topics" is biased to begin with. ( The > mental model for it that i kind of like to think of is that every > person would have a slightly different meaning of what "politeness" > means, depending on his upbringing and experience, i.e. on his > personal "training corpus") . > > So in many cases data is rather biased to begin with. that's why lsi > is not the same as trying to compute geometry of a rocket booster.
