as far as accuracy, i only did experiments on small-ish inputs that i could hold in memory for control of things and with a good decay in s.v., and it was quite good with even single power iterations. Now this is not the same as to do stuff with a completely random inputs, which i did too and then sure enough the problems were evident.
However, if you are looking for trends, and trends are there, you will find them. And if they are not, then the results are pretty useless regardless which ways they are computed. I guess there are problems that try to git rid of low frequency but high amplitudal nosie out there, i think i saw a post from someone recently, but they require a modified approach anyway. Direct SVD doesn't help much either. On Sun, Dec 18, 2011 at 1:45 PM, Dmitriy Lyubimov <[email protected]> wrote: > or, in terms of pragmatical problems: if i work for computer industry > and want for LSI to figure out that "java coffee" and "java code" is > completely orthogonal concepts despte of common terms present, i just > throw in a mixture of texts mentioning both uses and as long it tells > me those are different things with high degree of confidence, i don't > care about abolute value of that confidence. Which it does. > > On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <[email protected]> wrote: >>>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic >>>> Indexing (one of the applications of SVD), and users explicitly ask for SVD >>>> accuracy -- so there it is, hard numbers :) >>>> Also note that a few million documents, with a few 10k-100k vocabulary, is >>>> by >>>> far the most common use-case for gensim users. That's why I picked the >>>> English >>>> wikipedia to test on. If use-cases of Mahout SVD target millions of >>>> features on >>>> billions of documents, YMMV. >> >> If I remember it correctly, Dumais and Deerwester were speaking of >> ~200 s.v. in their experiments. >> >> As far as lsi is concerned, why would one be interested in that many? >> The measures you get are going to greatly depend on the corpus you are >> picking. So your solution for "topics" is biased to begin with. ( The >> mental model for it that i kind of like to think of is that every >> person would have a slightly different meaning of what "politeness" >> means, depending on his upbringing and experience, i.e. on his >> personal "training corpus") . >> >> So in many cases data is rather biased to begin with. that's why lsi >> is not the same as trying to compute geometry of a rocket booster.
