as far as accuracy, i only did experiments on small-ish inputs that i
could hold in memory for control of things and with a good decay in
s.v., and it was quite good with even single power iterations. Now
this is not the same as to do stuff with a completely random inputs,
which i did too and then sure enough the problems were evident.

However, if you are looking for trends, and trends are there, you will
find them. And if they are not, then the results are pretty useless
regardless which ways they are computed.

I guess there are problems that try to git rid of low frequency but
high amplitudal nosie out there, i think i saw a post from someone
recently, but they require a modified approach anyway. Direct SVD
doesn't help much either.

On Sun, Dec 18, 2011 at 1:45 PM, Dmitriy Lyubimov <[email protected]> wrote:
> or, in terms of pragmatical problems: if i work for computer industry
> and want for LSI to figure out that "java coffee" and "java code" is
> completely orthogonal concepts despte of common terms present, i just
> throw in a mixture of texts mentioning both uses and as long it tells
> me those are different things with high degree of confidence, i don't
> care about abolute value of that confidence. Which it does.
>
> On Sun, Dec 18, 2011 at 1:35 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>>> The 500 is a common "gold-standard" dimensionality used in Latent Semantic
>>>> Indexing (one of the applications of SVD), and users explicitly ask for SVD
>>>> accuracy -- so there it is, hard numbers :)
>>>> Also note that a few million documents, with a few 10k-100k vocabulary, is 
>>>> by
>>>> far the most common use-case for gensim users. That's why I picked the 
>>>> English
>>>> wikipedia to test on. If use-cases of Mahout SVD target millions of 
>>>> features on
>>>> billions of documents, YMMV.
>>
>> If I remember it correctly, Dumais and Deerwester were speaking of
>> ~200 s.v. in their experiments.
>>
>> As far as lsi is concerned, why would one be interested in that many?
>> The measures you get are going to greatly depend on the corpus you are
>> picking. So your solution for "topics" is biased to begin with. ( The
>> mental model for it that i kind of like to think of is that every
>> person would have a slightly different meaning of what "politeness"
>> means, depending on his upbringing and experience, i.e. on his
>> personal "training corpus") .
>>
>> So in many cases data is rather biased to begin with. that's why lsi
>> is not the same as trying to compute geometry of a rocket booster.

Reply via email to