Why does document concept require such a large K?
On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > Thanks Jake for your response. I am trying to get concepts out of the > documents and for this I want the K to be large around 500. I will run CVB > based on your suggestions and see what I get. Appreciate your prompt > response. > > -Ankur > > -----Original Message----- > From: Jake Mannix [mailto:[email protected]] > Sent: Wednesday, June 12, 2013 9:22 AM > To: [email protected] > Subject: Re: Mahout CVB parameters > > What is the number of terms in your dictionary, after tokenization and > vectorization? Typically, for english, you'll get reasonable topics with > anywhere from 20-200 topics, tending toward the lower end if you've not got > very many documents (like in your case) 20 topics will yield very generic > things, 100 is pretty nice, a lot of the time, but 200 or more can lead to > really niche things (I've found things like getting one topic to be > basically all female first names, for example). > > Maximum # of iterations I'd say that 20-30 tends to always be enough, but > while you're running it, it should be spitting out the perplexity as it > goes (you can tell it to calculate this every N iterations, and set N to 1 > to check after each iteration, while you're trying to see how it goes). > When this perplexity plateaus, you're done. But in practice, I've never > needed more than 30 iterations (less the larger your corpus is). > > As for the smoothing parameters, we really should have an implementation > of one of the various ways of finding it automatically, but for now, doing > a grid search over values in the range of 0.001 to 0.1 while first testing > things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x > {0.001, 0.01, 0.1}) > > Hope that helps. > > > On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM > COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > > > Hi, > > > > I am using mahout CVB to generate topics from about 8K documents. I > > am struggling to determine what are some of the best parameters values > to use? > > Please help, if you know best way to determine the parameter values > > like topic and term smoothing, max number of iterations, or total > > number of topics to generate. > > > > Thanks, > > Ankur > > > > > > -- > > -jake >
