Thanks Jake for your response. I am trying to get concepts out of the documents and for this I want the K to be large around 500. I will run CVB based on your suggestions and see what I get. Appreciate your prompt response.
-Ankur -----Original Message----- From: Jake Mannix [mailto:[email protected]] Sent: Wednesday, June 12, 2013 9:22 AM To: [email protected] Subject: Re: Mahout CVB parameters What is the number of terms in your dictionary, after tokenization and vectorization? Typically, for english, you'll get reasonable topics with anywhere from 20-200 topics, tending toward the lower end if you've not got very many documents (like in your case) 20 topics will yield very generic things, 100 is pretty nice, a lot of the time, but 200 or more can lead to really niche things (I've found things like getting one topic to be basically all female first names, for example). Maximum # of iterations I'd say that 20-30 tends to always be enough, but while you're running it, it should be spitting out the perplexity as it goes (you can tell it to calculate this every N iterations, and set N to 1 to check after each iteration, while you're trying to see how it goes). When this perplexity plateaus, you're done. But in practice, I've never needed more than 30 iterations (less the larger your corpus is). As for the smoothing parameters, we really should have an implementation of one of the various ways of finding it automatically, but for now, doing a grid search over values in the range of 0.001 to 0.1 while first testing things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 0.1}) Hope that helps. On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote: > Hi, > > I am using mahout CVB to generate topics from about 8K documents. I > am struggling to determine what are some of the best parameters values to use? > Please help, if you know best way to determine the parameter values > like topic and term smoothing, max number of iterations, or total > number of topics to generate. > > Thanks, > Ankur > -- -jake
