Thanks Jake for your response.  I am trying to get concepts out of the 
documents and for this I want the K to be large around 500.  I will run CVB 
based on your suggestions and see what I get.  Appreciate your prompt response.

-Ankur

-----Original Message-----
From: Jake Mannix [mailto:[email protected]] 
Sent: Wednesday, June 12, 2013 9:22 AM
To: [email protected]
Subject: Re: Mahout CVB parameters

What is the number of terms in your dictionary, after tokenization and 
vectorization?  Typically, for english, you'll get reasonable topics with 
anywhere from 20-200 topics, tending toward the lower end if you've not got 
very many documents (like in your case)  20 topics will yield very generic 
things, 100 is pretty nice, a lot of the time, but 200 or more can lead to 
really niche things (I've found things like getting one topic to be basically 
all female first names, for example).

Maximum # of iterations I'd say that 20-30 tends to always be enough, but while 
you're running it, it should be spitting out the perplexity as it goes (you can 
tell it to calculate this every N iterations, and set N to 1 to check after 
each iteration, while you're trying to see how it goes).
 When this perplexity plateaus, you're done.  But in practice, I've never 
needed more than 30 iterations (less the larger your corpus is).

As for the smoothing parameters, we really should have an implementation of one 
of the various ways of finding it automatically, but for now, doing a grid 
search over values in the range of 0.001 to 0.1 while first testing things out 
tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x {0.001, 0.01, 
0.1})

Hope that helps.


On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM COMPUTER 
SERVICES LIMITED at Cisco) <[email protected]> wrote:

> Hi,
>
> I am using mahout CVB to generate topics from about 8K documents.  I 
> am struggling to determine what are some of the best parameters values to use?
>  Please help, if you know best way to determine the parameter values 
> like topic and term smoothing, max number of iterations, or total 
> number of topics to generate.
>
> Thanks,
> Ankur
>



-- 

  -jake
  • Maho... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
    • ... Jake Mannix
      • ... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
        • ... Ted Dunning
          • ... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
            • ... Jake Mannix
              • ... Ted Dunning

Reply via email to