Why does document concept require such a large K?



On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER
SERVICES LIMITED at Cisco) <[email protected]> wrote:

> Thanks Jake for your response.  I am trying to get concepts out of the
> documents and for this I want the K to be large around 500.  I will run CVB
> based on your suggestions and see what I get.  Appreciate your prompt
> response.
>
> -Ankur
>
> -----Original Message-----
> From: Jake Mannix [mailto:[email protected]]
> Sent: Wednesday, June 12, 2013 9:22 AM
> To: [email protected]
> Subject: Re: Mahout CVB parameters
>
> What is the number of terms in your dictionary, after tokenization and
> vectorization?  Typically, for english, you'll get reasonable topics with
> anywhere from 20-200 topics, tending toward the lower end if you've not got
> very many documents (like in your case)  20 topics will yield very generic
> things, 100 is pretty nice, a lot of the time, but 200 or more can lead to
> really niche things (I've found things like getting one topic to be
> basically all female first names, for example).
>
> Maximum # of iterations I'd say that 20-30 tends to always be enough, but
> while you're running it, it should be spitting out the perplexity as it
> goes (you can tell it to calculate this every N iterations, and set N to 1
> to check after each iteration, while you're trying to see how it goes).
>  When this perplexity plateaus, you're done.  But in practice, I've never
> needed more than 30 iterations (less the larger your corpus is).
>
> As for the smoothing parameters, we really should have an implementation
> of one of the various ways of finding it automatically, but for now, doing
> a grid search over values in the range of 0.001 to 0.1 while first testing
> things out tends to be helpful (so try (alpha, beta) = {0.001, 0.01, 0.1} x
> {0.001, 0.01, 0.1})
>
> Hope that helps.
>
>
> On Wed, Jun 12, 2013 at 9:01 AM, Ankur Desai -X (ankurdes - SATYAM
> COMPUTER SERVICES LIMITED at Cisco) <[email protected]> wrote:
>
> > Hi,
> >
> > I am using mahout CVB to generate topics from about 8K documents.  I
> > am struggling to determine what are some of the best parameters values
> to use?
> >  Please help, if you know best way to determine the parameter values
> > like topic and term smoothing, max number of iterations, or total
> > number of topics to generate.
> >
> > Thanks,
> > Ankur
> >
>
>
>
> --
>
>   -jake
>
  • Maho... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
    • ... Jake Mannix
      • ... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
        • ... Ted Dunning
          • ... Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco)
            • ... Jake Mannix
              • ... Ted Dunning

Reply via email to