Hi John,

  I'm not an expert in the field, but I have done a bit of work building
topic
models with LDA, and here are some of the "tricks" I've used:

  1) yes remove stop words, in fact remove all words occurring in more than
(say) half (or more conservatively, 90%) of your documents, as they'll be
noise
and just dominate your topics.

  2) more features is better, if you have the memory for it (note that
mahout's
LDA currently holds numTopics * numFeatures in memory in the mapper tasks,
which means that you are usually bounded to a few hundred thousand features,
maybe up as high as a million, currently).  So don't stem, and throw in
commonly occurring (or more importantly: high log-likelihood) bigrams and
trigrams as independent features.

  3) violate the underlying assumption of LDA, that you're talking about
"token
occurrences", and weight your vectors not as "tf", but "tf*idf", which
makes rarer
features more prominent, which ends up making your topics look a lot nicer.

Those are the main tricks I can think of right now.

If you're using Mahout trunk, try the new LDA impl:

  $MAHOUT_HOME/bin/mahout cvb0 --help

It operates on the same kind of input as the last one (ie. a corpus which is
a SequenceFile<IntWritable, VectorWritable>).

  -jake

On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> wrote:

> I'm trying to find out if there are any standard best practices for
> document tokenization when prepping your data for LDA in order to get a
> higher quality topic model, and to understand how the feature space affects
> topic model quality.
>
> For example, will the topic model be "better" if there is a more rich
> feature space by not stemming terms, or is it better to have a more
> normalized feature space by applying stemming?
>
> Is it better to filter out stop words, or keep them in?
>
> Is it better to include bi and/or tri grams of highly correlated terms in
> the feature space?
>
> In essence what characteristics of the feature space that LDA uses for
> input will create a higher quality topic model.
>
> Thanks,
> JohnC
>

Reply via email to