Hi John, I'm not an expert in the field, but I have done a bit of work building topic models with LDA, and here are some of the "tricks" I've used:
1) yes remove stop words, in fact remove all words occurring in more than (say) half (or more conservatively, 90%) of your documents, as they'll be noise and just dominate your topics. 2) more features is better, if you have the memory for it (note that mahout's LDA currently holds numTopics * numFeatures in memory in the mapper tasks, which means that you are usually bounded to a few hundred thousand features, maybe up as high as a million, currently). So don't stem, and throw in commonly occurring (or more importantly: high log-likelihood) bigrams and trigrams as independent features. 3) violate the underlying assumption of LDA, that you're talking about "token occurrences", and weight your vectors not as "tf", but "tf*idf", which makes rarer features more prominent, which ends up making your topics look a lot nicer. Those are the main tricks I can think of right now. If you're using Mahout trunk, try the new LDA impl: $MAHOUT_HOME/bin/mahout cvb0 --help It operates on the same kind of input as the last one (ie. a corpus which is a SequenceFile<IntWritable, VectorWritable>). -jake On Tue, Jan 24, 2012 at 12:14 PM, John Conwell <[email protected]> wrote: > I'm trying to find out if there are any standard best practices for > document tokenization when prepping your data for LDA in order to get a > higher quality topic model, and to understand how the feature space affects > topic model quality. > > For example, will the topic model be "better" if there is a more rich > feature space by not stemming terms, or is it better to have a more > normalized feature space by applying stemming? > > Is it better to filter out stop words, or keep them in? > > Is it better to include bi and/or tri grams of highly correlated terms in > the feature space? > > In essence what characteristics of the feature space that LDA uses for > input will create a higher quality topic model. > > Thanks, > JohnC >
