I'm trying to find out if there are any standard best practices for
document tokenization when prepping your data for LDA in order to get a
higher quality topic model, and to understand how the feature space affects
topic model quality.

For example, will the topic model be "better" if there is a more rich
feature space by not stemming terms, or is it better to have a more
normalized feature space by applying stemming?

Is it better to filter out stop words, or keep them in?

Is it better to include bi and/or tri grams of highly correlated terms in
the feature space?

In essence what characteristics of the feature space that LDA uses for
input will create a higher quality topic model.

Thanks,
JohnC

Reply via email to