Token filtering and LDA quality

John Conwell Tue, 24 Jan 2012 12:14:53 -0800

I'm trying to find out if there are any standard best practices for
document tokenization when prepping your data for LDA in order to get a
higher quality topic model, and to understand how the feature space affects
topic model quality.


For example, will the topic model be "better" if there is a more rich
feature space by not stemming terms, or is it better to have a more
normalized feature space by applying stemming?

Is it better to filter out stop words, or keep them in?

Is it better to include bi and/or tri grams of highly correlated terms in
the feature space?

In essence what characteristics of the feature space that LDA uses for
input will create a higher quality topic model.

Thanks,
JohnC

Token filtering and LDA quality

Reply via email to