I'm trying to find out if there are any standard best practices for document tokenization when prepping your data for LDA in order to get a higher quality topic model, and to understand how the feature space affects topic model quality.
For example, will the topic model be "better" if there is a more rich feature space by not stemming terms, or is it better to have a more normalized feature space by applying stemming? Is it better to filter out stop words, or keep them in? Is it better to include bi and/or tri grams of highly correlated terms in the feature space? In essence what characteristics of the feature space that LDA uses for input will create a higher quality topic model. Thanks, JohnC
