I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I noticed the topicModel output datatypes changed from the original LDA implementation. So I figured I'd just write my own for CVB, and base it off the LDA implementation.
And I noticed something odd. When running LDAPrintTopics , it gathers the top N terms by topic (topWordsForTopics), and normalizes the values in the vector, which makes sense. But during the normalization calculation it also weights the vector by using Math.exp(score) instead of just the straight score for all calculations. I get that using Math.exp(score) will give exponentially larger values a stronger weighting than smaller values, but why is this done in the normalization? And if I was going to use the topicModel output as the input to some other algorithm, would want to run the topicModel vectors through the same kind of weighting normalization? And if so, why not just persist the topicModel in this weighted normalized format in the first place? And finally, should I also use this same weighting normalization on the docTopics output as well? The docTopics are normalized (well, they all add up to 1), but are the normalized in the same manner? I'm just trying to figure out how to use the LDA output, and figure out if there are any steps I need to consider before I use it as input to something else. -- Thanks, John C
