Ok, I think I just wrote all that (and wasted a couple hours) for nothing.
 It looks like topicModel output for the CVB algorithm is the normalized
output from the last model generated from the tempState folder.  Basically
it automatically does for me some of what LDAPrintTopics does; normalizes
the topic word weights.

That means there is no reason to do the weighting normalizaiton for CVB,
correct?  And we still have to manually pull out the top N terms by weight
for the topic, and match their index in the vector with the dictionary in
order to get a new readable top N words per topic, correct?

But, I think in all that looking I found a bug in LDAPrintTopics.  It is
supposed to spit out the top N words per topic, where top N is based on the
term weight for that topic.  The function maybeEnqueue() uses a
PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so it
uses the Comparable implementation for Pair<A,B>, which first compares the
String in the Pair and then if equal it would compare the Double value.
 But this never happens since no terms are duplicated for a topic, and
hence the term weight value is never checked.  I double checked by putting
a breakpoint in the compareTo method in Pair, and it never made it past the
string comparison.

All this means is that LDAPrintTopics is outputting the top N terms per
topic, by term string sorted order.



On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[email protected]> wrote:

> I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I
> noticed the topicModel output datatypes changed from the original LDA
> implementation.  So I figured I'd just write my own for CVB, and base it
> off the LDA implementation.
>
> And I noticed something odd.  When running LDAPrintTopics , it gathers the
> top N terms by topic (topWordsForTopics), and normalizes the values in the
> vector, which makes sense.  But during the normalization calculation it
> also weights the vector by using Math.exp(score) instead of just the
> straight score for all calculations.
>
> I get that using Math.exp(score) will give exponentially larger values a
> stronger weighting than smaller values, but why is this done in the
> normalization?
>
> And if I was going to use the topicModel output as the input to some other
> algorithm, would want to run the topicModel vectors through the same kind
> of weighting normalization?  And if so, why not just persist the topicModel
> in this weighted normalized format in the first place?
>
> And finally, should I also use this same weighting normalization on
> the docTopics output as well?  The docTopics are normalized (well, they all
> add up to 1), but are the normalized in the same manner?
>
> I'm just trying to figure out how to use the LDA output, and figure out if
> there are any steps I need to consider before I use it as input to
> something else.
>
> --
>
> Thanks,
> John C
>
>


-- 

Thanks,
John C

Reply via email to