Hey John,
Sorry I didn't get back to respond to this earlier: you are essentially
correct,
but if you want to have the effect of "LDAPrintTopics" in the new CVB data,
since the format is exactly that of a DistributedRowMatrix (ie a simple
SequenceFile<IntWritable,VectorWritable>), you need do nothing other than:
$MAHOUT_HOME/bin/vectordump -s <modelpath> -d <dictionarypath> \
-dt sequencefile -p -sort -o ./output_topics.txt
If you only want the top N terms/features per topic, add the "-vs 100" to
that
option list.
Hope that helps.
-jake
p.s. yes, LDAPrintTopics does a lot of funny things, and might indeed be
buggy. But I'm more interested in finding bugs / pieces of missing docs
in the new CVB code, as we are probably removing the old code in the
next release.
On Fri, Jan 27, 2012 at 5:21 PM, John Conwell <[email protected]> wrote:
> Ok, I think I just wrote all that (and wasted a couple hours) for nothing.
> It looks like topicModel output for the CVB algorithm is the normalized
> output from the last model generated from the tempState folder. Basically
> it automatically does for me some of what LDAPrintTopics does; normalizes
> the topic word weights.
>
> That means there is no reason to do the weighting normalizaiton for CVB,
> correct? And we still have to manually pull out the top N terms by weight
> for the topic, and match their index in the vector with the dictionary in
> order to get a new readable top N words per topic, correct?
>
> But, I think in all that looking I found a bug in LDAPrintTopics. It is
> supposed to spit out the top N words per topic, where top N is based on the
> term weight for that topic. The function maybeEnqueue() uses a
> PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so it
> uses the Comparable implementation for Pair<A,B>, which first compares the
> String in the Pair and then if equal it would compare the Double value.
> But this never happens since no terms are duplicated for a topic, and
> hence the term weight value is never checked. I double checked by putting
> a breakpoint in the compareTo method in Pair, and it never made it past the
> string comparison.
>
> All this means is that LDAPrintTopics is outputting the top N terms per
> topic, by term string sorted order.
>
>
>
> On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[email protected]> wrote:
>
> > I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I
> > noticed the topicModel output datatypes changed from the original LDA
> > implementation. So I figured I'd just write my own for CVB, and base it
> > off the LDA implementation.
> >
> > And I noticed something odd. When running LDAPrintTopics , it gathers
> the
> > top N terms by topic (topWordsForTopics), and normalizes the values in
> the
> > vector, which makes sense. But during the normalization calculation it
> > also weights the vector by using Math.exp(score) instead of just the
> > straight score for all calculations.
> >
> > I get that using Math.exp(score) will give exponentially larger values a
> > stronger weighting than smaller values, but why is this done in the
> > normalization?
> >
> > And if I was going to use the topicModel output as the input to some
> other
> > algorithm, would want to run the topicModel vectors through the same kind
> > of weighting normalization? And if so, why not just persist the
> topicModel
> > in this weighted normalized format in the first place?
> >
> > And finally, should I also use this same weighting normalization on
> > the docTopics output as well? The docTopics are normalized (well, they
> all
> > add up to 1), but are the normalized in the same manner?
> >
> > I'm just trying to figure out how to use the LDA output, and figure out
> if
> > there are any steps I need to consider before I use it as input to
> > something else.
> >
> > --
> >
> > Thanks,
> > John C
> >
> >
>
>
> --
>
> Thanks,
> John C
>