Totally understand on the bug fix.  But for anyone who wants the fix, I've
created a patch, attached to this email.

Basically just create a new Comparator when you create the PriorityQueue,
and only compare the second value of each Pair (the double value), and
ignore the string.

On Fri, Jan 27, 2012 at 8:24 PM, Jake Mannix <[email protected]> wrote:

> Hey John,
>
>  Sorry I didn't get back to respond to this earlier:  you are essentially
> correct,
> but if you want to have the effect of "LDAPrintTopics" in the new CVB data,
> since the format is exactly that of a DistributedRowMatrix (ie a simple
> SequenceFile<IntWritable,VectorWritable>), you need do nothing other than:
>
>  $MAHOUT_HOME/bin/vectordump -s <modelpath> -d <dictionarypath> \
>      -dt sequencefile -p -sort -o ./output_topics.txt
>
>  If you only want the top N terms/features per topic, add the "-vs 100" to
> that
> option list.
>
>  Hope that helps.
>
>  -jake
>
> p.s. yes, LDAPrintTopics does a lot of funny things, and might indeed be
> buggy.  But I'm more interested in finding bugs / pieces of missing docs
> in the new CVB code, as we are probably removing the old code in the
> next release.
>
> On Fri, Jan 27, 2012 at 5:21 PM, John Conwell <[email protected]> wrote:
>
> > Ok, I think I just wrote all that (and wasted a couple hours) for
> nothing.
> >  It looks like topicModel output for the CVB algorithm is the normalized
> > output from the last model generated from the tempState folder.
>  Basically
> > it automatically does for me some of what LDAPrintTopics does; normalizes
> > the topic word weights.
> >
> > That means there is no reason to do the weighting normalizaiton for CVB,
> > correct?  And we still have to manually pull out the top N terms by
> weight
> > for the topic, and match their index in the vector with the dictionary in
> > order to get a new readable top N words per topic, correct?
> >
> > But, I think in all that looking I found a bug in LDAPrintTopics.  It is
> > supposed to spit out the top N words per topic, where top N is based on
> the
> > term weight for that topic.  The function maybeEnqueue() uses a
> > PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so
> it
> > uses the Comparable implementation for Pair<A,B>, which first compares
> the
> > String in the Pair and then if equal it would compare the Double value.
> >  But this never happens since no terms are duplicated for a topic, and
> > hence the term weight value is never checked.  I double checked by
> putting
> > a breakpoint in the compareTo method in Pair, and it never made it past
> the
> > string comparison.
> >
> > All this means is that LDAPrintTopics is outputting the top N terms per
> > topic, by term string sorted order.
> >
> >
> >
> > On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[email protected]> wrote:
> >
> > > I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I
> > > noticed the topicModel output datatypes changed from the original LDA
> > > implementation.  So I figured I'd just write my own for CVB, and base
> it
> > > off the LDA implementation.
> > >
> > > And I noticed something odd.  When running LDAPrintTopics , it gathers
> > the
> > > top N terms by topic (topWordsForTopics), and normalizes the values in
> > the
> > > vector, which makes sense.  But during the normalization calculation it
> > > also weights the vector by using Math.exp(score) instead of just the
> > > straight score for all calculations.
> > >
> > > I get that using Math.exp(score) will give exponentially larger values
> a
> > > stronger weighting than smaller values, but why is this done in the
> > > normalization?
> > >
> > > And if I was going to use the topicModel output as the input to some
> > other
> > > algorithm, would want to run the topicModel vectors through the same
> kind
> > > of weighting normalization?  And if so, why not just persist the
> > topicModel
> > > in this weighted normalized format in the first place?
> > >
> > > And finally, should I also use this same weighting normalization on
> > > the docTopics output as well?  The docTopics are normalized (well, they
> > all
> > > add up to 1), but are the normalized in the same manner?
> > >
> > > I'm just trying to figure out how to use the LDA output, and figure out
> > if
> > > there are any steps I need to consider before I use it as input to
> > > something else.
> > >
> > > --
> > >
> > > Thanks,
> > > John C
> > >
> > >
> >
> >
> > --
> >
> > Thanks,
> > John C
> >
>



-- 

Thanks,
John C

Reply via email to