Totally understand on the bug fix. But for anyone who wants the fix, I've created a patch, attached to this email.
Basically just create a new Comparator when you create the PriorityQueue, and only compare the second value of each Pair (the double value), and ignore the string. On Fri, Jan 27, 2012 at 8:24 PM, Jake Mannix <[email protected]> wrote: > Hey John, > > Sorry I didn't get back to respond to this earlier: you are essentially > correct, > but if you want to have the effect of "LDAPrintTopics" in the new CVB data, > since the format is exactly that of a DistributedRowMatrix (ie a simple > SequenceFile<IntWritable,VectorWritable>), you need do nothing other than: > > $MAHOUT_HOME/bin/vectordump -s <modelpath> -d <dictionarypath> \ > -dt sequencefile -p -sort -o ./output_topics.txt > > If you only want the top N terms/features per topic, add the "-vs 100" to > that > option list. > > Hope that helps. > > -jake > > p.s. yes, LDAPrintTopics does a lot of funny things, and might indeed be > buggy. But I'm more interested in finding bugs / pieces of missing docs > in the new CVB code, as we are probably removing the old code in the > next release. > > On Fri, Jan 27, 2012 at 5:21 PM, John Conwell <[email protected]> wrote: > > > Ok, I think I just wrote all that (and wasted a couple hours) for > nothing. > > It looks like topicModel output for the CVB algorithm is the normalized > > output from the last model generated from the tempState folder. > Basically > > it automatically does for me some of what LDAPrintTopics does; normalizes > > the topic word weights. > > > > That means there is no reason to do the weighting normalizaiton for CVB, > > correct? And we still have to manually pull out the top N terms by > weight > > for the topic, and match their index in the vector with the dictionary in > > order to get a new readable top N words per topic, correct? > > > > But, I think in all that looking I found a bug in LDAPrintTopics. It is > > supposed to spit out the top N words per topic, where top N is based on > the > > term weight for that topic. The function maybeEnqueue() uses a > > PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so > it > > uses the Comparable implementation for Pair<A,B>, which first compares > the > > String in the Pair and then if equal it would compare the Double value. > > But this never happens since no terms are duplicated for a topic, and > > hence the term weight value is never checked. I double checked by > putting > > a breakpoint in the compareTo method in Pair, and it never made it past > the > > string comparison. > > > > All this means is that LDAPrintTopics is outputting the top N terms per > > topic, by term string sorted order. > > > > > > > > On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[email protected]> wrote: > > > > > I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I > > > noticed the topicModel output datatypes changed from the original LDA > > > implementation. So I figured I'd just write my own for CVB, and base > it > > > off the LDA implementation. > > > > > > And I noticed something odd. When running LDAPrintTopics , it gathers > > the > > > top N terms by topic (topWordsForTopics), and normalizes the values in > > the > > > vector, which makes sense. But during the normalization calculation it > > > also weights the vector by using Math.exp(score) instead of just the > > > straight score for all calculations. > > > > > > I get that using Math.exp(score) will give exponentially larger values > a > > > stronger weighting than smaller values, but why is this done in the > > > normalization? > > > > > > And if I was going to use the topicModel output as the input to some > > other > > > algorithm, would want to run the topicModel vectors through the same > kind > > > of weighting normalization? And if so, why not just persist the > > topicModel > > > in this weighted normalized format in the first place? > > > > > > And finally, should I also use this same weighting normalization on > > > the docTopics output as well? The docTopics are normalized (well, they > > all > > > add up to 1), but are the normalized in the same manner? > > > > > > I'm just trying to figure out how to use the LDA output, and figure out > > if > > > there are any steps I need to consider before I use it as input to > > > something else. > > > > > > -- > > > > > > Thanks, > > > John C > > > > > > > > > > > > -- > > > > Thanks, > > John C > > > -- Thanks, John C
