CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

Mohammed Omer Sun, 13 Jul 2014 11:07:39 -0700

All - I'm having the same issue as mentioned at
http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout
0.9. My CVB clusters describe my corpus well; however, the mapping file
generated by mahout's `rowid` seems to be wayyyyyy off.


For example, there's a very obvious cluster which has keywords like "beer,
stout, pale" - the only cluster to contain these keywords. In my vectordump
for the p(term | topic) this cluster is at line 217. Vector dump generated
by:

echo `date` ": Dumping the p(term | topic) vectors to local filesystem..."
$mahout_bin/mahout vectordump -i results/cvb_results/to_out \
  --dictionary results/seq2sparse_results/dictionary.file-0 \
  --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \
  -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile

And, while the results of dumping out the p(doc | topic) group all of the
documents which contain the words "beer, stout, pale" together - it dumps
them into cluster number 8. The dump is created via:

echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..."
$mahout_bin/mahout vectordump -i results/cvb_results/do_out \
  -sort results/cvb_results/do_out \
  -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true

IE: the result from the p(doc | topic) dump will result in:

123    0.001,...,0.60,...

Where 123 maps to a document about "beer, stout, pale" and where 0.60 is
the 9th comma separated value -- thus belonging to cluster id#8 (at zero
index).

However, if we look at the p(term | topic) file dumped earlier, cluster
id#8 has nothing to do with this document.

Additionally, I wrote a script to review all of the documents belonging to
any given cluster; and, all of the documents in cluster #8 actually map to
the p(term|topic) entry described by cluster #217. That is to say, these
are the only documents containing the ngrams / keywords that cluster #217
shows as describing it.

I can't figure out where the gap is: Is it in the rowid docIndex/matrix I
have? I've tried dumping the above two files without sorting as I figured
that might be rearranging the ordering of cluster probabilities in the
p(doc | topic) dump, but that turned up inconclusive I believe.

I would love any ideas - I've been stumped on this for a little while now.

Thank you,

Mo

CVB: Incorrect mapping between p(topic | term) and p(doc | topic) dump files

Reply via email to