Using Mahout to train an CVB and retrieve it's topics

Folcon Sat, 28 Jul 2012 18:10:40 -0700

Hi Everyone,

I'm posting this as my original message did not seem to appear on the mailing 
list, I'm very sorry if I have done this in error.


I'm doing this to then use the topics to train a maxent algorithm to predict 
the 
classes of documents given their topic mixtures. Any further aid in this 
direction would be appreciated!

I've been trying to extract the topics out of my run of cvb. Here's what I did 
so far.

Ok, so I still don't know how to output the topics, but I have worked out how 
to 
get the cvb and what I think are the document vectors, however I'm not having 
any luck dumping them, so help here would still be appreciated!

I set the values of:
    export MAHOUT_HOME=/home/sgeadmin/mahout
    export HADOOP_HOME=/usr/lib/hadoop
    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
on the master otherwise none of this works.

So first I uploaded the documents using starclusters put:
    starcluster put mycluster text_train /home/sgeadmin/
    starcluster put mycluster text_test /home/sgeadmin/

Then I added them to hadoop's hbase filesystem:
    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster

Then I called Mahout's seqdirectory to turn the text into sequence files
    $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --
output /user/sgeadmin/text_seq -c UTF-8 -ow

Then I called Mahout's seq2parse to turn them into vectors
    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec -
wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

Finally I called cvb, I believe that the -dt flag states where the inferred 
topics should go, but because I haven't yet been able to dump them I can't 
confirm this.
    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o 
/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict 
/user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document -
mt /user/sgeadmin/text_states

The -k flag is the number of topics, the -nt flag is the size of the 
dictionary, 
I computed this by counting the number of entries of the dictionary.file-0 
inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the 
number of iterations.

If you know how to get what the document topic probabilities are from here, 
help 
would be most appreciated!

Kind Regards,
Folcon

Using Mahout to train an CVB and retrieve it's topics

Reply via email to