Re: Using Mahout to train an CVB and retrieve it's topics

DAN HELM Sat, 28 Jul 2012 18:41:33 -0700

Hi Folcon,
 
In the folder you specified for the –dt option for cvb command
there should be sequence files with the document to topic associations (Key:
IntWritable, Value: VectorWritable).  You
can dump in text format as: mahout seqdumper –s <sequence file>
So in text output from seqdumper, the key is a document id and the vector 
contains
the topics and associated scores associated with the document.  I think 
all topics are listed for each
document but many with near zero score.
In my case I used rowid to convert keys of original sparse
document vectors from Text to Integer before running cvb and this generates a 
mapping file so I know the textual
keys that correspond to the numeric document ids (since my original document 
ids were file names and I created named vectors).
Hope this helps.
Dan

________________________________
 From: Folcon <[email protected]>
To: [email protected] 
Sent: Saturday, July 28, 2012 8:28 PM
Subject: Using Mahout to train an CVB and retrieve it's topics

Hi Everyone,

I'm posting this as my original message did not seem to appear on the mailing 
list, I'm very sorry if I have done this in error.

I'm doing this to then use the topics to train a maxent algorithm to predict 
the 
classes of documents given their topic mixtures. Any further aid in this 
direction would be appreciated!

I've been trying to extract the topics out of my run of cvb. Here's what I did 
so far.

Ok, so I still don't know how to output the topics, but I have worked out how 
to 
get the cvb and what I think are the document vectors, however I'm not having 
any luck dumping them, so help here would still be appreciated!

I set the values of:
    export MAHOUT_HOME=/home/sgeadmin/mahout
    export HADOOP_HOME=/usr/lib/hadoop
    export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
    export HADOOP_CONF_DIR=$HADOOP_HOME/conf
on the master otherwise none of this works.

So first I uploaded the documents using starclusters put:
    starcluster put mycluster text_train /home/sgeadmin/
    starcluster put mycluster text_test /home/sgeadmin/

Then I added them to hadoop's hbase filesystem:
    dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster

Then I called Mahout's seqdirectory to turn the text into sequence files
    $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --
output /user/sgeadmin/text_seq -c UTF-8 -ow

Then I called Mahout's seq2parse to turn them into vectors
    $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec -
wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow

Finally I called cvb, I believe that the -dt flag states where the inferred 
topics should go, but because I haven't yet been able to dump them I can't 
confirm this.
    $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o 
/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict 
/user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document -
mt /user/sgeadmin/text_states

The -k flag is the number of topics, the -nt flag is the size of the 
dictionary, 
I computed this by counting the number of entries of the dictionary.file-0 
inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the 
number of iterations.

If you know how to get what the document topic probabilities are from here, 
help 
would be most appreciated!

Kind Regards,
Folcon

Re: Using Mahout to train an CVB and retrieve it's topics

Reply via email to