Hi Folcon, In the folder you specified for the –dt option for cvb command there should be sequence files with the document to topic associations (Key: IntWritable, Value: VectorWritable). You can dump in text format as: mahout seqdumper –s <sequence file> So in text output from seqdumper, the key is a document id and the vector contains the topics and associated scores associated with the document. I think all topics are listed for each document but many with near zero score. In my case I used rowid to convert keys of original sparse document vectors from Text to Integer before running cvb and this generates a mapping file so I know the textual keys that correspond to the numeric document ids (since my original document ids were file names and I created named vectors). Hope this helps. Dan
________________________________ From: Folcon <[email protected]> To: [email protected] Sent: Saturday, July 28, 2012 8:28 PM Subject: Using Mahout to train an CVB and retrieve it's topics Hi Everyone, I'm posting this as my original message did not seem to appear on the mailing list, I'm very sorry if I have done this in error. I'm doing this to then use the topics to train a maxent algorithm to predict the classes of documents given their topic mixtures. Any further aid in this direction would be appreciated! I've been trying to extract the topics out of my run of cvb. Here's what I did so far. Ok, so I still don't know how to output the topics, but I have worked out how to get the cvb and what I think are the document vectors, however I'm not having any luck dumping them, so help here would still be appreciated! I set the values of: export MAHOUT_HOME=/home/sgeadmin/mahout export HADOOP_HOME=/usr/lib/hadoop export JAVA_HOME=/usr/lib/jvm/java-6-openjdk export HADOOP_CONF_DIR=$HADOOP_HOME/conf on the master otherwise none of this works. So first I uploaded the documents using starclusters put: starcluster put mycluster text_train /home/sgeadmin/ starcluster put mycluster text_test /home/sgeadmin/ Then I added them to hadoop's hbase filesystem: dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster Then I called Mahout's seqdirectory to turn the text into sequence files $MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train -- output /user/sgeadmin/text_seq -c UTF-8 -ow Then I called Mahout's seq2parse to turn them into vectors $MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec - wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow Finally I called cvb, I believe that the -dt flag states where the inferred topics should go, but because I haven't yet been able to dump them I can't confirm this. $MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o /user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict /user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document - mt /user/sgeadmin/text_states The -k flag is the number of topics, the -nt flag is the size of the dictionary, I computed this by counting the number of entries of the dictionary.file-0 inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the number of iterations. If you know how to get what the document topic probabilities are from here, help would be most appreciated! Kind Regards, Folcon
