Hi Everyone,
I'm posting this as my original message did not seem to appear on the mailing
list, I'm very sorry if I have done this in error.
I'm doing this to then use the topics to train a maxent algorithm to predict
the
classes of documents given their topic mixtures. Any further aid in this
direction would be appreciated!
I've been trying to extract the topics out of my run of cvb. Here's what I did
so far.
Ok, so I still don't know how to output the topics, but I have worked out how
to
get the cvb and what I think are the document vectors, however I'm not having
any luck dumping them, so help here would still be appreciated!
I set the values of:
export MAHOUT_HOME=/home/sgeadmin/mahout
export HADOOP_HOME=/usr/lib/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
on the master otherwise none of this works.
So first I uploaded the documents using starclusters put:
starcluster put mycluster text_train /home/sgeadmin/
starcluster put mycluster text_test /home/sgeadmin/
Then I added them to hadoop's hbase filesystem:
dumbo put /home/sgeadmin/text_train /user/sgeadmin/ -hadoop starcluster
Then I called Mahout's seqdirectory to turn the text into sequence files
$MAHOUT_HOME/bin/mahout seqdirectory --input /user/sgeadmin/text_train --
output /user/sgeadmin/text_seq -c UTF-8 -ow
Then I called Mahout's seq2parse to turn them into vectors
$MAHOUT_HOME/bin/mahout seq2sparse -i text_seq -o /user/sgeadmin/text_vec -
wt tf -a org.apache.lucene.analysis.WhitespaceAnalyzer -ow
Finally I called cvb, I believe that the -dt flag states where the inferred
topics should go, but because I haven't yet been able to dump them I can't
confirm this.
$MAHOUT_HOME/bin/mahout cvb -i /user/sgeadmin/text_vec/tf-vectors -o
/user/sgeadmin/text_lda -k 100 -nt 29536 -x 20 -dict
/user/sgeadmin/text_vec/dictionary.file-0 -dt /user/sgeadmin/text_cvb_document -
mt /user/sgeadmin/text_states
The -k flag is the number of topics, the -nt flag is the size of the
dictionary,
I computed this by counting the number of entries of the dictionary.file-0
inside the vectors(in this case under /user/sgeadmin/text_vec) and -x is the
number of iterations.
If you know how to get what the document topic probabilities are from here,
help
would be most appreciated!
Kind Regards,
Folcon