Hi, I am trying to run the Mahout LDA over the Reuters data set as described in Mahout in Action however I always get only 1 topic returned. I am running on Mahout 0.5 and here are my steps:
$ mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/" Next I had to put the output directory (reuters-extracted) into HDFS which wasn't mentioned in the book. $ hadoop dfs -put reuters-extracted/* reuters/ $ bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow $ bin/mahout lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 70000 -x 20 $ bin/mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i reuters-lda-sparse/state-20/ -d reuters-vectors/dictionary.file-* -dt sequencefile -w 5 Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop No HADOOP_CONF_DIR set, using /usr/lib/hadoop/src/conf 11/11/09 17:43:55 WARN driver.MahoutDriver: No org.apache.mahout.clustering.lda.LDAPrintTopics.props found on classpath, will use command-line arguments only Topic 0 =========== pct [p(pct|topic_0) = 0.04985000259283585 from [p(from|topic_0) = 0.04332905057607894 said [p(said|topic_0) = 0.03736886059106963 1986 [p(1986|topic_0) = 0.015418741367019371 dlrs [p(dlrs|topic_0) = 0.014674464223644563 11/11/09 17:44:01 INFO driver.MahoutDriver: Program took 6337 ms Any suggestions? Thanks, Varnit
