It's seems to be a bug in ldatopics in mahout 0.5, ldatopics utility works as expected in 0.6.
-varnit On Nov 9, 2011, at 12:48 PM, Varnit Khanna wrote: > Hi, > I am trying to run the Mahout LDA over the Reuters data set as > described in Mahout in Action however I always get only 1 topic > returned. I am running on Mahout 0.5 and here are my steps: > > $ mvn -e -q exec:java > -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" > -Dexec.args="reuters/ reuters-extracted/" > > Next I had to put the output directory (reuters-extracted) into HDFS > which wasn't mentioned in the book. > > $ hadoop dfs -put reuters-extracted/* reuters/ > > $ bin/mahout seqdirectory -c UTF-8 -i reuters/ -o reuters-seqfiles > $ bin/mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow > $ bin/mahout lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse > -k 10 -v 70000 -x 20 > > $ bin/mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i > reuters-lda-sparse/state-20/ -d reuters-vectors/dictionary.file-* -dt > sequencefile -w 5 > Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop > No HADOOP_CONF_DIR set, using /usr/lib/hadoop/src/conf > 11/11/09 17:43:55 WARN driver.MahoutDriver: No > org.apache.mahout.clustering.lda.LDAPrintTopics.props found on > classpath, will use command-line arguments only > Topic 0 > =========== > pct [p(pct|topic_0) = 0.04985000259283585 > from [p(from|topic_0) = 0.04332905057607894 > said [p(said|topic_0) = 0.03736886059106963 > 1986 [p(1986|topic_0) = 0.015418741367019371 > dlrs [p(dlrs|topic_0) = 0.014674464223644563 > 11/11/09 17:44:01 INFO driver.MahoutDriver: Program took 6337 ms > > > Any suggestions? > > Thanks, > Varnit
