Hi Yutaka,
On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <[email protected]> wrote: > Hi > Here is a question around how to evaluate the result of Mahout 0.7 CVB > (Collapsed Variational Bayes), which used to be LDA > (Latent Dirichlet Allocation) in Mahout version under 0.5. > I believe I have no prpblem running CVB itself and this is purely a > question on the efficient way to visualize or evaluate the result. Looks like result evaluation in Mahout-0.5 at least could be done using the > utility called "LDAPrintTopic", however this is already > obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA) > > I'm using , as said using Mahout-0.7. I believe I'm running CVB > successfully and obtained results in two separate directory in > /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as > number of iterations and also in > /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as > specified as number of topics tha I wanted to extract/decomposite. > > Neither of the files contained in the directory can be dumped using Mahout > vectordump, however the output format is way different > from what you should've gotten using LDAPrintTopic in below 0.5 which > should give you back the result as the Topic Id. and it's > associated top terms in very direct format. (See "Mahout in Action" p.181 > again). > Vectordump should be exactly what you want, actually. > > Here is what I've done as below. > 1. Say I have already generated document vector and use tf-vectors to > generate a document/term matrix as > > $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o > NHTSA-matrix03 > > 2. and get rid of the matrix docIndex as it should get in my way (as been > advised somewhere…) > $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex > NHTSA-matrix03-docIndex > > 3. confirmed if I have only what I need here as > $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/ > Found 1 items > -rw-r--r-- 1 hadoop supergroup 42471833 2012-12-20 07:11 > /user/hadoop/NHTSA-matrix03/matrix > > 4.and kick off CVB as > $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict > NHTSA-vectors03/dictionary.file-* -k 10 -x 20 –ow > … > …. > 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms > (Minutes: 733.1281333333334) > (Took over 12hrs to complete to process 100k documents on my laptop with > pseudo-distributed Hadoop 0.20.203) > > 5. Take a look at what I've got. > $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse > Found 12 items > -rw-r--r-- 1 hadoop supergroup 0 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/_SUCCESS > drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 > /user/hadoop/NHTSA-LDA-sparse/_logs > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > /user/hadoop/NHTSA-LDA-sparse/part-m-00000 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > /user/hadoop/NHTSA-LDA-sparse/part-m-00001 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > /user/hadoop/NHTSA-LDA-sparse/part-m-00002 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > /user/hadoop/NHTSA-LDA-sparse/part-m-00003 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00004 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00005 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00006 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00007 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00008 > -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > /user/hadoop/NHTSA-LDA-sparse/part-m-00009 > [hadoop@localhost NHTSA]$ > Ok, these should be your model files, and to view them, you can do it the way you can view any SequenceFile<IntWriteable, VectorWritable>, like this: $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType sequencefile --vectorSize 5 --sort This will dump the top 5 terms (with weights - not sure if they'll be normalized properly) from each topic to the output file "topic_dump.txt" Incidentally, this same command can be run on the topicModelState directories as well, which let you see how fast your topic model was converging (and thus show you on a smaller data set how many iterations you may want to be running with later on). > > and > $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState > Found 20 items > drwxr-xr-x - hadoop supergroup 0 2012-12-20 07:59 > /user/hadoop/temp/topicModelState/model-1 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 13:32 > /user/hadoop/temp/topicModelState/model-10 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:09 > /user/hadoop/temp/topicModelState/model-11 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:46 > /user/hadoop/temp/topicModelState/model-12 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 15:23 > /user/hadoop/temp/topicModelState/model-13 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 15:59 > /user/hadoop/temp/topicModelState/model-14 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 16:36 > /user/hadoop/temp/topicModelState/model-15 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 17:13 > /user/hadoop/temp/topicModelState/model-16 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 17:48 > /user/hadoop/temp/topicModelState/model-17 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 18:25 > /user/hadoop/temp/topicModelState/model-18 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 18:59 > /user/hadoop/temp/topicModelState/model-19 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 08:37 > /user/hadoop/temp/topicModelState/model-2 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 > /user/hadoop/temp/topicModelState/model-20 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 09:13 > /user/hadoop/temp/topicModelState/model-3 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 09:50 > /user/hadoop/temp/topicModelState/model-4 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 10:27 > /user/hadoop/temp/topicModelState/model-5 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 11:04 > /user/hadoop/temp/topicModelState/model-6 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 11:41 > /user/hadoop/temp/topicModelState/model-7 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 12:18 > /user/hadoop/temp/topicModelState/model-8 > drwxr-xr-x - hadoop supergroup 0 2012-12-20 12:55 > /user/hadoop/temp/topicModelState/model-9 > > Hope someone could help this out. > Regards,,, > Yutaka > -- -jake
