Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Jake Hi. Due to my housekeeping matters for other things, I have actually not built Mahout 0.7 from the trunk code yet, but before doing so,I have tried Mahout-0.6 so that I can run LDA straight forward. I have successfully ran LDA with input as TF vector file wuth 68 iteration across 43 documents, specifying 12 topics to be identified. $MAHOUT_HOME/bin/mahout lda --input JAText-Mahout-0.6-LDA/JAText-luceneTFvectors01/part-out.vec --output JAText-Mahout-0.6-LDA/output --numTopics 12 $HADOOP_HOME/bin/hadoop dfs -ls JAText-Mahout-0.6-LDA/output/ Found 70 items drwxr-xr-x - hadoop supergroup 0 2013-03-18 13:48 /user/hadoop/JAText-Mahout-0.6-LDA/output/docTopics drwxr-xr-x - hadoop supergroup 0 2013-03-18 13:03 /user/hadoop/JAText-Mahout-0.6-LDA/output/state-0 drwxr-xr-x - hadoop supergroup 0 2013-03-18 13:04 /user/hadoop/JAText-Mahout-0.6-LDA/output/state-1 . . drwxr-xr-x - hadoop supergroup 0 2013-03-18 13:47 /user/hadoop/JAText-Mahout-0.6-LDA/output/state-67 drwxr-xr-x - hadoop supergroup 0 2013-03-18 13:48 /user/hadoop/JAText-Mahout-0.6-LDA/output/state-68 I actually see part-m-0 sequencefiles per each iteration stages. Question here is that $MAHOUT_HOME/bin/mahout ldatopics utility for Mahout 0.6 (https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html) doesn't work right due to NullPointerException. I could only confirm the result of docTopics using seqdumper but not able to see any of the results for the above state-* sequencefiles. Here is what will happen with ldatopics comand. $MAHOUT_HOME/bin/mahout ldatopics -i JAText-Mahout-0.6-LDA/output/state-68 -d JAText-TFDictionary.txt MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar Exception in thread main java.lang.NullPointerException at org.apache.mahout.common.Pair.compareTo(Pair.java:90) at org.apache.mahout.common.Pair.compareTo(Pair.java:23) at java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:582) at java.util.PriorityQueue.siftUp(PriorityQueue.java:574) at java.util.PriorityQueue.offer(PriorityQueue.java:274) at java.util.PriorityQueue.add(PriorityQueue.java:251) at org.apache.mahout.clustering.lda.LDAPrintTopics.maybeEnqueue(LDAPrintTopics.java:150) at org.apache.mahout.clustering.lda.LDAPrintTopics.topWordsForTopics(LDAPrintTopics.java:216) at org.apache.mahout.clustering.lda.LDAPrintTopics.main(LDAPrintTopics.java:128) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Apart from this, I can confirm 12 topics per each documents, total of 43 given as follows, using seqdumper.(but not with ldatopics) $MAHOUT_HOME/bin/mahout seqdumper -s JAText-Mahout-0.6-LDA/output/docTopics/part-m-0 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar 13/03/18 14:12:04 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --seqFile=JAText-Mahout-0.6-LDA/output/docTopics/part-m-0, --startPhase=0, --tempDir=temp} Input Path: JAText-Mahout-0.6-LDA/output/docTopics/part-m-0 Key class: class org.apache.hadoop.io.LongWritable Value Class: class org.apache.mahout.math.VectorWritable Key: 0: Value: {0:0.0718128116030847,1:0.07204818495147658,2:0.07165839473775905,3:0.07471123413425951,4:0.07228942239756206,5:0.07223674970698116,6:0.08965049111711978,7:0.07114235379664942,8:0.18392117686641946,9:0.0713290760585,10:0.0725383327578603,11:0.0724357150231} Key: 1: Value: {0:0.07340159249672981,1:0.07673280973643179,2:0.07227506725925102,3:0.17698846760344888,4:0.07957759924990469,5:0.07593691263843196,6:0.0723139656294,7:0.07195475314903217,8:0.07480823084457076,9:0.07539197289261017,10:0.07323355269079787,11:0.07732127004222797} Key: 2: Value:
Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Jake Now this is very clear and I will work on this build from the latest source. Thank you. Regards,,, Y.Mandai iPhoneから送信 On 2013/02/23, at 3:14, Jake Mannix jake.man...@gmail.com wrote: On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 20525entrad...@gmail.com wrote: Thanks Jake for your attention on this. I believe I have the trunk code from the official download site. Well my Mahout version is 0.7 and I have downloaded from local mirror site as http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/ and confirmed that the timestamp on ther mirror site as 12-Jun-2012 and the time stamp for my installed files are all identical. Note that I'm using the precompiled Jar files only and have not built on my machine from source code locally. I believe this will not affect negatively. Mahout-0.7 is my first and only experienced version. Never have tried older ones nor newer 0.8 snapshot either... Can you think of any other possible workaround? You should try to build from trunk source, this bug is fixed in trunk, that's the correct workaround. That, or wait for our next officially released version (0.8). Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for this case? I could confirm the heap assignment for the Hadoop jobs since they are resident processes while Mahout RunJob immediately dies before the VisualVM utility can recognozes it, so I'm not confident if RunJob really got how much he really wanted or not... Heap is not going to help you here, you're dealing with a bug. The correct code doesn't need really very much memory at all (less than 100MB to do the job you're talking about). Regards,,, Y.Mandai 2013/2/22 Jake Mannix jake.man...@gmail.com This looks like you've got an old version of Mahout - are you running on trunk? This has been fixed on trunk, there was a bug in the 0.6 (roughly) timeframe in which vectors for vectordump --sort were assumed incorrectly to be of size MAX_INT, which lead to heap problems no matter how much heap you gave it. Well, maybe you could have worked around it with 2^32 * (4 + 8) bytes ~ 48GB, but really the solution is to upgrade to run off of trunk. On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 20525entrad...@gmail.com wrote: My trial as below. However still doesn't get through... Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark from mahout shell script so that I can check it's actually taking effect. Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB) ~bin/mahout~ JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx4g * - Increased from the original 3g to 4g* # check envvars which might override default args if [ $MAHOUT_HEAPSIZE != ]; then echo run with heapsize $MAHOUT_HEAPSIZE JAVA_HEAP_MAX=-Xmx$MAHOUT_HEAPSIZEm echo $JAVA_HEAP_MAX fi Also added the same heap size as 4G in hadoop-env.sh as ~hadoop-env.sh~ # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=4000 [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000 [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE --sortVectors TRUE run with heapsize 4000* - Looks like RunJar is taking 4G heap?* -Xmx4000m *- Right?* Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments: {--dictionary=[NHTSA-vectors01/dictionary.file-*], --dictionaryType=[sequencefile], --endPhase=[2147483647], --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE], --startPhase=[0], --tempDir=[temp], --vectorSize=[5]} 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218) at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84) at org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133) at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
My trial as below. However still doesn't get through... Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark from mahout shell script so that I can check it's actually taking effect. Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB) ~bin/mahout~ JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx4g * - Increased from the original 3g to 4g* # check envvars which might override default args if [ $MAHOUT_HEAPSIZE != ]; then echo run with heapsize $MAHOUT_HEAPSIZE JAVA_HEAP_MAX=-Xmx$MAHOUT_HEAPSIZEm echo $JAVA_HEAP_MAX fi Also added the same heap size as 4G in hadoop-env.sh as ~hadoop-env.sh~ # The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=4000 [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000 [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE --sortVectors TRUE run with heapsize 4000* - Looks like RunJar is taking 4G heap?* -Xmx4000m *- Right?* Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments: {--dictionary=[NHTSA-vectors01/dictionary.file-*], --dictionaryType=[sequencefile], --endPhase=[2147483647], --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE], --startPhase=[0], --tempDir=[temp], --vectorSize=[5]} 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218) at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84) at org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133) at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) [hadoop@localhost NHTSA]$ I've also monitored that at least all the Hadoop tasks are taking 4GB of heap through VisualVM utility. I have done ClusterDump to extract the top 10 terms from the result of K-Means as below using the exactly same input data sets as below, however, this tasks requires no extra heap other that the default. $ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d NHTSA-vectors01/dictionary.file-* -i NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01 -b 30-n 10 I believe the vectordump utility and the clusterdump derive from different roots in terms of it's heap requirement. Still waiting for some advise from you people. Regards,,, Y.Mandai 2013/2/19 万代豊 20525entrad...@gmail.com Well , the --sortVectors for the vectordump utility to evaluate the result for CVB clistering unfortunately brought me OutofMemory issue... Here is the case that seem to goes well without --sortVectors option. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE ... WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE IDLING.:1.1614897786179032E-8,WHILE IM:2.161108807903E-11,WHILE IN:5.032593039252978E-6,WHILE INFLATING:8.13895666336E-13,WHILE INSPECTING:3.854370531928256E- ... Once you give --sortVectors TRUE as below. I ran into OutofMemory exception. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt
Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Well , the --sortVectors for the vectordump utility to evaluate the result for CVB clistering unfortunately brought me OutofMemory issue... Here is the case that seem to goes well without --sortVectors option. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE ... WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE IDLING.:1.1614897786179032E-8,WHILE IM:2.161108807903E-11,WHILE IN:5.032593039252978E-6,WHILE INFLATING:8.13895666336E-13,WHILE INSPECTING:3.854370531928256E- ... Once you give --sortVectors TRUE as below. I ran into OutofMemory exception. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE *--sortVectors TRUE* Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments: {--dictionary=[NHTSA-vectors01/dictionary.file-*], --dictionaryType=[sequencefile], --endPhase=[2147483647], --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE], --startPhase=[0], --tempDir=[temp], --vectorSize=[5]} 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true *Exception in thread main java.lang.OutOfMemoryError: Java heap space* at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218) at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84) at org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133) at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I see that there are several parameters that are sensitive to giving heap to Mahout job either dependently/independent across Hadoop and Mahout such as MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc. Can anyone advise me which configuration file, shell scripts, XMLs that I should give some addiotnal heap and also the proper way to monitor the actual heap usage here? I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with pseudo-distributed configuration on a VMWare Player partition running CentOS6.3 64Bit. Regards,,, Y.Mandai 2013/2/1 Jake Mannix jake.man...@gmail.com On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai 20525entrad...@gmail.com wrote: Thank Jake for your guidance. Good to know that I wasn't alway wrong but was just not familiar enough about the vector dump usage. I'll try this out later when I can as soon as possible. Hope that --sort doesn't eat up too much heap. If you're using code on master, --sort should only be using an additional K objects of memory (where K is the value you passed to --vectorSize), as it's just using an auxiliary heap to grab the top k items of the vector. It was a bug previously that it tried to instantiate a vector.size() [which in some cases was Integer.MAX_INT] sized list somewhere. Regards,,, Yutaka iPhoneから送信 On 2013/01/31, at 23:33, Jake Mannix jake.man...@gmail.com wrote: Hi Yutaka, On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote: Hi Here is a question around how to evaluate the result of Mahout 0.7 CVB (Collapsed Variational Bayes), which used to be LDA (Latent Dirichlet Allocation) in Mahout version under 0.5. I believe I have no prpblem running CVB itself and this is purely a
Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai 20525entrad...@gmail.comwrote: Thank Jake for your guidance. Good to know that I wasn't alway wrong but was just not familiar enough about the vector dump usage. I'll try this out later when I can as soon as possible. Hope that --sort doesn't eat up too much heap. If you're using code on master, --sort should only be using an additional K objects of memory (where K is the value you passed to --vectorSize), as it's just using an auxiliary heap to grab the top k items of the vector. It was a bug previously that it tried to instantiate a vector.size() [which in some cases was Integer.MAX_INT] sized list somewhere. Regards,,, Yutaka iPhoneから送信 On 2013/01/31, at 23:33, Jake Mannix jake.man...@gmail.com wrote: Hi Yutaka, On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote: Hi Here is a question around how to evaluate the result of Mahout 0.7 CVB (Collapsed Variational Bayes), which used to be LDA (Latent Dirichlet Allocation) in Mahout version under 0.5. I believe I have no prpblem running CVB itself and this is purely a question on the efficient way to visualize or evaluate the result. Looks like result evaluation in Mahout-0.5 at least could be done using the utility called LDAPrintTopic, however this is already obsolete since Mahout 0.5. (See Mahout in Action p.181 on LDA) I'm using , as said using Mahout-0.7. I believe I'm running CVB successfully and obtained results in two separate directory in /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as number of iterations and also in /user/hadoop/NHTSA-LDA-sparse/part-m-0 through part-m-9 as specified as number of topics tha I wanted to extract/decomposite. Neither of the files contained in the directory can be dumped using Mahout vectordump, however the output format is way different from what you should've gotten using LDAPrintTopic in below 0.5 which should give you back the result as the Topic Id. and it's associated top terms in very direct format. (See Mahout in Action p.181 again). Vectordump should be exactly what you want, actually. Here is what I've done as below. 1. Say I have already generated document vector and use tf-vectors to generate a document/term matrix as $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o NHTSA-matrix03 2. and get rid of the matrix docIndex as it should get in my way (as been advised somewhere…) $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex NHTSA-matrix03-docIndex 3. confirmed if I have only what I need here as $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/ Found 1 items -rw-r--r-- 1 hadoop supergroup 42471833 2012-12-20 07:11 /user/hadoop/NHTSA-matrix03/matrix 4.and kick off CVB as $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow … …. 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms (Minutes: 733.12814) (Took over 12hrs to complete to process 100k documents on my laptop with pseudo-distributed Hadoop 0.20.203) 5. Take a look at what I've got. $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse Found 12 items -rw-r--r-- 1 hadoop supergroup 0 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/_logs -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-0 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-1 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-2 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-3 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-4 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-5 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-6 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-7 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-8 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-9 [hadoop@localhost NHTSA]$ Ok, these should be your model files, and to view them, you can do it the way you can view any SequenceFileIntWriteable, VectorWritable, like this: $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType sequencefile --vectorSize 5 --sort This will dump the top 5
Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?
Hi Yutaka, On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote: Hi Here is a question around how to evaluate the result of Mahout 0.7 CVB (Collapsed Variational Bayes), which used to be LDA (Latent Dirichlet Allocation) in Mahout version under 0.5. I believe I have no prpblem running CVB itself and this is purely a question on the efficient way to visualize or evaluate the result. Looks like result evaluation in Mahout-0.5 at least could be done using the utility called LDAPrintTopic, however this is already obsolete since Mahout 0.5. (See Mahout in Action p.181 on LDA) I'm using , as said using Mahout-0.7. I believe I'm running CVB successfully and obtained results in two separate directory in /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as number of iterations and also in /user/hadoop/NHTSA-LDA-sparse/part-m-0 through part-m-9 as specified as number of topics tha I wanted to extract/decomposite. Neither of the files contained in the directory can be dumped using Mahout vectordump, however the output format is way different from what you should've gotten using LDAPrintTopic in below 0.5 which should give you back the result as the Topic Id. and it's associated top terms in very direct format. (See Mahout in Action p.181 again). Vectordump should be exactly what you want, actually. Here is what I've done as below. 1. Say I have already generated document vector and use tf-vectors to generate a document/term matrix as $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o NHTSA-matrix03 2. and get rid of the matrix docIndex as it should get in my way (as been advised somewhere…) $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex NHTSA-matrix03-docIndex 3. confirmed if I have only what I need here as $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/ Found 1 items -rw-r--r-- 1 hadoop supergroup 42471833 2012-12-20 07:11 /user/hadoop/NHTSA-matrix03/matrix 4.and kick off CVB as $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict NHTSA-vectors03/dictionary.file-* -k 10 -x 20 –ow … …. 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms (Minutes: 733.12814) (Took over 12hrs to complete to process 100k documents on my laptop with pseudo-distributed Hadoop 0.20.203) 5. Take a look at what I've got. $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse Found 12 items -rw-r--r-- 1 hadoop supergroup 0 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/_SUCCESS drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/_logs -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-0 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-1 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-2 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 /user/hadoop/NHTSA-LDA-sparse/part-m-3 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-4 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-5 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-6 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-7 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-8 -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 /user/hadoop/NHTSA-LDA-sparse/part-m-9 [hadoop@localhost NHTSA]$ Ok, these should be your model files, and to view them, you can do it the way you can view any SequenceFileIntWriteable, VectorWritable, like this: $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType sequencefile --vectorSize 5 --sort This will dump the top 5 terms (with weights - not sure if they'll be normalized properly) from each topic to the output file topic_dump.txt Incidentally, this same command can be run on the topicModelState directories as well, which let you see how fast your topic model was converging (and thus show you on a smaller data set how many iterations you may want to be running with later on). and $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState Found 20 items drwxr-xr-x - hadoop supergroup 0 2012-12-20 07:59 /user/hadoop/temp/topicModelState/model-1 drwxr-xr-x - hadoop supergroup 0 2012-12-20 13:32 /user/hadoop/temp/topicModelState/model-10 drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:09 /user/hadoop/temp/topicModelState/model-11 drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:46 /user/hadoop/temp/topicModelState/model-12 drwxr-xr-x - hadoop supergroup