Well , the --sortVectors for the vectordump utility to evaluate the result for CVB clistering unfortunately brought me OutofMemory issue...
Here is the case that seem to goes well without --sortVectors option. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE ... WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE INSPECTING:3.854370531928256E- ... Once you give --sortVectors TRUE as below. I ran into OutofMemory exception. $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5 --printKey TRUE *--sortVectors TRUE* Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR= MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar 13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments: {--dictionary=[NHTSA-vectors01/dictionary.file-*], --dictionaryType=[sequencefile], --endPhase=[2147483647], --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE], --startPhase=[0], --tempDir=[temp], --vectorSize=[5]} 13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true *Exception in thread "main" java.lang.OutOfMemoryError: Java heap space* at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221) at org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218) at org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84) at org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133) at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I see that there are several parameters that are sensitive to giving heap to Mahout job either dependently/independent across Hadoop and Mahout such as MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc. Can anyone advise me which configuration file, shell scripts, XMLs that I should give some addiotnal heap and also the proper way to monitor the actual heap usage here? I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with pseudo-distributed configuration on a VMWare Player partition running CentOS6.3 64Bit. Regards,,, Y.Mandai 2013/2/1 Jake Mannix <jake.man...@gmail.com> > On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entrad...@gmail.com > >wrote: > > > Thank Jake for your guidance. > > Good to know that I wasn't alway wrong but was just not familiar enough > > about the vector dump usage. > > I'll try this out later when I can as soon as possible. > > Hope that --sort doesn't eat up too much heap. > > > > If you're using code on master, --sort should only be using an additional K > objects of memory (where K is the value you passed to --vectorSize), as > it's just using an auxiliary heap to grab the top k items of the vector. > It was a bug previously that it tried to instantiate a vector.size() > [which in some cases was Integer.MAX_INT] sized list somewhere. > > > > > > Regards,,, > > Yutaka > > > > iPhoneから送信 > > > > On 2013/01/31, at 23:33, Jake Mannix <jake.man...@gmail.com> wrote: > > > > > Hi Yutaka, > > > > > > > > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entrad...@gmail.com> wrote: > > > > > >> Hi > > >> Here is a question around how to evaluate the result of Mahout 0.7 CVB > > >> (Collapsed Variational Bayes), which used to be LDA > > >> (Latent Dirichlet Allocation) in Mahout version under 0.5. > > >> I believe I have no prpblem running CVB itself and this is purely a > > >> question on the efficient way to visualize or evaluate the result. > > > > > > Looks like result evaluation in Mahout-0.5 at least could be done using > > the > > >> utility called "LDAPrintTopic", however this is already > > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA) > > >> > > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB > > >> successfully and obtained results in two separate directory in > > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as > specified > > as > > >> number of iterations and also in > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as > > >> specified as number of topics tha I wanted to extract/decomposite. > > >> > > >> Neither of the files contained in the directory can be dumped using > > Mahout > > >> vectordump, however the output format is way different > > >> from what you should've gotten using LDAPrintTopic in below 0.5 which > > >> should give you back the result as the Topic Id. and it's > > >> associated top terms in very direct format. (See "Mahout in Action" > > p.181 > > >> again). > > >> > > > > > > Vectordump should be exactly what you want, actually. > > > > > > > > >> > > >> Here is what I've done as below. > > >> 1. Say I have already generated document vector and use tf-vectors to > > >> generate a document/term matrix as > > >> > > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o > > >> NHTSA-matrix03 > > >> > > >> 2. and get rid of the matrix docIndex as it should get in my way (as > > been > > >> advised somewhere…) > > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex > > >> NHTSA-matrix03-docIndex > > >> > > >> 3. confirmed if I have only what I need here as > > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/ > > >> Found 1 items > > >> -rw-r--r-- 1 hadoop supergroup 42471833 2012-12-20 07:11 > > >> /user/hadoop/NHTSA-matrix03/matrix > > >> > > >> 4.and kick off CVB as > > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse > -dict > > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow > > >> … > > >> …. > > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms > > >> (Minutes: 733.1281333333334) > > >> (Took over 12hrs to complete to process 100k documents on my laptop > with > > >> pseudo-distributed Hadoop 0.20.203) > > >> > > >> 5. Take a look at what I've got. > > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse > > >> Found 12 items > > >> -rw-r--r-- 1 hadoop supergroup 0 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 > > >> /user/hadoop/NHTSA-LDA-sparse/_logs > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:36 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008 > > >> -rw-r--r-- 1 hadoop supergroup 827345 2012-12-20 19:37 > > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009 > > >> [hadoop@localhost NHTSA]$ > > >> > > > > > > Ok, these should be your model files, and to view them, you > > > can do it the way you can view any > > > SequenceFile<IntWriteable, VectorWritable>, like this: > > > > > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse > > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt > > --dictionaryType > > > sequencefile > > > --vectorSize 5 --sort > > > > > > This will dump the top 5 terms (with weights - not sure if they'll be > > > normalized properly) from each topic to the output file > "topic_dump.txt" > > > > > > Incidentally, this same command can be run on the topicModelState > > > directories as well, which let you see how fast your topic model was > > > converging (and thus show you on a smaller data set how many iterations > > you > > > may want to be running with later on). > > > > > > > > >> > > >> and > > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState > > >> Found 20 items > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 07:59 > > >> /user/hadoop/temp/topicModelState/model-1 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 13:32 > > >> /user/hadoop/temp/topicModelState/model-10 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:09 > > >> /user/hadoop/temp/topicModelState/model-11 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 14:46 > > >> /user/hadoop/temp/topicModelState/model-12 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 15:23 > > >> /user/hadoop/temp/topicModelState/model-13 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 15:59 > > >> /user/hadoop/temp/topicModelState/model-14 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 16:36 > > >> /user/hadoop/temp/topicModelState/model-15 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 17:13 > > >> /user/hadoop/temp/topicModelState/model-16 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 17:48 > > >> /user/hadoop/temp/topicModelState/model-17 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 18:25 > > >> /user/hadoop/temp/topicModelState/model-18 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 18:59 > > >> /user/hadoop/temp/topicModelState/model-19 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 08:37 > > >> /user/hadoop/temp/topicModelState/model-2 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 19:36 > > >> /user/hadoop/temp/topicModelState/model-20 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 09:13 > > >> /user/hadoop/temp/topicModelState/model-3 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 09:50 > > >> /user/hadoop/temp/topicModelState/model-4 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 10:27 > > >> /user/hadoop/temp/topicModelState/model-5 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 11:04 > > >> /user/hadoop/temp/topicModelState/model-6 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 11:41 > > >> /user/hadoop/temp/topicModelState/model-7 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 12:18 > > >> /user/hadoop/temp/topicModelState/model-8 > > >> drwxr-xr-x - hadoop supergroup 0 2012-12-20 12:55 > > >> /user/hadoop/temp/topicModelState/model-9 > > >> > > >> Hope someone could help this out. > > >> Regards,,, > > >> Yutaka > > >> > > > > > > > > > > > > -- > > > > > > -jake > > > > > > -- > > -jake >