Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

万代豊 Tue, 19 Feb 2013 02:11:15 -0800

Well , the --sortVectors for the vectordump utility to evaluate the result
for CVB clistering unfortunately brought me OutofMemory issue...


Here is the case that seem to goes well without --sortVectors option.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE
...
WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
IDLING.:1.1614897786179032E-8,WHILE IM:2.1611666608807903E-11,WHILE
IN:5.032593039252978E-6,WHILE INFLATING:8.138999995666336E-13,WHILE
INSPECTING:3.854370531928256E-
...

Once you give --sortVectors TRUE as below.  I ran into OutofMemory
exception.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE *--sortVectors TRUE*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
*Exception in thread "main" java.lang.OutOfMemoryError: Java heap space*
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.<init>(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I see that there are several parameters  that are sensitive to giving heap
to Mahout job either dependently/independent across Hadoop and Mahout such
as
MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.

Can anyone advise me which configuration file, shell scripts, XMLs that I
should give some addiotnal heap and also the proper way to monitor the
actual heap usage here?

I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
pseudo-distributed configuration on a VMWare Player partition running
CentOS6.3 64Bit.

Regards,,,
Y.Mandai
2013/2/1 Jake Mannix <jake.man...@gmail.com>

> On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai <20525entrad...@gmail.com
> >wrote:
>
> > Thank Jake for your guidance.
> > Good to know that I wasn't alway wrong but was just not familiar enough
> > about the vector dump usage.
> > I'll try this out later when I can as soon as possible.
> > Hope that --sort doesn't eat up too much heap.
> >
>
> If you're using code on master, --sort should only be using an additional K
> objects of memory (where K is the value you passed to --vectorSize), as
> it's just using an auxiliary heap to grab the top k items of the vector.
>  It was a bug previously that it tried to instantiate a vector.size()
> [which in some cases was Integer.MAX_INT] sized list somewhere.
>
>
> >
> > Regards,,,
> > Yutaka
> >
> > iPhoneから送信
> >
> > On 2013/01/31, at 23:33, Jake Mannix <jake.man...@gmail.com> wrote:
> >
> > > Hi Yutaka,
> > >
> > >
> > > On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 <20525entrad...@gmail.com> wrote:
> > >
> > >> Hi
> > >> Here is a question around how to evaluate the result of Mahout 0.7 CVB
> > >> (Collapsed Variational Bayes), which used to be LDA
> > >> (Latent Dirichlet Allocation) in Mahout version under 0.5.
> > >> I believe I have no prpblem running CVB itself and this is purely a
> > >> question on the efficient way to visualize or evaluate the result.
> > >
> > > Looks like result evaluation in Mahout-0.5 at least could be done using
> > the
> > >> utility called "LDAPrintTopic", however this is already
> > >> obsolete since Mahout 0.5. (See "Mahout in Action" p.181 on LDA)
> > >>
> > >> I'm using , as said using Mahout-0.7. I believe I'm running CVB
> > >> successfully and obtained results in two separate directory in
> > >> /user/hadoop/temp/topicModelState/model-1 through model-20 as
> specified
> > as
> > >> number of iterations and also in
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000 through part-m-00009 as
> > >> specified as number of topics tha I wanted to extract/decomposite.
> > >>
> > >> Neither of the files contained in the directory can be dumped using
> > Mahout
> > >> vectordump, however the output format is way different
> > >> from what you should've gotten using LDAPrintTopic in below 0.5 which
> > >> should give you back the result as the Topic Id. and it's
> > >> associated top terms in very direct format. (See "Mahout in Action"
> > p.181
> > >> again).
> > >>
> > >
> > > Vectordump should be exactly what you want, actually.
> > >
> > >
> > >>
> > >> Here is what I've done as below.
> > >> 1. Say I have already generated document vector and use tf-vectors to
> > >> generate a document/term matrix as
> > >>
> > >> $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
> > >> NHTSA-matrix03
> > >>
> > >> 2. and get rid of the matrix docIndex as it should get in my way (as
> > been
> > >> advised somewhere…)
> > >> $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
> > >> NHTSA-matrix03-docIndex
> > >>
> > >> 3. confirmed if I have only what I need here as
> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
> > >> Found 1 items
> > >> -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
> > >> /user/hadoop/NHTSA-matrix03/matrix
> > >>
> > >> 4.and kick off CVB as
> > >> $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse
> -dict
> > >> NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
> > >> …
> > >> ….
> > >> 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
> > >> (Minutes: 733.1281333333334)
> > >> (Took over 12hrs to complete to process 100k documents on my laptop
> with
> > >> pseudo-distributed Hadoop 0.20.203)
> > >>
> > >> 5. Take a look at what I've got.
> > >> $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
> > >> Found 12 items
> > >> -rw-r--r--   1 hadoop supergroup          0 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/_logs
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00000
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00001
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00002
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:36
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00003
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00004
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00005
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00006
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00007
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00008
> > >> -rw-r--r--   1 hadoop supergroup     827345 2012-12-20 19:37
> > >> /user/hadoop/NHTSA-LDA-sparse/part-m-00009
> > >> [hadoop@localhost NHTSA]$
> > >>
> > >
> > > Ok, these should be your model files, and to view them, you
> > > can do it the way you can view any
> > > SequenceFile<IntWriteable, VectorWritable>, like this:
> > >
> > > $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
> > > -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
> > --dictionaryType
> > > sequencefile
> > > --vectorSize 5 --sort
> > >
> > > This will dump the top 5 terms (with weights - not sure if they'll be
> > > normalized properly) from each topic to the output file
> "topic_dump.txt"
> > >
> > > Incidentally, this same command can be run on the topicModelState
> > > directories as well, which let you see how fast your topic model was
> > > converging (and thus show you on a smaller data set how many iterations
> > you
> > > may want to be running with later on).
> > >
> > >
> > >>
> > >> and
> > >> $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
> > >> Found 20 items
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 07:59
> > >> /user/hadoop/temp/topicModelState/model-1
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 13:32
> > >> /user/hadoop/temp/topicModelState/model-10
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:09
> > >> /user/hadoop/temp/topicModelState/model-11
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 14:46
> > >> /user/hadoop/temp/topicModelState/model-12
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:23
> > >> /user/hadoop/temp/topicModelState/model-13
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 15:59
> > >> /user/hadoop/temp/topicModelState/model-14
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 16:36
> > >> /user/hadoop/temp/topicModelState/model-15
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:13
> > >> /user/hadoop/temp/topicModelState/model-16
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 17:48
> > >> /user/hadoop/temp/topicModelState/model-17
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:25
> > >> /user/hadoop/temp/topicModelState/model-18
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 18:59
> > >> /user/hadoop/temp/topicModelState/model-19
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 08:37
> > >> /user/hadoop/temp/topicModelState/model-2
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 19:36
> > >> /user/hadoop/temp/topicModelState/model-20
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:13
> > >> /user/hadoop/temp/topicModelState/model-3
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 09:50
> > >> /user/hadoop/temp/topicModelState/model-4
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 10:27
> > >> /user/hadoop/temp/topicModelState/model-5
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:04
> > >> /user/hadoop/temp/topicModelState/model-6
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 11:41
> > >> /user/hadoop/temp/topicModelState/model-7
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:18
> > >> /user/hadoop/temp/topicModelState/model-8
> > >> drwxr-xr-x   - hadoop supergroup          0 2012-12-20 12:55
> > >> /user/hadoop/temp/topicModelState/model-9
> > >>
> > >> Hope someone could help this out.
> > >> Regards,,,
> > >> Yutaka
> > >>
> > >
> > >
> > >
> > > --
> > >
> > >  -jake
> >
>
>
>
> --
>
>   -jake
>

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

Reply via email to