Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-03-17 Thread 万代豊
Jake
Hi.
Due to my housekeeping matters for other things, I have actually not built
Mahout 0.7 from the trunk code yet, but before doing so,I have tried
Mahout-0.6 so that I can run LDA straight forward.

I have successfully ran LDA with input as TF vector file wuth 68 iteration
across 43 documents, specifying 12 topics to be identified.

$MAHOUT_HOME/bin/mahout lda --input
JAText-Mahout-0.6-LDA/JAText-luceneTFvectors01/part-out.vec --output
JAText-Mahout-0.6-LDA/output --numTopics 12
$HADOOP_HOME/bin/hadoop dfs -ls JAText-Mahout-0.6-LDA/output/
Found 70 items
drwxr-xr-x   - hadoop supergroup  0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/docTopics
drwxr-xr-x   - hadoop supergroup  0 2013-03-18 13:03
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-0
drwxr-xr-x   - hadoop supergroup  0 2013-03-18 13:04
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-1
  .
  .
drwxr-xr-x   - hadoop supergroup  0 2013-03-18 13:47
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-67
drwxr-xr-x   - hadoop supergroup  0 2013-03-18 13:48
/user/hadoop/JAText-Mahout-0.6-LDA/output/state-68

I actually see part-m-0 sequencefiles per each iteration stages.

Question here is that $MAHOUT_HOME/bin/mahout ldatopics utility for Mahout
0.6 (https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html)
doesn't work right due to
NullPointerException.

I could only confirm the result of docTopics using seqdumper but not able
to see any of the results for the
above state-* sequencefiles.
Here is what will happen with ldatopics comand.
$MAHOUT_HOME/bin/mahout ldatopics -i JAText-Mahout-0.6-LDA/output/state-68
-d JAText-TFDictionary.txt
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
Exception in thread main java.lang.NullPointerException
 at org.apache.mahout.common.Pair.compareTo(Pair.java:90)
 at org.apache.mahout.common.Pair.compareTo(Pair.java:23)
 at java.util.PriorityQueue.siftUpComparable(PriorityQueue.java:582)
 at java.util.PriorityQueue.siftUp(PriorityQueue.java:574)
 at java.util.PriorityQueue.offer(PriorityQueue.java:274)
 at java.util.PriorityQueue.add(PriorityQueue.java:251)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.maybeEnqueue(LDAPrintTopics.java:150)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.topWordsForTopics(LDAPrintTopics.java:216)
 at
org.apache.mahout.clustering.lda.LDAPrintTopics.main(LDAPrintTopics.java:128)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Apart from this, I can confirm 12 topics per each documents, total of 43
given as follows, using seqdumper.(but not with ldatopics)

$MAHOUT_HOME/bin/mahout seqdumper -s
JAText-Mahout-0.6-LDA/output/docTopics/part-m-0
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
MAHOUT-JOB: /usr/local/mahout-distribution-0.6/mahout-examples-0.6-job.jar
13/03/18 14:12:04 INFO common.AbstractJob: Command line arguments:
{--endPhase=2147483647,
--seqFile=JAText-Mahout-0.6-LDA/output/docTopics/part-m-0,
--startPhase=0, --tempDir=temp}
Input Path: JAText-Mahout-0.6-LDA/output/docTopics/part-m-0
Key class: class org.apache.hadoop.io.LongWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value:
{0:0.0718128116030847,1:0.07204818495147658,2:0.07165839473775905,3:0.07471123413425951,4:0.07228942239756206,5:0.07223674970698116,6:0.08965049111711978,7:0.07114235379664942,8:0.18392117686641946,9:0.0713290760585,10:0.0725383327578603,11:0.0724357150231}
Key: 1: Value:
{0:0.07340159249672981,1:0.07673280973643179,2:0.07227506725925102,3:0.17698846760344888,4:0.07957759924990469,5:0.07593691263843196,6:0.0723139656294,7:0.07195475314903217,8:0.07480823084457076,9:0.07539197289261017,10:0.07323355269079787,11:0.07732127004222797}
Key: 2: Value:

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-02-22 Thread Yutaka Mandai
Jake
Now this is very clear and I will work on this build from the latest source.
Thank you.
Regards,,,
Y.Mandai


iPhoneから送信

On 2013/02/23, at 3:14, Jake Mannix jake.man...@gmail.com wrote:

 On Fri, Feb 22, 2013 at 2:26 AM, 万代豊 20525entrad...@gmail.com wrote:
 
 Thanks Jake for your attention on this.
 I believe I have the trunk code from the official download site.
 Well my Mahout version is 0.7 and I have downloaded from local mirror site
 as
 http://ftp.jaist.ac.jp/pub/apache/mahout/0.7/  and confirmed that the
 timestamp on ther mirror
 site as 12-Jun-2012 and the time stamp for my installed files are all
 identical.
 Note that I'm using the precompiled Jar files only and have not built on my
 machine from source code locally.
 I believe this will not affect negatively.
 
 Mahout-0.7 is my first and only experienced version. Never have tried older
 ones nor newer 0.8 snapshot either...
 
 Can you think of any other possible workaround?
 
 
 You should try to build from trunk source, this bug is fixed in trunk,
 that's the
 correct workaround.  That, or wait for our next officially released version
 (0.8).
 
 
 
 Also, Am I doing Ok with giving heap size for both Hadoop and Mahout for
 this case?
 I could confirm the heap assignment for the Hadoop jobs since they are
 resident processes while
 Mahout RunJob immediately dies before the VisualVM utility can recognozes
 it, so I'm not confident if
 RunJob really got how much he really wanted or not...
 
 
 Heap is not going to help you here, you're dealing with a bug.  The correct
 code doesn't need really very much memory at all (less than 100MB to do
 the job you're talking about).
 
 
 
 Regards,,,
 Y.Mandai
 
 
 
 2013/2/22 Jake Mannix jake.man...@gmail.com
 
 This looks like you've got an old version of Mahout - are you running on
 trunk?  This has been fixed on trunk, there was a bug in the 0.6
 (roughly)
 timeframe in which vectors for vectordump --sort were assumed incorrectly
 to be of size MAX_INT, which lead to heap problems no matter how much
 heap
 you gave it.   Well, maybe you could have worked around it with 2^32 *
 (4 +
 8) bytes ~ 48GB, but really the solution is to upgrade to run off of
 trunk.
 
 
 On Wed, Feb 20, 2013 at 8:47 PM, 万代豊 20525entrad...@gmail.com wrote:
 
 My trial as below. However still doesn't get through...
 
 Increased MAHOUT_HEAPSIZE as below and also deleted out the comment
 mark
 from mahout shell script so that I can check it's actually taking
 effect.
 Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)
 
 ~bin/mahout~
 JAVA=$JAVA_HOME/bin/java
 JAVA_HEAP_MAX=-Xmx4g  * - Increased from the original 3g to 4g*
 # check envvars which might override default args
 if [ $MAHOUT_HEAPSIZE !=  ]; then
  echo run with heapsize $MAHOUT_HEAPSIZE
  JAVA_HEAP_MAX=-Xmx$MAHOUT_HEAPSIZEm
  echo $JAVA_HEAP_MAX
 fi
 
 Also added the same heap size as 4G in hadoop-env.sh as
 
 ~hadoop-env.sh~
 # The maximum amount of heap to use, in MB. Default is 1000.
 export HADOOP_HEAPSIZE=4000
 
 [hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
 [hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
 NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
 --vectorSize 5 --printKey TRUE --sortVectors TRUE
 run with heapsize 4000* - Looks like RunJar is taking 4G heap?*
 -Xmx4000m   *- Right?*
 Running on hadoop, using /usr/local/hadoop/bin/hadoop and
 HADOOP_CONF_DIR=
 MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
 13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
 {--dictionary=[NHTSA-vectors01/dictionary.file-*],
 --dictionaryType=[sequencefile], --endPhase=[2147483647],
 --input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
 --startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
 13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
 Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
 
 
 
 org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221)
 at
 
 
 
 org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218)
 at
 
 
 
 org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
 
 
 
 org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at
 org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at
 
 org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
 
 
 
 org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-02-20 Thread 万代豊
My trial as below. However still doesn't get through...

Increased MAHOUT_HEAPSIZE as below and also deleted out the comment mark
from mahout shell script so that I can check it's actually taking effect.
Added JAVA_HEAP_MAX=-Xmx4g (Default was 3GB)

~bin/mahout~
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx4g  * - Increased from the original 3g to 4g*
# check envvars which might override default args
if [ $MAHOUT_HEAPSIZE !=  ]; then
  echo run with heapsize $MAHOUT_HEAPSIZE
  JAVA_HEAP_MAX=-Xmx$MAHOUT_HEAPSIZEm
  echo $JAVA_HEAP_MAX
fi

Also added the same heap size as 4G in hadoop-env.sh as

~hadoop-env.sh~
# The maximum amount of heap to use, in MB. Default is 1000.
export HADOOP_HEAPSIZE=4000

[hadoop@localhost NHTSA]$ export MAHOUT_HEAPSIZE=4000
[hadoop@localhost NHTSA]$ $MAHOUT_HOME/bin/mahout vectordump -i
NHTSA-LDA-sparse -d NHTSA-vectors01/dictionary.file-* -dt sequencefile
--vectorSize 5 --printKey TRUE --sortVectors TRUE
run with heapsize 4000* - Looks like RunJar is taking 4G heap?*
-Xmx4000m   *- Right?*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/21 13:23:17 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/21 13:23:17 INFO vectors.VectorDumper: Sort? true
Exception in thread main java.lang.OutOfMemoryError: Java heap space
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
[hadoop@localhost NHTSA]$
I've also monitored that at least all the Hadoop tasks are taking 4GB of
heap through VisualVM utility.

I have done ClusterDump to extract the top 10 terms from the result of
K-Means as below using the exactly same input data sets as below, however,
this tasks requires no extra heap other that the default.

$ $MAHOUT_HOME/bin/mahout clusterdump -dt sequencefile -d
NHTSA-vectors01/dictionary.file-* -i
NHTSA-kmeans-clusters01/clusters-9-final -o NHTSA-kmeans-clusterdump01
-b 30-n 10

I believe the vectordump utility and the clusterdump derive from different
roots in terms of it's heap requirement.

Still waiting for some advise from you people.
Regards,,,
Y.Mandai
2013/2/19 万代豊 20525entrad...@gmail.com


 Well , the --sortVectors for the vectordump utility to evaluate the result
 for CVB clistering unfortunately brought me OutofMemory issue...

 Here is the case that seem to goes well without --sortVectors option.
 $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
 NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
 --printKey TRUE
 ...
 WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
 FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
 GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
 HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
 I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
 IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
 IDLING.:1.1614897786179032E-8,WHILE IM:2.161108807903E-11,WHILE
 IN:5.032593039252978E-6,WHILE INFLATING:8.13895666336E-13,WHILE
 INSPECTING:3.854370531928256E-
 ...

 Once you give --sortVectors TRUE as below.  I ran into OutofMemory
 exception.
 $ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
 NHTSA-vectors01/dictionary.file-* -dt 

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-02-19 Thread 万代豊
Well , the --sortVectors for the vectordump utility to evaluate the result
for CVB clistering unfortunately brought me OutofMemory issue...

Here is the case that seem to goes well without --sortVectors option.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE
...
WHILE FOR:1.3623429635926918E-6,WHILE FRONT:1.6746456292420305E-11,WHILE
FUELING:1.9818992669733008E-11,WHILE FUELING,:1.0646022811429909E-11,WHILE
GETTING:5.89954370861319E-6,WHILE GOING:1.4587091471519642E-6,WHILE
HAVING:5.137634548963784E-7,WHILE HOLDING:7.275884421503996E-7,WHILE
I:2.86243736646287E-4,WHILE I'M:5.372854590432754E-7,WHILE
IDLING:1.7433432428460682E-6,WHILE IDLING,:6.519276066493627E-8,WHILE
IDLING.:1.1614897786179032E-8,WHILE IM:2.161108807903E-11,WHILE
IN:5.032593039252978E-6,WHILE INFLATING:8.13895666336E-13,WHILE
INSPECTING:3.854370531928256E-
...

Once you give --sortVectors TRUE as below.  I ran into OutofMemory
exception.
$ $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse -d
NHTSA-vectors01/dictionary.file-* -dt sequencefile --vectorSize 5
--printKey TRUE *--sortVectors TRUE*
Running on hadoop, using /usr/local/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /usr/local/mahout/mahout-examples-0.7-job.jar
13/02/19 18:56:03 INFO common.AbstractJob: Command line arguments:
{--dictionary=[NHTSA-vectors01/dictionary.file-*],
--dictionaryType=[sequencefile], --endPhase=[2147483647],
--input=[NHTSA-LDA-sparse], --printKey=[TRUE], --sortVectors=[TRUE],
--startPhase=[0], --tempDir=[temp], --vectorSize=[5]}
13/02/19 18:56:03 INFO vectors.VectorDumper: Sort? true
*Exception in thread main java.lang.OutOfMemoryError: Java heap space*
 at org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:108)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:221)
 at
org.apache.mahout.utils.vectors.VectorHelper$TDoublePQ.init(VectorHelper.java:218)
 at
org.apache.mahout.utils.vectors.VectorHelper.topEntries(VectorHelper.java:84)
 at
org.apache.mahout.utils.vectors.VectorHelper.vectorToJson(VectorHelper.java:133)
 at org.apache.mahout.utils.vectors.VectorDumper.run(VectorDumper.java:245)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:266)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
 at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
 at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I see that there are several parameters  that are sensitive to giving heap
to Mahout job either dependently/independent across Hadoop and Mahout such
as
MAHOUT_HEAPSIZE,JAVA_HEAP_MAX,HADOOP_OPTS,etc.

Can anyone advise me which configuration file, shell scripts, XMLs that I
should give some addiotnal heap and also the proper way to monitor the
actual heap usage here?

I'm running Mahout-distribution-0.7 on Hadoop-0.20.203.0 with
pseudo-distributed configuration on a VMWare Player partition running
CentOS6.3 64Bit.

Regards,,,
Y.Mandai
2013/2/1 Jake Mannix jake.man...@gmail.com

 On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai 20525entrad...@gmail.com
 wrote:

  Thank Jake for your guidance.
  Good to know that I wasn't alway wrong but was just not familiar enough
  about the vector dump usage.
  I'll try this out later when I can as soon as possible.
  Hope that --sort doesn't eat up too much heap.
 

 If you're using code on master, --sort should only be using an additional K
 objects of memory (where K is the value you passed to --vectorSize), as
 it's just using an auxiliary heap to grab the top k items of the vector.
  It was a bug previously that it tried to instantiate a vector.size()
 [which in some cases was Integer.MAX_INT] sized list somewhere.


 
  Regards,,,
  Yutaka
 
  iPhoneから送信
 
  On 2013/01/31, at 23:33, Jake Mannix jake.man...@gmail.com wrote:
 
   Hi Yutaka,
  
  
   On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote:
  
   Hi
   Here is a question around how to evaluate the result of Mahout 0.7 CVB
   (Collapsed Variational Bayes), which used to be LDA
   (Latent Dirichlet Allocation) in Mahout version under 0.5.
   I believe I have no prpblem running CVB itself and this is purely a
   

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-02-01 Thread Jake Mannix
On Fri, Feb 1, 2013 at 3:35 AM, Yutaka Mandai 20525entrad...@gmail.comwrote:

 Thank Jake for your guidance.
 Good to know that I wasn't alway wrong but was just not familiar enough
 about the vector dump usage.
 I'll try this out later when I can as soon as possible.
 Hope that --sort doesn't eat up too much heap.


If you're using code on master, --sort should only be using an additional K
objects of memory (where K is the value you passed to --vectorSize), as
it's just using an auxiliary heap to grab the top k items of the vector.
 It was a bug previously that it tried to instantiate a vector.size()
[which in some cases was Integer.MAX_INT] sized list somewhere.



 Regards,,,
 Yutaka

 iPhoneから送信

 On 2013/01/31, at 23:33, Jake Mannix jake.man...@gmail.com wrote:

  Hi Yutaka,
 
 
  On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote:
 
  Hi
  Here is a question around how to evaluate the result of Mahout 0.7 CVB
  (Collapsed Variational Bayes), which used to be LDA
  (Latent Dirichlet Allocation) in Mahout version under 0.5.
  I believe I have no prpblem running CVB itself and this is purely a
  question on the efficient way to visualize or evaluate the result.
 
  Looks like result evaluation in Mahout-0.5 at least could be done using
 the
  utility called LDAPrintTopic, however this is already
  obsolete since Mahout 0.5. (See Mahout in Action p.181 on LDA)
 
  I'm using , as said using Mahout-0.7. I believe I'm running CVB
  successfully and obtained results in two separate directory in
  /user/hadoop/temp/topicModelState/model-1 through model-20 as specified
 as
  number of iterations and also in
  /user/hadoop/NHTSA-LDA-sparse/part-m-0 through part-m-9 as
  specified as number of topics tha I wanted to extract/decomposite.
 
  Neither of the files contained in the directory can be dumped using
 Mahout
  vectordump, however the output format is way different
  from what you should've gotten using LDAPrintTopic in below 0.5 which
  should give you back the result as the Topic Id. and it's
  associated top terms in very direct format. (See Mahout in Action
 p.181
  again).
 
 
  Vectordump should be exactly what you want, actually.
 
 
 
  Here is what I've done as below.
  1. Say I have already generated document vector and use tf-vectors to
  generate a document/term matrix as
 
  $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
  NHTSA-matrix03
 
  2. and get rid of the matrix docIndex as it should get in my way (as
 been
  advised somewhere…)
  $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
  NHTSA-matrix03-docIndex
 
  3. confirmed if I have only what I need here as
  $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
  Found 1 items
  -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
  /user/hadoop/NHTSA-matrix03/matrix
 
  4.and kick off CVB as
  $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict
  NHTSA-vectors03/dictionary.file-* -k 10 -x 20 -ow
  …
  ….
  12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
  (Minutes: 733.12814)
  (Took over 12hrs to complete to process 100k documents on my laptop with
  pseudo-distributed Hadoop 0.20.203)
 
  5. Take a look at what I've got.
  $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
  Found 12 items
  -rw-r--r--   1 hadoop supergroup  0 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
  drwxr-xr-x   - hadoop supergroup  0 2012-12-20 19:36
  /user/hadoop/NHTSA-LDA-sparse/_logs
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
  /user/hadoop/NHTSA-LDA-sparse/part-m-0
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
  /user/hadoop/NHTSA-LDA-sparse/part-m-1
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
  /user/hadoop/NHTSA-LDA-sparse/part-m-2
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
  /user/hadoop/NHTSA-LDA-sparse/part-m-3
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-4
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-5
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-6
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-7
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-8
  -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
  /user/hadoop/NHTSA-LDA-sparse/part-m-9
  [hadoop@localhost NHTSA]$
 
 
  Ok, these should be your model files, and to view them, you
  can do it the way you can view any
  SequenceFileIntWriteable, VectorWritable, like this:
 
  $MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
  -dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt
 --dictionaryType
  sequencefile
  --vectorSize 5 --sort
 
  This will dump the top 5 

Re: What will be the LDAPrintTopics compatible/equivalent feature in Mahout-0.7?

2013-01-31 Thread Jake Mannix
Hi Yutaka,


On Thu, Jan 31, 2013 at 3:03 AM, 万代豊 20525entrad...@gmail.com wrote:

 Hi
 Here is a question around how to evaluate the result of Mahout 0.7 CVB
 (Collapsed Variational Bayes), which used to be LDA
 (Latent Dirichlet Allocation) in Mahout version under 0.5.
 I believe I have no prpblem running CVB itself and this is purely a
 question on the efficient way to visualize or evaluate the result.

Looks like result evaluation in Mahout-0.5 at least could be done using the
 utility called LDAPrintTopic, however this is already
 obsolete since Mahout 0.5. (See Mahout in Action p.181 on LDA)

 I'm using , as said using Mahout-0.7. I believe I'm running CVB
 successfully and obtained results in two separate directory in
 /user/hadoop/temp/topicModelState/model-1 through model-20 as specified as
 number of iterations and also in
 /user/hadoop/NHTSA-LDA-sparse/part-m-0 through part-m-9 as
 specified as number of topics tha I wanted to extract/decomposite.

 Neither of the files contained in the directory can be dumped using Mahout
 vectordump, however the output format is way different
 from what you should've gotten using LDAPrintTopic in below 0.5 which
 should give you back the result as the Topic Id. and it's
 associated top terms in very direct format. (See Mahout in Action p.181
 again).


Vectordump should be exactly what you want, actually.



 Here is what I've done as below.
 1. Say I have already generated document vector and use tf-vectors to
 generate a document/term matrix as

 $MAHOUT_HOME/bin/mahout rowid -i NHTSA-vectors03/tf-vectors -o
 NHTSA-matrix03

 2. and get rid of the matrix docIndex as it should get in my way (as been
 advised somewhere…)
 $HADOOP_HOME/bin/hadoop dfs -mv NHTSA-matrix03/docIndex
 NHTSA-matrix03-docIndex

 3. confirmed if I have only what I need here as
 $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-matrix03/
 Found 1 items
 -rw-r--r--   1 hadoop supergroup   42471833 2012-12-20 07:11
 /user/hadoop/NHTSA-matrix03/matrix

 4.and kick off CVB as
 $MAHOUT_HOME/bin/mahout cvb -i NHTSA-matrix03 -o NHTSA-LDA-sparse -dict
 NHTSA-vectors03/dictionary.file-* -k 10 -x 20 –ow
 …
 ….
 12/12/20 19:37:31 INFO driver.MahoutDriver: Program took 43987688 ms
 (Minutes: 733.12814)
 (Took over 12hrs to complete to process 100k documents on my laptop with
 pseudo-distributed Hadoop 0.20.203)

 5. Take a look at what I've got.
 $HADOOP_HOME/bin/hadoop dfs -ls NHTSA-LDA-sparse
 Found 12 items
 -rw-r--r--   1 hadoop supergroup  0 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/_SUCCESS
 drwxr-xr-x   - hadoop supergroup  0 2012-12-20 19:36
 /user/hadoop/NHTSA-LDA-sparse/_logs
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
 /user/hadoop/NHTSA-LDA-sparse/part-m-0
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
 /user/hadoop/NHTSA-LDA-sparse/part-m-1
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
 /user/hadoop/NHTSA-LDA-sparse/part-m-2
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:36
 /user/hadoop/NHTSA-LDA-sparse/part-m-3
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-4
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-5
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-6
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-7
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-8
 -rw-r--r--   1 hadoop supergroup 827345 2012-12-20 19:37
 /user/hadoop/NHTSA-LDA-sparse/part-m-9
 [hadoop@localhost NHTSA]$


Ok, these should be your model files, and to view them, you
can do it the way you can view any
SequenceFileIntWriteable, VectorWritable, like this:

$MAHOUT_HOME/bin/mahout vectordump -i NHTSA-LDA-sparse
-dict NHTSA-vectors03/dictionary.file-* -o topic_dump.txt --dictionaryType
sequencefile
--vectorSize 5 --sort

This will dump the top 5 terms (with weights - not sure if they'll be
normalized properly) from each topic to the output file topic_dump.txt

Incidentally, this same command can be run on the topicModelState
directories as well, which let you see how fast your topic model was
converging (and thus show you on a smaller data set how many iterations you
may want to be running with later on).



 and
 $HADOOP_HOME/bin/hadoop dfs -ls temp/topicModelState
 Found 20 items
 drwxr-xr-x   - hadoop supergroup  0 2012-12-20 07:59
 /user/hadoop/temp/topicModelState/model-1
 drwxr-xr-x   - hadoop supergroup  0 2012-12-20 13:32
 /user/hadoop/temp/topicModelState/model-10
 drwxr-xr-x   - hadoop supergroup  0 2012-12-20 14:09
 /user/hadoop/temp/topicModelState/model-11
 drwxr-xr-x   - hadoop supergroup  0 2012-12-20 14:46
 /user/hadoop/temp/topicModelState/model-12
 drwxr-xr-x   - hadoop supergroup