Its caused by not setting the correct word count I believe. Use the same value as the dictionary count. It has to be fixed one of these days.
Robin On Sun, May 23, 2010 at 3:50 PM, 杨杰 <[email protected]> wrote: > Jeff and Robin, > > Thank you for your suggestion! There is another problem: Compiled the > source from trunk and applied the patch MAHOUT-397, I retried the lda > experiment, but another exception was thrown: > > 10/05/23 17:01:52 INFO common.HadoopUtil: Deleting > mahout/seq-sparse-tf/lda-out > 10/05/23 17:01:55 INFO lda.LDADriver: Iteration 1 > 10/05/23 17:01:55 WARN mapred.JobClient: Use GenericOptionsParser for > parsing the arguments. Applications should implement Tool for the > same. > 10/05/23 17:01:56 INFO input.FileInputFormat: Total input paths to process > : 1 > 10/05/23 17:01:56 INFO mapred.JobClient: Running job: job_201005231654_0001 > 10/05/23 17:01:57 INFO mapred.JobClient: map 0% reduce 0% > 10/05/23 17:02:10 INFO mapred.JobClient: Task Id : > attempt_201005231654_0001_m_000000_0, Status : FAILED > java.lang.ArrayIndexOutOfBoundsException: 123 > at > org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:106) > at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:45) > at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > The COMMAND is the same as the former one except "-ow" which is "-w" > in 0.3 distribution; dataset is also the same with mahout 0.3 (on > which the experiment works ok except for *only one map* in each > iteration~). > > Is it because of absence of some other patches? Or is there any other > mistakes in my operations? > > Thank you! > > > On Sun, May 23, 2010 at 8:01 AM, Robin Anil <[email protected]> wrote: > > David's rule of thumb was to let the iterations go until relative change > in > > LL becomes around 10^-4 > > > > Robin > > > > On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman < > [email protected]>wrote: > > > >> I suggest you try running with a trunk checkout and upgrading to Hadoop > >> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk > in > >> the last few days. The maxIter parameter should not be an issue; you > could > >> try removing it entirely and LDA will default to running to convergence > >> (about 100 iterations which can take some time). I've found the Reuters > >> results don't change too much after 20. Even with a clean trunk checkout > >> Reuters will only use a single node and the iterations should take about > 5 > >> mins each. If you want to run on a multi-node cluster, install the patch > in > >> MAHOUT-397 ( > >> > >> > >> > https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ) > >> and use the same arguments as in examples/bin/build-reuters.sh. Even on > a > >> 3-node cluster this brings the iteration time down to about a minute and > a > >> half which is worth doing. > >> > >> Hope this helps, > >> Jeff > >> > >> http://www.windwardsolutions.com > >> > >> > >> > >> > >> On 5/22/10 5:40 AM, 杨杰 wrote: > >> > >>> Hi, everyone > >>> > >>> I'm trying mahout now. When running LDA on reuter corpus > >>> ( > >>> > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > >>> ), > >>> A parameter refuses to work. This parameter is "maxIter", without > >>> which, i cannot decide the iteration to run~ > >>> > >>> My CMD is: > >>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output > >>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20 > >>> --maxIter 1 > >>> > >>> But got a exception: > >>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception > >>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing > >>> Options > >>> at > >>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100) > >>> at > >>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115) > >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>> at > >>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > >>> at > >>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >>> at java.lang.reflect.Method.invoke(Method.java:597) > >>> at > >>> > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > >>> at > >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > >>> at > >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172) > >>> ... > >>> > >>> What's the problem? I'm using version 0.3& Hadoop 0.20.0. > >>> > >>> Thank you! > >>> > >>> > >>> > >>> > >> > >> > > > > > > -- > Yang Jie(杨杰) > hi.baidu.com/thinkdifferent > > Group of CLOUD, Xi'an Jiaotong University > Department of Computer Science and Technology, Xi’an Jiaotong University > > PHONE: 86 1346888 3723 > TEL: 86 29 82665263 EXT. 608 > MSN: [email protected] > > once i didn't know software is not free, but found it days later; now > i realize that it's indeed free. >
