Thanks Jake! I changed my command to: $MAHOUT lda -i $BASE_DIR/termvecs/tf-vectors -o $BASE_DIR/lda_working -k 2 -v 10000
And now I get: 11/09/05 12:06:25 WARN mapred.LocalJobRunner: job_local_0001 org.apache.mahout.math.IndexException: Index 10007 is outside allowable range of [0,10000) at org.apache.mahout.math.AbstractMatrix.get(AbstractMatrix.java:412) at org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:45) at org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:225) at org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:110) at org.apache.mahout.clustering.lda.LDAWordTopicMapper.map(LDAWordTopicMapper.java:48) at org.apache.mahout.clustering.lda.LDAWordTopicMapper.map(LDAWordTopicMapper.java:36) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) 11/09/05 12:06:26 INFO mapred.JobClient: map 0% reduce 0% 11/09/05 12:06:26 INFO mapred.JobClient: Job complete: job_local_0001 11/09/05 12:06:26 INFO mapred.JobClient: Counters: 0 Exception in thread "main" java.lang.InterruptedException: LDA Iteration failed processing /home/ben/Scripts/eipi/lda_working/state-0 at org.apache.mahout.clustering.lda.LDADriver.runIteration(LDADriver.java:427) at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:226) at org.apache.mahout.clustering.lda.LDADriver.run(LDADriver.java:174) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:90) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) I reran the seqdirectory and seq2sparse commands and they seemed to work fine, but I keep getting this error. Any idea what I'm doing wrong? Thanks, -Ben ----- Original Message ----- From: Jake Mannix <[email protected]> To: [email protected]; Ben West <[email protected]> Cc: Sent: Monday, September 5, 2011 11:30 AM Subject: Re: LDA question Hi Ben, On Mon, Sep 5, 2011 at 8:38 AM, Ben West <[email protected]> wrote: > > > ~/Scripts/Mahout/trunk/bin/mahout seqdirectory --input > /home/ben/Scripts/eipi/files --output /home/ben/Scripts/eipi/mahout_out > -chunk 1 > ~/Scripts/Mahout/trunk/bin/mahout seq2sparse -i > /home/ben/Scripts/eipi/mahout_out -o /home/ben/Scripts/eipi/termvecs -wt tf > -seq > > The "output" directory (/home/ben/Scripts/eipi/termvecs) has a bunch of subdirectories, only one of which actually contains your Vectors. In this case, you've done tf-normalization, so they're in /home/ben/Scripts/eipi/termvecs/tf-vectors. This is the directory you want to give to LDA as input. -jake
