Re: Mahout LDA Parameter: maxIter (--numWords actually)

Jeff Eastman Sun, 23 May 2010 15:26:49 -0700

I added a try block in the mapper which catches the exception and outputs:

java.lang.IllegalStateException: This is probably because the --numWordsargument is set too small.It needs to be >= than the number of words (terms actually) in thecorpus and can be

    larger if some storage inefficiency can be tolerated.
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)

atorg.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)atorg.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)atorg.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)

    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
    ... 5 more

I'll commit that for now while we explore a more elegant solution.


On 5/23/10 2:45 PM, Sean Owen wrote:

Even something as simple as checking that bound and throwing
IllegalStateException with a custom message -- yeah I imagine it's
hard to detect this anytime earlier. Just a thought.

On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
<[email protected]>  wrote:

I agree it is not very friendly. Impossible to tell the correct value in the
options section processing. It needs to be>= than the actual number of
unique terms in the corpus and that is hard to anticipate though I think it
is known in seq2sparse. If it turns out to be the dictionary size (I'm
investigating), then it could be computed by adding a dictionary path
argument instead of the current option. Trouble with that is the dictionary
is not needed for anything else by LDA.

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Reply via email to