I added a try block in the mapper which catches the exception and outputs:

java.lang.IllegalStateException: This is probably because the --numWords argument is set too small. It needs to be >= than the number of words (terms actually) in the corpus and can be
    larger if some storage inefficiency can be tolerated.
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
at org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44) at org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205) at org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
    ... 5 more

I'll commit that for now while we explore a more elegant solution.


On 5/23/10 2:45 PM, Sean Owen wrote:
Even something as simple as checking that bound and throwing
IllegalStateException with a custom message -- yeah I imagine it's
hard to detect this anytime earlier. Just a thought.

On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
<[email protected]>  wrote:
I agree it is not very friendly. Impossible to tell the correct value in the
options section processing. It needs to be>= than the actual number of
unique terms in the corpus and that is hard to anticipate though I think it
is known in seq2sparse. If it turns out to be the dictionary size (I'm
investigating), then it could be computed by adding a dictionary path
argument instead of the current option. Trouble with that is the dictionary
is not needed for anything else by LDA.


Reply via email to