I added a try block in the mapper which catches the exception and outputs:
java.lang.IllegalStateException: This is probably because the --numWords
argument is set too small.
It needs to be >= than the number of words (terms actually) in the
corpus and can be
larger if some storage inefficiency can be tolerated.
at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
at
org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
at
org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
at
org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
... 5 more
I'll commit that for now while we explore a more elegant solution.
On 5/23/10 2:45 PM, Sean Owen wrote:
Even something as simple as checking that bound and throwing
IllegalStateException with a custom message -- yeah I imagine it's
hard to detect this anytime earlier. Just a thought.
On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
<[email protected]> wrote:
I agree it is not very friendly. Impossible to tell the correct value in the
options section processing. It needs to be>= than the actual number of
unique terms in the corpus and that is hard to anticipate though I think it
is known in seq2sparse. If it turns out to be the dictionary size (I'm
investigating), then it could be computed by adding a dictionary path
argument instead of the current option. Trouble with that is the dictionary
is not needed for anything else by LDA.