Hi everybody,

I am implementing a classifier that handles a big amount of data using naive 
bayes using EMR as the "hadoop cluster". By large amount of data I mean that 
the final models are around 45GB. While the feature extraction step works fine, 
calling TrainNaiveBayesJob results in the following exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at 
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:491)
        at 
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:444)
        at 
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
        at 
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:122)
        at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1875)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2007)
        at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
        at 
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
        at com.google.common.collect.AbstractIterator.tryToComputeNext(Unknown 
Source)
        at com.google.common.collect.AbstractIterator.hasNext(Unknown Source)
        at com.google.common.collect.Iterators$5.hasNext(Unknown Source)
        at com.google.common.collect.ForwardingIterator.hasNext(Unknown Source)
        at 
org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:79)
        at 
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:161)It
 took me a little bit to realize that the MR job from Naive Bayes finished fine 
on each of the reducers, but after the reduce step the namenode gets the models 
from the reducers, loads them to memory, validates them and after the 
validation, serializes the models. My second thought was to increase the heap 
memory of the namenode (using boostrap-actions in EMR, 
s3://elasticmapreduce/bootstrap-actions/configure-daemons 
--namenode-heap-size=60000) but even with this setup I am receiving the same 
exception. 

Has somebody dealt with a similar problem? Any suggestion? (other than trying a 
better master node). 

Also, what is the rationale to load all the model in the memory of the 
namenode? While I understand the need of validation, can't it be dode in chunks 
of data instead of the complete model to avoid having this scalability issue?

Thanks!
                                          

Reply via email to