Hi everybody,
I am implementing a classifier that handles a big amount of data using naive
bayes using EMR as the "hadoop cluster". By large amount of data I mean that
the final models are around 45GB. While the feature extraction step works fine,
calling TrainNaiveBayesJob results in the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at
org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:491)
at
org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:444)
at
org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at
org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:122)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1875)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2007)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at com.google.common.collect.AbstractIterator.tryToComputeNext(Unknown
Source)
at com.google.common.collect.AbstractIterator.hasNext(Unknown Source)
at com.google.common.collect.Iterators$5.hasNext(Unknown Source)
at com.google.common.collect.ForwardingIterator.hasNext(Unknown Source)
at
org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:79)
at
org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:161)It
took me a little bit to realize that the MR job from Naive Bayes finished fine
on each of the reducers, but after the reduce step the namenode gets the models
from the reducers, loads them to memory, validates them and after the
validation, serializes the models. My second thought was to increase the heap
memory of the namenode (using boostrap-actions in EMR,
s3://elasticmapreduce/bootstrap-actions/configure-daemons
--namenode-heap-size=60000) but even with this setup I am receiving the same
exception.
Has somebody dealt with a similar problem? Any suggestion? (other than trying a
better master node).
Also, what is the rationale to load all the model in the memory of the
namenode? While I understand the need of validation, can't it be dode in chunks
of data instead of the complete model to avoid having this scalability issue?
Thanks!