Use seq2encoded instead to create smaller vectors. See the other thread.
Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Thu, Jan 3, 2013 at 3:47 PM, Robin Anil <[email protected]> wrote: > Model is bounded by the feature space. So if you are using uptil trigrams, > you need to estimate the memory needed, Assume total doubles needed = > > IIR vaguely, its Num classes * num features * 12/16 bytes. > > See if you can actually build a model with that. Else, I would suggest > pruning features from the input vectors like those with < 5 frequency. > > Robin > > Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. > > > On Thu, Jan 3, 2013 at 2:23 PM, Adam Baron <[email protected]> wrote: > >> I'm trying to run Naïve Bayes on 2.4GB of tfidf-vectors representing a >> bunch of 1-, 2-, 3-grams. However, no matter how much I increase the >> mapred.child.java.opts, I seem to get "java.lang.OutOfMemoryError: Java >> heap space" errors. My most recent attempt before e-mailing this mail >> group was 32GB for mapred.child.java.opts and 33GB for >> mapred.child.ulimit. >> >> I'm using these "mahout trainnb" with these arguments: >> -i [my tfidf-vectors directory on HDFS] >> -el >> -o [name of a model file that does not yet exist, in an HDFS directory >> that >> does exist] >> -li [name of a label index file that does not yet exist, in an HDFS >> directory that does exist] >> -ow >> >> Any idea what I can try to get this to work? I don't think I fancy going >> above 32GB for a 2.4GB input file. Below is the output when I run the >> command: >> >> 13/01/03 14:08:43 INFO common.HadoopUtil: Deleting temp >> 13/01/03 14:09:31 INFO input.FileInputFormat: Total input paths to process >> : 1 >> 13/01/03 14:09:32 INFO mapred.JobClient: Running job: >> job_201211120903_15452 >> 13/01/03 14:09:33 INFO mapred.JobClient: map 0% reduce 0% >> 13/01/03 14:09:44 INFO mapred.JobClient: map 51% reduce 0% >> 13/01/03 14:09:45 INFO mapred.JobClient: map 71% reduce 0% >> 13/01/03 14:09:47 INFO mapred.JobClient: map 88% reduce 0% >> 13/01/03 14:09:48 INFO mapred.JobClient: map 99% reduce 0% >> 13/01/03 14:09:52 INFO mapred.JobClient: map 100% reduce 0% >> 13/01/03 14:09:59 INFO mapred.JobClient: map 100% reduce 5% >> 13/01/03 14:10:02 INFO mapred.JobClient: map 100% reduce 31% >> 13/01/03 14:10:05 INFO mapred.JobClient: map 100% reduce 33% >> 13/01/03 14:10:08 INFO mapred.JobClient: map 100% reduce 75% >> 13/01/03 14:10:11 INFO mapred.JobClient: map 100% reduce 78% >> 13/01/03 14:10:15 INFO mapred.JobClient: map 100% reduce 82% >> 13/01/03 14:10:17 INFO mapred.JobClient: map 100% reduce 89% >> 13/01/03 14:10:20 INFO mapred.JobClient: map 100% reduce 95% >> 13/01/03 14:10:23 INFO mapred.JobClient: map 100% reduce 100% >> 13/01/03 14:10:29 INFO mapred.JobClient: Job complete: >> job_201211120903_15452 >> 13/01/03 14:10:29 INFO mapred.JobClient: Counters: 22 >> 13/01/03 14:10:29 INFO mapred.JobClient: Job Counters >> 13/01/03 14:10:29 INFO mapred.JobClient: Launched reduce tasks=1 >> 13/01/03 14:10:29 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=258302 >> 13/01/03 14:10:29 INFO mapred.JobClient: Total time spent by all >> reduces waiting after reserving slots (ms)=0 >> 13/01/03 14:10:29 INFO mapred.JobClient: Total time spent by all maps >> waiting after reserving slots (ms)=0 >> 13/01/03 14:10:29 INFO mapred.JobClient: Launched map tasks=19 >> 13/01/03 14:10:29 INFO mapred.JobClient: Data-local map tasks=19 >> 13/01/03 14:10:29 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=36375 >> 13/01/03 14:10:29 INFO mapred.JobClient: FileSystemCounters >> 13/01/03 14:10:29 INFO mapred.JobClient: FILE_BYTES_READ=306924353 >> 13/01/03 14:10:29 INFO mapred.JobClient: HDFS_BYTES_READ=2545107495 >> 13/01/03 14:10:29 INFO mapred.JobClient: FILE_BYTES_WRITTEN=614908308 >> 13/01/03 14:10:29 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=217788513 >> 13/01/03 14:10:29 INFO mapred.JobClient: Map-Reduce Framework >> 13/01/03 14:10:29 INFO mapred.JobClient: Reduce input groups=2 >> 13/01/03 14:10:29 INFO mapred.JobClient: Combine output records=20 >> 13/01/03 14:10:29 INFO mapred.JobClient: Map input records=370867 >> 13/01/03 14:10:29 INFO mapred.JobClient: Reduce shuffle >> bytes=290705921 >> 13/01/03 14:10:29 INFO mapred.JobClient: Reduce output records=2 >> 13/01/03 14:10:29 INFO mapred.JobClient: Spilled Records=40 >> 13/01/03 14:10:29 INFO mapred.JobClient: Map output bytes=2524521040 >> 13/01/03 14:10:29 INFO mapred.JobClient: Combine input records=370867 >> 13/01/03 14:10:29 INFO mapred.JobClient: Map output records=370867 >> 13/01/03 14:10:29 INFO mapred.JobClient: SPLIT_RAW_BYTES=3458 >> 13/01/03 14:10:29 INFO mapred.JobClient: Reduce input records=20 >> 13/01/03 14:10:29 INFO input.FileInputFormat: Total input paths to process >> : 1 >> 13/01/03 14:10:29 INFO mapred.JobClient: Running job: >> job_201211120903_15453 >> 13/01/03 14:10:30 INFO mapred.JobClient: map 0% reduce 0% >> 13/01/03 14:10:45 INFO mapred.JobClient: map 50% reduce 0% >> 13/01/03 14:10:47 INFO mapred.JobClient: map 100% reduce 0% >> 13/01/03 14:11:04 INFO mapred.JobClient: map 100% reduce 16% >> 13/01/03 14:11:07 INFO mapred.JobClient: map 100% reduce 33% >> 13/01/03 14:11:10 INFO mapred.JobClient: map 100% reduce 100% >> 13/01/03 14:11:18 INFO mapred.JobClient: Job complete: >> job_201211120903_15453 >> 13/01/03 14:11:18 INFO mapred.JobClient: Counters: 22 >> 13/01/03 14:11:18 INFO mapred.JobClient: Job Counters >> 13/01/03 14:11:18 INFO mapred.JobClient: Launched reduce tasks=1 >> 13/01/03 14:11:18 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=36791 >> 13/01/03 14:11:18 INFO mapred.JobClient: Total time spent by all >> reduces waiting after reserving slots (ms)=0 >> 13/01/03 14:11:18 INFO mapred.JobClient: Total time spent by all maps >> waiting after reserving slots (ms)=0 >> 13/01/03 14:11:18 INFO mapred.JobClient: Launched map tasks=2 >> 13/01/03 14:11:18 INFO mapred.JobClient: Data-local map tasks=2 >> 13/01/03 14:11:18 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=20671 >> 13/01/03 14:11:18 INFO mapred.JobClient: FileSystemCounters >> 13/01/03 14:11:18 INFO mapred.JobClient: FILE_BYTES_READ=202961723 >> 13/01/03 14:11:18 INFO mapred.JobClient: HDFS_BYTES_READ=301359707 >> 13/01/03 14:11:18 INFO mapred.JobClient: FILE_BYTES_WRITTEN=308381584 >> 13/01/03 14:11:18 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=205579891 >> 13/01/03 14:11:18 INFO mapred.JobClient: Map-Reduce Framework >> 13/01/03 14:11:18 INFO mapred.JobClient: Reduce input groups=2 >> 13/01/03 14:11:18 INFO mapred.JobClient: Combine output records=4 >> 13/01/03 14:11:18 INFO mapred.JobClient: Map input records=2 >> 13/01/03 14:11:18 INFO mapred.JobClient: Reduce shuffle bytes=7559204 >> 13/01/03 14:11:18 INFO mapred.JobClient: Reduce output records=2 >> 13/01/03 14:11:18 INFO mapred.JobClient: Spilled Records=10 >> 13/01/03 14:11:18 INFO mapred.JobClient: Map output bytes=217788354 >> 13/01/03 14:11:18 INFO mapred.JobClient: Combine input records=4 >> 13/01/03 14:11:18 INFO mapred.JobClient: Map output records=4 >> 13/01/03 14:11:18 INFO mapred.JobClient: SPLIT_RAW_BYTES=296 >> 13/01/03 14:11:18 INFO mapred.JobClient: Reduce input records=4 >> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> at >> >> org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) >> at >> >> org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) >> at >> >> org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139) >> at >> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118) >> at >> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1766) >> at >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1894) >> at >> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) >> at >> >> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) >> at >> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141) >> at >> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136) >> at >> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525) >> at >> >> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) >> at >> >> org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:61) >> at >> >> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:137) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at >> >> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:62) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at >> >> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) >> at >> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) >> at >> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >> at >> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.util.RunJar.main(RunJar.java:186) >> >> Thanks, >> Adam >> >> PS: I was able to run the classify-20newsgroups.sh example packaged in >> Mahout 0.7 needing only to increase my mapred.child.java.opts to 2GB >> (since >> it had similar errors at 1GB). >> > >
