Re: Parser crash with HeapSpace error

Markus Jelsma Wed, 31 Aug 2011 06:22:47 -0700


On Wednesday 31 August 2011 15:12:25 Marek Bachmann wrote:
> Am 31.08.2011 12:58, schrieb Markus Jelsma:
> > UNFO?? That's interesting! Anyway, i understand you don't want to parse
> > this file. See your other thread.
> 
> Interesting? Do you know the file type? Is this something that shouldn't
> be public? Actually, I noticed it is UNF0 (ZERO!) not the letter O.


No, just interesting because i never heard of it. Could be:
http://www.file-extensions.org/unf-file-extension

> 
> > The OOM can happen for many reasons. By default Nutch takes 1G of RAM
> > (hence the 52%). You can toggle the setting via the Xmx JVM parameter.
> > 
> > On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> >> Hello,
> >> 
> >> I ran in a bad situation.
> >> 
> >> After crawling and parsing about 130k pages in multiple
> >> generate/fetch/parse/update cycles today the parser crashed with:
> >> 
> >> Error parsing:
> >> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPRE
> >> C_2 041_11.31.UNF0: failed(2,200):
> >> org.apache.nutch.parse.ParseException: Unable to
> >> successfully parse content
> >> Exception in thread "main" java.io.IOException: Job failed!
> >> 
> >>          at
> >>          org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
> >>          at
> >>          org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:15
> >>          7) at
> >>          org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
> >>          at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >>          at
> >>          org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164
> >>          )
> >> 
> >> and in the hadoop.log, more verbosely:
> >> 
> >> java.lang.OutOfMemoryError: Java heap space
> >> 
> >>          at
> >>          org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >>          at
> >> 
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >> 
> >> :1817) at
> >> 
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >> 
> >>          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>          at
> >>          org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >>          ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >> 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >> 
> >> The strange thing is, that the parser didn't stop running. It remains in
> >> a state where it consumes 100 % cpu and doesn't do anything any more.
> >> 
> >> The last lines it wrote to the hadoop.log file were:
> >> 
> >> java.lang.OutOfMemoryError: Java heap space
> >> 
> >>          at
> >>          org.apache.nutch.protocol.Content.readFields(Content.java:140)
> >>          at
> >> 
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:67) at
> >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ
> >> er. deserialize(WritableSerialization.java:40) at
> >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j
> >> ava
> >> 
> >> :1817) at
> >> 
> >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja
> >> va: 1790) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen
> >> ceF ileRecordReader.java:103) at
> >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor
> >> dRe ader.java:78) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.
> >> jav a:192) at
> >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1
> >> 76)
> >> 
> >>          at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
> >>          at
> >>          org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358
> >>          ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at
> >> 
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> >> 2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
> >> job_local_0001
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> >> FILE_BYTES_READ=2047029532
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> >> FILE_BYTES_WRITTEN=819506637
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
> >> records=0
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> >> records=15746 2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
> >> Spilled Records=15138 2011-08-31 01:27:08,975 INFO  mapred.JobClient - 
> >>    Map output
> >> bytes=83235364
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> >> bytes=306386116
> >> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input
> >> records=0 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map
> >> output records=15139
> >> 
> >> Now it is 01:42 an nothing had happened since this last log, but the
> >> java process is still using all of the cpu.
> >> 
> >> I think there is something wrong.
> >> 
> >> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> >> little curios about that top says, the java process is only using 52 %
> >> of the memory.
> >> 
> >> Any suggestions?
> >> 
> >> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> >> is! But in our university network are many strange fiel types. Handling
> >> this is an other topic for me :)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser crash with HeapSpace error

Reply via email to