Re: Parser crash with HeapSpace error

Markus Jelsma Wed, 31 Aug 2011 03:59:01 -0700

UNFO?? That's interesting! Anyway, i understand you don't want to parse this 
file. See your other thread.


The OOM can happen for many reasons. By default Nutch takes 1G of RAM (hence 
the 52%). You can toggle the setting via the Xmx JVM parameter.

On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote:
> Hello,
> 
> I ran in a bad situation.
> 
> After crawling and parsing about 130k pages in multiple
> generate/fetch/parse/update cycles today the parser crashed with:
> 
> Error parsing:
> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2
> 041_11.31.UNF0: failed(2,200): org.apache.nutch.parse.ParseException:
> Unable to
> successfully parse content
> Exception in thread "main" java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
>         at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
>         at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)
> 
> and in the hadoop.log, more verbosely:
> 
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.nutch.protocol.Content.readFields(Content.java:140)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 
> The strange thing is, that the parser didn't stop running. It remains in
> a state where it consumes 100 % cpu and doesn't do anything any more.
> 
> The last lines it wrote to the hadoop.log file were:
> 
> java.lang.OutOfMemoryError: Java heap space
>         at org.apache.nutch.protocol.Content.readFields(Content.java:140)
>         at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:67) at
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.
> deserialize(WritableSerialization.java:40) at
> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java
> :1817) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1790) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:103) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:78) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav
> a:192) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> 2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
> job_local_0001
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> FILE_BYTES_READ=2047029532
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -
> FILE_BYTES_WRITTEN=819506637
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
> records=0
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> records=15746 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Spilled
> Records=15138 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map
> output
> bytes=83235364
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
> bytes=306386116
> 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input
> records=0 2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
> records=15139
> 
> Now it is 01:42 an nothing had happened since this last log, but the
> java process is still using all of the cpu.
> 
> I think there is something wrong.
> 
> It seems to me that my machine has to less memory. (2 GB) Bur I am a
> little curios about that top says, the java process is only using 52 %
> of the memory.
> 
> Any suggestions?
> 
> BTW: I don't want to parse UNFO files. In fact I have no idea what this
> is! But in our university network are many strange fiel types. Handling
> this is an other topic for me :)

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Parser crash with HeapSpace error

Reply via email to