Hello,

I ran in a bad situation.

After crawling and parsing about 130k pages in multiple
generate/fetch/parse/update cycles today the parser crashed with:

Error parsing:
http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2041_11.31.UNF0:
failed(2,200): org.apache.nutch.parse.ParseException: Unable to
successfully parse content
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157)
        at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164)

and in the hadoop.log, more verbosely:

java.lang.OutOfMemoryError: Java heap space
        at org.apache.nutch.protocol.Content.readFields(Content.java:140)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

The strange thing is, that the parser didn't stop running. It remains in
a state where it consumes 100 % cpu and doesn't do anything any more.

The last lines it wrote to the hadoop.log file were:

java.lang.OutOfMemoryError: Java heap space
        at org.apache.nutch.protocol.Content.readFields(Content.java:140)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at
org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817)
        at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
2011-08-31 01:27:00,722 INFO  mapred.JobClient - Job complete:
job_local_0001
2011-08-31 01:27:08,975 INFO  mapred.JobClient - Counters: 11
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   ParserStatus
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     failed=313
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     success=14826
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   FileSystemCounters
2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
FILE_BYTES_READ=2047029532
2011-08-31 01:27:08,975 INFO  mapred.JobClient -    
FILE_BYTES_WRITTEN=819506637
2011-08-31 01:27:08,975 INFO  mapred.JobClient -   Map-Reduce Framework
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine output
records=0
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input records=15746
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Spilled Records=15138
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
bytes=83235364
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map input
bytes=306386116
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Combine input records=0
2011-08-31 01:27:08,975 INFO  mapred.JobClient -     Map output
records=15139

Now it is 01:42 an nothing had happened since this last log, but the
java process is still using all of the cpu.

I think there is something wrong.

It seems to me that my machine has to less memory. (2 GB) Bur I am a
little curios about that top says, the java process is only using 52 %
of the memory.

Any suggestions?

BTW: I don't want to parse UNFO files. In fact I have no idea what this
is! But in our university network are many strange fiel types. Handling
this is an other topic for me :)



Reply via email to