Hello, I ran in a bad situation.
After crawling and parsing about 130k pages in multiple generate/fetch/parse/update cycles today the parser crashed with: Error parsing: http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2041_11.31.UNF0: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) and in the hadoop.log, more verbosely: java.lang.OutOfMemoryError: Java heap space at org.apache.nutch.protocol.Content.readFields(Content.java:140) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) The strange thing is, that the parser didn't stop running. It remains in a state where it consumes 100 % cpu and doesn't do anything any more. The last lines it wrote to the hadoop.log file were: java.lang.OutOfMemoryError: Java heap space at org.apache.nutch.protocol.Content.readFields(Content.java:140) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) at org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java:1817) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1790) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:103) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:78) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) 2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete: job_local_0001 2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11 2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus 2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313 2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826 2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters 2011-08-31 01:27:08,975 INFO mapred.JobClient - FILE_BYTES_READ=2047029532 2011-08-31 01:27:08,975 INFO mapred.JobClient - FILE_BYTES_WRITTEN=819506637 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input records=15746 2011-08-31 01:27:08,975 INFO mapred.JobClient - Spilled Records=15138 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output bytes=83235364 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input bytes=306386116 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output records=15139 Now it is 01:42 an nothing had happened since this last log, but the java process is still using all of the cpu. I think there is something wrong. It seems to me that my machine has to less memory. (2 GB) Bur I am a little curios about that top says, the java process is only using 52 % of the memory. Any suggestions? BTW: I don't want to parse UNFO files. In fact I have no idea what this is! But in our university network are many strange fiel types. Handling this is an other topic for me :)

