UNFO?? That's interesting! Anyway, i understand you don't want to parse this file. See your other thread.
The OOM can happen for many reasons. By default Nutch takes 1G of RAM (hence the 52%). You can toggle the setting via the Xmx JVM parameter. On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote: > Hello, > > I ran in a bad situation. > > After crawling and parsing about 130k pages in multiple > generate/fetch/parse/update cycles today the parser crashed with: > > Error parsing: > http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPREC_2 > 041_11.31.UNF0: failed(2,200): org.apache.nutch.parse.ParseException: > Unable to > successfully parse content > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:157) > at org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164) > > and in the hadoop.log, more verbosely: > > java.lang.OutOfMemoryError: Java heap space > at org.apache.nutch.protocol.Content.readFields(Content.java:140) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer. > deserialize(WritableSerialization.java:67) at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer. > deserialize(WritableSerialization.java:40) at > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java > :1817) at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java: > 1790) at > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF > ileRecordReader.java:103) at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe > ader.java:78) at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav > a:192) at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > > The strange thing is, that the parser didn't stop running. It remains in > a state where it consumes 100 % cpu and doesn't do anything any more. > > The last lines it wrote to the hadoop.log file were: > > java.lang.OutOfMemoryError: Java heap space > at org.apache.nutch.protocol.Content.readFields(Content.java:140) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer. > deserialize(WritableSerialization.java:67) at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer. > deserialize(WritableSerialization.java:40) at > org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.java > :1817) at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java: > 1790) at > org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF > ileRecordReader.java:103) at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe > ader.java:78) at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.jav > a:192) at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > 2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete: > job_local_0001 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus > 2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters > 2011-08-31 01:27:08,975 INFO mapred.JobClient - > FILE_BYTES_READ=2047029532 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - > FILE_BYTES_WRITTEN=819506637 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output > records=0 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input > records=15746 2011-08-31 01:27:08,975 INFO mapred.JobClient - Spilled > Records=15138 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map > output > bytes=83235364 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input > bytes=306386116 > 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input > records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map output > records=15139 > > Now it is 01:42 an nothing had happened since this last log, but the > java process is still using all of the cpu. > > I think there is something wrong. > > It seems to me that my machine has to less memory. (2 GB) Bur I am a > little curios about that top says, the java process is only using 52 % > of the memory. > > Any suggestions? > > BTW: I don't want to parse UNFO files. In fact I have no idea what this > is! But in our university network are many strange fiel types. Handling > this is an other topic for me :) -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

