On Wednesday 31 August 2011 15:12:25 Marek Bachmann wrote: > Am 31.08.2011 12:58, schrieb Markus Jelsma: > > UNFO?? That's interesting! Anyway, i understand you don't want to parse > > this file. See your other thread. > > Interesting? Do you know the file type? Is this something that shouldn't > be public? Actually, I noticed it is UNF0 (ZERO!) not the letter O.
No, just interesting because i never heard of it. Could be: http://www.file-extensions.org/unf-file-extension > > > The OOM can happen for many reasons. By default Nutch takes 1G of RAM > > (hence the 52%). You can toggle the setting via the Xmx JVM parameter. > > > > On Wednesday 31 August 2011 01:47:05 Marek Bachmann wrote: > >> Hello, > >> > >> I ran in a bad situation. > >> > >> After crawling and parsing about 130k pages in multiple > >> generate/fetch/parse/update cycles today the parser crashed with: > >> > >> Error parsing: > >> http://www.usf.uni-kassel.de/ftp/user/eisner/Felix/precip/B1/ECHAM5/GPRE > >> C_2 041_11.31.UNF0: failed(2,200): > >> org.apache.nutch.parse.ParseException: Unable to > >> successfully parse content > >> Exception in thread "main" java.io.IOException: Job failed! > >> > >> at > >> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > >> at > >> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:15 > >> 7) at > >> org.apache.nutch.parse.ParseSegment.run(ParseSegment.java:178) > >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >> at > >> org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:164 > >> ) > >> > >> and in the hadoop.log, more verbosely: > >> > >> java.lang.OutOfMemoryError: Java heap space > >> > >> at > >> org.apache.nutch.protocol.Content.readFields(Content.java:140) > >> at > >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ > >> er. deserialize(WritableSerialization.java:67) at > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ > >> er. deserialize(WritableSerialization.java:40) at > >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j > >> ava > >> > >> :1817) at > >> > >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja > >> va: 1790) at > >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen > >> ceF ileRecordReader.java:103) at > >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor > >> dRe ader.java:78) at > >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask. > >> jav a:192) at > >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1 > >> 76) > >> > >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > >> at > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358 > >> ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > >> > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > >> > >> The strange thing is, that the parser didn't stop running. It remains in > >> a state where it consumes 100 % cpu and doesn't do anything any more. > >> > >> The last lines it wrote to the hadoop.log file were: > >> > >> java.lang.OutOfMemoryError: Java heap space > >> > >> at > >> org.apache.nutch.protocol.Content.readFields(Content.java:140) > >> at > >> > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ > >> er. deserialize(WritableSerialization.java:67) at > >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializ > >> er. deserialize(WritableSerialization.java:40) at > >> org.apache.hadoop.io.SequenceFile$Reader.deserializeValue(SequenceFile.j > >> ava > >> > >> :1817) at > >> > >> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.ja > >> va: 1790) at > >> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Sequen > >> ceF ileRecordReader.java:103) at > >> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecor > >> dRe ader.java:78) at > >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask. > >> jav a:192) at > >> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:1 > >> 76) > >> > >> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > >> at > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358 > >> ) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at > >> > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) > >> 2011-08-31 01:27:00,722 INFO mapred.JobClient - Job complete: > >> job_local_0001 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Counters: 11 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - ParserStatus > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - failed=313 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - success=14826 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - FileSystemCounters > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - > >> FILE_BYTES_READ=2047029532 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - > >> FILE_BYTES_WRITTEN=819506637 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map-Reduce Framework > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine output > >> records=0 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input > >> records=15746 2011-08-31 01:27:08,975 INFO mapred.JobClient - > >> Spilled Records=15138 2011-08-31 01:27:08,975 INFO mapred.JobClient - > >> Map output > >> bytes=83235364 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map input > >> bytes=306386116 > >> 2011-08-31 01:27:08,975 INFO mapred.JobClient - Combine input > >> records=0 2011-08-31 01:27:08,975 INFO mapred.JobClient - Map > >> output records=15139 > >> > >> Now it is 01:42 an nothing had happened since this last log, but the > >> java process is still using all of the cpu. > >> > >> I think there is something wrong. > >> > >> It seems to me that my machine has to less memory. (2 GB) Bur I am a > >> little curios about that top says, the java process is only using 52 % > >> of the memory. > >> > >> Any suggestions? > >> > >> BTW: I don't want to parse UNFO files. In fact I have no idea what this > >> is! But in our university network are many strange fiel types. Handling > >> this is an other topic for me :) -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

