Hi,
For some reason the fetcher sometimes produces corrupts unreadable segments. It
then exists with exception like "problem advancing post", or "negative array
size exception" etc.
java.lang.RuntimeException: problem advancing post rec#702
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246)
at
org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431)
at
org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392)
at
org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at org.apache.hadoop.io.Text.readString(Text.java:402)
at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70)
at
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282)
at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222)
... 7 more
2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException:
Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520)
at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529)
These errors produce the following exception when trying to index.
java.io.IOException: IO error in map input file
file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 at 2620416
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
... 5 more
Is there any way we can debug this? The errors is usually related to Nutch
reading metadata, but since we cannot read the metadata, i cannot know what
data is causing the issue :) Any hints to share on how to tackle these issues?
Markus