Hi Markus, a similar problem was posted some time ago:
http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-quot-problem-advancing-port-rec-quot-during-fetching-tt3994633.html#a3996554 Sebastian On 05/27/2013 11:06 AM, Markus Jelsma wrote: > Hi, > > For some reason the fetcher sometimes produces corrupts unreadable segments. > It then exists with exception like "problem advancing post", or "negative > array size exception" etc. > > java.lang.RuntimeException: problem advancing post rec#702 > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250) > at > org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246) > at > org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431) > at > org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260) > Caused by: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:197) > at org.apache.hadoop.io.Text.readString(Text.java:402) > at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243) > at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144) > at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70) > at > org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) > at > org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) > at > org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282) > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222) > ... 7 more > 2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: > Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520) > at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529) > > These errors produce the following exception when trying to index. > > java.io.IOException: IO error in map input file > file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: > file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 at > 2620416 > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219) > at > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237) > at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) > at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) > at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158) > at java.io.DataInputStream.readFully(DataInputStream.java:195) > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992) > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124) > at > org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236) > ... 5 more > > Is there any way we can debug this? The errors is usually related to Nutch > reading metadata, but since we cannot read the metadata, i cannot know what > data is causing the issue :) Any hints to share on how to tackle these issues? > > Markus >

