Re: Fetcher corrupting some segments

Sebastian Nagel Mon, 27 May 2013 14:04:34 -0700

Hi Markus,

a similar problem was posted some time ago:


http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-quot-problem-advancing-port-rec-quot-during-fetching-tt3994633.html#a3996554

Sebastian

On 05/27/2013 11:06 AM, Markus Jelsma wrote:
> Hi,
> 
> For some reason the fetcher sometimes produces corrupts unreadable segments. 
> It then exists with exception like "problem advancing post", or "negative 
> array size exception" etc. 
> 
> java.lang.RuntimeException: problem advancing post rec#702
>       at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1225)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:250)
>       at 
> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:246)
>       at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1431)
>       at 
> org.apache.nutch.fetcher.Fetcher$FetcherReducer.reduce(Fetcher.java:1392)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:520)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
> Caused by: java.io.EOFException
>       at java.io.DataInputStream.readFully(DataInputStream.java:197)
>       at org.apache.hadoop.io.Text.readString(Text.java:402)
>       at org.apache.nutch.metadata.Metadata.readFields(Metadata.java:243)
>       at org.apache.nutch.parse.ParseData.readFields(ParseData.java:144)
>       at org.apache.nutch.parse.ParseImpl.readFields(ParseImpl.java:70)
>       at 
> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>       at 
> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>       at 
> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:1282)
>       at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1222)
>       ... 7 more
> 2013-05-26 22:41:41,344 ERROR fetcher.Fetcher - Fetcher: java.io.IOException: 
> Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1327)
>       at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1520)
>       at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:1556)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>       at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:1529)
> 
> These errors produce the following exception when trying to index.
> 
> java.io.IOException: IO error in map input file 
> file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:242)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:216)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> Caused by: org.apache.hadoop.fs.ChecksumException: Checksum error: 
> file:/opt/nutch/crawl/segments/20130526223014/crawl_parse/part-00000 at 
> 2620416
>         at 
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
>         at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
>         at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>         at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>         at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
>         at java.io.DataInputStream.readFully(DataInputStream.java:195)
>         at 
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>         at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1992)
>         at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2124)
>         at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:76)
>         at 
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:236)
>         ... 5 more
> 
> Is there any way we can debug this? The errors is usually related to Nutch 
> reading metadata, but since we cannot read the metadata, i cannot know what 
> data is causing the issue :) Any hints to share on how to tackle these issues?
> 
> Markus
>

Re: Fetcher corrupting some segments

Reply via email to