If I take  certain crawl_generate input, and run fetch on it, this always
happens on the same url.
The url is a 300mb txt. 
But other big size files run fine, without this problem. Also, if I open the
file myself with notepad++, its not corrupted, it opens fine.

I manged to debug it on the cluster and get to the specific point where the
exception in thrown. It seems that the stream suddenly ends, and I cant
understand why.

There is no disk space problem, and the datanodes seem healthy.




Ferdy Galema wrote
> 
> Hi,
> 
> This most certainly has something to do with data corruption. When all
> your
> mappers succeed and the error is in reducer code (looks like your case),
> it
> may be mapreduce intermediate output that cannot be read succesfully. Is
> there enough free space on your configured mapred.local.dir devices? Are
> they healthy? Perhaps try to run some sanity check jobs. (Something like
> Hadoop teragen and terasort).
> 
> Ferdy.
> 
> On Sun, Jul 22, 2012 at 3:56 PM, nutch.buddy@ <
> nutch.buddy@> wrote:
> 
>> I'm still stuch with this problem.
>>
>> I now know that the 'NegativeArraySizeException " is not related.
>>
>> during fetch procesc, at 100% map, 98% reduce
>> I get the following exceptions:
>> java.lang.RuntimeException: problem advancing post rec#9618
>> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1182)
>> at
>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:246)
>> at
>>
>> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:242)
>> at
>>
>> org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:40)
>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:469)
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417)
>> at org.apache.hadoop.mapred.child$4.run(Child.java:270)
>> ...
>> ...
>> Caused by: java.io.EOFException
>> at java.io.DataInputStream.readByte(DataInputStream.java:250)
>> at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:308)
>> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:329)
>> at org.apache.hadoop.io.Text.readFields(Text.java:282)
>> at
>>
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:73)
>>
>> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:44)
>> at
>> org.apache.hadoop.mapred.Task$ValuesIterator.readNextKey(Task.java:1225)
>> at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:1180)
>>
>>
>> I use a 8 node cluster with cdh4.
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-problem-advancing-port-rec-during-fetching-tp3994633p3996554.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
> 




--
View this message in context: 
http://lucene.472066.n3.nabble.com/NegativeArraySizeException-and-problem-advancing-port-rec-during-fetching-tp3994633p3996706.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to